Read "Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report" at NAP.edu

Page 16 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

2
Framework, Principles, and Designs for Evaluation

As we stressed in Chapter 1, evaluating the effects of the new welfare legislation and related welfare reform initiatives at the state and local level around the country presents many challenges to data and evaluation methodology. In this chapter we present general principles for program evaluation and the major issues that any evaluation must confront, and we outline some of the choices and alternatives that are available in an evaluation. Although these general principles are quite well known to many welfare reform researchers, we review them to emphasize that our findings on individual state-level studies, discussed in Chapter 3, are based on and naturally follow from this set of general principles governing how evaluations should be conducted. In Chapter 3 we apply the principles to the 14 specific state studies we have assessed in detail, and also, more briefly, to other welfare reform examinations under way around the country.

We focus this discussion primarily on ''impact" evaluations as opposed to "process" evaluations. Impact evaluations (sometimes called outcome evaluations) concern the outcomes of a program on recipients, such as the effects on individual employment, earnings, and family income. Process evaluations (sometimes called implementation evaluations) describe how the program services are actually provided and then assess how well the services provided match the intended purpose of a program. They also assess the degree to which a program was successfully implemented and thus aid in characterizing the policy "treatment" that the participants and potential participants actually received. Although we do not provide an extended analysis of process evaluation, we do provide a brief discussion of it at the end of the chapter, given its importance.

The report also focuses only on the effects of reform on individuals, rather

Page 17 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

than the effects of reform on government itself. One view of the purpose of the welfare reform legislation is that it was intended to change the nature of how government delivers assistance to the poor, away from a purely eligibility-oriented and check-writing function to a function of encouraging work, promoting self-sufficiency, and providing the right signals and incentives for those to occur. As Nathan and Gais (1999) have described, the reform is resulting in a major change in welfare bureaucracies. Although this is a legitimate issue, evaluating the effects of PRWORA on governments themselves requires different evaluation methods than the methods discussed in this report (although process studies, which we do discuss, are one component of such evaluations).

We organize our discussion of the general principles of impact evaluation in terms of four general issues that any impact study must address; we pose each in the form of a question:

What are the research and policy questions of interest, and what are the precise objectives of the study?
What are the study populations of interest, and what are the outcomes of interest on those populations?
What evaluation methodologies are appropriate for achieving the goals of the study?
What data sources are available to the study and how can they be used?

Having a solid understanding of these issues is not only important for the design of new welfare reform studies, but also for interpreting the results of those studies that are currently under way and will be issuing findings over the next few years. As Chapter 3 details, the current studies differ, often on critical dimensions, in the way in which each of the four issues listed above is addressed. Some answer different questions, many study different populations, they often use different methodologies, and they frequently use very different data. Melding the results of such a diverse set of studies into a single coherent picture of the effects of the latest wave of welfare reform is a challenge that requires a clear understanding of the issues that we discuss in this chapter.

RESEARCH AND POLICY QUESTIONS AND STUDY OBJECTIVES

Broadly speaking, the question of interest in all welfare reform studies is the effect of reform on adults and children. The types of reforms that are of interest and the geographic level at which these effects are assessed are major issues in the research community. One key distinction, for example, is whether interest centers on the effect of an entire "bundle" of reforms—that is, a package containing provisions for work requirements, sanctions, time limits, a particular set of support services, and other features—or whether one is interested in the effects of each component separately, holding the others fixed. Most welfare reforms that

Page 18 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

have been enacted in the last 10 years are, indeed, bundles of different types of reforms, sometimes introduced by policy makers on the presumption that the collective effect of all the components together is greater than the effect of each of them separately. Policy makers often discuss the importance of changing the overall "culture" of welfare and of changing the expectations that recipients have for welfare. Changing multiple components at the same time makes such changes in culture more likely. PRWORA itself legislated multiple changes of the old AFDC system, and each state has added more components to those required by the federal law. Thus, a strong case can be made that it is the effect of the entire bundle that is of major policy interest.

Yet knowing the effect of an entire bundle of reforms does not provide a very good basis for future reforms or for determining which components work and which do not. Taken literally, knowing the effect of the bundle allows policy makers to decide only whether the entire bundle turned out to be a good policy or a bad policy, on the whole, and is informative only for the decision to either continue or end the whole bundle. However, it is likely that some components have favorable effects and others have unfavorable or no effects. Determining which components should be changed requires knowledge of the effects of each of them separately. Indeed, most observers expect that when PRWORA comes up for reauthorization in 2002, it is unlikely that a simple return to the old AFDC system and the old method of financing welfare will be an active option. Rather, it is far more likely that Congress and the President will be interested in modifying the current law to eliminate or change components of the law that have been judged to be ineffective or less effective than others.

Determining the effects of each component separately will probably require choosing a "base" from which each component has changed. If it is indeed the case that the bundle of reforms has a greater effect than the sum of the effects of its components, then adding or subtracting any individual component will have a different effect if none of the other components is in place than if all the components are in place at the same time. If the basic structure of reforms enacted by PRWORA is taken as the base, for example, policy interest should center on the incremental effects of each policy component, holding that basic structure in place.

In addition to these issues of the inherent questions of interest, there are several practical questions about the feasibility of estimating the effects of individual components, rather than the bundle. We discuss these issues when we discuss alternative evaluation designs below.

Another issue that has assumed importance in recent welfare reform discussions is the relative importance of national-level estimates and state-specific estimates of the effects of reform. Many federal policy makers and members of Congress would like to know the total effect of reform in the country as a whole. PRWORA is, after all, a federal law and was intended to change the welfare system in the entire country. Yet, other analysts argue that an average estimate is

Page 19 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

not of great interest because the diversity of state reforms is so great that an average would not be a very good indicator of any particular reform package. This approach does not mean that national-level estimates are not desirable, for one may well be interested in the range of effects across states, not just the average. However, an acceptable evaluation strategy for this approach would only require estimates on a subset of the states, if that subset captured the range of different reform policies that have been tried in all the states. Yet this approach leads back to the question of what the policy of interest is, for obtaining a range of effects across a set of states leads inevitably to a search for why those effects differ. This question in turn leads to a need to determine which elements of the bundle of reforms explain the differences. Even if the aim is to obtain only a range of estimates across selected states, cross-state comparability is necessarily a major issue.

While these debates occur at the federal level, at the state level there is more interest in knowing the effect of a state's own specific reforms. Because evaluation, as well as operations, have shifted so heavily toward the states and away from Washington, welfare reform analysis in the current environment is much more state focused than it has previously been. State policy makers are often interested in comparing their state's policies to those of other states, but usually they are most interested in knowing the effects of their own policies first. This focus creates some difficulties in making national-level assessments of the effects of the policies and for determining what works and what does not, as discussed further below.

Yet another important issue concerning the nature of the research and policy questions that are, or should be, asked involves the distinction between evaluation studies and monitoring (or descriptive) studies. We discuss this issue below.

Evaluation Studies

The classic type of study enshrined in textbooks and in program project studies is the evaluation study, whose objective is to estimate the causal connection between a program or policy and its effect.¹ Any study of this type must necessarily have what is known in the evaluation field as a "counterfactual": the program or policy that is being compared with the program or policy under study. By definition, when one speaks of the effect of a new welfare reform program or policy, one must say what that effect is relative to; the latter is the counterfactual.

The most common counterfactual is simply the program that existed prior to the program under study, which in most cases for TANF or an AFDC waiver is the basic AFDC program in a state prior to the introduction of the state's waiver

¹	Throughout this report, "evaluation" refers to such an assessment of the effects of the program or policy.

Page 20 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

or PRWORA program. The effect of a program is generally taken to mean its effect relative to what existed before.² While this is indeed the generally accepted counterfactual in current discussions, it should be noted that there are other counterfactuals of interest. One is a program bundle that is the same except for one changed feature. As we noted above, this may be the most important knowledge for incremental reform, where the alternative policy is not a return to AFDC, but a modification of current policy. Studies that consider this type of comparison will necessarily have an evaluation design that permits the estimation of the impact of the counterfactual. A program that modifies one element of a bundle is one example of such a counterfactual. A counterfactual could also be another state's program or policy.

Monitoring and Descriptive Studies

Many of the welfare reform studies currently under way have less ambitious goals than evaluation. These studies typically characterize their goals as "description" or "monitoring." A descriptive study is one that simply describes the characteristics of a population group relevant for policy—such as welfare leavers, welfare applicants, welfare eligibles, or just low-income families—and focuses on their levels of well-being. A monitoring study is one that follows such a population group over time, periodically describing and measuring its well-being along general and specific dimensions. In both descriptive and monitoring studies, there is no attempt to isolate the precise cause of the individual and family outcomes. No attempt is made to determine how much of the change (in the case of a monitoring study) is the result of welfare reform and how much is the result of other, simultaneous forces, such as trends in the economic environment.

The monitoring approach is very closely related to a classic method known as a before-and-after, or pre-post, design, which we discuss below when we review alternative methodologies for conducting an evaluation. A before-and-after design uses roughly the same data strategy as a monitoring study, namely, the collection of data on outcomes before and after a policy change. However, in a before-and-after design the family and individual outcomes in the "after" phase are intended to be causally related to the policy. A design of this type can be distinguished from a monitoring study if it includes a strong analysis of the influence of alternative, simultaneously occurring forces, such as social and economic trends (e.g., changes in the unemployment rate) that may have been contributing to the trends in outcomes as well as policy. (Because this separation of policy effects and the effects of other forces is so difficult, before-and-after designs are one of the least desirable types of evaluation methodologies, as we

²	To be precise, given time and changes in a state's economic and social environment, the counterfactual is usually defined not as AFDC at a prior time, but as what the effects of AFDC would be in the current environment had it continued.

Page 21 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

discuss further below.) This kind of analysis is usually missing in a monitoring study. Also, some monitoring studies have no data from the period prior to the policy change and so are clearly distinguished from a before-and-after evaluation design.

Describing and monitoring the populations of interest are, arguably, the necessary first steps prior to conducting an evaluation. Descriptive and monitoring studies can have tremendous utility in situations in which relatively little information on the recipient population is available. Studies of this type can be informative both to program managers and to the general policy-making and research community because the information gathered can be an indicator of the well-being of the target population intended to be served by the program, and whether that well-being is going up or down, or remaining unchanged. For example, descriptive studies can determine how many welfare leavers are in economic distress and can identify the existence of particular barriers to employment, such as health status, transportation needs, and access to child care.

Ultimately, however, in order to learn the effects of a policy change, description and monitoring need to be followed by evaluation. Without evaluation, nothing can be firmly known about why the well-being of the population is changing the way it is. More important, if that well-being is deteriorating, even for only a minority of the population, a descriptive or monitoring study provides no guidance on how to reverse that trend and increase well-being, because nothing has been firmly learned about its causes. Thus, very little guidance can be given to policy makers regarding whether a policy should be modified.

A potential danger of monitoring studies as well is that they are often misinterpreted as representing the results of a before-and-after design. Even though a monitoring study may carefully note that it has not established any cause-and-effect conclusions, the results may nevertheless be incorrectly labeled by others as demonstrating the effect of policy changes. This often occurs because many monitoring studies do not explicitly state the purpose of the study as monitoring, making the results easily interpreted as the results of a before-and-after evaluation. Given the weaknesses of the before-and-after methodology, such misinterpretations pose risks to good policy conclusions.

Monitoring studies are sometimes justified as useful in establishing a baseline for the evaluation of future policy changes. For example, welfare reform in most states is, at this writing, still evolving: any data collection (or monitoring) effort under way can be viewed as establishing a baseline that can be compared with later outcomes. ³ A yet more long-run view is that PRWORA will, most likely, be modified, even if in only minor ways, so current monitoring studies can be viewed as establishing a baseline prior to those modifications. These interpreta-

³	It is possible, however, that such a baseline may have already missed certain attitude and perception changes, such as an increase in the stigma associated with welfare receipt, that were the result of the national debate and media attention on welfare reform prior to the law's enactment.

Page 22 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

tions and justifications for monitoring studies led to more general issues concerning the desirability of investments in data and in knowledge infrastructure as the basis for research and evaluation in the future, possibly as part of a general building of data infrastructure.

STUDY POPULATIONS OF INTEREST

In the broadest sense, the population of interest in any welfare reform study is the low-income or poor population in the United States. However, most welfare reform studies have a narrower focus because most policies and programs are aimed at a particular target population, usually the families or individuals that are eligible for program services. There is some danger in focusing only on the eligible population, however, because who is eligible and who is not can change over time, resulting in a shifting population of interest. Another complicating factor is that families sometimes have the ability to alter their behavior in order to make themselves eligible: for example, by spending down their asset levels. Nevertheless, eligible families are the first population of interest.

Many welfare reform studies are even narrower in their focus, concentrating instead on the population of program participants, usually those who are receiving benefits at a particular time or at two or three different times. Such a focus comes naturally because participants are those actually receiving program benefits and services. Yet this focus runs the risk of missing important responses to a reform. Who is receiving benefits at any time may change, sometimes because of external changes in the socioeconomic or demographic environment and sometimes because of behavioral responses to the policy change itself. In either case, the types of individuals who are program participants can change in ways that will affect the findings of the study or at least that will require a careful delineation of what the study shows and what it does not.

Studies of Recipients: Caseload Dynamics

The principle that studying a population composed only of a sample of those on the rolls at a particular time is of great relevance and importance, yet it presents certain risks. Given its importance and the risks, an extended discussion of its different aspects is warranted. A useful perspective on the determinants of who is a participant and who is not is furnished by the framework of caseload dynamics, which views the caseload in a program as a fluid, ever-changing mix of families and individuals who move in and out of the program, possibly at frequent intervals. First-time entry by a TANF recipient, for example, occurs when a family suffers a drop in income, a woman has a nonmarital birth or experiences a divorce or separation, or some combination of these or other factors. Thus, first entry begins a recipient's experience with the system. First exit occurs when the recipient finds a job or gets married, when her child ages out of the age range of

Page 23 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

eligibility, or any of a host of other events that leads the recipient to attempt self-sufficiency. Reentry occurs for those who are unsuccessful in obtaining self-sufficiency, even if temporarily, and who therefore return to the program for another period of benefit receipt, possibly because of the loss of a job, the dissolution of a marital or nonmarital union, or some other event.

At any time, the caseload of a program is composed of families who are first-time entrants, as well as reentrants, and who have been on the rolls for varying lengths of time. The caseload dynamics perspective distinguishes between short-termers, cyclers, and long-termers. Short-termers, the least disadvantaged of the three, have only a brief experience with the welfare system and are, for the most part, relatively independent of welfare over their lifetimes. In contrast, cyclers move on and off the welfare rolls periodically and end up, over time, with a long-term dependence on the system for repeated assistance, being unable to achieve self-sufficiency. Long-termers, the most disadvantaged of the three, have long spells on welfare uninterrupted by time off the rolls, and have the heaviest dependence on the welfare system for support.

These distinctions are important because research has shown—and intuition supports—that the different degrees of dependence on the welfare system are correlated with individual, family, and community characteristics. These characteristics include a recipient's level of education, work experience, physical and mental health status, history of drug abuse, past history of nonmarital childbearing, and family background and how well it has prepared the recipient for adulthood; the family, social, and community networks available to the recipient; the neighborhood environment from which the recipient comes; her exposure to others with social difficulties; and related factors. Among the types of recipients, short-termers are typically the best off, with relatively good educational and work backgrounds and a relative lack of severe health problems, and who come from better-off family and neighborhood backgrounds than other recipients. Long-termers are typically the worst off, with relatively poor educational and work backgrounds, often with a history of health problems and drug abuse, and with a history of unstable marital or other partner relationships. Cyclers are in the middle, ranked somewhere between the short-termers and long-termers in these respects; they may have some job market skills and some family or community support, for example, but not enough for permanent self-sufficiency.

Among all families in a low-income welfare-eligible population, participants are, on average, worse off than those who are eligible nonparticipants. More important, as the socioeconomic and policy environments change, families move into and out of participation depending on their characteristics and situations. As the economy improves, for example, as it has in recent years, recipients who are better off in general and have greater skill potential tend to leave the program, so the worst-off cases remain. Thus, the caseload becomes increasingly composed of long-termers who have the greatest number of difficulties (sometimes also called the hard-to-serve). Not only do the exit rates of the better-off families

Page 24 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

increase, but the first-time entry and reentry rates of such families also decline as individuals who have better income potential or networks of support are less likely to lose their jobs or supports and become participants. These changes reinforce the change in the composition of the caseload.

Similarly, when policies change (such as those enacted in recent welfare reform), better-off recipients are likely to leave the program as they find jobs or other supports, and they are less likely to enter the program for the same reason; both exit rates and entry rates are affected, changing the composition of recipients. Some policy reforms, such as work requirements, have the net effect of encouraging recipients to leave welfare and discouraging them to enter welfare. Other reforms, such as time limits and sanctions and diversion, literally push recipients out of programs or prevent them from entering. These latter reforms provide a possible exception to the rule that it is always the better-off families that tend to be the first to leave or to fail to enter: for example, in some states the evidence suggests that sanctioned families tend to be among the worst-off cases. In other words, families that are relatively better off will be more likely to voluntarily leave programs, while those who are relatively worse off will be more likely to involuntarily leave programs.

Implications

What are the implications of a caseload dynamics perspective for the study of welfare reform policy changes? The major implication is that, while it is easy for a study to define its population of interest as recipients at one particular time, the resulting estimates of the policy effects on that population may not generalize to any other time or any other place. This limitation is because the composition of the recipient population (e.g., among long-termers, short-termers, and cyclers) changes in response to the state of the economy, the prior policies in place, and the nature of the eligible population from which recipients are drawn. This caseload dynamic is especially important for current studies of recipients or former recipients because there has been a significant decrease in the number of families receiving welfare since 1994. Between 1994 and December 1998, the number of families receiving AFDC/TANF declined from just over 5 million to 2.8 million (U.S. Department of Health and Human Services, 1998). A study of former recipients in 1998 is likely to show very different outcomes than a study of former recipients in 1994 because the caseload in 1994 was likely to have been composed of recipients with a greater mix of self-sufficiency levels (skills, education, and work experience) and of recipiency histories (long-termers, cyclers, and first-timers) than the caseload in 1998, which probably had less variation in self-sufficiency level and recipiency history and was likely composed of harder-to-serve recipients.

Equally important, the effects of policy reforms are likely to be different in any comparison where the caseload composition has changed. For example, the

Page 25 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

effect of imposing stricter work requirements would have been different if imposed in 1994 than in 1998, and, in turn, is likely to be different in a future time with a higher unemployment rate than in 1998. The effect of stricter work requirements in different states is also likely to be different if their unemployment rates are different or if their caseload compositions (short-termers, cyclers, and long-termers) is different for other reasons. The effects of time limits, sanctions, family caps, and other reforms is also likely to depend on caseload composition.

The lesson of a caseload dynamics perspective for studying welfare reform is, at a minimum, that the findings of any particular study of recipients must be carefully described as pertaining to the particular population at that particular time. A more proactive lesson is that a good welfare reform study should distinguish between different types of recipients in describing its results. This critical element is a first step toward comparability across studies in different states and localities and across studies in the same state or locality at different times. Thus, all results and findings should be stratified by whether the recipients were long-termers, cyclers, or short-termers and by the other individual and neighborhood dimensions mentioned above. Distinguishing between groups should take place when measuring outcomes, such as earnings and income, either among those still on the rolls or welfare leavers. Adequate stratification along various dimensions has clear implications for the types of data needed for the study as well, which we discuss further below.

Another implication of these principles concerns studies that examine only welfare leavers. Given the importance of first entry and reentry in the response to welfare reform, a study that intends to capture the full effects of the reform on the eligible population has to move beyond the examination of only leavers to an examination of the decisions of eligible nonparticipants, including their entry decisions. It is important to recognize that changes in welfare programs may affect the decisions of potentially eligible families before they apply or reapply for benefits. First, individuals who may be eligible for the program may not understand the new rules and, hence, may believe that they are no longer eligible to receive assistance. Second, some agencies have implemented formal diversion programs, which commonly offer a lump-sum payment or support services, such as job search support or transportation support, in exchange for not enrolling for the cash assistance program. Furthermore, some agencies are directly or indirectly sending signals to potential clients that the emphasis of welfare is now on employment and self-sufficiency and that more will be expected of them if they enroll in the cash assistance program, a sort of informal diversion program. In some cases, the names of the programs and agencies are the signals of a focus on employment and self-sufficiency. The leading signal to potential welfare recipients that agencies, politicians, and the media have been sending is that work effort is expected (Nathan and Gais, 1999).

Both formal and informal diversion programs aimed at reducing entry to welfare programs are important to understand in evaluating the entry effects of

Page 26 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

welfare reform. These diversion programs are also being implemented to stem reentry onto welfare for those who have voluntarily left or were sanctioned off of welfare. Understanding how these diversion programs, in conjunction with sanction policies, act to permanently keep potential recipients off welfare is also important in assessing the effects of time limits on permanently removing or keeping people from assistance. Studies of the broad effects of welfare reform should also seek to understand the behavioral responses of individuals who make themselves eligible or ineligible for participation, for example, by changes in marital status.

Beyond these broad issues of the study population, there are of course many important subpopulations of interest in most welfare reform studies. Most welfare reform studies differentiate carefully between unemployed-parent and single-parent cases, teenage parent and older parent cases, child-only and non-child-only cases, and a variety of different programmatic categories (by age of children, for example). The subpopulations of interest in any particular study depend on the policy and program of interest and on which subpopulations are differentially treated by the policy.⁴

OUTCOMES AND TIME FRAMES

The outcomes of interest in welfare evaluations vary widely: a comprehensive list of all possible outcomes of interest would be quite long. From a programmatic perspective, the effect of the reform on caseloads or, at the family level, on participation rates and recipient rates are clearly of key interest. The implications of caseload changes for costs, including costs net of the expense of operating and implementing the policy change, are usually also of interest to administrators and legislators. The policy and research community is interested in overall trends in family well-being and in how the reforms affect overall trends in poverty. The policy and research community is also often interested in the outcomes of those who begin or end participation because of a policy change. The typical outcomes considered are the employment and earnings of the mother or responsible adult in the case. Shared family or household income is also of interest, especially for former recipients who marry, or who, with their children, move in with or share supports with kin or friends, all of which are outcomes of interest themselves. The extent to which nonparticipant low-income families (because of program exit or failure to enter) rely on other programs or on families and friends for support is also an important question of interest. Dependence on other government programs (such as food stamps) implies that families are not

⁴

The PRWORA policy changes targeted at specific subpopulations are too numerous to explain in detail. However, for an example, a study might focus on the well-being of unmarried minor teenage parents and their children who, under the new rules, must live with an adult or in an adult-supervised setting and must participate in educational and training activities.

Page 27 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

self-sufficient entirely, and dependence on families, friends, community supports and charities or private aid-giving agencies is often less desirable to some policy makers than reliance on earnings.

Beyond these outcomes for adults and families are outcomes for children. The fundamental reforms of the last decade have the ambitious goal of reducing dependence on welfare and reducing the likelihood that future generations will become welfare recipients. Often reforms are supported on the basis that children will be made better off in the long run by the reforms. Studying these outcomes requires that child outcomes be explicitly identified. There is a tremendous range of types of child outcomes that can be studied, including relatively easy-to-define outcomes for older children, such as educational attainment and grades, to somewhat less easy-to-define but nevertheless conceptually clear outcomes related to the behavior of parents toward children, such as abuse and maltreatment or child support payments from an absent parent. There is also the behavior of the children themselves, such as nonmarital childbearing, drug abuse, and illegal activities; their physical health; socioemotional development, especially for young children; cognitive outcomes, such as test scores and performance on standardized scales; and attitudinal changes, mental health problems, and the like.

The time frame for studying outcomes is also an important feature of any welfare reform study, and time frames often differ across studies, which can introduce noncomparability. It is expected that short-term effects of some recent reforms on adult earnings and employment might be different from long-term outcomes, for it may take time for individuals to build up a sufficient work history to achieve an adequate level of income. Alternatively, an individual leaving welfare may initially do quite well but later encounter health problems or other difficulties that hinder or eliminate her forward progress, resulting in deteriorating long-term outcomes. The private sources of support that are available to families who leave or fail to enter welfare programs may differ over time. Even if a family is, in the short run, able to rely on the income of families and friends, this may not be true in the long run because such sources of support are likely to be sporadic. Child outcomes are particularly sensitive to the time frame because many of the basic cognitive outcomes can be expected to be affected only in the long term. However, some outcomes, such as school attendance, may be affected relatively quickly. Thus it is important, once again, for a welfare study both to carefully choose a time frame appropriate to its goals, as well as to carefully describe and qualify its findings according to the time frame actually used in the study.

Demographic outcomes—reduction in nonmarital childbearing, for example—have been a goal of many recent welfare reform policies, including PRWORA. Some reforms were implemented specifically to discourage nonmarital childbearing, such as provisions directed specifically to unmarried teenage mothers, funding for abstinence education programs, family caps implemented by some states, reductions in benefits for not cooperating with paternity establishment,

Page 28 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

and greater enforcement of child support laws. PRWORA also has attempted to give incentives to states to focus on nonmarital childbearing outcomes by implementing an illegitimacy bonus, which will be given to the five states whose nonmarital births and abortions decrease the most over 2-year periods. The extent to which these policy changes are successful in reducing nonmarital childbearing may require a relatively long time frame because demographic outcomes are heavily influenced by custom and social acceptability, and these may change slowly. However, it is also possible that the widespread attention and debate over welfare reform and nonmarital childbearing may have already affected childbearing decisions.

STUDY METHODOLOGIES

There is a long tradition in social science research and policy studies of evaluation of government programs, which in turn, has generated a large literature on methods for evaluation and the relative pros and cons of different methodologies. Although there is still considerable disagreement among experts on what the ''best'' methodology is, there is general agreement on what the advantages and disadvantages of different methodologies are. Experts come to different judgments on the most preferred methodology because they give different weight to the various advantages and disadvantages. We provide in this section a thumbnail sketch and broad classification of the alternative methodologies, with an emphasis on the types of comparisons involved and the data required for them.

The goal of any evaluation method is to make valid inferences. By valid inference, we mean that the method leads to a conclusion about the true cause of an observed outcome. A method leads to a valid inference if the conclusion drawn from it attributes the change in an outcome to what truly caused the change in the outcome—in this case, if it correctly attributes a change in outcome to the change in policy being examined. Every methodology has some risk of leading to a wrong conclusion, and the possible reasons that a method may lead to incorrect conclusions are called threats to valid inference for that method. An assessment of the reliability of any evaluation methodology requires systematically listing the possible threats to valid inference for that method and, in practice, assessing the importance of each threat.

Randomized Trials

Perhaps the most well-known evaluation methodology is a randomized trial, or experiment, in which a randomly selected set of individuals is provided with a welfare reform alternative (the experimental group) and another randomly selected set of individuals is not (the control group). The difference in outcomes between the two groups is attributed to the difference in policy. The primary advantage of a randomized trial is that, if properly conducted, the results have a

Page 29 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

higher degree of credibility than comparisons based on other methodologies, for the randomization ensures that the estimated effect of the program is free of contamination from effects of other, nonreform changes. A secondary advantage of experiments is that they generally require less data than "observational" (i.e., nonexperimental) studies, for there is less need to collect background and retrospective information on the individuals involved in order to control for their differences; differences between the experimental and control groups have already been eliminated, at a first approximation, by the randomization. Nor is there need to collect data across a number of states and localities to ensure that sufficient programmatic variation is obtained, because programmatic variation is built into the experimental design and is thus "forced" on the environment.⁵

Despite these advantages, randomized trials are generally not being used in current welfare reform evaluations, except in a few areas in which pre-1996 waiver evaluations are being continued with experimental methodologies. One of the many reasons for this lack of experimental activity follows from the evolving and still sometimes ill-defined nature of many state TANF programs, which have undergone, and are still undergoing, significant modification as states explore different policy and implementation goals. Without a clearly defined and stable program, it is not cost-effective to begin a long-term experiment whose results may not be of interest after they are obtained because the program being examined is no longer the same as the one currently in effect. Another difficulty with experiments is that most welfare reform efforts in the states have sought to change the perception of welfare in low-income communities and to change the culture of welfare offices (e.g., to a more work-oriented environment). With changes on this scale occurring, it is difficult to prevent the members of a control group from being affected by the changes induced by welfare reform in their environment. Yet another drawback to experimental approaches is that experiments are ill-suited to capturing the effects of diversion, nonparticipation of eligible people, and general entry effects that result from welfare reform, a significant disadvantage given the current importance of such effects.

Despite these difficulties, however, experimental methodologies should have an important role to play in the future. If and when programs in the states stabilize, for example, experiments may become more cost-effective. In addition, experiments that vary one feature of a reform bundle offer an attractive means of estimating the effects of incremental reform, holding constant the environment and "culture" created by the initial reform. Given the advantages of experiments in terms of credibility, the methodology should be kept as an active alternative for future welfare reform.

⁵	Randomized experiments are often conducted at different sites when it is believed that the program effect depends on area characteristics, which helps make the conclusions from the study more generalizable.

Page 30 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

Nonexperimental Studies

Turning to nonexperimental, or observational studies, Table 2-1 presents a classification of different types of evaluations according to the key issue of the source of the program variation used in obtaining an estimate of the program effect. The choice of a comparison group in a nonexperimental evaluation simultaneously defines the implicit counterfactual for the programmatic environment to which the policy change is being compared by determining who is being compared with whom to obtain an estimate of the program effect. The table considers four different generic types of evaluations: pure before-and-after designs, pure cross-section designs, designs that combine before-and-after with cross-sectional elements, and cohort designs. Each of these types makes a different type of comparison. The issue for evaluation, assuming that the different methodologies do not produce the same findings of program impact (which they usually do not), is how to assess the threats to each methodology.

Pure Before-and-After Designs

Pure before-and-after designs simply follow individuals or groups of individuals over a period in which a program change has occurred. (We referred to this method above in the discussion of monitoring as a study goal.) The change

TABLE 2-1 Methodologies for Nonexperimental Welfare Evaluations

Evaluation Design	Description
Pure Before-and-After	Individuals examined over time and outcome measures; program has changed over time; change in outcomes attributed to change in program; can have multiple before-and-after time periods.
Pure Cross-Section	Comparison of different individuals at one time (e.g., week, month, or year); program differs across units; difference in outcomes across units attributed to program differences. Alternatively, participants compared with nonparticipants.
Cross-Section Combined with Before-and-After	Individuals followed over time as policy changes and affects different individuals differently. Within areas: individuals subject to different requirements; across areas: individuals subject to different policy rules.
Cohort and Repeated Cross-Section	Multiple birth or program entry cohorts who are followed over time; program is changing over time; changes in cohort experiences attributed to program change. Within areas or across areas.

Page 31 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

in outcomes is attributed to the change in the program. An example of such a design is a study that follows a group of low-income families for several years prior to a policy change—for example, the implementation of work requirements in order to receive cash assistance—through several years after the policy change. The policy comparison might be how many of these families applied for cash assistance before the work requirements compared to how many of these same families applied for cash assistance after work requirements.

The threats to this design are of two distinct types: aging (sometimes called maturation or life-cycle) effects and systematic external changes in the environment. Aging effects refer to the employment or demographic changes that occur as individuals age and go through different stages of life. As an individual's family members age, their program participation may also change. For example, a woman may find sustained employment more feasible once all her children are of school age. Aging effects might be ignorable for short periods, but over long periods the change in outcomes will inevitably be affected by life-course transitions.

Systematic changes in the external environment can also pose threats to correct inference in a before-and-after design. Changes in the economy and in other programs can easily change the outcomes of the families being studied for reasons other than the policy change being examined. Attempts to control for changes in the environment are generally necessary for this type of evaluation to be credible, and credibility often requires relatively long periods of historical data to convincingly demonstrate that other influences have been successfully controlled (accounted) for. Like all time-series analyses, of which the before-and-after design is one, estimating the effect of the policy change requires an estimate of how outcomes would have evolved in the absence of the change, which must be estimated on the basis of historical trends at the aggregate or individual level.

Pure Cross-Section Designs

A pure cross-section evaluation compares different individuals or families at one time who face different program environments or who in some other way can be differentiated according to their programmatic status. For welfare reform, the most common approach is a comparison of outcomes across different states with different policies. The cross-sectional dimension of evaluations in this category refers to cross-sectional variations in policy at a given time, but it does not mean that only point-in-time data are used. Indeed, an evaluation can follow individuals and families over time and collect detailed outcome measures and still be a cross-sectional evaluation according to our definition, if the policies do not change over the period of the data collection. An example of this design would be a comparison of the earnings outcomes of program participants in a state that has a 5-year time limit, lenient sanctions, and weak work requirements, with the outcomes of program participants in another state that has a different bundle of reforms—for

Page 32 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

example, a 2-year time limit, stringent sanctions, and strong work requirements. This comparison would require observing the participants in both states over time, but it is still a cross-sectional study because the policies in both states do not change over the observation period.

The major threat to this type of design is that not all differences across areas in the economic, social, or programmatic environments have been controlled for, and there are, therefore, alternative explanations for any differing outcomes among the study populations. A closely related threat to this design occurs if the study populations themselves are different across areas, an issue which is best understood in the context of our previous discussion of differing caseload compositions. Caseload composition may differ across states, which by itself can generate differences in outcomes, independent of differences in policies. A partial remedy to caseload compositional differences across states is to combine eligible nonparticipants with participants and compare the total eligible populations across states. However, the eligible populations themselves may differ for reasons that are difficult to measure, which leaves some potential for confounding policy differences with underlying population differences.

Most cross-sectional comparison designs in welfare reform are based upon cross-state differences, but within-state comparisons are not completely ruled out. A within-state cross-sectional comparison design is possible, for example, if policy is implemented differently in different areas. Migration across areas is a potential threat to such comparisons, but if migration can be shown to be minor, the threat is minimal. Comparisons of different types of recipients who are treated differently by a policy (e.g., women with and without young children who are and are not exempt from work requirements) are also sometimes considered under this rubric. However, such studies are rarely credible because obvious differences in outcomes result from the difference in characteristics that generates the differential policy treatment in the first place: for example, women with and without children will ordinarily have quite different employment and other outcomes independent of the welfare policies imposed upon them.

Also included under this category are comparisons of recipients to nonrecipients. Cross-sectional comparisons of this kind are extremely rare in welfare reform evaluations because recipients and nonrecipients are so different that their differences in outcomes can almost never be ascribed to differences in the policy. Participant-nonparticipant comparisons are more common in evaluations of other types of social programs, such as manpower training programs, where, arguably, there is a significant degree of randomness in who enters the program from among those who are eligible.

Combination of Cross-Sectional and Before-and-After Designs

A combination of cross-sectional and before-and-after designs is the third category of evaluation methodology. In this design, a study has data following

Page 33 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

individuals over time, but a policy changes during that time and changes differently for different individuals, so that both over-time policy variation and cross-sectional variation are generated. For example, a study may follow a group of recipients in two different states over a period in which a policy changes in one state but not the other. Relative to a pure cross-sectional comparison of the states over a period in which policy is not changing, this design has the advantage of permitting a comparison of the composition of the recipient populations before the policy change in order to ascertain the differences that may confound efforts to estimate the true effect; once these differences are controlled for, the isolation of the post-policy change difference across the states that results from the change itself is more credibly achieved.

The major threat to this design is the danger that the changes in outcomes across areas differ for reasons other than the difference in the policy change. Trends in the economic environment or programmatic environment in different areas are not always easy to identify and control for. In addition, differences in the types of recipients who are receiving benefits at any particular time may be associated with differences in their evolving experiences over time: better-off recipients may have stronger growth rates of employment and earnings, for example, than worse-off recipients. Controlling for observable differences in recipient composition is a critical feature for a convincing design of this type and often requires the use of information on the work and programmatic history of the recipients in the different states.

Cohort and Repeated Cross-Section Designs

Cohort and repeated cross-section designs are the final category of non-experimental methodologies. These designs are quite similar to the combination just discussed except that the same individuals or families are not followed over time; rather, different cohorts of families are compared before and after a policy change. The usual example of this type of design defines one cohort as participants in the welfare program at one time and the second cohort as participants at a later time, after welfare policy has changed. A comparison of the outcomes of the two groups is conducted, and any difference is attributed to the introduction of the policy. The advantage of this design over a pure before-and-after approach is that the threat of aging, maturation, or natural life-course effects that occur to a group of families over time—which is a threat to valid inference in a before-and-after design—is no longer present. If the two cohorts have roughly the same age distribution and distribution of other characteristics, they will both "age" simultaneously and their outcomes will evolve, but if the policy has an effect then the outcomes of the two cohorts will be different.

One major threat to valid inference in this design is the danger that the two cohorts are different in ways that affect their outcomes. This is a serious threat in welfare reform evaluations because most policies change the nature of who is a

Page 34 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

recipient (as discussed above in the context of study populations). Thus, comparing two cohorts of recipients, one before PRWORA and one after PRWORA, for example, is problematic because the types of recipients still on the rolls during the latter period may be quite different from the types on the rolls in the former period, leading to differences in outcomes because of the types of families on the rolls rather than the effect of welfare reform.

A method to reduce this threat is to compare birth cohorts instead of program cohorts or to consider cohorts of eligible populations instead of program recipiency cohorts.⁶ Although the nature of birth cohorts may change over time, any such effects should be minor and unaffected by policy. Cohorts of eligible populations are more problematic because policies can change who is eligible and who is not, but this can be partly controlled for by using pre-change measures of eligibility for the definition of the second cohort. Assuming cohorts are defined in this or some related way, one can compare series of cohorts across areas in which policies are changing differentially.

Whether cohorts are defined by recipiency, eligibility, or birth year, all cohort designs face the additional threat of changes over time in the economic and programmatic environment that will affect outcomes independently of those induced by the policy change, just as in a before-and-after study. Ideally, a number of cohorts (i.e., more than two) should be constructed to determine whether trends exist in successive cohorts.

A more elaborate cohort design combines cohort studies across states. One example would be to have a before cohort and an after cohort in a state that experienced a policy change between the cohorts and data from two cohorts at the same time in a state that did not experience a policy change. As with the combined before-and-after and cross-section designs discussed above, the major threat to this design is that there are differences across states either in the types of families in the cohorts or in trends in the social and economic environments.

As we noted at the beginning of this section, different studies of welfare reform have evaluation goals that require different methodologies for evaluation. If the threats to the validity of each method are not the same, then comparing the results of studies is problematic. More important for the long run is the resolution of any issues that arise from the choice of methodology and determining whether the threats to each type actually occurred. Strong welfare reform evaluations are those that consider the different threats and evaluate their importance.

⁶

An example of a birth cohort design is a design that follows a cross-section of women of the same age over time (perhaps following all 15-year-old women as they move through the childbearing years) and a later cross-section of women of the same age (i.e., a later cohort of 15-year-olds) who are subject to a different policy environment and compares their outcomes. A birth cohort could also be defined by the birth of a woman's child. An example of an eligibility cohort design would be to follow a cross-section of women pre-PRWORA who were eligible for AFDC (both those on and off the rolls) and a later cross-section of women post-PRWORA who have the same characteristics—that is, who would have been eligible for AFDC pre-PRWORA.

Page 35 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

ESTIMATING THE EFFECT OF REFORM COMPONENTS

Our description of different study methodologies has implicitly assumed that there is a single policy whose effect is of interest, in most cases a policy consisting of a bundle of different reform elements. But different evaluation designs have a different set of advantages and disadvantages when the evaluation seeks to estimate the effects of individual components in a bundle of reforms. Before-and-after designs are very problematic for estimating the effects of individual components because policies are almost always introduced in their totality, not in a piecemeal fashion. Cohort designs, if conducted in only one state or geographic location, are similarly disadvantaged for this purpose. Because before-and-after and cohort designs are the most common within-state evaluation methodologies, it is very difficult to estimate the effects of individual program components on data from a single state. In contrast, cross-section designs and combined cross-section and before-and-after designs, which use cross-state variation in policy to estimate welfare reform effects, are more amenable to estimating the effect of individual components because the variation in bundles across states sometimes permits an indirect assessment of individual component effects. For example, in the lucky (though unlikely) case that two states have enacted bundles of policies that are identical except for one feature, a comparison of the outcomes across the states may be interpreted as representing the effects of a change in that individual component. Although this type of case is quite unlikely, it is still possible that the 51 states (including the District of Columbia) may have enacted policies that differ in only a small number of dimensions. For example, it is possible that states could be classified into five or ten different "types," and within each type states have more or less the same package of reforms. If such a classification is possible, comparisons of outcomes across the states might allow the estimation of the effects of each of the individual "types" from each other and, possibly, an indirect estimate of the effects of individual components (if there are not too many).

Although such a strategy is attractive and will no doubt be explored in welfare reform evaluations of PRWORA and related reforms, there are significant difficulties that will have to be addressed in its implementation. Some issues are practical, such as the capability of national-level data sets or a combination of state-level data sets to permit such comparisons (an issue we discuss further below). Another issue is whether accurate data on policy measures have been collected and are sufficiently available to represent a state's policy correctly (which we also discuss further below).

But even with the appropriate data and full knowledge of all states' policies, one difficulty that remains is whether the policies are similar enough in groups of states to permit the classification into types. There are numerous different welfare reform features and combinations of features that have been considered by the states, so a classification scheme would have to rank those features by their importance. Another difficulty arises if the features interact with each other—

Page 36 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

that is, if the effect of any individual component depends on the presence of another—which will likely reduce the strength of a comparison across only 51 states and jurisdictions. Finally, a major difficulty is controlling for the other differences across states that occur simultaneously with differences in policies—differences in the economic environment, socioeconomic characteristics of the population, and types of other welfare programs available. Thus, while this strategy should unquestionably be pursued in the full panoply of welfare reform evaluations, its success and credibility will require successful resolution of these issues. We also note that these issues will have to be resolved to give credibility to cohort and before-and-after designs in a single state (except for the issue of across-state comparability), as discussed above, and to other forms of evaluation.

DATA SOURCES

The discussions of research and policy questions, populations of interest, and evaluation methodologies in the preceding sections have already raised issues related to data requirements. Monitoring studies, for example, necessarily require data that track the observed group over time. Welfare reform studies need to make careful distinctions by entry, exit, and recipiency status, and they need to distinguish between long-termers and short-termers in their analyses. Disaggregating families by their characteristics is also important, which requires data on education, past work history, health, past recipiency, the ages and number of children, characteristics of the families' neighborhoods, and related characteristics. Finally, the choice of evaluation methodology has immediate and direct implications for data requirements, for each type of methodology requires different information. The simplest design, a before-and-after study, particularly needs historical data on trends at the state and individual level, for example. But all methodologies need data on multiple periods, multiple cohorts, or otherwise demand careful and detailed data.

The data requirements are unlikely to be met by any existing welfare reform study in perfectly acceptable fashion because data limitations are so severe. The severity of the data difficulties confronting studies of welfare reform is a major barrier to conducting convincing and credible analyses with reliable policy conclusions. In this section we discuss the data that are available for evaluations and briefly describe the more important data barriers.

The two major sources of data for welfare reform analysis are administrative data and survey data. Administrative data include information gathered from welfare records of all kinds (TANF, Food Stamp Program, Medicaid, etc.), as well as information gathered from nonwelfare sources, such as information on earnings from the records of the unemployment insurance system or information on fertility from birth records. Often, administrative records from many welfare and nonwelfare programmatic sources are linked together to expand the coverage of any one individual source, leading to "linked" administrative data.

Page 37 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

Survey data are obtained when information is collected directly from participant or nonparticipant families through a question-and-answer interviewing process, either over the telephone or in person. There are national-level data sets of this kind that are relevant to welfare evaluation (e.g., the Survey of Program Dynamics and the Survey of Income and Program Participation), as well as state-specific surveys conducted explicitly to yield information on some relevant sub-population in that state.

It is useful to separately consider these two types of data sources, and we do so below.⁷ We also discuss the importance of collecting descriptive program data to make both administrative and survey data more effective. Finally, we consider the use of all three types of data together for both monitoring and evaluation.

Administrative Data

Administrative data in welfare reform evaluations come most often from the records of the TANF system itself. TANF records typically indicate the months of receipt by a family, a list of the persons included on the grant, the benefit paid, and various characteristics of the persons relevant to eligibility and to the grant amount, such as earned and unearned income, assets, and ages of children. With the more complex types of welfare reforms that have been implemented over the last decade, administrative data have come to include information on participation in work programs, sanctioning status, and related indicators of program treatment.

Administrative data of this type have been most heavily used because they are most readily available to welfare agencies and to the evaluation organizations with which they may subcontract. The data typically are not immediately usable for analytic and research purposes, however, but must be prepared for such use in what may be a fairly long and expensive process of correcting erroneous codes, interpreting missing data, and documenting the meaning of entries. Indeed, a significant barrier to the use of administrative data in general is that their quality is often of an unknown level because there is rarely systematic checking for errors and inconsistencies, especially for items of information that are not directly used for administering the program (Hotz et al., 1998). Nevertheless, assembling administrative data on TANF recipiency is an important first step in describing a recipient population.

One serious issue that arises in the administrative welfare records of most states is the generally short time period of their availability. Not only are welfare records in most states not available for families who were recipients even a few

⁷	For other discussions of survey and administrative data for welfare reform research, see Brady and Snow (1996) and Hotz et al. (1998).

Page 38 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

years ago, there is often relatively poor data on the historical recipiency patterns of current recipients.⁸ Although there is considerable variation across states and even across counties in this regard, many states have not thus far given priority to maintaining historical administrative data at all or in usable form. For the most part, the reason for the lack is that such data were not needed to administer the program. Now, however, the relative lack of availability of such information is a major barrier to many of the desirable features of a welfare evaluation study. Classifying recipients into long-term and short-term categories, for example, requires at least some historical information on recipiency. Likewise, implementing any of the evaluation methodologies that make use of data on recipients over time requires such information.

For the purposes of most welfare program evaluations, the major drawback to administrative data from the welfare system is simply that they do not, by definition, contain information on periods when the individual or family is not receiving benefits. Thus, the data are ill-equipped to assess the well-being or status of families who have left the program or of eligible nonparticipants who have failed to apply or who have been diverted. This problem can be considerably reduced by the linkage of data sets from different programs—AFDC-TANF, food stamps, housing, the child welfare system, child support enforcement, and so on—because families may be in at least one of these databases when they are not receiving benefits from one of the other programs. For example, a family that has stopped receiving cash assistance and is no longer in the welfare data system may still be receiving food stamps, Medicaid, or public housing benefits. A major issue in current welfare reform data discussions is whether administrative data from, say, non-TANF welfare programs provides adequate coverage of TANF leavers or TANF-eligible nonparticipants. At the present time, little information is available on such coverage rates.

The most common administrative data source currently in use to at least partly assess the economic circumstances of individuals and families when they are not receiving benefits is that based on unemployment insurance (UI) records. Employers who are covered by the UI system must provide quarterly earnings reports on individuals to state employment agencies, and these data can be made available to researchers. Typically, these data have been matched to information on families who have previously received welfare benefits, but they could also be gathered on periods prior to entry (or on low-skilled working women who are not receiving benefits) to estimate entry effects. Making such data available to researchers requires that safeguards and guarantees of confidentiality and disclosure be maintained and enforced. This is another barrier to the use of administrative data that needs to be addressed by state and local governments (see Hotz et al., 1998, for a discussion).

⁸	This may change with PRWORA because the enforcement of time limits requires that records on past spells of recipiency be kept for a much longer time.

Page 39 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

A significant difficulty with UI earnings data is that they do not cover the entire workforce: they exclude many government workers, domestic workers, informal and temporary workers, and individuals in the underground economy. They also pertain only to individuals and not to families, and unearned income in general is not available from the records. Yet another issue is that earnings are reported only quarterly, which can create some difficulties for matching to monthly or weekly welfare participation or other records. Furthermore, in order to track workers who live in one state but work in another, a state would have to obtain the UI records of its neighboring state. One new potential source that might be useful in tracking workers across states is the Expanded Federal Parent Locator Service (EFPLS), which contains the National Directory of New Hires. The National Directory of New Hires contains quarterly reports from all states on wage and unemployment compensations of newly hired workers in a state. Many federal agencies, whose workers are not covered in UI reporting, will be reporting this information under EFPLS. Making the data available for research purposes could help analysts track employment outcomes of welfare recipients and potential recipients.

Data from income tax records is another potential source of data for conducting evaluations. State tax records are likely to have wider coverage of the workforce than UI earnings data and will also have wider coverage of unearned income and the earnings of spouses. There are, however, serious privacy and confidentiality barriers to obtaining the use of tax records.

Each of these data sources is susceptible to not covering the entire population of interest, at least to some degree. Some low-income individuals may not show up in any of the data sets, especially if they are not working in the formal economy and would not be covered under unemployment insurance or file tax returns. These individuals may be the worst-off cases in terms of formal labor market job skills, and missing them in an analysis could limit the generalizability of results.

Administrative data are usually quite weak on socioeconomic characteristics of the recipient because that information is not generally needed to determine eligibility for a benefit or to judge compliance with requirements. Consequently, information on education, occupation, marital status, and other basic characteristics is usually not available from administrative data. However, linking administrative data sets can improve the coverage of socioeconomic characteristics of individuals and families because different programs need different information about a potential recipient to judge eligibility or compliance. Nonprogrammatic sources of data can increase coverage of socioeconomic characteristics: for example, vital statistics birth records can be used to monitor and understand fertility decisions. However, linking individual administrative data sets can be difficult because there is often not a unique identifier for each case and because other identifying variables (names, Social Security numbers, or birth dates) can be incorrectly recorded. Significant strides in the area of probabilistic record

Page 40 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

matching, a technique that calculates a probability that two records with separate identifying information (name and birth date for example) are actually from the same person, have been made and can help address this problem.

Finally, administrative data present significant difficulties if attempts are made to compare them across states. In many instances—for example, the many welfare leaver studies conducted in different states (see Chapter 3)—one may want to know if an outcome in one state is comparable to that in another (e.g., if a 50% employment rate among welfare leavers is really double the 25% rate in a different state). Unfortunately, data from administrative records are often not comparable because of variations in the definition of what a case is, what a program is, and how a case is tracked with administrative data. Different concepts are often used for variables with the same label, and the classification schemes used for recipients may be quite different. This variation has always existed, but it is growing with the devolution of program design to the states and the increased variety of types of programs across the country. This variation presents a serious challenge to making cross-state comparisons with administrative data.

Survey Data

Survey data have important advantages over administrative data. General household surveys contain information on family structure, family income, earnings in all sectors, hours of work, and all other major socioeconomic and demographic characteristics. Often, earnings and wages are available at relatively short time intervals. In addition, perhaps most importantly, general population surveys have information on individuals and families when they are not receiving welfare benefits, and thus can be used to assess well-being and to measure behavior during those periods.

One source of household survey data are the national-level surveys, such as the Survey of Program Dynamics (SPD), Survey of Income and Program Participation (SIPP), Current Population Survey (CPS), Panel Study of Income Dynamics (PSID), and National Longitudinal Survey of Youth (NLSY). As we noted above, these data sets have a potential role to play in obtaining national-level estimates of the impact of welfare reform. Unfortunately, the usefulness of these surveys for the purpose of welfare program evaluation is significantly threatened by three factors. One is that most national surveys do not have very large sample sizes on the populations of interest in welfare reform. Even the CPS, the largest of the data sets, runs into potential sample size problems if an analysis is restricted to, say, less educated single mothers and conducted separately by race and ethnic group. A second drawback is that using national surveys to assess welfare reform requires that welfare rules be known for each state in a comparable form, and there have been, thus far, limits to the extent to which such information is col

Page 41 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

lected and made available (see below).⁹ A third drawback is that most national household surveys collect only general socioeconomic information and do not obtain all the information from a respondent that welfare studies need, such as the respondent's history of receipt of welfare and other government program benefits, detailed accounts of sources of support, and the characteristics of the neighborhood in which the respondent lives.

From the point of view of state-level evaluations, new household surveys of the population are an option that can be considered. The major barrier to their use is their significant expense. Fielding a survey is a major operation and can be quite costly, particularly if interviews are conducted in person rather than over the telephone. Survey expenses are also quite high if the sample is generated by screening at the household door, because considerable effort is required to locate the target sample. A frequently used alternative in welfare evaluations is to gather administrative data from welfare or other programs to generate a sample of current or former welfare recipients. The major disadvantage to such list frames is their partial coverage of the population, because many families who are not receiving welfare benefits will not be included in such administrative data. An additional difficulty is that, although forming a sample from administrative data lowers screening costs, locating and tracking former recipients (e.g., obtaining current addresses or telephone numbers) can also be time-consuming and expensive.

In addition to the expense of household surveys, nonresponse¹⁰ and misreporting problems raise issues that can be difficult to address. Nonresponse in most household surveys is not random, and a low response rate in a survey leaves the potential for systematic bias due to nonresponse. Nonresponse rates can be particularly high in telephone surveys of low-income populations. Nonresponse rates in telephone surveys in general have grown with the increase in telemarketing and other factors (e.g., extensive polling, use of answering machines and other call-screening devices). Furthermore, the fraction of the low-income population without telephones or with disconnected telephone service, and the fraction who change telephone numbers frequently, is relatively high, leaving the potential for considerable sampling frame bias. Yet telephone surveys are often used because they are less expensive than in-person surveys. Indeed, there are serious tradeoff concerns between obtaining high-quality, high response-rate survey data with the limited resources of many states for data collection. This

⁹	The lack of comparable information on policy variables across states is also a problem when comparing outcomes across states using administrative data, and hence it is not inherently a problem with survey data. However, state-level administrative data can be used for state-level evaluations and hence is of some usefulness—at least for estimating the effect of the bundle of reforms.
¹⁰	Nonresponse can be by the respondent to a whole survey (unit nonresponse) or to one or more questions on a survey (item nonresponse).

Page 42 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

tradeoff may lead to a need to conduct smaller-scale surveys in order to keep quality standards sufficiently high.

Misreporting and underreporting of events and program participation is also a problem and is also difficult to detect. One of the main ways of detecting response errors is, in fact, the use of the administrative data discussed above, through cross-checking information with survey data. For example, TANF receipt information from survey data can be cross-checked with administrative records. Administrative data can also be used to gather missing information from survey nonrespondents, for example, earnings from UI records. Such data can also help detect any nonresponse bias¹¹ in the surveys. The use of administrative data for detecting nonresponse is limited by the coverage of administrative data sets (e.g., UI data only cover those who are employed in the formal economy). However, the potential use of administrative data for this purpose is worth serious consideration.

A more analytic difficulty with survey data is that they generally cannot be used to gather much retrospective information on earnings, employment, and welfare and other program participation while ensuring accurate answers. Consequently, historical information is difficult to obtain. This is a problem for welfare program evaluation, given that most evaluation methodologies require information on behavior and outcomes prior to the policy change as well as after the change. Most state-level surveys begin long after a new policy is in place, leaving the study without a pre-change, or baseline, measure. In contrast, with administrative data, the likelihood of the availability of at least some historical data is much greater.

Other difficulties with survey data result from attempts to reinterview respondents over periodic intervals and hence create a longitudinal, or panel, data set. While the average cost of interviewing a family a second time is much less than the cost of locating and interviewing a family for the first time, a small fraction of families who move or who are difficult to locate at a later time can generate very high expenses for the data collectors. Nonresponse in a longitudinal context can be a problem as well, because the ability to locate and reinterview a family may be correlated with the values of the outcome variables of interest (employment, earnings, program participation, etc.) for assessing new welfare policies. Consequently, issues of nonresponse bias again appear. Another complication for panel data sets is following all family members when families split up. To track outcomes, especially for children, it is critically important to collect data on all members of the original family. For instance, one may want to evaluate the outcomes of children who have been separated from their families because of hardships. To do so would require following the children in the

¹¹	Nonresponse bias is a systematic difference in the characteristics of respondents and non-respondents.

Page 43 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

family as well as the adult member(s) of the family, but it is often difficult to do so after the family has split, and it can be costly.

Despite this rather long list of disadvantages, survey data nevertheless have strong advantages and must be considered part of the data collection strategy of any welfare reform study that desires a reasonably comprehensive picture of families and individuals who are not participating in welfare programs.

Linking Administrative and Survey Data

Linking administrative data sets to survey data sets offers the potential to take advantage of the features of both types of data. Surveys can gather information on program participants when they are not receiving benefits and can also supplement administrative data in gathering information on demographic and background characteristics of the populations of interest. Surveys can also collect data on the entire household and on informal sources of support. Administrative data, in contrast, can provide reliable data on program participation, potentially for long periods of time, and information on how recipients are treated by the program. Linked administrative data can provide information on the services recipients receive while they are on welfare, such as work supports under Welfare to Work, job training, and job search services. Linked administrative data can also be used to track a recipient's or former recipient's dependency on other social welfare programs, such as public housing, food stamps, and others. The use of administrative data can also reduce the costs of collecting data that would otherwise be obtained with a survey, such as date of birth. Information on common items available in both sources can be used to check data quality. Administrative data (e.g., on UI earnings and employment) can be used to assess the seriousness of any bias from nonresponse in a survey.

As we stated in the opening of this section, the appropriate data sources for an evaluation depend on the evaluation methodology chosen. A monitoring study would be more effective if based on a linked administrative-survey data set on families over time. Before-and-after studies require historical data at either the individual level or aggregated at the state (or relevant policy area) level in order to account for changes in the external environment that may change individual welfare recipiency, and linked administrative-survey data sets would also make this type of study more effective. The pure cross-section design, the combination of cross-section and before-and-after design, and cohort designs (at least across states) also require considerable knowledge of participation histories and demographic and socioeconomic characteristics, but in these cases, information is required across different states. A challenge to the use of administrative data for these types of designs is whether cross-state comparability of administrative data is sufficient to make these methods possible (see Hotz et al., 1998, for a discussion of such comparability).

Linking survey data to administrative data has thus far been on a state level,

Page 44 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

probably because state agencies conducting welfare reform evaluations have easy access to administrative records. The extent to which survey and administrative data can be linked on a national-level basis with national household surveys remains to be seen. As we explain in Chapter 3, the Census Bureau, with support from ASPE, is looking into the feasibility of matching social security records to the SIPP and SPD data.

Privacy and confidentiality are significant concerns for the development and linkage of administrative data sets and for survey data sets linked to administrative data sets. These concerns may limit the access outside researchers have to the data. The issue is also of concern for survey data, but is typically addressed through informed consent agreements and data masking procedures. Techniques and protocols for ensuring confidentiality and privacy continue to develop and will need to be developed further if linked data are to be more widely accessible.

Data Providing Descriptions of Programs

A third type of data, less often discussed, is that describing the welfare reform itself. Although it is commonly assumed that such data must necessarily be available, lack of accurate information about program rules and provisions has developed into a problem in current welfare reform efforts, and it is therefore necessary to note that collection of descriptive program data requires an independent effort.

Prior to the wave of welfare reform that began in the early 1990s, all state AFDC programs had the same approximate structure, with a relatively similar set of rules governing eligibility and benefit computation. States had considerable leeway in setting benefit levels, but most other characteristics of the program were heavily regulated by the federal government, operating under the provisions of the Social Security Act, court interpretations of that act, and administrative decisions. States were required to report to the federal government the provisions of their state AFDC plans, their benefit levels, and a wide variety of other information to ensure that they were in compliance. In addition, because the matching-grant structure of the federal financial support for the system required information on average benefit levels in the states, those had to be reported as well.

Requirements for reporting program rules to the federal government have changed greatly under PRWORA. Federal regulations include a requirement that states must provide an annual report on the characteristics of their TANF program rules. However, how these characteristics are reported is not standardized, and the wide variation in policy across states makes standardized reporting more difficult. The reporting requirements are fairly open-ended, possibly diminishing the usefulness of the data provided in these reports. States can use varying definitions in reporting and are likely to report only what is strictly defined in the final regulations since there are no incentives to report any other information and no funds from the federal government to do so, as there were under AFDC.

Page 45 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

Furthermore, it is not clear from the final regulations how states that have given counties authority to set their own program rules will report the program rules, though clearly the states and the federal government have an interest in knowing what these county rules are.

Many states have their own state programs for low-income populations. For such programs, states are only required to report the characteristics of programs that use federal maintenance-of-effort funds that are provided for under PRWORA;¹² states do not have to report rules of separate state programs that are funded from other state sources. But to evaluate the effects of the PRWORA legislation, it would be necessary to understand how these separate state programs interact with the federal requirements of TANF. For example, Illinois is using its own funds to pay benefits to recipients in months when they are working at least 25 hours per week, but receiving these benefits does not count against the 5-year time limit on receiving benefits (Illinois Department of Human Services, 1999).

It is difficult to judge whether the requirements of states to report on program rules will be comprehensive and standardized enough for use in evaluations. A separate effort is being made along these lines by the Urban Institute, under contract to DHHS. The Urban Institute is collecting information on the TANF rules for all the states for 1996–1998 and is attempting to classify the rules in a typology that could allow state comparisons. A list of the summary categories of rules that will be collected in the project is shown in Box 2-1. This is an important effort that should be strongly encouraged and considerably broadened. The pace of the effort is discouragingly slow, given that PRWORA was passed in August 1996. The work deserves support to produce information on a more timely basis, and a long-run institutional commitment is required to ensure that this information will be forthcoming on a regular basis in the future.

The current lack of information on state policies also poses a significant problem to any welfare reform evaluation that attempts to make cross-state comparisons. As discussed above, several of the major evaluation methodologies require such comparisons. Without reliable information on the programs enacted by the states and how they are changing over time, at a level of detail permitting accurate comparisons of how different states have approached the various major categories of reform policy (time limits, work requirements, sanctions, diversion, family caps, and so on), it is unlikely that credible cross-state comparisons will be possible. This would be an unfortunate outcome because the various policies adopted by the different states offer a valuable source of variation for estimating the effects of welfare policies.

¹²	States are not required to report very many details of the characteristics of these state programs.

Page 46 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

PROCESS EVALUATIONS

As we note at the beginning of this chapter, process evaluation plays an important role in supplementing and complementing outcome evaluation. Documenting the written program rules in each state is the essential first step to understanding the policies that face program participants and potential program participants. A further step toward fully understanding the treatment is to document how the written rules are actually implemented. Such studies are generally referred to as process, or implementation, evaluations.

Process evaluations describe how program rules are operationalized and how the services are actually delivered. Implementation information is gathered by

Page 47 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

visiting program offices (often across multiple service delivery areas), interviewing caseworkers, surveying administrators, directly observing client and caseworker interactions, or reviewing documentation of individual cases. Process evaluations can be used for administrative purposes, such as assessing caseworker and administrator performance, determining whether the intended policies are actually being implemented, or as an example of how services are provided in one area. Process evaluations can also be used in conjunction with outcome evaluations by linking the exposure individuals had to the program to the effects of policies on individuals. This use of process evaluations is the most relevant for the purposes of this report.

Although it is always possible that a gap between the written policy and the

Page 48 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

implementation of the policy exists, process analyses are particularly important in the post-PRWORA policy setting because there is greater variation in program rules and because responsibility for program design and administration has devolved to state and local levels. There is now more room for differential implementation of policies across service delivery areas because local welfare offices have more control over service provision than in the AFDC program. AFDC was an entitlement program in which caseworkers were basically charged only with determining eligibility and benefit levels, and there were quality control measures taken to ensure that eligibility and benefit calculations were implemented consistently. Now, however, local welfare offices are increasingly becoming integrated with other social program offices so that caseworkers serve as gatekeepers to a variety of services (job training, job search, transportation benefits, and child care benefits, all in addition to cash assistance). Understanding how integrated these services are in each service delivery area and how clients are treated is an important component of assessing the treatment and, subsequently, in drawing conclusions about the effects of the treatment.

Consider the following example given in Corbett (1998). A new policy that many states have implemented is a diversion payment, a lump-sum payment given to cash assistance applicants in exchange for not enrolling in the continuing cash assistance program. One local agency may encourage applicants to take the diversion payment, while another agency may just mention the payment in passing. In order to evaluate the effect of the diversion payment on TANF participation (and in the gatekeeper setting, on other social program participation), an evaluation study would need to understand the degree to which clients were aware of and pushed toward taking the diversion payment.

Process evaluations may also be useful in understanding how other social welfare programs have been affected by the change in cash assistance rules. For example, some administrative offices may direct potential cash assistance applicants or current recipients to other programs, such as food stamps, while other administrative offices may discourage the receipt of any form of assistance. Both possible cases would have implications for participation in other social welfare programs.

While the necessity for conducting process evaluations is apparent, it is not always apparent how the results can be integrated with outcome evaluations. Studies that span many service delivery areas present especially difficult problems, because in order to link program implementations to individual case outcomes, specific information for each office from which cases in the sample receive services must be known. It is less difficult to link implementation results to outcome studies if the study sample covers only a few service delivery areas and implementations in only these areas must be assessed. Keeping up-to-date information on program implementations so that they are relevant to the study period is another challenge to effectively using process evaluations in conjunc-

Page 49 Cite

Suggested Citation:"2 Framework, Principles, and Designs for Evaluation." National Research Council. 1999. Evaluating Welfare Reform: A Framework and Review of Current Work, Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9672.

×

tion with outcome evaluations. Efforts to address these challenges deserve further attention in the evaluation research community.

CONCLUSIONS

The study of welfare reform and the evaluation of its effects presents many challenges. Examining the effect of complex bundles of individual reform programs, determining the influence of the composition of the welfare caseload on measured outcomes, developing a credible comparison group for those affected by welfare reform, and constructing an adequate database for measuring outcomes, as well as data describing policies across states, require thoughtful study designs as well as considerable resources.

We conclude that while nonexperimental methodologies for evaluation have become the dominant method of evaluation at the current time, experimental methodologies still have a role to play and should be kept on the table as one means of evaluation. We conclude that monitoring and descriptive studies of welfare reform are important, but that evaluation studies—which estimate the effect of a program reform—should be the ultimate goal of welfare reform research. We emphasize that there is a role for both national-level welfare reform evaluation, which yields a comprehensive assessment of the effects of reform in all the states around the country, and for purely state-level studies, which yield estimates for individual states.

Regarding data, the panel has found considerable weaknesses in the three elements of data infrastructure needed to evaluate welfare reform. Household survey data sets, which are rare at the state level, are more plentiful at the national level but suffer from small sample sizes, a lack of key variables, and the relative unavailability of comparable policy measures across states. State-level administrative data sets, which have traditionally been used for management rather than research purposes, are still at an early stage of development and need much more work before they can fulfill their potential. Comprehensive data on state welfare policies across states and over time on a comparable basis have yet to be published, and there is no systematic plan for collecting such data on a long-run, permanent basis within the federal government.