Read "Learning from Experience: Evaluating Early Childhood Demonstration Programs" at NAP.edu

« Previous: Part 1: Report of the Panel

Page 3 Cite

Suggested Citation:"Evaluating Early Childhood Demonstration Programs." National Research Council. 1982. Learning from Experience: Evaluating Early Childhood Demonstration Programs. Washington, DC: The National Academies Press. doi: 10.17226/9007.

Page 4 Cite

Page 5 Cite

Page 6 Cite

Page 7 Cite

Page 8 Cite

Page 9 Cite

Page 10 Cite

Page 11 Cite

Page 12 Cite

Page 13 Cite

Page 14 Cite

Page 15 Cite

Page 16 Cite

Page 17 Cite

Page 18 Cite

Page 19 Cite

Page 20 Cite

Page 21 Cite

Page 22 Cite

Page 23 Cite

Page 24 Cite

Page 25 Cite

Page 26 Cite

Page 27 Cite

Page 28 Cite

Page 29 Cite

Page 30 Cite

Page 31 Cite

Page 32 Cite

Page 33 Cite

Page 34 Cite

Page 35 Cite

Page 36 Cite

Page 37 Cite

Page 38 Cite

Page 39 Cite

Page 40 Cite

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Page 51 Cite

Page 52 Cite

Page 53 Cite

Page 54 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Evaluating Early Childhood Demonstration Programs I NTRODUCTION During the last two decades, public and private programs for young children and their families have undergone profound changes. Programs and philosophies have proliferated. Program objectives have broadened. Federal support has increased: Projected expenditures for child care and preschool education alone neared $3 billion several years ago. Target populations have expanded and diversified, as have the constituencies affected by programs; such constituencies reach beyond the target populations themselves. A sizable evaluation enterprise has grown along with the expansion in programs. Formal outcome measurement has gained increasing acceptance as a tool for policy analysis, as a test of accountability, and to some extent as a guide for improving program practices. Programs have been subjected to scrutiny from all sides, as parents, practitioners, and politicians have become increasingly sophisticated about methods and issues that once were the exclusive preserve of the researcher. At the same time, evaluation has come under attack--some of it politically motivated, some of it justified. Professionals question the technical quality of evaluations, while parents, practitioners, and policy makers complain that studies fail to address their concerns or to reflect program realities. Improvements in evaluation design and outcome measurement have failed to keep pace with the evolution of programs, widening the gap between what is measured and what programs actually do. This report attempts to take modest steps toward rectifying the situation. Rather than recommend specific instruments, its aims are (1) to characterize recent 3

4 developments in programs and policies for children and families that challenge traditional approaches to evalua- tions and (2) to trace the implications for outcome measurement and for the broader conduct of evaluation studies. We have attempted to identify various types of information that evaluators of early childhood programs might collect, depending on their purposes. Our intent is not so much to prescribe how evaluation should be done as to provide a basis for intelligent choice of data to be collected. Two related premises underlie much of our argument. First, policies and programs, at least those in the public domain, are shaped by many forces. Constituencies with conflicting interests influence policies or programs and in turn are affected by them. Policies and programs evolve continuously, in response to objective conditions and to the concerns of constituents. Demonstration programs, the subject of this report, are particularly likely to change as experience accumulates. Consequently, evaluation must address multiple concerns and must shift focus as programs mature or choral Her and == -^w policy issues emerge. Any single study is limited in its Ivy co react co changes, out a single study is only a part of the larger evaluation process. Second, the role of the evaluator is to contribute to public debate, to help make programs and policies more effective by informing the forensic process through which they are shaped. Though the evaluator might never actually engage in public discussion or make policy recommendations, he or she is nevertheless a participant in the policy formation process, a participant whose special role is to provide systematic information and to articulate value choices, rather than to plead the case for particular actions or values. Note that we distinguish between informing the policy formation process and being co-opted by it--between research and advocacy. Research is characterized by systematic inquiry, concern with the reduction and control of bias, and commitment to addressing all the evidence. Nothing that we say is intended to relax the need for such rigor. There are many views of the evaluator's role. Relevant discussions appear in numerous standard sources on evalu- ation methodology, such as Suchman (1967), Weiss (1972), Rossi et al. (1979), and Goodwin and Driscoll (1980). Some of these views are consonant, and some are partially contrasting with ours. For example, one widely held view ~I_ ~ # _ ~ OF ~ J _ ~

5 is that the role of the evaluator is, ideally, to provide definitive information to decision makers about the degree to which programs or policies are achieving their stated goals.) Though we agree that evaluation should inform decision makers (among others) and should strive for clear evidence on whether goals are being met, we argue that this view is insufficiently attuned to the pluralistic, dynamic process through which most programs and policies are formed and changed. Sometimes the most valuable lesson to be learned from a demonstration is whether a particular intervention has achieved a specified end. Often, however, other lessons are equally or more important. An intervention can succeed for reasons that have little import for future programs or policies--for example, because of the efforts of uniquely talented staff. Conversely, a demonstration that fails, overall, may contain successful elements deserving replication in other contexts, and it may succeed in identifying practices that should be amended or avoided. Or a demonstration may shift its goals and "treatments" in response to local needs and resources, thereby failing to achieve its original ends but succeeding in other important respects. By the same token, a randomized field experiment, with rigorous control of treatment and subject assignment, is sometimes the most appropriate way to answer questions salient for policy formation or program management. In such situations, government should be encouraged to provide the support necessary to implement experimental designs. There are situations, however, in which experimental rigor is impractical or premature, or in which information of a different character is likely to be more useful to policy makers and program managers. Preoccupation with prespecified goals and treatments can cause evaluators to overlook important changes in the aims and operations of programs as well as important outcomes that were not part of the original plan. If demonstrations have been allowed to adapt to local conditions, thoughtful documentation of the process of Strictly speaking, this view applies only to "summa- tive" evaluations, as distinguished from "formative" evaluations, which are intended to provide continuous feedback to program participants for the purpose of improving program operations.

6 change can be far more useful in designing future programs than a report on whether original goals were met. Even if change in goals and treatments is not at issue, understanding the mechanisms by which programs work or fail to work is likely to be more helpful than simply knowing whether they have achieved their stated goals. These meabanisms are often complex, and the evaluator's understanding of them often develops gradually. To elucidate mechanisms of change, it may be necessary to modify an initial experimental design, to perform post hoc analyses without benefit of experimental control, or to supplement quantitative data collection with qualitative accounts of program operations. In short, we believe that evaluation is best conceived as a process of systematic learning from experience--the experience of the demonstration program itself and the experience of the evaluator as he or she gains increasing familiarity with the program. It is the systematic quality of evaluation that distinguishes it from advocacy or journalism. It is the need to bring experience to bear on practice that distinguishes evaluation from other forms of social scientific inquiry. A Word on Definitions This is a report about the evaluation of demonstration programs for young children and their families. Each word or phrase in the foregoing sentence is subject to multiple interpretations. The substance of this report is intimately bound up with our choice of definitions. By evaluation we mean systematic inquiry into the operations of a program--the services it delivers, the process by which those services are provided, the costs of services, the characteristics of the persons served, relations with relevant community institutions (e.g., schools or clinics), and, especially, the outcomes for program participants. By outcomes we mean any changes in program participants or in the contexts in which they function. The latter is a deliberately broad definition, which includes yet extends far beyond the changes in individual children that are usually thought of as program outcomes. We believe that the definition is appropriate, given the nature of contemporary programs, and we endeavor to support this claim in some detail.

7 By demonstration programs we mean any programs installed at least in part for the purpose of generating practical knowledge--such as the effectiveness of particular interventions; the costs, feasibility, or accessibility of services under alternative approaches to delivery; or the interaction of a program with other community institutions. This definition goes beyond traditional concerns with program effectiveness. We believe that it is an appropriate definition in light of the policy considerations that surround programs for young children today. Finally, by young children we mean children from birth to roughly age eight, although some of our discussion applies to older children as well. We take very seriously the inclusion of families as recipients of services; we emphasize the fact that many contemporary programs attempt to help the child through the family and that outcome measures should reflect this emphasis. Plan of the Report We begin by tracing the historical evolution of demonstration programs from 1960 to the mid-1970s, and of the evaluations undertaken in that period. Although children's programs and formal evaluation have histories beginning long before 1960, the programs and evaluations of the early 1960s both prefigure and constrain our thinking about outcome measurement today. Following this historical overview is a section that examines in some detail the policy issues and programs that have evolved in recent years and that appear to be salient for the 1980s. The next section--the heart of the report-- identifies some important implications of these programs and policy developments for outcome measurement and evaluation design. The final section points to implications for dissemination and utilization of results, for the organization and conduct of applied research, and, finally, for the articulation between applied research and basic social science. PROGRAMS FOR CHILDREN AND FAMILIES, 1960-1975 Programs for children and families have come a long way since 1960, but it is fair to say that the earliest demonstration programs of the 1960S, precursors of Head

8 Start, still have a hold on the imagination of the public as well as many researchers. It is perhaps an oversimplification--but nevertheless one with a large grain of truth--to say that outcome measurement, which was reasonably well adapted to the early demonstrations, has stood still while programs have changed radically. To illustrate, let us consider the experience of a "typical" child in a "typical" demonstration program at various points from 1960 to the present, and let us briefly survey the kinds of measures that have been used at each point to assess the effects of programs. In the early 1960s it would have been easy to characterize a typical child and a typical program. . Prototypical remonstrations of that period were primarily preschool education programs, designed to enhance the cognitive skills of "culturally disadvantaged" children from low-income families, in order to prepare them to function more effectively as students and, ultimately, as workers and citizens. It was only natural to measure as outcomes children's school performance, academic ability, and achievement. Some practitioners had misgivings about the fit between available measures and the skills and atti- tudes they were attempting to teach, and many lamented the lack of good measures of social and emotional growth. There was fairly widespread consensus, however, that preacademic instruction was the heart of early childhood demonstrations. (Horowitz and Paden, 1973, Provide one of several useful reviews of these early projects.) By 1965 the typical child would have been one of more than half a million children to participate in the first Head Start program. Despite its scale, Head Start was and still is termed a "demonstration" in its authorizing legislation. Moreover, Head Start has constantly expert mented with curricula and approaches to service delivery, and it has spawned a vast number of evaluations. For these reasons it dominates our discussion of demon- strations from 1965 until very recently. (A collection of papers edited by Zigler and Valentine, 1979, reviews the history of Head Start. See in particular Datta's paper in that volume (Datta, 1979) for a discussion of Head Start research.) The program originally consisted of eight weeks of preschool during the summer and was soon extended to a full year. Proponents had stressed "comprehensive services," and many teachers viewed socialization rather than academic instruction as their primary goal. Many of the federal managers and local practitioners did not -

9 conceive Head Start exclusively as a cognitive enrichment program. Nevertheless, Head Start was widely perceived-- by the public, by Congress, and by many participants--as a way to correct deficiencies in cognitive functioning before a child entered the school system. Early Head Start programs involved many enthusiastic parents, but the educational mission and direction of the program was set by professional staff and local sponsoring organiza- tions. Programs and developmental theories were numerous and diverse; no uniform curriculum was set. Yet there seems to have been consensus and a high level of confidence with respect to one key point--that early intervention would be effective, regardless of the particular approach. In some quarters this confidence was severely shaken by the first national evaluation of Head Start's impact on children, the Westinghouse-Ohio study (Westinghouse Learning Corp. and Ohio University, 1969). The study reported that Head Start graduates showed only modest immediate gains on standardized tests of cognitive ability and that these gains disappeared after a few years in school. However, for others the results testified only to the narrowness of the study's outcome measures and to other inadequacies of design. Some partisans of Head Start and critics of the Westinghouse-Ohio study, claiming that the program was much more than an attempt at compen- satory education or cognitive enrichment, argued that the study bad measured Head Start against a standard more appropriate to its precursors. These advocates argued that Head Start enhanced social skills (to which the Westinghouse-Ohio study paid limited attention) and provided food, medical and dental checkups, and corrective services to children who were badly in need of them. Thus its justification lay in part in the provision of immediate benefits to low-income populations, not solely in expected future gains. Furthermore, argued advocates of Head Start, many local programs had mobilized parents and become a focus for community organization and political action. To be sure, some of the criticism of the Westinghouse-Ohio study was rhetorical and politically motivated. However, many of the critics' points were supported empirically, for example, by an evaluation by Kirschner Associates (1970), which documented the impact of the program on services provided by the community. By 1970, Head Start had begun to experiment with systematic variations in curriculum. Now the typical preschool child might be served according to any of a

10 dozen models, ranging from highly structured academic drill to global, diffuse support for social and emotional growth. Models were viewed as fixed treatments, to be applied more or less uniformly across sites. Parallel models were also put in place in elementary schools that received Head Start graduates, as part of the National Follow Through experiment. Under most models, treatment was still directed primarily to individual children, not families or communities. Some models made an effort to integrate parents; others did not. Noneducational program components, such as health, nutrition, and social ser- vices, had expanded but were still widely viewed as subordinate to the various developmental approaches. Comparative evaluations continued to stress a relatively narrow range of educational outcomes. As a result, pro- grams with a heavy cognitive emphasis tended to fare better than others, although no single approach proved superior on all measures, and there were large differences in the effectiveness of a given model at different sites. Dissatisfaction with the narrowness of outcome measures continued to grow, as programs broadened their goals and came to be seen as having distinctive approaches and outcomes, not necessarily reflected by the measures being used. By 1975, Head Start had changed and diversified significantly. Program standards were put in place, mandating comprehensive services and parent involvement nationwide. In 1975 more than 300 Head Start programs were gearing up to provide home-based services as supple- ments to, or even substitutes for, center-based services. The home-based option was permitted in the national guidelines following an evaluation of Home Start, a 16-site demonstration project (Love et al., 1975). The evaluation, which involved random assignment of children to home treatment and control conditions, found that the home treatment group scored significantly above the control group on a variety of measures, including a standardized cognitive test, and that the home treatment group did as well as a nonrandom comparison group of children in Head Start centers. In addition, several offshoot demonstrations, some of them dating from the 1960s, began to get increased attention, notably the Child and Family Resource Program, the Parent-Child Centers, and Parent-Child Development Centers. These projects extend services to children much younger than age three or four, the normal age for Head Start entrants These programs work through the mother or the family

11 rather than serving the child alone. They combine home visits with center sessions in various mixes. Although these programs even today serve only about 8 percent of the total number of children served in Head Start, they represent significant departures from traditional approaches. We have a good deal more to say about these programs below. Thus by 1975 the experience of the typical Head Start child had become difficult to characterize. The child might be served at home or in a center; he or she might receive a concentrated dose of preacademic instruction or almost no instruction at all. In the face of this diver- sity, it is apparent that standardized tests, measuring aspects of academic skill and ability, capture only a part of what Head Start was trying to accomplish. Evaluations of Head Start's components, such as health services, and offshoot demonstrations, such as the Child and Family Resource Program, have been conducted or are currently in progress. Head Start's research division in 1977 initiated a multimillion-dollar procurement to develop a new comprehensive assessment battery that stresses health and social as well as cognitive measures. By the late 1970s other programs, mostly federal in origin, were beginning to take their places beside Head Start as major providers of services to children. In addition, federal evaluation research began to concentrate on other children's programs, such as day care, which had existed for many years but had begun to assume new importance for policy in the 1970s. In the next section we attempt to characterize some of the recent program initiatives as well as the policy climate that surrounds programs for young children and their families in the early 1980s. THE PROM AND POLICY CONTEXT OF THE 1980s Public policy both creates social change and responds to it. The evolution of policies toward children and families must be understood in the context of general societal change. Demographic shifts in the number of young children, the composition of families, and the labor force participation of mothers in recent years have increased and broadened the demand for services. They have also heightened consciousness about policy issues surrounding child health care, early education, and social services. Policy makers and evaluators in the

12 1980s are coping with the consequences of these broad changes. Contemporary policy issues and program characteristics constitute the environment in which evaluators ply their trade, and they pose challenges with which new evaluations and outcome measures must deal. To understand the policy context surrounding demonstra- tion programs for children in the 1980s, it is useful to begin by outlining some general considerations that affect the formation of policy. These generic considerations apply to virtually all programs and public issues but shift in emphasis and importance as they are applied to particular programs and issues, at particular times, under particular conditions. . . . . The most fundamental consideration is whether the program or policy in question (whether newly proposed or a candidate for modification or termina Lion) accords with the general philosophy of some group of policy makers and their constituents. Closely related is the question of tangible public support for a program or policy: Can the groups favoring a particular action translate their needs into effective political pressure? Assuming that basic support exists, issues of access, - , ~ equity, effectiveness, and efficiency arise. Will a program reach the target population(s) that it is intended to affect (access)? Will it provide benefits fairly, without favoring or denying any eligible target aroun--for example, by virtue of geographic location, ethnicity, or any other characteristics irrelevant to eligibility? And will its costs, financial and nonfinancial, be apportioned fairly (equity)? Will it achieve its intended objectives (effectiveness)? Will it do so without excessively cumbersome administrative machinery, and will cost- effectiveness and administrative requirements compare favorably with alternative programs or policies (efficiency)? Two related concerns have to do with the unintended consequences of programs and policies and their interplay with existing policies and institutions. Will the policy or program have unanticipated positive or negative effects? Will it facilitate or impede the operations of existing policies, programs, or agencies? How will it affect the operations of private, formal, and informal institutions? Programs for children and families are not exempt from any of these concerns. Some have loomed larger than others at times in the past two decades, and the current configuration is rather different from the one that prevailed when the first evaluations of compensatory

13 education were initiated. The policy climate of the early 1960s was one of concern over poverty and inequality and of faith in the effectiveness of government-initiated social reform. The principal policy initiative of that period directed toward children and families--namely, the founding of Head Start--exemplified this concern and this faith. Head Start was initially administered by the now defunct Office of Economic Opportunity (OEO), and many local Head Start centers were affiliated with OEO-funded Community Action Programs. Thus, while it was in the first instance a service to children, Head Start was also part of the government's somewhat paradoxical attempt to stimulate grass roots political action "from the top down. n The national managers made a conscious, concerted effort to distinguish Head Start from other children's services, notably day care. The latter was seen as controversial--hence, a politically risky ally. The early 1960s was a time of economic and governmental expansion. Consequently, questions of cost and efficiency did not come to the fore. The principal concerns of the period were to extend services--to broaden access--and to demonstrate the effectiveness of the program. As noted earlier, effectiveness in the public mind was largely equated with cognitive gains. Despite the political character of the program, studies documenting its effectiveness as a focus for community organization and political action received little attention or weight-- perhaps because the political activities of OEO-funded entities, such as the Community Action Programs and Legal Services, were sensitive issues even in the 1960s. Yet it was precisely the effectiveness of Head Start at mobilizing parents (together with the political skills of its national leaders) that saved the program when the Westinghouse-Ohio study produced bleak results and a new administration dismantled OEO. During the 1970s the policy climate changed markedly. Economic slowdown and growing disillusionment with what were seen as excesses and failures of the policies of the 1960s brought about a concern for accountability and fiscal restraint, a concern that is still present and growing. Head Start responded by establishing national performance standards in an effort at quality control. Expansion was curtailed as the program fought to retain its budget in the face of inflation and congressional skepticism. (In fiscal 1977 only 15-18 percent of eligible children were actually served by Head Start.) Policy makers and program managers began to demand that

14 evaluations focus on management information and cost accounting. At the same time, other policies and programs for children and families were gaining national attention. Economic pressures, the increased labor force participa- tion of women, and the rise of feminism brought day care into prominence. Federal investment in day care increased under Title XX of the Social Security Act and numerous other federal programs for the working poor, backed by a curious alliance of feminists, liberals, child advocates, and ~workfare conservatives." Although anti-day-care, "pro-family forces remained strong, public subsidy of day care was gradually, if sometimes grudgingly, accepted as a reality. Most of the policy controversy surrounding day care in the 1970s centered on the trade-off of cost and quality: Should day care be viewed primarily as a service designed to free (or force) mothers to work--and therefore be funded at minimum levels consistent with children's physical and psychological safety? Or should it be viewed as a developmental service, akin to Head Start, or as a vehicle for delivering other services, such as health care and parent counseling, with attendant increases in cost? The controversy took concrete form in the debate over the Federal Interagency Day Care Requirements-- purchasing standards that specify the type and quality of care on which federal dollars can legally be spent. As we move into the 1980s, new, or more precisely latent, issues are likely to become prominent with respect to day care. The financing of day care is likely to become an ever more pressing problem, as the service becomes increasingly professionalized. Day care workers, among the nation's lowest paid, are likely to seek higher wages. Informal, low-cost care by friends or relatives may absorb less demand than it has in the past, as women who have heretofore provided such care either enter the work force in other capacities or begin to seek increased recognition and compensation for their services. At the same time, the importance of relatively informal care arrangements, such as family day care, have come to be recognized in policy circles. Informal arrangements are in fact the most prevalent forms of out-of-home care, especially for children of school age and for children under three. With this recognition will come new debates about the proper role of government: Should it regulate? Provide training? Invent new subsidy mechanisms? Major demonstrations examining alternative funding and regula

15 tory policies for both center and family day care have already been undertaken by the state of California. Novel ways of funding child care, such as "tuitions vouchers, have been urged and studied, and a child care tax credit has already been legislated. Day care is of course not the only type of children's program that underwent major change in the 1970s. Important new initiatives arose in the areas of child health and nutrition. For example, the Department of Agriculture established the Supplementary Food Program for Women, Infants, and Children and the Child Care Food Program; these provide low-cost nutritional supplements to low-income families and to the child care programs serving them. The Early and Periodic Screening, Diagnosis, and Treatment program was established to ensure that children from low-income families would be examined for problems of health, vision, hearing, etc. Another initiative, sweeping in its implications, was the federal mandate under the Education for All Handicapped Children Act of 1975 (P.L. 94-142) that handicapped children be provided with a ~free, appropriate public education, n interpreted to mean education in the "least restrictive environment" feasible given their handicaps. The consequences for public schools have been enormous, and federal programs for younger children have also responded by building in provisions for the handi- capped. The Head Start Economic Opportunity and Community Partnership Act of 1976 requires that 10 percent of Head Start slots in each state be set aside for handicapped children. Although P.L. 94-142 is linked to federal funds to aid the handicapped, the law has the character of an entitle- ment rather than being a service program per se. The law establishes very broad rights and guidelines, not particu- lar machinery for service delivery. Entitlements greatly broaden the constituencies affected by federal policy, for they extend far beyond the children of the poor. They highlight questions of access and equity for those charged with enforcement at the federal level. In the case of P.L. 94-142, questions of effectiveness and efficiency have largely been delegated to the local level: Local experts and practitioners are confronted with the task of devising programs that work at reasonable costs under local conditions. Questions having to do with overall effects of the policy on children, schools, and families have not been addressed at a national level. However, federal funds have been made available under

16 other legislative authorization for the establishment and evaluation of small-scale model programs for serving handicapped children. Another major development with profound consequences for the schools is the bilingual education movement. The movement has been reinforced by the courts, most notably by the case of Lau v. Nichols, in which a California ~ . federal district court, later upheld by the O.S. Supreme Court, declared that it is discriminatory for schools to provide instruction only in English to students whose primary language is not English. Although the case was brought on behalf of Oriental children, its primary effects are being felt in those states where Hispanic children constitute a large and growing segment of the student population. And, like P.L. 94-142, the bilingual education movement has generally trickled down to the preschool level, where bilingual programs are rapidly being established in Head Start and other programs. The bilingual movement poses basic questions about federal and state policies toward minority suboultures--questions of pluralism versus integration that have never been fully addresssed. At the local level, these highly controver- sial issues are fueled with additional controversies over what are seen as federal rights of encroachment and the responsibilities of local governments. Concurrent with these specific legislative and judicial initiatives, more diffuse but no less important policy issues have arisen in connection with certain federal demonstration programs. Two characteristics of these programs are particularly salient: an emphasis on the family and the community institutions with which it interacts, rather than on the child in isolation, and a stress on localism--on the diversity, rather than the uniformity, of programs and on their adaptation to local values and conditions. Programs exemplifying these emphases include Head Start's spinoff demonstrations, such as the Parent-Child Development Centers and the Child and Family Resource Program. These projects have acquired new strategic importance, in part as a result of a recent General Accounting Office report (General Accounting Office, 1979) that holds them up as models for future delivery of services to children from low-income families. Some nonfederal programs also emphasize multiservice support for families; an example is the Brookline Early Education Project, a privately funded program within a public school system. Other important

17 examples are day care programs funded under Title XX of the Social Security Act, which provides grants to states to purchase social services. These programs often provide a wide range of services that go beyond direct care of the child. And Title XX itself represents an attempt to decentralize decision making by allowing states considerable latitude in the use of federal funds. These policy emphases have multiple roots. In part they stem from a reaction against what has been seen as an intrusive, excessively prescriptive federal posture vis-a-vis local programs and their clients. In part they represent an assertion of the family's central role and responsibility in child rearing. In part they have a theoretical base and reflect an ecological perspective on child development--one that sees changes in the child's immediate social milieu, the family, and family-community relations as the best way to create and sustain change in individual children. In part they arise from practical experience with and applied research on earlier programs, which repeatedly showed dramatic differences in practices and effects from site to site, even when they were allegedly committed to implementing some prescribed treatment or model. Family support programs raise issues that have not been prominent with respect to earlier demonstrations. They focus attention on the relationships between children's programs and other service agencies in local communities. They also focus attention on relations between programs and informal institutions, such as extended families, which in some subcultures have traditionally provided the kind of global support that some demonstration programs aim to provide. They raise basic questions as to whether ecological approaches in general are more effective than interventions aimed at the child alone. Finally, they highlight issues having to do with the prerogatives and responsibilities of different levels of government and of government vis-a-vis private program sponsors, service providers, and clients. A tension is created by pressures for accountability at the federal level and conflicting pressures for delegation of responsibility to the state or local level. Evaluation often plays a role in struggles among the various levels of government, usually as a device by which federal program managers attempt to exert some control over local practices. In short, the policy context surrounding early child

18 hood demonstration programs in 1980 has become very complex. Old issues have remained, and new or resurgent issues have been overlaid on them. The need to measure program effects on children has not diminished--witness the current effort by Head Start to develop a new, compre- hensive battery of outcome measures. Concerns about cost, efficiency, and equity have become acute, as the federal government has expanded the scope of its responsibilities Broad entitlements and new initiatives have increased the competition for finite resources in the face of widespread resistance to further taxation and bureaucratic expansion. There is increased pressure for centralized accountability and cost and quality control. At the same time there has been a broadening of the constituencies affected by early childhood programs as well as increased emphasis on pluralism of goals and values; decentralized, local decision making; and the individualization of services. Fortunately, no single evaluation will ever have to address all of these policy concerns simultaneously. Nevertheless, their complexity and antithetical value premises pose staggering challenges for the evaluator who hopes to influence policy. Although evaluators can address only a small subset of these concerns, they must constantly be aware of the larger picture or run the risk that the information they provide will be irrelevant or misleading in light of the full configuration of issues bearing on the future of a particular program. These last observations lead to a final point about the policy climate of the 1980s: the role of evaluation itself in policy determination. An evaluation industry was born with the Great Society programs of the 1960s, which often included evaluations as integral parts. That enterprise has continued to grow and its audience has expanded, as clients, advocacy groups, and practitioners as well as policy makers and social scientists have learned to use evaluation results for their own diverse purposes. Congress has explicitly written evaluation requirements into the authorizing legislation for major programs, such as Title I of the Elementary and Secondary Education Act and the Education for All Handicapped Children Act. As evaluation has grown in prevalence and importance, some of its limitations have also become apparent. By their very nature, evaluative studies must be restricted in scope and therefore can address broad policy issues only in a partial and fragmentary fashion. The injection

19 of rational, systematic, analytic perspective into policy formation does not dispense with value conflicts; the choice of questions in evaluations is partly a matter of values, and findings are always subject to interpretation from multiple perspectives. Evaluation itself has costs, not only financial but also in terms of respondent burden and potential invasion of privacy. There are concrete manifestations of resistance to evaluation, in the form of increased restrictions on data collection. Despite these limitations we believe that evaluation can contribute to policy. Particular findings may mesh with the immediate information needs of policy makers and thus affect decisions directly. Boruch and Cordray (1980) provide some striking case studies illustrating this sort of direct contribution. Perhaps more typically, findings from many studies over time can create a general climate of belief, for example, belief that early intervention in some sense "works," which in turn subtly and gradually shapes the questions that policy makers ask, shifting their attention, for example, from questions of effective- ness to questions of access, equity, and efficiency. Evaluation can also reveal unintended consequences of programs and point to new policy questions and new directions for program development. Sophistication about the multiple concerns of policy makers and their own limited roles in the process of policy determination may breed in evaluators a salutary humility, but it should not breed despair. And awareness should make their contribution even greater. IMPLICATIONS FOR OUTCOME MEASUREMENT AND EVALUATION DESIGN The programs and policy issues that have evolved over the past two decades, particularly in the late 1970s, pose serious challenges for evaluators. However, experience in performing evaluative studies has been accumulating since the early 1960s, and that experience offers contem- porary evaluators some lessons about how to deal with at least some of these challenges. In this section we dis- cuss specific characteristics of contemporary programs for young children that confront evaluators with problems of design and measurement and lessons drawn from past experience that may help improve future evaluations.

20 Challenges to the Evaluator Many of our concepts of outcome measurement and evaluation design were, as already suggested, shaped by the compensatory education and cognitive enrichment programs of the early 1960s. These programs were initiated under private auspices, often with government funding, at one or a few sites. While these programs were to become models for public policy and in many cases were consciously intended as such, they were not immediately concerned with issues of administration and implementation on a large scale or with links to other public service delivery systems, such as nutrition or health care. Nor were they much concerned with questions of cost or cost-effectiveness. The question on everyone's mind was, will preschool education work? That is, will it improve the school functioning and test scores of low-income children? The early programs were new and relatively small, their goals were relatively clear and circumscribed, and comparable services were not widely available. The individual child was typically the recipient of treatment, and the programs were implicitly conceived as operating in relative isolation from other social institutions and forces. Consequently, it was possible to devise simple evaluations, in which test scores and school performance of children in the program were compared with those of similar children in the same communities who received no services. The program itself was viewed as a unitary "treatment," and children in the control or comparison group were assumed to receive no treatment. Such evaluation designs were straightforward extensions of laboratory paradigms, although the children in control groups were often selected by post hoc matching rather than random assignment, thus making many evaluations designs quasi-experiments rather than true experiments. Of course, not all early programs were rigorously evaluated, and not all evaluations were as limited as we have suggested; for example, diffusion of effects to siblings and neighbors was a topic of interest in some of the early evaluation studies. As suggested earlier, experimental designs are ideal for answering certain kinds of evaluation questions, because they provide the most direct means of establishing linkages of cause and effect. Children's academic skills and performance are often important program outcomes, and standardized tests, properly interpreted, measure aspects

21 of these skills. However, experience with the demonstra- tions that have evolved over the past two decades has made three points clear: First, a wider range of outcome measurement is necessary to do justice to program goals. Second, measurement of outcomes alone does not show why a program achieved or failed to achieve its intended goal-- often the most significant lesson to be learned from a demonstration. Third, the conditions necessary for successful experimentation are often not met when demon- strations are conducted on a relatively large scale. Treatments tend to be multifaceted and variable. Often the pairing of client and treatment is beyond the experi- menter's control. Extremely complex designs may be needed to tease out complex chains of causation. We amplify these points in the pages that follow. It should be clear, however, that we are not opposed to experimental approaches, controlled assignment, or formal designs. We discuss program characteristics that pose barriers to formal experimentation in order to make a case for supplementing, not supplanting, experimental approaches with other scientifically defensible forms of investigation. Similarly, we recognize the value of outcome measures focused on individual development, including academic skills and achievement. However, we emphasize program characteristics that point to the need for other kinds of data--measures of outcomes that go beyond the individual child and measures of context and process that illuminate why and how a program works or fails to work. We discuss below eight program character istics that are particularly salient. Diversity of Target Groups In contrast to most earlier demonstrations, the programs of the 1980s are aimed at a broader range of client populations. Programs aimed at physically normal, English-speaking children from low-income families still predominate. The sweeping entitlements mandated by legis- latures and courts, however, have created many programs to meet the special needs of handicapped children and children of limited English-speaking ability, not all of them from low-income families. Of course, these children themselves form extremely heterogeneous populations with diverse needs. Accompanying increased public attention to day care has been a concern about the effects of prolonged out-of-home care on children from all social

22 backgrounds, including the middle class and well-to-do, and of all ages, from infancy through school age. Increased diversity in the children served by public and private demonstration programs calls for increased diversity in measures to address the needs and characteristics of the populations in question. Diversity of Services Closely related to the breadth of client populations is breadth in the range of services offered. Again, services to meet the special needs of handicapped children and children of limited English-speaking ability provide striking examples. In addition, preschool education, once the predominant service for children of low-income families, has been joined by health care and nutrition, referrals to a wide variety of social services, and training and counseling of parents in child care, in dealing with schools and other public institutions, in family relations, and in more peripheral areas such as employment and housing. This breadth of services obviously requires a commensurate breadth of measures-- not only better measures of children's physical, intellectual, social, and emotional growth but also measures of the quality of the child's life in the program itself (as programs increasingly become a large part of the child's daily environment); the quality of parent-child relations; the strengths and cohesion of families; and the family's adaptation to its social, economic, and institutional environment. Emphasis on the Social Environment In many programs there has been a widening of focus, from the child in isolation to the child in the family and the family in the community. Strengthening families and improving family-community relations are seen as ways to create social environments for children that foster growth--as well as ends in themselves. This emphasis on the child's social milieu creates a need to reexamine existing measures of individual development and family functioning, with an eye toward their appropriateness in assessing the effects of programs and policies aimed at reaching the child through the family. It may of course also create a need to modify existing measures or to

23 develop new ones. Similarly, it draws attention to measures of linkage between families and institutions-- such as schools, courts, churches, voluntary organiza- tions, social service and health care agencies--and informal sources of support--friends, neighbors, and relatives. There is an overarching need to test the basic assumption of these programs: that the most effective way to create and sustain benefits for the child is to improve his or her family and community environment. This assumption is well grounded in theory and basic research, but whether it can be translated into effective programs is an open question. Clearly, such a test is not the task of any single study, but must arise from a gradual accumulation of data on the effects of many such programs. Support Versus Intervention Accompanying the focus on families and communities is an emphasis on support rather than intervention. Inter- vention implies an initiative from outside the family, a "treatment" whose goals and methods are prescribed by an external agency, governmental or private. Support implies shared goal setting and initiative on the part of the family in selecting the services it or the child receives. Though often merely rhetorical, this emphasis has poten- tially profound consequences for evaluation design and measurement, since it implies that the goals of a program and the treatment provided cannot be predefined, except in a broad manner. In effect the client plays a role in selecting both dependent and independent measures. An additional, equally important implication of this emphasis on support is that support itself should be measured. There is a need to know whether family-oriented social programs in fact strengthen the family or inadvertently weaken it by creating dependence on government and cutting ties to informal supports such as friends, neighbors, and the extended family. Even participation in a program may be hard to define or interpret when contacts between family and program are wholly or partially voluntary, as is the case with many support programs. A family may choose not to contact a program because it is doing well on its own, yet it may also fail to make contact when it is most in need of help. A family may remain out of contact for long periods, then renew the Relationship in time of stress. Thus participa

24 tion is an ambiguous indicator of need and of program effectiveness. It may be difficult just to know at any time how many families are participating and difficult to determine who should be counted as participants when the program is evaluated. It is important to note that certain key assumptions of support programs are embodied in far-reaching policies as well. P.L. 94-142, for example, establishes an advocacy process by which parents play a major role in the educational placement of their children. Like support programs, the law assumes that parents are rightful advocates for their children, that they can identify the child's needs and can and will act effectively in the child's best interest. In part, of course, this emphasis on parent involvement stems from basic value premises about the rights of parents. In part it also embodies empirical assumptions, which are subject to test through a gradual accumulation of information about the effects on children of programs and policies in which parental involvement plays a major role. Individualization of Services For many programs of the 1980s, services for a particu- lar child or family are selected in light of that child's or family's needs; individualization of services has become a watchword. Individualization tends to character- ize support-oriented programs, in which clients partici- pate in decision making. It can also occur when the locus of control rests with the program. Individualization is required by law in educational programs for the handi- capped. It occurs naturally as part of health programs-- medical and dental services are provided in response to patients' complaints and diagnosed problems--although health programs may also provide uniform services, such as screenings and immunizations. Nonuniform treatments challenge evaluation designs in fairly obvious ways. Although it is inappropriate to lump clients into a single treatment group to probe for common outcomes, it is equally unsatisfactory to treat individualized programs simply as a series of case studies. There is a need to find some middle ground that permits aggregation of effects across clients yet does justice to the diversity of treatments and outcomes. There is a complementary need to devise new teabuiques for "profiling" effects--for summarizing what the program

25 has done for the individual child or family across a range of outcome domains, which may vary from client to client. Finally, there is a need to test the underlying assumption that individualization is a viable approach, through gradual accumulation of data on a variety of individualized programs. Individualization of services also raises a related value issue: how to reconcile legitimate and desirable individual differences with the need to identify a manageable set of outcome measures that are consistent with program goals. Early childhood programs run the risk of attempting to homogenize certain characteristics of their participants. The need for relatively clear, consistent program goals can shade imperceptibly into an assumption that what is good for one is good for all. The process of evaluation, assuming that it is based on outcome criteria known to the program, may foster or exacerbate pressures for conformity and penalize children who are constructively different. Decentralization and Site Variation In part because of increased philosophical emphasis on local initiative and primarily because programs inevitably adapt to local needs and resources, even when federal program guidelines exist, decentralization of control and site-to-site variation are facts of life for the program evaluator of the 1980s. In multisite evaluations, site variations cannot be viewed as nuisance variables, to be quashed through insistence on rigid adherence to a treat- ment recipe or to be adjusted away after the fact by statistical manipulation. They are integral features of large-scale programs, to be examined in their own right. Evaluations must be designed to accommodate them, and outcome measures must be chosen to highlight rather than obscure them. Indefinite Time Boundaries Many demonstration programs of the 1980s are likely to be ongoing rather than time bounded. Classical interven- tions typically involve strict age guidelines; for exam- ple, preschool compensatory education programs normally serve children from age three to age five. In contrast, some contemporary support programs imply an indefinite

26 period of relationship between program and family; programs continue to provide assistance as long as the family wants it, lives in the area, and meets eligibility criteria. This open-ended quality makes it difficult to know when to measure a program's outcomes. Different measures may be appropriate at different points in a family's relationship with a program, yet these points defined not chronologically but by the juxta- position of a need expressed and a service provided. Integration of Services Finally, the programs of the 1980s are likely to be characterized by increased emphasis on the integration of services. Head Start and Title XX day care attempt to provide a wide range of services in a single facility. Demonstrations such as the Child and Family Resource Program try to capitalize on existing services in the community, providing referrals and, if necessary, assist ance and advocacy in securing services to which clients are entitled. In part this emphasis on service integra - tion arises from considerations of efficiency. In part it arises from a felt need to present client families with a coherent image of the social service system rather than a fragmented one, with a sense of accessibility and rationality, rather than one of obstruction and confusion Service integration raises questions that have heretofore been largely ignored in evaluations of early childhood programs, although they have been central in policy analyses of social programs generally: Under what condi- tions is the referral approach more appropriate? The answer depends in part on the services already available in a given community. If services are available elsewhere in a community, how should the convenience of service at a single facility, such as a Head Start center, be weighed against the efficiency of using existing services outside the facility? If referrals are used, how is demand for existing services affected? Is the system structured so that the referral agency does not overload the provider agencies? How do federal programs, such as the Child and Family Resource Program, affect demand for state and local services? These and other systemic questions demand a different order of outcome measures from those usually thought of in connection with programs for children and families. .

27 Lessons for Future Evaluations There are no all-purpose solutions to the problems posed for evaluators by contemporary programs for children. Nor is there an all-encompassing list of widely accepted outcome measures from which evaluators can choose to suit their purposes. However, children's programs have been among the most heavily studied of all social programs, and considerable experience in the art of evaluation has accumulated. This section draws on that experience to make a series of broad suggestions about the kinds of information that evaluators might collect in order to make their results useful in shaping future- policies and program practices. These suggestions should not be construed as implying that any single evaluation must make use of all of the kinds of measures mentioned On the contrary, the panel is acutely aware of the con- straints imposed by resources and by the need to avoid burdening programs and clients. Our suggestions are offered not as a recipe for the ideal evaluation but as a framework for choice. We have tried to provide some salient reminders about factors that should be considered in designing evaluations of children's programs, based on our review of program characteristics and contemporary · ~ policy issues. Rethinking Developmental Measures By choosing too narrow a range of outcome measures, the evaluator may forego opportunities to discover important effects of a program and thus misdirect policy or fail to address some of the many constituencies affected by a program. In this regard the limitations of traditional outcome measures, especially standardized tests of cognitive ability and achievement, have long been recognized. Because the goals of many early childhood programs lie in socialization, rather than cognitive enrichment, calls for better measures of self- concept, social skills, prosocial behavior, and the like have been frequent and forceful. (For some proposals regarding the measurement of social competence in young children, see Anderson and Messick, 1974; Zigler and Trickett, 1978.) While we are prepared to add our voices to the chorus, we argue that some important distinctions, qualifications, and additions must be kept in mind.

28 One can conceptualize socioemotional outcomes in terms of enduring changes in the personality traits of children, traits that are exhibited in other contexts and preserved in later life. Or one can conceptualize such outcomes as indices of the child's immediate well-being. For example, one could speak of a day care program making a child more cooperative with other children, with the presumption that increased cooperativeness will manifest itself in the home or in school, not just in the day care center. Or one could simply speak of a day care center in which a cooper- ative atmosphere prevails, or in which a particular child behaves cooperatively, with no presumption about cross- situational generality or longitudinal persistence of cooperativeness. We suggest that this distinction is a crucial one, for the two interpretations raise different measurement issues. This section discusses some of the issues surrounding the ~trait" interpretation. The immediate well-being of the child is discussed later. If the worth of a program is to be judged by its ability to produce enduring changes in individual traits, then a heavy burden of proof is placed on it. Despite the progress that has been made in developmental psychol- ogy, basic researchers in the field are still struggling with the question of how to conceptualize social behavior and to sort it into portions attributable to the enduring traits of the child and portions attributable to the immediate situation. Similarly, a great deal remains to be learned about which early behavior patterns are likely to persist into later childhood and adulthood. Thus we are currently ill equipped to choose or develop measures that capture important, lasting traits of children and that are also responsive to intervention. The evaluator's problem in choosing social measures is not merely a technological one that can be solved by straightforward investment in instrument development. In fact, there are already hundreds of instruments for measuring social development in young children. These instruments are reviewed, for example, by Goodwin and Driscoll, 1980; Johnson, 1976; Walker, 1973; and Johnson and 80mmarito, 1971. Unfortunately, the few that have been used in evaluation have had disappointing histories. Developing better social measures is a problem of basic research that cannot fairly be handed to evaluators. Until such measures are available, the limitations of our under- standing should not be allowed to work to the detriment of programs; programs should not be judged on the basis of available measures, without regard for their actual

29 goals and practices. On the other hand, programs should not be allowed to use ill-defined goals in the realm of social development as a smokescreen to avoid account- ability. Program planners should be specific and concrete about their goals, so that the programs can be evaluated as thoroughly as possible within the limits of existing technology. Paralleling the need for enriched psychosocial measure- ment is a less widely recognized need for measures of physical development and health that are likely to be sensitive to program interventions. Available measures of physical status, ranging from height and weight to presence or absence of a wide variety of diseases, are unlikely to show such sensitivity for most childrene Height and weight are likely to be measurably affected primarily in children who enter a program in a state of malnourishment or physiological disorder. Ameliorating these serious cases is of course a program effect of major importance; however, detecting program effects on children in the normal growth range may require more sensitive measures. Incidence of serious diseases is likely to be so low that any program effects could be detected only with huge samples. More common diseases tend to be less serious and/or self-terminating; the incidence of such diseases may therefore be of secondary importance as an outcome measure. Thus there is a need for measures of "wellness n and normal development that vary with nonextreme differences in environments. Even for a measure that is well established in basic research, there are numerous hurdles to be cleared in adapting it for use in evaluation. Field conditions may rule out some of the control that characterizes use of the measure in the laboratory. Economic constraints in large-scale studies may preclude recruitment of highly educated field staff or extensive staff training. Sometimes measures may lack the degree of face validity they need if they are to be accepted by parents and program staff. Even in small-scale studies, researchers are often tempted to cut corners when a particular instrument requires a heavy investment of time and effort. For example, the "strange situation" developed by Ainsworth (Ainsworth and Wittig, 1969) to measure an infant's attachment to its mother has been shown to be a reliable, valid measure that predicts social adjustment up to age five (Sroufe, 1979). However, although many researchers have been concerned with the impact of early day care on mother-infant attachment, few have used

30 Ainsworth's demanding coding scheme, and few have confined their research to the age range (12 to 18 months) for which the instrument is known to work. Instead, even basic researchers working with small samples have used ad hoc modifications of Ainsworth's procedure, with the result that much of the literature on day care and attachment must be viewed as ambiguous (Belsky and Steinberg, 1978). i Other important questions surround the adaptation of Lndividual developmental measures for use in evaluation. One such question has to do with the expected timing of effects--an issue on which current theory and research give little guidance. Different outcomes may have very different time courses: Some effects may be transient and contemporaneous with the program itself; some effects may be at a maximum on completion of treatment and may diminish in size thereafter; and some effects may not become apparent until long after participation in the program. Preschool education, for example, has shown both of the latter two patterns of effects. Scores on standardized tests of ability or achievement tend to show maximum differences between treatment and control on completion of the program, diminishing afterward (Bronfenbrenner, 1974). However, as discussed below, there are recent reports of sleeper effects, in the form of better school performance, years later, for some programs. Assessment of program effects may thus depend critically on the timing of outcome measurement. Without a clear theory or at least a well-formulated hunch about relationships between treatment and outcome, it may be necessary to probe for effects at multiple time points. Another such question has to do with the match between the quantitative form of outcome measures and the goals of the program in question. Some programs are designed primarily to shift a distribution upward--for example, the distribution of academic achievement scores of low-income children. Some are designed to set a floor under a distribution--for example, to guarantee that all children in a program receive a certain minimum nutri- tional intake or achieve minimal literacy. Some are designed to lower the prevalence of undesirable conditions in the immediate present, such as dental caries, or in the future, such as adolescent delinquency. Some are designed to prevent relatively rare but catastrophic events, such as child abuse. In some cases the variance rather than the central tendency of a distribution may be important. For example, mainstreaming of handicapped

31 children may not change their mean performance from that of handicapped children in separate classes, but some children might be doing much better and others much worse when integrated with nonhandicapped children. A program may look successful or unsuccessful, given precisely the same-distribution of individual outcomes, depending on how the individual scores are aggregated and analyzed. For example, a reading program may produce an upward shift in the group mean by increasing the scores of the children who read best already, while having no effect on the skills of nonreaders. Whether the program is deemed a success or a failure depends on whether the evaluator emphasizes the mean shift or the lack of change at the bottom of the distribution. The choice of quanti- tative summary measures is thus not a purely technical matter; it is intimately linked to the substance of the evaluation and the goals of the programs. There are encouraging recent reports of lasting indi- vidual effects of some early preschool demonstrations of the early 1960s (e.g., Lazar and Darlington, 1978). These reports are significant not only for what they suggest about the time course of the effects of intervention but also for the nature of the long-term measures they use. Reviewing a number of longitudinal studies, Lazar and Darlington conclude that graduates of these programs were much less likely than control or comparison children to be placed in special education classes, to be held back one or more grades in school, and to score poorly on tests of academic achievement. The authors also conclude that children's participation in preschool programs elevated mothers' aspirations for their children's educational achievement and increased the children's pride in their own achievements. The panel has not reviewed these studies in detail and offers no judgment about the accuracy of their findings. What is significant for our purposes is their attempt to use certain highly practical indicators, which combine academic motivation and skill (such as grade retention, placement or nonplacement in classes for the retarded or learning disabled) as indicators of long-term program effects on individuals. These measures are clearly attractive for their direct social and policy importance. They sidestep many of the theoretical issues and value controversies that surround most cognitive and social measures. However, they do need careful scrutiny, since they are likely to be affected by school policies and other external factors that might cloud their interpretation as measures of long-term individual success.

32 The foregoing remarks are not meant to imply that measures of long-term individual development have no place as outcome measures for early childhood programs. On the contrary, such measures have been and remain central. We take this to be a position that requires no elaboration or defense. We have chosen, however, to focus the remainder of our comments elsewhere because we believe that other measures have been neglected. Measuring Quality of Life In view of the fact that some programs, notably day care and preschool education, consume a significant portion of the child's waking life, a case can be made for considering the quality of life to be an outcome in itself. We are accustomed to thinking of programs for children primarily as investments in the child's future. Often, however, social programs, such as some programs for the elderly, are justified on the grounds that they provide a decent environment in the here and now for people whose welfare is the concern of the citizenry as a whole. Our intent is not to advocate that the citizenry or government accept such responsibility for children. Rather, our point is that once such responsibility is taken, immediate quality of life becomes an appropriate standard by which programs may be judged. The same consideration applies to the evaluation of services financed by nongovernment agencies or purchased privately by parents. Clearly, measuring the quality of life is no easier than measuring socioemotional development, except insofar . . . . . as the former phrase carries no implication ot enduring effects. Equally clearly, quality of life and development are intertwined; patterns of behavior that indicate immediate engagement, stimulation, self-confidence, etc. on the part of the child are at least good bets to relate to longer-term socioemotional growth. In urging a shift of attention to the here and now, we are under no illusion that there exists a readily available, widely accepted technology for assessing children's social environments. There are examples, however, of influential studies that have focused on the child's immediate well-being. One is the National Day Care Study, a large-scale study of center day care, designed to inform federal regulatory policy (Ruopp et al., 1979). The study used natural observations of care givers and children to characterize the social

33 experiences of children in groups of different sizes, with different staff/child ratios and different configurations of care givers' qualifications. The study found that cooperation and creative, intellectual activities by children were more frequent in small groups and that aimless wandering and noninvolvement were less frequent. This study also found that care givers with training specifically related to young children (e.g., in child development or early childhood education) provided more social and intellectual stimulation than those without such training. The study's results had a direct influence on the day care regulations subsequently proposed by the federal government (Federal Register, March 19, 1980), suggesting that the study's outcome measures had some weight for policy makers. Assessing Effects on the Child's Social Milieu Earlier we pointed out that ecological influences have gained increasing prominence in the rationales underlying contemporary programs. Some practitioners have come to believe that the best way to produce lasting effects on the child is to reshape the "ecosystems" in which the child grows--the immediate family and the larger web of relationships between families and external institutions, such as schools, the health care system, and social service agencies. Family support demonstrations, such as the Child and Family Resource Program, provide an obvious example of this ecological approach. Our earlier discussion also pointed to some of the measurement requirements of family-oriented programs--the need to assess program effects on parent-child inter- action, family functioning, and family-community relationships as well as the larger need to test the assumption that the best way to help the child is to work through the family and the community. Fortunately, it is possible to go beyond mere exhortation in this regard. There is a massive literature on parent-child interaction that can be tapped to identify desirable and undesirable patterns of mutually contingent behavior of parents and children. For example, there have been studies of the effects of day care on parent-child interaction using as outcomes laboratory paradigms for measuring the quality of parental teaching (e.g., Ramey and Mills, 1975; Farran and Ramey, 1980). An evaluation of the Child and Family Resource Program, currently under way, assesses the

34 program's impact on parent-child interaction by video- taping natural situations in the home (Cornell and Carew, 1980). Similarly, there exist many measures of family functioning that have been used in evaluation studies supported by the Office of Child Development, now the Administration for Children, Youth, and Families (see Lindsey, 1976, for a review). There is a literature on the effects of parent education programs (Brim, 1959; Good son and Hess, 1978), which can also be drawn on to identify parental behaviors likely to be both significant for the child and susceptible to influence by programs. Finally, there is promising new theoretical work on the ecology of human development, which offers both a conceptual framework and specific suggestions about variables and relationsips that might be examined in real-world contexts, such as day care (Bronfenbrenner, 1979). Bronfenbrenner's work has been applied by others in attempting to understand other practical problems, such as child abuse (Belsky, 1980). In general, we are in a fair position to-identify intrafamilial variables and measures that affect children; however, while there are many measures describing the interface between families and communities, published work tying these measures to the well-being of the child is just beginning to appear. This is an important area for development, and existing intrafamilial measures have certain problems. Most of them have been developed for specific basic research purposes and adapted for use in evaluation research. Little is known about the psycho- metric properties of various questionnaires, interviews, and laboratory-based procedures when applied under field conditions quite different from those under which they were developed. In addition, when evaluations of early childhood programs move beyond measures of the child into areas of parent-child interaction and family functioning, issues of privacy and confidentiality may inhibit in-depth investigation. Assessing Effects on the Service Delivery System As suggested earlier, systemic effects are crucial for policy. By systemic effects we mean effects on the formal and informal service delivery system as a whole, which can be intentional or unintentional. For example, a voucher demonstration, allowing eligible parents to purchase day care services as they choose, might draw new

35 providers into the business of family day care, as they began to see a stable source of income for their services--an intended systemic effect--or it might lead to the purchase of substandard care in unregulated facilities--an unintended systemic effect. Similarly, federal regulatory policies might raise the quality of subsidized day care but might also raise costs and drive parents and providers into informal, unlicensed day care arrangements. Or family support programs may benefit children and parents but simultaneously increase their dependence on government and displace private support systems, such as the extended family. There are no hard-and-fast rules for mapping the universe of potential systemic outcomes. However, as a preface' to evaluation it is necessary to think broadly and systematically, perhaps drawing on case studies in which unintended ' effects were discovered, in order to identify as fully as possible the range of such outcomes that might result, particularly if a program is'implemented on a large scale. Fairly simple types of data can often shed a great- deal of light on systemic issues. Evaluators of early childhood demonstrations often collect a limited amount of basic information on the numbers of individuals served by a program, the frequency or amount of participation, the services received, and the like. Such information, however, is usually accorded ' only subordinate status in reporting results and often is not analyzed in detail. We urge a fresh look at such descriptive data, and we suggest that from some points of view''such data can legitimately be treated as measures of program effective- ness. Atheoretical indicators of services rendered and of contacts between clients and programs can be invaluable in program management, both on site and at the level of 'the funding agency. Moreover, from a policy maker's point of view, delivery of service is often an end in itself, particularly when the value of the service is known or assumed. Health services, such as immunizations, are paradigmatic examples of services whose intrinsic value has been independently demonstrated, i.e., by medical research. Special education for the handicapped is an example of a service whose general value is in ' effect presumed by existing federal policies, and the choice of specific approaches is left to state or local discretion. The policy maker's concerns with issues of access, equity, and efficiency of cervices' are addressed by descriptive data on types of services provided, numbers

36 of persons served, costs of service, and the like. For example, these are the types of data included in reports to Congress on the implementation of P.L. 94-142, prepared by the Bureau of Education for the Handicapped (1979). Another example of the utility of such data is provided by demonstrations of service delivery mechanisms, e.g., vouchers for day care, for which head counts of persons served are obviously relevant as outcome measures. Defining "Treatments" A key problem in understanding the effects of a demonstration is specifying the nature of the "treatments received by individual children or families within a program. Some of the difficulties involved in describing treatments were identified earlier. For example, we have seen that treatments are often individualized to match the needs of children and families. In the case of support programs, clients have an active voice in deciding what services they receive. As a result, treatment is not standardized and is distributed across clients in nonrandom fashion, complicating conventional experimental design and statistical analysis. If the program itself is defined as the treatment, and ~treated" subjects are compared with controls without regard for actual variations in type and amount of service, important information could be lost. For example, a program may appear to have no overall effect, whereas closer examina- tion may suggest that certain treatment strategies, confined to a subset of the treatment group, were in fact effective. Precisely this situation occurred in national evaluations of Head Start and Follow Through (Smith and Bissell, 1970; Stebbins et al., 1977). If actual services received are measured within both treatment and control (or comparison) groups, and measured service rather than group assignment becomes the independent variable, the simplicity of the analysis is sacrificed. Further complicating the definition of treatment is the fact that the time boundaries of a program may be fuzzy, and the temporal relationship between treatment and effects may be uncertain. When a program has no clearly defined temporal endpoint, it is difficult to say when treatment is complete. Still another complication in many evaluations is that control subjects may themselves receive treatment. For example, low-income children not in Head Start may be

37 served in Title XX day care, which resembles Head Start in many respects. Children without access to an experi- mental health program may be treated at a local clinic. Depending on the purpose of a particular evaluation, these alternative sources of service may be either nuisance variables or highly relevant. If the purpose is to deter- mine whether the program "works in comparison to no treatment, they obviously cloud the issue. If the purpose is to determine whether a particular program confers an advantage over existing service systems or agencies, alternative sources of service may provide a useful com- parison. In general, experimental designs presume that the treatment/no treatment comparison is the relevant one, but for many policy purposes comparison with the preexisting configuration of services is more relevant. These observations make it clear that if the results of an evaluation are to be intelligible, it is crucial to document the precise nature of the treatment received by children or families in the experimental program as well as those in any control or comparison groups that might be involved. For this purpose, so-called process measures are needed--both gross measures of services provided and fine-grained measures of transactions between staff and clients. Such measures might be of many types--systematic observations using a coding system, participant observa- tion, in-depth interviews, etc. Such measures have the potential to document what actually transpires in a program, as opposed to what is prescribed in the program's guidelines or self-description. Thus they can tell us whether a program is living up to its stated ideals; see, for example, Stallings' (1975) monograph on the relation- ship between program ideologies and program practices in Follow Through. They can help us distinguish between the delivery and the receipt of services--i.e., between what the program provides and what the child or family experiences. More importantly, process measures have the potential to illuminate the connection between means and ends--to tell us why a program worked or failed to work. As argued earlier, this information is critical. A demonstration can succeed for idiosyncratic reasons that preclude wider use of its results. Similarly, a demonstration that fails to achieve its intended effects may nonetheless contain valuable lessons for the future. Numerous examples could be adduced to illustrate the potential usefulness of process measures in clarifying the connection of treatments with outcomes. To cite just

38 a few cases: An early evaluation of the Child and Family Resources Program failed to include such elementary process data as frequency of home visits or regularity of attendance at center sessions. When no effects were found on children's development or family functioning after two to three years in the program, no precise explanation could be given for the lack of program effects. In a current evaluation, detailed process data are being collected, and tentative relationships have been found between participation measures and children's performance on developmental tests. Similarly, staff of the Brookline Early Education Project kept logs of their contacts with client families, and contact frequency has been found to be related to positive outcomes. While correlational data such as those just cited cannot distinguish selection effects from genuine causal linkages between program participation and outcomes, they at least suggest plausible hypotheses for further exploration. Understanding Site Differences A related reason for giving careful attention to process measures is to understand the site differences in effects so often found for children's programs. Umbrella programs are likely to vary from site to site with respect to such features as scope of services, the role of parents, philosophy or curriculum, nature of the sponsoring agency, links to the school system, etc. In some cases, notably family service programs such as Parent-Child Centers and the Child and Family Resources Program, this diversity is deliberate: Such programs are intended to respond to local needs and to make use of local resources. Even when programs or models operate under uniform guidelines, however, studies-have repeatedly found great diversity in actual practices and in effects from site to site. When site variation is great, it seems inappropriate to think of a program as a single treatment that is implemented at many sites or that varies unidimen- sionally from site to site in "distance" from national program specifications; rather, such a program is a collection of treatments, each of which applies to a single site or a few sites at most. Large-scale comparative studies in the past, such as the evaluations of Head Start Planned Variation and Follow Through, have struggled against this reality, first by trying to enforce uniformity of program models

39 across sites, then by grouping programs or performing various statistical adjustments in order to compare "models." In our view such efforts are often misplaced We must learn to deal with site variations through innova- tions in design and analysis and through measurement of program characteristics that allow us to understand site differences. can help the evaluator understand site variations in a given program or model. With respect to design, it makes sense not only to avoid comparative designs that presume or require sites to be alike but also to capitalize on site differences. By studying how programs adapt to their settings, the evaluator can provide the policy maker with useful information about the potential generalizability of a locally successful approach and can provide practi- tioners with some indication as to whether a successful innovation is likely to work well under their particular With respect to measurement, process aaca . circumstances. Investigation ot site errects can also give the policy maker some indication of which program characteristics can and should be mandated at the federal or state level and which are best left to local initiative. Measuring Costs and Cost Increments Program cost has continued to be a concern of program sponsors; it is one for which entire methodologies for cost accounting and cost-benefit analysis have been developed. While the panel has not directly concerned itself with issues of cost measurement, it recognizes a need for much more attention to the relationship between costs and program outcomes. With the notable exception of High/Scope Foundation's Ypsilanti-Perry Preschool Project, early childhood demonstrations have made almost no attempt to examine their total costs in relation to long-run benefits. The Perry project claimed substantial long-run cost-effectiveness, largely due to the fact that its graduates were far less likely than control children to require expensive special education during the school years (Weber et al., 1977). Furthermore, almost no attention has been paid to variations in cost that are linked to variations in program configuration. In the evaluation of early childhood programs, variations in program philosophies and curricula have frequently been studied, and variations in delivery strategies or program structures have

40 occasionally been studied, but little attention has been paid specifically to cost-relevant variations in programs Cost-relevant variations may include, for example, staff/ client ratios, economies that may derive from large-scale provision of services, or transporting staff to families rather than families to program centers. To the extent that such variations are related to program effectiveness, the nature of these relationships needs to be understood. An example of the usefulness of findings linking cost to quality of service is provided by the National Day Care Study, mentioned in an earlier section (Ruopp et. al., 1979). This study examined the costs associated with different grouping and staffing patterns and concluded that the most costly program elements are not the ones most closely linked to quality of care--a finding that influenced the day care regulations proposed in 1980 by the federal government. Such issues become even more important when one considers that demonstration programs are often designed as prototypes to be refined and made more cost-efficient later, so that~they may be implemented on a larger scale. Dissecting such prototype demonstration programs, in order to identify the components that are most closely related to both outcomes and costs, is the best way to ensure that later efficiencies will be accomplished without risk to the effectiveness of the program. . Generalizing From Successful Demonstrations Even when a program has proven to be highly effective at one or a few sites, numerous factors may limit its wider implementation. By being aware of these factors and addressing them explicitly, the evaluator can provide guidance as to where and how the program's lessons can be put to use. The kind of information necessary to make a reasonable projection of the generalizability of a demonstration is not typically collected in evaluations of programs for children, but it is very much in line with our earlier recommendations. Examples of relevant questions include: To what degree are participants in the demonstration typical of the populations that might potentially be served? How feasible is it to recruit appropriate staff in large numbers? To what degree are the program's effects limited to particular sites with unique characteristics? How much does the program cost? Are

41 there economies or diseconomies of scale? How complex, costly, and burdensome is the administrative machinery necessary to operate the program on a large scale? To what degree would widespread implementation disrupt, facilitate, or overlap with existing programs? An evalua- tion that focuses solely on the effects of a program on children or families furnishes indispensable but insuffi cient information. An evaluation that incorporates information about processes, costs, and the interaction of the program with its setting is in a far better position to address the concerns of those who would build on the experience of the successful demonstration by adapting it for a wide range of settings. - Rethinking Evaluation Designs The challenges posed by contemporary programs for children and the suggestions we have made for addressing at least some of them require a broader view of alterna- tive measurement techniques and evaluation designs than is commonly maintained. In this section we distinguish a number of different configurations of designs and measures that might be considered, depending on the evaluator's particular purposes. The first distinction is between experimental and observational approaches. The difference is highlighted by characterizing the former as learning through manipula- tion. While suitable control is important for either approach, the static nature of observational studies heavily burdens the inferences that are drawn from them. The concept of control leads to a second distinction-- that between randomized and nonrandomized designs. While we make a plea for breadth, to have a rigorous demonstra- tion of program effects there is no substitute for a completely randomized study. Although they do not predom- inate, randomized designs have been used in the evaluation of children's programs. For example, the Home Start Evaluation (Love et al., 1975) produced particularly clear-cut evidence of the effectiveness of home-based intervention. Another example is provided by the National Day Care Study (Ruopp et al., 1979), which addressed the same set of questions through a large, quasi experiment and a smaller randomized study. The randomized study produced results generally similar to but stronger than those of the quasi experiment.

42 One must, however, study the program and not the experiment. Ideally, randomized experiments should be combined with observational studies that focus on the natural setting(s) in which the program is intended to operate. In addition to checking the reactivity of the experimental design, such observations may elucidate the "whys of observed effects (and, in the absence of them, the "why not"). The quantitative/qualitative distinction is the third distinction to consider. Furthermore, we distinguish between qualitative assessment and qualitative research. The former denotes the use of qualitative techniques, such as clinical judgments, to gather data; the latter is exemplified by such approaches as grounded theory and analytic induction. Qualitative research relies primarily on three data collection techniques: document review, in-depth observation, and interviewing. It should be noted that both qualitative assessment and research may occur in experimental designs. A qualitative approach can provide a rich description of cases, which can broaden our understanding of the situation and the setting and answer the "why" of program effect or lack of it. Such description may also educe theory and provide a basis for subsequent research. We find a great deal of promise in combining both qualitative and quantitative types of studies in the evaluation of early childhood programs. One approach would be to do both and see if they tell the same story Another approach uses qualitative data to enrich and support quantitative findings. Especially promising seems to be a reciprocal strategy in which qualitative insights are treated as a challenge for the development of quantitative measures, and statistical findings are used as guideposts for more intensified and differentiated qualitative analysis. At an entirely different level, that of the administer- ing or funding agency, multiple approaches may also be useful in constructing an overall evaluation strategy. Many of the best-khown evaluations have been large, multisite studies. Alternatives are possible, however, even when the agency's intent is to understand a large- scale program. Small studies often permit greater experimental rigor than large ones, and they avert the risk of catastrophic failure. Although each study yields only a partial picture, collectively they may permit a gradual accumulation of knowledge about the program as a whole. This cumulative approach is especially likely to

43 be effective when the evaluation program as a whole is specifically designed to permit integration of findings-- rather than relying on after-the-fact integration in the manner of traditional literature reviews. While this brief discussion of issues is not exhaus- tive, it does suggest the wide variety of approaches available. The choice of methods is, however, far from arbitrary. That choice should be linked to the questions to be answered, the state of knowledge, and the real constraints under which the research will be enacted. There are some questions, such as those addressing the issues of access and equity, that do not lend themselves to, nor are illuminated by, manipulation, and so are best addressed through observational studies. Again, the matching of design alternatives to the problem at hand is critical. IMPLICATIONS FOR THE EVALUATION PROCESS Some of our suggestions about design and measurement have indirect implications for the way in which applied research is organized and conducted, for the way in which its results may be presented most effectively, and even for the relationship between applied research and basic social science. Involving Multiple Constituencies in Selecting Outcome Measures Given that demonstration programs affect many constitu- encies that have a stake or a say in the program's future, ways must be found to involve these groups or at least take account of their concerns in selecting outcome measures. Actual involvement is preferable, because it creates a commitment to the evaluation process, which may not otherwise be present on the part of some constitutent groups, even if the outcome measures used in an evaluation are relevant to their concerns. To say that constituents should somehow be involved in identifying salient concerns or potential program outcomes of course does not mean that the outcomes can or should be selected on the basis of a survey. Constituencies differ in the salience that they accord to different outcomes. In some cases, outcomes valued by different constituencies may conflict. For example, when parents

44 of handicapped children exercise their rights to change their children's educational placement, there is no guarantee that the educational experiences of the child will in fact be improved, either by the lengthy process of appeals that may be involved or by the ultimate outcome. In such a situation, legitimate values compete: Is it more important for parents to have such rights or for children to have steady, uninterrupted, and relaxed educational experiences? Such conflicts create delicate situations in which evaluators, sponsors of evaluations, practitioners, and clients must negotiate the choice and weighting of outcomes. Our point is that the scope of an evaluation the breadth of the audience for which it provides at least some relevant information, and the likelihood that its findings will be put to use will all be enhanced if the perspectives of the various constituencies are considered. Communicating with Multiple Audiences We have argued consistently that if evaluation is to accomplish its goal of helping to improve programs and shape policies, it must be attuned to practical issues, not only to the interests of discipline-based researchers and methodologists. Beyond this first and most important step, evaluators can, by virtue of the way in which they present their work, take further measures to ensure the dissemination and utilization of their results. Basic researchers are usually trained to speak only to other researchers. Buttressed with statistics and hedged with caveats. their reports typically have a logic and an a · ~ _: a_ _ ~ _ _: ~ _ ~ ~ i: organization almea at persuading processional crows o' the accuracy of careful delimited empirical claims. However, applied researchers must address many audiences who make very different uses of their findings. Policy makers, government program managers, advocacy groups, practitioners, and parents are among their many audiences. Each group has its own concerns and requires a special form of communication. However, all these groups have some common needs and aims, quite different from those or the research audience. They all want information to guide action, rather than information for its own sake. They have limited interest and sophistication with respect to research methods and statistics. This situation poses practical and ethical problems for the evaluator. The practical problem is simply that

4s of finding ways to communicate findings clearly, with a minimum of jargon and technical detail. One strategy that has proved effective in this regard is organizing presentations around the questions of concern to non- technical audiences, rather than around the researcher's data-collection procedures and analyses. Adoption of this strategy of course presumes that the research itself has been designed at least in part to answer the questions of policy makers and practitioners. In addition, the impact of a report, however well written, can be enhanced by adroit management of other aspects of the dissemination process--public presentations, informal discussions with members of the intended audience, and the like--which can help create a climate of realistic advance expectations and appropriate after-the-fact interpretation. The ethical problem is that of drawing the line between necessary qualification and unnecessary detail. One can always write a report with a clear message by ignoring inconsistent data and problematic analyses. The difficulty is to maintain scientific integrity without burying the message in methodological complexities and caveats. There is no general formula for solving this problem, any more than there is a formula for writing accurately and forcefully. It is important, however, that the problem be recognized--that researchers do not allow themselves to fall back on comfortable obscurantism or to strain for publicity and effect at the price of scientific honesty. Building in Familiarity and Flexibility The considerations about design and measurement discussed above have practical implications for the way in which applied research is conducted. One implication is that both researchers and the people who manage applied research--particularly government project officers and perhaps even program officers in foundations--need to develop intimate familiarity with the operations of service programs as well as basic understanding of the policy context surrounding those programs. Technical virtuosity and substantive excellence in an academic discipline do not alone make an effective evaluator. Over and above these kinds of knowledge, a practical, experiential awareness of program realities and policy concerns is essential if evaluation is to deal with those realities and to address those concerns. When third-party

46 evaluations are conducted by organizations other than the service program or its funding agency, a preliminary period of familiarization may be needed by the outside evaluator. Moreover, that individual or organization should remain in close enough touch with the service program throughout the evaluation to respond to changes in focus, clientele, or program practices. A second, related implication is that the evaluation process must be flexible enough to accommodate the evolution of programs and the researcher's understanding. Premature commitment to a particular design or set of measures may leave an evaluation with insufficient resources to respond to important changes, ultimately resulting in a report that speaks only to a program's past and not to its future. Such a report fails disastrously in meeting what we see as the primary responsibility of the evaluator, namely to teach the public and the policy maker whatever there is to learn from the orouram's experience. There is danger, too, in the evaluator's being familiar with programs and flexible in responding to program changes as we have advocated. Too much intimacy with a program can erode an evaluator's intellectual independ- ence, which is often threatened in any case by his or her financial dependence on the agency sponsoring the Program - in question. (Most evaluations are funded and monitored by federal mission agencies or private sponsors that also operate demonstration programs themselves.) We see no easy solution to this serious dilemma, but at the same time we can point to mechanisms that limit any distor- tions introduced by too close a relationship between evaluator and program. Most important among them are the canons of science, which require that the evaluator collect, analyze, and Present data in a way that opens the conclusions to scrutiny. The political process can also act as a corrective force, In that it exposes the evaluator's conclusions to criticism from many value perspectives. Finally, as some researchers have urged, it may sometimes be feasible to deal with advocacy in evaluation by establishing concurrent evaluations of the same program, perhaps funded by separate agencies, but in any case deliberately designed to reflect divergent values and presuppositions. This report does not discuss in detail the institu- tional arrangements that might lead to more effective program evaluations nor does it examine current arrange- ments critically. Such an examination would be a major

47 report in itself. Relevant reports have been written under the aegis of the National Research Council, e.g., Raizen and Rossi (1981). However, we observe that many major evaluations are funded by the federal government - through contracts with universities or private research organizations. The contracting process is rather tightly controlled. Subject to the approval of the funding agency, the contractor is typically required to choose designs, variables, and measures early in the course of the study, then stick to them. It is rare that contrac- tors are given adequate time to assimilate preliminary information or to develop and pretest study designs and methods. Sometimes the overall evaluation process is segmented into separate contracts for design, data collection, statistical analysis, and policy analysis. It is Perfectly understandable that the government is . . . reluctant to give universities or contract research organizations carte blanche, especially in large evalua- tions, which may cost millions of dollars. Even the fragmentation of evaluation efforts may be partially justifiable, on the grounds that it allows the government to purchase the services of organizations with complement- ary, specialied expertise. Whatever the merits of these policies, it seems clear that in some respects the contracting process is at odds with the needs we have identified for gradual accretion of practical under- standing and for flexibility in adapting designs and measures to changes in programs. Drawing on and Contributing to Basic Social Science In some respects, evaluation stands in the same relationship to traditional social science disciplines as do engineering, medicine, and other applied fields to-the physical and biological sciences. Evaluation draws on the theories, findings, and methods of anthropology, economics, history, political science, psychology, sociology, statistics, and kindred basic research fields. At the same time, evaluation "technology" can also contribute to basic knowledge. The approach to the evaluation of children's programs set forth in this report has implications both for the kinds of basic social science that are likely to give rise to the most useful applications and for the kinds of contributions that evaluation can make to fundamental research.

48 Traditionally, evaluation has borrowed most heavily from basic research fields that emphasize formal designs and quantitative analytic techniques--statistics, economics, experimental psychology, survey research in sociology, and political science. The approach to evaluation we suggest implies that quantitative techniques can usefully be supplemented--not supplanted-- by ethnographic, historical, and clinical techniques. These qualitative approaches are well suited to formu- lating hypotheses about orderly patterns underlying complex, multidetermined, constantly changing phenomena, although not to rigorous establishment of causal chains. There is nothing scientific about adherence to forms and techniques that have proved their usefulness elsewhere but fail to fit the phenomena at hand. Science instead adapts and develops techniques to fit natural and social phenomena. When a field is at an early stage of develop- ment, available techniques are likely to have severe limitations. But the use of all the techniques available, with candid admission of their limitations, is preferable to Procrustean distortion of phenomena to fit preferred methods in pursuit of spurious rigor. Our proposed approach also suggests that global, systemic approaches to theory, of which the ecological approach to human development is an example, are potentially useful. Ad hoc empirical "theories" that specify relationships among small numbers of variables, whatever their merits in terms of clarity and precision, simply omit too much. Theories that explicate relation- ships among variables describing individual growth, family dynamics, and ties between families and other institutions have greater heuristic value, even if they are too ambitious to be precise at this early stage in their development. It should be clear that we favor precision, rigor, and quantitative techniques. Each has its place, even given the present state of the evaluation art, and that place is likely to become larger and more secure as the art advances. We argue, however, that description and qualitative understanding of social programs are in themselves worthwhile aims of evaluation and are essential to the development of useful formal approaches. We have indicated some of the directions in which we think evaluation technology is likely to lead social science. Because understanding social programs requires a judicious fusion of qualitative and quantitative methods, evaluation may stimulate new methodological work

49 articulating the two approaches. We may, for example, learn better ways to bring together clinical and experi- mental studies of individual children or ethnographic and survey-based studies of the family. Because understanding programs requires an appreciation of interlocking social systems, evaluation may contribute to the expansion and refinement of ecological, systemic theories. Thinking about children's programs may lead to a deeper under- standing of the'ways in which individual development is shaped by social systems of which the child is a part. Finally, because programs are complex phenomena that cannot be fully comprehended within the intellectual boundaries of a single discipline, evaluation may open up fruitful areas of interdisciplinary cooperation. We are well aware that science often proceeds analyti- cally rather than holistically; for example, it is useful for some purposes to isolate the circulatory system as an object of study, even though it is intimately linked to many other bodily systems. Nevertheless it is also useful now and then to examine interrelationships among previously defined systems to see if new insights and new areas of study--new systems--emerge. It is our hope that evaluation research can play this role vis-a-vis the social sciences. By focusing on concrete, real-world phenomena that do not fit neatly into existing theoretical or methodological boxes, evaluation may stimulate the development of both theory and method. REFERENCES Ainsworth, M. D. S., and Wittig, B. A. (1969) Attachment and exploratory behavior of one- year-olds in a strange situation. In B. M. Foss, ea., Determinants of Infant Behavior, Volume 4. London: Methuen. Anderson, S., and Messick, S. (1974) Social competency in young children. Developmental Psychology 10:282-293. Belsky, J. (1980) Child maltreatment: an ecological integration. American Psychologist 35(4):320-335. Belsky, J., and Steinberg, L. D. (1978) The effects of day care: a critical review Child Development 49:929-949.

50 Boruch, R. F., and Cordray, D. S. (1980) An Appraisal of Educational Program Evaluations: Federal, State and Local Agencies. Report prepared for the U.S. Department of Education, Contract No. 300-79-0467. Northwestern University (June 30). Brim, O. G. (1959) Education for Child Rearing. New York: Russell Sage Foundation. Bronfenbrenner, U. (1974) A Report on Lonaitudinal Evaluations of (1979) Preschool Programs. Vol. II: Is Early Intervention Effective? U.S. Department of Health, Education, and Welfare, Publication No. OHD 75-25. Washington, D.C.: U.S. Department of Health, Education, and Welfare. The Ecology of Human Development. Cambridge, Mass.: Harvard University Press. Bureau of Education for the Handicapped (1979) Progress Toward a Free, Appropriate Public Education. A Report to Congress on the Implementation of Public Law 94-142: The Education for All Handicapped Children Act. HEW Publication No. (OK) 79-05003. Washington, D.C.: U.S. Department of Health, Education, and Welfare. Connell, D. C., and Carew, J. V. (1980) Infant Activities in Low-Income Homes: Impact of Family-Focused Intervention. International Conference on Infant Studies, New Haven, Conn. (April). Datta, L. E. (1979) Another spring and other hopes: some findings from National Evaluations of Project Head Start. In E. Zigler and J. Valentine, eds., Project Head Start: A Legacy of the War on Poverty. New York: Free Press. Farran, D., and Ramey, C. (1980) Social class differences in dyadic involvement during infancy. Child Development 51:254-257. General Accounting Office (1979) Early Childhood and Family Development Programs Improve the Quality of Life for Low-Income Families. Report to the Congress by the Comptroller General. HR-79-40 (February).

51 Goodson, B. D., and Hess, R. D. (1978) The effects of parent training programs on child performance and parent behavior. In B. Brown, ea., Found: Long-Term Gains From Early Education. Boulder, Colo.: Westview Press. Goodwin, W. L., and Driscoll, L. A. (1980) Handbook for Measurement and Evaluation in Early Childhood Education. San Francisco, Calif.: Jossey-Bass, Inc., Publishers. Horowitz, F. D., and Paden, L. Y. (1973) The effectiveness of environmental programs. In B. Caldwell and H. D. Riccioti, eds., Review of Child Development Research. Vol. 3: Child Development and Social Policy. Chicago, Ill.: University of Chicago Press. Johnson, O. G. (1976) Tests and Measurements in Child Development: Handbook II. Vols. 1 and 2. San Francisco, Calif.: Jossey-8ass, Inc., Publishers. Johnson, O. G., and Bommarito, J. W. (1971) Tests and Measurements in Child Development: A Handbook. San Francisco, Calif.: Jossey-Bass, Inc., Publishers e Kirschner Associates, Albuquerque, N.M. (1970) A National Survey of the Impacts of Head Start Centers on Community Institutions. (ED04519S) Washington, D.C.: Opportunity. Lazar, I., and Darlington, R. B. (1978) Lasting Effects After Preschool. A report of the Consortium for Longitudinal Studies. U.S. Department of Health, Education, and Welfare, Office of Human Development Services, Administration for Children, Youth, and Families. Lindsey, W. E. (1976) Instrumentation of OCD Research Projects on the Family. Mimeographed report prepared under contract HEW-105-76-1120, U.S. Department of Health, Education, and Welfare Social Research Group, The George Washington University, Washington, D.C. Office of Economic . Love, J. M., Nauta, M. J., Coelen, C. G., and Ruopp, R. R. (1975) Home Start Evaluation Study: Executive Summary--Findings and Recommendations. Ypsilanti, Mich., and Cambridge, Mass.: Higb/Scope Educational Research Foundation and Abt Associates, Inc.

52 Raizen, S. A., and Rossi, P. H., eds. (1981) Program Evaluation in Education: When? To What Ended Committee on Program Evaluation - in Education, Assembly of Behavioral and Social Sciences, National Research Council Washington, D.C.: National Academy Press. Ramey, C., and Mills, J. (1975) Mother-Infant Interaction Patterns as a Function of Rearing Conditions. Paper presented at the biennial meeting of the Society for Research in Child Development, Denver, Colo. (April). Rossi, P. H., Freeman, H. E., and Wright, S. R e (1979) Evaluation: A Systematic Approach. Hills, Calif.: Sage Publications. Ruopp, R., Travers, J., Coelen, C., and Glantz, F. (1979) Children at the Center. Final report of the National Day Care Study, Volume I. Cambridge, Mass.: Abt Books. Smith, M. S., and Bissell, J. S. (1970) Report analysis: the impact of Head Start. Harvard Educational Review 40:51-104. Sroufe, L. A. (1979) The coherence of individual development: early care, attachment and subsequent developmental issues. American Psychologist 34:834-841. Stallings, J. (1975) Implementation and child effects of teaching practices in Follow Through classrooms. Monographs of the Society for Research in Child Development 40(7-8), Serial No. 163. Stebbins, L. B., et al. (1977) Education as Experimentation: A Planned Variation Model. Vol. IV. Cambridge, Mass.: Abt Associates, Inc. Also issued by the U.S. Office of Education as National Evaluation . Patterns of Effects. Vol. II of the Follow Through Planned Variation Series. Suchman, E. A. (1967) Evaluation Research: Principles and Practice in Public Service and Social Action Programs. New York: Russell Sage Foundation. Walker, D. K. (1973) Socioemotional Measures for Preschool and Kindergarten Children. San Francisco, Calif.: Jossey-Bass, Inc., Publishers.

53 Weber, C. U., Foster, P. S., and Weikart, D. P. (1977) An economic analysis of the Ypsilanti Perry Preschool Project. Monographs of the High/Scope Educational Research Foundation. Series No. 5. Weiss, C. H. (1972) Evaluating Action Programs: Readings in Social Action and Education. Boston, Mass.: Allyn & Bacon, Inc. Westinghouse Learning Corporation and Ohio University (1969) The Impact of Head Start: An Evaluation of the Effects of Head Start on Children's Cognitive and Affective Development. Executive Summary. Report to the Office of Economic Opportunity (ED036321). Washington, D.C.: Clearinghouse for Federal Scientific and Technical Information. Zigler, E., and Trickett, P. (1978) IQ, social competence and evaluation of early childhood intervention programs. American Psychologist 33:789-798. Zigler, E., and Valentine, J., eds. (1979) Project Head Start: A Legacy of the War on Poverty. New York: The Free Press.

Next: Part 2: Papers »

Learning from Experience: Evaluating Early Childhood Demonstration Programs (1982)

Chapter: Evaluating Early Childhood Demonstration Programs

Welcome to OpenBook!

Get Email Updates