Click for next page ( 232


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 231
APPENDIX Implications of the Youth Employment Experience for Improving Applied Research and Evaluation Policy Robert Boruch Try all things; hold fast that which is good. I Thessalonians 51:21 INTRODUCTION Determining what is good is no easy matter. The purpose of this appendix is to capitalize on hard experience in making that judgment in one arena--employment and training programs supported by the federal government. The program evaluations reviewed by the Committee on Youth Employment Programs have a variety of implications for evaluation policy. The aim of this appendix is to educe some of those implications. The committee also relied on earlier reviews of social and educational program evaluation generally (Riecken et al., 1974; Rivlin, 1971; Raizen and Rossi, 1981; the U.S. General Accounting Office, 1978; and others). The discussion is concerned with obtaining better evidence with which to answer fundamental questions about youth employment and training programs: Who needs the services? How well are services delivered? What are the relative effects of the service? What are the relative benefits and costs of alternative services? The implications are grouped into two simpler categories: . Improving the design of outcome evaluations 0 Improving reporting Each implication is followed by a brief description of evidence and rationale. Robert F. Boruch is professor of psychology, Center for Probability and Statistics, and director, Division of Methodology and Evaluation Research, Northwestern University. 231

OCR for page 231
232 IMPROVING THE DES ION OF OUTCOME EVALUATIONS Once said, it is obvious that quality in the design of an outcome evaluation affects the quality of the data and of conclusions. Poor designs can make programs look worse than they are, or better than they are, or yield Reinterpretable evidence. Quality and Evaluation Policy in General Quality in evaluation design ought to be recognized and ought to be explicit in an agency's evaluation policy and in congressional oversight policy. Special efforts need to be made to improve the quality of research and evaluation designs for estimating the impact of youth employment projects. Existing professional guidelines can be used to influence the quality of design and the quality of reporting. The theme of quality has been explicit in the Department of Labor's Knowledge Development Plan, insofar as the plan yoked the introduction of new programs to good evaluation design. That is, the plan recog- nized the legitimacy of the idea that good impact evaluations can only be done if conditions are controlled and evaluation is planned and begun at the start of a program. This theme appears also in the U.S. General Accounting Office's (1978) attention to competing explanations that characterize the results of poor research designs, to the elements of reasonable design, and to the need for designing the evaluation before a new program is put into the field. The theme has been recognized by the courts in cases that recognize the shortcomings in some evaluation designs and the benefits of others, e.g., copayments in health insurance. Injunctions have been issued against poor designs, for example, and challenges to good designs have been defeated (Breger, 1983, for specific cases). Despite this, the quality of evaluations of youth employment and training programs is still not sufficiently high. Less than 30 percent of the reports examined by this committee, for example, were of high enough quality to be reviewed seriously. Projects rejected for serious consideration by this committee were flawed by the lack of sensible comparison groups, unreliable measures of program outcome, vague objec- tives, and other shortcomings. Our acceptance rate is low, but it still represents progress. Rossi's (1969) review of 200 evaluations issued by the Office of Economic Opportunity before 1968, for example, uncovered no randomized field experiments and only about 25 reports with credible evidence. Professional and institutional guidelines for improving the quality of evaluation designs are readily available. Section A of the bib- liography lists guidelines that pertain to evaluation design and reporting in health and health services, education and training, welfare, and other areas. The references include applications of standards and assessments of their common features and usefulness. It would not be unreasonable to adopt variations on these in evaluation policy.

OCR for page 231
233 Randomized Field Experiments Randomized field experiments should be explicitly authorized in law and encouraged in evaluation policy as a device for estimating the effects of new projects, program variations, and program components. Randomized field tests of new programs, program components, or program variations are a scientifically credible device for obtaining interpretable evidence about a program's effects. But they are demanding with respect to the requirement that individuals or schools or other organizational units must be assigned randomly to program variations or to program versus control conditions. The usefulness of randomized tests, in principle, is generally not at issue in professional discussions about estimating the impact of programs. That is, there is substantial agreement that when experiments are conducted properly, estimates of effects will be unbiased. The conditions under which experimental design can be or should be employed are more debatable, however (e.g., Cronbach and others, 1980; Boruch, 1975~. How precedent, pilot tests, ethics, and law constrain or enhance feasibility is considered briefly below. Precedent Some opponents of controlled randomized tests maintain that randomized experiments are rarely feasible in field settings. The references in Section B of the bibliography constitute evidence against the claim: the reports listed cover some recent, high-quality experiments. Randomized field tests have been undertaken, for example, to get at the effects of new law enforcement procedures (Sherman and Berk, 1984, 1985) and innovative methods for improving court efficiency (Partridge and Lind, 1983~. They have been used to assess the relative benefits and costs of special health services delivery methods and practices, e.g., day care for the chronically ill and medical information systems. They have produced good evidence on the effects of diversion projects for delinquent youths in California, telephone conferencing in adminis- trative law hearings in New Mexico, post-prison subsidy programs in Texas and Georgia, and nutrition education projects in Nebraska. Not all attempts to use randomized field tests succeed, of course. The procedure may and indeed has been corrupted in medical experiments, e.g., early tests of the effects of enriched oxygen environments on retrolental fibroplasia (Silverman, 1977~. And they have failed at times in formal efforts to evaluate court procedures, educational innovations, and other social programs (e.g., Conner, 1977~. Judging from precedent, however, it is not impossible to assign individuals or other units randomly to programs for the sake of fair estimates of program effects. Good randomized experiments have indeed been mounted. The reasons for successes and failures need to be studied.

OCR for page 231
234 Pilot Tests of Randomized Experiments Precedent is persuasive in the crudest sense: It implies that what has been done might be done again. Still, experience with a randomized trial in one setting may be irrelevant in others. For this reason, pilot tests of large-scale field experiments are worth considering. That is, small experiments prior to the main field experiment can provide evidence on feasibility that is more direct than what precedent can offer, can identify problems that otherwise could not be anticipated, and can help to resolve predictable problems before the main effort. That there can be major problems in mounting randomized tests is clear from the Youth Employment Program experience (see Section C of the Bibliography). For instance, difficulties were encountered in the Tallmadge and Yuen evaluations of the Career Intern Program and the evaluation of the Career Advancement Voucher Demonstration (CAVD) program for low-income college students (Clark, Phipps, Clark & Harris, Inc., 1981; hereafter CPC&H). Some 30 randomized tests of Head Start programs were initiated in the late 1970s, despite counsel for pilot work: fewer than 10 succeeded. Randomized tests have also been unsuccessfully implemented in medicine, law enforcement, and other areas, because the randomization was corrupted. That attempts to run good randomized trials in the youth employment sector were imperfect, or that other attempts have failed miserably, should not be unexpected. Imperfection and failure are our lot, just e as Improvement Is. The pilot test strategy has helped to ensure the quality of field experiments on telephone conferencing in the administrative court system. For instance, city tests served as a pilot for a statewide experiment in administrative appeals in New Mexico (Corsi and Hurley, 1979~. The strategy has also been used to enhance quality in the youth employment program research. The Supported Work Experiments in five cities were preceded by a pilot effort in one city by the Vera Institute, and it did have a bearing on the quality of those experi- ments. The approach seems sensible in view of these efforts, the failed efforts, and experience from other areas. Hahn's (1984) adv in the industrial commercial sector is similar, for similar reasons. Ethics Ice Where there is an oversupply of eligible recipients for scarce program services, randomized assignment of candidates for the resource seems fair. Vancouver's Crisis Intervention Program for youthful offenders, for instance, offered equal opportunity to eligible recipi- ents. Since all participants could not be accommodated well with available program resources, but were all equally eligible, they were randomly assigned to program or control conditions. More generally, randomized experiments are most likely to be regarded as ethical when the services are in short supply, their effectiveness is not clear, and someone is interested in effectiveness.

OCR for page 231
235 This rationale dovetails neatly with some managerial constraints. That is, despite the aspirations of program advocates, new programs cannot be emplaced all at once, but must be introduced in stages, e.g., ser- vices are delayed for some. The argument for the ethicality of random assignment to scarce resources is not especially pertinent when the manager can simply spread resources more thinly, e.g., by expanding the size of classes dedicated to special instruction in tests of training projects. Law Randomized field tests have received attention only recently from the courts and from constitutional scholars. The attention, however, has been thorough and productive. Pertinent court decisions, for example, include Aguayo v. Richardson and California Welfare Rights Organizations v. Richardson. These cases challenged the use of randomized experiments in assessing welfare pro- grams, but the challenges were dismissed by the court. Legal analyses of such cases are given by Breger (1983) and Teitelbaum (1983~; Bermant et al. (1978), Federal Judicial Center (1983), and Rivlin and Timpane (1975) give more general treatments. Statutes that recognize the legitimacy of randomized experiments are scarce, however, and that is one reason for recommending explicit reference to randomized experiments in law. The People Targeted for Services: Characteristics, Access, and Number Surveys prior to mounting a field test are essential to ensure that people targeted for service or action are (a) identifiable, (b) trainable in the experimental regimen and (c) sufficient in number to warrant the investment in a controlled randomized experiment. Program managers sometimes promise to randomize because they presume the target population is large enough to permit random assignment of individuals to "program" versus "control" conditions or to program variations. The presumption has been wrong at times in medical research, e.g., experiments in day care for the chronically ill during the late 1970s. It has also been wrong in educational research, notably in attempts to do randomized field experiments on Head Start preschool programs and in planned variations. And it has been wrong in manpower training programs prior to 1966, to judge from Rossi's (1969) descrip- tion of the National Opinion Research Center's (NORC's) failure to recruit enough clients for experimental tests of an employment training program. If there are too few individuals in need of the service and who are accessible and willing to participate, one will be unable to execute an experiment well. The reasons for error in the presumption include ignorance: It is often very hard to estimate the number of those in need of special services, harder to identify them, and at times harder still to under- stand how to train them in a program. They include greed, of course.

OCR for page 231
236 The funds made available for a special program and for an experiment produce inflated counts of those in need. Regardless of the reasons for the error, the matter is important if we expect to have decent tests implemented. What do the youth employ- ment experiments tell us about this? For the Manpower Demonstration Research Corporation (MDRC), there seems to have been no remarkable problems in identifying and enrolling members of various target groups for the Supported Work Program. Still, MDRC says it may not have focused sufficiently well on the right target group in its explanation of why effects on youths fail to be substan- tial. Nor is there any reference to shortfall in the reports by the Vera Institute on the Alternative Youth Employment Strategies Project. The report on the Opportunities Industrialization Center (OIC) project (O'Malley et al., 1981) is ambiguous on this account. In the CAVD program, on the other hand, "All CETA prime sponsors were to recruit a pool of at least 200 youths between the ages of sixteen and twenty-one who met YETP eligibility requirements and who desired and were available for full time work" (CPC&H, 1980:15~. Some 150 to 170 were eventually assigned to alternative treatments. The target sample size was partly a function of local screening criteria. Recruitment and assignment difficulty is discussed in the report (CPC&H, 1980~. Difficulty in recruitment was encountered in four of five sites. In three sites, the difficulty seems serious, said to be caused by internal organizational problems (e.g., a move to a different building) or interinstitutional problems (e.g., one agency doing the screening, another the program implementation). Similarly, the Project STEADY evaluation reported that "sufficient numbers of youth were difficult to recruit" and that start-up time for the program was brief according to site directors (Grandy, 1981~. The SPICY project for Indochinese youths was targeted for 120 youths per site, but obtained only 70 to 80 individuals. The Tallmadge and Yuen report on Career Education programs suggests that only three rather than four cohorts (with a projected 75 per cohort) were enrolled at each of the four sites. Further, the first two of the three cohorts contained fewer than the 75 members that were forecast (an extension of the period led to complete cohorts). Hahn and Friedman's evaluation of the Cambridge Job Factory for out-of-school youths encountered problems in recruitment because there were related summer programs in the same area. Their work in Wilkes-Barre suggests the enrollment problem was severe in that area (53 percent of target reached) and that it affected both the Youth Employment Service Program and the Employer-Voucher Program. The problem was attributed to competing CETA programs. A few of the experiments at hand also tell us something about the tractability of problems in the target population. The Supported Work Program run by MDRC, for instance, suggests that women receiving Aid to Families with Dependent Children profit more than young people do from the services provided. It is not clear that "tractability" can be assessed well in prior surveys of a targeted group. The experience does suggest that it is important to separate ostensibly different subgroups in the experiment and to establish that their members can be identified well in prior surveys.

OCR for page 231
237 Sensitivity of Field Experiments Statistical power analyses and reporting on the analyses are important and ought to be undertaken routinely. This is elemental quality assurance for any evaluation policy. By power analysis here is meant formal calculation of the probabil- ity of finding an effect, if there indeed is an effect, despite a noisy context. Critical reviews of field experiments in health services, education, and other areas stress that (a) the effects of projects will usually be small and (b) sample sizes are often too small to detect those effects. That is, differences between program and control groups are likely to go undetected. Three of the experiments reviewed by this committee had samples large enough to justify the expectation that an effect would be detected if indeed youths were influenced by the regimen. The Vera Institute's study, for example, involved 600 to 800 individuals per site, with about 300 in a control group and 100 in each of three program variation groups. OIC had about 1,500 participants and 700 controls distributed across seven sites. MDRC's sample size exceeded 1,200, over five sites. Other experiments listed in Section C of the Bibliography involve far smaller samples, however. And so it is difficult to understand how a small project effect could be detected. The CAVD project, for example, involved assignment of fewer than 30 individuals per group in each site. Attrition led to even fewer, e.g., 4 individuals in a site in one analysis of dropouts. And so it is no surprise that differences among groups are often insignificant. The Indochinese SPICY project had 70 to 80 individuals per site in three sites and analyzed the sites separately. (They did detect effects.) Tallmadge-Yuen's Career Intern report suggests that there were, at most, 75 subjects per cohort per site; there is no reference to a power analysis in the final report, and results are mixed. Measures of Program Implementation More orderly, verifiable information on the degree of program implementation needs to be collected. Better, less-expensive methods for obtaining and resorting such information also need to be developed. . And basic research needs to be conducted to link implementation data with impact data. No outcome evaluation should exclude measurement of the level of program implementation. Such data are as essential in social program evaluation as measurements of dosage level and compliance are in evalu- ating new drugs and therapy. At its crudest, measuring implementation may focus on structural features of the program's construction. This includes establishing th time frame required for actualizing major parts of program plans. Information about the time it takes for a new program to become stabil ized, for instance, is often sketchy even for large-scale programs. Measurement should include crude observation on staff, material, and resources, and on the recipients of services and their eligibility. e

OCR for page 231
238 See Rezmovic (1982), for instance, on tests of programs for former drug addicts. It is also reasonable to expect such measurement efforts to document the way implementation is degraded. The Tallmadge and Yuen (1981:4) report, for example, stresses staffing problems at all four experimental sites, problems that are said to be attributable to "extremely com- pressed time schedules and bad timing associated with start up opera- tions. n Related reports cover actual program composition and the qualitative features of client and program interaction. It seems sensible also to establish what kinds of services are offered to control group members. For they, too, may participate in other programs that are implemented to some degree. This measurement seems especially important insofar as 20 to 40 percent of control group youths in a given site may in fact avail themselves of services from other sources. The problems of assuring that treatments are delivered as adver- tised, of measuring the degree of implementation, and of understanding how to couple implementation data and experimental data are not confined to the youth employment arena, of course. Poorly planned and executed programs occur in the commercial sector, though information about this is sparse for obvious reasons (see Hahn, 1984~. Despite good planning, meteorological experiments have been imperfect and admirably well documented (Braham' 1979~. Drug trials and other randomized clinical trials in medicine must often accommodate departures from protocol and noncompliance (e.g., Silverman, 1977~. And so on. "Evaluability" The extent to which projects and programs are "evaluable" should be routinely established before large-scale evaluation is undertaken. Not all programs can be evaluated with the same level of certainty. New projects, for instance, often present better opportunities for obtaining interpretable estimates of project effects than ongoing Ones. It is also clear that limitations on resources and experience prevent even new programs from being evaluated well. The need to anticipate how well one might be able to evaluate has generated interest in the "evaluability" of programs. The idea of formal evaluability assessment, proposed by Joseph Wholey and extended by Leonard Rutman and others, involves addressing specific questions about whether an evaluation can be designed to provide useful infor- mation in a particular setting, especially whether decisions or changes can be made on the basis of the information. It has been suggested that the approach be employed before demanding evaluation in every instance. Evaluability assessment has received some support and pilot testing in Canada and the United States (Wholey, 1977; Rutman, 1980~. The strategy is imperfect in that it asks one to predict success based on experience that may not exist. But it is useful in identifying the senses in which evaluation is possible and potentially helpful. It is a promising device for avoiding unnecessary effort.

OCR for page 231
239 More to the point, the idea is to learn how to avoid putting money into evaluations that cannot be done well or are likely to be useless. More generally, the approach can provide a framework for understanding how to achieve compromises between the desirability of special designs, such as a randomized experiment, and operational constraints of the program, and for understanding the kinds of evaluation that will be useful. Testing Components and Variations Testing the components of programs is warranted, especially when tests of full programs are not feasible or appropriate. No theory of evaluation demands that the effects of an entire program be estimated. Few practitioners would regard the requirement as reasonable. Yet rhetoric and legislative mandates foster this view, distracting attention from the possibility of testing important components of programs or variations on them. For example, one may find that running high-quality tests of an entire training program is not possible. But estimating the effect of alternative sources of information, ways of presenting information, ways of enhancing use of information, and so on, may be possible in small, high-quality . experiments. The strategy has been exploited in a few youth employment program evaluations, in research which preceded development of Sesame Street, and in experiments on surgical and health innovation. Incorporated into evaluation policy, the idea broadens options. And in the event of a major evaluation's failure, it is a device for assuring that at least parts of the program can be assayed well. Attrition Far better methods need to be invented and tested to control attrition and to understand its effects on analysis. Resources need to be dedicated to the activity. Individuals who voluntarily participate in any social program are also free to abandon the program. Individuals who participate in a randomized control group or some alternative to which they have been randomly assigned--by answering questions about their work activity instance--are also free to withdraw. The loss of contact with individuals in either group is important insofar as it affects how easily and confidently one can interpret the results of an experiment. To put the matter simply, if contact is lost with individuals in both groups, there is no way to determine the project's impact on participants. If the program maintains good contact with the participants, for instance, but fails to track nonparticipants well, it may generate evidence that makes the program look damaging when in fact it is ineffectual, or that makes the program appear effective when in fact its impact is negligible or even negative.

OCR for page 231
240 When the attrition rate in the program group differs appreciably from the rate in the control group, making inferences is more complicated. Suppose, for instance, that all individuals who "attrit" in the control group found jobs. Analysis that fails to recognize this would probably produce inflated estimates of the program's effect. Problems in attrition in field experiments were sufficiently critical to warrant the committee's rejecting over half of them for serious review. This does not always imply that the work was unsal- vageable, merely that resources do not permit determining if the analytic problems engendered by attrition could be resolved. CAVD's differential in rate of interviewing, for instance, is substantial, e.g., 80 percent for the participants and 50 percent for controls. There is no discussion of the potential problems. There is no attempt to accommodate them in the report at hand. The Vera Institute (Sadd et al., 1983:18), on the other hand, reported about equal rates of attrition, e.g., "premature dropout was substantial and depended on (program) model," but there are no details. Interview completion rates are given in the 60 to 84 percent range for the program group and in the 60 to 84 percent range for the control group. The Tallmadge and Yuen study of the Career Intern Project is unusual in having tried to accommodate the possible biases due to differential attrition by matching individuals first and then keeping only full pairs for analysis. For all four sites, they got rates in the 50 to 70 percent range. OIC does no formal analysis of the effect of attrition. Their report does, however, give completion rates and rates at which com- parison groups have access to programs. They ignore these differences in their analysis or discard controls who did become involved in programs. Nothing is said about cooperation 3 and 8 months after program completion though they give data on each point. The Project STEADY report is conscientious in providing a table that illustrates the flow and attrition rate at various stages in the experiment (Educational Testing Service, 1981:Table 1, p. 26~. It is a shame, however, that no serious statistical/analytic attention was brought to bear on this. Coupling Experiments to Longitudinal and Panel Studies Randomized experiments ought to be coupled routinely to longi- tudinal surveys and panel studies. The purposes of this "satellite" policy include calibration of nonrandomized experiments, more general- izable randomized experiments, and better methods for estimating program effects. Longitudinal surveys based on well-designed probability samples are clearly useful, for science and policy, in understanding how individuals (or institutions) change over time. For example, they avoid the logical traps that cross-sectional studies invite, such as overlooking cohort effects, in economic, psychological, and other research. Such studies are, however, often pressed to produce evidence that they cannot support. Of special concern here is evidence on the impact

OCR for page 231
241 of social programs on groups that the longitudinal study happens to include. So, for example, the Continuous Longitudinal Manpower Survey (CLMS) has been justified and supported primarily on grounds that one ought to understand what happens to the human resources pool. Its secondary or tertiary justification is that it can help one understand the effect of special programs--in youth employment, training, and so on. Such justification may be useful for rhetorical and scientific purposes. But it is dysfunctional insofar as the claim is exaggerated. That is, longitudinal surveys are often not sufficient to permit confident estimation of the effect of programs designed to, say, affect earnings of individuals who happen to be members of the sample, crime rates of those people, and so on. That the claims made for longitudinal surveys with respect to evaluation of programs can be misleading is clear empirically and analytically. The most dramatic recent empirical evidence stems from Fraker and Maynard's (1985) comparisons of program effects based on randomized experiments and effects based on nonrandomized data, notably the CLMS and the Current Population Survey. Earlier evidence in differ- ent arenas stems from the Salk polio vaccine trials, health services research, and others (see Boruch, 1975, for listing). Randomized experiments, on the other hand, permit one to estimate the effects of projects with considerably more confidence. Indeed the committee report is emphatic on this account. A major shortcoming of experiments, one not shared by the large-scale longitudinal studies, is their limited generalizability. That is, a set of experiments might be feasible in only a half-dozen sites, sites that do not necessarily reflect national characteristics. The implication is that one ought to invent and try out research policy that helps to couple the benefits of longitudinal studies, i.e., generalizability, with those of experiments, i.e., unbiased estimates of program effect. This policy element is akin to science policy on satellite use. That is, the satellite, like the longitudinal study, requires enormous resources to emplace and maintain. It pays to capitalize on them. Further, the scientist who designs special-purpose studies can obtain access to part of the satellite to sustain his or her investigation. Just as the physicist then may use the satellite as a vehicle for limited, temporary investigation, the policy recommended here allows the researcher the option of using longitudinal infrastructure as a resource and as a vehicle for conducting prospective studies. The policy element gets well beyond simple scientific traditions of "data sharing" (Committee on National Statistics, 1985~. It is con- siderably more debatable and more important in principle. Access is likely to be feasible, for example, only for a few projects, perhaps only one every year or two, because of the sheer difficulty of coupling studies to an already complex longitudinal enterprise.

OCR for page 231
243 Such models are in fact small theories and they have the merit of being explicit and often ingenious. Their shortcomings lie in their parochialism: each is a notion developed by a mathematically oriented analyst, who is unlikely to have (a) come within sniffing distance of a real program, i.e., not done empirical studies of the enrollment process and (b) taken the trouble to exploit theory from disciplines outside economics. For example, it takes no wit to recognize that enrollment processes involve information, supplied to and available to administrator and applicant, and decisions by each. The information is processed under constraints; the decisions are made under constraints. Why then do we not exploit, say, the theories and rules of cognitive processing that have been developed over the past 10 years by Herbert Simon, Kahneman and Tversky, and others to develop something more realistic and more coherent than the simplistic selection models that the ill-informed analyst might choose? The reason seems to lie more in disciplinary provincialism than in any inherent weaknesses in either kind of model. The point is that far more integration of theory and practical program development and evaluation is warranted. Absent the integra- tion, it is doubtful that one will learn much that is durable. IMPROVING THE QUALITY AND INTERPRETABILITY OF EVALUATION REPORTS The quality of evaluation reporting can be improved substantially by adhering to professional guidelines issued over the past five years. Documentation on large-scale evaluations is generally much better than the information usually available on smaller, locally managed ones. Nonetheless, there are notable gaps in what is known about even the large ones. Information is not always presented in accord with reasonable reporting standards issued, for example, by the Evaluation Research Society (1982), the U.S. General Accounting Office (1970), and other organizations, and by individual experts, such as Mosteller et al. (1980) (in clinical trials section of the bibliography). The weaknesses in reporting make it difficult to screen and summarize results for policy, as this committee has tried to do. And the weaknesses complicate efforts to develop quantitative syntheses of multiple experiments so we may judge how sizes of program effect vary. [See Light and Pillemer (1984) and Cordray and Orwin (1985) on synthesis and the problems that poor reporting engenders.] Doubtless some good projects have been ignored because reporting is poor. Good projects are not exploited as much as they should be because information provided in reports is insufficient. So, for example, the best reports tell us what the attrition rate is from programs or from program versus control groups. But many reports do not. The best of the best educe the implications of attrition and how they have been taken into account to produce fair estimates of program effect. Most do not. The gaps make it very difficult to review the quality of evaluations and to adjust for quality of evaluation in gauging the success of multiple programs.

OCR for page 231
244 The topics that should be routinely considered in such reports are easy to list. The following are based on fuller treatments in the references cited. Attrition Rates The difference between the ultimate target samples of program par- ticipants and control-group members is crucial. Estimates of program effect may be inflated, deflated, or remain unchanged, relative to their true value, depending on the magnitude of attrition in the groups. Yet attrition rates are sometimes not reported. Nor are difference' reported. Even less frequently reported are analyses of how sensitive the conclusions are to the rates and to differences in the rates. Character of Program Understanding what happens in a program is, of course, important; developing orderly, inexpensive descriptions of what happens is diffi- cult. This does not excuse one from trying to document program activity well. Little reporting has been undertaken, partly perhaps because of a lack of understanding of how to measure the level of program implementation well. Access to the Data Base Assuring that raw research data are accessible for reanalysis, in the interest of facilitating criticism and secondary analysis, is not common. Still, it seems sensible to advance understanding of how to exploit costly information better and how to encourage thoughtful criticism (see Fienberg et al., 1985~. The implication for research policy in this arena is that reports should routinely inform the reader about what raw data are available and from whom. The vehicles for assuring the data that are available include the normal contract system and agencies responsible for overseeing evaluations. Target Population/Recruitment The general characteristics of youths targeted for programs are usually a matter of law or regulation and are usually reported. Demographic characteristics of the sample are also reported. But how youths are recruited, what fraction of the available population is involved, and what kinds of problems were encountered in targeting are often not reported or given only superficial treatment. As a consequence, it would be difficult to replicate the program even if it were found to be successful. And it is difficult for the thoughtful observer to reconcile conflicts among the results of different studies.

OCR for page 231
245 Perhaps most important, shortfalls between target and actual samples occurred. Understanding the magnitude of the shortfall and the reasons for it are crucial to designing better evaluations. Site Selection Many of the field tests of youth employment programs involve multiple sites. The 9 randomized trials reviewed seriously by the committee, for example, involve a randomized experiment in each of 40 sites. Very little information is provided on site selection in final reports, however. The information is important for understanding the general context of the test, perhaps for understanding why the program succeeded or failed and why the evaluation was executed poorly or well, and for learning whether and how the evaluation might be replicated. Final results need not provide great detail on site selection. Nonetheless, a reference, footnote, or paragraph ought to provide leads to accessible sources of written information on the topic. Random Assignment The specific method for randomly assigning individuals to alter- native regimens is rarely reported. The methods may be mundane. Or they may be creative in the sense of being robust against indifference, incompetence, and corruption. In any case, reporting on method is warranted in a footnote or appendix to assure the reader that indeed the study was carried out as an experiment and to permit praise and criticism of the random assignment process [see Dobson and Cook (1979~. For example, the specific mechanics of randomization are not described in any final report on youth employment projects listed in Section C of the Bibliography. There are no citations to reports that may contain the information. But broad information is given by some. The Tallmadge and Yuen report on the Career Intern Program, for instance, gives no detail except to say that assignment was by "lottery." The report on the OIC project (O'Malley et al., 1981:34) says that individuals were randomly assigned to program versus control groups but that "there's no firm assurance that in all cases the participants were randomly assigned." The "most" detail on the topic is given in the final report for Project STEADY. Grandy (1981:17) reports that "Project STEADY participants consisted of random samples of program applicants. Site directors reported that all applicants drew a card from a hat, and on that card was the designation of participant or control." The randomization followed administration of tests and screening that determined eligibility and individuals' willingness to participate in the program. Failure to report such crucial information is not confined to the youth employment reports. In their review of medical research journal articles, such as the New England Journal of Medicine, for example, Der

OCR for page 231
246 Simmons et al. found that only one-fifth reported anything on method of randomization. Statistical Power Analyses It is rare for final reports to specify the probability of finding program effects, if indeed they exist, given the sample size and other design parameters. It seems especially desirable that evaluations showing "no effects" report the statistical power of the analysis. Even when effects are found, the power calculations should be available for postevaluation analyses, e.g., was the size of effect obtained close to the effect guessed in the power analysis. Costs Information on the costs of an evaluation, apart from the cost of the program under examination, is almost never published in profes- sional journals or in final reports of evaluations. In consequence, it is difficult to estimate what has been spent on evaluation and impossible to do good benefit/cost analyses of evaluations based on evidence. There is no readily accessible evidence. It seems desirable, then, to have information on costs in the report. No uniform system for reporting costs of elements of evaluation has been adopted. And so creation of alternative accounting systems for budgets and expenditures is warranted. Graphs and Tables Tables in some evaluation reports are often dreadful, difficult to understand, and impossible to read quickly. And they are often sus- ceptible to misreading. Only a few reports on youth employment experiments are exceptional on this account. So, for example, computer printouts of tables are merely reprinted, rather than being reorganized and restructured to permit the reader to understand patterns quickly. The state of the art in constructing tables and graphs has improved remarkably over the past 10 years. It is a shame that it is ignored. See Fienberg (1979) and Kruskal (1980), among others, on improving these presentations. REFERENCES AND BIBLIOGRAPHY Bermant, G ., H. C . Kelman, and D.P. Warwick 1978 The Ethics of Social Experimentation. New York: Wiley. Boruch, R.F. 1975 Coupling randomized experiments and approximations to experiments in social program evaluation. Sociological Methods and Research 4~1~:31-53.

OCR for page 231
247 1976 On common contentions about randomized experiment. Pp. 158-194 in G.V. Glass, ea., Evaluation Studies Review Annual, Vol. 3. Beverly Hills, Calif.: Sage Publications. Boruch, R.F., and P.M. Wortman 1979 Some implications of educational evaluation for evaluation policy and program development. Review of Research in Education 7:309-361. Boruch, R.F., D.S. Cordray, and G. Pion 1982 How well are local evaluations carried out? In Log. Datta, ea., Local, State, and Federal Evaluation. Beverly Hills, Calif.: Sage Publications. Boruch, R.F., A.J. McSweeny, and E.J. Soderstrom 1978 Bibliography: illustrative randomized field experiments. Evaluation Quarterly 4:655-695. Boruch, R.F., P.S. Anderson, D.M. Rindekopf, I.A. Amidjaya, amd D. graham, 1979 Jansson 1979 Randomized experiments for evaluating and planning local programs: a summary on appropriateness and feasibility. Public Administration Review 39~1~:36-40. R.E. Field experimentation in weather modification. Journal of the American Statistical Association 74(365):57-104. Breger, M.J. 1983 Randomized social experiments and the law. Pp. 97-144 in R.F. Boruch and J.S. Cecil, eds., Solutions to Ethical and Legal Problems in Social Research. New York: Academic Press. Chen, H-T., and P.H. Rossi 1980 The multigoal, theory driven approach to evaluation: a model linking basic and applied social science. Social Forces 59 (1):106-122. Clark, Phipps, Clark & Harris, Inc. 1980 Advanced Education and Training--Interim Report on the Career Advancement Voucher Demonstration. Youth Knowledge Development Report No. 5.3. Washington, D.C.: U.S. Department of Labor, Employment and Training Administration. 1981 Second Year Final Report: Career Advancement Voucher Demonstration ProJect. New York: Clark, Phipps, Clark & Harris, Inc. Conner, R.F. 1977 Selecting a control group: an analysis of the randomization process in twelve social reform programs. Evaluation OuarterlY 1:195-244. 1982 Random assignment of clients in social experimentation. In J.E. Sieber, ea., The Ethics of Social Research: Surveys and Experiments. New York: Springer Verlag. Cordray, D.S., and R.G. Orwin 1985 Effects of deficient reporting on mete analysis: a conceptual framework and reanalysis. Psychological Bulletin 97~1~:134-147. Corsi, J.R. 1983 Randomization and consent in the New Mexico teleconferencing experiment: legal and ethical considerations. Pp. 159-170 in R.F. Boruch and J.S. Cecil, eds., Solutions to Ethical and Legal Problems in Social Research. New York: Academic Press.

OCR for page 231
248 Corsi, J.R., and T.L. Hurley 1979 Pilot study report on the use of the telephone in administrative fair hearings. Administrative Law Review 31(4):484-524. Cronbach, L.J., S.R. Ambron, S.M. Dornbusch, R.D. Hess, R.C. Hornik, D.C. Phillips, D.F. Walker, and S.S. Weiner 1980 Toward Reform of Program Evaluation. San Francisco, Calif.: Jossey-Bass. Dobson, L.D., and T.D. Cook 1979 Implementing random assignment in a field setting: a computer based approach. Evaluation Quarterly 3:472-478. Educational Testing Service 1981 Assessment of the U.S. Employment Service Project STEADY. Technical Report no. 9. Prepared by J. Grandy. Evaluation Research Society 1982 Standards for evaluation practice. New Directions for Program Evaluation. No. 15. Federal Judicial Center 1983 Social Experimentation and the Law. Washington, D.C.: Federal Judicial Center. Fienberg, S.E. 1979 Graphical methods in statistics. American Statistician 33(4):165-178. Fienberg, Stephen E., Margaret E. Martin, and Miron L. Stat, eds. 1985 Sharing Research Data. Committee on National Statistics. Washington, D.C.: National Academy Press. Fisher, R.A. 1966 Design of Experiments. (First edition, London: Oliver and Boyd, 1935.) Eighth edition. New York: Hafner. Fraker, T., and R. Maynard 1985 The Use of Comparison Group Designs in Evaluations of Employment Related Programs. Princeton, N.J.: Mathematica Policy Research. Freeman, H.E., and P.H. Rossi 1981 Social experiments. Milbank Memorial Fund Quarterly: Health and Society 59~3~:340-374. Grandy, J. 1981 Assessment of the U.S. Employment Service Project STEADY. Technical Report No. 9, prepared under contract to the U.S. Department of Labor, Office of Youth Programs. Princeton, N.J.: Educational Testing Service. Hahn, G.J. 1984 Experimental design in the complex world. Technometrics 26~1~:19-31. Heckman, J. and R. Robb 1985 Alternative methods or evaluating the impact of interventions: an overview. Presented at the Social Science Research Council Workshops on Backtranslation, Committee on Comparative Evaluation of Longitudinal Surveys, New York. Kruskal, W.H. 1980 Criteria for Judging Statistical Graphics. Unpublished manuscript. University of Chicago, Department of Statistics.

OCR for page 231
249 Light, R.J., and D.B. Pillemer 1984 Summing Up: The Science of Reviewing Research. Cambridge, Mass.: Harvard University Press. Lipsey, M.W., D.S. Cordray, and D.E. Berger 1981 Evaluation of a juvenile diversion program: using multiple lines of evidence. Evaluat ion Review 5~3~:283-306. O'Malley, J.M., B.B. Hampson, D.H. Holmes, A.M. Ellis, and F.J. Fannau 1981 Evaluation of the OIC/A Career Exploration Project--1980. McLean, Va.: Center for Studies in Social Policy. Patridge, A., and A. Lind 1983 A Reevaluation of the Civil Appeals Management Plan. Washington, D.C.: Federal Judicial Center. Raizen, S.A., and P.H. Rossi 1981 Program Evaluation in Education: When? How? To What Ends? Report of the Committee on Program Evaluation in Education. Washington, D.C.: National Academy of Sciences. Rezmovic, E.~. 1982 Program implementation and evaluation results: a reexamination of Type III error in a field experiment. Evaluation and Program Planning 5:111-118. Riecken, H.W., and others 1974 Social Experimentation. New York: Academic Press. Rivlin, A.M. 1971 Systematic Thinking for Social Action. Washington, D.C.: The Brookings Institution. Rivlin, A.M., and P.M. Timpane, eds. 1975 Ethical and Legal Issues of Social Experimentation. Washington, D.C.: The Brookings Institution. Rossi, P.H. 1969 Practice, method, and theory in evaluating social action programs. Pp. 217-234 in D.P. Moynihan, ea., On Understandinq Poverty. New York: Basic Books. Rutman, L. 1980 Planning Useful Evaluations: Evaluability Assessment. Beverly Hills, Calif.: Sage Publications. Sadd, S., M. Kotkin, and S.R. Friedman 1983 Alternative Youth Employment Strategies Project. New York: Vera Institute of Justice. Sherman, L.W., and R.A. Berk 1984 The specific deterrent effects of arrest for domestic assault. American Sociological Review 49:261-272. Silverman, W.A. 1977 The lesson of retrolental fibroplasia. Scientific American 236(6):100-107. Tallmadge, G.K., and S.D. Yuen 1981 Study of the Career Intern Program. Final Report--Task B: Assessment of Intern Outcomes. Prepared for the National Institute of Education. Mountain View, Calif.: RMC Research Corporation.

OCR for page 231
250 Teitelbaum, L.E. 1983 Spurious tractable and intractable legal problems: a positivist approach to law and social science research. In R.F. Boruch and J.S. Cecil, eds., Solutions to Ethical and Legal Problems in Social Research. New York: Academic Press. U.S. General Accounting Office 1978 Assessing Social Program Impact Evaluations: A Checklist Approach. Washington, D.C.: U.S. General Accounting Office. Wholey, J. 1977 Evaluability assessment. In L. Rutman, ea., Evaluation Research. Beverly Hills, Calif.: Sage Publications. B IBLIOGRAPHY Guidelines, Standards, and Related Papers on Appraising the Quality of Social Program and Project Evaluations The guidelines issued by the Evaluation Research Society, the U.S. General Accounting Office, and other institutions are marked with an asterisking. Papers by individual scholars (and one journalist) provide commentary, applications, comparisons among standards, or ideas for new standards. Bernstein, I.N., and H.E. Freeman. 1975. Academic and Entrepreneurial l Research. New York: Russell Sage Foundation. Chalmers, T.C. 1981. A method for assessing the quality of a randomized control trial. Controlled Clinical Trials 2:31-49. Cordray, D.S. 1982. An assessment of the utility of the ERS standards. New Directions for Program Evaluation No. 15:67-82. Davis, H.R., C. Windle, and S.S. Sharfstein. 1977. Developing guidelines for program evaluation capability in Community Mental Health Centers. Evaluation 4:25-34. DerSimmons, R. J L.J. Charette, Be McPeek, and F. Mosteller. 1982. Reporting on methods for clinical trials. New England Journal of Medicine 306~22~:1332-1337. , Gordon, G., and E.V. Morse. 1975. Evaluation research. In A. Inkeles, ea., Annual Review of Sociology 1. *Joint Committee on Standards. 1981. Standards for Evaluations of_ Programs, Projects and Materials. New York: McGraw-Hill. McTavish, D.G., J.D. Cleary, E.E. Brent, L. Perman, and K.R. Knudsen. 1977. Assessing research methodology: the structure of professional assessments of methodology. Sociological Methods and Research 6~1~:3-44. Mosteller, F.M., S.P. Gilbert, and B. McPeek. 1980. Reporting standards and research strategies: agenda for an editor. Controlled Clinical Trials 1~1~:37-58. - *National Institute of Education, Department of Education. 1977. The Joint Dissemination Review Panel Ideabook. Washington, D.C.: National Institute of Education.

OCR for page 231
251 Pear, R. 1984. Taking the measure, or mismeasure, of it all. The New York Times, August 28. *Ross), P., ed. 1982. Standards for evaluation practice. New Directions for Program Evaluation No. 15. *U.S. General Accounting Office. 1975. Evaluation and Analysis to Support Decisionmaking. Washington, D.C.: U.S. General Accounting Office. *U.S. General Accounting Office. 1978. Assessing Social Program_ Impact Evaluations: A Checklist Approach. Washington, D.C.: U.S. General Accounting Office. Illustrative, Recent Randomized Field Experiments in Law Enforcement and Corrections, Welfare, Court Procedures, Health Services, and Other Areas For a list of over 300 field experiments run before 1979, see Boruch, McSweeny, Soderstrom (1978~. Law Enforcement and Corrections Berk, R.A., and L.W. Sherman. 1985. Randomized experiments in police research. New Directions for Program Evaluation. In preparation. Rossi, P.H., R.A. Berk, and K.J. Lenihan. 1980. Money, Work, and Crime: Experimental Evidence. New York: Academic Press. Sherman, L.W., and R.A. Berk. 1984. The specific deterrent effects of arrest for domestic assault. American Sociological Review 49:261-272. Civil, Criminal, and Administrative Law Corsi, J.R., L.B. Rosenfeld, G.D. Fowler, K.E. Newcomer, and D Niekerk. 1981. The Use of Telephone Conferencing in Administrative Law Hearings: Major Findings of the New Mexico _ ,_ Experiment with Unemployment Insurance Appeals. Final report to the National Science Foundation. Goldman, G. 1980. Ineffective Justice: Evaluating the Preappe _ Conference. Beverly Hills, Calif.: Sage Publications. Partridge, A., and A. Lind. 1983. A Reevaluation of the Civil Appeals Management Plan. Washington, D.C.: Federal Judicial Center. Juvenile and Criminal Justice L'psey, M.W., D.S. Cordray, and D.E. Berger. 1981. Evaluation of a juvenile diversion program: using multiple lines of evidence. Evaluation Review 5~3~:283-306.

OCR for page 231
252 Schneider, P., and A. Schneider. 1979. The National Juvenile Justice Restitution Evaluation: Experimental Designs and Research Objectives. Paper presented at the Third National Symposium on Restitution, Duluth, Minn., September 28-29. Health Services Cohen, D.I., P. Jones, B. Littenberg, and D. Neuhauser. 1982. Does cost information availability reduce physician test usage? A randomized clinical trial with unexpected findings. Medical Care 20(3):286. Rogers, J.L., and O.M. Haring. 1979. The impact of a computerized medical record summary system on incidence and length of hospitalization. Medical Care 17~6~:618-630. Weissert, W.G., T.H. Wan, B.B. Livieratos, and S. Katz. 1980. Cost-effectiveness of day care services for the chronically ill: randomized experiment. Medical Care 28~6~:567-584. Zimmer, J.G., A. Groth-Junker, and J. McCusker. 1985. A randomized controlled study of a home health care team. American Journal of Public Health 75(2):134-141. Nutrition Related Rush, D., Z. Stein, and M. Susser. 1980. A randomized controlled trial of prenatal nutrition supplementation in New York City. Pediatrics 65~4~:683-697. St. Pierre, R.G., T.D. Cook, and R.B. Straw. 1982. An evaluation of the nutrition education and training program: findings from Nebraska. Evaluation and Program Planning 4:335-344. U.S. Department of Agriculture, Office of Analysis and Evaluation, Food and Nutrition Service. 1984. Food Stamp Work Registration and Job Search Demonstration: Report on Initial Demonstration Sites. Contract No. 53-3198-0-85. Washington, D.C.: U.S. Department of Agriculture. Randomized Experiments, Planned and Executed in Varying Degrees, on Youth Employment Programs Clark, Phipps, Clark & Harris, Inc. 1980. Advanced Education and_ Trainina--Interim Resort on the Career Advancement Voucher - Demonstration. Youth Knowledge Development Report No. 5.3. . Washington, D.C.: U.S. Department of Labor, Employment and Training Administration. Clark, Phipps, Clark & Harris, Inc. 1981. Second Year Final Report: Career Advancement Voucher Demonstration Project. New York: Clark, Phipps, Clark & Harris, Inc. (60 East 86th Street).

OCR for page 231
253 Grandy, J. 1981. Assessment of the U.S. Employment Service Project STEADY. Technical Report No. 9, prepared under contract to the U.S. Department of Labor, Office of Youth Programs. Princeton, N.J.: Educational Testing Service. Hahn, A., and B. Friedman (with C. River a and R. Evans). 1981. The Effectiveness of Two Job Search Assistance Programs for Disadvantaged Youth: Final Report. Center for Employment and Income Studies. Waltham, Mass.: Brandeis University. Manpower Demonstration Research Corporation. 1980. Summary and Findings of the National SuPpor ted Work Demonstration. Mass.: Ballinger. O'Malley, J.M., B.B Hampson, D.H. Holmes, A.M. Ellis, and F.J. Fannau. 1981. Evaluation of the OIC/A Career Exploration Proiect--1980. McLean, Va.: Center for Studies in Social Policy. Resource Consultants, Inc. 1981. Special Project for Indochinese Youth: Final Results. Report to U.S. Department of Labor, Employment and Training Administration, Office of Youth Programs. McLean, Va.: Resource Consultants. Rivera-Sasale, C., B. Friedman, and R. Lerman. 1982. Can Employer or Work Subsidies Raise Youth Employment? An Evaluation of Two Financial Incentive Programs for Disadvantaged Youth. Center for Employment and Income. Waltham, Mass.: Brandeis University. Sadd, S., M. Kotkin, and S.~. Friedman. 1983. Alternative Youth Employment Strategies Project. New York: Justice. Cambridge, Vera Institute of Tallmadge, G.K., and S.D. Yuen. 1981. Study of the Career Intern Program. Final Report--Task B: Assessment of Intern Outcomes. Prepared for the National Institute of Education. Calif.: RMC Research Corporation. Mountain View,