Read "Placing Children in Special Education: A Strategy for Equity" at NAP.edu

« Previous: Classifying Mentally Retarded Students: A Review of Placement Practices in Special Education

Page 230 Cite

Suggested Citation:"Testing in Educational Placement: Issues and Evidence." National Research Council. 1982. Placing Children in Special Education: A Strategy for Equity. Washington, DC: The National Academies Press. doi: 10.17226/9440.

Page 231 Cite

Page 232 Cite

Page 233 Cite

Page 234 Cite

Page 235 Cite

Page 236 Cite

Page 237 Cite

Page 238 Cite

Page 239 Cite

Page 240 Cite

Page 241 Cite

Page 242 Cite

Page 243 Cite

Page 244 Cite

Page 245 Cite

Page 246 Cite

Page 247 Cite

Page 248 Cite

Page 249 Cite

Page 250 Cite

Page 251 Cite

Page 252 Cite

Page 253 Cite

Page 254 Cite

Page 255 Cite

Page 256 Cite

Page 257 Cite

Page 258 Cite

Page 259 Cite

Page 260 Cite

Page 261 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Testing In Eclucational Placement: Issues arid Evidence JEFFREY R. TRAVERS To write about testing in relation to the issues facing the Panel on Selec- tion and Placement of Children in Programs for the Mentally Retarded is somewhat like testifying as a ballistics expert at a shooting trial: The topic invites discussion in almost limitless technical detail, but the details are significant only insofar as they help illuminate whether someone has in- jured someone else and by what means. Therefore, this paper focuses less on psychometric issues than on their interplay with the legal, political, and moral issues raised by testing in the context of educational placement. The paper, in providing background and support for portions of the panel's report, attempts to accomplish two distinct but related tasks. First, given the controversy that has surrounded testing in the academic and popular literature as well as in recent court cases, the panel felt a responsibility to survey the scientific evidence bearing on relevant aspects of the controversy. This paper provides such a survey, albeit one that is condensed and selective and that covers material already well known to professionals in testing and related fields. Second and more impor- tantly the panel wanted to place the testing controversy in proper per I would like to thank the pane! members and the outside reviewers who commented on drafts of this paper. Among the panel members, special thanks go to Donald Bersoff, Asa Hilliard, Jane Mercer, and Samuel Messick. Outside reviewers were Lee Cronbach, Robert Linn, Richard Snow, and Mark Yudoff. Their thoughtful comments helped me to strengthen my arguments and correct various errors. For errors that remails, as well as for,judgments with which a few reviewers disagreed, I alone am responsible. 230

Testing in Educational Placement 231 spective. Issues surrounding testing are part of the larger complex of issues raised by the stubborn and tragic fact that large numbers of children, particularly minority children, are not learning in regular class- rooms. Consequently, as the paper examines various controversies and the associated scientific evidence, it also examines their wider implications for educational policy and practice. Several limitations on the scope of the paper should be made clear at the outset. It is not a comprehensive discussion of issues related to ability testing. (For such a discussion, see the report of the Committee on Ability Testing of the National Research Council, Wigdor and Garner, eds., 1982; see also the special issue of American Psychologist, Glaser and Bond, eds., 1981.) This paper focuses specifically on the issues that have figured in the debate over placement in programs for educable mentally retarded (EMR) children. It does not deal with research on mental retar- dation per se, nor does it make judgments about the validity or utility of the EMR category. It asks instead how tests contribute to classification or misclassification, given current professional and legal definitions of EMR. Finally, this paper does not deal directly with the consequences of classifi- cation the effects of labeling or the educational benefits and costs of placement in EMR classes-although one of its major themes is that the consequences, not just the accuracy, of classification must be taken into account in deciding whether any assessment procedure is appropriate. This paper focuses primarily on the widely used, individually adminis- tered tests that yield IQ scores, notably the Stanford-Binet and the revised Wechsler Intelligence Scale for Children (WISC-R), although other tests are mentioned. Much of the discussion applies to ability tests generally. Special issues raised by group testing and by various quick and dirty sub- stitutes for the major tests are not discussed. (The fact that the Stanford- Binet and WISC-R are widely used and that IQ scores are important determinants of EMR placement are documented in Chapter 2 of the pan- el's report and in the paper by Bickel in this volume.) Here, these facts are taken as points of departure and concentration is not on describing how tests are used in educational placement but on elucidating the controversy surrounding their use. Readers familiar with professionally recommended practices for admin- istering and interpreting tests of mental ability and with the range of such tests currently available may be disturbed by the emphasis throughout this paper on single IQ scores and the occasional use of such words as "IQ test." Leaders in the field of assessment have long recommended the use of multiple tests and careful consideration of performance profiles across subscales within tests, and they have inveighed against the practice of recording only single, summary IQ scores. Unfortunately, data (cited in

232 TRAVERS Chapter 2 and in the paper by Bickel in this volume) indicate that in many school systems the single IQ score is accorded overwhelming weight in placement decisions. Although the extent of this practice cannot be gauged, it is an important source of the controversy over testing in educa- tional placement. It may also be a source of miscommunication between professionals in testing and related fields, who think in terms of the best practices and proper test use, and some critics of testing, who focus on possible or actual misuse and misinterpretation of tests. This paper assumes that the reader has at least a rudimentary knowl- edge of how tests are constructed and interpreted as well as of basic statis- tical concepts and procedures. The presentation is largely qualitative, however, and some background material is included. It is useful to begin this inquiry with rough caricatures of the positions taken by proponents and opponents of mental ability testing. Though such caricatures ignore many significant distinctions and nuances within the two camps, they lay out most of the major points of dispute and illus- trate the interrelatedness of the various issues from both perspectives. Subsequent sections of the paper will necessarily discuss selected issues seriatim. However, if one thing is clear in all of the debate, with its com- plex arguments and high emotions, it is that the positions of participants rarely rest on one or a few isolated facts or arguments; data and logic in- stead lodge within a web of assumptions, beliefs, and values that must be understood if rational analysis is to proceed. TESTING ON TRIAL: BRIEFS FOR THE DEFENSE AND FOR THE PROSECUTION Proponents of the use of tests of general ability in educational placement hold that such tests measure global, enduring qualities of cognitive func- tioning not necessarily "native intelligence" but some broad ability to learn, reason, and grasp abstract concepts. Proponents deny that tests are culturally biased; while they recognize that children from certain ethnic and socioeconomic groups on the average score lower than white, middle- class children, they attribute these group differences in test scores to genu- ine differences in cognitive functioning, caused by heredity, environment, or both. Finally, in justifying the social uses of tests in educational and oc- cupational selection and placement, proponents argue that tests offer in- dividual members of disadvantaged groups, such as minorities and the poor, their best chance of distinguishing themselves and achieving educa- tional and economic success; alternatives to testing, such as qualitative as- sessments by teachers and supervisors, are, claim the proponents of testing, likely to be more discriminatory than tests.

Testing in Educational Placement 233 Critics of standardized tests hold that the tests fail to measure in- telligence, aptitude, or global cognitive skill and instead measure specific skills and knowledge acquired through particular experiences or instruc- tion. Moreover, critics charge, experiences leading to the acquisition of these skills are more accessible to white, middle-class children than to children of other ethnic and socioeconomic groups. Some critics also argue that the test situation itself is unfamiliar and threatening to low-in- come and/or minority children, further depressing their scores. Thus, argue the critics, tests are inherently biased against low-income and/or minority children and systematically underestimate their intellectual abil- ity relative to that of middle-class whites. Finally, critics attack the social uses and social effects of testing. Tests, they allege, perpetuate race and class prejudices because they are widely interpreted as demonstrating the inherent intellectual inferiority of minorities and low-income groups. Sim- ilarly, they perpetuate racial and class inequities in income, job status, and other forms of success and achievement, because they channel children from minority and/or low-income families into educational set- tings that provide little intellectual stimulation, little opportunity to ac- quire the skills most valued by the society, and little in the way of presti- gious credentials and social contacts that can influence occupational and economic success quite apart from ability and effort. The extreme case in point, of course, is placement in classes for mentally retarded students, which, it is alleged, stigmatizes the child unfairly and virtually guarantees a dead-end education leading to a menial job at best. Even this brief summary, which has barely skimmed the surface of the debate, makes it clear that many profound issues divide the proponents and opponents of testing. Any list of the primary open questions would in- clude at least the following: 1. What do standardized ability tests measure? To what degree do they measure deep-seated mental abilities as opposed to skills and knowledge that can be readily acquired by almost any child in the right environment? 2. Are tests culturally biased? To what degree do test scores understate or fail to measure the abilities of minority and/or low-income children? 3. What are the causes of observed group differences in test pejor- mance? To what degree are the causes genetic? To what degree do such dif- ferences arise from group differences in quality of prenatal care, nutrition, and health care? To what degree do they arise from differences in early ex- perience or in the home environment? To what degree do they arise from differences in out-of-home educational environments and opportunities from the preschool years on? 4. What are the social consequences of testing? To what degree do tests

234 TRAVERS provide opportunities for gifted individuals from disadvantaged back- grounds to identify themselves? To what degree do they perpetuate disad- vantage and prejudice? In the context of educational placement, do they, on balance, help or hinder the meeting of children's needs? To what de- gree do they identify children who need special help? To what degree do they lead to inappropriate classification and unfair allocation of educa- tional opportunity? Answers to these questions vary with particular tests and particular pol- icies regarding their use. The partial answers offered below relate primar- ily to the use of major "IQ tests" in EMR evaluations during recent years and may not generalize beyond that context. The first three issues are discussed in separate sections below. The fourth is central to the mission of the panel and crosscuts the others; it is discussed in each substantive section and in the conclusion of this paper. The possible contribution of testing to the disproportionate representa- tion of boys in EMR classes another concern of the panel is not dis- cussed explicitly since the controversy over testing has focused on ethnicity rather than gender. Important issues concerning possible interactions of gender and ethnicity and the reportedly greater vulnerability of boys than girls to environmental variations are likewise beyond the scope of this paper. WHAT DO "INTELLIGENCE" TESTS MEASURE? To discuss what such tests as the WISC-R and Stanford-Binet measure, it is first necessary to clear away a popular misconception about what they are supposed to measure. In the view of most professionals in psychology, psychometrics, and related fields, such tests do not and are not intended to measure the global, fixed native capacity that seems to be implied by the term "intelligence." Indeed, for these professionals the equation of in- telligence with native intellectual capacity is entirely misleading and has been the source of much confusion and unnecessary acrimony in debates about testing and its uses. (For an authoritative statement of this position, see Cleary et al., 19751. The gap between this view and that of many educators, policy makers, members of the public, and some social scientists is illustrated by federal Judge Robert Peckham's landmark decision in the case of Larry P. v. Riles (1979~. In a section entitled "The Impossibility of Measuring Intelli- gence," the judge writes (Larry P. v. Riles, 1979, Section IVA): While many think of the IQ as an objective measure of innate, fixed intelligence, the testimony of the experts overwhelmingly demonstrated that this conception of

Testing in Educational Placement 235 IQ is erroneous. Defendant's expert witnesses, even those closely affiliated with the companies that devise and distribute the standardized intelligence tests, agreed, with only one exception, that we cannot truly define, much less measure, intelli- gence. We can measure certain skills, but not native intelligence. The judge implies that in the common view intelligence is, by definition, a quality both innate and unchanging; and he apparently holds this view himself. (Generations of psychologists, most of them now deceased, ad- vanced the same definition.) However, the judge rejects what he considers to be the popular view that IQ is an accurate measure of native intelli- gence. He himself was convinced that IQ tests measure something that is not fixed or innate "certain skills" and he does not seem to equate these skills with intelligence. Presumably, however, the "experts" who "devise and distribute" intel- ligence tests must believe that they measure something that can legiti- mately be called "intelligence," even if it is ill defined and not fixed or innate. The experts seem to hold the view of those contemporary psycholo- gists who think of intelligence as a kind of global ability to absorb complex information or grasp and manipulate abstract concepts- an ability that is not fixed but that develops continuously through a process of reciprocal interaction with the physical and social world, including, but not limited to, the world of formal education. This very general view is shared by psy- chologists who differ on many specific theoretical points Piagetian devel- opmental psychologists, cognitive psychologists oriented toward computer simulation and information processing, even some learning theorists com- mitted to animal behavior models. For all of these psychologists, it is reasonable to speak of an individual's intelligence at a given point in his or her development, but there is no presumption that individual differences in intelligence are fixed or wholly determined by the genes. From this perspective the central question is whether IQ is a valid mea- sure of "developed intelligence." Questions about how much genes con- tribute, how genes and environment interact, and how much IQ can be modified by planned social intervention through education are separate. A few of these questions are discussed in a subsequent section on the causes of variation in IQ; selected aspects of the validity question are dis- cussed here. Inspection of one of the major intelligence tests, such as the Stanford- Binet or the WISC-R, reveals that items vary widely in content and that many plainly require learning of a very specific sort. Examples include verbal analogies, numerical computations, and questions about practical tasks and social norms (How do you make water boil? What should you do if a smaller child tries to start a fight with you?. Vocabulary items provide some particularly striking examples: At its most advanced adult level, the

236 TRAVERS Stanford-Binet asks the meanings of such esoteric words as "parterre" and "sudorific." This manifest emphasis on acquired knowledge and di- versity of item content naturally raises questions as to how such tests can be said to measure any general mental property (as opposed to specific skills and knowledge) as well as how tests can be said to measure "ability" in any broad sense that goes beyond the ability to answer the specific ques- tions and solve the specific problems presented by the test itself. The generality of mental test scores has been the subject of a long debate in psychometrics. Early leaders in the field, notably Spearman and Thurstone, took opposed positions. The debate came to focus on the statis- tical issue of shared variation: What fraction of the variance in individual performance is shared by all items? What fraction is shared within dis- tinct clusters of items but not across clusters (thus pointing to differenti- ated abilities rather than a single "intelligences? What fraction is unique to individual items (pointing to "abilities" specific to the items)? Statisti- cal techniques of principal components and factor analysis were developed largely to address these questions. There is no universal agreement on precise, quantitative answers to these questions. Different analytic techniques yield different estimates of the relative importance of the general factor versus differentiated clusters. There is agreement, however, that a significant fraction of the variation is shared across items. The diverse items on such tests as the WISC-R and Stanford-Binet appear to measure (in part) the same thing or a small number of things; they are not merely a heterogeneous ragbag of skills and bits of knowledge. Item responses correlate with one another, with sub scare scores, and with total scores on the test. Items load on a single general factor and on a small number of orthogonal factor scales. For ex- ample, several analyses of WISC-R scores, based on large samples com- prised of several ethnic groups, have revealed independent "verbal" and "perceptual" factors and, occasionally, a third factor variously labeled "distractibility," "attention," "memory," and "sequential" (Kaufman, 1975; Mercer, in press; Reschly and Reschly, 1979~. In addition, most tests of general abilities, even when apparently dissimilar in content, cor- relate positively and often highly with each other. Covariance of scores across items and across tests is an established em- pirical fact. To identify common variance with ability or abilities requires inference and interpretation. The inference rests on an assumption: A child who possesses general perceptual and analytic abilities will make good use of experience and will master a wide range of specific facts, con- cepts, and principles. Conversely, a child who performs well on a wide variety of items is likely to have well-developed information-processing abilities of a general sort. An alternative interpretation of test and item co

Testing in Educational Placement 237 variance is that both the tests and the individual items reflect exposure to the mainstream culture, especially to the language, symbols, information, strategies, and tasks that are important in schools. These two interpre- tations are not necessarily opposed, so long as it is recognized that per- ceptual and analytic abilities may be developed in part through experience and exposure to appropriate stimulation. (There may of course be other broad perceptual and analytic abilities that are neither captured by ex- isting tests nor fostered by the mainstream culture.) It is important to recognize that all test performance depends on both general abilities and specific knowledge, both of which are products of learning, at least in part. For example, a test of an advanced, academic subject matter, e.g., one that requires the respondent to solve differential equations, clearly requires specific preparation. Nevertheless, general mathematical ability is likely to play a large role in individual perfor- mance. The relative contributions of general ability and specific learning are not fixed characteristics of the test itself but depend as well on the tested population and the circumstances of testing. Pursuing the example just given, if students in a calculus class are all drawn from a narrow, high band of the spectrum of general mathematical ability but vary widely in their previous preparation for calculus, the latter variations will be a relatively important determinant of test performance. If students in the class vary widely in ability but have all been exposed to the same mathe- matics curriculum in the past, variations in ability will be a more impor- tant factor. Most school psychologists and educators who use IQ tests avoid the in- terpretive issues discussed above and justify their use of tests on grounds of "predictive validity," a purely empirical phenomenon. Many studies have shown that IQ scores predict (correlate with) "criterion" measures of scholastic success, such as later school grades or scores on standardized tests of achievement in specific subject areas. For elementary school children, validity coefficients (correlations) of .7 or higher have often been obtained using achievement tests as criteria (see Crano et al., 19721. Cor- relations with grades are typically somewhat lower. Values around .5 have been reported (Messe et al., 1979~. Occasionally, much lower correlations with grades have been reported; however, technical limitations may ac- count in part for these findings. ~ iLower and less consistent correlations with grades are to be expected for many reasons. IQ tests are more similar in style and content to standardized achievement tests than to class- room tests and other performance measures used in grading. Grades are likely to be less reli- able than standardized achievement tests, and unreliability attenuates correlations. Grades are likely to be influenced by factors other than achievement, such as deportment or per

238 TRAVERS It is not necessary to dwell on the evidence for predictive validity, be- cause some degree of the predictive power of tests is generally conceded. What is sharply debated, however, is the interpretation of validity correla- tions. They are obviously consistent with the hypothesis that IQ tests measure academic ability, which is later manifested in scholastic perfor- mance, and they have been interpreted in this way, implicitly or explicitly, by many of those who use tests in schools. They are also consistent with the hypothesis that IQ tests, teacher-made tests, and standardized achievement tests all sample the same domain of acquired skills. This ambiguity of interpretation points to an important fact, noted by Messick (1980), among others, that the term "predictive validity" is a misnomer. Prediction is not a kind of validity; prediction does not in itself guarantee that a test measures what it is supposed to measure. (Parental income predicts a child's IQ and school success, but it is surely stretching the term "measure" to call parental income a measure of the child's in- telligence.) What is needed is an explicit theory of intelligence that links this construct to its measures and to other constructs and their associated measures. To draw a physical analogy, there is an explicit theory that links temperature to pressure and volume and, thereby, to the height of a column of liquid in a sealed tube. Without such a theory it would be hard to understand why a thermometer measures the entity that causes water to boil or one's hand to hurt when placed on a hot stove. Belief in the validity of the measure gains strength with repeated confirmation of the theory. In psychometric parlance, this process is "construct validation," and, as Messick and others have argued, construct validity is the only kind there is. Prediction is just one of several kinds of evidence that can be used to support claims of construct validity. Unfortunately, where intelligence is concerned, there are multiple, competing theories, few of them very pre- cise; hence, the evidence of prediction is subject to multiple interpretations. In sum, there are two principal pieces of evidence for the validity of IQ tests as measures of "developed intelligence." One is the convergence of different items and different tests. The other is the association between IQ scores and measures of academic achievement. Both are subject to varying interpretations. The question of interest here is how the evidence bears on the use of tests in educational placement. ceived effort. Overall grade point averages may include nonacademic subjects, for which lit- tle effect of intellectual ability might be expected. Students are likely to be grouped by abil- ity, formally or informally, and graded in comparison to their classmates; such practices imply that the same grade means different things for students in different classes or for stu- dents graded by different teachers and also that the restricted range of variation in IQ within classes will reduce the correlation between IQ and grades.

Testing in Educational Placement 239 Critics of testing have argued vehemently that tests are invalid as mea- sures of a child's potential and are, therefore, unfair devices to use for placement. However, they have not spelled out why they would be fair if they did measure potential nor why they are unfair if they measure only acquired skills or developed abilities. Defenders of testing have not con- tested the point about measurement of potential but have justified the use of tests on grounds of predictive validity, apparently believing that the use of tests in educational placement is fair even if tests measure skills that are partially or primarily acquired. In my view, neither the critics nor the defenders (exemplified by the plaintiffs and defendants in Larry P. and in Parents in Action on Special Education v. Hannon) have focused their arguments appropriately. Prediction in itself is not sufficient justification for using tests in educational placement. Nor is the critical shortcoming of tests their failure to measure "potential" or "native intelligence." The key issue is whether tests offer guidance in choosing among educational alternatives. One relevant, if obvious, limitation of prediction has been mentioned in court cases concerned with the use of tests in EMR placement (e.g., U.S. Department of Justice, 1980:A7-A81: Prediction is probabilistic. The fact that a given IQ on the average predicts a specific grade level does not guarantee that any particular child who achieves the given IQ will achieve the predicted grade level. Variation around the predicted level can be quite wide. When the validity coefficient is as high as .6, a child who scores below the 10th percentile (an IQ of roughly 80) would have a 46 per- cent chance of achieving a grade point average in the bottom fifth of the class, hence a 54 percent chance of doing better. The child would have a 17 percent chance of being in the top half of the class. When the-validity coefficient is as low as .2, the child would have only a 28 percent probabil- ity of being in the bottom fifth just 8 percent higher than pure chance. The child would have a 40 percent likelihood of being in the top half of the class (Schrader, 1965~. Even if it is conceded that IQ tests are among the best predictors of school success that we have, the margin of error in an in- dividual case is substantial. (In principle, prediction can be improved by the use of other valid indicators in conjunction with IQ scores. In practice, as indicated earlier, this improvement may or may not be achieved, de- pending on whether additional indicators are in fact collected and used.) A second limitation somewhat paradoxical, given the first is that the predictive information available in the IQ overlaps with that available in the child's grade record or achievement test scores, when the latter are available. Past and current achievement predicts future achievement, typically better than IQ (Crano et al., 19721. Although, as illustrated in the previous paragraph, a substantial portion of the variation in achieve

240 TRAVERS ment is independent of IQ and vice versa, prediction based on both IQ and achievement is only a little more accurate than prediction based on achievement alone. (The fact that IQ and achievement measures are not entirely redundant does have important implications, however. In current practice, children are usually referred for testing only after experiencing serious and prolonged difficulty in the classroom. When testing reveals that such children have low IQs, it merely confirms expectations. In some individual cases, however, testing can make a distinctive and positive con- tribution: When children who are performing poorly in class prove to have IQs in the normal range, the discrepancy points to undetected problems that should be diagnosed sensory malfunctions, emotional difficulties, poor or inappropriate instruction, etc. Obviously, this is not to say that high scores are somehow more valid or meaningful than low scores or that predictive equations are different for high and low scores. The point, rather, is that the functional contribution of testing is likely to lie less in improving prediction than in stimulating diagnosis.) A more fundamental limitation concerns the underlying logic of using prediction as a basis for educational placement at all. Even if it could be predicted with certainty that a child with a low IQ will get low grades in a regular class, this fact would not in itself dictate or justify removing the child from the class. Judge Peckham recognized this point when he drew a distinction between testing for educational placement and testing for job placement. Courts have held that employers have a legitimate stake in employee performance and thus are justified in selecting employees on the basis of a test that has demonstrated predictive power (Bersoff, 1979~. But the stake of educators in the performance of children is not analogous. Children, not educators, are the beneficiaries of education, and the public schools have an obligation to teach every child as well as possible. The paramount question is not how to select children who will perform well in regular classes but how to select classes or programs that best meet the needs of children. To justify separate placement on the basis of an IQ score it would be necessary to show that children with low IQs require and profit from a dif- ferent curriculum or different type of instruction from that available in regular classes. (Alternatively, separate placement might be justified if it could be shown that children with low IQs are not harmed by it, while children in regular classes are harmed when children with low IQs share those regular classes.) Educational researchers call situations in which dif- ferent educational approaches work best with children of different initial ability "aptitude-treatment interactions" (Cronbach and Snow, 1977~. It has been urged that demonstration of aptitude-treatment interactions is the appropriate way to validate tests for use in educational placement,

Testing in Educational Placement 241 although there may be severe difficulties in conducting such demonstra- tions in special education.2 The more general point stands, however: Sep- arate placement demands justification on grounds of educational conse- quences, not merely predicted failure in regular classes. Such justification goes far beyond the boundaries of technical test valid- ity, as demonstrated by item convergence and prediction. As Messick (1980) pointed out, even ironclad evidence of technical validity is insuffi- cient to justify a particular use of a test. One must always consider whether the construct measured by the test is relevant to the decision to be made, and one must always consider the consequences of the decision. The case of educational placement is a dramatic illustration of these pre- cepts, as was noted by Reschly (1981), among others. It is likely that the framers of the implementing regulations for P.L. 94-142 (see Chapter 2) had this broad range of information in mind when they required that tests be "validated for their intended use," i.e., educational placement. Later in this paper I argue that the above arguments would apply even if IQ scores supported strong inferences about learning potential. That is, even if children with low IQs were genetically limited in their capacity to learn, the decision to separate them from other children (or to assign them to any sort of special program) should be based on the educational conse- quences for these children and their classmates. First, however, I will con- sider another issue, central to both the Larry P. and PA SE cases, the issue of racial and cultural bias in tests. ARE TESTS BIASED? Do tests misrepresent the skills or abilities of minority children and those from low-income families? Are tests merely the bearers of bad news about genuine differences in academic functioning, or are they the creators of false differences? To address these questions it is necessary to clarify cer- tain points of definition that have caused confusion and miscommunica- tion between specialists in psychological measurement, on one hand, and lawyers, judges, many social scientists, and the public on the other. Documents such as Judge Peckham's decision in the Larry P. case or the amicus curiae brief filed by the U.S. Department of Justice in PA SE v. 2To demonstrate an aptitude-treatment interaction it is necessary to use similar outcome measures for the various children and classes, or "treatments," being studied. If EMR children are exposed to curricula with goals that are radically different from those of the reg- ular class-e.g., teaching self-help and vocational skills rather than academic skills-the use of common measures is pointless. The situation is further complicated if EMR children are in fact given individualized treatment, as required by current law.

242 TRAVERS Hannon (1980) suggest that the authors define bias quite differently from the measurement specialists. For many nonspecialists (accustomed, as noted earlier, to thinking that tests purport to measure innate ability), tests are biased if group differences in test scores can be attributed to average differences in environmental advantage enjoyed by children from different ethnic or socioeconomic groups. The issue for nonspecialists is not whether tests capture genuine differences in skill or developed ability between groups; it is whether these differences are caused by cultural fac- tors. Thus the Justice Department attorneys (1980:17) in their post-trial memorandum supported the plaintiffs in PAST: Plantiffs argue that racial and cultural bias, demonstrated most graphically by the differences in the test scores by race, reflect differences in cultural patterns and levels of exposure to the dominant school culture between blacks and whites. Judge Peckham, in supporting his conclusion that tests are biased, cited testimony by witnesses for both plaintiffs and defendants to the effect that racial differences in IQ scores are culturally caused. For example, he wrote (Larry P. v. Riles, 1979, Section IVC): . . . there was general agreement by all sides on the inevitability of cultural differ- ences on IQ scores. Put succinctly by Professor Hilliard, black people have "a cul- tural heritage that represents an experience pool which is never used" or tested by standardized IQ tests. To be sure, the cited documents contain additional discussion sug- gesting that the writers are aware of other aspects of bias more closely akin to the concerns of the specialist, which are discussed below. It is clear, however, that for these (arguably) representative nonspecialists, evidence for cultural causation of group differences in test scores is sufficient to establish bias in the tests themselves. In effect, "bias," "cultural causa- tion," and "unfairness" are equivalent concepts for many nonspecialists. From this perspective it seems unfair to categorize children or allocate educational opportunities on the basis of performance differences that are culturally caused, and it seems proper to characterize the instruments that effectuate this unfair categorization as biased. For the specialist, questions of bias, fairness, and cultural causation are separate. In psychometric theory, bias is purely a measurement issue: A test is biased if and only if quantitative indicators of validity internal structure and relationships to other variables differ for different cultural groups. A test is held to be unbiased if these quantitative properties are in- variant across cultural groups, even if different groups have different per- formance profiles due to differential opportunity and experience. The following quote from Jensen (1980:375) illustrates the strong methodologi

Testing in Educational Placement 243 cal flavor of the measurement specialist's definition of bias and its kinship to mathematical definitions of the term: In mathematical statistics, "bias" refers to a systematic under or overestimation of a population parameter by a statistic based on samples drawn from the population. In psychometrics "bias" refers to systematic errors in the predictive validity or the construct validity of test scores of individuals that are associated with the individ- ual's group membership. This definition separates bias from fairness. It makes bias a purely technical issue. No matter how good a test is technically, there is room for disagreement concerning the decision rules to be applied when the test is used for selection, placement, or other purposes. Questions of fairness ap- ply to these rules and to test use, not to tests themselves.3 (There have been a number of attempts to formulate explicit, quantitative criteria of "fairness" in the use of tests that show different performance profiles across social groups; see Petersen and Novick, 1976.) Given this technical definition of bias, it is not inconsistent to argue that the use of a particular test for a particular purpose may be unfair even if the test is, in the sense defined, unbiased. For example, it is consis- tent to argue that IQ tests are racially unbiased measures of academic ability but that ability is affected by cultural experience and that it is, therefore, unfair to use IQ tests to make decisions that require inferences about innate potential. Thus a measurement specialist might agree with some of Judge Peckham's conclusions while rejecting the judge's claim that tests per se are biased. What evidence could be adduced to show that IQ tests are unbiased in the technical sense, i.e., that tests are equally valid for children from dif- ferent ethnic groups or markedly different socioeconomic and/or educa- tional backgrounds? The answer is that there is no direct way to dem- onstrate that a test is culturally unbiased. (Jensen, who has devoted a 740-page tome to showing that bias is not a significant factor in mental testing, concurs with this point.) However, it is possible to show that a test is biased, in any of a number of specific ways. Conversely, by systematically ruling out each of the known potential sources of bias, it is possible to reduce the plausibility of the hypothesis that a test is biased, though never to falsify the hypothesis in a strict sense. Three potential sources of bias have received the lion's share of atten 3The usage here is fairly common but not universal. I use the term "bias" to refer to all potential group differences in quantitative measures of validity and the term "fairness" to refer to issues of test use. Others, such as Cole (1981), use "bias" and "fairness" to refer to different types of potential quantitative discrepancy between groups.

244 TRAVERS tion in the psychometric literature to date: (1) differences in performance induced by culturally sensitive features of the test situation, such as the race or dialect of the tester; (2) conspicuous differences across cultural groups in the difficulty of particular items or in other internal features of the pattern of responses generated by test items, which would indicate that the items do not tap the same underlying construct for different groups; and (3) differences in the external or predictive validity of tests for different groups. BIAS IN THE TEST SITUATION Many aspects of the test situation, aside from a child's actual skill or abil- ity, are known or hypothesized to influence test scores. Any of these factors could in theory operate differentially by race, thereby artificially depress- ing the scores of black children relative to those of white children. The most complete list appears to be that in Chapter 12 of Jensen's (1980) book on bias and includes the following: familiarity with the particular test or type of test (coaching and practice); the race and sex of the tester; the language style or dialect of the tester; the tester's expectations about the child's performance; distortions in scoring or time pressure or lack thereof; and attitudinal factors such as test anxiety, achievement motiva- tion, and self-esteem. Jensen characterizes the findings on the contribu- tion of these aspects of testing to the racial gap in test scores as "wholly negative"; I would characterize them as equivocal, indicating a small degree of bias at best. (Jensen agrees that there is evidence for a language bias in the testing of bilingual children, but he denies the existence of bias due to racial dialects or any other bias linked to race.) Many of these situational factors have statistically significant overall ef- fects on test scores but show no interactions with race. For example, coaching and practice together can boost an individual's IQ score by about nine points, if the individual is retested after a fairly short time in- terval on a test that is highly similar to the practice one. However, blacks and whites profit almost equally from coaching and practice. Blacks do not gain much more than whites, as one might expect assuming that blacks are initially less familiar with tests and test-taking strategies. (Ac- tually, a close look at the data reported by Jensen suggests that in several studies blacks did gain a point or two more than whites on some tests, while in other studies or on other tests they gained less. It is unclear whether the different outcomes are random or reflect some underlying phenomenon worthy of investigation.) There is little in the reported data to suggest that familiarization with tests can eliminate more than a small

Testing in Educational Placement 245 fraction of the IQ difference between the races. Not all of the other situa- tional factors have significant overall effects on test scores, and none are as large as the effects of coaching and practice. More importantly, in no case is there a large interaction between a situational factor and race. How can these equivocal-to-negative findings be reconciled with reports of large IQ gains when minority children who scored low are retested by persons of the same ethnic group under nonthreatening conditions? Cases of this sort have frequently been cited in the courts. There are at least two possible answers, with very different implications, indicating a need for research to resolve the issue. One answer is that the people who retest children and boost their IQ scores drastically are merely making the test easier, e.g., by translating items containing difficult words into items with the same content but with easier words, by giving hints, by putting the most favorable interpretation on ambiguous answers, etc. Such changes in procedure may or may not be desirable, but the question of interest here is whether this approach to testing boosts the scores of minority children selectively. It might be the case that white, middle-class children would benefit as much or more than minority children from equivalent changes in procedure. If they did, the changed procedures would have nothing to do with cultural or ethnic bias in tests. If minority children benefited more, the changed procedure would point to bias and indicate that something was wrong or missing in the studies cited by Jensen. What might that something be? One answer, a second potential expla- nation of the discrepancy between the null findings reported by Jensen and the substantial increases in IQ that are often reported, lies in the training of testers and the conditions under which tests are administered. It seems likely that the testers employed in research projects are likely to be particularly well trained, conscientious in their adherence to prescribed procedures, and sensitized to issues of bias. It may well be the case that situational distortions are minimized when such testers operate under such conditions. In contrast, it seems likely that school psychologists often work under considerable administrative pressure and less than optimal testing conditions, and their evaluations are less open to scrutiny by other professionals. If so, testing errors in general and bias in particular seem more likely to occur under "field" conditions. Some of the large increases attributed to retesting may have been genuine corrections of testing errors that would not have occurred in research settings. Studies that systemati- cally compare the effects of the test situation on minority and white children under research and field conditions are needed to choose between the two explanations.

246 ITEM ANALYSIS: RUBY IS A RED HERRING TRAVERS Curiously, many critics and some proponents of testing share an exagger- ated faith in the analysis of individual test items as a method for assessing cultural bias. In fact, item analysis is useful in addressing only limited aspects and, as it happens, relatively unimportant aspects of test bias. A common approach to item analysis, which might be called "editorial," is to analyze the face content of items on logical or semantic grounds or on the basis of apparent or presumed connections to particular subcultural milieux. Judge John F. Grady's decision in PAST v. Hannon (1980) pro- vides a dramatic and socially significant illustration of this approach. Set- ting aside a variety of statistical and empirical arguments for and against the use of tests in placing black children in EMR classes, the judge chose instead to examine test items individually and to decide in each case whether the item appeared a priori to present special difficulties for black children-rather like Judges Woolsey and Bryan, who read Ulysses and Lady Chatterly s Lover, respectively, to decide whether they were porno- graphic. Thus Judge Grady rejected the test question "What is the color of a ruby?" on the grounds that "Ruby" is a common name in the black community; hence, the name of the gem might be mistaken for a proper name and the child might answer "black." However, his "item analysis" led the judge to accept all but a few items on the Stanford-Binet and WISC-R and to uphold the use of these tests in educational placement by the Chicago public schools. Others have drawn diametrically opposed conclusions from similar editorial item analyses (e.g., Hoffman, 1962~. One obvious flaw in this approach is that it places "bias" in the eye of the editor, and different editors disagree. More important is the fact that judgments about item content (even if there is agreement) are neither necessary nor sufficient to prove that particular items discriminate against black children, in the sense of lowering their test scores. An apparently in- nocent item can be disproportionately difficult for minority children com- pared with whites, while an item that is problematic on its face can be equally difficult for all ethnic groups. The foregoing sentence implicitly establishes one standard by which professionals in test construction determine whether items are biased: They examine proportions of children from different ethnic groups who get each item correct; when an item deviates markedly from the overall profile for any group (an item X group interaction), that item is assumed to confer an unacceptable advantage or disadvantage for one group or the other and is deemed to be "biased" in this precise and limited sense. Related psychometric approaches to assessing item bias focus on item- scale correlations and the factor loadings of items. If correlations or

Testing in Educational Placement 247 loadings for particular items differ conspicuously for children from dif- ferent ethnic groups, such items are suspect on the grounds that they do not appear to measure the same construct for the various groups. Item analyses performed on IQ tests have tended to show that most individual items show about the same gap in performance between whites and other ethnic groups. There are statistically significant item X group interac- tions, but they are trivially small relative to overall group differences (Mercer, in press; Sandoval, 1979~. Factor structures show only minor dif- ferences for most major ethnic groups (Reschly and Reschly, 1979~. If there is bias in IQ tests, it is pervasive and not primarily linked to a few of- fending items. But bias can indeed be pervasive. It is possible that all items on a test systematically understate the abilities of minority children. Item analyses of the kind described cannot rule out this possibility. In short, criticisms of tests based on the content of individual items are misplaced, insofar as those criticisms are meant to imply that particular, "culturally loaded" items account for the differential test performance of children from different ethnic groups. On the other hand, defenses of tests based on item analyses fail to address the issue of pervasive or global test bias. An independent case can be made that "editorial" or content bias in test items should be eliminated in order to enhance the credibility and ac- ceptability of tests among minority cultural groups,4 but current evidence does not warrant optimism that editorial changes will reduce differential performance. DIFFERENTIAL PREDICTIVE VALIDITY The logic of predictive validation of tests was explicated and critically ex- amined earlier. A straightforward extension of that logic makes differen- tial predictive validity a measure of bias, in a precise but rather narrow sense: If a test is a valid measure of some trait or skill for some social groups but not others, and if an independent criterion measure of the same trait or skill exists, it follows that the test should predict the criterion for those groups for which it is valid and fail to predict the criterion for those groups for which it is invalid. For example, if IQ tests measure intel- lectual skills or abilities more accurately for white children than for black children, IQ should correlate more highly with measures of future school 4There exist flagrant examples of racially offensive content in widely used tests. For example, prior to a recent revision, one popular test of "receptive vocabulary" incorporated only two pictures of black people among numerous pictorial stimulus items-a Pullman porter and a Sambo figure. A case can clearly be made against the use of such materials without regard to their effects on performance.

248 TRAVERS success for whites than for blacks. Thus an empirical demonstration of differential predictive accuracy would tend to confirm the hypothesis of cultural bias, although bias in the test itself is not the only possible expla- nation for differential prediction. (For example, differential prediction could arise if tests measured ability accurately for both blacks and whites but the school performance of blacks was adversely affected by teacher at- titudes and behavior.) This question of differential validity can be addressed most clearly within the framework of statistical methods used to assess predictive power. In statistical terms, the question "Does a given test have equal pre- dictive validity for blacks and whites?" translates into the questions "Do regression lines (relating the test to the criterion) for the two groups coin- cide, i.e., have the same slope and the same intercept?" and "Are the standard errors of estimate similar for the two groups? The first question has to do with whether the test predicts the same level of success on the criterion variable (e.g., school grades) for blacks and whites who score the same on the test. The second question has to do with whether the margin of error in predicting individual performance on the criterion is equal for both groups or greater for one group than the other. These issues have been explored fairly extensively in a series of studies on the differential predictive validity of various ability tests applicable to young adults, such as the Scholastic Aptitude Test, the Law School Ad- mission Test, and numerous tests of job aptitude. The criterion variables in these studies were college grades, law school grades, supervisor ratings on the job, and other indices of job performance. This literature was re- viewed in a paper by Robert Linn (1982), commissioned by the Committee on Ability Testing of the National Research Council. Linn concludes that these studies consistently show that test scores overpredict the future suc- cess of blacks relative to that of whites; that is, blacks do less well in school or on the job than whites with similar test scores. There is also a tendency for the regression line for blacks to slope less steeply than the line for whites, so that overprediction is greatest for blacks who achieve the highest test scores. With respect to the margin of error in prediction, Linn concludes that the evidence is less consistent but tends to show that tests predict less ac- curately for blacks than for whites, by a small margin in most studies. For example, 34 reported estimates of the multiple correlation between college aptitude tests and freshman or first-semester college grade averages yield a median of .302 for blacks and .385 for whites. Differences in predictive accuracy are essentially nonexistent for the Law School Admission Test and for most job-related tests; however, one large Air Force study found that the median correlation (across 39 different job areas) between the

Testing in Educational Placement 249 Armed Forces Qualification Test and grades in Air Force technical train- ing was .33 for whites and only .18 for blacks.5 Data like those presented by Linn, suggesting overprediction of black scholastic success and roughly comparable errors of estimate, were cited in defense of IQ testing in Larry P. and PAST. However, an obvious ques- tion that arises, in light of the matters at issue in those cases and of the mission of the panel, is whether the findings apply to children of elemen- tary and secondary school age, particularly those from minority groups who score low enough on the IQ scale to be candidates for placement in EMR classes. Unfortunately, there are surprisingly few studies of the dif- ferential predictive validity of IQ tests for black and white children of school age, and fewer still that present regression data necessary to ex- amine issues of underprediction and overprediction. (Most present only correlations. ~ As indicated in the earlier section on predictive validity, correlations be- tween IQ scores and scores on standardized achievement tests are gener- ally quite high. Typically they are only slightly lower for minority children than for whites (see Sattler, 1974, for some representative findings). Correlations with grades are typically lower and are less consistent across studies, in part for the technical reasons mentioned earlier. Cor- relations reported for black children range from a high of .6-.7 (Settler, 1974) to a low of zero (Green and Farquhar, 1965, quoted in Jensen, 1980:4741. Correlations reported for whites are generally as high or higher than those for blacks, sometimes substantially higher in the studies that find the lowest correlations for blacks (e.g., Goldman and Hartig, 1976; Green and Farquhar, 1965; Mercer, 1979~. Goldman and Hartig, for ex- ample, found a correlation of only .27 between WISC IQ and later grades for a large sample of elementary school children in California. For sub- samples of black and Mexican-American children, correlations are in the range of .12-.18. Mercer (1979) reports correlations of .46 for Anglo children and .20 for black and Mexican-American children in an overlap- ping sample drawn from the same California school district. Judge Peckham gave considerable weight to the Mercer and Goldman and Har- tig results, although the latter have been criticized on methodological sIt must be kept in mind that observed differences in so-called validity coefficients (test- criterion correlations) are affected by statistical factors that have nothing to do with "valid- ity" as the word is commonly understood. In particular, if the range of variation in the test score or criterion is less for blacks than for whites, the correlations between the test scores and the criterion are lowered. Validity coefficients are, therefore, not always comparable. Close examination of some of the data presented here and elsewhere in the text indicates that relatively low validity coefficients reported for minority children are in fact due in part to restricted variance in the IQ, the criterion, or both.

250 TRAvERS grounds similar to those mentioned earlier (e.g., by Messe et al., 19791. Mercer herself points out that her differential correlations are due in part to restricted variance in both WISC scores and grades in the minority samples; however, she also points out that essentially the same results are obtained when a semantic differential rating of student competence by teachers (which does not suffer from range restriction) is used as the crite- rion variable. I have encountered only three studies that present full regression infor- mation for school-age children from different ethnic groups (Parr et al., 1971, quoted in Jensen, 1980:475-476; Mercer, 1979; Reschly and Sabers, 1979~. Parr et al. examined the predictive validity of the California Test of Mental Maturity for black and white secondary school students, using grades and various teacher ratings as criterion variables. Reschly and Sabers used the WISC-R as a predictor and achievement test scores as cri- teria; their sample was a group of children in grades 1-9 and included Anglos, blacks, Chicanos, and Native American Papagos. Mercer's analy- sis, based on data from Goldman and Hartig, used the WISC (not the WISC-R) as a predictor and used grades as criteria; the sample included Anglo, black, and Hispanic children. The Parr et al. and Reschly and Sabers studies produced complex pat- terns of results, varying with the ages of the children involved. On balance they indicated only minor differences in prediction for Anglos, blacks, and, in the Reschly and Sabers study, Hispanics. When patterns differed, they often revealed overprediction for blacks and underprediction for Anglos.6 The Mercer analysis was unique in finding worse overall predic- tion, worse prediction for blacks and Hispanics than for Anglos, and un- derprediction of grades for minority children with IQs below the mid- 70s the range likely to be found among children being evaluated for placement in EMR classes. Mercer's findings suggest that, if the same cutoff scores were used to place children in EMR classes, minority chil- dren in those classes would be more academically able than their white counterparts. However, the findings are subject to some of the same caveats mentioned above in connection with the validity coefficients reported by Mercer. In sum, within the measurement specialist's precise but narrow empiri- cal framework for assessing bias, there are only a few studies indicating a relatively modest amount of distortion in test scores of minority children, Gordon (1980) reports partial results of a regression study, in which he found overpredic- tion for Mexican-American students. Messe et al. (1979) report an analysis of data from a large, all-white British sample that revealed overprediction of grades for children of low so- cioeconomic status.

Testing in Educational Placement 251 within the range of scores and ages most relevant to the panel's work. There is at best scattered evidence for bias in aspects of the testing situa- tion external to the test itself; however, this issue merits further study under field conditions. There is little evidence that bias lodges in particu- lar test items, but this does not preclude the possibility of generalized bias across all items. In general there has not been consistent evidence for dif- ferential predictive validity of tests across ethnic groups, although such evidence has been found in several influential but controversial studies. On balance it must be concluded that bias in the technical sense con- tributes little either to explaining group differences in IQ or to shaping placement policy. No study I have encountered suggests that the magni- tude of any bias effect, or even several combined, comes close to explain- ing all of the differences in IQs between whites and minorities. It is unlikely that elimination of psychometric bias, in the absence of other changes in policy and practice, would have much effect on the IQ scores of minority children or the proportion assigned to EMR classes. It is important to recognize the limited import of this conclusion. The conclusion relates only to technical bias and says nothing about fairness in test use or about ethnic or racial bias in the interpretation of test scores or bias in the educational system or in society at large. Psychometric investi- gations of bias do not address many of the larger concerns of educators, policy makers, and the public, most of whom use the term "bias" more broadly than the technical definition allows. For example, these investiga- tions ignore the problem of bias in the criteria: If school grades and/or achievement test scores underestimate the academic attainment of minor- ity students as tests allegedly underestimate their abilities it would be no justification of testing, from a moral or policy standpoint, to find that prediction was perfect. In addition, as we saw at the beginning of this sec- tion, many persons outside the field of psychological measurement define bias as any contribution of sociocultural factors that raise or lower the IQ scores of one group relative to another. There is simply no doubt that there is some cultural contribution, as even the firmest believers in genetic determination of IQ would admit. I take up the issue of the relative size of this contribution in the next section, but I also argue that the issue is less important for policy in the area of educational placement than it may seem. WHAT CAUSES INDIVIDUAL AND GROUP VARIATIONS IN IQ? No question in psychology has provoked more bitter debate than that sur- rounding the determinants of variation in IQ scores. In recent years the controversy has centered on the relative contributions of heredity and en

252 TRAVERS vironment to the 15-point average difference usually found between the IQ scores of blacks and whites. I survey some of the main lines of evidence briefly and then consider the relevance of the entire debate for educational policy and practice. The hereditarian viewpoint has had a sporadic history in psychology generally and in the field of IQ testing particularly. Alfred Binet, whose work in the Paris schools in the early 20th century initiated modern ability testing, vociferously denied that his test measured innate ability. How- ever, many of the American and British psychologists who translated, modified, and used Binet's instrument took the contrary view. Some ex- pressed their opinions in the public-policy arena and were associated with the eugenics and anti-immigration movements (Kamin, 1974~. As we have seen, the assumption that "IQ tests" measure or are supposed to measure innate intelligence is still shared by many outside the measurement field, although most professionals in the field reject it. Arthur Jensen's article in the Harvard Educational Review (1969) re- vived the hereditarian viewpoint within the field and provoked a debate that still continues. Jensen's paper attempted to show that IQ tests measure general intellectual ability, that this ability is of great social impor- tance, and that educational intervention has relatively little effect on in- dividual differences in IQ. Examining correlations among IQs of persons in various biological kinship relations, Jensen concluded that the data can be well explained by postulating that intelligence is a polygenic trait and that 80 percent of its phenotypic variation is due to underlying genotypic . , ~ variation. Others, using similar techniques of "heritability estimation" but with somewhat different models, assumptions, or data, have arrived at lower estimates, in the neighborhood of 0.5 (e.g., Jencks et al., 1972; Plomin and DeFries, 19801. One thorough and dispassionate review (Loehlin et al., 1975) reached a summary estimate only a little lower than Jensen's for the heritability of individual variations in IQ within European and Caucasian populations. The reviewers found that estimates of heritability within the black population were less consistent and often lower than estimates for whites, although they still pointed to a substantial genetic component. However, Loehlin et al. note that there is considerable room for disagree- ment about the technical details of heritability calculations; existing evi- dence is hence consistent with a very broad range of within-group herita- bility coefficients. A number of factors create difficulties for the statistical techniques, borrowed from population genetics, that are used to estimate heritability. For example, one widely noted problem is the confounding of heredity and

Testing in Educational Placement 253 environment: Innately bright parents are likely to provide their children with a lot of intellectual stimulation; innately bright children are likely to elicit stimulation from others and to find or create it in their physical en- vironments (Scarr, 1981~. Similarly, patterns of biological relationship are likely to mirror patterns of environmental similarity. For example, cousins share fewer genes than siblings, but they are also likely to grow up in less similar environments. As Loehlin et al. (1975) point out, most techniques for estimating heritability confound the purely genetic contribution with the contribution of the gene-environment correlation. To get a meaningful heritability estimate for a given trait in a given population, it is necessary to sample the relevant ranges of genotypes and environments and to specify correctly the statistical model that describes their separate and joint contribution to the phenotype. Some skeptics (e.g., Layzer, 1972) doubt that techniques of heritability estimation can be legitimately applied to IQ data, given the limitations of existing data, the imprecision of existing definitions and theories of intelligence, and our ignorance about possible environmental influences and gene-environment interactions. Probably this rather arcane controversy over the proper use of statistics in estimating the heritability of traits would have aroused little public at- tention had Jensen not gone beyond his discussion of individual differ- ences in IQ to speculate that group differences, specifically black-white differences, are also partly genetic in origin. Jensen wrote (1969:82~: So all we are left with are various lines of evidence, no one of which is definitive alone, but which, viewed all together, make it a not unreasonable hypothesis that genetic factors are strongly implicated in the average Negro-white intelligence dif- ference. The preponderance of the evidence is, in my opinion, less consistent with a strictly environmental hypothesis than with a genetic hypothesis, which, of course, does not exclude the influence of environment or its interaction with genetic factors. This conjecture was not based on direct examination of data on the causes of racial differences but rather was an extension of Jensen's main discussion, which, as already noted, dealt with individual differences within ethnic groups. Jensen's critics have stressed that average group dif- ferences in a particular trait can be due mostly or entirely to the environ- ment even if the heritability of the trait within groups is very high. In an attempt to address the issue of between-group variance as directly as possible, Loehlin et al. (1975) reviewed a number of studies relating IQ to various indices of racial mixture. Some of these studies examined corre- lations between IQ and race-linked characteristics such as skin color and blood-group distributions. Others examined IQ distributions associated

254 TRAVERS with various patterns of interracial mating. One particularly interesting study traced the genealogies of black children with extremely high IQs (and found no evidence for increased European admixture, compared with the black population at large). While careful to point out that the results of these studies "are consistent with either moderate hereditarian or environmentalist interpretations, " Loehlin et al. ( 1975:238) suggest that the findings are "more easily accommodated in an environmentalist framework." (In an appendix they estimate between-group heritability at .125, though the estimate is cautious and tentative.) A similar conclusion can be reached regarding other studies, indicating that the size of the IQ gap between blacks and whites is inversely related to the degree of the black child's exposure to white, middle-class culture and schooling. These include classic studies of black families who migrated from the rural South to the urban North (Klineberg, 1935), studies of in- terracial adoptions (Scarr and Weinberg, 1976), and studies of the effects of sociocultural variations within the black community (Mercer, 1979~. The foregoing cursory glance at a large and complex literature will not satisfy either supporters or critics of the hereditarian position. It merely indicates some of the areas in which scientific controversy exists. The im- portant points for purposes of this discussion are (1) that controversy does exist; science has not yet provided definitive answers to the nature-nurture question and perhaps never will and (2) that virtually everyone involved in the controversy agrees that both genetic and experiential factors influence IQ; what is at issue is the degree of influence and the mechanisms involved. The relevant question is whether there are policy decisions or practices having to do with educational placement or instruction that hinge on reso- lution of the issue. Courts have held that the issue is indeed central. In Lacy P., for exam- ple, Judge Peckham argued that EMR classes are (according to defini- tions adopted by the California Department of Education)7 intended for children who are congenitally unable to learn in regular classes; to be valid for purposes of placing children in such classes, the judge reasoned, tests 7California's EMR classes were intended for "pupils whose mental capabilities make it im- possible for them to profit from the regular instructional programs" (Larry P. v. Riles, 1979, Sec. IIIC). EMR children were distinguished (in a 1963 law) from "culturally disadvantaged minors," who are "potentially capable of completing a regular educational program" but unable to do so because of "cultural, economic and like disadvantages." EMR children were also distinguished from "educationally handicapped" children, who "cannot benefit from the regular educational program" because of "marked learning or behavioral disorders or both" (Larry P. v. Riles, 1979, Sec. IIIB). Given the historical definitions of the latter two categories, Judge Peckham not unreasonably construed the EMR category as applying to children who are congenitally unable to learn.

Testing in Educational Placement 255 must be capable of identifying congenital disability. (See Larry P., Sec- tions IIIC and VB(4), and the analysis of the decision by Smith, 1980~. The assumption that mental retardation is by definition innate is one that professionals concerned with the problem abandoned long ago. The American Association on Mental Deficiency, for example, cites "signifi- cantly subaverage general intellectual functioning" and "deficits in adap- tive behavior" as the defining conditions (Grossman, 1977:5~. It can, of course, be debated whether this is an appropriate definition or whether IQ is an appropriate measure of intellectual functioning. Nevertheless, given the definition, it is not necessary to show that the deficient intellectual functioning (arguably) signalled by a low IQ is inborn in order to say that a child is "mentally retarded" according to the stated functional criteria. The medical profession has been more explicit in defining mental retarda- tion as a purely functional category that may have many different causes, experiential as well as organic. (For a lucid discussion of contemporary definitions see Goodman, 1977.) It appears that there is a wide gap be- tween the assumptions and definitions embraced by leaders in the field and those embodied in administrative procedures of some school systems. The latter assumptions apparently guided the Larry P. decision. Professionals have abandoned the organic definition of mental retarda- tion in favor of the functional definition for both scientific and moral rea- sons that seem compelling. Organic causes can be identified in a small proportion of cases of mild mental retardation. However, there is no evi- dence that different educational procedures are needed, or work better, for organically disabled children, compared with other children with simi- lar functional abilities but no (known) organic deficit (Goodman, 1977~. There is no evidence that it is any easier to teach the latter group than the former or that their prognosis for future success is any worse. Good teaching can do a great deal to help even children with organic disabilities meet their potential; conversely, poor performance that is socially caused is just as hard to correct as poor performance that is organically caused- at least up to the limits of present scientific knowledge and instructional techniques. Moreover, different views of the relative contributions of genetic and environmental factors in no way affect the responsibility of schools to pro- vide the best instruction possible. There will always be differences in abil- ity and achievement among students, and schools will always have to deal with these differences, regardless of their causes. To be sure, schools face difficult questions about how to allocate resources among students with different levels of developed academic ability. However, there is apparently no basis in current knowledge for believing that investment in the edu- cation of students of low ability with environmentally caused deficiencies

256 TRAVERS will pay off (in future performance or social contribution) more than in- vestment in the education of those with congenital disorders. If it is indeed the case that treatment of educational disability is in- dependent of the cause of the problem, it is hard to see why different beliefs about the relative contributions of genes and environment to IQ should have any educational import. Earlier we saw that a wide range of academic performance is consistent with any given IQ score. The job of the educator is to make sure that performance is as good as it can be. Though a teacher, administrator, or policy maker with hereditarian views might be pessimistic about the likelihood of large gains in underlying in- tellectual ability, this pessimism would be no justification for failing to im- part as many skills and as much knowledge as possible. I am not denying that negative expectations can potentially do harm; they probably can, whether they are based on beliefs in genetic or cultural inferiority of minority groups. I am arguing that they should not be allowed to do harm that such beliefs provide no legitimate basis for educational policies or practices that would in any way restrict children's progress. Decisions about curricula and teaching methods to be used with children at different levels of initial performance as well as decisions about whether to teach these children separately or together can and should be based on the demonstrated pedagogical effectiveness of the various approaches, not on preconceptions about the causes of initial differences in performance. Finally, one's position on the nature-nurture question gives little or no guidance as to the degree of racial imbalance in special education place- ment that one should be willing to tolerate. As long as there are separate classes or programs for children who are significantly lacking in tradi- tional academic skills, both environmentalists and hereditarians would expect minority children to be overrepresented in such classes, at least for the immediate future. Critics of IQ testing and EMR classes (e.g., the plaintiffs in Larry P. and PASS) have argued that the nativist connotations of terms such as "intelligence" and "mental retardation" are deeply ingrained. Children are harmed because people misinterpret the meaning of IQ scores and EMR placements, stigmatizing children and denying them educational opportunities. None of the evidence reviewed in this section bears on the truth or falsity of such claims. The arguments in this section of the paper have not dealt with the actual political and educational consequences of hereditarian versus environmentalist views. The arguments have been in- tended to make one fundamental point: Given current knowledge, there is no logical or necessary connection between the heritability of IQ and edu- cational practice.

Testing in Educational Placement CONCLUSIONS 257 Two kinds of conclusions have been sprinkled liberally throughout this paper and need not be repeated here: judgments about the weight of the scientific evidence on various empirical issues that have been raised and value-based arguments about the implications of these judgments for edu- cation policy. In this final section I will draw a few more general lessons and reflect on their implications for the work of the panel. One general lesson is that there is less articulation between the concerns of the public and the concerns of specialists in psychological measurement than might be expected, given their common agreement on the impor- tance of the issues. Specialists have succeeded in formulating and answer- ing an array of specific questions regarding aspects of test validity, bias, and the like. Other questions, however, remain ill formulated or unan- swered; many of the latter questions are important to the nonspecialist nr1 fiery in his or her legitimate definition of validity, bias, etc., even if _ _ =~ _ O they do not figure in specialists' definitions. By the same token, nonspe- cialists including some who are highly knowledgeable about education policy and legal aspects of testing- have often failed to recognize scien- tifically important distinctions among possible interpretations of connota- tively loaded terms, such as intelligence, validity, and bias. A second lesson is that standardized ability tests, as currently conceived and constructed, will inevitably contribute to disproportionate placement of minority children in classes for mildly mentally retarded students (or classes by any other name that are designed to serve children whose prog- nosis for success in school is poor). The reasons for this bleak conclusion are deeply rooted in the natures of the tests, of the schools, and of society. As long as new tests are built on the same logic as old ones, namely the logic of inferring ability from achieved performance across a wide variety of specific "intellectual" tasks, they will continue to tell us what we already know that children who grow up outside the mainstream are likely to have trouble in school. They will not help us resolve the ambigui- ties of potential and achievement, of nature and nurture, that plague the existing tests. There are some new, experimental approaches to testing based on Piagetian developmental theory, on direct observation of the child's learning in novel situations, and even on measures of neurological functioning, such as electroencephalograms. It is impossible to say at this juncture, however, how much hope we should pin on them. For the fore- seeable future, decisions about public policy and educational practice will have to be based on tests as they are. Fortunately, many such decisions can be made in the face of a great

258 TRAVERS deal of ambiguity about the meaning of tests. This is the third and most important lesson to be drawn from this paper. Debates over validity, bias, and the causes of group differences have a hypnotic quality because of the connotations of the word "intelligence" and the specter of genetic predes- tination. But the debate distracts our attention from what should be our central concern, namely how to improve education, particularly for chil- dren who are not doing well in the school system as it currently exists. It is striking that some scholars who disagree fundamentally about the nature of IQ tests, such as Jane Mercer and Arthur Jensen, are in agree- ment about many aspects of the proper use of tests in evaluating children for placement in classes for mentally retarded children. Both Mercer and Jensen agree that tests tell us something about a child's level of school functioning and that they deserve a place in an assessment battery. Both agree that full diagnostic assessment should take place only when children have had trouble in the classroom; tests and other assessment procedures should not be used as general screening devices. Both agree that IQ tests alone should not determine placement but should be used in tandem with information about other characteristics of the child, notably the child's capacity to function in nonschool environments and roles, and the pres- ence of any neurological, sensory, or other physical problems. To be sure, Mercer and Jensen would disagree about using information on the child's sociocultural background to interpret or adjust IQ scores, but the areas of agreement are substantial. It seems that serious theoretical disagreements are consistent with surprisingly similar practical recommendations. If so, one can only wonder about the wisdom of dragging the theoretical dis- agreements into the courts. One consequence of the current focus of debate is that judges have been forced to deliberate about scientific controversies that they are ill equipped to consider. It is not surprising that their conclusions are sometimes con- tradictory. But judges (and policy makers) are well equipped to consider other kinds of issues; given the ambiguous meanings of test scores, and given the consequences of placement, is it consistent with established legal standards of fairness to use tests as placement devices? Are some uses fair, while others are not? This way of framing questions puts them squarely in the court of values and legal definitions and precedents. The consequences of placement will surely play a central role in any such deliberation. Regardless of the intrinsic merits of tests or alternative placement procedures, it is hard to justify the use of any device to sort chil- dren or prescribe educational programs, unless there are demonstrated educational benefits attached to the sorting or prescription. In Larry P. Judge Peckham concluded that EMR classes are "educationally dead- end, isolated and stigmatizing." Given the issues raised in the case, it was

Testing in Educational Placement 259 necessary for the judge to go on to examine discrimination in placement procedures; had his purpose been to decide whether schools and society were meeting their responsibilities, however, he need not have looked fur- ther. If "special" classes (particularly EMR classes) convey no special benefits and involve no remedial instruction, it is hard to justify placing any children in them, regardless of race. If minority children are overrep- resented in such classes, they are being disproportionately harmed; the basis for placement doesn't much matter. If, on the other hand, special classes do convey demonstrable benefits, disproportionate placement does not represent disproportionate harm. The benefits of the classes must be weighed against their costs, e.g., the cost of separateness per se. If we are going to fight about IQ tests (or EMR classes) we should be fighting about what they do or do not contribute to learning. Proponents should try to show that tests give information, not available through other practical means, that can be used to match instruction to children's per- fo~ance. Opponents should be trying to show that there are better ways to channel children into the most effective instructional situations. If the panel can help refocus public debate in this manner, it will have done a great service. REFERENCES Bersoff, D. N. 1979 Regarding psychologists testily: regulation of psychological assessment in the public schools. Maryland Law Review 39(1):27-120. Cleary, T. A., Humphreys, L. G., Kendrick, S. A., and Wesman, A. 1975 Educational uses of tests with disadvantaged students. American Psychologist 30: 15-41. Cole, N. S. 1981 Bias in testing. Americar' Psychologist 36:1067-1077. Crano, W. D., Kenny, D. A., and Campbell, D. T. 1972 Does intelligence cause achievement?: a cross-lagged panel analysis. Journal of Educational Psychology 63:258-275. Cronbach, L. J., and Snow, R. E. 1977 Aptitudes and Instructional Methods: A Handbook for Research 0~2 Interactions. New York: Irvington. Parr, J. L., O'Leary, B. S., Pfeiffer, C. M., Goldstein, I. L., and Bartlett, C. J. 1971 Ethnic Group Membership as a Moderator in the Prediction of Job PerJor,na'~ce: An Examination of Some Less Traditional Predictors. AIR technical report no. 2. Washington, D.C.: American Institutes for Research. Glaser, R., and Bond, L., guest eds. 1981 Testing: concepts, policy, practice and research. America,' Psychologist (special issue). Goldman, R. D., and Hartig, L. K. 1976 The WISC may not be a valid predictor of school performance for primary-grade minority children. American Journal of Me''tal Deficiency 80(6):583-587.

260 TRAVERS Goodman, J. F. 1977 The diagnostic fallacy: a critique of Jane Mercer's concept of mental retardation. Journal of School Psychology 15: 197-205. Gordon, R. A. 1980 Examining labeling theory: the case of mental retardation. Pp. 111-174 in W. R. Gove, ea., The Labeli'2g of Deviance: Evaluc~ti,7g a Perspective. Beverly Hills. Calif.: Sage Publications. Green, R. L., and Farquhar, W. W. 1965 Negro academic motivation and scholastic achievement. Journal of Educational Psychology 56:241-243. Grossman, H. J., ed. 1977 Manual 0~? Terminology arid ClassiJicatio'2 ill Me''tal Retardatio,'. American Asso- ciation on Menta} Deficiency. Baltimore, Md.: Garamond/Pridemark. Hoffman, B. 1962 The Tyra,`'zy of Testing. New York: Crowell-Collier. Jencks, C., Smith, M., Acland, H., Bane, M. J., Cohen, D., Gintis, H., Heyns, B., and Michelson S. 1972 Inequality: A Reassess''`e~zt of the Effect of Family arid Schooling i,' A',~erica. New York: Basic Books. Jensen, A. R. 1969 How much can we boost IQ and scholastic achievement? Harvarcl Educational Re- vieu' 39~1~:1-123. 1980 Bias i'' Me'2taf Testing. New York: Free Press. Kamin, L. J. 1974 The Science card Politics ol IQ. New York: John Wiley & Sons, Inc. Kaufman, A. 1975 Factor analysis of the WISC-R at 11 age levels between 6~/: and 16~/2 years. Journal of Consulting arid Cli''ical Psych ol`,gy 43: 135-147. Klineberg, O. 1935 Negro I''tellige'~ce arid Selective Migratio''. New York: Columbia University Press. Larry P. v. Riles 1979 495 F. Supp. 926 (N. D. Cal, 1979) (decision on merits) appeal docketed No. 80.4027 (9th Cir., Jan. 17, 1980). Layzer, D. 1972 Science or superstition?: a physical scientist looks at the IQ controversy. Cog/2itio'2 1 :265-299. Linn, R. 1982 Ability testing: individual differences, prediction, and differential prediction. Pp. 335-388 in A. K. Wigdor and W. R. Garner, eds., Ability Testing: Uses Conse- quences and Controversies Vol. II. Report of the Committee on Ability Testing, National Research Council. Washington, D.C.: National Academy Press. Loehlin, J. C., Lindzey, G., and Spuhler, J. N. 1975 Race Differences in Intelligence. San Francisco, Calif.: W. H. Freeman. Mercer, J. 1979 System of Multicultural Pluralistic Assessment Technical Manual. New York: Psy- chological Corporation. In What is a racially and culturally nondiscriminatory test? In E. R. Reynolds and R. press T. Brown, eds., Perspectives on Bias in Mental Testing. New York: Plenum. Messe, L. A., Crano, W. D., Messe, S. R., and Rice, W. 1979 Evaluation of the predictive validity of tests of mental ability for classroom per- formance in elementary grades. Journal of Educational Psychology 71 :233-241.

Testing in Educational Placement Messick, S. 1980 Test validity and the ethics of assessment. American Psychologist 35:1012-iO27. Parents in Action on Special Education (PASE) v. Hannon 1980 No. 74-C-3586 (N. D. III. 1980). Petersen, N., and Novick, M. 261 1976 An evaluation of some models for culture-fair selection. Jo,urnal of Educational Measurement 13:3-31. Plomin, R., and DeFries, J. C. 1980 Genetics and intelligence: recent data. Intelligence 4:15-24. Reschly, D. J. 1981 Psychological testing in educational classification and placement. American Psy- chologist 36:1094-1102. Reschly, D. J., and Reschly, J. E. 1979 Validity of WISC-R factor scores in predicting achievement and attention for four sociocultural groups. Journal of School Psychology 17:355-361. Reschly, D. J., and Sabers, D. L. 1979 Analysis of test bias in four groups with regression definition. Journal of Educa- tional Measurement 16(1):1-9. Sandoval, J. 1979 The WISC-R and internal evidence of test bias with minority groups. Journal of Consulting and Clinical Psychology 47:919-927. Sattler, J. M. 1974 Assessment of Children's Intelligence. Philadelphia, Pa: W. B. Saunders Com- pany. Scarr, S. 1981 Testingfor children: assessment and the many determinants of intellectual compe- tence. American Psychologist 36:1159-1166. Scarr, S., and Weinberg, R. A. 1976 IQ test performance of black children adopted by white families. American Psy- chologist 31: 726-739. Schrader, W. B. 1965 A taxonomy of expectancy tables. Journal of Educational Measurement 2:29-35. Smith, E. 1980 Test validation in the schools. Texas Law Review 58:1123-1159. U.S. Department of Justice 1980 Post-Trial Memorandum of the United States. Amicus Curiae brief filed in PASE v. Hannon. Wigdor, S., and Garner, W. R., eds. 1982 Ability Testing: Uses, Consequences, and Controversies. Vols. 1 and 2. Report of the Committee on Ability Testing, National Research Council, Washington, D.C.: National Academy Press.

Next: Effects of Special Education Placement on Educable Mentally Retarded Children »

Placing Children in Special Education: A Strategy for Equity (1982)

Chapter: Testing in Educational Placement: Issues and Evidence

Welcome to OpenBook!

Get Email Updates