Chapter 3
The Role of Intellectual Assessment

For many years, only scores from intelligence tests (IQs) were used in the diagnosis of mental retardation. As professionals and the public came to understand better the limitations of intelligence theory and IQ tests, finding other useful measures for assessing mental retardation became more urgent, especially because of allegations of racial, cultural, and gender bias in standard IQ assessment instruments. Yet constructs like adaptive behavior have proven at least as difficult to assess as intelligence, and IQ still looms large in determining eligibility for a diagnosis of mental retardation. To address the many misunderstandings about intelligence and its assessment, this chapter covers the following topics: (1) intelligence theory and test use from a historical perspective; (2) intelligence tests used commonly in the diagnosis of mental retardation; (3) assessment conditions that affect examinees’



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits Chapter 3 The Role of Intellectual Assessment For many years, only scores from intelligence tests (IQs) were used in the diagnosis of mental retardation. As professionals and the public came to understand better the limitations of intelligence theory and IQ tests, finding other useful measures for assessing mental retardation became more urgent, especially because of allegations of racial, cultural, and gender bias in standard IQ assessment instruments. Yet constructs like adaptive behavior have proven at least as difficult to assess as intelligence, and IQ still looms large in determining eligibility for a diagnosis of mental retardation. To address the many misunderstandings about intelligence and its assessment, this chapter covers the following topics: (1) intelligence theory and test use from a historical perspective; (2) intelligence tests used commonly in the diagnosis of mental retardation; (3) assessment conditions that affect examinees’

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits assessed cognitive performance; (4) the use of total test scores, like full-scale IQs, and subscores (part or scale scores) in the diagnosis of mental retardation; (5) the use of comprehensive as opposed to restricted measures of intelligence; and (6) psychometric considerations in the selection and application of intelligence tests for diagnosing mental retardation, including test fairness. HISTORICAL PERSPECTIVE ON THEORY AND PRACTICE History of Development of Tests of Intelligence The use of intelligence tests in the process of diagnosing mental retardation dates back to the turn of the 20th century, when Alfred Binet and Theodore Simon developed an intelligence test for that purpose. In the course of the implementation of universal education laws in France at that time, debates arose over the relative benefits and methods of educating schoolchildren with subnormal intelligence. As a result of this educational movement, Binet and Simon developed and in 1905 published what has come to be known as the first “practical” intelligence test (Sattler, 1988). Binet Scale Three years after its initial publication in 1905, the Binet-Simon Scale was revised by Binet and Simon (Binet & Simon, 1916) and then again by Binet in 1911. The instrument was noticed by researchers in the United States and was brought to this country by Goddard (1908). Three independent researchers, Huey, Kuhlmann, and Wallin, translated the Binet Scale into English in 1911 (Thorndike & Lohman, 1990), and use of the instrument and the general practice of assessing intelligence for many purposes spread quickly. Lewis M. Terman was responsible for making the Binet Scale a recognized and accepted professional tool. Terman adopted, then revised, and renormed the instrument several times at Stanford University (Terman, 1916; Terman & Merrill, 1937, 1960, 1973), and from the

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits early 1900s it became the principal tool for assessing the intelligence of children, adolescents, and young adults. As a result of Terman’s research and development efforts, the original Binet-Simon scale eventually become known throughout the United States as the Stanford-Binet Intelligence Scale, which is currently in its fourth edition (Thorndike et al., 1986). Pioneer Nonverbal Assessments In tandem with Binet and Simon’s work, European clinicians also attempted to develop methods for assessing the cognitive functioning of children who could not or would not speak. This effort, designed to determine latent cognitive functioning in the absence of manifest language abilities, initiated the field of nonverbal intellectual assessment. In the widely celebrated case of Victor, the Wild Boy of Aveyron, Jean Itard sought to determine the cognitive abilities of a feral youth and help the boy acquire functional language skills (Carrey, 1995; Itard, 1932). In addition to Itard’s pioneering work, even earlier historical figures pursued the problem of assessing the intellectual abilities of children who could not or would not speak more directly. Seguin (1856) is possibly best known for his development of unique instrumentation to aid in the assessment of children’s abilities through nonverbal means. Seguin’s performance-based nonverbal measure of cognition required the puzzle-like placement of common geometric shapes into openings of the same shape. The instrument and its many derivatives have become widely used and are known universally as the Seguin Form Board (DuBois, 1970). The current edition of the Stanford-Binet Intelligence Scale includes a Seguin-like form board, which has resulted in the merger of efforts by Binet, Simon, and Seguin, the three pioneer European test developers, in a contemporary American instrument. Nonverbal intelligence testing has a history paralleling that of traditional language-loaded intelligence tests with the publication of many nonverbal scales during the early 1900s. In the lineage of nonverbal intelligence tests, the Leiter International Performance Scale (Leiter,

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits 1948) is one of the only surviving historical instruments, although the number of new nonverbal tests available has grown since 1990. Group Language and Nonverbal Assessments The parallel development of verbal and nonverbal intellectual assessment continued during the group mental testing movement that stemmed from the country’s need to assess military recruits during the First World War. According to the Examiner’s Guide for the Army Psychological Examination (U.S. Government Printing Office, 1918), military testing was deemed necessary to classify soldiers according to mental ability, create organizational units of equal strength, identify potential problem soldiers (e.g., recruits with cognitive disability), assist in training and assignments, identify potential officers, and discover soldiers with special talents or skills. The Army Mental Tests resulted in Group Examination Alpha and Beta forms. The Group Examination Alpha (Army Alpha) was administered to recruits who could read and respond to the written English version of the scale. Because the Army Alpha was not useful as a measure of ability when recruits had limited English proficiency or were insufficiently literate to read and respond reliably to verbal items, the Group Examination Beta portion of the Mental Tests (Army Beta) was developed as a nonverbal supplement to the Army Alpha. Wechsler Scales Since the onset of mental testing with the Stanford-Binet and the application of group intelligence testing procedures, a plethora of individual and group tests have been developed in the United States for assessing overall intelligence and diagnosing subnormal intellectual functioning in infants, children, adolescents, and adults. Most prominent among the post-Binet instruments was a series of intelligence tests developed by David Wechsler (1939, 1949, 1955, 1967, 1974, 1981, 1989, 1991). Although the Binet scale was preeminent during the early to mid-1900s, the Wechsler scales quickly replaced the Binet as the test

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits of choice among psychological examiners. The Wechsler and Binet scales remain the two dominant, language-loaded, individually administered intelligence tests used for the diagnosis of mental retardation in the United States. The Wechsler Scales of Intelligence employed the Army Alpha and Beta approach to assessment by creating a collection of language-oriented subtests (verbal scale) and a collection of language-reduced subtests (performance scale), which combine to create a full scale IQ (FSIQ). Historically, the performance scale has been used as a nonverbal test because of its reduced language demands, but it is not truly nonverbal in that it requires the examinee to comprehend lengthy and complex verbal instructions. During the past 20 years, a number of intelligence tests have been published as alternatives to the Stanford-Binet and the Wechsler scales. Currently psychologists have an impressive array of instruments differing in their features, theoretical orientations, length and complexity, and technical quality from which to select. Correlates of Assessed Intelligence The widespread use of intelligence tests in the diagnosis of mental retardation is a consequence of research outcomes that have definitively demonstrated that, of all the social science variables that have been studied, intelligence tests remain the single best predictors of most important life events and outcomes (Jensen, 1981; Neisser et al., 1996; Sattler, 1990; Wilson, 1978). Intelligence tests predict such diverse outcomes as academic achievement, attainment, and deportment (Beck et al., 1988; Martel et al., 1987; Paal et al., 1988; Poteat et al., 1988; Roberts & Baird, 1972; Venter et al., 1992); language development, comprehension, and communication (Ackerman-Ross & Khanna, 1989; Bolla et al., 1990; Bracken, Howell, & Crain, 1993; Bracken, Prasse, & McCallum, 1984; Caplan et al., 1992; Lindsay et al., 1988; Morton & Green, 1991; Mitchell & Lambourne, 1979); psychosocial adjustment (Cunningham et al., 1991; Denno, 1986; Drotar & Sturm, 1988; Greenwald et al., 1989; Kohlberg, 1969; O’Toole &

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits Stankov, 1992; Poon et al., 1992; Siebert, 1962; Stogdill, 1948; Windle & Blane, 1989); family and home environment (Bracken et al., 1993; Luster & Dubow, 1992); short-term memory (Miller & Vernon, 1992); and employment success (Arvey, 1986; Burke et al., 1989; Faas & D’Alonzo, 1990; Hunter, 1986; Hunter & Hunter, 1984; Schmidt & Hunter, 1981; Thorndike, 1986). With such a diverse and wide array of correlates, intelligence tests have become highly instrumental in the identification of levels of cognitive functioning, including differentiating levels of mental retardation and the prediction of concomitant behavioral, social, and economic consequences. Theories of the Structure of Intellectual Abilities Research and theorizing on the structure of intellectual abilities has a history that is virtually as long as the history of work on the measurement of intelligence. As Binet and Simon were hard at work developing their seminal scale for intelligence, Charles Spearman (1904a, 1904b) published two groundbreaking statistical papers, one on basic methods of correlational analysis and the other that laid the foundation for factor analysis. The factor analytic techniques that Spearman (1904a) proposed were specially geared for testing his theoretical notions regarding ability structure, but the value of the generalized factor analysis model was recognized almost immediately by other researchers. Factor analysis has become the standard way to investigate the structure of the ability and other domains for over half a century. An interesting conundrum in ability research is the continuing disconnection between techniques for assessing intelligence, or general intellectual ability for practical decision making, and research on the structure of intellectual abilities. When assessing intelligence to make decisions about individuals, attention has been paid almost exclusively to general intelligence, as reflected in a composite intelligence quotient, or IQ. That is, a single number, embodied in the IQ, is used to

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits portray an individual’s mental ability. This focus on a single dimension of general intelligence is consistent with the theory outlined by Spearman (1904a). The need to consider more than a single factor to represent correlations among ability tests was recognized only a few years after Spearman first described his theory. Furthermore, the need for more than a single factor has been widely acknowledged for over 60 years; the key disagreements in the field relate to how the structure of multiple abilities is portrayed, understood, and used in decision making. This continuing disconnection between theory and practice in the structure and measurement of human cognitive or intellective abilities is a central issue in psychometrics today (McArdle & Woodcock, 1998). Signs of a closer connection between theory and practice in the measurement of abilities are apparent, and the next decade is likely to show even greater influence of ability theory on the range of mental abilities for which IQs can be obtained. We now review the major theories of the structure of intellectual abilities, which point toward an emerging consensus on the major ability dimensions that constitute the ability domain. This information undergirds the committee’s recommendations regarding the intelligence test scores that can best be used for eligibility decisions. Spearman’s Two-Factor Theory Spearman (1904a) developed factor analytic techniques to test his hypothesis that a single dimension accounted for correlations among all tests of mental ability. Spearman called this dimension “general intelligence.” To avoid contaminating the scientific construct of general intelligence with any ideas associated with the notion of intelligence in common parlance, Spearman signified the scientific construct derived from correlations among ability tests with the letter g, which stood for general intelligence. Spearman argued that g represented a new scientific construct, the meaning of which would be established only with substantial empirical research.

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits Spearman was perhaps the first to notice what is called the positive manifold, which refers to the finding of uniformly positive correlations among tests of ability. This positive manifold is a hallmark of the ability domain and is a distinctive attribute of the domain in comparison with others. Spearman reasoned that, if all tests of ability are positively intercorrelated, a single entity might influence all tests and thus be in common among the tests. Tests that correlate highly with other tests would be more heavily saturated with this common entity, whereas tests that tended to correlate at lower levels with other tests would be less saturated with the common entity. Spearman (1904a) presented techniques for estimating the saturation of each test, based on its correlations with other tests, and he continued to refine and extend these techniques for the remainder of his career. Spearman’s theory is frequently called the two-factor theory, reflecting the hypothesis that two factors account mathematically for the variance of each measured variable. One of these factors is g, the factor of general intelligence; and the second factor is sj, a factor that is specific to manifest variable j. Thus, the two-factor theory postulates two classes of factors. One class has a single member, g, the factor of general intelligence, which is the single influence that is common to all tests of ability. The second class of factors has as many members as there are tests of ability, one specific factor for each different test of ability. As a theoretical metaphor, Spearman (1927) borrowed from the Industrial Revolution. Arguing that g, or general intelligence, could be likened to or identified with mental energy, Spearman also hypothesized that individual differences in mental energy were largely genetic in origin. This mental energy can be directed toward any kind of intellectual task or problem, and the greater the amount of mental energy devoted to a task, the better the performance on the task. Individuals with a high level of g have a high level of mental energy to devote to intellectual pursuits, whereas persons with low levels of g have much lower levels of mental energy at their disposal when confronting intellectual problems or puzzles. Consequently, individual differences in g

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits reflect individual differences in mental energy, and individual differences in mental energy lead to individual differences in performance on all ability tests and therefore account for the correlations among all tests of mental ability. The specific factor for variable j, sj, is composed theoretically of two components—a reliable component that is specific to variable j, and a stochastic or random component that represents random error of measurement. (This specific factor is sometimes referred to as residual variance.) In most research situations, these two components cannot be separated, so emphasis is laid on the combined specific factors. Spearman equated the specific factor sj for a given test j with an engine. General intelligence, or g, provides the mental energy to power the engine that is used to solve a particular type of problem. Thus, one engine would be used to solve the problems on a verbal comprehension test, another engine would be used for numerical problems, and so forth. For certain types of problems, the general factor g is of primary importance, leading to a high g-loading for such a test and a relatively low contribution to explained variance by the engine, or specific factor, for the test. But, for other tests, g is of less importance, and the engine for the test accounts for the majority of the variance. The specific factor sj for a test is an opportunity for the environment or experience to play a part in performance on mental ability tests. When conducting research to test his hypothesized ability structure, Spearman often conducted analyses so that the results would conform to his theory. For example, Spearman (1914) dropped a test from his analyses because its inclusion resulted in a failure to satisfy his statistical criterion for adequacy of a single factor. Once the test was dropped from the analysis, the remaining tests satisfied the mathematical criterion, supporting the adequacy of a single factor for the set of tests. This approach—discarding tests that led to failure to confirm his theory—was a common one for Spearman, who discarded recalcitrant tests in several analyses reported in his major empirical work on mental abilities (Spearman, 1927). As a result, Spearman’s two-factor theory has equivocal support, because any indication of lack of fit was

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits effectively swept under the rug. But the two-factor theory is important for several reasons, including its status as the first theory of the structure of mental abilities, the clarity with which the theory and its predictions were stated, and the close interplay between psychological theory and the mathematical and statistical tools developed to test it. Thurstone’s Primary Mental Abilities During the 1930s, L.L. Thurstone and his colleagues pursued a program of research designed to identify the basic set of dimensions that span the ability, or intelligence, domain. Rather than beginning with a strong a priori theory about the structure of mental abilities as Spearman had done, Thurstone and his collaborators took a very different approach. Specifically, they collected a large battery of tests comprising all conceivable types of intellectual tasks, administered the battery to a large sample of subjects, and then analyzed the correlations among the tests in this battery to determine the number and nature of the dimensions required to account for the correlations. If the same dimensions continued to emerge from their analyses across several samples of subjects and different but largely overlapping batteries of tests, then the dimensions would serve as a framework for representing the ability domain. In several early studies, Thurstone and his colleagues (1938a, 1938b) found seven interpretable factors that were replicated across several analyses; these seven factors were termed primary mental abilities. The seven primary mental abilities that consistently appeared across samples were identified as: (1) verbal comprehension (V), or the ability to extract meaning from text; (2) word fluency (W), subsuming the ability to access elements of the lexicon based on structural characteristics (e.g., first letters, last letters), rather than meaning; (3) spatial ability (S), or the ability to rotate figural stimuli in a two-dimensional space; (4) memory (M), involving the short-term retention of material typically presented in paired-associate format; (5) numerical facility (F), reflecting the fast and accurate response to problems involving simple arithmetic; (6) perceptual speed (P), or the speedy iden-

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits tification of stimuli based on their stimulus features; and (7) reasoning (R), which represented inductive reasoning in some studies, deductive reasoning in other studies, and general reasoning in still others. As for an interpretation of the nature of mental abilities, Thurstone (1938a, 1938b) was not specific. He repeatedly referred to ability dimensions as representing “functional unities,” by which he meant that the tests loading on a given factor had some functional similarity that was hypothesized to be the same across tests. Thurstone did believe that the future would bring a mapping of mental abilities onto brain areas, such that each ability factor would be tied to particular brain areas that supported its functioning. But brain mapping was in its initial stages and Thurstone could only voice this as a hope for the future. He did think that cognitive psychology held hope for understanding the underpinnings of mental abilities, stating that psychologists should move into the laboratory to devise studies that would illuminate why a given set of tests loaded on a given factor (Thurstone, 1947). Once again, the field of psychology was not ready for this recommendation, and cognitive investigations into the processes underlying mental test performance began in earnest about 30 years after Thurstone’s encouragement to pursue this avenue of research. In the initial studies by Thurstone and his collaborators (e.g., 1938a, 1938b), the primary mental ability factors were rotated orthogonally, so they were statistically uncorrelated with one another. But after the development of the mathematical theory for oblique rotations (Tucker, 1940), Thurstone and Thurstone (1941) quickly applied oblique rotations to the primary mental abilities and found substantial correlations among the seven dimensions. The correlations among the primary mental abilities were well described by a single second-order factor, which Thurstone and Thurstone argued provided a way to reconcile Spearman’s theory with their own. That is, at the level of the primary mental abilities, seven dimensions were required to represent the relations among a large set of tests. But correlations among the primary mental abilities could be explained by a single second-order factor. Thus, one could argue that Spearman pursued

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits orthogonal versus correlated factors, secondary loadings, correlated error terms, other model constraints (fixed and free parameters), method of estimation, goodness of fit, overall fit, relative fit, parsimony, any model modification to improve model fit to data, factor loadings and standard errors, communality, and factor correlations and standard errors with statistical significance. Comprehensive treatment and inclusion of such information allows test users to better understand the extent to which the test fits its proposed model compared with competing models and provides support for the interpretation of the instrument’s respective subscales and composite scores. External Evidence of Validity External evidence of test validity considers the extent to which a test relates to or predicts other variables or outcomes in differing populations. Tests should be validated with regard to the purposes for which they are employed and the consequences of their use. In this section, we describe external classes of evidence for test construct validity, including criterion-related validity, consequential validity, and generalizability. Criterion-related validity. Campbell and Fiske (1959) originally proposed that test scores should be related to external measures of the same psychological construct (convergent evidence of validity), and they should be comparatively unrelated to measures of different psychological constructs (discriminant evidence of validity). In criterion-related validity, criterion measures can be obtained concurrently (concurrent validity) or at some future date (predictive validity). An intelligence test that is proposed for use in the process of diagnosing mental retardation should demonstrate convergent validity with other extant intelligence tests before the instrument is accepted for this purpose. Similarly, as a class of instruments, intelligence tests should demonstrate higher correlations among themselves than with measures of other psychoeducational constructs (e.g., academic achievement, adaptive behavior). Tests should meaningfully guide decision making. Contrasted

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits groups methodology is commonly used for validating psychological tests. In this approach to validation, the test performance of two samples that are known to be different on the criterion of interest is compared. For example, a sample of people who are known to have mental retardation should perform on an intelligence test at a level significantly below the performance of a second group that is known to not have mental retardation. Decision-making classification accuracy should be determined by examining sensitivity, specificity, positive predictive power, and negative predictive power. Tests should provide evidence of consequential validity. A form of validity that emphasizes the societal impact of test results on individuals and groups is known as consequential validity. Consequential validity evaluates the utility of score interpretation as a basis for action, as well as the actual and potential consequences of test use (Messick, 1989). Messick (1995) argued that examination of the consequences of test use as a trigger to social and educational actions, such as equitable application of SSI benefits, is a necessary element of validating tests. Consequential validity is especially relevant to issues of bias, fairness, and distributive justice. Generalizability of validity. External evidence of test validity is especially important when test results are to be generalized across contexts, situations, and populations, and when the consequences of testing reach beyond the test’s original intent. Intelligence test manuals should demonstrate the extent to which the test validity generalizes across subpopulations, such as racial or ethnic minority groups, gender, or age levels. Examiners who wish to use tests for purposes not stated or supported in the examiner’s manual, such as using a language instrument for discerning levels of cognitive functioning, must demonstrate the validity of the new application prior to its application. Test Score Reliability The reliability of test scores refers to the reproducibility (precision, consistency, and repeatability) of test results, or the degree to

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits which test scores are free from measurement error. Measurement precision can be assessed by examining the instrument’s internal consistency, temporal stability, and interrater agreement. Reliability can only be evaluated in the context of test use (Nunnally & Bernstein, 1994). Internal consistency. The internal consistency of a test is a reflection of the uniformity and coherence of test items and content. All variance generated by a test can be classified as either reliable variance or error variance. In classical test theory, reliability is based on the assumption that measurement error is distributed normally and equally for all score levels. By contrast, item response theory posits that reliability differs between individuals with different response patterns and levels of ability but generalizes across populations (Embretson & Hershberger, 1999). Internal consistency is usually coefficient alpha or split-half reliability. Several psychometricians (Bracken, 1987; Clark & Watson, 1995; Nunnally & Bernstein, 1994) have recommended that minimal levels of internal consistency should average across age levels at or above .80 or .90, depending on the nature and applications of the test scale to low-stakes or high-stakes applications, respectively. Consistent with Nunnally’s (1978) original standards, Bracken (1987, 1998; Wasseman & Bracken, 2002) recommended that total test or total scale internal consistency of high-stakes test applications, such as for clinical diagnosis or eligibility decision making, should equal or exceed .90 when averaged across the age levels. Instruments used for the high-stakes purposes of diagnosing mental retardation for SSI should approximate this minimal level of reliability, recognizing that the inverse of reliability is measurement error and that error only confounds correct decision making. Local reliability. Local reliability refers to measurement precision at specified levels or ranges of scores that are at or near the decision-making point for mental retardation. For example, a test with high local reliability at low ability levels would be more appropriate for use with low-functioning individuals than one with less local reliability. Local reliability can be measured by approaching it from classical test theory orientation or by using item response theory. Whichever ap-

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits proach is used, local reliability should be measured and the data made available for disability determination examiners so they can use the most appropriate tests for their clients. Total test short-term stability. Test scores must be reasonably stable to have practical utility when diagnosing known stable conditions such as mental retardation and to be predictive of future performance. Stability is typically estimated through use of test-retest stability (correlation) coefficients across two points in time. Bracken (1987) suggested that for short-term test intervals of two to six weeks the total test stability coefficient should be greater than or equal to .90 for high-stakes test applications. Test-retest reliability is in part a measure of construct stability, but its interpretation in clinical contexts can be influenced by several factors like the deleterious effects of degenerative disorders or the positive effects of successful therapeutic interventions, which should be remembered in individual studies of test stability. Generalizability of test score reliability. As an extension of validity generalization (Hunter & Schmidt, 1990; Schmidt & Hunter, 1977), reliability generalization investigates the stability of reliability coefficients across varying samples. In order to demonstrate measurement precision for the populations for which a test is intended, the test should show comparable levels of reliability across various demographic subsets of the population, as well as salient clinical and exceptional populations like individuals with mental retardation. Fairness in Testing Fairness has not been considered historically as a leading criterion by which test selection decisions are made, but increased social sensitivity and recent court decisions have elevated its importance. Tiedeman (1978) has noted, “Test equity seems to be emerging as a criterion for test use on a par with the concepts of reliability and validity” (p. xxviii). As such, tests intended for use with all subsets of the U.S. population, as in SSA evaluations, should provide ample evidence of psychometric fairness and equitable treatment of examinees. Wasseman and Bracken (2002) consider fairness to be the extent

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits to which test scores are (a) statistically shown to be free from evidence of psychometric bias, (b) comparably reliable and valid across demographic groups, and (c) equitably applied and equally predictive in real-life consequences and pragmatic impact. Fairness transcends psychometrics and includes philosophic, legal, and practical considerations. Test bias refers to elements of a test and its usage that are construct irrelevant and that yield systematic errors that in turn lead to erroneous decisions related to specific demographic group membership. Bias results in differential outcomes for individuals of the same ability levels but from different ethnic, sex, cultural, or religious groups (Hambleton & Rodgers, 1995). Test bias has also been described as “a kind of invalidity that harms one group more than another” (Shepard et al., 1981, p. 318) Internal Evidence of Fairness As with internal evidence of validity, test fairness rests in part on the structural features of the instrument, including theoretical underpinnings, item content, assessment procedures, differential item functioning, and an invariant factor structure. Theoretical underpinnings. The theory on which a test is built may have an inherent sensitivity to issues of fairness and should be fully discussed in the test manual. Several illustrations of these implications may be presented with regard to measures of cognitive and intellectual ability. For example, tests that emphasize speed may be less fair for Hispanics, because time is considered a less salient concept in many Hispanic cultures (Scheuneman & Oakland, 1998). Individuals who speak English as a second language also may be disadvantaged by traditional language-loaded intelligence tests, even on performance-based measures like the Wechsler Performance Scale that include lengthy and conceptually laden test directions (Bracken & McCallum, 1998; Duran, 1989; Geisinger, 1992; Oakland & Parmelee, 1985). In addition, measures of crystallized ability and knowledge are inextricably linked to culture (Carroll, 1997) and accordingly may show differ-

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits ential performance across culturally different groups, whereas fluid abilities tend to show less differential performance across groups. Multicultural bias and sensitivity reviews. The use of multicultural reviewers to examine the type, content, and format of test items for potential bias is a common practice among test publishers. Usually the goal of bias review panels is to identify offensive or controversial material and unfair material, remaining sensitive to population diversity. Among the considerations of such reviewers are language usage, ethnocentric item content, minority group representation in the norms, and minority group portrayals in test stimulus materials (Sireci & Geisinger, 1998). All tests should present items in a sensitive manner for all gender, culture, age, and racial groups. Stimulus artwork should depict people performing similar or equivalent roles and activities, regardless of gender, age, race, and cultural backgrounds. Stimulus artwork that portrays facial expressions, such as happiness, anger, or fear, or indicators of physical limitations like eyeglasses, hearing aids, or wheelchairs, should be evenly distributed across representations of differing demographic groups. Stereotyping of any sort in test artwork and stimulus materials should be avoided. Differential item function (DIF). Differential item function (DIF) refers to a family of statistical procedures used to identify whether test items display different statistical properties in different group settings after controlling for differences in the abilities of the comparison groups (Angoff, 1993). The concept of DIF has been extended by Shealy and Stout (1993) to include a test level of analysis known as differential test function (DTF). DTF is important because tests may produce a small number of offsetting items that are identified as biased against both comparison groups, such as males and females, using DIF procedures. Because the number of biased items are offsetting, the overall effect (DTF) of these few items on the fairness of the test can be minimal (Waller et al., 2000). Invariant factor structure and scale reliabilities. The examination of comparable reliability and validity across separate demographic

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits groups should be conducted to investigate test fairness. Jensen (1980) noted that if test reliability and validity coefficients differ significantly for designated subgroups of interest, then “it is clear that the test scores are not equally [reliable or valid] measures for both groups” (p. 430). With respect to validity, Meredith (1993) asserted that strict factorial invariance is required for test fairness and equity to exist. Geisinger (1998) noted the importance of comparable reliabilities across subsamples, stating that “subgroup-specific reliability analysis may be especially appropriate when the reliability of a test has been justified on the basis of internal consistency reliability procedures (e.g., coefficient alpha)” (p. 25). The demonstration of comparable reliabilities across samples that differ on the basis of gender, race, or ethnicity has been studied in some current-generation intelligence tests with positive outcomes (Bracken & McCallum, 1998; Matazow et al., 1991; Vance & Gaynor, 1976; Zhu et al., 1999). External Evidence of Test Fairness The external features of test fairness are evident in the relationship between test scores and various external criteria, including equality of prediction and consequential impact. It is important to examine external evidence of validity in addition to internal sources of evidence like DIF when investigating test fairness. Focusing solely on internal evidence of fairness may fail to capture subtle yet important sources of test bias (Shepard et al., 1981). Comparable prediction. The demonstration of equivalent predictive validity across demographic groups constitutes an important source of fairness that is related to validity generalization. Intelligence tests used for the diagnosis of mental retardation should predict future external outcomes, such as employability or independent functioning, in a comparable manner across differing demographic groups. Minimize adverse impact and selection bias outcomes. A second form of external bias includes the differential incidence of adverse outcomes or differential selection rates across groups. Mean score differ-

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits ences between groups on tests are not inherently an indication of bias and may yield comparable prediction rates. Still, disparate group mean scores can have the undesirable effect of producing disproportionate negative impact for one group as opposed to another (Thorndike, 1971). Such consequential aspects of test bias are commonly referred to as selection bias (Jencks, 1998). When test scores produce adverse, disparate, or disproportionate impact for one group over another, even when that impact is construct relevant, test users should consider the societal and legal implications of such selection bias. CONCLUSIONS AND RECOMMENDATIONS Review of the extensive literature on the assessment of intellectual functioning reveals that because of differential rates of development across the life span, the most accurate estimates of intellectual functioning can be made only from recently administered, comprehensive IQ tests. This means that intelligence testing for infants (birth through age 2) is best done at the time of the eligibility determination, within the last year for children between the ages of 3 and 6, and within three years between the ages of 6 and 16. For adults between the ages of 18 and 50 who are living in stable conditions and are in stable health, composite IQ scores are valid for as long as five years; and, after age 50, composite IQs could reasonably be considered valid for three years. Research also suggests that intelligence in the entire population increases at a rate of approximately 3 IQ points per decade, which approximates the standard error of measurement for most comprehensive intelligence tests. Thus, tests with norms older than 10 to 12 years will tend to produce inflated scores and could result in the denial of services to significant numbers of individuals who would have been eligible for them, if more recent norms had been used. Because intelligence is a complex and multidimensional construct, it is imperative that intelligence tests used for diagnosis be comprehensive (multifactored) and assess more than a single cognitive attribute. Also, because test length and comprehensiveness are directly related

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits to an instrument’s technical adequacy and construct sampling, brief or abbreviated tests compromise test quality or comprehensiveness for brevity. Language-loaded intelligence tests are not appropriate for people who would be disadvantaged due to language limitations (e.g., deafness, limited English proficiency, elective/selective mute, autism). Whenever language facility constitutes a source of construct-irrelevant variance for examinees, language-loaded instruments (both verbal and performance scales) create an unfair additional challenge. In such cases, examinees should be assessed in their native language or with intelligence tests that do not require receptive or expressive language. Since the skills and training of the examiner can affect the accuracy of an IQ test, examiners should meet publishers’ requirements for the use of Class C tests. Class C instruments are those that require the highest level of training, professional credentials, and supervision. Examiners (not their supervisors) should meet this minimal professional standard. Furthermore, examiners who administer and interpret intelligence tests should possess the skills and competencies to assess clients with uncommon characteristics, such as deafness, extreme youth or age, or a nonmajority cultural or linguistic background. Not only should examiners be competent to administer and interpret intelligence tests, but they should also have the knowledge and experience to work effectively with clients of all ages, exceptionalities, and cultural/linguistic backgrounds to ensure valid assessment results. Almost a century of intelligence test development has shown that the most valid and accurate results are obtained when tests meet minimal psychometric standards, as outlined in this chapter, for use in high-stakes decision making like SSA disability determination. The tests should demonstrate adequate floors, item gradients, reliability, validity, norm table sensitivity, population representation, as well as sufficient convincing evidence of fairness and lack of bias. Composite scores from intelligence tests should be used routinely in mental retardation diagnosis, except when the validity of a composite IQ above 70 is in doubt, in which case an appropriate part score

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits may be used in its place. Significant and meaningful variation among an instrument’s part scores may indicate evidence of compromised validity for one or more of them (for example, a low verbal scale score for an individual with a suspected speech disorder), which in turn would threaten the validity of the composite IQ. In such situations, appropriate part scores may better represent the individual’s true overall level of cognitive functioning or it may be necessary to use other methods to support a diagnosis of mental retardation (see Chapter 5). However, only part scores derived from scales that demonstrate high g-loadings (e.g., crystallized, fluid, visual/spatial measures of intelligence) should be used in place of the composite IQ score when its validity is in doubt. Many intelligence tests assess several facets of intelligence, but not all facets are equally important or predict life events equally well. Those intellectual facets that are heavily g-saturated provide the best sources for replacing the composite IQ score when its validity is questionable. The characteristics of comprehensive IQ tests are such that, even when part scores are used in making disability determinations for mental retardation, the composite IQ score from an instrument should never be higher than 75. Furthermore, if a part score is used in place of the composite IQ score in SSA decision making, the part score should not exceed 70. Therefore: Recommendation: A client must have an intelligence test score that is two or more standard deviations (SD) below the mean (e.g., a score of 70 or below, if the mean = 100 and the standard deviation = 15). Composite score is 70 or below: If the composite or total test score meets this criterion, then the individual has met the intellectual eligibility component. Composite score is between 71 and 75: If the composite score is suspected to be an invalid indicator of the person’s intellectual disability and falls in the range of 71-75, a part score of 70 or

OCR for page 69
Mental Retardation: Determining Eligibility for Social Security Benefits below can be used to satisfy the intellectual eligibility component. Composite score is 76 or above: No individual can be eligible on the intellectual criterion if the composite score is 76 or above, regardless of part scores.2 The committee recommends continuation of the criterion of presumptive eligibility for persons with IQs below 60. 2   Committee member Keith Widaman dissents from this part of the recommendation. Dr. Widaman believes that IQ part scores representing crystallized intelligence (Gc, similar to verbal IQ) and fluid intelligence (Gf, related to performance IQ) have clear discriminant validity and represent broad, general domains of intellectual functioning. Therefore, a score of 70 or below on either of these part scores from any standardized, individually administered intelligence test that reports such scores should be deemed sufficient to meet the listings for low general intellectual functioning regardless of the level of the composite score, providing that the part scores have adequate psychometric properties (e.g., high reliability, low standard error of measurement). Dr. Widaman notes that, without any clear justification, SSA currently accepts either a composite IQ score from any standardized, individually administered intelligence test or a verbal or performance IQ score, any one of which can be 70 or below. SSA does not stipulate that the composite IQ must be below a certain score for a part score to be used. Dr. Widaman’s position provides a rationale for current SSA use of part scores, but it (a) aligns the acceptable part scores with the constructs of Gc and Gf used in contemporary theories of mental abilities and (b) argues that usable part scores for Gc and Gf should not be limited to those derived from any particular test instrument.