National Academies Press: OpenBook

Mental Retardation: Determining Eligibility for Social Security Benefits (2002)

Chapter: 3. The Role of Intellectual Assessment

« Previous: 2. The Policy Context
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Chapter 3
The Role of Intellectual Assessment

For many years, only scores from intelligence tests (IQs) were used in the diagnosis of mental retardation. As professionals and the public came to understand better the limitations of intelligence theory and IQ tests, finding other useful measures for assessing mental retardation became more urgent, especially because of allegations of racial, cultural, and gender bias in standard IQ assessment instruments. Yet constructs like adaptive behavior have proven at least as difficult to assess as intelligence, and IQ still looms large in determining eligibility for a diagnosis of mental retardation. To address the many misunderstandings about intelligence and its assessment, this chapter covers the following topics: (1) intelligence theory and test use from a historical perspective; (2) intelligence tests used commonly in the diagnosis of mental retardation; (3) assessment conditions that affect examinees’

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

assessed cognitive performance; (4) the use of total test scores, like full-scale IQs, and subscores (part or scale scores) in the diagnosis of mental retardation; (5) the use of comprehensive as opposed to restricted measures of intelligence; and (6) psychometric considerations in the selection and application of intelligence tests for diagnosing mental retardation, including test fairness.

HISTORICAL PERSPECTIVE ON THEORY AND PRACTICE

History of Development of Tests of Intelligence

The use of intelligence tests in the process of diagnosing mental retardation dates back to the turn of the 20th century, when Alfred Binet and Theodore Simon developed an intelligence test for that purpose. In the course of the implementation of universal education laws in France at that time, debates arose over the relative benefits and methods of educating schoolchildren with subnormal intelligence. As a result of this educational movement, Binet and Simon developed and in 1905 published what has come to be known as the first “practical” intelligence test (Sattler, 1988).

Binet Scale

Three years after its initial publication in 1905, the Binet-Simon Scale was revised by Binet and Simon (Binet & Simon, 1916) and then again by Binet in 1911. The instrument was noticed by researchers in the United States and was brought to this country by Goddard (1908). Three independent researchers, Huey, Kuhlmann, and Wallin, translated the Binet Scale into English in 1911 (Thorndike & Lohman, 1990), and use of the instrument and the general practice of assessing intelligence for many purposes spread quickly.

Lewis M. Terman was responsible for making the Binet Scale a recognized and accepted professional tool. Terman adopted, then revised, and renormed the instrument several times at Stanford University (Terman, 1916; Terman & Merrill, 1937, 1960, 1973), and from the

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

early 1900s it became the principal tool for assessing the intelligence of children, adolescents, and young adults. As a result of Terman’s research and development efforts, the original Binet-Simon scale eventually become known throughout the United States as the Stanford-Binet Intelligence Scale, which is currently in its fourth edition (Thorndike et al., 1986).

Pioneer Nonverbal Assessments

In tandem with Binet and Simon’s work, European clinicians also attempted to develop methods for assessing the cognitive functioning of children who could not or would not speak. This effort, designed to determine latent cognitive functioning in the absence of manifest language abilities, initiated the field of nonverbal intellectual assessment. In the widely celebrated case of Victor, the Wild Boy of Aveyron, Jean Itard sought to determine the cognitive abilities of a feral youth and help the boy acquire functional language skills (Carrey, 1995; Itard, 1932).

In addition to Itard’s pioneering work, even earlier historical figures pursued the problem of assessing the intellectual abilities of children who could not or would not speak more directly. Seguin (1856) is possibly best known for his development of unique instrumentation to aid in the assessment of children’s abilities through nonverbal means. Seguin’s performance-based nonverbal measure of cognition required the puzzle-like placement of common geometric shapes into openings of the same shape. The instrument and its many derivatives have become widely used and are known universally as the Seguin Form Board (DuBois, 1970). The current edition of the Stanford-Binet Intelligence Scale includes a Seguin-like form board, which has resulted in the merger of efforts by Binet, Simon, and Seguin, the three pioneer European test developers, in a contemporary American instrument.

Nonverbal intelligence testing has a history paralleling that of traditional language-loaded intelligence tests with the publication of many nonverbal scales during the early 1900s. In the lineage of nonverbal intelligence tests, the Leiter International Performance Scale (Leiter,

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

1948) is one of the only surviving historical instruments, although the number of new nonverbal tests available has grown since 1990.

Group Language and Nonverbal Assessments

The parallel development of verbal and nonverbal intellectual assessment continued during the group mental testing movement that stemmed from the country’s need to assess military recruits during the First World War. According to the Examiner’s Guide for the Army Psychological Examination (U.S. Government Printing Office, 1918), military testing was deemed necessary to classify soldiers according to mental ability, create organizational units of equal strength, identify potential problem soldiers (e.g., recruits with cognitive disability), assist in training and assignments, identify potential officers, and discover soldiers with special talents or skills. The Army Mental Tests resulted in Group Examination Alpha and Beta forms. The Group Examination Alpha (Army Alpha) was administered to recruits who could read and respond to the written English version of the scale. Because the Army Alpha was not useful as a measure of ability when recruits had limited English proficiency or were insufficiently literate to read and respond reliably to verbal items, the Group Examination Beta portion of the Mental Tests (Army Beta) was developed as a nonverbal supplement to the Army Alpha.

Wechsler Scales

Since the onset of mental testing with the Stanford-Binet and the application of group intelligence testing procedures, a plethora of individual and group tests have been developed in the United States for assessing overall intelligence and diagnosing subnormal intellectual functioning in infants, children, adolescents, and adults. Most prominent among the post-Binet instruments was a series of intelligence tests developed by David Wechsler (1939, 1949, 1955, 1967, 1974, 1981, 1989, 1991). Although the Binet scale was preeminent during the early to mid-1900s, the Wechsler scales quickly replaced the Binet as the test

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

of choice among psychological examiners. The Wechsler and Binet scales remain the two dominant, language-loaded, individually administered intelligence tests used for the diagnosis of mental retardation in the United States.

The Wechsler Scales of Intelligence employed the Army Alpha and Beta approach to assessment by creating a collection of language-oriented subtests (verbal scale) and a collection of language-reduced subtests (performance scale), which combine to create a full scale IQ (FSIQ). Historically, the performance scale has been used as a nonverbal test because of its reduced language demands, but it is not truly nonverbal in that it requires the examinee to comprehend lengthy and complex verbal instructions.

During the past 20 years, a number of intelligence tests have been published as alternatives to the Stanford-Binet and the Wechsler scales. Currently psychologists have an impressive array of instruments differing in their features, theoretical orientations, length and complexity, and technical quality from which to select.

Correlates of Assessed Intelligence

The widespread use of intelligence tests in the diagnosis of mental retardation is a consequence of research outcomes that have definitively demonstrated that, of all the social science variables that have been studied, intelligence tests remain the single best predictors of most important life events and outcomes (Jensen, 1981; Neisser et al., 1996; Sattler, 1990; Wilson, 1978). Intelligence tests predict such diverse outcomes as academic achievement, attainment, and deportment (Beck et al., 1988; Martel et al., 1987; Paal et al., 1988; Poteat et al., 1988; Roberts & Baird, 1972; Venter et al., 1992); language development, comprehension, and communication (Ackerman-Ross & Khanna, 1989; Bolla et al., 1990; Bracken, Howell, & Crain, 1993; Bracken, Prasse, & McCallum, 1984; Caplan et al., 1992; Lindsay et al., 1988; Morton & Green, 1991; Mitchell & Lambourne, 1979); psychosocial adjustment (Cunningham et al., 1991; Denno, 1986; Drotar & Sturm, 1988; Greenwald et al., 1989; Kohlberg, 1969; O’Toole &

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Stankov, 1992; Poon et al., 1992; Siebert, 1962; Stogdill, 1948; Windle & Blane, 1989); family and home environment (Bracken et al., 1993; Luster & Dubow, 1992); short-term memory (Miller & Vernon, 1992); and employment success (Arvey, 1986; Burke et al., 1989; Faas & D’Alonzo, 1990; Hunter, 1986; Hunter & Hunter, 1984; Schmidt & Hunter, 1981; Thorndike, 1986).

With such a diverse and wide array of correlates, intelligence tests have become highly instrumental in the identification of levels of cognitive functioning, including differentiating levels of mental retardation and the prediction of concomitant behavioral, social, and economic consequences.

Theories of the Structure of Intellectual Abilities

Research and theorizing on the structure of intellectual abilities has a history that is virtually as long as the history of work on the measurement of intelligence. As Binet and Simon were hard at work developing their seminal scale for intelligence, Charles Spearman (1904a, 1904b) published two groundbreaking statistical papers, one on basic methods of correlational analysis and the other that laid the foundation for factor analysis. The factor analytic techniques that Spearman (1904a) proposed were specially geared for testing his theoretical notions regarding ability structure, but the value of the generalized factor analysis model was recognized almost immediately by other researchers. Factor analysis has become the standard way to investigate the structure of the ability and other domains for over half a century.

An interesting conundrum in ability research is the continuing disconnection between techniques for assessing intelligence, or general intellectual ability for practical decision making, and research on the structure of intellectual abilities. When assessing intelligence to make decisions about individuals, attention has been paid almost exclusively to general intelligence, as reflected in a composite intelligence quotient, or IQ. That is, a single number, embodied in the IQ, is used to

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

portray an individual’s mental ability. This focus on a single dimension of general intelligence is consistent with the theory outlined by Spearman (1904a). The need to consider more than a single factor to represent correlations among ability tests was recognized only a few years after Spearman first described his theory. Furthermore, the need for more than a single factor has been widely acknowledged for over 60 years; the key disagreements in the field relate to how the structure of multiple abilities is portrayed, understood, and used in decision making. This continuing disconnection between theory and practice in the structure and measurement of human cognitive or intellective abilities is a central issue in psychometrics today (McArdle & Woodcock, 1998).

Signs of a closer connection between theory and practice in the measurement of abilities are apparent, and the next decade is likely to show even greater influence of ability theory on the range of mental abilities for which IQs can be obtained. We now review the major theories of the structure of intellectual abilities, which point toward an emerging consensus on the major ability dimensions that constitute the ability domain. This information undergirds the committee’s recommendations regarding the intelligence test scores that can best be used for eligibility decisions.

Spearman’s Two-Factor Theory

Spearman (1904a) developed factor analytic techniques to test his hypothesis that a single dimension accounted for correlations among all tests of mental ability. Spearman called this dimension “general intelligence.” To avoid contaminating the scientific construct of general intelligence with any ideas associated with the notion of intelligence in common parlance, Spearman signified the scientific construct derived from correlations among ability tests with the letter g, which stood for general intelligence. Spearman argued that g represented a new scientific construct, the meaning of which would be established only with substantial empirical research.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Spearman was perhaps the first to notice what is called the positive manifold, which refers to the finding of uniformly positive correlations among tests of ability. This positive manifold is a hallmark of the ability domain and is a distinctive attribute of the domain in comparison with others. Spearman reasoned that, if all tests of ability are positively intercorrelated, a single entity might influence all tests and thus be in common among the tests. Tests that correlate highly with other tests would be more heavily saturated with this common entity, whereas tests that tended to correlate at lower levels with other tests would be less saturated with the common entity. Spearman (1904a) presented techniques for estimating the saturation of each test, based on its correlations with other tests, and he continued to refine and extend these techniques for the remainder of his career.

Spearman’s theory is frequently called the two-factor theory, reflecting the hypothesis that two factors account mathematically for the variance of each measured variable. One of these factors is g, the factor of general intelligence; and the second factor is sj, a factor that is specific to manifest variable j. Thus, the two-factor theory postulates two classes of factors. One class has a single member, g, the factor of general intelligence, which is the single influence that is common to all tests of ability. The second class of factors has as many members as there are tests of ability, one specific factor for each different test of ability.

As a theoretical metaphor, Spearman (1927) borrowed from the Industrial Revolution. Arguing that g, or general intelligence, could be likened to or identified with mental energy, Spearman also hypothesized that individual differences in mental energy were largely genetic in origin. This mental energy can be directed toward any kind of intellectual task or problem, and the greater the amount of mental energy devoted to a task, the better the performance on the task. Individuals with a high level of g have a high level of mental energy to devote to intellectual pursuits, whereas persons with low levels of g have much lower levels of mental energy at their disposal when confronting intellectual problems or puzzles. Consequently, individual differences in g

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

reflect individual differences in mental energy, and individual differences in mental energy lead to individual differences in performance on all ability tests and therefore account for the correlations among all tests of mental ability.

The specific factor for variable j, sj, is composed theoretically of two components—a reliable component that is specific to variable j, and a stochastic or random component that represents random error of measurement. (This specific factor is sometimes referred to as residual variance.) In most research situations, these two components cannot be separated, so emphasis is laid on the combined specific factors. Spearman equated the specific factor sj for a given test j with an engine. General intelligence, or g, provides the mental energy to power the engine that is used to solve a particular type of problem. Thus, one engine would be used to solve the problems on a verbal comprehension test, another engine would be used for numerical problems, and so forth. For certain types of problems, the general factor g is of primary importance, leading to a high g-loading for such a test and a relatively low contribution to explained variance by the engine, or specific factor, for the test. But, for other tests, g is of less importance, and the engine for the test accounts for the majority of the variance. The specific factor sj for a test is an opportunity for the environment or experience to play a part in performance on mental ability tests.

When conducting research to test his hypothesized ability structure, Spearman often conducted analyses so that the results would conform to his theory. For example, Spearman (1914) dropped a test from his analyses because its inclusion resulted in a failure to satisfy his statistical criterion for adequacy of a single factor. Once the test was dropped from the analysis, the remaining tests satisfied the mathematical criterion, supporting the adequacy of a single factor for the set of tests. This approach—discarding tests that led to failure to confirm his theory—was a common one for Spearman, who discarded recalcitrant tests in several analyses reported in his major empirical work on mental abilities (Spearman, 1927). As a result, Spearman’s two-factor theory has equivocal support, because any indication of lack of fit was

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

effectively swept under the rug. But the two-factor theory is important for several reasons, including its status as the first theory of the structure of mental abilities, the clarity with which the theory and its predictions were stated, and the close interplay between psychological theory and the mathematical and statistical tools developed to test it.

Thurstone’s Primary Mental Abilities

During the 1930s, L.L. Thurstone and his colleagues pursued a program of research designed to identify the basic set of dimensions that span the ability, or intelligence, domain. Rather than beginning with a strong a priori theory about the structure of mental abilities as Spearman had done, Thurstone and his collaborators took a very different approach. Specifically, they collected a large battery of tests comprising all conceivable types of intellectual tasks, administered the battery to a large sample of subjects, and then analyzed the correlations among the tests in this battery to determine the number and nature of the dimensions required to account for the correlations. If the same dimensions continued to emerge from their analyses across several samples of subjects and different but largely overlapping batteries of tests, then the dimensions would serve as a framework for representing the ability domain.

In several early studies, Thurstone and his colleagues (1938a, 1938b) found seven interpretable factors that were replicated across several analyses; these seven factors were termed primary mental abilities. The seven primary mental abilities that consistently appeared across samples were identified as: (1) verbal comprehension (V), or the ability to extract meaning from text; (2) word fluency (W), subsuming the ability to access elements of the lexicon based on structural characteristics (e.g., first letters, last letters), rather than meaning; (3) spatial ability (S), or the ability to rotate figural stimuli in a two-dimensional space; (4) memory (M), involving the short-term retention of material typically presented in paired-associate format; (5) numerical facility (F), reflecting the fast and accurate response to problems involving simple arithmetic; (6) perceptual speed (P), or the speedy iden-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

tification of stimuli based on their stimulus features; and (7) reasoning (R), which represented inductive reasoning in some studies, deductive reasoning in other studies, and general reasoning in still others.

As for an interpretation of the nature of mental abilities, Thurstone (1938a, 1938b) was not specific. He repeatedly referred to ability dimensions as representing “functional unities,” by which he meant that the tests loading on a given factor had some functional similarity that was hypothesized to be the same across tests. Thurstone did believe that the future would bring a mapping of mental abilities onto brain areas, such that each ability factor would be tied to particular brain areas that supported its functioning. But brain mapping was in its initial stages and Thurstone could only voice this as a hope for the future. He did think that cognitive psychology held hope for understanding the underpinnings of mental abilities, stating that psychologists should move into the laboratory to devise studies that would illuminate why a given set of tests loaded on a given factor (Thurstone, 1947). Once again, the field of psychology was not ready for this recommendation, and cognitive investigations into the processes underlying mental test performance began in earnest about 30 years after Thurstone’s encouragement to pursue this avenue of research.

In the initial studies by Thurstone and his collaborators (e.g., 1938a, 1938b), the primary mental ability factors were rotated orthogonally, so they were statistically uncorrelated with one another. But after the development of the mathematical theory for oblique rotations (Tucker, 1940), Thurstone and Thurstone (1941) quickly applied oblique rotations to the primary mental abilities and found substantial correlations among the seven dimensions. The correlations among the primary mental abilities were well described by a single second-order factor, which Thurstone and Thurstone argued provided a way to reconcile Spearman’s theory with their own. That is, at the level of the primary mental abilities, seven dimensions were required to represent the relations among a large set of tests. But correlations among the primary mental abilities could be explained by a single second-order factor. Thus, one could argue that Spearman pursued

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

work on the ability domain at the second-order level, whereas Thurstone and his colleagues worked to specify well the dimensions that constituted the first-order level of factoring. Although this would provide a way of integrating the Spearman and Thurstone models, not all researchers agreed with this position. Indeed, Spearman (1939) argued that the primary mental abilities were rather trivial and narrow, and that the second-order general factor, or g, should be considered the principal or primary factor, rather than being relegated to second-order importance.

British Hierarchical Theorists

As early as 1909, Burt performed analyses that demonstrated the need to consider more than a single factor for explaining the correlations among a set of manifest indicators of ability. In this early publication, Burt (1909) provided little indication of a meaningful multiple factor structure, but 40 years later, he presented a theoretical summary of research that provided a three-level structure of mental abilities (Burt, 1949). At the first level, Burt postulated the presence of basic sensory and perceptual dimensions, including dimensions such as sound discrimination thresholds. The second level contained dimensions that were more cognitive and intellective in nature; here, typical ability dimensions such as verbal comprehension and spatial ability resided. The third level has a single dimension, the general factor of Spearman.

Vernon (1950, 1961) provided the most comprehensive and integrative review of the hierarchical theory; Vernon’s focus was at its highest levels. The topmost level had a single dimension, the general intelligence factor, g, of Spearman. Below g were two subgeneral abilities: v:ed (or verbal:educational), and k:m (or spatial:mechanical). Below the v:ed subgeneral dimension fall factors such as verbal comprehension, verbal fluency, numerical facility, and reasoning, whereas under the k:m subgeneral dimension are factors such as spatial rotation, mechanical and technical information, and various psychomotor abilities. Vernon presented the hierarchical structure of abilities as a way of

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

summarizing the previous three decades of research and considered the several versions of the hierarchy to be tentative and subject to revision in the future. However, both Vernon and Burt believed strongly in the nature of the general factor g as representing a single entity that was common to all tests of ability.

The third member of the British hierarchical group was Godfrey Thomson (1951), who supported the general hierarchy of abilities even as he espoused a rather different theoretical basis for it. Thomson’s hierarchical factor pattern was similar to Vernon’s, with a general factor aligned with Spearman’s g at the apex of the hierarchy, followed by rather broad subgeneral factors, and finally a series of much more narrow factors at the bottom of the hierarchy.

However, Thomson believed that the ability hierarchy was based on a radically different set of processes. Indeed, he repudiated the notion of a single entity common to all tests of ability. Instead, the human mind may be composed of a virtually infinite set of bonds or potential bonds that are independent of one another. When working on a particular type of test, a given set of bonds is required to arrive at a correct answer. When a different type of test was administered, a different but overlapping set of bonds was activated. The more highly overlapping the sets of bonds required by two tests, the higher the correlation between the tests. Conversely, if the sets of bonds sampled by two tests showed little overlap, then the tests would correlate positively but at a low level. The upshot of Thomson’s sampling theory was this: no single entity (i.e., bond) may be found that is common to all tests of mental ability, so the hierarchical structure of human mental abilities simply indicates the degree of overlap among the bonds sampled by tests of mental ability.

The Thomson explanation for the hierarchy of mental abilities may lead to a number of reactions. One may become highly suspect of factor analytic approaches, as one set of empirical results, with a dominant general factor, is consistent with diametrically opposed generating mechanisms—a single entity common to all tests (e.g., Spearman) versus no single entity common to all tests of ability (e.g., Thomson).

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

But another reaction to these findings is to become attuned to the need to marshal evidence beyond the pattern of tests loading on factors. The loadings of tests on factors may suggest the presence of an entity common to all tests of ability. But additional evidence of different types may be relevant to the choice between a single common entity and the absence of a single common entity. This additional evidence may then tip the balance in favor of one or the other of the two competing positions.

Guilford’s Structure of Intellect

Based on considerable research during World War II on army recruits and a thorough review of cognitive psychological research, J.P. Guilford (1967) developed a model he termed the structure of intellect, or SOI. He and his colleagues spent over two decades attempting to confirm the basic hypotheses of SOI theory, work summarized by Guilford (1967) and Guilford and Hoepfner (1971). The Guilford theory was well recognized as a competing model of ability structure until Horn and Knapp (1973) published a reanalysis of many of the data sets used by Guilford and his colleagues to corroborate SOI theory. They found that Guilford’s own data gave much stronger, in fact almost perfect, support for Thurstone’s hypotheses than for hypotheses generated by SOI theory. An interesting pair of commentaries on the Horn and Knapp (1973) study by Guilford (1974) and Horn and Knapp (1974) left the main findings by Horn and Knapp (1973) unchallenged. SOI theory is no longer recognized as a useful conceptualization of the structure of human abilities.

Cattell-Horn Theory of Fluid and Crystallized Intelligence

Capitalizing on a much earlier observation (Cattell, 1941), Raymond B. Cattell (1963) proposed a new theory of ability structure, subsequently referred to as the theory of fluid and crystallized intelligence. According to the initial theory sketched by Cattell (1963), two very broad and important dimensions of intelligence—fluid intelli-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

gence, or Gf, and crystallized intelligence, or Gc—could be distinguished, rather than the single dimension of g hypothesized by Spearman. Cattell conceived of fluid intelligence in ways that were reminiscent of Spearman’s theorizing about g. In particular, Gf was thought to be a reservoir of reasoning ability that could be directed toward many different kinds of content, hence its identification as a fluid form of intelligence. Furthermore, Gf was thought to be largely genetically determined.

As fluid intelligence was expended on a given kind of content or intellectual problem, the individual would develop knowledge stores related to the particular content or type of problem as well as mental algorithms for solving such problems. The knowledge and mental algorithms developed through the application of Gf on given tasks are therefore crystallizations of the influence of Gf. Thus, verbal comprehension, or the ability to extract meaning from text, is a crystallized ability assessed using tests of vocabulary, paragraph comprehension, and the understanding of proverbs, among others. All of these tests require one to extract the meaning from text using stored meanings of words in the lexicon. Numerical facility is a crystallized ability that subsumes knowledge of simple numerical facts (e.g., addition facts, subtraction facts) as well as algorithms for solving numerical problems that cannot be solved easily mentally (e.g., long division, multiple-place multiplication). The higher a person’s level of Gf, the greater the amount of fluid intelligence invested on particular tasks, and therefore the higher that person’s general levels of performance crystallized ability on tasks. Because Gf influences performance on all crystallized ability tasks, these tasks should correlate with one another and therefore define a general crystallized intelligence factor, or Gc.

Because Gf was a fluid ability to reason with new material, Cattell (1963, 1971) argued that Gf was best measured using either novel stimuli or problems or with highly overlearned stimuli with which a person is instructed to perform some novel operation, like doing simple math with letters of the alphabet. Theoretically, Gf was largely genetic in origin, and any learning that affected tests for Gf would be haphaz-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

ard learning that occurred in the context of daily life. In contrast, Gc was best measured using tests of standard cultural knowledge (vocabulary, information, similarities) or tests of material like numerical facility that was highly practiced in standardized cultural settings such as school. One hypothesis regarding the pattern of tests loading on factors that distinguishes Gf-Gc hypotheses from those of the British hierarchical theorists has to do with tests of mechanical knowledge. Cattell argued that tests of mechanical knowledge should load on the Gc factor, which is closest to Vernon’s v:ed factor, because mechanical knowledge is systematically taught in schools, rather than on the k:m factor, as Vernon had hypothesized. In addition, boys should have an advantage on mechanical knowledge over girls relative to other indicators of Gc, due to the more consistent teaching of mechanical knowledge to boys than to girls. These hypotheses were confirmed, lending support to structural hypotheses of Gf-Gc theory over those associated with the hierarchical model of Vernon.

Cattell (1971) made a further contribution to the understanding of mental abilities by distinguishing between the order and stratum of a factor. The order of a factor is a superficial, methodological aspect of the analysis in which a factor is identified, whereas the stratum a factor occupies is a deeper, theoretical concern regarding the nature and breadth of the factor. Factors that are obtained from analyzing the correlations among observed variables are termed first-order factors. If the first-order factors are rotated obliquely, factoring the matrix of correlations among first-order factors leads to the identification of second-order factors. Multiple orders of factoring may be continued as long as at least three oblique factors are identified at a given level. In contrast, the stratum a factor occupies depends on its breadth and the generality of its influence.

To make the distinction between order and stratum clearer, consider the following two research scenarios. In the first scenario, suppose that a researcher included in a battery of tests three tests of word fluency, three tests of associational fluency, and three tests of ideational fluency. Factoring these nine tests would lead to the identification of

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

three first-order factors, one each for word fluency, associational fluency, and ideational fluency. If the correlations among these three factors were factored, a single general fluency factor (or Glr, for general long-term retrieval from memory) could be derived as a second-order factor, and the three first-order fluency factors would load on this second-order factor. In this research scenario, the first-order factors are also first-stratum factors, representing the narrowest dimensions that would be fruitful to research. In addition, the second-order general fluency factor is a second-stratum factor, with broader influence on each of several types of more narrow fluency.

In the second research scenario, given constraints in testing time, the second researcher could administer only a single test for word fluency, a single test for associational fluency, and a single test of ideational fluency. In this scenario, the researcher could not identify first-stratum factors for word fluency, associational fluency, and ideational fluency, because only a single manifest variable for each dimension was available, and one must have at least two, and preferably three, tests for a given factor to identify it as a factor. Factor analyzing the three fluency tests would lead to a first-order factor on which the word fluency, associational fluency, and ideational fluency tests loaded. Now this factor is a first-order factor, because it was derived from the correlations among measured variables. But, because the variables loading on it represented different types of fluency, the first-order factor reflects general fluency, or Glr, a second-stratum dimension.

The distinction between the order and the stratum of factors enables one to place results in a hierarchical structure based on the stratum of the factors found in different studies. The current version of Gf-Gc theory has been outlined in several papers by John L. Horn (1985, 1988, 1998). The ability structure for Gf-Gc theory posits at least 55 primary or first-stratum factors. When correlations among the first-stratum factors are analyzed, nine second-stratum factors are found. These nine second-stratum factors are: (1) Gc (crystallized intelligence), which has verbal comprehension, semantic relations, numerical facility, mechanical knowledge, syllogistic reasoning, verbal clo-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

sure, and general information factors as indicators; (2) Gf (fluid intelligence), which subsumes first-order factors such as induction, general reasoning, figural relations, concept formation, and symbolic classification; (3) Gv (general visualization), with loadings from first-stratum factors for visualization, speed of closure, flexibility of closure, spatial orientation, figural fluency, and figural adaptive flexibility; (4) Ga (general auditory processing), with loadings from first-stratum factors, such as listening, verbal comprehension, temporal tracking, sound pattern discrimination, and auditory memory span; (5) Gsm (general short-term memory, also identified at times as SAR, for short-term acquisition and retrieval), which subsumes first-stratum factors of associative memory, span memory, meaningful memory, and memory for order; (6) Glr (general long-term memory, also sometimes identified as TSR, for tertiary storage and retrieval), which represents a variety of fluency dimensions, such as delayed retrieval, associational fluency, expressional fluency, ideational fluency, word fluency, and originality; (7) Gs (general speediness or processing speed), covering first-stratum dimensions of perceptual speed, numerical facility, and writing and printing speed; (8) Gt (decision speed, also identified at times as CDS, for correct decision speed), reflecting choice reaction time, decision speed, and simple reaction time; and (9) Gq (general quantitative knowledge), representing dimensions such as applied problems, quantitative concepts, numerical facility, and general reasoning.

The preceding results related to the loading of first-stratum factors on the nine second-stratum factors may be termed structural results. But in the continued development of Gf-Gc theory, Horn (1998) has always monitored several additional kinds of information. One of these additional types of information is derived from developmental studies and consists both of kinematic trends (developmental growth and decline of abilities over the life span) and of the dynamic effects of ability dimensions on one another. The differential kinematic, life-span trends for the various second-stratum abilities have been replicated many times.

These trends show that both Gf and Gs begin to decline early in

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

adulthood, around the age of 30, whereas Gc continues to increase in mean level until perhaps age 70 before declines begin. This is perhaps the strongest evidence against attempting to define a higher-stratum general factor analogous to Spearman’s g, an argument that Horn has made repeatedly. The moderate correlations among the nine second-stratum ability dimensions have been an impetus to many researchers to factor analyze these correlations and obtain a higher-order general factor. But Horn argued that, with very different life-span trends for the second-stratum dimensions, any general factor would be constructed out of the mixing of cognitive apples and oranges. This would lead to a hopelessly confounded and uninterpretable general factor showing essentially no change in level during the adult years, a pattern that none of the second-stratum factors actually displays. The dynamic effects mentioned above involve the hypothesized lead-lag relations among abilities. The most often cited of these is the hypothesis that Gf leads to later increases in Gc due to the investment of Gf on intellectual problems. Studies of these dynamic hypotheses have not been strongly supportive of hypothesized relations, but the current development of better models to test these hypotheses may lead to more definitive results.

Horn (1985, 1998) evaluated still other kinds of research evidence, which are discussed here only briefly. Although still somewhat premature for drawing final conclusions, neurocognitive studies appear to support the hypothesis that different ability factors are subserved by different brain areas. As these findings become more firmly established, they will provide additional support for the hypotheses of Gf-Gc theory. Another type of evidence is derived from studies of heritability. Gf-Gc theory makes certain predictions regarding heritability, or the degree of genetic variance in ability factors. One such prediction is that Gf should have higher heritability than Gc. Here, the evidence is not obviously supportive of Gf-Gc theory, as most estimates of heritability show about equal heritabilities for Gf and Gc. A final kind of evidence comes from studies of achievement, in which achievement in particular curricular areas is related to second-stratum dimensions

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

of ability. Horn (1998) noted the difficulties with such studies but concluded that achievement studies tend to support differential relations between achievements in distinct curricular areas and associated second-stratum factors of ability.

In summary, Gf-Gc theory is a complex and far-reaching enterprise. The theory makes predictions in the structural domain concerning the loading of first-stratum abilities on the broad second-stratum factors, but also makes clear predictions in several other domains. Although empirical results to date are not fully supportive of all predictions of the theory, a sufficient number of predictions have been confirmed that Gf-Gc theory is the most comprehensive and widely supported ability theory currently available. The frequent replication of the differential life-span trends for different abilities has resulted in Gf-Gc theory being the primary theoretical framework now used in studies of adulthood and aging. Moreover, the well-replicated structural results are leading the developers of intelligence tests to incorporate measures of Gf and Gc, in addition to an overall IQ in the scoring of their instruments.

Carroll’s Three-Stratum Theory

In 1993, John B. Carroll published a monumental tome that reported the reanalyses of over 450 sets of data. The aim of this project was to reanalyze all previous ability studies using a constant and well-justified set of factor analytic techniques, trusting that this would lead to a more consistent set of results across studies. The factor analytic results reported by Carroll are similar to the Horn-Cattell structural results in most respects, so little detailed description is needed here. We merely recount the broad strokes of the Carroll approach.

The upshot of the reanalysis of 477 studies was the identification of approximately 65 narrow, first-stratum factors. When correlations of these first-stratum factors were analyzed, eight second-stratum factors were located. When the correlations among second-stratum ability factors were analyzed, Carroll identified a single third-stratum factor, which he interpreted as corresponding to Spearman’s g. One

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

interesting advance by Carroll was to identify both level and speed components of abilities, where appropriate. The level component involves power tests in which time limits have little effect on individual differences on the tests. In contrast, the speed (or rate) component contains tests on which time limits or the rate of presentation of information, and therefore the speediness of performance, is important to measuring individual differences on the tests. One second-stratum ability had only level indicators, and two second-stratum dimensions had only speed indicators. The remaining five second-stratum factors had both level and speed (or rate) indicators.

The eight second-stratum factors identified by Carroll (1993) are: (1) Gf (fluid intelligence), with level first-stratum factors of general reasoning, induction, and quantitative reasoning and a speed factor of speed of reasoning; (2) Gc (crystallized intelligence), with level indicators of language development, verbal comprehension, spelling, and communication and speed indicators of oral fluency and writing ability; (3) Y (general memory and learning), with a level first-stratum factor of memory span and rate (related to speed) indicators of associative memory, free recall memory, meaningful memory, and visual memory; (4) V (broad visualization), with a level factor of visualization and speed indicators of spatial relations/orientation, speed of closure, flexibility of closure, and perceptual speed; (5) U (broad auditory perception), with level indicators of hearing and speed thresholds, speech sound discrimination, and musical discrimination and no clear speed or rate indicators; (6) R (broad retrieval), with level indicators of originality and creativity and speed indicators of ideational fluency, associational fluency, expressional fluency, word fluency, and figural fluency; (7) S (broad cognitive speediness), with no level indicators but speed indicators of rate of test taking and numerical facility; and (8) T (processing speed and/or decision speed), once again with no level indicators but speed indicators such as simple reaction time, choice reaction time, semantic processing speed, and mental comparison speed.

Despite the clear similarities between the Horn-Cattell and Carroll models, some important differences are apparent. The key difference concerns the presence and nature of a general intelligence factor.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Carroll (1997) argued that his work provided perhaps the strongest and most comprehensive support yet for the general intelligence factor, a position Cattell would probably have seconded. Carroll also identified the general factor as corresponding to Spearman’s g, the mental ability common to all tests of ability, also a position that Cattell might have favored. However, for more than 25 years, Horn has been responsible for the current synthesis of the Horn-Cattell model. He has long disclaimed the utility of a general factor, despite the positive correlations among the second-stratum abilities. Based on other information, such as the trends of growth and decline over the life span for second-stratum abilities, any overall score approximating general intelligence would represent a changing mixture of abilities, a general level of a person’s profile of second-stratum abilities, or “intelligence in general” or “on average,” rather than a single element common to all tests that retains its unitary nature across development. This striking difference of scientific opinion is reminiscent of the conflicting views on the nature of the general factor held by Vernon and Thomson, discussed above. The monumental work by Carroll (1993) was concerned almost exclusively with structural information about how variables load on factors. Carroll dismissed other forms of data, particularly differential life-span aging trends, by claiming that the data and their implications for theory were not yet sufficiently well established. In contrast, Horn has always studied structural information, but he has also monitored and integrated information from numerous other sources, such as kinematic or life-span trends, dynamic relations between abilities over time, and neurocognitive studies. Taking all of these kinds of information into consideration, Horn has argued that the existence of a single, unchanging entity common to all tests of ability cannot be supported.

The Horn-Cattell and Carroll models exhibit additional, but less important, differences. One of these is the absence of Gq, or general quantitative ability, as a second-stratum dimension in the Carroll model. Carroll considered the Gq dimension of the Horn-Cattell theory to be too narrow and lacking a sufficient research base to be accorded a position as a second-stratum factor. Also, some differences

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

in the first-stratum factors subsumed by second-stratum dimensions can be found.

Aside from the preceding differences, eight of the second-stratum ability dimensions from the Horn-Cattell and Carroll models fall in a rather clear one-to-one relation with one another. Some second-stratum dimensions have differing names and identifying symbols across the two systems. Still, the eight second-stratum dimensions represent the current state of the science with regard to the broad abilities that span the intelligence domain. An overall score, whether corresponding to Spearman’s g or to a changing composite reflecting “intelligence in general,” may be a useful summary index of a person’s general level of functioning, regardless of whether one believes the score corresponds to a particular identifiable entity.

Other Theories

The preceding theories were developed in connection with the use of factor analysis, which was used to derive the dimensions underlying batteries of tests and thereby confirm or disconfirm the hypotheses put forward by the groups of researchers. In addition to these theories based on factor analysis, several additional theories of the structure of mental abilities have been developed. Most of these other theories have been based on a priori theory or summaries of previous research, but have relied much less or not at all on sophisticated measurement techniques such as factor analysis. As a result, the utility of these theories for applied work on the assessment of intelligence is much more limited, although the future may see greater application of the ideas.

The first of these other theories is embodied in the PASS model of Das, Naglieri, and Kirby (1994). PASS stands for planning, attention, simultaneous processing, and successive processing, which are processes or mental functions associated with particular brain areas by Luria (1966a, 1966b). Planning refers to processes governing cognitive control and self-regulation, enabling a person to develop or plan courses of intelligent action to be followed. Attention subsumes the processes by which a continual focus on cognitive problems is main-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

tained. Simultaneous processing involves processing of stimuli in which the stimulus as a whole must be comprehended or in which elements must be integrated into a meaningful whole. Successive processing concerns processes in which the sequence of the processing of elements is crucial, such as language. Factor analytic studies of the PASS model have been less than fully successful, failing to establish planning and attention as empirically distinct entities. Despite this, the Cognitive Assessment System (Naglieri & Das, 1997) provides a standardized battery to assess the components of the PASS model.

A second theoretical approach encompasses information processing theories derived from cognitive psychology. For example, Campione and Brown (1978) offered an initial model that was further developed by Borkowski (1985). Information processing models of cognitive ability often distinguish the architectural and executive systems, roughly equivalent to the hardware and software components, respectively, of a computer. The architectural system is assumed to be genetically, or at least biologically, based and consists of basic operating parameters of cognitive processes, encompassing individual differences in (1) amount of information that can be processed, which is assessed using memory span, (2) durability of information storage, or the retention of memory traces, and (3) efficiency of processing, or the speed of encoding and decoding information. The executive system encompasses components that are environmentally based and guide processes comprising problem solving. The executive system subsumes components such as (1) one’s knowledge base, or declarative knowledge of facts; (2) control processes, which include strategies or heuristics to aid memory or problem solution; and (3) metacognition, which involves, among other things, knowing how problems should be solved and then monitoring progress toward problem solution and evaluating outcomes to ensure successful solution of the problem. Researchers using the information processing approach have paid little attention to converting theoretical insights into usable measures of intelligence.

Sternberg (1985, 1986, 1996) has offered several theories of hu-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

man intelligence, theories that have been radically reformulated over time. The most recent incarnation is Sternberg’s notion of successful intelligence. The three components of successful intelligence are (1) analytic abilities, which aid in defining problems, setting up solution strategies, and monitoring solutions and presumably include many of the dimensions outlined in the Horn-Cattell and Carroll models; (2) creative abilities, which involve generating new problem solving options and attempting to convince others of their worth; and (3) practical abilities, which subsume skills in ensuring that one can implement solutions and see that they are carried out. As with information processing approaches, at present no standardized batteries are available to assess constructs within Sternberg’s triarchic theories.

The final theory discussed in this section is the theory of multiple intelligences, described by Gardner (1983). According to this theory, at least eight different types of intelligence can be identified: (1) linguistic intelligence, subsuming language and communication skills; (2) musical intelligence, involving individual differences in rhythm and pitch and skills in composing music; (3) logical-mathematical intelligence, including logical reasoning and number abilities; (4) spatial intelligence, or the ability to understand spatial relations; (5) bodily-kinesthetic intelligence, assessed by skills in dancing, acting, and athletics; (6) intrapersonal intelligence, or knowledge of one’s self, feelings, and motives; (7) interpersonal intelligence, or skills in discerning the feelings, beliefs, and intentions of others; and (8) naturalist intelligence, involving seeing and understanding patterns in nature. Gardner has done little research to validate his theory on the types of intelligence. To the extent that evidence supports the notion of different intelligences, the evidence is consistent with the Horn-Cattell and Carroll theories. For example, Gardner’s linguistic intelligence is most similar to Gc in the Horn-Cattell model. As a result, little empirical evidence is available that uniquely supports Gardner’s theory. Moreover, no standardized measures of the constructs in this theory are available.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Summary

During the 20th century, theories of the structure of mental abilities have evolved from the two-factor theory of Spearman, which hypothesized only a single factor common to all tests of ability, to the more differentiated structure of the Horn-Cattell and Carroll models. In these models, the two most widely studied of the second-stratum factors are Gc and Gf. Gc, or crystallized intelligence, reflects stored cultural knowledge and corresponds closely with the verbal factor often reported in factor analyses of the Wechsler batteries. Gf, or fluid intelligence, is a dimension representing reasoning or thinking skills; the performance factor identified in factor analyses of the Wechsler batteries appears to be an amalgamation of Gf and Gv (or visualization skills).

Some movement has already taken place in structuring intelligence tests to acknowledge the utility of the Horn-Cattell and Carroll models. For example, the Stanford-Binet IV yields a composite IQ, but it was based on a theoretical model that included subareas for crystallized abilities (verbal reasoning and quantitative reasoning), fluid-analytic abilities (abstract/visual reasoning), and short-term memory. Furthermore, one battery—the Woodcock-Johnson—was explicitly designed to assess all second-stratum dimensions in the Horn-Cattell model. During the next decade, even greater alignment of intelligence tests and the IQ scores derived from them and the Horn-Cattell and Carroll models is likely. As a result, the future will almost certainly see greater reliance on part scores, such as IQ scores for Gc and Gf, in addition to the traditional composite IQ. That is, the traditional composite IQ may not be dropped, but greater emphasis will be placed on part scores than has been the case in the past. As this movement to part scores develops, it will most likely occur first for Gc and Gf, the most central of the second-stratum factors, and then extend to other second-stratum dimensions as they are determined to be useful for differential prediction.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

INTELLIGENCE TESTS COMMONLY USED IN THE DIAGNOSIS OF MENTAL RETARDATION

Given the widespread development of intelligence tests during the past 100 years, but especially during the past 20 years, many instruments with different theoretical orientations and quality can be employed to diagnose mental retardation. Table 3-1 identifies 13 instruments commonly used in the assessment of intelligence and the diagnosis of mental retardation among infants, children, adolescents, and adults. In addition to these 13 instruments, additional comprehensive intelligence tests are available to psychologists (e.g., McCarthy Scales of Children’s Abilities—McCarthy, 1972; Cattell Infant Intelligence Scale—Cattell, 1940). However, these additional tests are not included in the table because they lack norms or because their norms and stimulus materials are too outdated to recommend their use.

Also, several brief or unidimensional intelligence tests are currently available for the screening of intellectual functioning (e.g., Kaufman Brief Intelligence Test, KBIT—Kaufman & Kaufman, 1990; Test of Nonverbal Intelligence: Third Edition, TONI-III—Brown et al., 1997). Although these brief tests may have merit for use as cognitive screeners, they are best suited for low-stakes decision making because of their brevity and limited sampling of important theoretical facets of intelligence. Consequently, intellectual screening instruments are not included in the table.

Finally, a considerable number of group-administered intelligence tests (e.g., Otis-Lennon School Ability Test, OLSAT—Otis & Lennon, 1979) are also available. But such group-administered instruments, while suitable for group screening and decision making, are not designed or appropriate for high-stakes individual disability diagnosis and decision making. Therefore, of all the intelligence tests published and available in their many forms, the instruments cited in Table 3-1 include current, individually administered, comprehensive tests of intelligence suitable for disability diagnosis and eligibility determination. It should be stated, however, that at some point in the future each of the instruments listed may also become outdated, unless they

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

TABLE 3-1Comprehensive Tests of Intelligence

Intelligence Test

Age Rangea

Publication Date

Bayley Scales of Infant Development-II

Birth to 42 months

1993

Cognitive Assessment System

5-0 to 17-11

1997

Differential Ability Scalec

6-2 to 17-11

1990

Kaufman Assessment Battery for Childrend

6-2 to 12-6

1983

Kaufman Adolescent and Adult Intelligence Test

11-0 to 85+

1993

Leiter International Performance Scale-Revisede

2-0 to 20-0

1997

Mullen Scales of Early Learningc

Birth to 68 months

1995

Stanford-Binet Intelligence Scale: Fourth Editionb

2-0 to 24

1986

Universal Nonverbal Intelligence Teste

5-0 to 17-11

1998

Wechsler Adult Intelligence Scale-III

16 to 89

1997

Wechsler Intelligence Scale for Children-III

6-0 to 16-11

1991

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Publisher Levelb

Appropriate for MR

Appropriate Scores

C

Conditionalc

Mental development index

C

Yes

Full-scale standard score

C

Yes

Verbal ability

Nonverbal ability

General conceptual ability

C

Yes

Mental processing composite

C

Yes

Fluid scale

Crystallized scale

Composite intelligence scale

C

Yes

Full-scale IQ

C

Conditionalc

Early learning composite

C

Yes

Abstract/visual reasoning

Verbal reasoning

SAS composite

C

Yes

Reasoning

Memory

Full-scale IQ

C

Yes

Verbal scale

Performance scale

Full-scale IQ

C

Yes

Verbal scale

Verbal comprehension index

Performance scale

Perceptual organization index

Full-scale IQ

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Intelligence Test

Age Rangea

Publication Date

Wechsler Preschool and Primary Scale of Intelligence

11-2 to 7-3

1989

Woodcock-Johnson Psycho-Educational Battery-III

2-0 to 90+

2001

NOTE: Comprehensive intelligence tests are those that assess intelligence or early cognitive development through multiple subtests and factors, and assess a variety of cognitive processes.

a Ages are specified in years-months: 5-0 is 5 years, 0 months of age.

b Test publishers use criteria for purchasing tests, with different levels of tests requiring different levels of training and/or credentials. Most comprehensive intelligence tests are known as Class C tests, which require the highest level of training and credential to purchase. Qualification guidelines used by The Psychological Corporation, which is similar to other publishers, to purchase a Class C test requires: “Verification of a PhD-level degree in psychology or education or the equivalent in a related field with relevant training in assessment OR Verification of licensure or certification by an agency recognized by The Psychological Corporation to require training and experience in a relevant area of assessment consistent with the expectations outlined in the 1985 Standards for Educational and Psychological Testing.

are revised and renormed. In addition, new instruments may be developed and considered appropriate for inclusion on the list of appropriate instruments. Thus, the list presented in Table 3-1 should be viewed as being valid today, but the equivalent list of appropriate tests is likely to change over time as old tests become outdated and new tests are developed.

The instruments listed in Table 3-1 can be thought of in a variety of ways. For example, some instruments, like the Cognitive Assessment System (CAS) and the Kaufman Assessment Battery for Children (K-ABC) are designed as “process” oriented tests that are intended to be sensitive to the processing aspects of intelligence and are based on neuropsychological theories, such as Luria’s conceptualization of brain function and activity. Other instruments, like the Stanford-Binet

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Publisher Levelb

Appropriate for MR

Appropriate Scores

C

Yes

Verbal scale

Performance scale

Full-scale IQ

C

Yes

General intellectual ability

c Infant scales may be used for identifying developmental delay that is in the mentally retarded range of functioning, but many psychologists and professional groups defer diagnosis of mental retardation based on developmental scales during the infant/toddler years.

d The K-ABC is currently undergoing revision and will be available in two or three years.

e The Leiter-R and UNIT are explicitly designed to assess intelligence in a nonverbal administration format. Such tests are employed when language-loaded intelligence tests may provide distorted portrayal of the client’s current level of intellectual functioning due to limited English proficiency, language-related disabilities (e.g., verbal learning disability, speech disorders), certain psychiatric conditions (e.g., autism, selective mutism), or some neurological disorders.

Fourth Edition and the Wechsler scales, are product-oriented measures that tend to assess the outcome of a lifetime of knowledge acquisition. Two instruments, the Leiter International Performance Scale-Revised (Leiter-R) and the Universal Nonverbal Intelligence Test (UNIT), were designed specifically for use when an examinee’s limited language facility makes it difficult to assess his or her overall cognitive functioning. This could occur with ethnic minorities, individuals who speak English as a second language, individuals who are deaf or hard of hearing or autistic or selective/elective mutes, and others. In such instances, language-based intellectual assessments may produce “construct irrelevant variance.” That is, test scores may be contaminated by variance related to a confounding influence like poor English facil-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

ity, resulting in an IQ that is not a good indicator of the person’s true ability.

Another distinction among the instruments listed in Table 3-1 is the number and types of intellective factors or abilities assessed. The Wechsler scales, for example, are composed of two major subscales (verbal and performance), with three or four cognitive factors that better explain the tests’ true theoretical underpinnings. At the other extreme of sheer numbers of abilities assessed by a test, the cognitive battery of the Woodcock-Johnson Psycho-Educational Battery-III (Woodcock et al., 2001) purports to assess seven distinct cognitive factors. Most of the instruments cited in the table assess between three and five cognitive factors, with support for their theoretical underpinnings adequate to warrant their use in the diagnosis of mental retardation.

ASSESSMENT CONDITIONS THAT AFFECT INTELLIGENCE TEST SCORES

Intelligence test scores have considerable weight in diagnostic determination of mental retardation. Because of the importance placed on IQs, it is essential that examiners ensure that these scores are obtained in the most objective, clinically appropriate, and standardized fashion.

Test scores too frequently are assumed to be precise estimates of an individual’s intellectual functioning, without thoroughly considering the conditions under which the scores were obtained. Four major influences on an individual’s performance on an intelligence test should be considered in making diagnostic decisions or recommendations for intervention (Bracken, 2000). Each of the four poses threats to the validity of assessment results and all of them can be controlled to some considerable degree: (1) examinee characteristics, (2) examiner characteristics, (3) environmental influences, and (4) psychometric characteristics of tests. Table 3-2 summarizes these characteristics and gives examples of each. Test results should not be used for making diag-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

noses or program eligibility decisions without a statement from the examiner attesting to the validity of the evaluation results with regard to these four threats to validity.

Examinee Characteristics

Examinees approach intellectual assessments from differing sociocultural backgrounds and experiences. Some examinees may mistrust the “system” that is mandating the assessment, whereas other examinees may be challenged and highly motivated to participate. In a program like SSI that uses a form of intellectual means testing to identify participants, examiners must be aware of the risk of intentional faking or malingering by or on behalf of examinees. That is, examinees may be motivated to perform poorly intentionally in an effort to receive desired benefits or preferential treatment. Similarly, when parents or other parties who may benefit from assessment outcomes are involved in the diagnostic process by answering background information questions or responding to adaptive behavior measures, the veracity of the participants’ responses also needs to be considered and evaluated.

Because the results of intellectual assessments typically are associated with important decisions and outcomes, examinees should not be assessed unless they appear suitably healthy and well rested. If they exhibit symptoms of poor health, like cold or influenza symptoms, or symptoms of psychological disorders or distress, like depression or acute anxiety, that could adversely affect the assessment of the examinee’s cognitive functioning, the evaluation should be rescheduled for a later date after these conditions have abated or after they have been addressed adequately. In instances in which examinees have had an ongoing history of physical illness or mental health problems, the effects of these conditions on the examinee’s cognitive functioning must also be considered.

Examiners must also ensure that examinees have the requisite skills to perform all intelligence test tasks and activities when selecting instruments or assessment procedures. An examinee with impaired vi-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

TABLE 3-2 Potential Threats to the Validity of Intelligence Test Results

Sources of Threat

Examples

Examinee Characteristics

Transient health conditions

Influenza, colds, fever, minor injuries

Chronic health conditions

Otitis media, speech impairment, diabetes

Transient mental conditions

Recent trauma, acute situational anxiety

Ongoing mental conditions

Psychiatric disorders

Attitudinal conditions

Malingering, oppositional /defiant, uncooperative

Physical conditions

Hearing, vision, motor, neurological limitations

Social/cultural conditions

Linguistic/cultural effects, mistrust of examiner

Examiner Characteristics

Nonstandardized administration

Failure to administer test in standardized manner

Communication

Failure to establish and/or maintain rapport

Attitude/approachability

Personal bias, prejudice, inability to fairly work with some clients (e.g., certain racial groups, sexual orientations)

Competence/clinical skill

Lack of experience working with some clients (e.g., preschoolers, elderly)

Behavior management

Inability to manage examinee behaviors (e.g., disruptions, poor motivation)

Environmental Characteristics

Furniture

Inappropriately sized, textured furniture

Examining room conditions

Too cluttered, too cold or hot, poor lighting conditions

Distractions

Excessively noisy, too visually distracting, phones ringing, extraneous noises

sion that is not suitably improved with corrective devices should not be examined using materials that require visual acuity and discrimination. Similarly, individuals with impaired motor skills should not be examined using materials that require fine motor dexterity or processing speed. Examinees with vision, motor, or visual-motor handicapping conditions might better be assessed on verbally loaded measures of intelligence to remove the construct-irrelevant influences of these noncognitive handicapping conditions on their cognitive assessment results. Similarly, examinees who are hard of hearing or who have known speech or language disabilities or who speak and understand English with limited proficiency should not be assessed with language-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Sources of Threat

Examples

Psychometric Characteristics

Norms

Old norms, nonrepresentative norm samples, insensitive norm tables, small normative samples

Reliability

Excessive measurement error in scale, internal consistency, stability, interrater judgment

Validity

Threats to internal and external sources of test validity—nonsupportive factor structure, poor criterion-related validity, poor convergent/ discriminative validity

Item gradients

Too few items to allow for fine levels of ability discrimination

Ceilings/floors

Ceilings and floors that artificially limit an examinee’s level of performance

Skill demands

Inappropriate skill demands for certain clients (e.g., language demands for examinees who speak English as a second language; performance tasks for motorically disabled clients)

Spoiled subtests/scales

Subtests that are spoiled for any reason (e.g., examiner, examinee, environmental)

loaded intelligence tests, but rather should be assessed with nonverbal tests of intelligence. In all such instances, the examiner must use sound professional judgment when selecting appropriate instrumentation to render a valid assessment of the examinee’s true cognitive functioning. Social Security Administration (SSA) evaluators should ensure that they do not apply the same psychological battery to every client, without regard for its appropriateness. Examiners should carefully craft assessment batteries to fit the unique needs of each client.

Psychological examiners are responsible for ensuring that examinees are sufficiently healthy, motivated, and cooperative and that they have the requisite skills and abilities to participate in the assessment

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

before attesting to the validity of test results. When examinees’ mental or physical health or their effort or requisite skill levels are such that the validity of the test results are threatened, examiners have an obligation to select more appropriate assessment procedures or make known their reservations about the validity of the test results. Diagnoses should be deferred whenever test results are considered insufficiently valid to contribute meaningfully to such important decisions.

Examiner Influences

Examiners have the potential to significantly affect examinees’ test performance, and therefore they should have had proper training, supervision, and experiences to conduct individual intellectual assessments for all the clients with whom they work. In addition, examiners must present an overall demeanor that creates an optimal assessment environment. Examiners must hold the required credentials to perform intellectual assessments in their respective locales, and they should ensure that they administer tests only in the manner in which the instruments were standardized and intended to be used. For example, they should use full-scale, normed versions of the instrument rather than employing abbreviated versions and should not modify test directions.

The ethical standards of American professional and scientific psychological associations, like the American Psychological Association and the National Association of School Psychologists, require that psychologists not engage in services for which they lack competence. Whether the examiner is a psychologist or holds other acceptable credentials to provide psychological assessment services, it is essential that examiners provide only those services they are competent to perform. Because Supplemental Security Income (SSI) benefits are distributed to people across the entire life span from infant to adult and are allocated to members of all racial, ethnic, and linguistic groups, examiners must be properly trained and experienced to work with such a diverse clientele.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

Examiners should refrain from providing assessment services to any demographic group member whom they feel inadequately prepared to test. For example, examiners who are comfortable working with school-age children and adolescents may not have experience assessing infants and preschool children. Similarly, examiners may not have the linguistic or cultural competencies to fairly assess examinees whose nations of origin are other than the United States and whose primary languages are other than English. Examiners lacking the prerequisite skills and experience should acquire them through postgraduate or in-service training with supervision prior to attempting assessments with such a diverse clientele. Good professional practice requires that examiners who do not possess the required skills and experience refer clients to other examiners who do.

Examiners should also ensure that rapport is well established and that a businesslike atmosphere conducive to intellectual assessment is created and maintained during testing. Examinees should be comfortable and optimally engaged throughout the assessment process, and the pace of the assessment should be established to minimize examinee fatigue, boredom, distractibility, or other detrimental conditions associated with either a too slow or too rapid assessment pace. The examiner should describe the extent to which these potential threats to assessment validity adversely affected the examinee’s performance.

Environmental Conditions

Intellectual assessments should be conducted in settings that are optimal for eliciting the examinee’s best performance. Office furniture should be appropriately sized and safe for clients of all ages; for example, preschool children should be seated in small chairs for safety and comfort. Office decor should not unduly distract examinees or interfere with the examinee’s ability to focus on stimulus materials and tasks. Examining rooms should be properly ventilated, and the physical climate should be comfortable, with possible sources of distraction such as telephones and beepers eliminated during the assess-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

ment. Examiners should ensure that the environment allows examinees to demonstrate optimally their full range of intellectual talents and abilities.

Psychometric Considerations

Bracken (1988) identified 10 psychometric reasons why similar tests sometimes produce dissimilar results. When two tests intended to assess the same construct produce dissimilar results, one or the other or both tests may possess some unique psychometric characteristics that diminish the accuracy of its results for certain populations. Some of these psychometric characteristics, such as limited floors or steep item gradients, are often not as readily noted as other, more obvious psychometric characteristics, like low reliability, yet they must be identified through careful analysis before tests are employed.

Too frequently, examiners assume that test publishers have ensured that tests are equally useful for examinees of all ages and ability levels, but such assumptions are not always warranted. As an example, in the cognitive domain of the Battelle Developmental Battery (Newborg et al., 1984), the item gradient of the memory scale at the 24-35 month age level is too steep for reasonable discrimination of examinees’ abilities. On this scale, a raw score of 9 produces a percentile rank of 18, while a raw score of 14 has a percentile rank of 95. Thus, only five items must discriminate across a range of nearly 3 standard deviations, from nearly –1 to +2. Therefore, examiners must acquaint themselves with the examiner’s manuals for the instruments they use to determine which instruments may be inappropriate for certain demographic groups. The section on test standards in this chapter addresses these relevant psychometric considerations in more detail.

USE OF TOTAL TEST SCORES AND PART SCORES

Whenever the validity of one or more part scores (subtests, scales) is questioned, examiners must also question whether the test’s total

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

score is appropriate for guiding diagnostic decision making. The total test score is usually considered the best estimate of a client’s overall intellectual functioning. However, there are instances in which and individuals for whom the total test score may not be the best representation of overall cognitive functioning.

In a compelling article, Jensen (1984) presented a sound argument for generally using total test scores in decision making. His recommendation was to use instruments’ total scale composite scores (for example, a composite IQ) rather than the instruments’ respective part scores when making diagnostic decisions. Jensen argued that total test scores are more “g-saturated”; that is, they are better representations of general intelligence than part scores because they combine and reflect the contributions of all the respective individual part scores from the instrument. Whereas part scores tend to reflect specific abilities like verbal skills or performance abilities, total test scores combine examinees’ various skills and abilities to better reflect the client’s overall cognitive functioning.

In a similar vein, Spearman (1927) argued that because psychometric g, an instrument’s average loading on the general ability factor in exploratory factor analyses, permeates all cognitive tasks to some considerable degree, the test content or the specific abilities like memory, spatial ability, and reasoning assessed by tests may be less important considerations when selecting instruments than their g-loadings. Two tests of differing theoretical orientations or content may be equally strong measures of psychometric g and, as such, may represent equally good measures of general intelligence. Spearman (1927) also coined the term “indifference of the indicator” to describe the phenomenon that test content, process, or theoretical orientation is secondary to how well the test measures g, and that tests with comparable g-loadings can be used interchangeably as overall measures of intelligence regardless of how they go about assessing g.

From a practical standpoint, the total test score of an intelligence test best approximates an instrument’s overall g-loading. Total test scores produce the highest percentage of a test’s explained, reliable

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

variance, and they are reliably better predictors of more external criteria than are part scores (see Jensen, 1981, 1998). The intelligence test total score is also the single overall fairest predictor for individuals of differing ages, genders, races, and ethnic backgrounds (Jensen, 1980; Reschly, 1981b; Reynolds & Kaiser, 1990).

Total Test Scores

All of the instruments listed in Table 3-1 are first and foremost measures of general intelligence, which is best represented in each instrument’s total test score. Although the names applied to the instruments’ total test scores vary across instruments (e.g., full scale IQ, mental processing composite, general cognitive index), these global scores tend to be highly correlated and share a common source of variance: general intelligence or psychometric g. In this sense, there is little practical difference between what the total test scores are called; they are all representations of overall intelligence and historically have been referred to as IQ or full scale IQ. Given the high correlations among the comparable mean total test scores provided by the instruments cited in the table, these instruments can be thought of as collectively measuring the same construct, general intelligence, although their respective subparts may measure a diverse collection of additional specific cognitive abilities.

It is important to note that these tests can also be thought of as largely interchangeable, except in specific situations related to the unique characteristics of individual examinees, for example, those with limited English proficiency, or because of the psychometric foibles associated with specific tests. Some tests, for example, may have inadequate or barely adequate floors for the diagnosis of mild mental retardation at specific ages. With a minimal raw score of a single item answered correctly on each of the appropriate subtests of the Wechsler Preschool and Primary Scale of Intelligence-R for a child who is two and one-half years old, the test FSIQ barely meets the negative two standard deviation criterion commonly used for the diagnosis of mild mental retardation, which is an FSIQ of 68. This instrument would

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

not be capable of differentiating among mild, moderate, and profound levels of retardation at this age level. In most instances and for most examinees, the instruments cited in Table 3-1 produce comparable total test scores across much of the ability continuum, and in most cases the total test score is the preferred score to use in the diagnosis of mental retardation.

Part Scores

There are occasions when a total test score may not be the best indicator of an individual’s overall intellectual functioning, and the examiner must resort to interpreting one of the instrument’s part scores as the best indicator of overall intellectual functioning. In such cases, the instrument’s total test score may offer little more than an awkward and artifactual “average” of a number of relatively disparate subtests or subscales (i.e., part scores). Whenever an examinee’s test performance is highly variable across subtests or subscales of an instrument, the validity and meaningfulness of the total test score must be questioned as a reflection of overall intellectual ability. Before an examiner chooses to employ a part score in place of a total test score for a diagnosis of mental retardation, however, four issues must be considered: the statistical significance of scale differences, the meaningfulness of scale differences, which abilities are appropriate for FSIQ replacement, and the actual magnitude of the composite IQ.

Statistical Significance

The first issue to be addressed when considering replacing a total test score with a part score in the diagnosis of mental retardation is whether a statistically significant difference exists between the subscales that contribute to the total test score. When differences between part scores do not differ significantly from each other, the total test score is unequivocally the best indicator of overall cognitive functioning and should be used for decision making.

Most of the instruments cited in Table 3-1 provide interpretative

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

information that identifies when one or more of the instrument’s subtests or subscales differs significantly from the mean of the remaining subtests or subscales or when subtests differ significantly from each other. In such comparisons the examiner frequently has a choice between using 85, 90, 95, or 99 percent confidence levels. Determining that two scales differ significantly in magnitude depends in part on the alpha level used for the basis of significance and the level of confidence desired. Statistically significant differences between scales or subtests are necessary but not sufficient criteria for judging that the total test score is not an optimal representation of the examinee’s overall intellectual functioning.

Meaningful Differences

The second issue is the meaningfulness of the difference between two or more statistically disparate part scores. It is common to find that two intelligence test subscales differ significantly (e.g., p < .05) from each other for individuals, meaning that the differences in the client’s respective intellectual abilities are not likely to have occurred by chance alone. However, differences of such magnitude and larger are quite common in the general population. For example, a difference of one standard deviation (15 IQ points) between the simultaneous and successive subscales of the Cognitive Assessment System (Naglieri & Das, 1997) is statistically significant for the individual, but it occurs among 31 percent of the general population. A similarly significant difference of one standard deviation between the verbal and performance scales of the Wechsler Intelligence Scale for Children-III (WISC-III) occurs among 24 percent of the general population; and the same one standard deviation difference between UNIT primary scales (reasoning and memory) occurs among 28 percent of the general population. Differences of this magnitude, although statistically significant, are not unusual or rare occurrences in the general population.

Before determining that a total test score is not an optimal representation of the examinee’s overall intellectual functioning, the examiner must consider both the statistical significance of the difference

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

and the relative rarity of those significant differences. Scale differences that occur with less frequency than approximately 25 percent in the general population could be considered unusually rare events and therefore may be considered significant threats to the utility of the total test score as a representation of the individual’s overall intelligence. When subscale difference scores are statistically significant and are relatively rare occurrences in the general population, then examiners should consider whether the total test score is the best indication of overall functioning or whether one or another of the appropriate subscale scores might be a better representation of the client’s overall level of functioning. For example, examinees with limited English proficiency who are tested on the Wechsler scales frequently produce score differences that are both statistically significant and relatively rare between the instruments’ verbal and performance subscales. In such cases, the total test score would generally be considered invalid as a measure of the examinee’s “true” overall intellectual functioning because limited English facility, and not limited overall intelligence, is likely to have adversely affected and rendered invalid the examinee’s assessed verbal IQ and, consequently, the composite IQ.

The client’s language difficulty consequently would have had the adverse effect of reducing the composite IQ in direct proportion to its influence on the person’s performance on the verbal scale. In contrast to the verbal scale, the examinee’s performance on the language-reduced performance scale would have probably resulted in a significantly higher performance IQ. Hence, when there is a significant and relatively rare verbal and performance scale difference for individuals who speak English as a second language, the conclusion to be reached would be that the performance IQ is likely to be the best estimate of the client’s overall intellectual functioning.

It should be noted, however, that even the performance scales of the Wechsler series or comparable subscales on other instruments, like the Stanford-Binet IV’s abstract visual reasoning subscale, pose considerable language demands on examinees who are not proficient in English. As such, performance scales should be viewed only as a better measure of ability, but not necessarily the best measure of ability.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Appropriate Cognitive Abilities

The third issue addresses which of the various instruments’ subscales or the intellectual subskills they measure are of sufficient importance, contributing significantly to the understanding of intelligence, to warrant their individual consideration in the diagnosis of mental retardation. Although there are different theoretical approaches to the construct of intelligence, the Cattell-Horn-Carroll (CHC) theory in particular appears to be more developed than others. Because all facets of even this model are not considered to be equally important and the facets vary in predictive value, an essential question arises: which factors can be used individually in the determination of mental retardation? That is, which factors are sufficiently credible measures of general intelligence to contribute to such important decisions? Some are more obvious than others, most prominently Gc and Gf.

Historically, both crystallized (Gc) and fluid (Gf) abilities have been considered substantive facets of intelligence. From a Thurstonian perspective, Gc maps closely onto the construct of verbal comprehension and Gf maps closely onto Thurstone’s concept of reasoning. In a multiple-instrument factor analysis of the Woodcock-Johnson Psycho-Educational Battery and the Cognitive Assessment System, subtests from both of these broad ability factors load at moderate to high levels (.60s - .70s) on the g-factor (Timothy Z. Keith, personal communication, June, 2001), whereas subtests from some other areas, such as long-term retrieval, Glr, short-term memory, Gsm, and auditory processing, Ga, load at much lower levels (.30s - 50s). Visual (Gv) and spatial (Gs) subtests tend to be moderate g-loaders, ranging in the .50s and .60s. Kaufman (1975, 1979, 1994) suggested a convention for rating the value of subtests g-loadings: .70 and above are considered “good” g-loaders, .50 to .69 are “fair” g-loaders, and g-loadings below .50 are considered “poor.” Given this convention, part scores derived from crystallized, fluid, and visual/spatial measures appear to be acceptable measures of general ability, in addition to the specific abilities they assess.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

The traditional and practical dichotomy of verbal comprehension (Gc) and spatial reasoning (Gf) appears to represent a reasonable collection of subscales that could be used differentially in the diagnosis of mental retardation because subtests in these domains typically are considered “good” g-loading tasks. Such a dichotomy would allow the differential use of the Wechsler verbal and performance scales (or preferably, the factorially purer verbal comprehension and perceptual organization indices) and the verbal reasoning and abstract visual reasoning scales of the Stanford-Binet Intelligence Scale, Fourth Edition, for the purpose of diagnosing mental retardation.

Similar divisions within other instruments might also be considered appropriate, such as the simultaneous and successive scales of the CAS and K-ABC, but such decisions should be based on the instruments’ respective subtest g-loadings, with only those scales that have subtests that are predominately moderate to high g-loaders being used. For example, the two primary scales (memory and reasoning) of the UNIT (Bracken & McCallum, 1998) comprise six subtests that measure either complex memory or reasoning. Of the three memory and three reasoning subtests, all have g-loadings above .70, except one, the reasoning subtest (mazes < .50). Using the criteria of substantive contribution to g, either of these two primary scales may be considered appropriate for use in the diagnosis of mental retardation because of their significant g-loadings.

Magnitude of the Total Test Score

The last issue when considering whether a part score should be used in place of a total test score is the magnitude of the existing total test score. That is, when scale score discrepancies meet the previously mentioned criteria of significance and meaningfulness, the total test score may be simply too high to support a diagnosis of mental retardation. For example, one scale score might barely qualify for a diagnosis of mental retardation (e.g., verbal IQ near 70), while the second scale score may be considerably higher (e.g., performance IQ in the average range). In such cases, which are usually rare once significance and

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

meaningfulness have been assessed, the resulting composite IQ would be in the low average range, and, in the committee’s opinion, the individual would not be likely to truly have mental retardation, despite one scale score in the retarded range. Accuracy of diagnosis is vitally important to the individual client and to SSA, because the stakes are so high. It is as important to include appropriate individuals as it is to exclude inappropriate ones from the SSI and Disability Insurance benefits programs. In the committee’s view, comprehensive intelligence tests provide the greatest technical adequacy and construct sampling and result in the best assessment of intelligence. Therefore, the final criterion for deciding whether or not to use part scores in place of the total test score in the diagnosis of mental retardation is that, no matter how great the discrepancy between relevant subscales, individuals with total test scores greater than 75 should not be diagnosed as having mental retardation.1

Composite scores from intelligence tests should be used routinely in mental retardation diagnosis, except when the composite IQ validity is in doubt, in which case an appropriate part score may be used in its place. Significant and meaningful variation among an instrument’s respective part scores may indicate evidence of compromised validity for one or more of them (for example, a low verbal scale score for an individual with a suspected speech disorder), which in turn would threaten the validity of the composite IQ. In such situations, appropriate part scores may better represent the individual’s true overall level of cognitive functioning.

1  

Committee member Keith Widaman disagrees with this statement. Dr. Widaman believes that IQ part scores representing crystallized intelligence (Gc, similar to verbal IQ) and fluid intelligence (Gf, related to performance IQ) have clear discriminant validity and represent broad, general domains of intellectual functioning. Therefore, a score of 70 or below on either of these part scores from any standardized, individually administered intelligence test that reports such scores should be deemed sufficient to meet the listings for low general intellectual functioning regardless of the level of the composite score, providing that the part scores have adequate psychometric properties (e.g., high reliability, low standard error of measurement).

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

However, only part scores derived from scales that demonstrate high g-loadings (e.g., crystallized, fluid, visual/spatial measures of intelligence) should be used in place of the composite IQ score when its validity is in doubt. Many intelligence tests assess several facets of intelligence, but not all facets are equally important or predict life events equally well. Those intellectual facets that are heavily g-saturated provide the best sources for replacing the composite IQ score when its validity is questionable.

The characteristics of comprehensive IQ tests are such that, even when part scores are used in making disability determinations for mental retardation, the composite IQ score from an instrument should never be higher than 75. Furthermore, if a part score is used in place of the composite IQ score in SSA decision making, the part score should not exceed 70.

The committee considered a number of alternatives before recommending, under certain circumstances, the use of part scores in disability determination for mental retardation. Alternatives included: (1) recommending that SSA continue with its current practice of allowing the use of part scores in diagnosing mental retardation; (2) recommending against any use of part scores, with eligibility determinations made solely on the basis of composite IQs; and (3) recommending that the composite IQ be used, but also allowing for the use of part scores from various instruments, in certain circumstances.

The committee first considered endorsing current SSA practice, which allows the use of a valid verbal performance or full-scale (composite) IQ from an individually administered intelligence measure. In common clinical practice, this usually results in the use of a Wechsler VIQ, PIQ, or FSIQ, a situation that unfairly privileges one set of intelligence tests and has the effect of discouraging innovation on the part of other test developers. Furthermore, the Wechsler part scores VIQ and PIQ have poor theoretical and weak or mixed empirical support for their distinctive status. The current science of the structure of intelligence suggests that the Wechsler Verbal Comprehension and Perceptual Organization Indexes are better measures of Gc and Gf than the VIQ and PIQ.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

The committee also considered recommending that SSA allow the use of only composite IQs for disability determinations. This recommendation would have made SSA’s definition of mental retardation consistent with that used by other professional associations and health-related organizations, all of which identify significantly subaverage general intellectual functioning as characteristic of mental retardation. In this situation, significantly subaverage general intellectual functioning would suggest that the deficits must be evident on the overall index of functioning, or the composite IQ. The committee decided against such a recommendation for two reasons. First, the practical consequences of declaring that a practice long used by SSA was invalid would have caused significant disruption for the agency and for disability benefit recipients. Second, and also important, was the recognition that there are circumstances, described earlier in this chapter, in which the composite IQ does not represent the person’s true intellectual functioning and is instead a meaningless artifact.

The recommendation eventually adopted by the committee advises that part scores not be used routinely in mental retardation determinations, except in those cases in which the composite IQ is thought to be invalid. Only then can an appropriate part score be used as the measure of the person’s intellectual functioning. The committee opted to bring SSA’s definition of the intellectual functioning dimension of mental retardation more in line with that of the other professional associations and health-related organizations, which focus on the summary measure of intelligence. Since there are some situations in which the composite IQ is invalid, part scores may more accurately reflect a person’s intellectual functioning. The committee’s examination of the structure of intelligence suggests that part scores that measure crystallized and fluid intelligence are the most appropriate part scores to use in these situations. Also, the committee recognizes that many of these abilities are measured by a wide number of intelligence tests, not just Wechsler measures, and therefore recommends that SSA expand in its listings the use of examples of other apropriate tests that yield g-loaded part scores. The text of this recommendation appears at the end of this chapter.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

In rare instances it may be impossible to develop reliable and valid assessments of intellectual functioning even with the use of specially designed instruments that attempt to limit the effects of language differences, sensory or neuromotor impairments, and severe emotional disturbance. In such cases all of the summary scores, both the composite and part scores, may be suspected of being invalidly low. Invalid intelligence test results in the range of mental retardation, whether too low or too high, should always be ignored and other methods used to confirm or disconfirm a diagnosis of mental retardation, such as case history information, educational performance, social functioning across a variety of settings, adaptive behavior, and interviews with the individual and significant others. The principle of convergent validity should be applied to the interpretation of this information (see Chapter 5) and diagnostic decisions made based on the preponderance of evidence.

MULTIDIMENSIONAL VERSUS UNIDIMENSIONAL MEASURES OF COGNITIVE FUNCTIONING

The Standards for Educational and Psychological Testing (Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, 1999) discuss the importance of using tests of differing length and psychometric quality for high-stakes versus low-stakes decisions. Intelligence testing intended for high-stakes decision making, such as disability diagnosis or eligibility determination, should include multidimensional measures of important intellectual factors like high g-loading tasks rather than unidimensional measures. The unidimensional Peabody Picture Vocabulary Test (PPVT—Dunn, 1959) originally reported an IQ as its total test score—a score that was once used for high-stakes placement and eligibility testing. During the 1960s and 1970s, the field came to the realization that, although the PPVT assessed a singularly important aspect of intelligence, verbal comprehension (crystallized abilities), it correlated to a relatively modest degree

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

with comprehensive, multidimensional tests of intelligence. As a result, this unidimensional test was deemed insufficiently comprehensive to warrant using the term IQ. The revised PPVT (PPVT-R— Dunn & Dunn, 1981) did not continue the practice of using the term IQ for the total test score. Bracken et al. (1984) recommended further that the instrument not be considered or used as a general measure of intelligence.

In addition to tests designed as unidimensional ability measures, like the PPVT-R and the Raven Progressive Matrices (Raven et al., 1986), abbreviated versions of comprehensive tests (WISC-III short forms, the Wechsler Abbreviated Scale of Intelligence—Wechsler, 1999) and screening tests (KBIT—Kaufman & Kaufman, 1990) have been developed. These shortened tests typically have limited construct sampling and consequently have reduced levels of reliability and validity compared with comprehensive measures of intelligence, and consequently they should be reserved for low-stakes decision making.

When intelligence testing is conducted for high-stakes purposes, multidimensional, full-scale instruments should be used in the diagnostic, decision-making process because these instruments provide the most convincing evidence of technical adequacy, construct sampling, and validity. Comprehensive intelligence tests assess multiple facets of the construct, and they more thoroughly sample the domain of intelligence. The instruments listed in Table 3-1 represent a current compendium of comprehensive measures of intelligence for infants, children, adolescents, and adults.

PSYCHOMETRIC STANDARDS

The psychometric quality of tests should guide examiners’ selection of tests used to contribute to the diagnosis of mental retardation. Due to the nature of this disability and the unique characteristics of individual intelligence tests, all comprehensive tests may not be appropriate for this application—at least, all tests may not be appropriate for all examinees. Both pragmatic and empirical aspects of test quality

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

should guide test selection and inform decision making. Bracken (1987, 1988, 1998; Wasseman & Bracken, 2002) have proposed criteria to guide the selection of cognitive tests. In some instances these guidelines include more breadth and specificity than the Standards for Educational and Psychological Testing (Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, 1999), yet in general they are consistent with recommendations provided by such psychometricians as Anastasi and Urbina (1997), Cattell (1986), Cicchetti (1994), Nunnally and Bernstein (1994), and Salvia and Ysseldyke (2001).

Intelligence Test Norms

The adequacy of a test’s norms is of paramount importance when selecting a norm-referenced intelligence test. The quality of test norms is dependent on several factors, including sample size and population representation. Cronbach (1949) posed the following questions, which remain pertinent today, when assessing the quality of test norms: “(1) Are the norms based on a sufficiently large group? (2) Is the standard group representative? (3) Does the standard group resemble the persons with whom we wish to compare our subject?” (pp. 75-76).

The primary goal in normative sampling is to accurately reflect population parameters, which allows inferences based on obtained scores to be generalized to the population. The goal of intelligence test norms is to accurately represent the U.S. population because the goal of assessment is to identify the degree to which an individual deviates from normative expectations. In test norm development, sampling plans should sample representatively from among all potential examinees to reflect the entire distribution of ability, including individuals who have mental retardation.

Some test developers employ truncated selection procedures that do not sample the entire population and systematically exclude indi-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

viduals with impairments or other special needs (McFadden, 1996). Such norming practices should be avoided and tests used for the identification of mentally retarded individuals should include a representative proportion of such individuals in the test normative sample as are found in the general population. Hollon and Flick (1988) recommended that when tests are developed for use with special populations, norms still should be based on fully representative samples.

Sampling plans should be thoroughly described. Two of the principal assumptions of random sampling are that every individual in the target population has an equal chance of being selected and that every sample selection is made independently. However, true random sampling is an ideal that is rarely if ever achieved in test norming. Given the geographic expanse of the United States and its population of approximately 280 million citizens, random sampling from the entire U.S. population is typically not economically feasible or practical. As a reasonable compromise in test norming, intelligence test norms should be gathered in a stratified sampling manner that results in a sample that is demographically representative of the population, including all of its relevant characteristics.

Normative samples should be sufficiently large. Intelligence test normative samples should be sufficiently large to provide stable estimates of population parameters, thereby reducing sampling error to acceptable levels and meeting assumptions for requisite statistical analyses. Although large-scale group tests may involve 10,000 to 20,000 students per grade or age level, samples for individually administered intelligence tests generally are considerably smaller. Carefully drawn samples of 150 to 200 participants per grade or age level are typically considered appropriate and are frequently employed with individually administered tests. The smaller the sample size, the less likely the sample is to be normally distributed or to accurately reflect population parameters. Therefore, tests with norms based on samples smaller than the minimal level noted above should be avoided, unless additional evidence that supports their use is available.

Normative samples should reflect appropriate demographic pa-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

rameters. Research has shown that certain demographic variables are more related to levels of cognitive functioning than are other variables. It is important that norm samples are stratified and selected on the basis of these identified variables. Variables used for selection and stratification when gathering samples for intelligence test norms generally include age, grade level, sex, ethnic origin, race, geographic region, urban or rural residence, and socioeconomic status. Intelligence tests used for the diagnosis of mental retardation should include carefully selected samples that fully represent these important demographic characteristics to the degree that they are found in the general population.

Many intelligence tests also appropriately include individuals with handicapping conditions and educational exceptionalities in their normative samples. The inclusion of exceptional individuals in norming samples is based on the logic that the intended function of the normative sample is to represent accurately the population, and the intended function of the test is to serve a comprehensive group of individuals rather than only people without known deficits or gifts (Elliott, 1990). If an intelligence test is intended to diagnose and serve individuals with mental retardation, then the test should include proportionate representation of this population in the normative sample.

Sampling should be representative and precise. The accuracy and precision of a stratified sample is most readily determined by the degree to which the sample matches the sampling plan. The degree to which the composition of an acquired sample reflects census proportions should be assessed through examination of not only single demographic characteristics like gender for the entire sample but also by examining combined demographic sampling cells (e.g., gender by race within individual age levels). It is in these smaller cells that sampling plans typically fail most often. Examiners should carefully examine sampling outcomes to ensure that selection variables are accurately represented not only across the entire norm sample, but also within each level of the norm sample (say, for 5-year-olds or 20-year-olds) and for each group sampled, such as blacks or females.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

All statistical transformations used to develop interpretive scores should be reported in the examiner’s manual. Because raw scores have limited norm-referenced interpretative value, they must be transformed into more meaningful metrics, like standard scores. The statistical procedures by which raw scores are transformed into standard scores should be clearly documented in the test manual, including procedures used to smooth, normalize, or stretch distributions during the transformation process.

One consideration in transformation from raw score to standard score is whether scores were manipulated through sample weighting. Weighting is not necessary with most carefully normed intelligence tests; however, sometimes weighting is done to “correct” errant samples when the stated goals of the sampling plan have not been adequately met. When specific demographic strata have been under-sampled, score weighting is sometimes used to statistically correct this methodological slight. It should be recognized, however, that weighting scores often increases sampling error because the “corrected” scores are based on smaller and probably less representative samples than appropriate. Weighted scores in general should be viewed as an undesirable characteristic of test norms and should be carefully considered when selecting tests.

Similar to weighting is the issue of extrapolated score development. When normative samples are not sufficiently diverse in their range of talent, it sometimes happens that there are too few low- or high-functioning individuals to properly generate norms for individuals who are functioning in the mentally retarded or gifted range. In situations in which exceptional individuals are excluded from the norming process, there may be too few people with mental retardation assessed to establish norms at this level. Consequently, test publishers often “stretch” norms beyond their actual range through linear extrapolations. Extrapolation provides the benefit of extending norms farther than would otherwise be the case; however, extrapolated norms provide no assurance of accuracy because they are not based on obtained data. It is not known for certain whether cognitive functioning progresses through the population in a linear fashion, and the applica-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

tion of linear extrapolation is merely a best guess at what the norms would have been if sufficient numbers of exceptional individuals had been included in the sampling plan.

Standardization examiners and procedures should be clearly described. Procedures used to recruit, qualify, and train standardization examiners should be carefully described in intelligence test manuals. Quality assurance procedures intended to correct invalid administrations and to identify invalid test protocols also should be detailed. Ideally, standardization examiners should have the same credentials and experience as the professionals who will be administering the test.

Test manuals should carefully describe the standardized test conditions under which the test norms were established. These conditions should be the same when the test is employed in clinical practice. Any changes in artwork or stimulus materials, instructions, and test or item sequence after standardization should be described in the test manual.

The standardization sample should be current. Research suggests that intelligence in the entire population increases at a rate of approximately 3 IQ points per decade, which approximates the standard error of measurement for most comprehensive intelligence tests. Thus, tests with norms older than 10 to 12 years will tend to produce inflated scores and could result in the denial of benefits to significant numbers of individuals who would be eligible for them if more recent norms had been used. Disability examiners who use tests with outdated norms may be systematically if unintentionally denying benefits to those who are legally entitled to them. The examiners also risk losing their licenses for ethical violations of their professional codes. Proper test usage is essential for accurate testing and diagnosis and ultimately for equitable disability determination.

In several meta-analyses, James Flynn (1984, 1987, 1994, 1999) has demonstrated that the age of intelligence test norms may be one of the most important considerations when selecting tests for use. Internationally, Flynn has demonstrated that intelligence test norms “soften” at a rate of about 3 IQ points per decade. That is, a test with 20-year-old norms will tend to produce IQs that are approximately 6

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

points (> one-third of a standard deviation) higher than a recently normed instrument. Thus, administering outdated forms of intelligence tests like WISC-R instead of the WISC-III may have the unintentional and undesired result of failing to qualify individuals for services or benefits that they would otherwise qualify for. The Flynn effect is noticeable in samples as young as infants (Bayley, 1993; Campbell et al., 1986) and appears to continue throughout childhood and adolescence. Chan et al. (1999) demonstrated that a variety of cognitive abilities, especially those involving more semantically laden content and procedures that measure crystallized abilities, tend to be most susceptible to population changes over time.

This issue is particularly salient for psychologists who habitually use older tests such as the McCarthy Scales of Children’s Abilities (McCarthy, 1972) or previous editions of revised tests like the WISC or WISC-R, rather than the newer WISC-III. Tests of this vintage may have norms that are as old as three decades or in some instances even older. Norms of this age would predictably and reliably fail to identify large numbers of individuals who would otherwise qualify for services or benefits. For example, given the 9-point IQ inflation that would be associated with using either the McCarthy or WISC-R rather than a current generation test, many and possibly most individuals who are functioning in the 60-70 IQ range would fail to be properly identified as having mental retardation on either instrument. This issue is also important because ethical codes admonish psychologists from using outdated tests and norms. This view is also supported by many state psychological associations.

Verification of this norm-softening can be seen throughout the literature wherein early researchers discovered that the most recent edition of various intelligence tests produced scores that were significantly lower than the previous edition of the same instrument (e.g., Kaufman, 1979). Similarly, new instruments just entering the field typically produce total test scores that are significantly lower than the scores obtained on the traditional “old standards” used in convergent validity studies—leading to criterion contamination as a major threat to the validation of the newer instrument. For these reasons, professional

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

organizations like the American Psychological Association and the National Association of School Psychologists and the joint AERA, APA, and NCME Standards for Psychological and Educational Testing (Joint Committee on Standards for Educational and Psychological Testing of the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, 1999) admonish psychologists to not use outdated instruments.

Given these generational changes in intelligence test norms, tests should undergo normative update, restandardization, or revision at intervals corresponding to the time expected to produce one SEM of change. For example, the commonly used WISC-III has a composite IQ SEM of 3.20. Given an SEM of this magnitude, the WISC-III norms would be expected to soften a significant degree, (3 to 4 points), in 10 to 11 years (Wasseman & Bracken, 2002). Therefore, the WISC-III and most other intelligence tests might be considered inappropriate for the diagnosis of mental retardation when their norms are more than 10 to 11 years old.

A related issue is the length of time that an obtained IQ (or IQ equivalent) can be considered valid. Because intelligence is a quite stable construct, especially among older children, adolescents, and adults, IQs of record may be useful for a number of years beyond the date they were obtained, with the exception of the occurrence of any known condition that might threaten the validity of the obtained score, such as physical or emotional trauma. Despite its general stability, cognitive development proceeds most rapidly during the infant and toddler years and slows thereafter through childhood and adolescence (Bloom, 1964). For adults, formal learning-dependent knowledge (crystallized abilities) and long-term memory continue to improve into advanced years, but fluid abilities like novel problem solving and clerical speed generally decline fairly rapidly after peaking in adolescence (Horn, 1985).

Therefore, during the infant and toddler years, when cognitive growth and development are most rapid and consequently least stable, total test scores should be obtained at the time they are to be used in

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

diagnosis or disability determination. For children between the ages of 3 and 6, total test scores might reasonably be considered valid for one year. Among children and adolescents between the ages of 6 and 16 years, total test scores should be considered valid for as long as three years. For adults ages 18 to 50 living in stable conditions and with stable health, total test scores should be considered valid for as long as five years. After age 50, total test scores might be considered reasonably valid for three years, but separate intellectual abilities, like Gf-Gc, might become important considerations. This lack of stability in elderly individuals’ specific cognitive abilities is typically due to debilitating factors associated with aging, and, although their IQs may change over the years, their diagnostic status is unlikely to change. That is, adults with mental retardation are likely to become more retarded in their functioning as they age.

Norms should reflect adequate item difficulty gradients. Item gradients reflect the degree to which standard scores change as a function of success or failure on a single item (Bracken, 1987). The larger the resulting standard score difference in relation to a change in a single raw score, the less sensitive and discriminating the test is. For a test to have adequate sensitivity at all levels of cognitive functioning, it must have adequate item density across the ability range. Bracken (1987, 1998) has suggested that item gradients should not be so steep that a single item passed or failed would result in a standard score change of more than one-third of a standard deviation.

Similarly, norm table gradients should be sufficiently sensitive that when the same raw score is entered into two adjacent age tables, that score should not produce standard score changes of more than one-third of a standard deviation (Wasseman & Bracken, 2002). For example, the norm tables on the McCarthy Scales of Children’s Abilities are insufficiently sensitive at the younger age levels. A child who is 2 years, 7 months, and 16 days old could earn McCarthy total test general cognitive index scores that are more than two-thirds of a standard deviation (11 points) apart when the same raw score is entered into adjacent norm tables (Bracken, 1988). That is, a single day’s difference

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

in the child’s chronological age would result in the child’s graduating from one norm table to the next, and with that movement into another norm table the child would appear to be as much as two-thirds of one standard deviation less intelligent. Such insensitivity in item or norm table gradients could easily lead to misidentification or misdiagnosis, especially among low-functioning individuals.

Test norms should have adequate floors and ceilings. When tests are used to identify individuals who may have mental retardation or giftedness, it is important that the tests have sufficient discriminating power in the extreme ends of the distributions for accurate differentiation of ability and diagnosis. At a minimum, intelligence tests should have floors sufficiently strong to differentiate the extreme lowest 3 percent of the population from the top 97 percent (Bracken, 1984, 1987, 1998; Bracken & McCallum, 1998). Preferably, intelligence tests should be able to discern more severe levels of retardation from mild mental retardation. Although not pertinent to the diagnosis of mental retardation, intelligence tests should also have ceilings that are sufficiently high to differentiate the extreme upper 3 percent from the lower 97 percent.

Evidence of Test Score Validity

The validity of a test is characterized by the extent to which it exclusively measures its targeted constructs (construct validity) and its scores meaningfully guide decision making. Increasing emphasis is being placed on the extent to which test scores serve their intended purposes and proposed applications (Messick, 1995). Construct validity can be supported with two broad classes of evidence, internal and external, which parallel the threats to validity typically considered in research designs (Campbell et al., 1963; Cook & Campbell, 1979).

Internal Evidence of Test Validity

Internal sources of validity include procedures to systematically examine the characteristics of a test, especially its content, assessment

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

methods, structure, and theoretical underpinnings. Internal evidence of test validity can be found in investigations of face validity, content validity, theory-based validity, and structural validity.

Face validity refers to the degree to which a test appears to measure what it purports to measure. During casual examination, test items may be judged for face validity by the extent to which they appear to appropriately measure the targeted construct and objectives. Although not considered a source of validity in a technical sense, face validity has been shown to be related to examinee motivation and effort, as well as social desirability biases, labeling, and fairness (Bornstein, 1996). Most tests selected for the diagnosis of mental retardation include activities and tasks of sufficient difficulty that they readily appear to measure the construct of intelligence.

Content validity can be described as the degree to which a test adequately samples the domains of interest. Content validity varies with the purpose of the test and the nature of the inferences that may be drawn from test scores (Messick, 1993). Inferences made from tests with inadequate content validity may be suspect, even when other indices of validity are satisfactory (Haynes et al., 1995). Ideally, content should remain consistent throughout the age range of a test to ensure that the same construct is being measured (Bracken, 1988). The Stanford-Binet Intelligence Scale, Fourth Edition, includes subtests in which the content assessed is not consistent across the age range. For example, the vocabulary subtest begins with a picture vocabulary format and then graduates to an oral vocabulary format. When test content and item formats change in this manner, it is difficult to interpret an examinee’s test performance, because it is no longer clear which construct is being interpreted, receptive or expressive vocabulary.

The formulation of test items and procedures based on and consistent with a theory has been termed substantive validity (Loevinger, 1957) and is closely related to content validity. Psychology has produced rich and cohesive theories of behavior and cognition—theories that have led to the development of new tests and assessment practices (e.g., the K-ABC—Kaufman & Kaufman, 1983; CAS—Naglieri & Das,

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

1997). As Crocker and Algina (1986) suggest, “psychological measurement, even though it is based on observable responses, would have little meaning or usefulness unless it could be interpreted in light of the underlying theoretical construct” (p. 7). Tests used for the diagnosis of mental retardation should be based on reasonable and supportable theories, and these theoretical orientations should be presented in the test manual for consideration.

Composite scores should be supported through factor analyses. Exploratory factor analyses allow for examination of the natural structure of an instrument and the psychological meaningfulness of the dimensions or factors that emerge (Gorsuch, 1983). This criterion refers to the degree to which factor analytic results match the composite scales or subscales of the test. The mismatch between factor structure and composite indices has been shown to render test interpretation more difficult (Chattin & Bracken, 1989) on such tests as the Stanford-Binet Intelligence Scale, Fourth Edition (Thorndike, 1986), and the McCarthy Scales of Children’s Abilities (McCarthy, 1972).

Exploratory factor analyses provide a methodology by which the underlying dimensions assessed by a test may be separated or summarized. Floyd and Widaman (1995) suggest that exploratory factor analyses for clinical assessment instruments should routinely report principal component analysis or common factor analysis, initial communality estimates (or squared correlations of observed variables with the factors), the method of factor extraction, the criteria for retaining factors, the eigenvalues and the percentage of variance accounted for by the unrotated factors, the rotation method and rationale, all rotated factor loadings, factor intercorrelations, and the variance explained by the factors after rotation.

Competing models or theories should be tested with confirmatory factor analyses. Confirmatory factor analyses are conducted to evaluate the congruence of the test data with an a priori theoretical model, as well as to measure the relative fit of competing models. Floyd and Widaman (1995) recommend that confirmatory factor analyses should report proposed model(s), number and composition of factors,

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

orthogonal versus correlated factors, secondary loadings, correlated error terms, other model constraints (fixed and free parameters), method of estimation, goodness of fit, overall fit, relative fit, parsimony, any model modification to improve model fit to data, factor loadings and standard errors, communality, and factor correlations and standard errors with statistical significance. Comprehensive treatment and inclusion of such information allows test users to better understand the extent to which the test fits its proposed model compared with competing models and provides support for the interpretation of the instrument’s respective subscales and composite scores.

External Evidence of Validity

External evidence of test validity considers the extent to which a test relates to or predicts other variables or outcomes in differing populations. Tests should be validated with regard to the purposes for which they are employed and the consequences of their use. In this section, we describe external classes of evidence for test construct validity, including criterion-related validity, consequential validity, and generalizability.

Criterion-related validity. Campbell and Fiske (1959) originally proposed that test scores should be related to external measures of the same psychological construct (convergent evidence of validity), and they should be comparatively unrelated to measures of different psychological constructs (discriminant evidence of validity). In criterion-related validity, criterion measures can be obtained concurrently (concurrent validity) or at some future date (predictive validity). An intelligence test that is proposed for use in the process of diagnosing mental retardation should demonstrate convergent validity with other extant intelligence tests before the instrument is accepted for this purpose. Similarly, as a class of instruments, intelligence tests should demonstrate higher correlations among themselves than with measures of other psychoeducational constructs (e.g., academic achievement, adaptive behavior).

Tests should meaningfully guide decision making. Contrasted

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

groups methodology is commonly used for validating psychological tests. In this approach to validation, the test performance of two samples that are known to be different on the criterion of interest is compared. For example, a sample of people who are known to have mental retardation should perform on an intelligence test at a level significantly below the performance of a second group that is known to not have mental retardation. Decision-making classification accuracy should be determined by examining sensitivity, specificity, positive predictive power, and negative predictive power.

Tests should provide evidence of consequential validity. A form of validity that emphasizes the societal impact of test results on individuals and groups is known as consequential validity. Consequential validity evaluates the utility of score interpretation as a basis for action, as well as the actual and potential consequences of test use (Messick, 1989). Messick (1995) argued that examination of the consequences of test use as a trigger to social and educational actions, such as equitable application of SSI benefits, is a necessary element of validating tests. Consequential validity is especially relevant to issues of bias, fairness, and distributive justice.

Generalizability of validity. External evidence of test validity is especially important when test results are to be generalized across contexts, situations, and populations, and when the consequences of testing reach beyond the test’s original intent. Intelligence test manuals should demonstrate the extent to which the test validity generalizes across subpopulations, such as racial or ethnic minority groups, gender, or age levels. Examiners who wish to use tests for purposes not stated or supported in the examiner’s manual, such as using a language instrument for discerning levels of cognitive functioning, must demonstrate the validity of the new application prior to its application.

Test Score Reliability

The reliability of test scores refers to the reproducibility (precision, consistency, and repeatability) of test results, or the degree to

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

which test scores are free from measurement error. Measurement precision can be assessed by examining the instrument’s internal consistency, temporal stability, and interrater agreement. Reliability can only be evaluated in the context of test use (Nunnally & Bernstein, 1994).

Internal consistency. The internal consistency of a test is a reflection of the uniformity and coherence of test items and content. All variance generated by a test can be classified as either reliable variance or error variance. In classical test theory, reliability is based on the assumption that measurement error is distributed normally and equally for all score levels. By contrast, item response theory posits that reliability differs between individuals with different response patterns and levels of ability but generalizes across populations (Embretson & Hershberger, 1999).

Internal consistency is usually coefficient alpha or split-half reliability. Several psychometricians (Bracken, 1987; Clark & Watson, 1995; Nunnally & Bernstein, 1994) have recommended that minimal levels of internal consistency should average across age levels at or above .80 or .90, depending on the nature and applications of the test scale to low-stakes or high-stakes applications, respectively.

Consistent with Nunnally’s (1978) original standards, Bracken (1987, 1998; Wasseman & Bracken, 2002) recommended that total test or total scale internal consistency of high-stakes test applications, such as for clinical diagnosis or eligibility decision making, should equal or exceed .90 when averaged across the age levels. Instruments used for the high-stakes purposes of diagnosing mental retardation for SSI should approximate this minimal level of reliability, recognizing that the inverse of reliability is measurement error and that error only confounds correct decision making.

Local reliability. Local reliability refers to measurement precision at specified levels or ranges of scores that are at or near the decision-making point for mental retardation. For example, a test with high local reliability at low ability levels would be more appropriate for use with low-functioning individuals than one with less local reliability. Local reliability can be measured by approaching it from classical test theory orientation or by using item response theory. Whichever ap-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

proach is used, local reliability should be measured and the data made available for disability determination examiners so they can use the most appropriate tests for their clients.

Total test short-term stability. Test scores must be reasonably stable to have practical utility when diagnosing known stable conditions such as mental retardation and to be predictive of future performance. Stability is typically estimated through use of test-retest stability (correlation) coefficients across two points in time. Bracken (1987) suggested that for short-term test intervals of two to six weeks the total test stability coefficient should be greater than or equal to .90 for high-stakes test applications. Test-retest reliability is in part a measure of construct stability, but its interpretation in clinical contexts can be influenced by several factors like the deleterious effects of degenerative disorders or the positive effects of successful therapeutic interventions, which should be remembered in individual studies of test stability.

Generalizability of test score reliability. As an extension of validity generalization (Hunter & Schmidt, 1990; Schmidt & Hunter, 1977), reliability generalization investigates the stability of reliability coefficients across varying samples. In order to demonstrate measurement precision for the populations for which a test is intended, the test should show comparable levels of reliability across various demographic subsets of the population, as well as salient clinical and exceptional populations like individuals with mental retardation.

Fairness in Testing

Fairness has not been considered historically as a leading criterion by which test selection decisions are made, but increased social sensitivity and recent court decisions have elevated its importance. Tiedeman (1978) has noted, “Test equity seems to be emerging as a criterion for test use on a par with the concepts of reliability and validity” (p. xxviii). As such, tests intended for use with all subsets of the U.S. population, as in SSA evaluations, should provide ample evidence of psychometric fairness and equitable treatment of examinees.

Wasseman and Bracken (2002) consider fairness to be the extent

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

to which test scores are (a) statistically shown to be free from evidence of psychometric bias, (b) comparably reliable and valid across demographic groups, and (c) equitably applied and equally predictive in real-life consequences and pragmatic impact. Fairness transcends psychometrics and includes philosophic, legal, and practical considerations.

Test bias refers to elements of a test and its usage that are construct irrelevant and that yield systematic errors that in turn lead to erroneous decisions related to specific demographic group membership. Bias results in differential outcomes for individuals of the same ability levels but from different ethnic, sex, cultural, or religious groups (Hambleton & Rodgers, 1995). Test bias has also been described as “a kind of invalidity that harms one group more than another” (Shepard et al., 1981, p. 318)

Internal Evidence of Fairness

As with internal evidence of validity, test fairness rests in part on the structural features of the instrument, including theoretical underpinnings, item content, assessment procedures, differential item functioning, and an invariant factor structure.

Theoretical underpinnings. The theory on which a test is built may have an inherent sensitivity to issues of fairness and should be fully discussed in the test manual. Several illustrations of these implications may be presented with regard to measures of cognitive and intellectual ability. For example, tests that emphasize speed may be less fair for Hispanics, because time is considered a less salient concept in many Hispanic cultures (Scheuneman & Oakland, 1998). Individuals who speak English as a second language also may be disadvantaged by traditional language-loaded intelligence tests, even on performance-based measures like the Wechsler Performance Scale that include lengthy and conceptually laden test directions (Bracken & McCallum, 1998; Duran, 1989; Geisinger, 1992; Oakland & Parmelee, 1985). In addition, measures of crystallized ability and knowledge are inextricably linked to culture (Carroll, 1997) and accordingly may show differ-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

ential performance across culturally different groups, whereas fluid abilities tend to show less differential performance across groups.

Multicultural bias and sensitivity reviews. The use of multicultural reviewers to examine the type, content, and format of test items for potential bias is a common practice among test publishers. Usually the goal of bias review panels is to identify offensive or controversial material and unfair material, remaining sensitive to population diversity. Among the considerations of such reviewers are language usage, ethnocentric item content, minority group representation in the norms, and minority group portrayals in test stimulus materials (Sireci & Geisinger, 1998).

All tests should present items in a sensitive manner for all gender, culture, age, and racial groups. Stimulus artwork should depict people performing similar or equivalent roles and activities, regardless of gender, age, race, and cultural backgrounds. Stimulus artwork that portrays facial expressions, such as happiness, anger, or fear, or indicators of physical limitations like eyeglasses, hearing aids, or wheelchairs, should be evenly distributed across representations of differing demographic groups. Stereotyping of any sort in test artwork and stimulus materials should be avoided.

Differential item function (DIF). Differential item function (DIF) refers to a family of statistical procedures used to identify whether test items display different statistical properties in different group settings after controlling for differences in the abilities of the comparison groups (Angoff, 1993). The concept of DIF has been extended by Shealy and Stout (1993) to include a test level of analysis known as differential test function (DTF). DTF is important because tests may produce a small number of offsetting items that are identified as biased against both comparison groups, such as males and females, using DIF procedures. Because the number of biased items are offsetting, the overall effect (DTF) of these few items on the fairness of the test can be minimal (Waller et al., 2000).

Invariant factor structure and scale reliabilities. The examination of comparable reliability and validity across separate demographic

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

groups should be conducted to investigate test fairness. Jensen (1980) noted that if test reliability and validity coefficients differ significantly for designated subgroups of interest, then “it is clear that the test scores are not equally [reliable or valid] measures for both groups” (p. 430). With respect to validity, Meredith (1993) asserted that strict factorial invariance is required for test fairness and equity to exist.

Geisinger (1998) noted the importance of comparable reliabilities across subsamples, stating that “subgroup-specific reliability analysis may be especially appropriate when the reliability of a test has been justified on the basis of internal consistency reliability procedures (e.g., coefficient alpha)” (p. 25). The demonstration of comparable reliabilities across samples that differ on the basis of gender, race, or ethnicity has been studied in some current-generation intelligence tests with positive outcomes (Bracken & McCallum, 1998; Matazow et al., 1991; Vance & Gaynor, 1976; Zhu et al., 1999).

External Evidence of Test Fairness

The external features of test fairness are evident in the relationship between test scores and various external criteria, including equality of prediction and consequential impact. It is important to examine external evidence of validity in addition to internal sources of evidence like DIF when investigating test fairness. Focusing solely on internal evidence of fairness may fail to capture subtle yet important sources of test bias (Shepard et al., 1981).

Comparable prediction. The demonstration of equivalent predictive validity across demographic groups constitutes an important source of fairness that is related to validity generalization. Intelligence tests used for the diagnosis of mental retardation should predict future external outcomes, such as employability or independent functioning, in a comparable manner across differing demographic groups.

Minimize adverse impact and selection bias outcomes. A second form of external bias includes the differential incidence of adverse outcomes or differential selection rates across groups. Mean score differ-

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

ences between groups on tests are not inherently an indication of bias and may yield comparable prediction rates. Still, disparate group mean scores can have the undesirable effect of producing disproportionate negative impact for one group as opposed to another (Thorndike, 1971). Such consequential aspects of test bias are commonly referred to as selection bias (Jencks, 1998). When test scores produce adverse, disparate, or disproportionate impact for one group over another, even when that impact is construct relevant, test users should consider the societal and legal implications of such selection bias.

CONCLUSIONS AND RECOMMENDATIONS

Review of the extensive literature on the assessment of intellectual functioning reveals that because of differential rates of development across the life span, the most accurate estimates of intellectual functioning can be made only from recently administered, comprehensive IQ tests. This means that intelligence testing for infants (birth through age 2) is best done at the time of the eligibility determination, within the last year for children between the ages of 3 and 6, and within three years between the ages of 6 and 16. For adults between the ages of 18 and 50 who are living in stable conditions and are in stable health, composite IQ scores are valid for as long as five years; and, after age 50, composite IQs could reasonably be considered valid for three years.

Research also suggests that intelligence in the entire population increases at a rate of approximately 3 IQ points per decade, which approximates the standard error of measurement for most comprehensive intelligence tests. Thus, tests with norms older than 10 to 12 years will tend to produce inflated scores and could result in the denial of services to significant numbers of individuals who would have been eligible for them, if more recent norms had been used.

Because intelligence is a complex and multidimensional construct, it is imperative that intelligence tests used for diagnosis be comprehensive (multifactored) and assess more than a single cognitive attribute. Also, because test length and comprehensiveness are directly related

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

to an instrument’s technical adequacy and construct sampling, brief or abbreviated tests compromise test quality or comprehensiveness for brevity.

Language-loaded intelligence tests are not appropriate for people who would be disadvantaged due to language limitations (e.g., deafness, limited English proficiency, elective/selective mute, autism). Whenever language facility constitutes a source of construct-irrelevant variance for examinees, language-loaded instruments (both verbal and performance scales) create an unfair additional challenge. In such cases, examinees should be assessed in their native language or with intelligence tests that do not require receptive or expressive language.

Since the skills and training of the examiner can affect the accuracy of an IQ test, examiners should meet publishers’ requirements for the use of Class C tests. Class C instruments are those that require the highest level of training, professional credentials, and supervision. Examiners (not their supervisors) should meet this minimal professional standard. Furthermore, examiners who administer and interpret intelligence tests should possess the skills and competencies to assess clients with uncommon characteristics, such as deafness, extreme youth or age, or a nonmajority cultural or linguistic background. Not only should examiners be competent to administer and interpret intelligence tests, but they should also have the knowledge and experience to work effectively with clients of all ages, exceptionalities, and cultural/linguistic backgrounds to ensure valid assessment results.

Almost a century of intelligence test development has shown that the most valid and accurate results are obtained when tests meet minimal psychometric standards, as outlined in this chapter, for use in high-stakes decision making like SSA disability determination. The tests should demonstrate adequate floors, item gradients, reliability, validity, norm table sensitivity, population representation, as well as sufficient convincing evidence of fairness and lack of bias.

Composite scores from intelligence tests should be used routinely in mental retardation diagnosis, except when the validity of a composite IQ above 70 is in doubt, in which case an appropriate part score

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

may be used in its place. Significant and meaningful variation among an instrument’s part scores may indicate evidence of compromised validity for one or more of them (for example, a low verbal scale score for an individual with a suspected speech disorder), which in turn would threaten the validity of the composite IQ. In such situations, appropriate part scores may better represent the individual’s true overall level of cognitive functioning or it may be necessary to use other methods to support a diagnosis of mental retardation (see Chapter 5).

However, only part scores derived from scales that demonstrate high g-loadings (e.g., crystallized, fluid, visual/spatial measures of intelligence) should be used in place of the composite IQ score when its validity is in doubt. Many intelligence tests assess several facets of intelligence, but not all facets are equally important or predict life events equally well. Those intellectual facets that are heavily g-saturated provide the best sources for replacing the composite IQ score when its validity is questionable.

The characteristics of comprehensive IQ tests are such that, even when part scores are used in making disability determinations for mental retardation, the composite IQ score from an instrument should never be higher than 75. Furthermore, if a part score is used in place of the composite IQ score in SSA decision making, the part score should not exceed 70. Therefore:

Recommendation: A client must have an intelligence test score that is two or more standard deviations (SD) below the mean (e.g., a score of 70 or below, if the mean = 100 and the standard deviation = 15).

  • Composite score is 70 or below: If the composite or total test score meets this criterion, then the individual has met the intellectual eligibility component.

  • Composite score is between 71 and 75: If the composite score is suspected to be an invalid indicator of the person’s intellectual disability and falls in the range of 71-75, a part score of 70 or

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×

below can be used to satisfy the intellectual eligibility component.

  • Composite score is 76 or above: No individual can be eligible on the intellectual criterion if the composite score is 76 or above, regardless of part scores.2

The committee recommends continuation of the criterion of presumptive eligibility for persons with IQs below 60.

2  

Committee member Keith Widaman dissents from this part of the recommendation. Dr. Widaman believes that IQ part scores representing crystallized intelligence (Gc, similar to verbal IQ) and fluid intelligence (Gf, related to performance IQ) have clear discriminant validity and represent broad, general domains of intellectual functioning. Therefore, a score of 70 or below on either of these part scores from any standardized, individually administered intelligence test that reports such scores should be deemed sufficient to meet the listings for low general intellectual functioning regardless of the level of the composite score, providing that the part scores have adequate psychometric properties (e.g., high reliability, low standard error of measurement). Dr. Widaman notes that, without any clear justification, SSA currently accepts either a composite IQ score from any standardized, individually administered intelligence test or a verbal or performance IQ score, any one of which can be 70 or below. SSA does not stipulate that the composite IQ must be below a certain score for a part score to be used. Dr. Widaman’s position provides a rationale for current SSA use of part scores, but it (a) aligns the acceptable part scores with the constructs of Gc and Gf used in contemporary theories of mental abilities and (b) argues that usable part scores for Gc and Gf should not be limited to those derived from any particular test instrument.

Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 69
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 70
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 71
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 72
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 73
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 74
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 75
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 76
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 77
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 78
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 79
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 80
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 81
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 82
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 83
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 84
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 85
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 86
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 87
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 88
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 89
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 90
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 91
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 92
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 93
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 94
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 95
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 96
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 97
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 98
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 99
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 100
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 101
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 102
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 103
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 104
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 105
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 106
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 107
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 108
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 109
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 110
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 111
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 112
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 113
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 114
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 115
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 116
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 117
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 118
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 119
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 120
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 121
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 122
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 123
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 124
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 125
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 126
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 127
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 128
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 129
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 130
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 131
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 132
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 133
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 134
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 135
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 136
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 137
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 138
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 139
Suggested Citation:"3. The Role of Intellectual Assessment." National Research Council. 2002. Mental Retardation: Determining Eligibility for Social Security Benefits. Washington, DC: The National Academies Press. doi: 10.17226/10295.
×
Page 140
Next: 4. The Role of Adaptive Behavior Assessment »
Mental Retardation: Determining Eligibility for Social Security Benefits Get This Book
×
Buy Paperback | $49.00 Buy Ebook | $39.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Current estimates suggest that between one and three percent of people living in the United States will receive a diagnosis of mental retardation. Mental retardation, a condition characterized by deficits in intellectual capabilities and adaptive behavior, can be particularly hard to diagnose in the mild range of the disability. The U.S. Social Security Administration (SSA) provides income support and medical benefits to individuals with cognitive limitations who experience significant problems in their ability to perform work and may therefore be in need of governmental support. Addressing the concern that SSA’s current procedures are consistent with current scientific and professional practices, this book evaluates the process used by SSA to determine eligibility for these benefits. It examines the adequacy of the SSA definition of mental retardation and its current procedures for assessing intellectual capabilities, discusses adaptive behavior and its assessment, advises on ways to combine intellectual and adaptive assessment to provide a complete profile of an individual's capabilities, and clarifies ways to differentiate mental retardation from other conditions.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!