9— Assessment Without Adverse Impact

Neal Schmitt

The 1964 Civil Rights Act stimulated an examination of employers' decisions in hiring, promotions, and other human resources action. In the employment arena, the first Supreme Court case that ruled on the provisions of the act established that the court would look first at the hiring rates in different subgroups (Griggs v. Duke Power Company, 40 U.S. 424,432 (1971)). If these hiring rates were different, the employment process would be examined to discover whether employment decisions were based on job-related concerns. In the absence of evidence about the validity of the employment procedures, the procedures were considered discriminatory, and the courts typically prescribed some remedial action. Between Griggs and the late 1980s, this was the legal status quo in employment discrimination cases. Employers realized that their human resource decisions would not be challenged if the ''numbers came out right," and some adapted their procedures in ways that ensured this outcome.

The quandary for employers was that many of the measures they were using that were cognitively based (or related to various academic skills) provided valid predictions of applicants' subsequent job performance but produced large subgroup differences in test scores and subsequent hiring rates. Technically, an employer should be able to use those procedures (i.e., valid procedures that produce adverse impact), but in many instances the employer or the courts or both were unhappy with the resulting composition of the work force. This produced the impetus for within-group scoring and other types of adjustments designed to achieve the desired work-force composition. These actions, in turn, produced an increased concern about reverse discrimination, which still continues. Ultimately, the public demanded change (or so our congressional representatives



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 215
--> 9— Assessment Without Adverse Impact Neal Schmitt The 1964 Civil Rights Act stimulated an examination of employers' decisions in hiring, promotions, and other human resources action. In the employment arena, the first Supreme Court case that ruled on the provisions of the act established that the court would look first at the hiring rates in different subgroups (Griggs v. Duke Power Company, 40 U.S. 424,432 (1971)). If these hiring rates were different, the employment process would be examined to discover whether employment decisions were based on job-related concerns. In the absence of evidence about the validity of the employment procedures, the procedures were considered discriminatory, and the courts typically prescribed some remedial action. Between Griggs and the late 1980s, this was the legal status quo in employment discrimination cases. Employers realized that their human resource decisions would not be challenged if the ''numbers came out right," and some adapted their procedures in ways that ensured this outcome. The quandary for employers was that many of the measures they were using that were cognitively based (or related to various academic skills) provided valid predictions of applicants' subsequent job performance but produced large subgroup differences in test scores and subsequent hiring rates. Technically, an employer should be able to use those procedures (i.e., valid procedures that produce adverse impact), but in many instances the employer or the courts or both were unhappy with the resulting composition of the work force. This produced the impetus for within-group scoring and other types of adjustments designed to achieve the desired work-force composition. These actions, in turn, produced an increased concern about reverse discrimination, which still continues. Ultimately, the public demanded change (or so our congressional representatives

OCR for page 215
--> believed), and one result was the Civil Rights Act of 1991, which explicitly prohibits any kind of score adjustments designed to favor one group over another with regard to employment decisions. This, then, is a simple version of the quandary faced by organizational decision makers. How do employers use the valid assessment procedures they have been using in a way that will produce a work force that is optimally capable and that is representative of the diverse groups in our society? Or how do we develop equally valid instruments that do not produce adverse impact? This paper attempts to describe some of the ways in which organizations and assessment specialists have tried to adjust to this quandary, the success of these attempts, and what new legal issues might be raised when these procedures are challenged, as they either have been or almost certainly will be. Specifically, the following five approaches to reducing or eliminating the adverse impact of psychological measures will be discussed: (1) inclusion of additional job-related constructs with low or no adverse impact in a battery that includes cognitive or academically based measures with high adverse impact; (2) changing the format of the questions asked or the type of response requested; (3) using computer or video technology to present test stimuli and collect responses; (4) using portfolios, accomplishment records, or other formalized methods of documenting job-related accomplishments or achievements; and (5) changing the manner in which test scores are used: specifically, by the use of banding. Use of Additional Constructs To Assess Competence One criticism of traditional personnel selection procedures is that they often focus on a single set of abilities, usually cognitive. These cognitive abilities are relatively easy and inexpensive to measure in a group context with paper and pencil instruments. Moreover, they tend to exhibit some validity for most jobs in the economy. They also, of course, exhibit large subgroup differences. It should be noted that with unequal subgroup variances, a possibility not often examined, the differences between lower- and higher-scoring subgroups might vary as a function of the part of the test score distribution examined. If the job requires other capabilities, such as interpersonal or teamwork skills, for example, why are these capabilities not measured? If we did measure these alternative constructs, what would happen to the organization's ability to identify talent and to the size of the subgroup difference when information from multiple sources on multiple constructs is combined to make hiring decisions? Recently, Sackett and Wilk (1994) examined a simple instance of this case in which one predictor with a large subgroup difference (i.e., one standard deviation) was combined with a second predictor on which subgroup scores were equivalent. If the two measures are uncorrelated, the subgroup difference of a simple equally weighted composite is 0.71. In Sackett and Wilk's case the two predictors were equally valid and uncorrelated with each other. In the

OCR for page 215
--> presence of some correlation between the two predictors, the difference between subgroups on the combined scores would be larger than 0.71. When first examining this case, one might predict that the subgroup difference would be 0.5. This simple example suggests that the combination of predictors with different levels of subgroup difference will not yield nearly the dampening effect on subgroup differences one might hope for. In the actual prediction of academic or job performance criteria, the situation will always be more complex. Recently, Elaine Pulakos and I (Pulakos and Schmitt, 1996) had the opportunity to examine various possible combinations of assessment scores and their impact on three groups (African Americans, Hispanic Americans, and whites) of applicants for jobs in a federal investigative agency. A traditional multiple-choice measure of verbal ability (analogies, vocabulary, and reading comprehension) was used along with two performance measures of writing skills. One of these measures required examinees to watch a video enactment of a crime scene and then write a description of what had happened. The other performance test required examinees to study a set of documents and reports of interviews and then write a summary of their observations of the case. The two performance measures were rated for writing skills (grammar, spelling, and punctuation), organization, persuasiveness, and the degree to which the examinee attended to and reported details of the case. In addition, the examinees responded to a biographical data questionnaire (a multiple-choice measure of background experiences, interests, and values), a situational judgment test (requesting their choice of one of three or four alternative reactions), and a structured interview designed to measure their actions in past situations that required job-related skills. These measures were relatively uncorrelated (all less than .39), valid against at least one of two performance rating criteria (i.e., observed correlations with the criterion exceeding .14), and varied considerably in the degree to which scores were characterized by subgroup differences. Of most relevance to the current discussion, the traditional verbal ability measure by itself produced a difference between white and African American examinees equal to 1.03 standard deviations. The white and Hispanic American difference was equal to 0.78 standard deviations. With one exception, the differences on the biographical data measure, the structured interview, and the situational judgment test were less than 0.22. The exception was the situational judgment test comparison for the African American and white groups, which produced a 0.41 standard deviation difference. That test was the most "verbal" and cognitive of these measures. When the four tests were combined, the difference between the African American and white groups was 0.63; that between the Hispanic American and white groups was 0.48. Both represent a drop in the subgroup difference of about 0.30 to 0.40 of a standard deviation, but note that this drop was accomplished by combining one test that had a large subgroup difference with three measures on which there were minimal or no subgroup

OCR for page 215
--> differences. The use of all four measures added uniquely to the overall multiple R relating the criteria to predictors. It is certainly appropriate to include all four measures (particularly the noncognitive tests) in this battery. The three noncognitive tests are measures on which the Hispanic American and African American groups typically do better, and these measures have often been excluded on the grounds that they were too expensive to develop and implement or added nothing above and beyond more traditional test batteries. Using all four measures is fair as well as optimal in a scientific sense in that a broader sampling of job-relevant constructs results. Combining measures that have relatively no adverse impact with traditional measures that have high adverse impact will not diminish the overall impact as much as one might hope, but they will lessen subgroup differences substantially. Two other studies of which I am aware address the question of the degree to which adverse impact will be diminished when tests of varying levels of adverse impact are combined. The degree to which adverse impact is lessened appears to be a complex combination of the level of adverse impact each part of the battery displays, the reliability of the individual tests, the intercorrelation of the tests (with increased levels of intercorrelation, such combinations will result in smaller decreases in the level of adverse impact), and the selection ratio. Sackett and Roth (1995) have examined the case in which one alternative predictor with no subgroup difference is combined with a predictor that displays a large subgroup difference (i.e., one standard deviation) for a variety of test use strategies. Schmitt et al. (1997) examined the role some of these factors play in determining levels of adverse impact and predictability. Both papers confirm the complex interaction of these factors in the determination of both predictability of performance and the size of subgroup differences. Obviously, whether any or all of these alternative measures should be used to make employment decisions is always contingent on their validity. On a legal basis it would be hard to challenge the use of additional tests with less adverse impact if in fact they are valid. What might occur in this situation is a challenge to the use of the traditional test, which, when used singly, produces large adverse impact and, when used in combination with the other predictors, is responsible for a relatively large adverse impact for the composite. In the case of the Pulakos and Schmitt (1996) study described above, the verbal ability test added .02 to the multiple correlation relating the predictors to one of two rating criteria afforded by a combination of the situational judgment test, the biographical data measure, and the interview. So one is comparing an incremental validity of .02 against a rather significant impact on two protected groups. As is so often the case, it seems that the courts and society at large are left with conflicting goals, and the solution will be a function of the decision maker's value system. In one sense this solution to the problem of subgroup differences is another version of the search for equally valid alternatives to tests with high adverse impact. In

OCR for page 215
--> this particular case, valid alternatives did exist, but each appeared to contribute uniquely to the prediction of the performance construct. Coincidentally but relevant to the larger purpose of this paper, the findings in this study regarding the three different measures of verbal ability are interesting. The alternative measures of verbal ability had comparable validity (.15, .19, and .22) and reliability (.85, .86, and .92) but displayed radically different levels of subgroup difference in mean scores. As stated above, for the traditional verbal test, African American and white differences were 1.03, and Hispanic American and white differences were 0.78. The performance measure involving written stimulus materials and requiring written output yielded somewhat smaller differences of 0.91 and 0.52, while the same comparisons for the performance measure in which the stimulus material was visual yielded differences of 0.45 and 0.37. Although it would be tempting to attribute the diminution of adverse impact to the change in test format, the intercorrelations between these three measures were only .26, .39, and .31, indicating that there were differences in the abilities or traits measured as well as format differences in these three measures of verbal ability. Developing New Formats The multiple-choice paper-and-pencil measure of ability has been criticized most frequently and probably remains the most ubiquitous assessment tool. Maintaining that the multiple-choice format is responsible for the magnitude of subgroup differences is tantamount to saying that test variance is partly a function of method variance and that subgroups differ on the method variance component more than they do on the variance components that are construct relevant. That there is something unique about the multiple-choice format has been demonstrated in a number of studies over the years (Cronbach, 1941; Traub and Fisher, 1977; Ward et al., 1980; Ward, 1982; Boyle, 1984). That there are format-by-subgroup interactions that would indicate that method bias differentially affects members of different subgroups has not been frequently studied. When it has, the results have been confusing and contradictory. A paper by Scheuneman (1987) is illustrative. She examined 16 hypotheses regarding differences between African Americans and whites in response to the Graduate Record Examination. Significant interactions were observed for 10 of the 16 hypotheses, but these interactions were so complex (group by item version by item pair) that interpretation was difficult. In addition, the sample sizes were very large; hence, significant effects were associated with small effect sizes. A similar but largely unstudied hypothesis is that minority groups are more likely to omit items than guess. Recently, Outtz (1994) has pointed out that very few researchers have actively studied the role that such method bias may actually play in producing subgroup differences in measured ability. He also provided a taxonomy of test characteristics that might provide the impetus for some systematic research on

OCR for page 215
--> this issue. He also cautions, as do others (Ryan and Greguras, 1994; Schmitt et al., 1996), that researchers in this area must be careful not to confound the construct measured with the format in which it is measured. This problem confounds the interpretation of the relatively small body of research on the influence of format differences on measures of ability. An analysis of the degree to which the stimuli and possible responses in an assessment device are samples of some content domain also points to the problems a researcher encounters when trying to compare different formats. For example, we might ask a potential teacher to answer the following question: What would you do if an angry parent confronted you about the grade a son or daughter received on an examination? We might present this question in multiple-choice format with the following alternatives: (1) try to calm the parent down and then deal with the problem, (2) calm the parent down and then tell him or her to wait until you are finished with the task you are now doing, (3) ignore the parent or walk away because the parent is being rude, (4) inform the parent that you will not tolerate his or her attitude and behavior. This item could also be presented in open-ended format and require an essay response from the prospective teacher. Or it could be an interview question that would require that the respondent give an oral response. We could even provide a role-play situation in which the examinee's response to an angry parent is observed and rated. Or, as is more frequently being done today, a video enactment of the four alternatives could be shown to the examinee, who would then have to indicate which course of action he or she would pursue. Another format might require that the examinee document her or his actual behavior in a similar situation in portfolio fashion. Clearly, these "format" differences vary along various dimensions (e.g., realism, capability of being objectively scored), but perhaps the most significant difference relates to the breadth of content sampling that is possible. In a multiple-choice format we can provide many stimuli (hopefully of a broadly representative nature), but we limit the examinee to a given set of responses to each item. Usually because of time and cost constraints, some other formats are limited in terms of the stimuli sampled but will presumably allow for a wider potential sampling of responses. Whether these differences yield data that decrease or increase the size of subgroup differences on various measures is unknown. One could surmise that if verbal skills are a problem, some of the formats that require extended verbal responses would increase the difference between groups. If groups are differentially motivated by concrete realistic requests for information, we might expect the realism dimension to be related to the size of subgroup differences. In a similar vein, some authors (e.g., Green et al., 1989; Ryan and Greguras, 1994) have drawn attention to the possible differential subgroup impact of distractors in multiple-choice formats. Whitney and Schmitt (1997) have presented evidence that distractors associated (or not associated) with African American cultural values change the attractiveness of these alternatives in multiple-choice

OCR for page 215
--> biographical data items. Their hypothesis that options that reflect communal interests and a respect for authority would be more attractive to African Americans than to whites was confirmed, but the overall effects on test scores were very small. Similar efforts to assess differential distractor functioning in the realm of cognitive ability have rarely produced effects beyond chance levels. Even those effects that have been found did not have satisfying substantive interpretations. If these efforts are to be informative, they should be preceded by a careful examination of the cognitive requirements of the items and how they might be associated with subgroup differences. In other words, a priori hypotheses should be presented, as was true of the Scheuneman (1987) and Whitney and Schmitt (1996) work. From a content perspective, it is also important that the response options reflect the domain of possible responses (Guion, 1977). As was alluded to above, there may also be motivational reasons to be concerned about the content of the item stimuli and response options that are related to subgroup status. There is a small body of research (e.g., Schmidt et al., 1977; Rynes and Connerly, 1993; Smither et al., 1993) indicating examinee preference for job-sample or "realistic" test formats over multiple-choice formats. In the educational arena it will be no surprise to any college professor that students prefer multiple-choice items (e.g., Bridgeman, 1992; Zeidner, 1987). I have used the threat of an essay makeup exam for many years in large college classes to avoid a large group of students demanding that they be given an exam after the scheduled date. Ryan and Greguras (1994), however, reported that minorities in an employment situation were significantly more likely than whites to agree that there was no connection between multiple-choice exams and one's ability to do a job, that multiple-choice exams cannot determine if one is a good employee, and that they would rather take a hands-on test even if it takes a lot more time. Smither et al. (1993) also found significant differences between minority and majority group members and older and younger job applicants on their reactions to various tests. Recently, Chan and co-workers (1997) also found relatively large and statistically significant differences in the perceived fairness and self-reported test-taking motivations of African American and white students who were taking a draft form of a cognitive ability measure to be used in selecting and promoting managerial personnel. It is at least plausible that these differences in motivation and perceived fairness will translate into differences in performance. Chan et al. (1997) provide some evidence that this might be the case. It is difficult to envision what new legal issues might arise as a function of the use of exams other than multiple choice, if those alternative formats and testing methods yield equal reliability and validity. At this time I do not believe that there is any convincing evidence that one group or another performs better as a function of the format of the test items used. The few available comparisons of minority-majority differences on different types of tests completely confound the content or construct measured in the test with the format of the test items. If experimental tests could be devised of item format that do not confound content

OCR for page 215
--> or construct with format, we will have better answers to these questions. My hypothesis at this point would be that any changes in the size of the difference between majority and minority groups will be moderated or mediated by the motivational impact of these format changes. Use Of New Technologies A significant stimulus to the question of whether format differences (as well as alternative formats) increase or decrease subgroup differences has been the availability of new video and computer technologies by which test stimuli can be presented and test responses gathered and scored. A large number of paper-and-pencil tests have been computerized: a computer terminal provides a more or less direct translation of the test to an examinee along with the potential responses, and the examinee is required to indicate the response by computer as well. Mead and Drasgow (1993) provide a metaanalysis of the effects of computerization on test scores. They found that the conversion of paper-and-pencil power tests to computerized forms yields scores that correlate highly (.97) and that the computerized version is slightly more difficult (d = -.04). Computerized versions of speeded tests do appear to be measuring something different than their paper-and-pencil counterparts (R = .72). No mention is made as to whether subgroup differences in scores on these tests increase or decrease when they are computerized. Computer adaptive testing is being more widely used. In this kind of testing the test items presented to an examinee are matched to the person's ability level, which is estimated on the basis of previous responses. For example, a portion of the national examination used to license nurses is now a computer-adaptive measure, and the Graduate Record Examination can now be taken in computer-adaptive form. To my knowledge there are no data indicating that subgroup differences are smaller or larger on adaptive tests than on traditional tests. Given the possibility that adaptive tests may be uniformly more difficult than standardized tests, which often include easy items, especially at the beginning of the test, one might speculate that a computer-adaptive test would be more demotivating than a paper-and-pencil one. If minority groups are more prone to be negatively motivated by standardized tests of any form, the use of computer-adaptive tests may heighten their demotivation; others (Wainer et al., 1990), however, have used the same arguments to speculate that members of minority groups should do better on adaptive tests. In addition, it is possible that disadvantaged students will have had little or no past experience with a computer. Some with no experience may actually fear using a computer. Again, there are no data about the effects of such computer phobia and no data of which I am aware showing a differential impact on one subgroup over another. In fact, I was able to find only one mention of this potential problem in books or papers on computer-adaptive testing (Wainer et al.,

OCR for page 215
--> 1990). If the opportunity to use computers is differentially distributed across members of different subgroups, as well it might be, there may be some negative impact on those who have had less opportunity or experience with computers. As in the case of simple computerization of tests, I am aware of no studies that have examined the nature of subgroup differences on adaptive tests as opposed to full-length tests with varying item difficulties. In addition to the heightened flexibility and capacity to present stimuli and collect and score responses that are characteristic of computer test administration, computers can also be used to present stimuli and collect responses that are inaccessible through paper-and-pencil tests. On the response side, computers can provide very accurate measurement of time variables such as response latencies. Variables characterizing the process of responding can be measured by tracking the activities of a test taker as he or she makes a decision or solves a problem. On psychomotor tasks, a person's use of a mouse or joystick can be recorded and scored. On the stimulus side, Pellegrino and Hunt (1989) have done research showing that a dynamic spatial ability factor is distinguishable from a static spatial ability factor included in many paper-and-pencil test batteries. This dynamic spatial factor involves the ability to track and project how objects will move in space, something that is clearly impossible with a static two-dimensional display. These technological advances certainly expand the capability to measure human ability, but the impact on subgroup differences is simply speculative at this point. If the use of technology results in more realistic face-valid measures, it is likely that the motivation of test takers will improve. In addition, research on job samples indicates that subgroup differences are likely to be minimized (see, e.g., Schmidt et al., 1977, 1996). On the other hand, if the use of computer technology constitutes an opportunity advantage, subgroup differences may be negatively affected. In some instances, computer technology has been used to increase the realism of the test stimuli. Drasgow et al. (1993) describe various exercises designed to measure noncognitive managerial skills. In an in-basket exercise, examinees are presented with an interpersonal problem. With the use of CD-ROM technology, two solutions to the problem are presented, and the examinee is asked to pick his or her preferred solution. This solution then produces another problem for the examinee to "resolve." Depending on the particular sequence of examinee responses, different examinees will be presented with different sets of questions. Drasgow et al. (1993) have used item response theory to calibrate the items and score the many different sets of test items that might be presented to the examinee. Several other similar examples are described by McHenry and Schmitt (1994). Ashworth and McHenry (1992) describe a simulation used by Allstate Insurance to select claims adjustors. Dyer et al. (1993) describe a similar test designed to assess examinees' skills in resolving interpersonal problems that confront them in entry-level production jobs at IBM. Wilson Learning (1992) has developed tests of this type for sales, banking, supervisory, and customer

OCR for page 215
--> service positions, and Schmitt et al. (1993) describe a test to assess the technical skills of applicants for clerical jobs at Ford Motor Company. Only Wilson Learning provides any evidence regarding criterion-related validity (validity equaled .40 for the customer service measure), and none of these investigators mentioned subgroup differences. Given the work-sample nature of these measures, subgroup differences are most likely smaller than they are for paper-and-pencil measures of ability. However, it would again be impossible to determine whether any difference in subgroup differences is a function of the test format or of the constructs measured. Interestingly, with the exception of the Schmitt et al. (1993) study, the focus in most of these efforts has primarily been on interpersonal or noncognitive capabilities. Some recent efforts to reduce adverse impact of tests have focused on reducing the reading or writing requirements of examinations when those requirements are not essential to the job. This is certainly partly the focus of the multimedia tests described above, but in some cases the only change was from verbal or written to oral or visual test stimuli. That is, a written test of problem-solving skills is presented visually and orally. In some of these tests only the problem is presented visually, and the examinee is asked to select from a number of written options. In other cases both the problem and the situations are presented visually, and the examinee is asked to pick with action he or she would take to resolve the issue (HRStrategies, 1995). Chan and Schmitt (1997) have taken one of the video tests and produced a written paper-and-pencil version of the same test (items were written from the scripts used to produce the videos). They then compared the performance of African American and white examinees on the two versions of the test. They found subgroup differences to be about 0.20 on the video version and 0.90 on the written version. This may be the only comparison of subgroup differences on tests of different formats in which the contents of the test (and hopefully the constructs measured by the tests) were held constant. While technology presents many alternatives for measuring individual differences in ability, there is an almost total absence of research literature on the validity of these measures as well as their potential impact on subgroup differences. Initial data on test reactions and older data regarding subgroup differences on job samples suggest that some of these changes may reduce subgroup differences. Equally promising is the potential to explore the nature of subgroup differences since the use of this technology allows for a significant expansion in the type of stimuli presented and the responses collected from examinees. The costs associated with the development, scoring, and updating of these measures, however, are certainly substantial. Opportunity differences associated with the previous use of, or exposure to, computers may raise legal concerns. If the test requires responses to different items from different examinees, as is true of computer-adaptive tests or branching tests of the type described by Drasgow et al. (1993), the equating of these tests may be difficult to explain and defend in court.

OCR for page 215
--> The major problem with the use of video and computer technology probably remains a simple lack of information on what exactly is being measured, what relationships with performance can be expected, and what differences in subgroup performance can be expected and why. Documentation Of Previous Accomplishments In the past several years a great deal of attention has been directed to a consideration of ''authentic assessment" or "portfolio assessment," particularly among educators who are interested in documenting student learning (Schulman, 1988). A portfolio is usually a collection of information about a person's experiences or accomplishments in various relevant areas. If organized around dimensions of importance to a particular job or educational experience, these portfolios may be viewed as indicators of a person's knowledge, skill, or ability in these areas. The contents of a portfolio are carefully selected to illustrate key features of a person's educational or work experiences and include written descriptions of how projects or products were created and accomplished, for what purpose the project was initiated, the examinee's role in the project, with whom he or she worked, and, perhaps most importantly, how the project or product reflects the examinee's competency on various dimensions. Obviously, if a portfolio is to be useful in the selection of individuals for a particular job, it cannot be a random collection and documentation of experiences; it must be targeted to the competencies required in a given job or jobs. As with all measurement instruments, key concerns with the use of portfolio assessment are reliability and validity issues. Research addressing the reliability of the scoring of portfolios is just beginning to develop, but the results of existing studies suggest that both internal consistency and interscorer reliability are low (Dunbar et al., 1991; Koretz et al., 1992; Nystrand et al., 1993). The validity issue has not been addressed as often, but one might argue that insofar as the portfolio contains evidence of the accomplishment of work tasks or tasks that require similar knowledge, skills, and abilities, no further evidence of validity is necessary. However, low interrater reliability would certainly suggest that validity is low as well. There is a belief that subgroups who score lower on traditional tests may not score as low in portfolio assessment, but it will be important to document that this is not a function of the lower reliability usually associated with portfolio assessment. While predictive bias studies do not exist, preliminary evidence on subgroup mean scores does not support the view that portfolio assessment will reduce adverse impact or produce equity. Extended-response essays on the National Assessment of Educational Progress, for example, result in mean differences between African American and white students that parallel and, after correction for unreliability, actually exceed those found on multiple-choice reading assessments (Linn et al., 1991). Bond (1995) also reports that in one study

OCR for page 215
--> African American students received lower scores than their white counterparts on portfolio evaluations, regardless of the race of the rater. In the work arena the development and use of accomplishment records is very similar to the use of portfolio assessment (Hough, 1984). However, much greater attention seems to have been placed on the psychometric adequacy of accomplishment records and documentation of the level of examinee involvement in the various items that may appear in a portfolio. Further, the work experiences are usually documented at the time one is applying for a job rather than in an ongoing manner as is the case for many educational portfolios. The development of accomplishment records has followed several well-defined steps that may contribute to their superior psychometric adequacy relative to portfolio assessment. Descriptions of portfolio assessment procedures do not include similar steps. Subject-matter experts meet to define the knowledge, skills, and abilities that best differentiate superior employees from those who are performing at minimally acceptable levels. This information is used to construct an application form that is organized around these job-relevant dimensions, and applicants are asked to describe their achievements on each dimension. This description must include information about the nature of the problem an applicant confronted, what he or she actually did, what outcome resulted, what percentage of the outcome was attributable to the respondent, and the name of someone who could document the respondent's role in producing the achievement described. Data are collected from a pool of applicants, and subject-matter experts are again used to judge the quality of this set of achievements. These achievements are then used as benchmarks against which additional applicants' qualifications are judged. Schmidt et al. (1979) reported interrater reliabilities averaging .80 when they used this procedure to evaluate accountants for federal civil service jobs. Hough (1984) used the procedure to evaluate attorneys' job-related skills on seven dimensions as well as their overall ability. Interrater reliabilities ranged from .75 to .82 for a three-rater composite. Hough was also able to collect performance data for 307 attorneys, and the correlation between composite performance ratings and accomplishment record scores was .25. Finally, the standardized mean difference between minority and nonminority attorneys was 0.33, which almost exactly matched the difference of these two groups on the performance composite (0.35). In this case the relatively small difference between minority and majority groups was not a function of low reliability. It is also important to note that some of the dimensions rated were cognitive (e.g., researching/investigating, using knowledge, planning and organizing). Hough's minority sample was small (N = 30), and very little subsequent research has been published on this method. If Hough's results are replicable, this may be a viable and promising alternative to traditional selection procedures and a significant improvement over the results that seem to be obtained using portfolio assessments. Completing an accomplishment record or constructing a portfolio can be complex and time consuming; in at least one instance in which this author was

OCR for page 215
--> involved, the minority group had a significantly lower completion rate than the majority group. The work involved may have a differential motivational impact; the persons involved may have realized that they did not have the required competencies when they read the accomplishment record; or they may have reacted to the organization's climate for minority individuals. In any event the involvement of members of different groups at all phases of an employment process should be monitored in an effort to detect unanticipated outcomes. The use of an accomplishment record was preceded by the use of training and experience inventories in various civil service jurisdictions. One commonly used method specified the number of points to be awarded for various years of specified training and experience. Points were determined on the basis of some judgment of the relative worth of the various experiences. This approach was essentially credentialistic in that it focused on the amount of experience and education rather than on what was achieved or accomplished during that education or experience. Even this relatively simple approach to assessing competencies appears to have some validity (McDaniel et al., 1988). It should be noted that this extreme attenuation of the portfolio approach is almost certain to be attacked on the grounds that there are large subgroup differences in educational attainment (and almost certainly job experiences) and that direct connections between a high school or college diploma and job performance are difficult to make. This is true in spite of, or perhaps because of, their relatively low level of criterion-related validity (see McDaniel et al., 1988). Whether or not enthusiasm for the portfolio type of measurement continues and future research suggests practical solutions to some of the measurement inadequacies, there are a number of possible issues that could generate litigation. Explanations of the scoring process and the reliability and qualifications of the scorers are obvious targets. Questions about who actually does the work involved (Gearhart and Herman, 1995) and the extent of the examinee's role in any accomplishment may be questioned. Perhaps the most significant unknown is the degree to which questions about the opportunity to achieve or accomplish along relevant dimensions will arise. As anyone who can remember looking for a first job or for summer employment will verify, one of the easiest ways to dismiss someone is to say the person does not have the requisite experience. But how does one achieve the required experience without that first job? Actually, portfolios and accomplishment records can accommodate this concern if developers and subject-matter experts are sensitive to the fact that relevant experiences can be acquired in nontraditional ways (e.g., through volunteer work or organizations and clubs). Banding The previous sections in this paper discussed various alternative testing methods and their impact on subgroup differences. This section addresses the

OCR for page 215
--> manner in which test scores are used to make employment decisions. As mentioned at the beginning of the paper, the desire to use valid tests while achieving a diverse work force often led employers to "adjust" scores in a variety of ways to achieve appropriate levels of diversity when the raw scores on those tests displayed subgroup differences. One such method used by the U.S. Employment Service in reporting applicant scores to potential employers in the 1980s was within-group norming. The scores of members of minority and majority groups were determined by reference to members of their demographic group. This adjustment was equal to adding a constant equal to the difference in the scores of these two groups to the scores of members of the lower-scoring group. One provision of the Civil Rights Act of 1991 was to prohibit such adjustments. While adjustments to test scores are now illegal, there are many different ways in which tests scores have been and can be used (e.g., pass-fail, multiple-hurdle, etc.), some of which result in lessened impact in some situations. Cascio et al. (1995a) describe a banding approach to the use of test scores (see also Sproule, 1984) to increase the likelihood of minority hiring as well as to attain other organizational objectives. This approach to the use of test scores started with the idea that individuals whose scores were not significantly different from the top scorer on the test should be treated as equally capable and that selections from this group of individuals could be made on other bases, including education, racial/ethnic diversity, seniority, job performance, training, experience, or relocation preferences. To determine the size of this band, Cascio et al. (1995a) proposed that the standard error of the difference (which is equal to the standard error of measurement multiplied by the square root of 2) be calculated. This value was then multiplied by 1.65 (if one chose the .05 level of significance) to determine the bandwidth. If the top score on a test was 100 and the bandwidth was 10, then all persons with scores between 90 and 100 would be considered equal, and some other means would be used to determine among this group of people who would be selected. Minority individuals in this band have greater likelihood of being selected than if only top-down selection based on test scores were used to make selections. This assumes there are subgroup differences on the test and that not all individuals in the top band are selected. It also means that race, or some variable correlated with race, is used to make selections within the band if any reductions in adverse impact are to be realized. As Kriska (1995) has pointed out, this approach to test use is no different than a multiple-hurdle approach in which examinees must achieve some passing test score as the first hurdle in a selection system and are then hired based on other job-relevant bases. The banding method then constitutes a means of setting a pass score on the test and is probably no more or less scientifically or legally defensible than other available means of setting pass scores. This banding approach is called a fixed band. In the context of this paper, the use of this approach would usually be motivated by a desire to increase minority

OCR for page 215
--> representation. If all members in this band are selected, the result of this approach will be no different from that of a top-down procedure other than the fact that minority individuals might be selected earlier than would be the case if strict top-down selection occurred. If race is the only consideration used after the establishment of the band, the implication of the Civil Rights Act and a San Francisco case to be discussed below is that it is likely this process will be considered legally inappropriate. When race is combined with other decision factors as was proposed by Cascio et al. (1995a), the impact on the number of minority hires will obviously be minimized. So, whether this fixed banding approach increases minority hiring is a function of the portion of the band hired and the secondary predictors used. In terms of the expected performance of those selected, the use of various banding approaches appears to have little impact (Siskin, 1995). If a fixed-band approach is used, it seems that one could also question why the test is used first in what amounts to a multiple-hurdle system. If test scores of minority and majority individuals are radically different while their standing on the secondary predictors is not, the organization would be using first the predictor on which minority individuals do worst. If I were a plaintiff's attorney, I would challenge this order of events. A second category of banding approaches proposed by Cascio et al. (1995a) is referred to as a sliding band. In this approach a band is established as above. As the organization makes selections, however, the band changes. As soon as the top-scoring individual or individuals are selected, the band moves down a corresponding number of test score points; thereby, a band of constant width is maintained. If the motivation is to allow consideration of minority (or other) individuals both within the band and just below the original band, a system of top-down selection within a group within a band is recommended (Cascio et al., 1995a). This means that secondary predictors would be used until the supply of individuals possessing those secondary characteristics in the first band is depleted; then the top-scoring person or persons would be selected if this has not already been done. The band would then move down, allowing consideration of additional minority individuals or persons with whatever secondary characteristics are being considered. This sequence of events would be repeated with additional movement of the band until the desired number of people are selected. If only race or sex is considered in isolation of other secondary characteristics when selections are made within the band, the sliding-band approach is equivalent to adding bonus points to the lower-scoring group equal to the size of the band (Sackett and Wilks, 1994). It is important to note that banding advocates do not consider test score differences within a band meaningful; hence, the use of the term "bonus points" is inappropriate. In a 1992 decision from the Ninth Circuit Court of Appeals (Officers for Justice v. Civil Service Commission of the City and County of San Francisco, 979 F2d 721), this plan was found acceptable under the Civil Rights Act of 1991 as long as race was only one of several secondary criteria used to make selections within the band. An earlier plan to use race as the

OCR for page 215
--> sole consideration on which to make selections within the band was not acceptable. Putting aside the considerable professional debate about the logic and merits of banding, I do not believe this approach represents a workable or highly desirable long-term solution to the problem of subgroup differences in test scores. First, the efficacy of banding in producing increases in minority hiring is a complex function of at least the following variables: the size of subgroup differences on the test and any secondary predictors used, the proportion of minorities in the applicant pool, the selection ratio, the reliability of the test (hence the size of the band relative to the distribution as a whole), the confidence level chosen to set the band, and the intercorrelations among the tests and the other criteria used to make selections. The fact that so many variables determine the outcome means that the manner in which bands are established will almost always be a post hoc consideration of several alternatives and their impact on minority hiring. When, and if, the post hoc manipulation and these variables are explained to a court or jury, they may very likely be interpreted as deliberate tampering with test scores to achieve increases in minority hiring. The same might be said when any alternative test use strategy is explained or used, however. Second, in many situations the use of banding or sliding bands will not produce a large increase in the proportion of minorities hired (Sackett and Roth, 1991). Third, one very undesirable outcome might be a court's (or the public's) perception that this approach represents the means by which psychologists and statisticians are reintroducing within-group norming or something similar. The importance of salience of this latter concern depends on the degree to which demographic characteristics (e.g., race or gender) are used as secondary predictors. One aspect of the debate on banding that I hope gains greater attention is that industrial and organizational psychologists expand their notion of what constitutes criteria of effectiveness (Cascio et al., 1995b). This would almost certainly expand the domain of variables considered when one appraises the capability of a set of applicants to contribute to organizational effectiveness. As an example, public jurisdictions and private organizations that seek to serve a minority community have recognized the need to employ members of these communities because they are often more effective in providing that service. Use of a traditional paper-and-pencil test of job-related knowledge as a primary selection device in such situations seems almost farcical. Summary And Conclusions In considering the issues addressed in this paper, I asked myself what I would do at this time if I wanted to maximize the outcomes of a selection process in ways that would reflect societal values (at least as I interpret them) and organizational interests and that would maximize the participation of minority groups in our economy. The following are suggestions for reaching the stated goal:

OCR for page 215
--> Consider the performance criteria that one hopes will be maximized . A very broad consideration of the organization's goals and its role in the community at large should be part of this consideration, as would the varied roles an individual can play in helping the organization accomplish its goals. These considerations may mean that there are multiple nonredundant performance criteria. A candidate's predicted status on one of these criteria may be superior while the predicted status on other criteria may be appalling. The managerial problem then becomes reconciliation of these conflicting predictions, which may very well have implications for minority hiring. Construct and use measures that reflect the broad range of abilities to engage in accomplishing these organizational objectives. One should not err by constructing measures that are easy to administer, score, or interpret. An ability's job relevance should be the major concern, not the ease of evaluating or measuring an ability. Pay attention to face validity as well as scientific validity. If examinees perceive the process as appropriate in light of what they expect to do when hired, there will be fewer legal problems and it is more likely that the defense of such procedures will prevail in court. Improved perceptions of the fairness of the process may increase examinee motivation (especially for minorities), which may, in turn, affect test performance. Continue research on alternative testing methods, technologies, and constructs to further our understanding of subgroup differences and to increase the probability that appropriate remedial actions can be taken. There are many points in this paper at which my conclusion was simply that there was a lack of information to answer a particular question. Develop job-relevant, psychometrically adequate measures of past achievements and accomplishments. In doing so, pay special attention to the opportunities various groups have had to engage in activities that would result in these accomplishments. Admit that there are substantial differences between minority and majority groups that transcend the particular test used to measure ability, primarily in the cognitive domain. Rather than continuing to focus attention on minimizing these measured differences, focus efforts on developing and supporting programs that will address the social, economic, and educational inequities that have produced these differences. Simultaneously, recognize that some tests, insofar as they constitute the primary gatekeepers, contribute to the perpetuation of these inequities.

OCR for page 215
--> References Ashworth, S.D., and J.J. McHenry 1992 Development of a Computerized In-Basket to Measure Critical Job Skills. Unpublished paper presented at the fall meeting of the Personnel Testing Council of Southern California, Newport, CA. Bond, L. 1995 Unintended consequences of performance assessments: Issues of bias and fairness. Educational Measurement: Issues and Practice 14:21-24. Boyle, S. 1984 The effect of variations in answer-sheet format on aptitude performance. Journal of Occupational Psychology 57:323-326. Bridgeman, B. 1992 A comparison of quantitative questions in open-ended and multiple-choice formats. Journal of Educational Measurement 29:253-271. Cascio, W.F., I.L. Goldstein, J. Outtz, and S. Zedeck 1995a Statistical implications of six methods of test score use in personnel selection. Human Performance 8(3):133-164 1995b Twenty issues and answers about sliding bands. Human Performance 8(3):227-242. Chan, D., and N. Schmitt 1997 Video-based versus paper-and-pencil method of assessment in situational judgment tests: Differential adverse impact and examinee attitudes. Journal of Applied Psychology 82:143-159. Chan, D., N. Schmitt, R.P. DeShon, C.S. Clause, and K. Delbridge 1997 Reactions to cognitive ability tests: Relationships between race, test performance, face validity, and test-taking motivation. Journal of Applied Psychology 82:300-310. Cronbach, L. J. 1941 An experimental comparison of the multiple true-false and multiple-choice tests. Journal of Educational Psychology 32:533-543. Drasgow, F., J.B. Olson, P.A. Keenan, P. Moberg, and A.D. Mead 1993 Computerized assessment. Pp. 163-206 in Research in Personnel and Human Resources Management, G.A. Ferris and K.M. Rowland, eds. Greenwich, CT: JAI Press. Dunbar, S.B., D.M. Koretz, and H.D. Hoover 1991 Quality control in the development and use of performance assessment. Applied Measurement in Education 4:289-303. Dyer, P.J., L.B. Desmarais, and K.R. Midkiff 1993 Multimedia Employment Testing in IBM: Preliminary Results from Employees. Unpublished paper presented at the annual conference of the Society for Industrial/Organizational Psychology, San Francisco. Gearhart, M., and J.L. Herman 1995 Portfolio Assessment: Whose Work Is It? Issues in the Use of Classroom Assignments for Accountability. Technical report. Los Angeles: Center for Research on Evaluation, Standards, and Student Testing, University of California. Green, B.F., C.R. Crone, and V.G. Folk 1989 A method for studying differential distractor functioning. Journal of Educational Measurement 26:147-160. Guion, R.M. 1977 Content validity: The source of my discontent. Applied Psychological Measurement 1:1-10. Hough, L.M. 1984 Development and evaluation of the "accomplishment record" method of selecting and promoting professionals. Journal of Applied Psychology 69:135-146.

OCR for page 215
--> HRStrategies 1995 Design, Validation and Implementation of the 1994 Police Officer Entrance Examination. Technical report. Nassau County, NY: HRStrategies. Koretz, D., D. McCaffrey, S. Klein, R. Bell, and B. Stecher 1992 The Reliability of Scores from the 1992 Vermont Portfolio Assessment Program: Interim Report. Los Angeles: RAND Institute on Education and Training. Kriska, S.D. 1995 Comments on banding. The Industrial-Organizational Psychologist 32:93-94. Linn, R.L., E.L. Baker, and S.B. Dunbar 1991 Complex, performance-based assessment: Expectations and validation criteria. Educational Researcher 20:15-21. McDaniel, M.A., F.L. Schmidt, and J.E. Hunter 1988 A meta-analysis of the validity of methods for rating training and experience in personnel selection. Personnel Psychology 41:282-314. McHenry, J.J., and N. Schmitt 1994 Multimedia testing. Pp. 193-232 in Personnel Selection and Classification , M.G. Rumsey, C.B. Walker, and J.H. Harris, eds. Hillsdale, NJ: Lawrence Erlbaum Associates. Mead, A.D., and F. Drasgow 1993 Equivalence of computerized and paper-and-pencil cognitive ability tests: A meta-analysis. Psychological Bulletin 114:449-458. Nystrand, M., A.S. Cohen, and N.M. Dowling 1993 Addressing reliability problems in the portfolio assessment of college writing. Educational Assessment 1:53-70. Outtz, J.L. 1994 Testing Medium, Validity, and Test Performance. Unpublished paper presented at the Conference on Evaluating Alternatives to Traditional Testing for Selection, Bowling Green State University, Bowling Green, OH. Pellegrino, J.W., and E.B. Hunt 1989 Computer-controlled assessment of static and dynamic spatial reasoning. Pp. 174-198 in Testing: Theoretical and Applied Perspectives , R.F. Dillon and J. W. Pellegrino, eds. New York: Praeger. Pulakos, E.D., and N. Schmitt 1996 An evaluation of two strategies for reducing adverse impact and their effects on criterion-related validity. Human Performance 9:241-259. Ryan, A.M., and G.J. Greguras 1994 Life is Not Multiple Choice: Reactions to the Alternatives. Unpublished paper presented at the Conference on Evaluating Alternatives to Traditional Testing for Selection, Bowling Green State University, Bowling Green, OH. Rynes, S.L., and M.L. Connerly 1993 Applicant reactions to alternative selection procedures. Journal of Business and Psychology 7:261-277. Sackett, P.R., and L. Roth 1991 A Monte Carlo examination of banding and rank order methods of test score use in selection. Human Performance 4:279-295. 1995 Multi-Stage Selection Strategies: A Monte Carlo Investigation of Effects on Performance and Minority Hiring. Unpublished manuscript, Industrial Relations Center, University of Minnesota, Minneapolis. Sackett, P.R., and S.L. Wilk 1994 Within-group norming and other forms of score adjustment in preemployment testing. American Psychologist 49:929-954.

OCR for page 215
--> Scheuneman, J.D. 1987 An experimental, exploratory study of causes of bias in test items. Journal of Educational Measurement 24:97-118. Schmidt, F.L., A.L. Greenthal, J.G. Berner, J.E. Hunter, and F.W. Seaton 1977 Job sample vs. paper-and-pencil trades and technical tests: Adverse impact and examinee attitudes. Personnel Psychology 30:187-198. Schmidt, F.L., J.R. Caplan, S.E. Bemis, R. Decuin, L. Dunn, and L. Antone 1979 The Behavioral Consistency Method of Unassembled Examining. Washington, DC: Office of Personnel Management. Schmitt, N., S.W. Gilliland, R.S. Landis, and D. Devine 1993 Computer-based testing applied to selection of secretarial applicants. Personnel Psychology 46:149-165. Schmitt, N., C.S. Clause, and E.D. Pulakos 1996 Subgroup differences associated with different measures of some common job-relevant constructs. Pp. 115-140 in International Review of Industrial and Organizational Psychology, Vol. 11, C.L. Cooper and I.T. Robertson, eds. New York: Wiley. Schmitt, N., W. Rogers, D. Chan, L. Sheppard, and D. Jennings 1997 Reducing adverse impact of cognitive ability tests by adding measures of other predictor constructs: The effects of number of predictors, predictor intercorrelations, validity, and level of subgroup differences. Journal of Applied Psychology (forthcoming). Schulman, L.S. 1988 A union of insufficiencies: Strategies for teacher assessment in a period of reform. Educational Leadership 46:36-41. Siskin, B.R. 1995 Relation between performance and banding. Human Performance 8:215-226. Smither, J.W., R.R. Reilly, R.E. Millsap, K. Pearlman, and R.W. Stoffey 1993 Applicant reactions to selection procedures. Personnel Psychology 46:49-76. Sproule, C.F. 1984 Should personnel selection tests be used on a pass-fail, grouping, or ranking basis? Public Personnel Management 13:375-394. Traub, R.E., and C.W. Fisher 1977 On the equivalence of constructed-response and multiple-choice tests. Applied Psychological Measurement 1:355-369. Wainer, H., N.J. Dorans, B.F. Green, R.J. Mislevy, L. Steinberg, and D. Thissen 1990 Future challenges. Pp. 233-272 in Computerized Adaptive Testing: A Primer. H. Wainer, ed. Hillsdale, NJ: Lawrence Erlbaum Associates. Ward, W.C. 1982 A comparison of free-response and multiple-choice forms of verbal aptitude tests. Applied Psychological Measurement 6:1-11. Ward, W.C., N. Fredericksen, and S.B. Carlson 1980 Construct validity of free-response and machine-scorable forms of a test. Journal of Educational Measurement 17:11-29. Whitney, D.J., and N. Schmitt 1997 The relationship between culture and responses to biodata and employment items. Journal of Applied Psychology 82:113-129. Wilson Learning 1992 Electronic Assessment of First-Line Supervisor: A Criterion-Related Validation Report. Longwood, FL: Wilson Learning. Zeidner, M. 1987 Essay versus multiple-choice type classroom exams: The student's perspective. Journal of Educational Research 80:352-358.