Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 103
Performance Assessment for the Workplace 5 The Testing of Personnel In an ideal world, empirical evaluation of job performance would involve measuring the performance of a large, representative group of job incumbents, all on the same day, at the same time, and under the same conditions. The ideal can seldom be attained, however, so some realistic accommodations are required. This chapter discusses issues in specifying the sample of personnel to be tested as well as the logistical and standardization problems that can arise in attempting to measure job performance, particularly hands-on performance. It also discusses steps that can be taken to obviate or minimize any adverse effects such problems may have on the quality of the data collected. As the JPM Project evolved, project scientists and committee members came to realize the magnitude of these problems when large-scale test administration of hands-on tests is required. It is hoped that future researchers will profit from what the Services learned about sampling and standardization and from the solutions that emerged from conduct of the project. Examples from the JPM Project are used to illustrate some of the more important sampling, training, and logistical issues, although the concerns discussed are generic and apply to any attempts to measure performance on the job. Our discussion assumes that the job tasks to be used for hands-on testing have been selected and the procedures for testing and scoring have been developed—the sampling of jobs and job tasks is discussed in Chapter 4 and again in Chapter 7.
OCR for page 104
Performance Assessment for the Workplace SAMPLING PERSONNEL Specifying the Personnel to be Tested Specifying the target population to which measures will be administered depends fundamentally on the planned uses of the performance data. If, for example, the goal is to determine the value of the ASVAB as a predictor of future performance, then the target population would be individuals who finished training in the relatively recent past (before years of experience mask the contributions of ability). If the purpose is to examine the relative effects of ability and experience, then one wants people at all stages of experience, and performance would be analyzed as a function of experience. If, however, the issue is an assessment of the present quality of performance in the occupational specialty, then a sample of job incumbents across the whole range is needed, and the relation to experience is of secondary interest. In the JPM Project, a central concern was the linking of enlistment standards to job performance. A problem that plagues all validation research of this type is the practical necessity of being able to obtain criterion performance data only from job incumbents rather than from the entire pool of job applicants. In addition to the restriction in the range of ability that this condition imposes (for a full discussion, see Chapter 8), there is the further complication that experience gained on the job may affect estimates of the relationship between selection instruments and job performance in complex and unknown ways. In some situations, on-the-job experience might reduce the correlation between job performance and selection tests. This could happen, for example, if the supervisory policy is to expend disproportionate energy on slower-developing employees and to leave the better employees on their own. In other situations, the correlation might increase if, because of the pressures of “getting the job done,” below-average employees are given unimportant or minor tasks and the better employees are always assigned the important or critical tasks. It is even conceivable that on-the-job experience could result in an underestimate of the relationship between predictor scores and job performance for one test in a battery and an overestimate of the relationship for another. No statistical adjustments can remedy this situation without making assumptions that generally will be untestable. The effects of experience gained on the job can be minimized, however, by carefully selecting the interval during which performance measures are administered. Testing should be delayed long enough for genuine and reliable individual differences in job performance to emerge, but not so long that supervisory practices and the passage of time distort relationships. Hence, in the JPM Project, job incumbents in their first term of enlistment who have at least six months of
OCR for page 105
Performance Assessment for the Workplace service were designated as the target population. This cohort was expected to be reasonably well versed in the job and would not yet have been promoted to largely supervisory or management functions. In addition, the amount of time between entrance testing and the administration of the job performance measures (a maximum of 36 or 45 months, depending on the Service and the terms of enlistment) was not so long as to completely vitiate the predictive power of the test. Selecting the Personnel to be Tested Although researchers would always prefer to test all personnel in the target group who fit the population specifications, it is possible to obtain quite stable regression results from samples of 200 to 300 cases, and results from as few as 50 can provide useful indications of trends. When only a few cases are available, however, interpretation must be tentative, unless bolstered by replications, possibly in conjunction with Bayesian statistical methods (Rubin, 1980). When many more incumbents are available than can be tested economically, a representative sample of the total population can be used. A sample is representative of the population from which it is drawn if it reflects (within the limits of sampling error) the characteristics of the population. If the sample of incumbents from which performance information is obtained is not representative of the population of interest, any conclusions about the relationship between performance measures and selection tests would be subject to serious question. A representative sample does not mean selecting those people that some manager would like to have tested; almost always, that would mean testing those who are available. People are often available for reasons related to performance: they might be available because they are a crack group who finished some work assignment early, or they might be a group of poorly motivated personnel who were excluded from more advanced work. Using such “available” people is likely to bias the results in one way or another. Rather, each incumbent should have an equal chance of being tested. Project researchers should be given the authority to select those individuals to be tested, or at least to specify the selection procedures in accordance with statistical methods of randomization. It is important to avoid sources of bias such as those introduced by selecting individuals recommended by a manager. Several methods are typically used to select a representative sample of incumbents. All such methods have the property that every person in the target population has the same probability of being included in the sample. The most straightforward and least used procedure is simple random sampling in which individuals are selected by some kind of lottery. For logistical purposes, the selected individuals can then be organized by
OCR for page 106
Performance Assessment for the Workplace unit, and the units visited in some order. Even so, the obvious problem with this scheme is the strong likelihood that the selected individuals will be dispersed at many different sites, making testing difficult and expensive. A more advantageous regimen involves multistage sampling. Large organizations like the military are already divided into units. In multistage sampling, units are picked at random, with probabilities proportional to their size. Then individuals are selected at random from each of the selected units with probabilities inversely proportional to group size. Such a procedure is especially economical in the military, in which the units are geographically dispersed. In organizations that have a hierarchy of organization, sampling can be done at each level. For example in the Navy, with bases, squadrons, and ships within squadrons, bases can be selected at random, then squadrons within bases, and then ships within squadrons. If required, individuals can be sampled within each ship, but there are many advantages to testing all qualified people in the target population in the selected unit. If the planned data analysis involves comparisons of subgroups, some adjustments in the sampling plan may be needed to accommodate subgroups that represent a small fraction of the population. That is, if members of a given subgroup, such as women, blacks, or ethnic minority groups, are relatively scarce in the target population and are sampled at the overall rate, subgroup comparisons may be based on numbers of cases that are too small to be meaningful. If this outcome is anticipated, the subgroup can be oversampled—members of the subgroup in question are simply given a higher probability of being selected and their data are weighted inversely by this differential probability in computing statistics for the total group. One of the advantages of random sampling is that uncontrolled variables are not likely to influence the outcome unduly, as they might when any subjective selection elements are permitted. But randomness works only on the average; any one sample can still be unrepresentative. It is always wise to record background information on all selected persons, including age, education, and any other similar data that might be relevant. A comparison of the population and sample values for these background variables can establish the extent to which the sample's characteristics match those found in the population. With random sampling, the group is not likely to be far off the population average, but it may diverge on some variables. Whenever other considerations force a departure from the purely statistical randomization procedures, such background comparisons are especially important. The Services varied considerably in the ease with which they could identify and select a representative sample of job incumbents. For the Navy, the sheer difficulty of obtaining access to a vessel for testing precluded any thought of preselecting a representative sample of, say, machinist's mates.
OCR for page 107
Performance Assessment for the Workplace Availability sampling was the only possible option. In such a situation it is important to collect as much descriptive information on the sample as practical to determine the similarity of the sample to the population. The Army research team found that the central personnel locator could not be used to draw the sample. The largest of the Services by far, with soldiers scattered across the United States, Europe, and much of the rest of the world, some number of whom are in transit between duty stations at any given time, the Army cannot maintain the level of currency in its central data system that was required for purposes of this project. As a consequence, the sample had to be selected at each base chosen for testing. The Marine Corps is a much smaller institution and has the bulk of its troops at a few locations, making for a more manageable task of tracking personnel. The researchers were able to design and select a stratified random sample prior to arrival at the testing sites and be assured that the Marines would be available for testing. For the infantry position of rifleman, for example, three rosters of potential examinees were prepared at Marine Corps Headquarters: a primary roster; an updated roster of recent graduates from the School of Infantry, from which the first replacements were to be drawn; and a supplementary roster. The sampling criteria used in creating the rosters were education level, time in service, and rank. Decision rules for selecting from the supplementary roster, were that to become necessary, matched the supplementary candidates as closely as possible to the riflemen being replaced. The desired size and composition of the rifleman sample are described in Table 5-1. The problems of creating a representative sample were particularly difficult in those occupational specialties that were performed differently at different sites. The Navy radioman rating provided the most extreme example. Contextual and equipment differences between shore-based and TABLE 5-1 Marine Corps Sampling Plan for the Infantry Occupational Field Rank Level of Education Months of Service Private-Private First Class Lance Corporal Corporal Total Non–High School Graduate 1-48 50 200 50 300 High School Graduate ≤9 150 150 — 300 High School Graduate ≥10 100 50 250 400 Total number 300 400 300 1000
OCR for page 108
Performance Assessment for the Workplace shipboard radio telecommunications systems were considerable. In addition, restrictions on the assignment of female radiomen produce large demographic differences between the two locations in which the job is performed. THE IMPORTANCE OF STANDARDIZATION Most research employing psychological tests begins with the postulate that observed differences among individuals (i.e., the test scores they obtain) are representative of true differences among those same individuals (i.e., how much of the attribute tested they really possess). In other words, if a sample of individuals completes a reasoning test and if there is variation evident in the resulting score distribution, it is assumed that this observed variation is the result of real variation in the capacity of these individuals to reason efficiently. The central hypothesis of most validation research is that differences in test scores are associated with differences in performance. Another way of stating it is that people who do better on the test will do better at the task in question. This generalization is equally applicable to aptitude tests used to predict performance and achievement tests used to measure performance. Standardization and Prediction A problem arises whenever the general testing postulate described above cannot be accepted. Unless we can assume that the observed score differences reflect true differences in the test takers' abilities, then the inference that people who do better on the test will do better at the task makes no sense. As an example, consider an instance in which two samples of individuals take reasoning tests. In the first sample, the tests are administered in a noisy environment, full of distractions associated with a rock band practicing next door. In the second sample, the tests are administered in a quiet environment, free of these distractions. In comparing the mean scores of the two samples, we note that the “noisy” sample performed less well than the “quiet” sample. Given the probable effect of noise on intellectual performance, it would be inappropriate to conclude that, on the average, individuals in the quiet sample have more reasoning ability than those in the noisy sample. In this instance, it would be fair to say that observed differences in reasoning scores are probably influenced by variables (e.g., noise) in addition to the reasoning abilities of the subjects. When the basic testing postulate does not hold, the effects are far-reaching. In such circumstances, any observed relationship between the test score in question and other test scores or the particular test score and observable behavior becomes difficult to interpret. Consider adding a second
OCR for page 109
Performance Assessment for the Workplace test—a vocabulary test—to the reasoning test administered to the quiet and noisy samples. As was the case with reasoning performance, we note that vocabulary performance of the quiet sample is better than that of the noisy sample. Ignoring the noisy/quiet conditions, we might be tempted to conclude that those who do well on reasoning tests also do well on vocabulary tests. But it might be more appropriate to conclude that environmental noise affects intellectual performance. In the example above, we have illustrated how an observed association between two test scores could be unrelated to attributes of the people taking the tests. An equally serious problem can be seen when certain conditions obscure a true relationship. Consider the quiet/noisy example once again. This time assume that we want to examine that relationship between reasoning test scores and the efficiency with which subjects make difficult judgments. As before, assume that some subjects take the reasoning test in quiet conditions and other subjects take the test under noisy conditions. Further assume that both groups of subjects are asked to make judgments in quiet conditions. Now we have an instance in which the reasoning test scores vary as a function of the environment in which the subject was tested but the judgment (performance) scores do not. In this instance, the test scores are influenced by a variable other than the underlying ability of the subject and, as a result, there will appear to be a lower relationship between the test score and performance than is actually the case. Standardization and Performance Standards Thus, it can be seen that when the basic testing postulate does not hold, it is possible to observe a spurious relationship or to ignore a real relationship. There is one additional problem that might arise from distorted test scores. In order to interpret test scores for individuals or groups, it is necessary to have some standard of comparison. In testing, standards of comparison are often called norms, and norms are developed by administering the test to a selected sample. The adequacy of these norms depends on the extent to which all members of the norm group or sample are exposed to identical conditions. To use our noise example, if we were constructing a reasoning test and wanted to develop norms to aid in test score interpretation, it would be unwise to allow variation in noise and quiet in our normative sample. Similarly, if the norms had been appropriately developed, they would have interpretive value only to the extent that a set of scores under consideration was gathered in conditions similar to those present when the norms were developed. The ASVAB norms are good examples of the value of standardization in testing. The fact that they have been carefully developed allows one to interpret the score of any single test taker compared with a sample of earlier
OCR for page 110
Performance Assessment for the Workplace test takers. It allows us to infer that the person in question is above average on one ASVAB ability or below average on another. Furthermore, it permits us to determine if particular recruiting strategies are identifying candidates who are more capable than those produced using a different recruiting strategy. If the norms had been developed without standardization of conditions, or if the actual ASVAB testing was done under conditions that varied greatly from those present when norms were developed, these inferences would not be possible. An associated problem arises when one attempts to link predictor test scores to performance standards. As an example, consider the implied relationship between ASVAB scores and job performance. It is assumed that it is reasonable to attempt to identify a minimum ASVAB score (or score profile) in terms of predicted performance in the Services. In other words, if there is a link between the abilities tested by the ASVAB and the abilities required to perform various military jobs, it should be possible to identify an ASVAB score (or score profile) below which successful performance is improbable. In order to identify such a score, it will be critical that the performance scores be true scores and relatively unaffected by systematic distortions. If the performance norms are too high or too low because of extraneous influences in developing these norms, then the cutoff scores on the ASVAB will be similarly too high or too low. This is a calibration challenge that can be met only through standardization. This is the conceptual foundation for test standardization. In the JPM Project, the challenge is exaggerated by the fact that the performance tests of most interest are not standard paper-and-pencil, multiple-choice examinations but performance tests with various complex interactive mechanisms. In the sections that follow, we consider the standardization issue as it affects the hands-on tests. STANDARDIZATION ISSUES IN HANDS-ON TESTING Hands-on tests are work samples. It is therefore necessary to move to field environments for testing purposes. The typical hands-on test is administered to a single subject in a field setting by a trained administrator or scorer. By extension, this means that tests are administered in many settings by many different administrators. This, in turn, means that there is an opportunity for variance other than true score variance to enter into the hands-on test scores. To the extent that this unwanted variance is present in the test scores, inferences both about the capacity of the ASVAB to predict performance and about individual recruit performance predictions (i.e., performance standards) will be in error. There are three general classes of variables that are of particular concern in the use of hands-on tests. They include test administrators, administra
OCR for page 111
Performance Assessment for the Workplace tive and scoring procedures, and physical testing conditions. Each of these classes of variables has the potential for distorting the hands-on test scores and any inferences that might be based on them. The issue is not only developing the “best” set of circumstances but also making sure that all people who take the hands-on tests experience the same circumstances. Standardizing the Test Conditions With hands-on performance testing, which requires one-on-one test administration, attention must be given to standardizing the interactions between administrator and test taker. Usually test developers provide a definite protocol to be followed, including instructions to be read for each task being tested and procedures to be used for verifying the execution of each step of task performance. Part of the test development includes preparing standard instructions, as well as standard responses to the most common questions raised by test takers. In effect, the test administration procedures attempt to reduce the test administrator to a completely neutral presence; the best administrators are those who follow the protocol, maintain a pleasant demeanor, and avoid giving verbal or facial cues. Some of the details of test development become critical only in the context of actual test administration and scoring. For example, all tests should have some kind of time limit, if for no other reason than to rescue the examinee who is so hopelessly lost as to be immobilized. Other details include clear specifications of what constitutes acceptable performance of each step, and how performance should be scored—go/no go versus rating on a continuum. The difficulties of the actual testing situation, in which the administrator must simultaneously time the performance, observe and evaluate the performance, grade the observable units, and handle any questions in a nondirective manner, led the JPM Project planners to prefer a dichotomous scoring system because it involves simpler judgments. The locale of the testing must also be specified. If a task is to be performed outdoors at one site, it should be performed outdoors at all sites. There is a natural conflict between finding a quiet, relatively isolated place for the administration and having the task performed under realistic conditions. For example, some aircraft maintenance tasks are normally carried out on the flight line rather than in a hanger or shop, but conditions on the flight line make viewing the performance extremely difficult for examiners, and there are elements of risk to equipment and personnel on the flight line that are better controlled in the shop. The details of the JPM Project should be a source of valuable information about striking the best possible balance between the scientist's need for a controlled and replicable environment and the realities of measuring performance in a real-world environment. The schedule of the testing sessions and the arrangement of the various
OCR for page 112
Performance Assessment for the Workplace testing stations are also candidates for standardization. Testing schedules must be set up so that trained administrators can do the testing. In the JPM Project, there were relatively few administrators and, since each test involved one test taker and at least one test administrator, testing had to be done sequentially. Moreover, in many cases, certain scarce equipment was required, and its use had to be carefully scheduled so as not to interfere with the unit's other scheduled activities. One inevitable consequence of having to test seriatim is that individuals tested later in the sequence may gain an advantage by learning about the test from those who had been through the process earlier. Efforts to minimize this type of information transmittal by exhortation did not seem particularly effective in the military project, but careful logistical preparations so that troops were not waiting together in groups for the next activity reduced on-site communications substantially. However, since the hands-on testing typically extended over days, if not weeks, at a given location, discussion among the test takers of test tasks had to be assumed and controlled for analytically. In the usual standardized test, all examinees perform each of the tasks to be evaluated; however, they are not required to perform the tasks in the same order. During early trials of JPM performance tests, it quickly became clear that imposing the same order on each test taker created large inefficiencies, with many individuals waiting at each station to begin testing and others waiting upon completing the final task. An important lesson in the logistics of mass performance testing gained from the project is that testing order should be counterbalanced across test takers to eliminate these inefficiencies as well as potential sources of performance variation due to the sequence in which the tasks are performed. In the JPM Project, the Marine Corps hands-on test administration, in large part because it came last and benefited from the experiences of the other Services, was the most carefully controlled. For the infantry specialty, each of the testing stations involved approximately the same amount of time (about 30 minutes), and a daily schedule was set up in which the order of the testing stations was carefully counterbalanced across examinees. Moreover, each examinee had a printed schedule showing which station to do at which time. This kept every examinee and every test administrator busy throughout the day. Thus, examinees did not have free time while waiting their turn in which to discuss their experiences with each other. Finally, the several testing stations were carefully isolated and/or insulated from each other so that examinees waiting to be tested could not view the test as it was being administered. This degree of control was possible because the testing site was physically separate from other activities on the base. Some of the testing locations in the JPM Project did not have suffi
OCR for page 113
Performance Assessment for the Workplace cient space to permit the necessary degree of isolation (on shipboard that was simply impossible to achieve). As a result, individuals not being tested were undergoing training and/or performing their normal activities in close proximity to those who were. Clearly, such external activity has the possibility of adding to the error variance in the test performance. In particular, the comments of informal observers about a colleague's performance are likely to have a deleterious influence on the test taker's performance. Not all threats to standardization are amenable to control by researchers. Hands-on tasks for many jobs must be performed outdoors, prey to weather and other differences in local environment. At one European site at which Army researchers were testing tank crewmen, the base was suddenly put on alert. Although testing proceeded, the formerly quiet and tranquil test site was now filled with equipment and soldiers awaiting further orders. While the hectic and tense atmosphere created by the alert no doubt provided an element of authenticity to the hands-on testing, it also introduced unaccountable variability to the testing environment that will make it hard to interpret the scores of the soldiers tested on that occasion. Selecting and Training the Test Administrators Individualized performance measurement requires trained administrators to observe and score the performance. These administrators need to understand the job fully, or at the very least to understand the tasks being tested; in the absence of careful selection and adequate training, the test administrators themselves could become sources of error variance. In the JPM Project, administrators were either past supervisors of the job under study or related jobs, or they were incumbents in supervisory positions. Navy, Air Force, and Marine Corps researchers hired former servicemen who had served as supervisors in the relevant occupations. The Army researchers, in some instances, used supervisors on the same bases as the individuals being tested; however, individuals were never tested by their supervisors. Training of the supervisors was necessary to be sure not only that they were knowledgeable about the tasks being tested, but also that they had a common understanding of the behavior that constituted correct performance of each step of the task, and how these various behaviors were to be rated. Training videos were very useful in showing the correct and incorrect ways to do the task. Use of the videos, coupled with group discussion, provided a common basis for scoring the performances. Administrators also took turns role playing as examinees to provide experience for each other. There are many ways to provide effective training; it is important to understand that a considerable amount of such training is always necessary to ensure standardization of test administration and scoring.
OCR for page 114
Performance Assessment for the Workplace One problem that arises when administrators are not accustomed to laboratory studies is their tendency to help the examinees and even act as teachers. One of the roles of a supervisor, especially in the armed forces but elsewhere as well, is to provide feedback to correct and improve the work of those under their supervision. On-the-job training is more or less continuous for new workers. By contrast, the test administrator is supposed to take an impassive, nonreactive role. Not only is this mode extremely difficult for those who have spent their working lives as first-line supervisors, but also they may not even see its importance. Certainly, there were some associated with the military project to whom it seemed more important that a serviceman learned the job than that a true study result was obtained. Such administrators had to be weeded out. Even when the administrators understand that they are not to train, it is often impossible for them to forgo the opportunity completely. Avoiding inappropriate examiner interaction with the test taker requires constant vigilance. Calibrating the Test Administration One-on-one testing shares some characteristics with grading test papers, such as essays written by students. When this is done professionally, as by a testing organization like the Educational Testing Service, not only is there considerable training and group discussion about standards but also, from time to time, sets of papers are rescored and careful records are kept of scores given by each scorer, so that agreed-on standards can be maintained. Ideally, similar quality control should be used in collecting job performance data: daily records can be obtained on the scoring patterns of each administrator; periodically a second administrator or shadow scorer can watch the same performance and provide independent scoring; videotapes of the test session can be obtained for later checking. When quality control procedures are applied daily, it is possible to correct situations that might otherwise get out of control—e.g., with different administrators using different standards. In the JPM Project the Marine Corps developed the most elaborate quality control system. Part of the success of its system came from the fact that just two teams of test administrators were used for data collections, one for the East Coast and one for the West Coast. For the infantry specialty, each team trained together for two weeks and then spent six months administering the hands-on tests. As a consequence, the scoring teams became experienced and developed a professional ethos. In addition, despite many complications, a computerized data analysis system was set up at the test sites. Each day, the scoring data were keyed in and reports were generated so that the test site manager could assess scoring trends for each administrator. In cases in which an administrator was clearly out of line—either more lenient
OCR for page 115
Performance Assessment for the Workplace or more stringent than the others—corrective action could be taken via group discussions of task performance, role playing, and so forth. CONCLUSION Because of the massive size of the JPM Project and the diversity of the jobs being tested, prodigious efforts were needed to surmount logistical and standardization obstacles. The problems encountered in field testing by the Services as well as their approaches to solutions provide a wealth of insight for others concerned with these issues. And because standardization of hands-on performance measurement presents much greater challenges than traditional written tests, this topic is worthy of far more attention than it is usually given in setting up data collection plans.
Representative terms from entire chapter: