Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests

Richard M. Jaeger and Sallie Keller-McNulty

THE PROBLEMS ADDRESSED

As part of a Joint-Service job performance measurement project, each Service is developing a series of standardized hands-on job performance tests. These tests are intended to measure the “manifest, observable job behaviors” (Committee on the Performance of Military Personnel, 1984:5) of first-term enlistees in selected military occupational specialties. Once the tests have been constructed and refined, they will be examined for use as criteria for validating the Armed Services Vocational Aptitude Battery (ASVAB), or its successor instruments, as devices for classifying military enlistees into various service schools and military occupational specialties.

Three problems are addressed in this paper. The first concerns the development of standards of minimally acceptable performance on the newly developed criterion tests. Such standards could be used to discriminate between enlistees who would not be expected to exhibit satisfactory (or, perhaps, cost-beneficial) on-the-job performance in a military occupational specialty and those who would be expected to exhibit such performance.

The second problem concerns methods for eliciting and characterizing judgments on the relative value or worth of enlistees' test performances that are judged to be above the minima deemed necessary for admission to one or more military occupational specialties. Practical interest in this problem derives from the need to classify enlistees into military occupational spe-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 258
Performance Assessment for the Workplace: VOLUME II Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests Richard M. Jaeger and Sallie Keller-McNulty THE PROBLEMS ADDRESSED As part of a Joint-Service job performance measurement project, each Service is developing a series of standardized hands-on job performance tests. These tests are intended to measure the “manifest, observable job behaviors” (Committee on the Performance of Military Personnel, 1984:5) of first-term enlistees in selected military occupational specialties. Once the tests have been constructed and refined, they will be examined for use as criteria for validating the Armed Services Vocational Aptitude Battery (ASVAB), or its successor instruments, as devices for classifying military enlistees into various service schools and military occupational specialties. Three problems are addressed in this paper. The first concerns the development of standards of minimally acceptable performance on the newly developed criterion tests. Such standards could be used to discriminate between enlistees who would not be expected to exhibit satisfactory (or, perhaps, cost-beneficial) on-the-job performance in a military occupational specialty and those who would be expected to exhibit such performance. The second problem concerns methods for eliciting and characterizing judgments on the relative value or worth of enlistees' test performances that are judged to be above the minima deemed necessary for admission to one or more military occupational specialties. Practical interest in this problem derives from the need to classify enlistees into military occupational spe-

OCR for page 258
Performance Assessment for the Workplace: VOLUME II cialties in a way that maximizes their value to the Service while satisfying the enlistees' own requirements and interests. The third problem concerns the use of enlistees' behaviors on the handson tests, and judgments of their value, in the classification of enlistees among military occupational specialties. As was true of the second problem, interest in this problem reflects the need to assign enlistees to military occupational specialties in a way that satisfies the needs of the Services and the enlistees. In a scarce-resource environment, it is essential that the classification problem be solved in a way that maximizes the value of available personnel to the Services while maintaining the attractiveness of the Services at a level that will not diminish the pool of available enlistees. The three problems considered in this paper are not treated at the same level of detail. Since there is an extensive methodological and empirical literature on judgmental procedures for setting standards on tests, we have addressed this topic in considerable detail. There is little research that supports methodological recommendations on assigning relative value or worth to various levels of test performance. Therefore, our treatment of this problem is comparatively brief. Finally, our discussion of the problem of assigning enlistees to the military occupational specialties should be viewed as illustrative rather than definitive. This problem is logically related to the first two, but is of such complexity that complete development is beyond the scope of this paper. Establishing Test Standards To fulfill the requirements of a military occupational specialty, an enlistee must be capable of performing dozens, if not hundreds, of discrete and diverse tasks. Indeed, each Service has conducted extensive analyses of the task requirements of each of its jobs (Morsch et al., 1961; Goody, 1976; Raimsey-Klee, 1981; Burtch et al., 1982; U. S. Army Research Institute for the Behavioral and Social Sciences, 1984) that have produced convincing evidence of the complexity of the various military occupational specialties and the need to describe military occupational specialties in terms of disjoint clusters of tasks. Even when attention is restricted to the job proficiencies expected of personnel at the initial level of skill defined for a military occupational specialty, the military occupational specialty might be defined by several hundred tasks that can reasonably be allocated to anywhere from 2 to 25 or more disjoint clusters (U.S. Army Research Institute for the Behavioral and Social Sciences, 1984:12-19). In view of the complexity of military occupational specialties, it is unlikely that the performance of an enlistee on the tasks that compose a military occupational specialty could validly be characterized by a single test

OCR for page 258
Performance Assessment for the Workplace: VOLUME II score. In their initial development of performance tests, the service branches have acknowledged this reality by (1) defining clusters of military occupational specialty tasks; (2) identifying samples of tasks that purportedly represent the population of tasks that compose a military occupational specialty; and (3) specifying sets of measurable behaviors that can be used to assess enlistees' proficiencies in performing the sampled tasks. The problem of defining minimally acceptable performance in a military occupational specialty must therefore be addressed by defining minimally acceptable performance on each of the clusters of tasks that compose the military occupational specialty. Methods for defining standards of performance on task clusters thus provide one major focus of this paper. Eliciting and Combining Judgments of the Worth of Job Performance Test Behaviors Scores on the job performance tests that are currently under development are to be used as criterion values in the development of algorithms for assigning new enlistees to various military occupational specialties. Were it possible to develop singular, equivalently scaled, equivalently valued measures that characterized the performance of an enlistee in each military occupational specialty, optimal classification of enlistees among military occupational specialties would be a theoretically simple problem. In reality, the problem is complicated by several factors. First, as discussed above, the tasks that compose a military occupational specialty are not unidimensional. Second, even tests that assessed enlistees' performances on task clusters with perfect precision and validity would not be inherently equivalent. Third, the worth or value associated with an equivalent level of performance on tests that assessed proficiency in two different task clusters would likely differ across those clusters. Fourth, the worth or value associated with a given proficiency level in a single task cluster would likely differ, depending on the military occupational specialty in which the task cluster was imbedded. To address these issues, the problem of establishing functions and eliciting judgments that assign value to levels of proficiency in various military occupational specialties (hereafter called “value functions”) must be examined at the level of the individual tasks and at the level of the task clusters. In this regard, two of the major problems considered in this paper are equivalent. To develop value functions for military occupational specialties, several component problems must be addressed. First, the task clusters defined by job analysts for each military occupational specialty must be accepted or revised. Second, value functions associated with performances on tasks sampled from task clusters must be defined. Third, operational procedures

OCR for page 258
Performance Assessment for the Workplace: VOLUME II for eliciting judgments of the values of various levels of performance on tasks sampled from task clusters must be developed. Fourth, methods for weighting and aggregating value assignments across sampled tasks, so as to determine a value assignment for a profile of performances on the tasks that are sampled from a military occupational specialty, must be developed. Related issues that must be considered include the comparability of value assignments across tasks within a military occupational specialty, as well as the scale equivalence of value assignments to levels of performance in different military occupational specialties. Using Predicted Test Performances and Value Judgments in Personnel Classification Assuming it is possible to predict enlistees' performances on military job performance tests from the ASVAB or other predictor batteries, and assuming that judgments of the values of these predicted performances can be elicited and combined to produce summary scores for military occupational specialties, there remains the problem of using these summaries in classifying enlistees among military occupational specialties. This problem can be addressed in several ways, depending on one's desire to consider as primary the interests of individual enlistees and/or the Services, and the types of decision scenarios envisioned. If it was desired to satisfy the interests of individual enlistees with little regard for the needs of or costs to the military, predicted performances in various military occupational specialties would be used solely for guidance purposes. The only value functions that would be pertinent would be those of the individual enlistee. Enlistees would be assigned to the military occupational specialties they most desired, after having been informed of their likely chances of success in each. If the interests of the military were viewed as primary, the best classification strategy would depend on the decision scenarios envisioned and the decision components to be taken into account. In a scenario in which each enlistee was to be classified individually, based on his/her predicted military occupational specialty job performances and the set of available military jobs at the time of his/her classification, the obvious classification choice would be the one that carried maximum value. In a scenario in which enlistees were to be classified as a group (e.g., the group of enlistees who completed the ASVAB during a given week), the predicted job performances of all members of the group, and the values associated with those predictions could be taken into account, in addition to the average values associated with the performances of personnel currently assigned to military occupational specialties with jobs available at the time of classification. These alternatives are considered in a discussion of the problem of using

OCR for page 258
Performance Assessment for the Workplace: VOLUME II enlistees' predicted scores on job performance tests in classifying enlistees among military occupational specialties. A specific mathematical programming model for the third alternative is developed and illustrated. ESTABLISHING MINIMUM STANDARDS OF PERFORMANCE One of the two major problems considered in this paper is the establishment of standards of performance that define minimally acceptable levels of response on the new criterion tests that are under development by the Services. In addressing this problem, we first discuss the consequential issues associated with standard setting. We next describe the most widely used standard-setting methods that have been proposed for use with educational achievement tests. In the third section, we consider the prospects for applying these methods to the problem of setting standards on military job performance tests. Finally, we examine a variety of operational questions that arise in the application of any standard-setting procedure, such as the types and numbers of persons from whom judgments on appropriate standards are sought, the form in which judgments are sought, and the information provided to those from whom judgments are sought. Rather than recommending the one “best” standard-setting procedure, it is our intent to illuminate the alternatives that have been applied elsewhere, to bring forth the principal considerations that affect their applicability in the military setting, and to bring to light the major operational issues that must be addressed in using any practical standard-setting procedure. Consequences of Setting Standards There are no objective procedures for setting test standards. It is necessary to rely on human judgment. Since judgments are fallible, it is important to consider the consequences of setting standards that are unnecessarily high or low. If an unnecessarily high standard is established, examinees whose competence is acceptable will be failed. Errors of this kind are termed false-negative errors. If the standard established is lower than necessary, examinees whose competence is unacceptable will be passed. Errors of this kind are termed false-positive errors. Both individuals and society are placed at risk by these kinds of errors. When tests are used for selection—that is, for determining who is admitted to an educational program or an employment situation—society or institutions bear the primary effects of false-positive errors. The effects of false-negative errors are borne primarily by individuals when applicant pools greatly exceed institutional needs. However, limitations in the pool of personnel available for military service increase the institutional consequences of making false-negative errors. Adequate military staffing depends on the availability of personnel for a variety of military occupational specialties.

OCR for page 258
Performance Assessment for the Workplace: VOLUME II Since the military now relies on an all-volunteer force, it is particularly vulnerable to erroneous exclusion of qualified personnel. When tests are used for purposes of classification—that is, for allocating examinees among alternative educational programs or jobs—the effects of false-positive and false-negative errors are shared by institutions and individuals. When false-positive errors are made, individuals are assigned to programs or jobs that are beyond their levels of competence. This results in less-than-optimal utilization of personnel and the possibility of costly damage for institutions. It also results in psychological and physical hazards for individuals. When false-negative errors are made, individuals are not assigned to programs or jobs for which they are competent. Although this is unlikely to result in physical damage to individuals or institutions, it does produce less-than-optimal use of personnel by institutions and the risk of psychological distress for individuals. In the military context, the risk to human life and the national security associated with false-positive classification errors is particularly great. Although they might cause psychological distress, false-negative classification errors are unlikely to be life-threatening for individuals. But the Services compete with the civilian sector for qualified personnel. Therefore, the military consequences of false-negative classification errors are likely to be severe for military occupational specialties that require personnel with rare skills and abilities. Conventional Standard-Setting Procedures The number of procedures that have been proposed for setting standards on pencil-and-paper tests has been estimated as somewhere between 18 (Hambleton and Eignor, 1980) and 31 (Berk, 1985). The difference between these figures has more to do with the authors' criteria for identifying methods as “different” than with substantively new developments during the years 1980 to 1985. These same authors, as well as others (Meskauskas, 1976; Berk, 1980; Hambleton, 1980), have proposed a variety of schemes for classifying standard-setting procedures. Since this review of standard-setting procedures will be restricted to those that have been widely used and/or hold promise for use in establishing standards on military job performance tests, a simple, two-category classification method will be used. Procedures that require judgements about test items will be described apart from procedures that require judgments about the competence of examinees. Procedures That Require Judgments About Test Items Many of the procedures used for setting standards on achievement tests are based on judgments about the characteristics of dichotomously scored

OCR for page 258
Performance Assessment for the Workplace: VOLUME II tests items and examinees' likely performances on those items. Both the types of judgments required and the methods through which judgments are elicited differ across procedures. The most widely used procedures of this type are reviewed in this section. The Nedelsky Procedure. This standard-setting procedure is, perhaps, of historical interest since it is the oldest procedure in the modern literature on standard setting that still enjoys widespread use. It was proposed by Nedelsky in 1954, and is only applicable to tests composed of multiplechoice items. The first step in the procedure is to define a population of judges and to select a representative sample from that population. Judges who use the procedure must conceptualize a “minimally competent examinee ” and then predict the behavior of this minimally competent examinee on each option of each multiple-choice test item. Because of the nature of the judgment task, it is essential that judges be knowledgeable about the proficiencies of the examinee population, the requirements of the job for which examinees are being selected, and the difficulties of the test items being judged. For each item on the test, each judge is asked to predict the number of response options a minimally competent examinee could eliminate as being clearly incorrect. A statistic termed by Nedelsky the “ minimum pass level” (MPL) is then computed for each item. The MPL for an item is equal to the reciprocal of the number of response options remaining, following elimination of the options that could be identified as incorrect by a minimally competent examinee. The test standard based on the predictions of a single judge is computed as the sum of the MPL values produced by that judge for all items on the test. An initial test standard is computed by averaging the summed MPL values produced by the predictions of each of a sample of judges. Nedelsky (1954) recommended that this initial test standard be adjusted to control the probability that an examinee whose true performance was just equal to the initial test standard could be classified as incompetent due solely to measurement error in the testing process. The adjustment procedure recommended by Nedelsky depends on the assumption that the standard deviation of the test standards derived from the predictions of a sample of judges is equal to the standard error of measurement of the test. If the assumption were correct, and if the distribution of measurement errors on the test were normal, the probability of failing an examinee with true ability just equal to the initial recommended test standard could be reduced to any desired value. For example, reducing the initial test standard by one standard deviation of the distribution of summed MPL values would ensure that no more than 16 percent of examinees with true ability just equal to the initial recommended test standard would fail. Reducing the initial recommended test standard by two standard deviations would reduce this probability to about 2 percent.

OCR for page 258
Performance Assessment for the Workplace: VOLUME II The initial recommended test standard produced by Nedelsky's procedure derives from the assumption that examinees will make random choices among the item options that they cannot eliminate as being clearly incorrect. Examinees are assumed to have no partial information or to be uninfluenced by partial information when making their choices among remaining options. If these assumptions were correct, and if judges were able to correctly predict the average number of options a minimally competent examinee could eliminate as being clearly incorrect, the initial tests standard resulting from the Nedelsky procedure would be an unbiased estimate of the mean tests score that would be earned by minimally competent examinees. However, studies by Poggio et al. (1981) report that, when Nedelsky 's procedure was applied to pencil-and-paper achievement tests in a public school setting, school personnel were unable to make consistent judgments of the type required to satisfy the assumptions of the procedure. The Angoff Procedure. Although he attributes the procedure to Ledyard Tucker (Livingston and Zieky, 1983), William Angoff's name is associated with a standard-setting method that he described in 1971. The procedure requires that each of a sample of judges consider each item on a test and estimate (1971:515): the probability that the “minimally acceptable” person would answer each item correctly. In effect, the judges would think of a number of minimally acceptable persons, instead of only one such person, and would estimate the proportion of minimally acceptable persons who would answer each item correctly. The sum of these probabilities, or proportions, would then represent the minimally acceptable score. As was true of Nedelsky's procedure, the first step in using Angoff's procedure is to identify an appropriate population of judges and then to select a representative sample from this population. Judges are then asked to conceptualize a minimally competent examinee. Livingston and Zieky (1982) suggest that judges be helped to define minimal competence by having them review the domain that the test is to assess and then take part in a discussion on what constitutes “borderline knowledge and skills.” If judges can agree on a level of performance that distinguishes between examinees who are competent and those who are not, Zieky and Livingston recommend that the definition of that performance be recorded, together with examples of performance that are judged to be above, and below, the standard. Using as an example a test that was designed to assess the reading comprehension of high school students, Zieky and Livingston suggest that judges be asked to reach agreement on whether a minimally competent student must be able to “find specific information in a newspaper article, distinguish statements of fact from statements of opinion, recognize the main idea of a paragraph,” and so on. To be useful in characterizing a

OCR for page 258
Performance Assessment for the Workplace: VOLUME II minimally competent examinee, the behaviors used to distinguish between those who are competent and those who are not should represent the domain of behavior assessed by the test for which a standard is desired. The judgments required by Angoff's procedure are as follows: Each judge, working independently, considers the items on a test individually and predicts for each item the probability that a minimally competent examinee would be able to answer the test item correctly. The sum of the probabilities predicted by a judge becomes that judge 's recommended test standard and, if the predictions were correct, would equal the total score on the examination that would be earned by a minimally competent examinee. The average of the recommended test standards produced by the entire sample of judges is the test standard that results from Angoff's procedure. If for each item on the test the average of the probabilities predicted by the sample of judges was correct, the test standard produced by Angoff's procedure would equal the mean score earned by a population of minimally competent examinees. In any case, the result of Angoff 's procedure can be viewed as a subjective estimate of that mean. Angoff's procedure has been modified in several ways, so as to make it easier to use and/or to increase the reliability of its results. One modification involves use of a fixed scale of probability values from which judges select their predictions. This technique allows judges' predictions to be processed by an optical mark-sense reader for direct entry to a computer, this saving a coding step and reducing the possibility of clerical errors. Educational Testing Service used an asymmetric scale of probabilities when setting standards on the subtests of the National Teacher Examinations (NTE). Livingston and Zieky (1982:25) objected to the use of an asymmetric scale, since they felt it might bias judges' predictions. Cross et al. (1984) used a symmetric scale of 10 probability values that covered the full range from zero to one, thus overcoming Livingston and Zieky's objections. Other modifications of Angoff's procedure include the use of iterative processes through which judges are given an opportunity to discuss their initial predictions and then to reconsider those predictions. Cross et al. (1984) investigated the effects of such a process coupled with the use of normative data on examinees' actual test performances. They found that judges recommended a lower test standard at the end of a second judgment session than at the end of an initial session. These results were not entirely consistent with findings of Jaeger and Busch (1984) in a study of standards set for the National Teacher Examinations. They found that mean recommended standards were lower at the end of a second judgment session than at the end of an initial session for four out of eight subtests of the NTE Core Battery; they found just the reverse for the other four subtests. However, the variability of recommended test standards was consistently reduced by

OCR for page 258
Performance Assessment for the Workplace: VOLUME II using an iterative judgment process. The resulting increase in the stability of mean recommended test standards suggests that use of an iterative judgment process with Angoff's procedure is advantageous. The Ebel Procedure. The Ebel (1972:492-494) standard-setting procedure also begins by defining a population of judges and selecting a representative sample from that population. After conceptualizing a “minimally competent ” examinee, judges must complete three tasks. First, judges must construct a two-dimensional taxonomy of the items in a test, one dimension being defined by the “difficulty” of the test items and the other being defined by the “relevance” of the items. Ebel suggested using three levels of difficulty, which he labeled “easy,” “medium,” and “hard.” He suggested that four levels of item relevance be labeled “essential,” “important,” “acceptable,” and “questionable.” However, the procedure does not depend on the use of these specific categories or labels. The numbers of dimensions and categories could be changed without altering the basic method. The judges' second task is to allocate each of the items on the test to one of the cells created by the two-dimensional taxonomy constructed in the first step. For example, Item 1 might be judged to be of “medium difficulty” and to be “important;” Item 2 might be judged to be of “easy difficulty” and to be of “questionable” relevance, etc. The judges' final task is to answer the following question for each category of test items (Livingston and Zieky, 1982:25): If a borderline test-taker had to answer a large number of questions like these, what percentage would he or she answer correctly? When a test standard is computed using Ebel's method, a judge's recommended percentage for a cell of the taxonomy is multiplied by the number of test items the judge allocated to that cell. These products are then summed across all cells of the taxonomy to produce a recommended test standard for that judge. As in the procedures described earlier, the recommendations of all sampled judges are averaged to produce a final recommended test standard. The Jaeger Procedure. This procedure was developed for use in setting a standard on a high school competency test (Jaeger, 1978, 1982), but can be adapted to any testing situation where a licensing, certification, or selection decision is based on an examinee's test performance (Cross et al., 1984). One or more populations of judges must be specified, and representative samples must be selected from each population. As in the procedures described above, judges are asked to render judgments about test items. More specifically, judges are asked to answer the following question for each item

OCR for page 258
Performance Assessment for the Workplace: VOLUME II on the test for which a standard is desired: Should every examinee in the population of those who receive favorable action on the decision that underlies use of the test (e.g., every enlistee who is admitted to the military occupational specialty) be able to answer the test item correctly? Notice that this question does not require judges to conceptualize a “minimally competent” examinee. An initial standard for a judge is computed by counting the number of items for which that judge responded “yes” to the question stated above. An initial test standard is established by computing the median of the standards recommended by each sampled judge. Jaeger's procedure is iterative by design. Judges are afforded several opportunities to reconsider their initial recommendations in light of data on the actual test performances of examinees and the recommendations of their fellow judges. In its original application, judges were first asked to provide “yes/no” recommendations on each test item on a 120-item reading comprehension test. The judges were then given data on the proportion of examinees who had actually answered each test item correctly in the most recent administration of the test, in addition to the distribution of test standards recommended by their fellow judges. Following a review of these data, judges were asked to reconsider their initial recommendations and once again answer, for each item, the question of whether every “successful” examinee should be able to answer the test item correctly. These answers were used to compute a new set of recommended test standards in preparation for a final judgment session. Prior to the final judgment session, judges were given data on the proportion of examinees who completed the test during the most recent administration who would have failed the test had the standard been set at each of the score values between zero and the maximum possible score. In addition, judges were shown the distribution of test standards recommended by their fellow judges during the second judgment session. A final judgment session, identical to the first two, was then conducted. The “yes” responses were tabulated for each judge, and the final recommended test standard was defined as the median of the standards computed for each judge. Jaeger (1982) recommends that more than one population of judges be sampled, and that the final test standard be based on the lowest of the median recommended standards computed for the various samples of judges. He also suggests that prior to the initial judgment session each judge complete the test under conditions that approximate those used in an actual test administration. Procedures That Require Judgments About Examinees Unlike the standard-setting procedures that have been described to this point, several widely used procedures do not require judgments about the

OCR for page 258
Performance Assessment for the Workplace: VOLUME II cialties, MOS1 and MOS2. Assume and . Suppose that, after completing the ASVAB or a similar examination, the predicted value functions for one enlistee are and . Also, suppose that this enlistee has the highest predicted value function for MOS2 among all of the new enlistees who have just completed the aptitude examination. What military occupational specialty assignment for this enlistee would be of maximum benefit to the military? The value function classification method described in the previous section would place this enlistee in MOS1. This classification method would place him/her in MOS2, and thereby be of maximum benefit to the military. Placing this enlistee in either military occupational specialty would likely help raise the average value of individuals in that military occupational specialty because this enlistee's predicted value levels are higher than the current estimated average values of individuals assigned to both of the military occupational specialties. Since is so much larger than , the military's immediate interest would be to assign enlistees to MOS2 who would have the greatest potential to help raise the current average value of individuals already assigned to MOS2 (provided the military's goal is as we stated earlier). Recall that the enlistee under consideration has the highest predicted value function among all new enlistees for whom placement decisions are to be made. Consequently, it is this enlistee who would have the greatest (predicted) ability to help raise the current average value of individuals assigned to MOS2. Had there been other new enlistees with predicted value functions for MOS2 greater than .6, the best classification decision would not have been obvious. Consider another enlistee from this same example for which =.65 and . Where should this enlistee be placed and what effect would he/she have on the average values of individuals assigned to the military occupational specialties? The classification method described in the previous section would have assigned this enlistee in MOS1. Without knowing the predicted value levels of all of the new enlistees and the quotas for MOS1 and MOS2, it is impossible to determine the classification of this enlistee that would minimize the potential negative effect he/she would have on the current average values of the individuals assigned to the military occupational specialties. A general solution that would achieve the military goal previously described can be determined in the following way. Without loss of generality assume there are only two military occupational specialties, MOS1 and MOS2. Consider forming a two-way table of new enlistees ' predicted value functions, as shown in Figure 2. Potential enlistees whose values fell in the (0,0) cell would not be admitted into the Services because their predicted performance scores would fall below the minimally acceptable standards. New enlistees with values falling in the (0,j) cells would be assigned to

OCR for page 258
Performance Assessment for the Workplace: VOLUME II FIGURE 2 Two-way table of new enlistees' predicted MOS value functions. MOS1 and those falling in the (i,0) cells would be assigned to MOS2. After these decisions had been made, the quotas could be adjusted to account for the enlistees just assigned to MOS1 and MOS2. Now, attention can be focused on the remainder of the table. Let Q1 and Q2 be the adjusted quotas for MOS1 and MOS2, respectively. Adjust Q1 and Q2 such that the number of remaining new enlistees equals the sum of Q1 and Q2. For simplicity, assume there are only two intervals of predicted

OCR for page 258
Performance Assessment for the Workplace: VOLUME II values, I1 and I2. Figure 3 displays the simplified two-way table. Let pij be the proportion of the new enlistees in the (i,j)th cell to be assigned to MOS1. Let (1 − pij) be the proportion of the new enlistees in the (i,j)th cell to be assigned to MOS2. The predicted average military occupational specialty values for the new enlistees can be expressed as and The goal is to find the pij's which jointly maximize and While jointly minimizing, if necessary, the amount these values may fall below the current estimated average value of individuals in the military occupational specialties, and . FIGURE 3 Simplified two-way table of new enlistees' predicted MOS value functions.

OCR for page 258
Performance Assessment for the Workplace: VOLUME II This problem can be written mathematically in the following way. Find the pij's, Δ1 and Δ2 that maximize subject to and n11P11 + n12p12 + n21p21 + n22p22 = Q1; where Δ1 and Δ2 are the amounts and fall below the current estimated average values of individuals assigned to MOS1 and MOS2, respectively, and M1 and M2 are positive known constants. The constants M1 and M2 can be thought of as the penalties imposed on the military for admitting enlistees whose predicted performance would result in dropping the average value of individuals in MOS1 and MOS2, respectively. These constants would be chosen by the military. This formulation of the classification problem is equivalent to a simple linear programming problem that can be solved easily by using the simplex method with the aid of a computer (Hillier and Lieberman, 1974). The formulation can be expanded to include any number of military occupational specialties and any number of value intervals Ii. The following examples have been included to demonstrate the outcome of this classification strategy. The data are fictitious. Example 1 The following two-way table shows the distribution of 100 new enlistees ' predicted value functions for two military occupational specialties.

OCR for page 258
Performance Assessment for the Workplace: VOLUME II Let Q1 = 55 and Q2 = 45. Assume estimates of the current average values of personnel currently assigned to the military occupational specialties are and . Let M1 = M2 = 2. This assigns equal penalties to both military occupational specialties. The linear programming analysis produced the following results: cell (i,j) pij Number of Enlistees Assigned to MOS1 Number of Enlistees Assigned to MOS2 (1,1) .42 17 23 (1,2) 1.00 30 0 (2,1) 0.00 0 20 (2,2) .80 8 2 Total   55 45 = .507 and = .401 . Compare the results of this analysis to those of the following analysis where M1 = .5 and M2 = 2. These choices assign a larger penalty to MOS2 than to MOS1, for decreases in anticipated average values of personnel currently assigned to the military occupational specialties. The linear programming analysis produced the following results: cell (i,j) pij Number of Enlistees Assigned to MOS1 Number of Enlistees Assigned to MOS2 (1,1) .625 25 15 (1,2) 1.00 30 0 (2,1) 0.00 0 20 (2,2) 0.00 0 10 Total   55 45 = .464 and = .511 .

OCR for page 258
Performance Assessment for the Workplace: VOLUME II Example 2 The following two-way table shows the distribution of 180 new enlistees ' predicted value functions for two military occupational specialties. This distribution of predicted value functions is similar to that in the example discussed in the text. Let Q1 = 90 and Q2 = 90. Assume estimates of the average values of personnel currently assigned to the military occupational specialties are and . Let M1 = 1 and M2 = 3. These choices assign a larger penalty to MOS2 than to MOS1, for potential decreases in predicted average values. The assignment of penalties in this way is consistent with the military's immediate interest in placing enlistees into MOS2, if they have the greatest predicted potential to help raise the current average value of personnel assigned to MOS2. The linear programming analysis produced the following results: cell (i,j) pij Number of Enlistees Assigned to MOS1 Number of Enlistees Assigned to MOS2 (1,1) .58 25 18 (1,2) 1.00 40 0 (1,3) 1.00 25 0 (2,1) 0.00 0 26 (2,2) 0.00 0 45 (2,3) 0.00 0 1 Total   90 90 = .572 and = .298 .

OCR for page 258
Performance Assessment for the Workplace: VOLUME II Notice that, because of the distribution of predicted values of new enlistees it is impossible to raise the average value of personnel assigned to MOS2. However, the optimization process did minimize the decrease in the average value of personnel assigned to MOS2 by allowing the average value of personnel assigned to MOS1 to fall appreciably. Compare the outcome of this analysis to that of the following analysis in which M1 = M2 = 1. These choices assign equal penalties to both military occupational specialties. The linear programming analysis produced the following results: cell (i,j) pij Number of Enlistees Assigned to MOS1 Number of Enlistees Assigned to MOS2 (1,1) 0.00 0 43 (1,2) 1.00 40 0 (1,3) 1.00 25 0 (2,1) 0.00 0 26 (2,2) 0.53 24 21 (2,3) 1 1 0 Total   90 90 = .631 and = .259 . SUMMARY Three problems associated with the use of military hands-on job performance tests have been addressed in this paper. The first concerned methods for setting standards of minimally acceptable performance on the tests. In addressing that problem, we described standard-setting procedures that have been used in a wide variety of settings in the civilian sector. We then discussed the prospects for using those procedures with the hands-on tests. Finally, we described a set of operational issues that must be addressed, regardless of the standard-setting procedures adopted by the Services. Among the most frequently used standard-setting procedures, those proposed by Angoff (1971) and Nedelsky (1954) appear to hold the greatest promise for use with the performance components and knowledge components, respectively, of the military job performance tests we have reviewed. Examinee-based standard-setting procedures would be most applicable to tests that are not composed of dichotomously scored activities or items. The second problem we addressed involves procedures for eliciting and combining judgments of the values of enlistees' behaviors on military job performance tests. We examined the potential contributions of psychological decision theory and social behavior theory to solving this problem and concluded that they were largely inapplicable. These theories are more appropriate for eliciting judgments of the values of decision alternatives or

OCR for page 258
Performance Assessment for the Workplace: VOLUME II for inferring the attributes of decision alternatives that underlie judges' recommendations. A procedure involving successive lotteries holds promise for defining the values judges attribute to various patterns of enlistees' behavior on military job performance tests. It appears that all Services have completed extensive job analysis studies and have developed elaborate lists of tasks that compose their military occupational specialties. Additional studies have resulted in the development of taxonomic clusterings of these tasks on such dimensions as frequency, difficulty, and judged importance. The results of these studies can and should be employed in developing methods for combining judged values associated with performance of the tasks that compose a military occupational specialty. A method based on weighted averages of value functions, with weights proportional to the judged importance of tasks, was described in detail. The third problem addressed in this paper concerns procedures for using enlistees' predicted job performance test scores and judged values associated with those scores in classifying enlistees among military occupational specialties. Several alternatives were considered, including one that considered only the interests and the predicted abilities of individual enlistees (a guidance model) and several that considered only the interests of the Services. Of the latter two, one method presumed that classification decisions were made sequentially, for each individual enlistee. The other method presumed that groups of enlistees were classified concurrently, and that it was desired to effect these classification decisions in a way that maximized the average values of personnel in all military occupational specialties. An explicit solution to this latter problem, in the form of a linear programming algorithm, was described and illustrated. REFERENCES Angoff, W.H. 1971 Scales, norms, and equivalent scores. Pp. 508-600 in R. L. Thorndike, ed., Educational Measurement. 2nd ed. Washington, D.C.: American Council on Education. Berk, R.A. 1976 Determination of optimal cutting scores in criterion-referenced measurement Journal of Experimental Education 45:4-9. 1985 A Consumer's Guide to Setting Performance Standards on Criterion-Referenced Tests. Paper presented before the annual meeting of the National Council on Measurement in Education, Chicago. Berk, R.A., ed. 1980 Criterion-Referenced Measurement: The State of the Art. Baltimore, Md.: Johns Hopkins University Press. Bunch, L.D., M.S. Lipscomb, and D.J. Wissman 1982 Aptitude Requirements Based on Task Difficulty: Methodology for Evaluation . TR-81-34. Air Force Human Resources Laboratory, Manpower and Personnel Division, Brooks Air Force Base, Tex.

OCR for page 258
Performance Assessment for the Workplace: VOLUME II Committee on the Performance of Military Personnel 1984 Job Performance Measurement in the Military: Report of a Workshop . Commission on Behavioral and Social Sciences and Education, National Research Council. Washington, D.C.: National Academy Press. Cross, L.H., J.C. Impara, R.B. Frary, and R.M. Jaeger 1984 A comparison of three methods for establishing minimum standards on the National Teacher Examinations. Journal of Educational Measurement 21:113-130. Ebel, R.L. 1972 Essentials of Educational Measurement. 2nd ed. Englewood Cliffs, N.J.: Prentice-Hall. 1979 Essentials of Educational Measurement. 3rd ed. Englewood Cliffs, N.J.: Prentice-Hall. Gardiner, P.C., and W. Edwards 1975 Public values: multiattribute-utility measurement for social behavior Pp. 1-38 in M. F. Kaplan and S. Schwartz, eds., Human Judgment and Decision Process. New York: Academic Press. Glass, G.V 1978 Standards and criteria. Journal of Educational Measurement 15:237-261. Goody, K. 1976 Comprehensive Occupational Data Analysis Programs (CODAP): Use of REXALL to Identify Divergent Raters. TR-76-82, AD-A034 327. Air Force Human Resources Laboratory, Occupation and Manpower Research Division, Lackland Air Force Base, Tex. Gulliksen, H. 1950 Theory of Mental Tests. New York: John Wiley and Sons. Hambleton, R.K. 1980 Test score validity and standard-setting methods. Pp. 80-123 in R. A. Berk, ed., Criterion-Referenced Measurement: The State of the Art. Baltimore, Md.: Johns Hopkins University Press. Hambleton, R.K., and D.R. Eignor 1980 Competency test development, validation, and standard-setting. Pp. 367-396 in R. M. Jaeger and C. K. Tittle, eds., Minimum Competency Achievement Testing: Motives, Models, Measures, and Consequences. Berkeley, Calif.: McCutchan. Hendrix, W.W., J.H. Ward, M. Pina, and D.D. Haney 1979 Pre-Enlistment Person-Job Match System. TR-79-29. Air Force Human Resources Laboratory, Occupation and Manpower Research Division, Brooks Air Force Base, Tex. Hillier, F.S., and G.J. Lieberman 1974 Operations Research. San Francisco: Holden-Day Press. Jaeger, R.M. 1978 A Proposal for Setting a Standard on the North Carolina High School Competency Test. Paper presented before the annual meeting of the North Carolina Association for Research in Education, Chapel Hill. 1982 An iterative structured judgment process for establishing standards on competency tests: theory and application. Educational Evaluation and Policy Analysis 4:461-476. Jaeger, R.M., and J.C. Busch 1984 A Validation and Standard-Setting Study of the General Knowledge and Communication Skills Tests of the National Teacher Examinations . Final report. Greensboro, N.C.: Center for Educational Research and Evaluation, University of North Carolina.

OCR for page 258
Performance Assessment for the Workplace: VOLUME II Kaplan, M. 1975 Information integration and social judgment: interaction of judge and informational components. Pp. 139-172 in M. F. Kaplan and S. Schwartz, eds., Human Judgment and Decision Process. New York: Academic Press. Kroeker, L., and J. Folchi 1984a Classification and Assignment within PRIDE (CLASP) System: Development and Evaluation of an Attrition Component. TRØ84-40. Navy Personnel Research and Development Center, San Diego, Calif. 1984b Minority Fill-Rate Component for Marine Corps Recruit Classification: Development and Test. TR 84-46. Navy Personnel Research and Development Center, San Diego, Calif. Kroeker, L.P., and B.A. Rafacz 1983 Classification and Assignment within PRIDE (CLASP): A Recruit Model TR 84-9. Navy Personnel Research and Development Center, San Diego, Calif. Laabs, G.J. 1984 Performance-Based Personnel Classification: An Update. Navy Personnel Research and Development Center, San Diego, Calif. Livingston, S. A., and M. J. Zieky 1982 Passing Scores: A Manual for Setting Standards of Performance on Educational and Occupational Tests. Princeton, N.J.: Educational Testing Service. 1983 A Comparative Study of Standard-Setting Methods. Research Report 83-38. Princeton, N.J.: Educational Testing Service. Lord, P.M., and M.R. Novick 1968 Statistical Theories of Mental Test Scores. Reading, Mass.: Addison-Wesley. Meskauskas, J.A. 1976 Evaluation models for criterion-referenced testing: views regarding mastery and standard-setting. Review of Educational Research 45:133-158. Morsch, J.E., J.M. Madden, and R.E. Christal 1961 Job Analysis in the United States Air Force. WADD-TR-61-113, AD-259 389. Personnel Laboratory, Lackland Air Force Base, Tex. Nedelsky, L. 1954 Absolute grading standards for objective tests. Educational and Psychological Measurement 14:3-19. Poggio, J.P., D.R. Glassnap, and D.S. Eros 1981 An Empirical Investigation of the Angoff, Ebel, and Nedelsky Standard-Setting Methods. Paper presented before the annual meeting of the American Educational Research Association, Los Angeles. Raimsey-Klee, D.M. 1981 Handbook for the Construction of Task Inventories for Navy Enlisted Ratings. Navy Occupational Development and Analysis Center, Washington, D.C. Roberts, D.K., and J.W. Ward 1982 General Purpose Person-Job Match System for Air Force Enlisted Accessions SR 82-2. Air Force Human Resources Laboratory, Manpower and Personnel Division, Brooks Air Force Base, Tex. Schmitz, E.J., and P.B. McWhite 1984 Matching People with Occupations for the Army: The Development of the Enlisted Personnel Allocation System. Personnel Utilization Technical Area Working Paper 84-5. U.S. Army Research Institute for the Behavioral and Social Sciences Alexandria, Va. Sherif, M. 1947 Group influences upon the formation of norms and attitudes. In T. M. Newcomb and E. L. Hartley, eds., Readings in Social Psychology. 1st ed. New York: Holt.

OCR for page 258
Performance Assessment for the Workplace: VOLUME II U.S. Army Research Institute for the Behavioral and Social Sciences 1984 Selecting Job Tasks for Criterion Referenced Tests of MOS Proficiency Working Paper RS-WP-84-25. U.S. Army Research Institute for the Behavioral and Social Sciences Alexandria, Va. Winkler, R.L., and W.L. Hays 1975 Statistics: Probability, Inference, and Decision. 2nd ed. New York: Holt, Rinehart and Winston. Zedeck, S., and W.F. Cascio 1984 Psychological issues in personnel decisions. Annual Review of Psychology 35:461-518. Zieky, M.L., and S.A. Livingston 1977 Manual for Setting Standards on the Basic Skills Assessment Tests . Princeton, N.J.: Educational Testing Service.