Read "Performance Assessment for the Workplace, Volume II: Technical Issues" at NAP.edu

Page 258 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests

Richard M. Jaeger and Sallie Keller-McNulty

THE PROBLEMS ADDRESSED

As part of a Joint-Service job performance measurement project, each Service is developing a series of standardized hands-on job performance tests. These tests are intended to measure the “manifest, observable job behaviors” (Committee on the Performance of Military Personnel, 1984:5) of first-term enlistees in selected military occupational specialties. Once the tests have been constructed and refined, they will be examined for use as criteria for validating the Armed Services Vocational Aptitude Battery (ASVAB), or its successor instruments, as devices for classifying military enlistees into various service schools and military occupational specialties.

Three problems are addressed in this paper. The first concerns the development of standards of minimally acceptable performance on the newly developed criterion tests. Such standards could be used to discriminate between enlistees who would not be expected to exhibit satisfactory (or, perhaps, cost-beneficial) on-the-job performance in a military occupational specialty and those who would be expected to exhibit such performance.

The second problem concerns methods for eliciting and characterizing judgments on the relative value or worth of enlistees' test performances that are judged to be above the minima deemed necessary for admission to one or more military occupational specialties. Practical interest in this problem derives from the need to classify enlistees into military occupational spe-

Page 259 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

cialties in a way that maximizes their value to the Service while satisfying the enlistees' own requirements and interests.

The third problem concerns the use of enlistees' behaviors on the handson tests, and judgments of their value, in the classification of enlistees among military occupational specialties. As was true of the second problem, interest in this problem reflects the need to assign enlistees to military occupational specialties in a way that satisfies the needs of the Services and the enlistees. In a scarce-resource environment, it is essential that the classification problem be solved in a way that maximizes the value of available personnel to the Services while maintaining the attractiveness of the Services at a level that will not diminish the pool of available enlistees.

The three problems considered in this paper are not treated at the same level of detail. Since there is an extensive methodological and empirical literature on judgmental procedures for setting standards on tests, we have addressed this topic in considerable detail. There is little research that supports methodological recommendations on assigning relative value or worth to various levels of test performance. Therefore, our treatment of this problem is comparatively brief. Finally, our discussion of the problem of assigning enlistees to the military occupational specialties should be viewed as illustrative rather than definitive. This problem is logically related to the first two, but is of such complexity that complete development is beyond the scope of this paper.

Establishing Test Standards

To fulfill the requirements of a military occupational specialty, an enlistee must be capable of performing dozens, if not hundreds, of discrete and diverse tasks. Indeed, each Service has conducted extensive analyses of the task requirements of each of its jobs (Morsch et al., 1961; Goody, 1976; Raimsey-Klee, 1981; Burtch et al., 1982; U. S. Army Research Institute for the Behavioral and Social Sciences, 1984) that have produced convincing evidence of the complexity of the various military occupational specialties and the need to describe military occupational specialties in terms of disjoint clusters of tasks. Even when attention is restricted to the job proficiencies expected of personnel at the initial level of skill defined for a military occupational specialty, the military occupational specialty might be defined by several hundred tasks that can reasonably be allocated to anywhere from 2 to 25 or more disjoint clusters (U.S. Army Research Institute for the Behavioral and Social Sciences, 1984:12-19).

In view of the complexity of military occupational specialties, it is unlikely that the performance of an enlistee on the tasks that compose a military occupational specialty could validly be characterized by a single test

Page 260 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

score. In their initial development of performance tests, the service branches have acknowledged this reality by (1) defining clusters of military occupational specialty tasks; (2) identifying samples of tasks that purportedly represent the population of tasks that compose a military occupational specialty; and (3) specifying sets of measurable behaviors that can be used to assess enlistees' proficiencies in performing the sampled tasks. The problem of defining minimally acceptable performance in a military occupational specialty must therefore be addressed by defining minimally acceptable performance on each of the clusters of tasks that compose the military occupational specialty. Methods for defining standards of performance on task clusters thus provide one major focus of this paper.

Eliciting and Combining Judgments of the Worth of Job Performance Test Behaviors

Scores on the job performance tests that are currently under development are to be used as criterion values in the development of algorithms for assigning new enlistees to various military occupational specialties. Were it possible to develop singular, equivalently scaled, equivalently valued measures that characterized the performance of an enlistee in each military occupational specialty, optimal classification of enlistees among military occupational specialties would be a theoretically simple problem. In reality, the problem is complicated by several factors. First, as discussed above, the tasks that compose a military occupational specialty are not unidimensional. Second, even tests that assessed enlistees' performances on task clusters with perfect precision and validity would not be inherently equivalent. Third, the worth or value associated with an equivalent level of performance on tests that assessed proficiency in two different task clusters would likely differ across those clusters. Fourth, the worth or value associated with a given proficiency level in a single task cluster would likely differ, depending on the military occupational specialty in which the task cluster was imbedded.

To address these issues, the problem of establishing functions and eliciting judgments that assign value to levels of proficiency in various military occupational specialties (hereafter called “value functions”) must be examined at the level of the individual tasks and at the level of the task clusters. In this regard, two of the major problems considered in this paper are equivalent.

To develop value functions for military occupational specialties, several component problems must be addressed. First, the task clusters defined by job analysts for each military occupational specialty must be accepted or revised. Second, value functions associated with performances on tasks sampled from task clusters must be defined. Third, operational procedures

Page 261 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

for eliciting judgments of the values of various levels of performance on tasks sampled from task clusters must be developed. Fourth, methods for weighting and aggregating value assignments across sampled tasks, so as to determine a value assignment for a profile of performances on the tasks that are sampled from a military occupational specialty, must be developed. Related issues that must be considered include the comparability of value assignments across tasks within a military occupational specialty, as well as the scale equivalence of value assignments to levels of performance in different military occupational specialties.

Using Predicted Test Performances and Value Judgments in Personnel Classification

Assuming it is possible to predict enlistees' performances on military job performance tests from the ASVAB or other predictor batteries, and assuming that judgments of the values of these predicted performances can be elicited and combined to produce summary scores for military occupational specialties, there remains the problem of using these summaries in classifying enlistees among military occupational specialties. This problem can be addressed in several ways, depending on one's desire to consider as primary the interests of individual enlistees and/or the Services, and the types of decision scenarios envisioned.

If it was desired to satisfy the interests of individual enlistees with little regard for the needs of or costs to the military, predicted performances in various military occupational specialties would be used solely for guidance purposes. The only value functions that would be pertinent would be those of the individual enlistee. Enlistees would be assigned to the military occupational specialties they most desired, after having been informed of their likely chances of success in each.

If the interests of the military were viewed as primary, the best classification strategy would depend on the decision scenarios envisioned and the decision components to be taken into account. In a scenario in which each enlistee was to be classified individually, based on his/her predicted military occupational specialty job performances and the set of available military jobs at the time of his/her classification, the obvious classification choice would be the one that carried maximum value. In a scenario in which enlistees were to be classified as a group (e.g., the group of enlistees who completed the ASVAB during a given week), the predicted job performances of all members of the group, and the values associated with those predictions could be taken into account, in addition to the average values associated with the performances of personnel currently assigned to military occupational specialties with jobs available at the time of classification.

These alternatives are considered in a discussion of the problem of using

Page 262 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

enlistees' predicted scores on job performance tests in classifying enlistees among military occupational specialties. A specific mathematical programming model for the third alternative is developed and illustrated.

ESTABLISHING MINIMUM STANDARDS OF PERFORMANCE

One of the two major problems considered in this paper is the establishment of standards of performance that define minimally acceptable levels of response on the new criterion tests that are under development by the Services. In addressing this problem, we first discuss the consequential issues associated with standard setting. We next describe the most widely used standard-setting methods that have been proposed for use with educational achievement tests. In the third section, we consider the prospects for applying these methods to the problem of setting standards on military job performance tests. Finally, we examine a variety of operational questions that arise in the application of any standard-setting procedure, such as the types and numbers of persons from whom judgments on appropriate standards are sought, the form in which judgments are sought, and the information provided to those from whom judgments are sought. Rather than recommending the one “best” standard-setting procedure, it is our intent to illuminate the alternatives that have been applied elsewhere, to bring forth the principal considerations that affect their applicability in the military setting, and to bring to light the major operational issues that must be addressed in using any practical standard-setting procedure.

Consequences of Setting Standards

There are no objective procedures for setting test standards. It is necessary to rely on human judgment. Since judgments are fallible, it is important to consider the consequences of setting standards that are unnecessarily high or low. If an unnecessarily high standard is established, examinees whose competence is acceptable will be failed. Errors of this kind are termed false-negative errors. If the standard established is lower than necessary, examinees whose competence is unacceptable will be passed. Errors of this kind are termed false-positive errors. Both individuals and society are placed at risk by these kinds of errors.

When tests are used for selection—that is, for determining who is admitted to an educational program or an employment situation—society or institutions bear the primary effects of false-positive errors. The effects of false-negative errors are borne primarily by individuals when applicant pools greatly exceed institutional needs. However, limitations in the pool of personnel available for military service increase the institutional consequences of making false-negative errors. Adequate military staffing depends on the availability of personnel for a variety of military occupational specialties.

Page 263 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Since the military now relies on an all-volunteer force, it is particularly vulnerable to erroneous exclusion of qualified personnel.

When tests are used for purposes of classification—that is, for allocating examinees among alternative educational programs or jobs—the effects of false-positive and false-negative errors are shared by institutions and individuals. When false-positive errors are made, individuals are assigned to programs or jobs that are beyond their levels of competence. This results in less-than-optimal utilization of personnel and the possibility of costly damage for institutions. It also results in psychological and physical hazards for individuals. When false-negative errors are made, individuals are not assigned to programs or jobs for which they are competent. Although this is unlikely to result in physical damage to individuals or institutions, it does produce less-than-optimal use of personnel by institutions and the risk of psychological distress for individuals.

In the military context, the risk to human life and the national security associated with false-positive classification errors is particularly great. Although they might cause psychological distress, false-negative classification errors are unlikely to be life-threatening for individuals. But the Services compete with the civilian sector for qualified personnel. Therefore, the military consequences of false-negative classification errors are likely to be severe for military occupational specialties that require personnel with rare skills and abilities.

Conventional Standard-Setting Procedures

The number of procedures that have been proposed for setting standards on pencil-and-paper tests has been estimated as somewhere between 18 (Hambleton and Eignor, 1980) and 31 (Berk, 1985). The difference between these figures has more to do with the authors' criteria for identifying methods as “different” than with substantively new developments during the years 1980 to 1985. These same authors, as well as others (Meskauskas, 1976; Berk, 1980; Hambleton, 1980), have proposed a variety of schemes for classifying standard-setting procedures. Since this review of standard-setting procedures will be restricted to those that have been widely used and/or hold promise for use in establishing standards on military job performance tests, a simple, two-category classification method will be used. Procedures that require judgements about test items will be described apart from procedures that require judgments about the competence of examinees.

Procedures That Require Judgments About Test Items

Many of the procedures used for setting standards on achievement tests are based on judgments about the characteristics of dichotomously scored

Page 264 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

tests items and examinees' likely performances on those items. Both the types of judgments required and the methods through which judgments are elicited differ across procedures. The most widely used procedures of this type are reviewed in this section.

The Nedelsky Procedure. This standard-setting procedure is, perhaps, of historical interest since it is the oldest procedure in the modern literature on standard setting that still enjoys widespread use. It was proposed by Nedelsky in 1954, and is only applicable to tests composed of multiplechoice items.

The first step in the procedure is to define a population of judges and to select a representative sample from that population. Judges who use the procedure must conceptualize a “minimally competent examinee ” and then predict the behavior of this minimally competent examinee on each option of each multiple-choice test item. Because of the nature of the judgment task, it is essential that judges be knowledgeable about the proficiencies of the examinee population, the requirements of the job for which examinees are being selected, and the difficulties of the test items being judged.

For each item on the test, each judge is asked to predict the number of response options a minimally competent examinee could eliminate as being clearly incorrect. A statistic termed by Nedelsky the “ minimum pass level” (MPL) is then computed for each item. The MPL for an item is equal to the reciprocal of the number of response options remaining, following elimination of the options that could be identified as incorrect by a minimally competent examinee. The test standard based on the predictions of a single judge is computed as the sum of the MPL values produced by that judge for all items on the test.

An initial test standard is computed by averaging the summed MPL values produced by the predictions of each of a sample of judges. Nedelsky (1954) recommended that this initial test standard be adjusted to control the probability that an examinee whose true performance was just equal to the initial test standard could be classified as incompetent due solely to measurement error in the testing process. The adjustment procedure recommended by Nedelsky depends on the assumption that the standard deviation of the test standards derived from the predictions of a sample of judges is equal to the standard error of measurement of the test. If the assumption were correct, and if the distribution of measurement errors on the test were normal, the probability of failing an examinee with true ability just equal to the initial recommended test standard could be reduced to any desired value. For example, reducing the initial test standard by one standard deviation of the distribution of summed MPL values would ensure that no more than 16 percent of examinees with true ability just equal to the initial recommended test standard would fail. Reducing the initial recommended test standard by two standard deviations would reduce this probability to about 2 percent.

Page 265 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

The initial recommended test standard produced by Nedelsky's procedure derives from the assumption that examinees will make random choices among the item options that they cannot eliminate as being clearly incorrect. Examinees are assumed to have no partial information or to be uninfluenced by partial information when making their choices among remaining options. If these assumptions were correct, and if judges were able to correctly predict the average number of options a minimally competent examinee could eliminate as being clearly incorrect, the initial tests standard resulting from the Nedelsky procedure would be an unbiased estimate of the mean tests score that would be earned by minimally competent examinees. However, studies by Poggio et al. (1981) report that, when Nedelsky 's procedure was applied to pencil-and-paper achievement tests in a public school setting, school personnel were unable to make consistent judgments of the type required to satisfy the assumptions of the procedure.

The Angoff Procedure. Although he attributes the procedure to Ledyard Tucker (Livingston and Zieky, 1983), William Angoff's name is associated with a standard-setting method that he described in 1971. The procedure requires that each of a sample of judges consider each item on a test and estimate (1971:515):

the probability that the “minimally acceptable” person would answer each item correctly. In effect, the judges would think of a number of minimally acceptable persons, instead of only one such person, and would estimate the proportion of minimally acceptable persons who would answer each item correctly. The sum of these probabilities, or proportions, would then represent the minimally acceptable score.

As was true of Nedelsky's procedure, the first step in using Angoff's procedure is to identify an appropriate population of judges and then to select a representative sample from this population. Judges are then asked to conceptualize a minimally competent examinee. Livingston and Zieky (1982) suggest that judges be helped to define minimal competence by having them review the domain that the test is to assess and then take part in a discussion on what constitutes “borderline knowledge and skills.” If judges can agree on a level of performance that distinguishes between examinees who are competent and those who are not, Zieky and Livingston recommend that the definition of that performance be recorded, together with examples of performance that are judged to be above, and below, the standard. Using as an example a test that was designed to assess the reading comprehension of high school students, Zieky and Livingston suggest that judges be asked to reach agreement on whether a minimally competent student must be able to “find specific information in a newspaper article, distinguish statements of fact from statements of opinion, recognize the main idea of a paragraph,” and so on. To be useful in characterizing a

Page 266 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

minimally competent examinee, the behaviors used to distinguish between those who are competent and those who are not should represent the domain of behavior assessed by the test for which a standard is desired.

The judgments required by Angoff's procedure are as follows: Each judge, working independently, considers the items on a test individually and predicts for each item the probability that a minimally competent examinee would be able to answer the test item correctly.

The sum of the probabilities predicted by a judge becomes that judge 's recommended test standard and, if the predictions were correct, would equal the total score on the examination that would be earned by a minimally competent examinee. The average of the recommended test standards produced by the entire sample of judges is the test standard that results from Angoff's procedure.

If for each item on the test the average of the probabilities predicted by the sample of judges was correct, the test standard produced by Angoff's procedure would equal the mean score earned by a population of minimally competent examinees. In any case, the result of Angoff 's procedure can be viewed as a subjective estimate of that mean.

Angoff's procedure has been modified in several ways, so as to make it easier to use and/or to increase the reliability of its results. One modification involves use of a fixed scale of probability values from which judges select their predictions. This technique allows judges' predictions to be processed by an optical mark-sense reader for direct entry to a computer, this saving a coding step and reducing the possibility of clerical errors. Educational Testing Service used an asymmetric scale of probabilities when setting standards on the subtests of the National Teacher Examinations (NTE). Livingston and Zieky (1982:25) objected to the use of an asymmetric scale, since they felt it might bias judges' predictions. Cross et al. (1984) used a symmetric scale of 10 probability values that covered the full range from zero to one, thus overcoming Livingston and Zieky's objections.

Other modifications of Angoff's procedure include the use of iterative processes through which judges are given an opportunity to discuss their initial predictions and then to reconsider those predictions. Cross et al. (1984) investigated the effects of such a process coupled with the use of normative data on examinees' actual test performances. They found that judges recommended a lower test standard at the end of a second judgment session than at the end of an initial session. These results were not entirely consistent with findings of Jaeger and Busch (1984) in a study of standards set for the National Teacher Examinations. They found that mean recommended standards were lower at the end of a second judgment session than at the end of an initial session for four out of eight subtests of the NTE Core Battery; they found just the reverse for the other four subtests. However, the variability of recommended test standards was consistently reduced by

Page 267 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

using an iterative judgment process. The resulting increase in the stability of mean recommended test standards suggests that use of an iterative judgment process with Angoff's procedure is advantageous.

The Ebel Procedure. The Ebel (1972:492-494) standard-setting procedure also begins by defining a population of judges and selecting a representative sample from that population. After conceptualizing a “minimally competent ” examinee, judges must complete three tasks.

First, judges must construct a two-dimensional taxonomy of the items in a test, one dimension being defined by the “difficulty” of the test items and the other being defined by the “relevance” of the items. Ebel suggested using three levels of difficulty, which he labeled “easy,” “medium,” and “hard.” He suggested that four levels of item relevance be labeled “essential,” “important,” “acceptable,” and “questionable.” However, the procedure does not depend on the use of these specific categories or labels. The numbers of dimensions and categories could be changed without altering the basic method.

The judges' second task is to allocate each of the items on the test to one of the cells created by the two-dimensional taxonomy constructed in the first step. For example, Item 1 might be judged to be of “medium difficulty” and to be “important;” Item 2 might be judged to be of “easy difficulty” and to be of “questionable” relevance, etc.

The judges' final task is to answer the following question for each category of test items (Livingston and Zieky, 1982:25):

If a borderline test-taker had to answer a large number of questions like these, what percentage would he or she answer correctly?

When a test standard is computed using Ebel's method, a judge's recommended percentage for a cell of the taxonomy is multiplied by the number of test items the judge allocated to that cell. These products are then summed across all cells of the taxonomy to produce a recommended test standard for that judge. As in the procedures described earlier, the recommendations of all sampled judges are averaged to produce a final recommended test standard.

The Jaeger Procedure. This procedure was developed for use in setting a standard on a high school competency test (Jaeger, 1978, 1982), but can be adapted to any testing situation where a licensing, certification, or selection decision is based on an examinee's test performance (Cross et al., 1984).

One or more populations of judges must be specified, and representative samples must be selected from each population. As in the procedures described above, judges are asked to render judgments about test items. More specifically, judges are asked to answer the following question for each item

Page 268 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

on the test for which a standard is desired: Should every examinee in the population of those who receive favorable action on the decision that underlies use of the test (e.g., every enlistee who is admitted to the military occupational specialty) be able to answer the test item correctly? Notice that this question does not require judges to conceptualize a “minimally competent” examinee.

An initial standard for a judge is computed by counting the number of items for which that judge responded “yes” to the question stated above. An initial test standard is established by computing the median of the standards recommended by each sampled judge.

Jaeger's procedure is iterative by design. Judges are afforded several opportunities to reconsider their initial recommendations in light of data on the actual test performances of examinees and the recommendations of their fellow judges. In its original application, judges were first asked to provide “yes/no” recommendations on each test item on a 120-item reading comprehension test. The judges were then given data on the proportion of examinees who had actually answered each test item correctly in the most recent administration of the test, in addition to the distribution of test standards recommended by their fellow judges. Following a review of these data, judges were asked to reconsider their initial recommendations and once again answer, for each item, the question of whether every “successful” examinee should be able to answer the test item correctly. These answers were used to compute a new set of recommended test standards in preparation for a final judgment session. Prior to the final judgment session, judges were given data on the proportion of examinees who completed the test during the most recent administration who would have failed the test had the standard been set at each of the score values between zero and the maximum possible score. In addition, judges were shown the distribution of test standards recommended by their fellow judges during the second judgment session. A final judgment session, identical to the first two, was then conducted. The “yes” responses were tabulated for each judge, and the final recommended test standard was defined as the median of the standards computed for each judge.

Jaeger (1982) recommends that more than one population of judges be sampled, and that the final test standard be based on the lowest of the median recommended standards computed for the various samples of judges. He also suggests that prior to the initial judgment session each judge complete the test under conditions that approximate those used in an actual test administration.

Procedures That Require Judgments About Examinees

Unlike the standard-setting procedures that have been described to this point, several widely used procedures do not require judgments about the

Page 269 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

characteristics or difficulty of test items. Instead, judges are asked to make decisions regarding the competence of individual examinees on the ability measured by the test for which a standard is sought. Proponents of these procedures claim that the types of judgments required—concerning persons rather than test items—are more consistent with the experience and capabilities of educators and supervisory personnel. The resulting test standards are thus claimed to be more reasonable and realistic.

The Borderline-Group Procedure. This standard-setting procedure was proposed by Zieky and Livingston (1977). As is true of all standard-setting procedures, the first step in applying the procedure is to define an appropriate population of judges and then to select a representative sample from that population. Livingston and Zieky (1982:31) indicate the importance of sampled judges knowing, or being able to determine, the level of knowledge or skill in the domain assessed by the test of individual examinees they will be asked to judge. Careful and appropriate selection of judges is thus critical to the success of the procedure.

Judges are first asked to define three categories of competence in the domain of knowledge or skill assessed by the test: “adequate or competent,” “borderline or marginal,” and “inadequate or incompetent.” Ideally, these definitions would be operational, would be consensual, and would be reached collectively following extensive deliberation and discussion by the entire sample of judges. In reality, this ideal might not be achieved, nor might a process of face-to-face discussion among judges be feasible.

Once definitions of the three categories of competence have been formulated, the principal act of judgment in the borderline-group procedure requires judges to identify members of the examinee population whom they would classify as “borderline or marginal” in the knowledge and/or skill assessed by the test. It is essential that the judges use information other than the score on the test for which a standard is sought in reaching their classification decisions. If scores on this test were used, the standard-setting procedure would be tautological. Additionally, classification decisions based on scores on the test for which a standard is sought might well be biased. Interpretations of test performances are often normative, and individual judges are unlikely to know about the performances of a representative sample of the population of examinees.

The test for which a standard is sought is administered, under standardized conditions, after a subpopulation of examinees has been classified as “marginal or borderline.” The standard produced by the borderline-group method is defined as the median of the distribution of test scores of examinees who are classified as “marginal or borderline.”

Although the borderline-group procedure has some definite advantages,

Page 270 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

it is subject to several factors that threaten the validity of the test standard it produces. First, unless the sample of examinees that is classified by the judges is in its distribution of test scores representative of the population of examinees to which the test standard is applied, a biased standard will result. Second, in making their classifications it is essential that judges restrict their attention to knowledge and/or skill that is assessed by the test for which a standard is sought. To make reasoned decisions, judges must be familiar with the performance of examinees they are to classify. However, the better they know these examinees, the more likely they are to be influenced in their judgments by factors other than the knowledge and/or skill assessed by the test; halo effect is a pervasive influence in judgments that require classification of persons. Finally, the “borderline” category is the middle position of a three-point scale that ranges from “competent” to “incompetent.” Numerous studies of judges' classification behavior have shown that the middle category of a rating scale tends to be used when judges do not have information that is sufficient to make a valid judgment. Contamination of the “borderline” category with examinees that do not belong there would bias the test standard produced by the borderline-group procedure.

The Contrasting-Groups Procedure. Proposed by Zieky and Livingston in 1977, this procedure is similar in concept to the criterion-groups procedure suggested by Berk (1976). The principal focus of judgment in the contrasting-groups procedure is on the competence of examinees rather than the characteristics of test items, just as in the borderline-group procedure.

The first two steps of the contrasting-groups procedure are identical to those of the borderline-group procedure. First, a population of judges is defined and a representative sample of judges is selected from that population. Second, the sampled judges must develop operational definitions of three categories of competence in the domain of knowledge and/or skill assessed by the test for which a standard is sought: “adequate or competent,” “borderline or marginal,” and inadequate or incompetent.”

The principal judgmental act of the contrasting-groups procedure requires judges to assign a representative sample of the population to be examined to the three categories of competence they have just defined. That is, each member of the sample of examinees is assigned to a category labeled “adequate or competent,” “borderline or marginal, or “inadequate or incompetent.”

Once classification of examinees has been completed, the test for which a standard is sought is administered to the examinees about whom judgments have been made. The standard that results from the contrasting-groups method is based on the test score distributions of examinees who

Page 271 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

have been assigned to the “adequate or competent” and “inadequate or incompetent” categories.

Several methods have been proposed for analyzing the test score distributions of examinees who have been assigned to the “adequate or competent ” and the “inadequate and incompetent” categories. Hambleton and Eignor (1980) recommended that the two test score distributions be plotted on the same graph and that the test standard be defined as the score at which these two distributions intersect. This procedure assumes that the score distributions will not be coincident and that they will be overlapping. Under these conditions, the test standard that results from this algorithm has the advantage of minimizing the total number of examinees who were classified as “competent” and who would fail the test, plus the total number of examinees who were classified as “incompetent” and who would pass the test. If the loss attendant to passing an incompetent examinee were not equal to the loss attendant to failing a competent one, this test standard would not minimize total expected losses. However, if the loss ratio was either known or estimable, the standard could be adjusted readily so as to minimize expected losses.

Livingston and Zieky (1982) proposed an alternative method of analyzing the test score distribution of “competent” examinees for the purpose of setting a standard. They suggested that the percentage of examinees classified as “competent” be computed for the subsample of examinees who earned every possible test score. The test standard would be defined as the test score for which 50 percent of the examinees were classified as “ competent.” Since for small samples of examinees, the distribution of test scores is likely to be irregular, Livingston and Zieky (1982) recommend the use of a smoothing procedure prior to computing the score value for which 50 percent of the examinees were classified as “competent. ” They describe several alternative smoothing procedures.

Most of the cautions enumerated above for the borderline-group procedure apply to the contrasting-groups procedure as well: Judges must have an adequate and appropriate basis for classifying examinees, yet avoid classification on bases outside the domain of knowledge and/or skill assessed by the test. A representative sample of examinees must be classified so as to avoid distortion of the distributions of test scores of “competent” and “incompetent” examinees. Since not only the shapes of test score distributions but the sample sizes on which they are based will affect their point of intersection, use of a representative sample of examinees is essential to the fidelity of the standard resulting from the contrasting-groups procedure.

Berk's (1976) criterion-groups procedure is operationally identical to the contrasting-groups procedure apart from the definition of groups. In Berk's method, instead of classifying examinees as “competent” or “incompetent,”

Page 272 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

“criterion” groups are formed from examinees who are “uninstructed” and “instructed” in the material assessed by the test for which a standard is sought. Of course, judgment is needed to define groups that can appropriately be termed “uninstructed” and “instructed.” A fundamental assumption of Berk's method is that the “uninstructed” group is incompetent in the knowledge and/or skill assessed by the test, and that the “instructed” group is competent in that knowledge and/or skill.

Prospects for Applying Conventional Standard-Setting Procedures

Although three Services are developing new pencil-and-paper tests as components of their job performance criterion measures (Laabs, 1984; Committee on the Performance of Military Personnel, 1984), a principal interest of the military is establishment of standards on the performance components of the new measures. Since all of the conventional standard-setting procedures reviewed above were developed for use with pencil-and-paper tests in a public education setting, they might not be applicable to hands-on and/or interview procedures used in a military setting. We will consider the applicability of the procedures in the order of their initial description.

Procedures That Require Judgments About Test Items

The Nedelsky procedure may be only partially applicable in setting standards on military job performance tests because it can be used only with multiple-choice test items, while the assessment of “manifest, observable job behaviors” is a central purpose of the military job performance tests. The performance components of these tests (in the Joint-Service lexicon, the hands-on portions) typically measure active performance of a task in accordance with the specifications of a military manual. Because the behavior to be measured is appropriate action, not discrimination among proposed actions, a multiple-choice item format would appear to be inconsistent with specified measurement objectives.

The Nedelsky procedure could be used to establish standards on the “knowledge measures” portion of the criterion measures being developed by the Army, and on similar tests developed by other Services, provided the measures consist of items in multiple-choice format. In civilian settings, the Nedelsky procedure often has provided standards that are somewhat more lenient than those provided by other procedures. In the proposed military setting, lenient standards of performance on individual tasks still lead to stringent standards on an entire job performance test. This would be true if a separate standard had to be established for each task and satisfactory performance were required on all sampled tasks. For example, suppose that pencil-and-paper measures had been developed for 10 tasks that com-

Page 273 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

posed a military occupational specialty job performance test. If the standard of performance adopted for each measure resulted in just 5 percent of enlistees failing, and if examinee performances on the various measures were independent (an admittedly unlikely occurrence, used here merely to illustrate the extreme case), the percentage of examinees who would satisfy the overall military occupational specialty criterion on the pencil-and-paper portion of the job performance test would be 100 × (1 − 0.05)¹⁰ = 59.9 percent. Thus almost 40 percent of the examinees would fail the pencil-and-paper portion of the job performance test, even though only 5 percent would fail to complete any given task. An alternative standard-setting procedure that resulted in more stringent standards for each task would result in an even higher (and perhaps unacceptable) failure rate on the pencil-and-paper portion of the job performance test.

The stimulus question that defines the fundamental judgment task of the Nedelsky procedure could be stated in any of several seemingly reasonable ways. A central issue would be the appropriate referent for a “minimally competent examinee.” An example using the Army military occupational specialty (MOS) 95B (military police) should clarify the issue. Suppose that one tested task from this military occupational specialty was “restraining a suspect.” Should the judges being asked to recommend a standard for the test of knowledge of this task be asked to:

Think about a soldier who has just been admitted to MOS 95B who is borderline in his/her knowledge of restraining a subject. Which options of each of the following test items should this soldier be able to eliminate as obviously incorrect?

Or should the judges be asked to:

Think about a soldier who has just been admitted to MOS 95B who is just borderline in his/her knowledge needed to function satisfactorily in that MOS. Which options of each of the following test items should this soldier be able to eliminate as obviously incorrect?”

The difference here is in the referent population. One is task-specific and the other refers to the entire military occupational specialty. Either choice could be supported through logical argument. Since the test is task-specific, the task-delimited population is consistent with Nedelsky's specifications. However, the tested task is one of many that could have been sampled from those that compose the military occupational specialty and the domain of generalization of fundamental interest is the military occupational specialty rather than the sampled task. Perhaps the stimulus question should be constructed on practical rather than purely logical grounds. Judges could be asked whether it is easier to conceptualize a minimally competent soldier who has just been admitted to the military occupational specialty or a soldier who was minimally competent in performing the task being tested.

Page 274 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Some experiments could be conducted to determine the referent population that produced the smallest variation in recommended test standards. Mean recommended standards resulting from use of the two referent populations could be compared to determine whether they differed and which appeared to be most reasonable.

The standard-setting procedures proposed by Angoff, Ebel, and Jaeger can be used with any test that is composed of items or activities that are scored dichotomously. Since all of the military job performance measures we have reviewed are of this form, any of these standard-setting procedures could be used with these tests.

Like the Nedelsky procedure, both the Angoff and Ebel procedures require that judges define a minimally competent examinee. The issue of an appropriate referent population, discussed in the context of the Nedelsky procedure, would therefore be of concern with these procedures as well. Once the question of an appropriate referent population was settled, adaptation of the Angoff procedure to military job performance tests with dichotomously scored components would appear to be straightforward. For example, when used with the performance component of a criterion test for military occupational specialty 95B, the stimulus question might be:

Think about 100 soldiers who have just been admitted to MOS 95B who are borderline in their ability to restrain a suspect. What percentage of these 100 soldiers would position the suspect correctly when applying handcuffs?

Similar questions could be formed for each tested activity in the “restraining a suspect” task.

Ebel's procedure might not be applicable to military job performance tests for several practical reasons. The procedure presumes that the “ items” that compose a test are unidimensional but stratifiable on dimensions of difficulty and relevance. Many of the military job performance measures we have reviewed contain very few activities or items, so that stratification of items might not be possible. Asking a judge “What percentage of the items on this test should a minimally competent examinee be able to answer correctly?” is tantamount to asking “Where should the test standard be set?”. Without stratification of items into relatively homogeneous clusters. Ebel's method is unlikely to yield stable results. Theoretically, Ebel's method could be applied to an overall job performance test to yield a standard for an entire military occupational specialty rather than a single task. However, several assumptions inherent in the method would then be highly questionable. The most obvious basis for stratification of items or activities on the test would be by task, but it is not likely that the activities or items used to assess an examinee's performance of a single task are homogeneous in relevance or difficulty. Also, the relative lengths of subtests that assess

Page 275 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

an examinee's performance of different tasks are probably a consequence of the level of detail contained in the military procedures manual that describes the tasks, rather than the relative importance of the tasks. Since Ebel's method weights item strata in proportion to their size, a task that contained a larger number of activities would receive more weight in determining an overall test standard than would a task that contained fewer activities, regardless of the relative importance of the two tasks. In computing an overall test standard, Ebel's procedure has no provision for weighting item strata by importance or by any other judgmental consideration.

From a purely mechanical standpoint, Jaeger's standard-setting procedure could be used readily with either the pencil-and-paper or performance components of military job performance tests. Since it does not require judges to conceptualize a minimally competent examinee, the problem of defining an appropriate referent population, a central issue with the other item-based standard-setting procedures, would not arise with Jaeger's procedure. However, if the empirical results observed in civilian public-school settings are also found in the military context, Jaeger's procedure might yield test standards that are unacceptably high. If standards are established for tests of each sampled task, this problem is likely to be greatly exacerbated. The principal advantage of Nedelsky's procedure, illustrated above, might well be the principal disadvantage of Jaeger's procedure, since stringent test standards for each task would translate to an impossibly stringent standard for admission to a military occupational specialty.

When using Jaeger's procedure, judges might be asked the following question for each activity in a test designed to assess performance on a designated task: “Should every enlistee who is accepted for this MOS be able to perform this activity?”. On the tests we have reviewed, it appears that the activities listed closely mirror descriptions of standard practice as specified in applicable military procedures manuals. If judges based their recommendations on “the book” they would likely answer questions about most, if not all, activities affirmatively, thus resulting in impossibly high test standards.

Our expectation then, is that Jaeger's procedure could be adapted to military job performance tests quite readily, but would likely yield test standards that were impractically high. This expectation should not preclude small-scale empirical investigations of the procedure.

Procedures That Require Judgments About Examinees

When considered for setting standards on military job performance tests, both the borderline-group procedure and the contrasting-groups procedure offer appealing characteristics and features. These procedures can be used with tests composed of any type of test item, regardless of item scoring, as

Page 276 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

long as the tests assess some unidimensional variable and yield a total score. Although the small sample of military job performance tests we have reviewed contained tasks that were made up of discrete units that could be treated as “items,” some performance tests might not be assembled in this way, thus rendering the item-based standard-setting procedures inapplicable. For example, a test concerned with operation of a simple weapon might be scored on the basis of time to effect firing or accuracy of results. If the variable representing “success ” can be scored continuously, the borderline-group or contrasting-groups standard-setting procedures could be used to determine a standard of performance, even though the item-based procedures could not.

A second advantage of the standard-setting procedures that require judgments about examinees is that the types of judgments required are probably similar to those made routinely by military supervisors, both in training schools and active units. In fact, somewhat similar judgments were requested of military job experts in the study published by the U.S. Army Research Institute for the Behavioral and Social Sciences (1984). Appendix A of the Army report contains a scale for assessing the abilities of soldiers to perform various tasks associated with specified military occupational specialties.

Despite their advantages, the borderline-group and contrasting-groups standard-setting procedures present several operational problems that might be difficult to overcome. Both procedures require classification of examinees into groups labeled unacceptable, borderline, and acceptable, and subsequent testing of persons in at least one of these groups. When discussing the item-based standard-setting procedures, we suggested that the appropriate referent population of “minimally competent examinees” was not obvious. In a somewhat different form, the same problem must be dealt with for the person-based procedures: Should judges be asked to classify examinees as “unacceptable,” “borderline,” or “acceptable” in the skills defined by the task cluster represented by the test or in all skills needed to function within a military occupational specialty? Since a standard is likely to be desired for a test that is restricted to a single task cluster, one could argue that the appropriate referent population is obvious. On the other hand, eliminating personnel who cannot meet all the demands of a military occupational specialty is the ultimate goal.

A second, and perhaps more serious problem, is obtaining judgments, and test data on examinees who are “unacceptable.” Under current military classification procedures, such persons would rarely be assigned to active duty in a military occupational specialty. First, potentially unacceptable enlistees are screened out on the basis of their ASVAB performances. Few such persons are assigned to service schools that provide the training necessary to enter a military occupational specialty for which they are “unaccept-

Page 277 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

able.” Second, since success in an appropriate service school is prerequisite to assignment to active duty in a military occupational specialty and screening takes place prior to graduation from military service schools, the population of potentially unacceptable personnel assigned to active duty in a military occupational specialty is further reduced. The contrasting-groups procedure requires the identification of personnel who are “acceptable” and “unacceptable” in the skills assessed by the test for which a standard is sought. The number of “unacceptable” examinees must be sufficiently large to obtain a stable distribution of test scores. If very few “unacceptable ” persons are admitted to a military occupational specialty, obtaining a sufficient number of nominations might not be possible. Recall that classification of examinees to the “unacceptable,” “borderline,” and “acceptable” groups must be based on information other than scores on the tests for which standards are sought. In the present context, that information would have to consist of observations of on-the-job performance of enlistees early in their initial tours of duty in a military occupational specialty. Again, to the extent that current military classification systems are effective, the number of “unacceptable” enlistees will be very small.

Operational Questions and Issues

Regardless of the procedure used to establish standards on military job performance tests, a set of common operational issues must be considered. Since judgments are required, one or more appropriate populations of judges must be identified. The numbers of judges to be sampled from each population must be specified. The stimulus materials used to elicit judgments must be developed. The substance and process of training judges must be specified and developed. The information to be provided judges, both prior to and during the judgment process, must be specified. A decision must be reached on handling measurement error within the process of computing a standard, and if measurement error is to be considered an algorithm for doing so must be developed. We will discuss each of these issues briefly.

Types of Judges to be Used

All of the item-based standard-setting procedures (with the possible exception of Jaeger's procedure) require judges to be knowledgeable about the distribution of ability of examinees on the skills assessed by the test and about the distributions of performance of examinees on each item contained in the test. Judges used with these procedures should therefore have experience in observing and working with the examinee population, either in a service-school setting or, preferably, in the actual working environment of

Page 278 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

the military occupational specialty for which the test is a criterion. Judges used with the person-based standard-setting procedures must meet even more stringent criteria. Since they must classify individual examinees, they must be knowledgeable about the abilities of these individuals to perform specific tasks in the actual work settings of the military occupational specialty.

These requirements suggest that either instructional personnel in appropriate military service schools or immediate unit supervisors (such as non-commissioned officers) in military occupational specialty field units could serve as judges for the item-based standard-setting procedures, but only the latter personnel would be suitable as judges for the examinee-based standard-setting procedures.

Numbers of Judges to be Used

In any standard-setting procedure, the numbers of judges to be used should be determined by considering the probable magnitude of the standard error of the recommended test standard as a function of sample size. Since in all of the standard-setting procedures described in this paper, the recommendations of individual judges are derived more or less independently and are aggregated only at the point of computing a final test standard, the standard error of that recommendation will vary inversely as the square root of the size of the sample of judges.

Ideally, the size of the sample of judges would be sufficient to reduce the standard error of the recommended test standard to less than half a raw score point on the test for which the standard was desired. In that case, assuming that the recommended test standard varied normally across samples of judges, the probability that an examinee whose test score was equal to the test standard recommended by a population of judges would pass or fail the test, just due to sampling of judges, would be no more than 0.05.

In practice, the size of a sample of judges needed to reduce the standard error of the recommended test standard to the ideal point might be prohibitively expensive or otherwise infeasible to obtain. An alternative criterion for sample size might be based on the relative magnitudes of the standard error of the recommended test standard and the standard error of measurement of the test for which a standard was desired. Since these sources of error are independent and therefore additive, it is possible to determine the contribution of sampling error to the overall error in establishing a test standard. For example, if the standard error of the mean test standard was half the magnitude of the standard error of measurement of the test, the variance error of the mean would be only one-fourth the variance error of measurement, and the overall standard error would be increased by a factor of (1.25)^1/2 = 1.12 or 12 percent. Alternatively, if the standard error of the mean test standard was one-tenth the magnitude of the standard error of

Page 279 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

measurement of the test, the variance error of the mean would be only one one-hundredth the variance error of measurement, and the overall standard error would be increased by a factor of (1.01)^1/2 = 1.005, or 0.5 percent.

Empirical work by Cross et al. (1984) and Jaeger and Busch (1984) showed that, for the subtests of the National Teacher Examinations, relative magnitudes of standard errors of the mean and measurement closer to the latter example than the former were realized with samples of 20 to 30 judges. A modified, iterative form of Angoff's standard-setting procedure was used in each of these studies.

Stimulus Materials to be Used in Setting Standards

The specific stimulus materials to be used in a standard-setting procedure must, of course, be based on the steps involved in conducting that procedure. However, it is essential that the materials be explicit and standardized. All judges must engage in the same standard-setting process, must be fully informed about the types of judgments required of them, and must be privy to the same types of information given all other judges. Experience has shown that judges should be given written as well as standardized oral instructions on the purposes for which their judgments are sought, the types of judgments they are to make, and the exact procedures they are to follow.

The questions judges are asked to answer must be developed with caution and care. For example, in the Angoff standard-setting procedure, judges are asked to estimate the probability that a “minimally competent ” examinee would be able to answer each test item correctly. Different responses should be expected, depending on whether judges are asked whether examinees “would be able to answer test items correctly” or “should be able to answer test items correctly. ” The first question requires a prediction of actual examinee behavior, while the second one requires a statement of desired examinee behavior. Another subtle, but important, distinction in the Angoff procedure concerns the issue of guessing. Since the Angoff procedure is frequently applied to multiple-choice items, examinees might well answer a test item correctly by guessing. Judges will likely estimate different probabilities of actual examinee behavior, depending on whether they are told to consider guessing, to ignore the possibility of guessing, or are given no instructions about guessing. The latter procedure is the least desirable, since consideration of guessing on the part of examinees becomes a source of error variance in the responses of judges.

This example illustrates one of many details that must be addressed if the stimulus materials used in a standard-setting procedure are to function correctly. The goal in developing stimulus materials should be to minimize the variance of recommendations across judges due to factors other than true differences in their judgment of an appropriate test standard.

Page 280 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Training of Judges

Obviously, if judges are to provide considered and thoughtful recommendations on standards for military job performance tests, they must understand their tasks clearly and completely. Judges will have to be trained to do their jobs if these ends are to be achieved.

Although the specifics of a training program for judges must depend on the standard-setting procedure used, some common elements can be identified. First, judges must thoroughly understand the test for which standards are desired. One effective way to meet this need is to have judges complete the test themselves under conditions that approximate an operational test administration. This training technique has been used successfully in setting standards on high school competency tests (Jaeger, 1982) and knowledge tests for beginning teachers (Cross et al., 1984; Jaeger and Busch, 1984).

Second, judges must understand the sequence of operations they are to carry out in providing their recommendations. Since in some standardsetting procedures a single set of judgments is elicited, the procedures are likely to be straightforward and easily learned through a single instructional session followed by a period for answering questions. However, other standard-setting procedures are iterative and require judges to provide several sets of recommendations. In these cases, a simulation of the judgment process might be necessary to ensure that judges know what is expected of them. Jaeger and Busch (1984) used such a simulation, together with a small, simulated version of the test for which standards were desired, in a three-stage standard-setting procedure. Following the actual standard-setting exercise, almost all judges reported that they fully understood what they were to do.

Third, in some standard-setting procedures, judges are provided with normative data on the test performances of examinees. Typically, these data are provided in graphical or tabular form, and since the types of graphs or tables used might not be familiar to them, judges might require instruction on their interpretation. For example, in modified versions of the Angoff and Jaeger standard-setting procedures, judges are shown a “p-value” for each item on the test. It should not be assumed that judges will know that these numbers represent the proportions of examinees who answered each test item correctly when the test was last administered. Normative data on examinees' test performances have also been provided in the form of an ogive (cumulative distribution function graph). It is not reasonable to assume that all judges will know how to read and interpret such graphs without specific training.

The overall objective of training should be to ensure that all judges are responding to the same set of questions on the basis of accurate and common understanding of their judgment tasks.

Page 281 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Information to be Provided to Judges

Citing both logical and empirical grounds, several researchers (Glass, 1978; Poggio et al., 1981) have questioned the abilities of judges to make sensible recommendations of the sort required by most item-based standard-setting procedures. In support of their contentions, these authors cite a number of studies in which recommended test standards would have resulted in outlandish examinee failure rates. For example, Educational Testing Service's study to determine standards for the National Teacher Examinations in North Carolina produced recommendations that would have resulted in denial of teacher certification to half the graduates of the state's accredited and approved teacher education programs.

For some military and civilian occupations one could reasonably argue that examinee failure rates are irrelevant to decisions on an appropriate test standard. For example, brain surgeons, jet engine mechanics, and pilots must be able to perform well all tasks that are essential to their jobs. For these types of positions, it is clearly more damaging to employ less than fully competent persons than to have unfilled positions. But for other, less critical jobs, or where on-the-job training might reasonably be used to compensate for marginal qualification at the time an applicant is hired, one could argue that judges' recommendations for test standards should be based on a realistic assessment of the capabilities of the examinees to whom the test will be administered, as well as the requirements of the job itself.

In such situations, it has been recommended that judges be provided with normative information on examinees' test performances to enable them to evaluate the consequences of their proposed test standards. As mentioned in the preceding section, several iterative standard-setting procedures provide judges with item p-values as well as cumulative distributions of the total scores earned by a representative sample of examinees during the most recent administration of the test. Studies have shown that judges use such information when formulating their recommendations (Jaeger, 1982) and that the predominant effect of such data is to reduce the variability of judges' recommendations (Cross et al., 1984; Jaeger and Busch, 1984).

The principal logical argument in support of such use of normative test data is that judges who are well informed on the capabilities of the examinees to whom the test will be administered are likely to provide more reasoned (and therefore better) recommendations on appropriate test standards. That normative test data also appear to reduce the variability of judges' recommended test standards, thereby increasing the reliability of their recommendations, is a serendipitous finding that adds nothing to the logical argument.

Another type of information judges might be provided during an iterative standard-setting procedure is a summary of the recommendations of their

Page 282 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

fellow judges. A large body of social psychological literature, dating from the work of Sherif (1947), suggests that most persons are influenced by the judgments of their peers in decision situations. The manner in which information on the judgments of peers is provided has a crucial influence on the outcome of the judgment process. Summary data in the absence of justification might induce a shift in judgment toward the central tendency of the group, thereby reducing variability, but are unlikely to result in better informed, and hence more reasoned, judgments. A more defensible procedure would allow judges to state and justify the reasons for their recommendations. If this procedure is followed, it is essential that it be carefully controlled to avoid domination by one or a few judges, and to ensure that a full spectrum of judgments is explained.

Measurement Error

Errors of measurement on tests typically are assumed to be normally distributed (Gulliksen, 1950; Lord and Novick, 1968). Based on this assumption, a person whose true level of ability was equal to the standard established for a test would have a 50 percent chance of earning an observed score that fell below the standard, just due to errors of measurement. A person whose true level of ability was one standard error of measurement above the standard would still have a 16 percent chance of earning an observed score that fell below the standard.

To protect against the possibility of failing an examinee as a result of measurement error, several researchers have proposed that initial test standards be adjusted downward by some multiple of the standard error of measurement of the test for which a standard is desired. As described above, Nedelsky (1954) recommended such an adjustment as an integral part of his standard-setting procedure. Unfortunately, he falsely assumed that the standard error of measurement of a test would be well approximated by the standard deviation of judges' recommended standards and based his adjustment on the latter value.

It is unnecessary to adopt Nedelsky's assumption. In most standard-setting situations, the standard error of measurement of the test can be estimated through an internal-consistency reliability estimation procedure, if by no other means, and the recommended test standard can be adjusted accordingly.

Whether a test standard should be adjusted to compensate for errors of measurement is an arguable point, and some might even suggest that the proper adjustment is upward, rather than downward. At issue in the current application are the relative merits of protecting enlistees, or the military, from the consequences of failing a competent examinee or passing an in-

Page 283 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

competent one. It is likely that such consequences will vary substantially across tasks and military occupational specialties.

DEFINITION AND CONSTRUCTION OF VALUE FUNCTIONS

The problems discussed in this section concern the establishment of functions that assign value (or worth) to different levels of proficiency in completing various military occupational specialty tasks, and the use of these value functions in assessing the overall worth of an enlistee in a specific military occupational specialty. A method is proposed for defining a value function for a given task. The use of task value functions to establish an overall military occupational specialty value function is then demonstrated. The discussion focuses particularly on hands-on job sample testing.

As in all evaluation processes, judgments must be made. In this situation judgments are needed on the value or worth of task performance levels. Methods for eliciting these judgments are discussed, as are operational issues that are common to several methods.

Defining Task Value Functions

Psychological decision theory (Zedeck and Cascio, 1984) and social behavior theory (Kaplan, 1975) would appear to lend some insights into the problem of establishing task value functions. Psychological decision theory evaluates an individual's (or institution's) decision making strategy by studying the behavior of the individual (or institution). It is in this somewhat circular way that individual (or institutional) behavior is assigned a value. Social behavior theory is not unlike psychological decision theory. In social behavior theory, the value of an individual's behavior is judged on the basis of existing information about that individual. In both of these theoretical frameworks, the behavior being evaluated is the ability to discriminate among proposed actions.

In carrying out their job tasks, enlistees are to behave in accordance with the specifications of a military manual. We assume that the type of behavior (or performance) of most interest in military job performance tests is enlistees' abilities to begin and carry through specific activities. If this is true, judges have less need to judge enlistees' abilities to discriminate among proposed actions than to evaluate their ability to begin a specific task and carry it through to some level of completion. The value assigned will likely depend on an enlistee's level of completion of the task and how accurately the enlistee adhered to the specifications that define the particular task.

One way to define task value functions would be to treat the problem as a multiattribute-utility measurement problem (Gardiner and Edwards, 1975).

Page 284 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

In this setting, each “dimension” of each task performance would be evaluated separately. The dimensions of a task would be defined as the set of measurable behaviors (hereafter called a “performance set”) being used to assess enlistees' proficiencies in performing the task. Based on the job performance tests we have reviewed, task performance sets might contain such dimensions as knowledge (measured by a pencil-and-paper test), speed (measured by time taken to complete a hands-on test), and fidelity (measured by the total number of successes on sequences of dichotomously scored hands-on test activities). The same performance set would be used in determining minimally acceptable performance levels when establishing job performance test standards. A task value function would be defined as a weighted average of the values assigned to each of the dimensions in the performance set.

Consider an example: Suppose the task to be assessed is an enlistee 's proficiency at putting on a field or pressure dressing. Dimensions of this task could be defined as “pressure administered properly” (measured by the total number of successfully completed activities in a “hands-on” job proficiency test) and “time elapsed before pressure administered” (measured by time taken to complete this portion of the job proficiency test). Judgments of the values of varying levels of performance on these two dimensions would have to be elicited. Assuming these judgments could be secured, the value functions for these dimensions might be similar to those in Figure 1. If these were the only dimensions, the value function for this task, V_t, would be a weighted average of the value functions for these dimensions. Mathematically,

where P is a set of performance scores on the dimensions in the performance set. In this example the set P contains the time, x, elapsed before pressure was administered and the performance level (or consistency with specified procedures), y, of administering pressure. The weights w₁ and w₂ would be determined by assessing the relative importance of the two dimensions.

Value functions for the dimensions of a task could be constructed to range between 0 and 1. Performance-test score distributions could be used to determine realistic maximum performance levels, which would be assigned values equal to 1. Performance scores below minimally acceptable levels would be assigned values equal to 0. The general location and spread

Page 285 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

FIGURE 1 Sample value functions.

of the distributions of performance scores could help define therates of increase or decrease of value functions.

There is no reason to believe that value functions would be linear. Intuitively, it would seem that small deviations from minimally acceptable performance levels would result in large changes in value, whereas at some higher levels of performance value functions would change more gradually. The actual shapes of value functions would be determined from the judgments elicited on the value (or worth) of different levels of performance (or proficiency) on the dimensions in performance sets.

Page 286 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Structuring Cluster and Military Occupational Specialty Value Functions

A review of the methods used by the Services in choosing the tasks to be included in performance tests indicates that they all used very similar strategies (Morsch et al., 1961; Goody, 1976; Raimsey-Klee, 1981; Burtch et al., 1982; U.S. Army Research Institute for the Behavioral and Social Sciences, 1984). First, each military occupational specialty was partitioned into nonoverlapping clusters, each of which contained similar or related tasks. Judgments were made on the frequency with which each task is performed, on the relative difficulty of each task compared to others in its cluster and the military occupational specialty, and on the relative importance of each task, compared to others in its cluster and the military occupational specialty. Some clusters of tasks appeared in several military occupational specialties and others only appeared in one military occupational specialty. A representative (judgment-based) sample of tasks from each cluster was chosen for the performance test. Based on face validity, it was determined that performance on the tasks sampled from a cluster could be considered to be generalizable to performance on all tasks in the cluster.

If enlistees' performances on the tasks sampled from a cluster are generalizable, the value of their performances on all tasks in that cluster can be computed from a weighted average of the value functions of the individual tasks sampled. The weights should reflect the already determined relative importance of the tasks sampled from the cluster. Note that, with the problem structured in this way, those who make task value judgments need only consider the frequency with which each task is performed and the difficulty of the task. The importance of each task will be reflected in the weights used to compute cluster value functions.

As is true of the task value functions, cluster value functions can be scaled to range between 0 and 1 (inclusive) by dividing the weighted average of task value functions by the sum of the weights, and assigning the cluster value function a value of 0 if performance on any of the sampled tasks in that cluster is below the minimally acceptable level, (V_t(P) = 0 for any task t). A cluster value function would have a value close to 1 if performances on all the sampled tasks in that cluster were close to their maximum possible levels. Cluster value functions defined in this way would be on a comparable scale. It would thus be possible to compare an enlistee's value (or worth) over different clusters both within and across military occupational specialties.

Comparability of value functions is essential to the classification of enlistees into the military occupational specialties and to the assignment of duties within a military occupational specialty. If an enlistee had a higher value level for a “first aid” cluster than for a “navigational” cluster, then he/

Page 287 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

she could be placed in a military occupational specialty where first aid was more essential than navigation. Once placed in a military occupational specialty, this enlistee could be assigned, if possible, more first aid duties than navigational duties.

It is also possible to define an overall value function for each specific military occupational specialty. Such value functions would be based on the generalizability of enlistees' performances on a sample of tasks within a military occupational specialty to their overall performance in that military occupational specialty. As previously mentioned, the relative importance of all tasks within each military occupational specialty has been determined. A military occupational specialty value function can therefore be defined as a weighted average of the value functions of sampled tasks within the military occupational specialty, where the weights are chosen to reflect the relative importance of the tasks sampled from the military occupational specialty. The military occupational specialty value function can be scaled to range between 0 and 1 (inclusive) by dividing the weighted average of task value functions by the sum of the weights and assigning the military occupational specialty value function to 0 if any task sampled from that military occupational specialty has a value level of 0. The resulting military occupational specialty value functions will then be comparable across military occupational specialties. With military occupational specialty value functions defined in this way, it will be possible to determine the military occupational specialty for which a given enlistee has the greatest value or worth.

The problem of specifying military occupational specialty value functions can be defined symbolically in the following way:

P_i = set of performance scores for individual i;

V_t(P_i) = value function for a specific task, t;

V_C(P_i) = value function for a specific cluster, c, within a military occupational specialty;

V_mos(P_i) = value function for a specific military occupational specialty;

w_tc = weight for V_t, reflects the relative importance of task t within cluster c;

w_tmos = weight for V_t, reflects the relative importance of task t within the military occupational specialty;

n_c = number of tasks sampled from cluster c;

n_mos = total number of tasks sampled from the military occupational specialty;

N = total number of individuals currently in the military occupational specialty;

= current average value of performances of individuals in the military occupational specialty;

Page 288 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

and

Eliciting Value Judgments

Task value functions must be based on two types of judgments. One type concerns the assignment of value to each possible level of performance on each of the dimensions that compose a task performance set. The other type concerns the relative importance (weights) of the dimensions that compose a performance set. Models for eliciting the second type of judgment exist within the procedures used by the Services to determine the relative importance of tasks that compose a military occupational specialty (Morsch et al., 1961; Burtch et al., 1982; U.S. Army Research Institute for the Behavioral and Social Sciences, 1984). Because these procedures can be adopted in their entirety, they will not be discussed further here.

The first type of judgment is inherently more difficult to elicit, since such judgments involve the assignment of value to continua rather than the more familiar ranking procedure associated with determining a value ordering for objects. Two methods for eliciting the first type of judgment are discussed—an average value function method (Gardiner and Edwards, 1975)

Page 289 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

and a method of successive lotteries (Winkler and Hays, 1975). The operational questions that arise in eliciting this type of value judgment are much the same as those that arise in eliciting judgments of appropriate test standards. The two situations are contrasted in this section of the paper.

Average Value Function Procedure

This method was employed by Gardiner and Edwards (1975) in a situation that differs from the present military context. Gardiner and Edwards were evaluating land use regulations (building permits) using a multiattributeutility measurement framework and considering such dimensions as percent of on-site parking, unit rental, size of development, aesthetics, and density of the proposed development. Even though Gardiner and Edwards were evaluating physical, as opposed to human characteristics, their method might be applicable to the problem of eliciting judgments of the values of performances on military job performance tests.

The average value function method operates as its name implies. Each judge derives his/her own value function for a given dimension and these value functions are then averaged across the judges. Of interest here is the information that is supplied to the judges to assist them in deriving their value functions. The following explanation of the method is set in the context of deriving value functions for dimensions of a military occupational specialty task performance set.

First, judges are told the dimensions in the performance set for which value functions are to be constructed. They are given minimally acceptable performance scores, plausible (not absolute) maximum performance scores, and some other fixed performance scores between the minimally acceptable and plausible maximum values in the performance set. The judges are then given a scenario related to the task in question. For example, suppose the task is to “collect and process evidence. ” The scenario might be a description of the room or building that needs to be searched and the evidence containers on hand. The judges are told to give a value between 0 and 1 (inclusive) for each of the previously fixed performance levels or simply to draw a graph of their value function, basing their decisions on the value (or worth) of an enlistee's performance in this scenario. The value functions recommended by the judges are then averaged, and this information is given back to the judges.

Other relevant information can also be supplied to the judges at this point in the average value function procedure. This might include the shape, location, and spread of the performance score distributions from which the plausible maximum and minimally acceptable performance scores were determined. The judges are then given a second scenario and asked to repeat their evaluations of the worth of the fixed performance levels, based

Page 290 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

on this new scenario. The second scenario must be identical, in terms of task characteristics and difficulty, to the first. However, it must be presented in such a way that this is not apparent to the judges. The final value function for a given dimension in the performance set is the average of the value functions recommended by all judges for that dimension, based on the second scenario.

The average value function procedure has merit, in that the scenarios used can be constructed so as to mirror the scenarios used in “hands-on ” performance tests or the scenarios used in assessing the relative importance of tasks within a military occupational specialty (U.S. Army Research Institute for the Behavioral and Social Sciences, 1984). In this way, the judges will consider the same set of circumstances that are imposed on an enlistee when his/her performance is measured. One problem with this method is that the frequency with which each task is performed and the difficulty of the tasks are not directly taken into account.

Method of Successive Lotteries

The method of successive lotteries (also called the method of certainty equivalence), as described by Winkler and Hays (1975), is used to develop utility functions in a decision-theoretic framework. The context of the current problem is not unlike a decision-theoretic framework, in that classification decisions are to be based on an enlistee's value (or worth). Stated in this way, what we have termed “value functions” are the utility functions of a decision-theoretic problem. Consequently, it should be possible to apply the method of successive lotteries to evaluate these value functions.

Consider the problem of constructing a value function for one dimension of one task. For example, let the task be shooting a firearm and let the dimension of interest be accuracy in hitting a stationary target (measured by percent of time on target). The method of successive lotteries would be applied in the following way. First, determine a minimally acceptable performance score (minimum acceptable percent of accuracy) and assign this a value just greater than 0. Then determine the plausible maximum performance score and assign this a value of 1. Select several performance scores between these two levels. The value function is to be evaluated and graphed at these scores. The shape of the value function will be given by the curve that results from connecting the points on this graph.

The evaluation consists of a series of comparisons. For example, suppose that the minimally acceptable accuracy level is 30 percent and the plausible maximum accuracy level is 90 percent. Comparisons between sets of lotteries such as the following are made.

Page 291 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Lottery I: Enlistee A shoots with x percent accuracy (30 < x < 90) all of the time.

Lottery II: Enlistee B has a probability of p of shooting with 90 percent accuracy and has a probability of (1 − p) of shooting with 30 percent accuracy.

The judgment that is made is to decide for what value of p one is indifferent with respect to the value (or worth) of enlistees A and B. This indifference point, p, is the value of being x percent accurate (or the magnitude of the value function at performance level x, V(x)). If, in the example, x percent is 85 percent, we would expect the indifference point, p, to be close to 1, whereas if x percent is 35 percent we would expect the indifference point to be close to 0. Throughout this evaluation process, Lottery II remains fixed and Lottery I changes only in the sense that the value of x is changed.

With this evaluation method, the frequency with which the task is performed and the difficulty of the task need to be taken into consideration in the determination of indifference points. Since several judges can be involved in this evaluation process, the final value function for a given dimension can be computed as an average (mean or median) of all the judges' value functions (sets of indifference points). To help the judges complete their evaluations, a scenario could also be provided. Analogously to the Jaeger (1982) standard-setting procedure, this evaluation method could be iterated several times, by providing judges with summary information about their fellow judges' initial value functions and information regarding enlistees' actual performance test score distributions before each iteration.

Operational Questions and Issues

With the exception of the issue concerning treatment of measurement error, all of the operational issues and questions that were discussed above (in the section on establishing minimally acceptable standards of performance) are pertinent to the process of eliciting judgments of the value of enlistees' task performances. A separate discussion of operational issues in this section would therefore be largely redundant.

One consideration that is appropriate here but would not be appropriate in the establishment of minimally acceptable performance standards is the use of enlistees themselves to determine value functions. First-term personnel who have been assigned to a military occupational specialty for a reasonable period of time should be capable of judging the value or worth of various levels of performance on tasks that compose that military occupational specialty. Zedeck and Cascio (1984) found that peer ratings of personnel were both reasonable and acceptably reliable.

Page 292 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

CLASSIFICATION OF NEW ENLISTEES

All branches of the military have developed computerized personnel allocation systems. These systems have been developed and/or adapted to serve a number of purposes, including: enhancing person-job match (Hendrix et al., 1979; Kroeker and Rafacz, 1983; Roberts and Ward, 1982; Schmitz and McWhite, 1984), lowering attrition rates (Kroeker and Folchi, 1984a), balancing minority representation in certain job classifications and providing equal placement opportunity for minorities in all job classifications (Kroeker and Folchi, 1984b). The material in this section illustrates one application of the ideas discussed earlier in this paper to a computerized personnel allocation system. The illustrations are fictitious and are not necessarily representative of the algorithms in current use in any Service.

Assume a set of performance scores for each potential enlistee can be estimated from his/her scores on an aptitude test such as the ASVAB. Call this set of estimated performance scores . Estimates of task value functions, , cluster value functions, , and various military occupational specialties' value functions, , can be found, based on this set of estimated performance scores. Using these estimates, several strategies for classification of enlistees into the different military occupational specialties can be defined. These classification strategies fall into two groups: individual classification strategies and institutional classification strategies.

Individual Classification Strategies

The simplest classification scheme would be to let each enlistee choose his/her preferred military occupational specialty from a pool of military occupational specialties for which his/her predicted performance scores satisfied the minimally acceptable standards. Enlistees' choices would have to be monitored and directed to some degree, so that quotas for the various military occupational specialties would be met. This classification method would not require estimation of any value functions. Consequently, no value judgments of task performance levels would have to be made. Also, this classification method would not require any information about predicted performance scores for enlistees other than the one being classified. A classification decision would be made solely on the basis of predicted performance scores of the enlistee in question and on that enlistee's preferences.

If the interests of the military are given primary consideration and if value functions are properly defined, alternative classification methods that better satisfy military requirements can be derived. The most direct way to use value functions for classification would be to classify each enlistee into the military occupational specialty for which he/she had the highest pre-

Page 293 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

dieted . Since military occupational specialty quotas would have to be taken into consideration, some enlistees would have to be assigned to military occupational specialties that corresponded to their second-or thirdlargest . This classification method would not require any information other than the individual enlistee's predicted value functions and the military occupational specialty quotas.

An Institutional Classification Strategy

The final classification method proposed in this paper is an institutional strategy as opposed to an individual strategy. The problem of assigning new personnel to military occupational specialties in a way that maximized their value to the military could be formulated in several ways. In keeping with standard operations research terminology, we will call the function that defines the overall military value of a set of personnel classification decisions an “objective function ” (Hillier and Lieberman, 1974). One possible goal (from which an objective function could be formed) might be to upgrade the average performance level and the average value (or worth) of individuals assigned to military occupational specialties across all military occupational specialties simultaneously. For defined as in Equations 1, it is possible to estimate the current average value of the individuals in each military occupational specialty, . This can be done by taking a random sample of enlistees presently in the military occupational specialty and determining V_mos (P_i) for each individual, i, in the sample and averaging those values. Using the estimates of the current average value of individuals in the military occupational specialties and the new enlistees' predicted military occupational specialty values, it is possible to derive a classification scheme that optimizes the anticipated changes (due to the new enlistees) in the average values of individuals assigned to all of the military occupational specialties. In this classification scheme, individual classification decisions would be based on predicted value functions for the entire group of new enlistees about whom classification decisions are to be made and also on information about enlistees currently assigned to the military occupational specialties.

It is important to understand the advantages to the military of this classification scheme, compared to the value function classification method discussed in the previous section. The optimization invoked by this classification strategy would minimize the decreases, while maximizing the increases, in anticipated average values of individuals assigned to all military occupational specialties. To better see the difference between this classification method and the previous method it will be helpful to consider some fictitious data.

For convenience, assume there are only two military occupational spe-

Page 294 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

cialties, MOS1 and MOS2. Assume and . Suppose that, after completing the ASVAB or a similar examination, the predicted value functions for one enlistee are and . Also, suppose that this enlistee has the highest predicted value function for MOS2 among all of the new enlistees who have just completed the aptitude examination. What military occupational specialty assignment for this enlistee would be of maximum benefit to the military?

The value function classification method described in the previous section would place this enlistee in MOS1. This classification method would place him/her in MOS2, and thereby be of maximum benefit to the military. Placing this enlistee in either military occupational specialty would likely help raise the average value of individuals in that military occupational specialty because this enlistee's predicted value levels are higher than the current estimated average values of individuals assigned to both of the military occupational specialties. Since is so much larger than , the military's immediate interest would be to assign enlistees to MOS2 who would have the greatest potential to help raise the current average value of individuals already assigned to MOS2 (provided the military's goal is as we stated earlier). Recall that the enlistee under consideration has the highest predicted value function among all new enlistees for whom placement decisions are to be made. Consequently, it is this enlistee who would have the greatest (predicted) ability to help raise the current average value of individuals assigned to MOS2. Had there been other new enlistees with predicted value functions for MOS2 greater than .6, the best classification decision would not have been obvious.

Consider another enlistee from this same example for which =.65 and . Where should this enlistee be placed and what effect would he/she have on the average values of individuals assigned to the military occupational specialties? The classification method described in the previous section would have assigned this enlistee in MOS1. Without knowing the predicted value levels of all of the new enlistees and the quotas for MOS1 and MOS2, it is impossible to determine the classification of this enlistee that would minimize the potential negative effect he/she would have on the current average values of the individuals assigned to the military occupational specialties.

A general solution that would achieve the military goal previously described can be determined in the following way. Without loss of generality assume there are only two military occupational specialties, MOS1 and MOS2. Consider forming a two-way table of new enlistees ' predicted value functions, as shown in Figure 2. Potential enlistees whose values fell in the (0,0) cell would not be admitted into the Services because their predicted performance scores would fall below the minimally acceptable standards. New enlistees with values falling in the (0,j) cells would be assigned to

Page 295 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

FIGURE 2 Two-way table of new enlistees' predicted MOS value functions.

MOS1 and those falling in the (i,0) cells would be assigned to MOS2. After these decisions had been made, the quotas could be adjusted to account for the enlistees just assigned to MOS1 and MOS2.

Now, attention can be focused on the remainder of the table. Let Q₁ and Q₂ be the adjusted quotas for MOS1 and MOS2, respectively. Adjust Q₁ and Q₂ such that the number of remaining new enlistees equals the sum of Q₁ and Q₂. For simplicity, assume there are only two intervals of predicted

Page 296 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

values, I₁ and I₂. Figure 3 displays the simplified two-way table. Let p_ij be the proportion of the new enlistees in the (i,j)th cell to be assigned to MOS1. Let (1 − p_ij) be the proportion of the new enlistees in the (i,j)th cell to be assigned to MOS2. The predicted average military occupational specialty values for the new enlistees can be expressed as

and

The goal is to find the p_ij's which jointly maximize and While jointly minimizing, if necessary, the amount these values may fall below the current estimated average value of individuals in the military occupational specialties, and .

FIGURE 3 Simplified two-way table of new enlistees' predicted MOS value functions.

Page 297 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

This problem can be written mathematically in the following way. Find the p_ij's, Δ₁ and Δ₂ that maximize

subject to

and

n₁₁P₁₁ + n₁₂p₁₂ + n₂₁p₂₁ + n₂₂p₂₂ = Q₁;

where Δ₁ and Δ₂ are the amounts and fall below the current estimated average values of individuals assigned to MOS1 and MOS2, respectively, and M₁ and M₂ are positive known constants. The constants M₁ and M₂ can be thought of as the penalties imposed on the military for admitting enlistees whose predicted performance would result in dropping the average value of individuals in MOS1 and MOS2, respectively. These constants would be chosen by the military.

This formulation of the classification problem is equivalent to a simple linear programming problem that can be solved easily by using the simplex method with the aid of a computer (Hillier and Lieberman, 1974). The formulation can be expanded to include any number of military occupational specialties and any number of value intervals I_i. The following examples have been included to demonstrate the outcome of this classification strategy. The data are fictitious.

Example 1

The following two-way table shows the distribution of 100 new enlistees ' predicted value functions for two military occupational specialties.

Page 298 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Let Q₁ = 55 and Q₂ = 45. Assume estimates of the current average values of personnel currently assigned to the military occupational specialties are and . Let M₁ = M₂ = 2. This assigns equal penalties to both military occupational specialties. The linear programming analysis produced the following results:

cell (i,j)	p_ij	Number of Enlistees Assigned to MOS1	Number of Enlistees Assigned to MOS2
(1,1)	.42	17	23
(1,2)	1.00	30	0
(2,1)	0.00	0	20
(2,2)	.80	8	2
Total		55	45
= .507 and = .401 .

Compare the results of this analysis to those of the following analysis where M₁ = .5 and M₂ = 2. These choices assign a larger penalty to MOS2 than to MOS1, for decreases in anticipated average values of personnel currently assigned to the military occupational specialties. The linear programming analysis produced the following results:

cell (i,j)	p_ij	Number of Enlistees Assigned to MOS1	Number of Enlistees Assigned to MOS2
(1,1)	.625	25	15
(1,2)	1.00	30	0
(2,1)	0.00	0	20
(2,2)	0.00	0	10
Total		55	45
= .464 and = .511 .

Page 299 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Example 2

The following two-way table shows the distribution of 180 new enlistees ' predicted value functions for two military occupational specialties. This distribution of predicted value functions is similar to that in the example discussed in the text.

Let Q₁ = 90 and Q₂ = 90. Assume estimates of the average values of personnel currently assigned to the military occupational specialties are and . Let M₁ = 1 and M₂ = 3. These choices assign a larger penalty to MOS2 than to MOS1, for potential decreases in predicted average values. The assignment of penalties in this way is consistent with the military's immediate interest in placing enlistees into MOS2, if they have the greatest predicted potential to help raise the current average value of personnel assigned to MOS2. The linear programming analysis produced the following results:

cell (i,j)	p_ij	Number of Enlistees Assigned to MOS1	Number of Enlistees Assigned to MOS2
(1,1)	.58	25	18
(1,2)	1.00	40	0
(1,3)	1.00	25	0
(2,1)	0.00	0	26
(2,2)	0.00	0	45
(2,3)	0.00	0	1
Total		90	90
= .572 and = .298 .

Page 300 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Notice that, because of the distribution of predicted values of new enlistees it is impossible to raise the average value of personnel assigned to MOS2. However, the optimization process did minimize the decrease in the average value of personnel assigned to MOS2 by allowing the average value of personnel assigned to MOS1 to fall appreciably.

Compare the outcome of this analysis to that of the following analysis in which M₁ = M₂ = 1. These choices assign equal penalties to both military occupational specialties. The linear programming analysis produced the following results:

cell (i,j)	p_ij	Number of Enlistees Assigned to MOS1	Number of Enlistees Assigned to MOS2
(1,1)	0.00	0	43
(1,2)	1.00	40	0
(1,3)	1.00	25	0
(2,1)	0.00	0	26
(2,2)	0.53	24	21
(2,3)	1	1	0
Total		90	90
= .631 and = .259 .

SUMMARY

Three problems associated with the use of military hands-on job performance tests have been addressed in this paper. The first concerned methods for setting standards of minimally acceptable performance on the tests. In addressing that problem, we described standard-setting procedures that have been used in a wide variety of settings in the civilian sector. We then discussed the prospects for using those procedures with the hands-on tests. Finally, we described a set of operational issues that must be addressed, regardless of the standard-setting procedures adopted by the Services. Among the most frequently used standard-setting procedures, those proposed by Angoff (1971) and Nedelsky (1954) appear to hold the greatest promise for use with the performance components and knowledge components, respectively, of the military job performance tests we have reviewed. Examinee-based standard-setting procedures would be most applicable to tests that are not composed of dichotomously scored activities or items.

The second problem we addressed involves procedures for eliciting and combining judgments of the values of enlistees' behaviors on military job performance tests. We examined the potential contributions of psychological decision theory and social behavior theory to solving this problem and concluded that they were largely inapplicable. These theories are more appropriate for eliciting judgments of the values of decision alternatives or

Page 301 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

for inferring the attributes of decision alternatives that underlie judges' recommendations. A procedure involving successive lotteries holds promise for defining the values judges attribute to various patterns of enlistees' behavior on military job performance tests.

It appears that all Services have completed extensive job analysis studies and have developed elaborate lists of tasks that compose their military occupational specialties. Additional studies have resulted in the development of taxonomic clusterings of these tasks on such dimensions as frequency, difficulty, and judged importance. The results of these studies can and should be employed in developing methods for combining judged values associated with performance of the tasks that compose a military occupational specialty. A method based on weighted averages of value functions, with weights proportional to the judged importance of tasks, was described in detail.

The third problem addressed in this paper concerns procedures for using enlistees' predicted job performance test scores and judged values associated with those scores in classifying enlistees among military occupational specialties. Several alternatives were considered, including one that considered only the interests and the predicted abilities of individual enlistees (a guidance model) and several that considered only the interests of the Services. Of the latter two, one method presumed that classification decisions were made sequentially, for each individual enlistee. The other method presumed that groups of enlistees were classified concurrently, and that it was desired to effect these classification decisions in a way that maximized the average values of personnel in all military occupational specialties. An explicit solution to this latter problem, in the form of a linear programming algorithm, was described and illustrated.

REFERENCES

Angoff, W.H. 1971 Scales, norms, and equivalent scores. Pp. 508-600 in R. L. Thorndike, ed., Educational Measurement. 2nd ed. Washington, D.C.: American Council on Education.

Berk, R.A. 1976 Determination of optimal cutting scores in criterion-referenced measurement Journal of Experimental Education 45:4-9.

1985 A Consumer's Guide to Setting Performance Standards on Criterion-Referenced Tests. Paper presented before the annual meeting of the National Council on Measurement in Education, Chicago.

Berk, R.A., ed. 1980 Criterion-Referenced Measurement: The State of the Art. Baltimore, Md.: Johns Hopkins University Press.

Bunch, L.D., M.S. Lipscomb, and D.J. Wissman 1982 Aptitude Requirements Based on Task Difficulty: Methodology for Evaluation . TR-81-34. Air Force Human Resources Laboratory, Manpower and Personnel Division, Brooks Air Force Base, Tex.

Page 302 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Committee on the Performance of Military Personnel 1984 Job Performance Measurement in the Military: Report of a Workshop . Commission on Behavioral and Social Sciences and Education, National Research Council. Washington, D.C.: National Academy Press.

Cross, L.H., J.C. Impara, R.B. Frary, and R.M. Jaeger 1984 A comparison of three methods for establishing minimum standards on the National Teacher Examinations. Journal of Educational Measurement 21:113-130.

Ebel, R.L. 1972 Essentials of Educational Measurement. 2nd ed. Englewood Cliffs, N.J.: Prentice-Hall.

1979 Essentials of Educational Measurement. 3rd ed. Englewood Cliffs, N.J.: Prentice-Hall.

Gardiner, P.C., and W. Edwards 1975 Public values: multiattribute-utility measurement for social behavior Pp. 1-38 in M. F. Kaplan and S. Schwartz, eds., Human Judgment and Decision Process. New York: Academic Press.

Glass, G.V 1978 Standards and criteria. Journal of Educational Measurement 15:237-261.

Goody, K. 1976 Comprehensive Occupational Data Analysis Programs (CODAP): Use of REXALL to Identify Divergent Raters. TR-76-82, AD-A034 327. Air Force Human Resources Laboratory, Occupation and Manpower Research Division, Lackland Air Force Base, Tex.

Gulliksen, H. 1950 Theory of Mental Tests. New York: John Wiley and Sons.

Hambleton, R.K. 1980 Test score validity and standard-setting methods. Pp. 80-123 in R. A. Berk, ed., Criterion-Referenced Measurement: The State of the Art. Baltimore, Md.: Johns Hopkins University Press.

Hambleton, R.K., and D.R. Eignor 1980 Competency test development, validation, and standard-setting. Pp. 367-396 in R. M. Jaeger and C. K. Tittle, eds., Minimum Competency Achievement Testing: Motives, Models, Measures, and Consequences. Berkeley, Calif.: McCutchan.

Hendrix, W.W., J.H. Ward, M. Pina, and D.D. Haney 1979 Pre-Enlistment Person-Job Match System. TR-79-29. Air Force Human Resources Laboratory, Occupation and Manpower Research Division, Brooks Air Force Base, Tex.

Hillier, F.S., and G.J. Lieberman 1974 Operations Research. San Francisco: Holden-Day Press.

Jaeger, R.M. 1978 A Proposal for Setting a Standard on the North Carolina High School Competency Test. Paper presented before the annual meeting of the North Carolina Association for Research in Education, Chapel Hill.

1982 An iterative structured judgment process for establishing standards on competency tests: theory and application. Educational Evaluation and Policy Analysis 4:461-476.

Jaeger, R.M., and J.C. Busch 1984 A Validation and Standard-Setting Study of the General Knowledge and Communication Skills Tests of the National Teacher Examinations . Final report. Greensboro, N.C.: Center for Educational Research and Evaluation, University of North Carolina.

Page 303 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

Kaplan, M. 1975 Information integration and social judgment: interaction of judge and informational components. Pp. 139-172 in M. F. Kaplan and S. Schwartz, eds., Human Judgment and Decision Process. New York: Academic Press.

Kroeker, L., and J. Folchi 1984a Classification and Assignment within PRIDE (CLASP) System: Development and Evaluation of an Attrition Component. TRØ84-40. Navy Personnel Research and Development Center, San Diego, Calif. 1984b Minority Fill-Rate Component for Marine Corps Recruit Classification: Development and Test. TR 84-46. Navy Personnel Research and Development Center, San Diego, Calif.

Kroeker, L.P., and B.A. Rafacz 1983 Classification and Assignment within PRIDE (CLASP): A Recruit Model TR 84-9. Navy Personnel Research and Development Center, San Diego, Calif.

Laabs, G.J. 1984 Performance-Based Personnel Classification: An Update. Navy Personnel Research and Development Center, San Diego, Calif.

Livingston, S. A., and M. J. Zieky 1982 Passing Scores: A Manual for Setting Standards of Performance on Educational and Occupational Tests. Princeton, N.J.: Educational Testing Service.

1983 A Comparative Study of Standard-Setting Methods. Research Report 83-38. Princeton, N.J.: Educational Testing Service.

Lord, P.M., and M.R. Novick 1968 Statistical Theories of Mental Test Scores. Reading, Mass.: Addison-Wesley.

Meskauskas, J.A. 1976 Evaluation models for criterion-referenced testing: views regarding mastery and standard-setting. Review of Educational Research 45:133-158.

Morsch, J.E., J.M. Madden, and R.E. Christal 1961 Job Analysis in the United States Air Force. WADD-TR-61-113, AD-259 389. Personnel Laboratory, Lackland Air Force Base, Tex.

Nedelsky, L. 1954 Absolute grading standards for objective tests. Educational and Psychological Measurement 14:3-19.

Poggio, J.P., D.R. Glassnap, and D.S. Eros 1981 An Empirical Investigation of the Angoff, Ebel, and Nedelsky Standard-Setting Methods. Paper presented before the annual meeting of the American Educational Research Association, Los Angeles.

Raimsey-Klee, D.M. 1981 Handbook for the Construction of Task Inventories for Navy Enlisted Ratings. Navy Occupational Development and Analysis Center, Washington, D.C.

Roberts, D.K., and J.W. Ward 1982 General Purpose Person-Job Match System for Air Force Enlisted Accessions SR 82-2. Air Force Human Resources Laboratory, Manpower and Personnel Division, Brooks Air Force Base, Tex.

Schmitz, E.J., and P.B. McWhite 1984 Matching People with Occupations for the Army: The Development of the Enlisted Personnel Allocation System. Personnel Utilization Technical Area Working Paper 84-5. U.S. Army Research Institute for the Behavioral and Social Sciences Alexandria, Va.

Sherif, M. 1947 Group influences upon the formation of norms and attitudes. In T. M. Newcomb and E. L. Hartley, eds., Readings in Social Psychology. 1st ed. New York: Holt.

Page 304 Cite

Suggested Citation:"Procedures for Eliciting and Using Judgments of the Value of Observed Behaviors on Military Job Performance Tests." National Research Council. 1991. Performance Assessment for the Workplace, Volume II: Technical Issues. Washington, DC: The National Academies Press. doi: 10.17226/1898.

×

U.S. Army Research Institute for the Behavioral and Social Sciences 1984 Selecting Job Tasks for Criterion Referenced Tests of MOS Proficiency Working Paper RS-WP-84-25. U.S. Army Research Institute for the Behavioral and Social Sciences Alexandria, Va.

Winkler, R.L., and W.L. Hays 1975 Statistics: Probability, Inference, and Decision. 2nd ed. New York: Holt, Rinehart and Winston.

Zedeck, S., and W.F. Cascio 1984 Psychological issues in personnel decisions. Annual Review of Psychology 35:461-518.

Zieky, M.L., and S.A. Livingston 1977 Manual for Setting Standards on the Basic Skills Assessment Tests . Princeton, N.J.: Educational Testing Service.