Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 45
Pay for Performance: Evaluating Performance Appraisal and Merit Pay 4 Performance Appraisal: Definition, Measurement, and Application INTRODUCTION The science of performance appraisal is directed toward two fundamental goals: to create a measure that accurately assesses the level of an individual's job performance and to create an evaluation system that will advance one or more operational functions in an organization. Although all performance appraisal systems encompass both goals, they are reflected differently in two major research orientations, one that grows out of the measurement tradition, the other from human resources management and other fields that focus on the organizational purposes of performance appraisal. Within the measurement tradition, emanating from psychometrics and testing, researchers have worked and continue to work on the premise that accurate measurement is a precondition for understanding and accurate evaluation. Psychologists have striven to develop definitive measures of job performance, on the theory that accurate job analysis and measurement instruments would provide both employer and employee with a better understanding of what is expected and a knowledge of whether the employee's performance has been effective. By and large, researchers in measurement have made the assumption that if the tools and procedures are accurate (e.g., valid and reliable), then the functional goals of organizations using tests or performance appraisals will be met. Much has been learned, but as this summary of the field makes explicit, there is still a long way to go. In a somewhat different vein, scholars in the more applied fields—human
OCR for page 46
Pay for Performance: Evaluating Performance Appraisal and Merit Pay resources management, organizational sociology, and more recently applied psychology, have focused their efforts on usability and acceptability of performance appraisal tools and procedures. They have concerned themselves less with questions of validity and reliability than with the workability of the performance appraisal system within the organization, its ability to communicate organizational standards to employees, to reward good performers, and to identify employees who require training and other development activities. For example, the scholarship in the management literature looks at the use of performance appraisal systems to reinforce organizational and employee belief systems. The implicit assumption of many applied researchers is that if the tools and procedures are acceptable and useful, they are also likely to be sufficiently accurate from a measurement standpoint. From a historical perspective, until the last decade research on performance appraisal was largely dominated by the measurement tradition. Performance appraisals were viewed in much the same way as tests; that is to say, they were evaluated against criteria of validity, reliability, and freedom from bias. The emphasis throughout was on reducing rating errors, which was assumed to improve the accuracy of measurement. The research addressed two issues almost exclusively—the nature and quality of the scales to be used to assess performance and rater training. The question of which performance dimensions to evaluate tended to be taken as a given. Although, strictly speaking, we do not disagree with the test analogy for performance appraisals, it can be misleading. Performance appraisals are different from the typical standardized test in that the ''test" in this case is a combination of the scale and the person who completes the rating. And, contrary to standardized test administration, the context in which the appraisal process takes place is difficult if not impossible to standardize. These complexities were often overlooked in the performance appraisal literature in the psychometric tradition. The research on scales has tended to treat all variation attributable to raters as error variance. The classic training research can be seen as attempting to develop and evaluate ways of standardizing the person component of the appraisal process. In the late 1970s there was a shift in emphasis away from the psychometric properties of scales. The shift was initially articulated by Landy and Farr (1980) and was extended by Ilgen and Feldman (1983) and DeNisi et al. (1984). They expounded the thesis that the search for rating error had reached the point of diminishing returns for improving the quality of performance appraisals, and that it was time for the field to concentrate more on what the rater brings to performance appraisal—more specifically, how the rater processes information about the employee and how this mental processing influences the accuracy of the appraisal. The thrust of the research was still on accuracy, but now the focus was on the accuracy of judgment rather than rating errors and the classical psychometric indices of quality.
OCR for page 47
Pay for Performance: Evaluating Performance Appraisal and Merit Pay Just as there was dissatisfaction with progress in performance appraisal research at the end of the 1970s, recent literature suggests dissatisfaction with the approaches of the 1980s. But this time the shift promises to be more fundamental. The most recent research (Ilgen et al., 1989; Murphy and Cleveland, 1991) appears to reject the goal of precision measurement as impractical. From this point of view, prior research has either ignored or underestimated the powerful impact of organizational context and people's perceptions of it. The context position is that, although rating scale formats, training, and other technical qualities of performance appraisals do influence the qualities of ratings, the quality of performance appraisals is also strongly affected by the context in which they are used. It is argued that research on performance appraisals now needs to turn to learning more about the conditions that encourage raters to use the performance appraisal systems in the way that they were intended to be used. At this juncture, therefore, it appears that the measurement and management traditions in performance appraisal have reached a rapprochement. How do these varied bodies of research contribute to an understanding of performance appraisal technology and application? Can jobs be accurately described? Can valid and reliable measures of performance be developed? Does the research offer evidence that performance appraisal instruments and procedures have a positive effect on individual and organizational effectiveness? Is there evidence that performance appraisal systems contribute to communication of organizational goals and performance expectations as management theory would lead us to believe? What does the recent focus on the interactions between appraisal systems and organizational context suggest about the probable accuracy of appraisals when actually used to make decisions about individual employees? These questions and their treatment in the psychological research and human resources management literature form the major themes of this chapter. In the following pages we present the results of research in the areas of psychometrics, applied psychology, and human resources management on performance description, performance measurement, and performance assessment for purposes of enhancing individual employee performance. The first section deals with measurement issues. The discussion proceeds from a general description of the research on job performance and its measurement to a description of the factors that can influence the quality of the performance assessment. Research relating to managerial-level jobs is presented as available, but most of the work in job performance description and measurement has involved nonmanagerial jobs.1 The second section deals with research on the more applied 1 The reason for this imbalance in the research literature is obvious: managerial jobs are difficult to define and assess at a specific level—not only are they fragmented, diverse, and amorphous, but many of the factors leading to successful outcomes in such jobs are not directly measurable. Moreover, in practice, most managerial appraisals involve some form of management by objective. This approach represents an attempt to finesse the problem of evaluating performance by defining good performance a priori—instead, the employee participates in establishing the performance objectives that are used to evaluate the performance.
OCR for page 48
Pay for Performance: Evaluating Performance Appraisal and Merit Pay issues, such as the effects of rater training and the contextual sources of rating distortion. PERFORMANCE APPRAISAL AND THE MEASUREMENT TRADITION The Domain of Job Performance The definition and measurement of job performance has been a central theme in psychological and organizational research. Definitions have ranged from general to specific and from quantitative to qualitative. Some researchers have concentrated their efforts on defining job performance in terms of outcomes; others have examined job behaviors; still others have studied personal traits such as conscientiousness or leadership orientation as correlates of successful performance. The more general, qualitative descriptions tend to be used for jobs that are complex and multifaceted like those at managerial levels, while quantitative descriptions are used frequently to describe highly proceduralized jobs for which employee actions can be measured and the resulting outcomes often quantified. The principal purpose of this research has been to enhance employee performance (via better selection, placement, and retention decisions), under the assumption that cumulative individual performance will influence organizational performance. When considering measures of individual job performance, there is a tendency in the literature to characterize some measures as objective and others as subjective. We believe this to be a false distinction that may create too much confidence in the former and an unjustified suspicion about the latter. Measurement of performance in all jobs, no matter how structured and routinized they are, depends on external judgment about what the important dimensions of the job are and where the individual's performance falls on each dimension. Our discussion in this chapter avoids the artificial distinctions of objective and subjective and instead focuses on the role of human judgment in the performance appraisal process. Initially, applied psychologists were optimistic about their ability to identify and measure job performance. Job analyses were used as the basis for constructing selection tests, for developing training programs, and for determining the strengths and weaknesses of employees. However, many of the results were disappointing and, as experience was gained, researchers began to realize that describing the constituent dimensions of a job and understanding its performance requirements was not a straightforward task. Today it is recognized
OCR for page 49
Pay for Performance: Evaluating Performance Appraisal and Merit Pay that job performance is made up of complex sets of interacting factors, some of them attributable to the job, some to the worker, and some to the environment. Thus, in even the simplest of jobs many elements of "job performance" are not easily isolated or directly observable. It is also clear to social scientists that the definition of what constitutes skill or successful work behavior is contingent and subject to frequent redefinition. In any appraisal system, the performance factors rated depend on the approach taken to job analysis, i.e., worker attributes or job tasks. There is evidence that different expert analysis and different analytic methods will result in different judgments about job skills (England and Dunn, 1988). Furthermore, the evaluation of job performance is subject to social and organizational influences. In elucidation of this point, Spenner (1990) has identified several theoretical propositions concerning the social definition of skill or of what is considered effective job behavior. For example, scholars in the constructionist school argue that what is defined as skilled behavior is influenced by interested parties, such as managers, unions, and professions. Ultimately, what constitutes good and poor performance depends on organizational context. The armed forces, for example, place a great deal of importance on performance factors like "military bearing." Identical task performance by an auto mechanic would be valued differently and therefore evaluated differently by the military than by a typical car dealership. In order to capture some of this complexity, Landy and Farr (1983) propose that descriptions of the performance construct for purposes of appraisal should include job behavior, situational factors that influence or interact with behavior, and job outcomes. Dimensions of Job Performance Applied psychologists have used job analysis as a primary means for understanding the dimensions of job performance (McCormick, 1976, 1979). There have been a number of approaches to job analysis over the years, including the job element method (Clark and Primoff, 1979), the critical incident method (Flanagan, 1954; Latham et al., 1979), the U.S. Air Force task inventory approach (Christal, 1974), and those methods that rely on structured questionnaires such as the Position Analysis Questionnaire (McCormick et al., 1972; Cornelius et al., 1979) and the Executive Position Description Questionnaire developed by Hemphill (1959) to describe managerial-level jobs in large organizations. All of these methods share certain assumptions about good job analysis practices and all are based on a variety of empirical sources of information, including surveys of task performance, systematic observations, interviews with incumbents and their supervisors, review of job-related documentation, and self-report diaries. The results are usually detailed descriptions of job tasks, personal attributes and behaviors, or both. One of the more traditional methods used to describe job performance is
OCR for page 50
Pay for Performance: Evaluating Performance Appraisal and Merit Pay the critical incident technique (Flanagan, 1954). This method involves obtaining reports from qualified observers of exceptionally good and poor behavior used to accomplish critical parts of a job. The resulting examples of effective and ineffective behavior are used as the basis for developing behaviorally based scales for performance appraisal purposes. Throughout the 1950s and 1960s, Flanagan and his colleagues applied the critical incident technique to the description of several managerial and professional jobs (e.g., military officers, air traffic controllers, foremen, and research scientists). The procedure for developing critical incident measures is systematic and extremely time-consuming. In the case of the military officers, over 3,000 incident descriptions were collected and analyzed. Descriptions usually include the context, the behaviors judged as effective or ineffective, and possibly some description of the favorable or unfavorable outcomes. There is general agreement in the literature that the critical incident technique has proven useful in identifying a large range of critical job behaviors. The major reservations of measurement experts concern the omission of important behaviors and lack of precision in working incidents, which interferes with their usefulness as guides for interpreting the degree of effectiveness in job performance. Moreover, there is some research evidence—and this is pertinent to our study of performance appraisal—suggesting that descriptions of task behavior resulting from task or critical incident analyses do not match the way supervisors organize information about the performance of their subordinates (Lay and Jackson, 1969; Sticker et al., 1974; Borman, 1983, 1987). In one of a few studies of supervisors' "folk theories" of job performance, Borman (1987) found that the dimensions that defined supervisors' conceptions of performance included: (1) initiative and hard work, (2) maturity and responsibility, (3) organization, (4) technical proficiency, (5) assertive leadership, and (6) supportive leadership. These dimensions are based more on global traits and broadly defined task areas than they are on tightly defined task behaviors. Borman's findings are supported by several recent cognitive models of the performance appraiser (Feldman, 1981; Ilgen and Feldman, 1983; Nathan and Lord, 1983; De Nisi et al., 1984). If, as these researchers suggest, supervisors use trait-based cognitive models to form impressions of their employees, the contribution of job analysis to the accuracy of appraisal systems is in some sense called into question. The suggestion is that supervisors translate observed behaviors into judgments about general traits or characteristics, and it is these judgments that are stored in memory. Asking them via an appraisal form to rate job behaviors does not mean that they are reporting what they saw. Rather, they may be reconstructing a behavioral portrait of the employee's performance based on their judgment of the employee's perseverance, maturity, or competence. At the very least, this research makes clearer the complexity of the connections between
OCR for page 51
Pay for Performance: Evaluating Performance Appraisal and Merit Pay job requirements, employee job behaviors, and supervisor evaluations of job performance. The Joint-Service Job Performance Measurement (JPM) Project undertaken by the Department of Defense is among the most ambitious efforts at systematic job analysis to date (Green et al., 1991). This is a large-scale, decade long research effort to develop measures of job proficiency for purposes of validating the entrance test used by all four services to screen recruits into the enlisted ranks. By the time the project is completed in 1992, over $30 million will have been expended to develop an array of job performance measures—including hands-on job-sample tests, written job knowledge tests, simulations, and, of particular interest here, performance appraisals—and to administer the measures to some 9,000 troops in 27 enlisted occupations. Each of the services already had an ongoing occupational task inventory system that reported the percentage of job incumbents who perform each task, the average time spent on the task, and incumbents' perceptions of task importance and task difficulty. The services also had in hand soldier's manuals for each occupation that specify the content of the job. From this foundation of what might be called archival data, the services proceeded to a more comprehensive job analysis, calling on both scientists and subject matter experts (typically master sergeants who supervise or train others to do the job) to refine and narrow down the task domain according to such considerations as frequency of performance, difficulty, and importance to the job. Subject matter experts were used for such things as ranking the core tasks in terms of their criticality in a specific combat scenario, clustering tasks based on similarity of principles or procedures, or assigning difficulty ratings to each task based on estimates of how typical soldiers might perform the task. Project scientists used all of this information to construct a purposive sample of 30 tasks to represent the job. From this sample the various performance measures were developed. The JPM project is particularly interesting for the variety of performance measures that were developed. In addition to hands-on performance tests (by far the most technically difficult and expensive sort of measure to develop and administer) and written job-knowledge tests, the services developed a wide array of performance appraisal instruments. These included supervisor, peer, and self ratings, ratings of very global performance factors as well as job-specific ratings, behaviorally anchored rating scales, ratings with numerical tags, and ratings with qualitative tags. Although the data analysis is still under way, the JPM project can be expected to contribute significantly to our understanding of job performance measurement and of the relationships among the various measures of that performance. For our purposes, it is instructive to note how the particular conception of job performance adopted by the project influenced everything else, from job analysis to instrument development, to interpretation of the data. First, it was decided to focus on proficiency (can do) and not on the personal
OCR for page 52
Pay for Performance: Evaluating Performance Appraisal and Merit Pay attributes that determine whether a person will do the job. Second, tasks were chosen as the central unit of analysis, rather than worker attributes or skill requirements. It follows logically that the performance measures were job-specific and that the measurement focus was on concrete, observable behaviors. All of these decisions made sense. The jobs studied are entry-level jobs assigned to enlisted personnel—jet engine mechanic, infantryman, administrative clerk, radio operator—relatively simple and amendable to measurement at the task level. Moreover, the enviable trove of task information virtually dictated the economic wisdom of that approach. And finally, the objectives of the research were well satisfied by the design decisions. During the 1980s the military was faced each year with the task of trying to choose from close to a million 18- to 24-year-olds, most with relatively little training or job experience, in order to fill perhaps 300,000 openings spread across hundreds of military occupations. It was important to be able to demonstrate that the enlistment test is a reasonably accurate predictor of which applicants are likely to be successful in a broad sample of military jobs (earlier research focused on success in training, not job performance). For classification purposes, it was important to understand the relationship between the aptitude subtests and performance in various categories of jobs. In other words, the picture of job performance that emerged from the JPM research was suited to the organizational objectives and to the nature of the jobs studied. The same job analysis design would not necessarily work in another context, as the following discussion of managerial performance demonstrates. Descriptions of Managerial Performance Most of the research describing managerial behavior was conducted between the early 1950s and the mid-1970s. The principal job analysis methods used (in addition to critical incident techniques) were interviews, task analyses, review of written job descriptions, observations, self-report diaries, activity sampling, and questionnaires. Hemphill's (1959) Executive Position Description Questionnaire was one of the earliest uses of an extensive questionnaire to define managerial performance. The results, based on responses from managers, led to the identification of the following nine job factors. FACTOR A: Providing a Staff Service in Nonoperational Areas. Renders various staff services to supervisors: gathering information, interviewing, selecting employees, briefing superiors, checking statements, verifying facts, and making recommendations. FACTOR B: Supervision of Work. Plans, organizes, and controls the work of others; concerned with the efficient use of equipment, the motivation of subordinates, efficiency of operation, and maintenance of the work force. FACTOR C: Business Control. Concerned with cost reduction, maintenance of proper inventories, preparation of budgets, justification of capital
OCR for page 53
Pay for Performance: Evaluating Performance Appraisal and Merit Pay expenditures, determination of goals, definition of supervisor responsibilities, payment of salaries, enforcement of regulations. FACTOR D: Technical Concerns With Products and Markets. Concerned with development of new business, activities of competitors, contacts with customers, assisting sales personnel. FACTOR E: Human, Community, and Social Affairs. Concerned with company goodwill in the community, participation in community affairs, speaking before the public. FACTOR F: Long-range Planning. Broad concerns oriented toward the future; does not get involved in routine and tends to be free of direct supervision. FACTOR G: Exercise of Broad Power and Authority. Makes recommendations on very important matters; keeps informed about the company's performances; interprets policy; has a high status. FACTOR H: Business Reputation. Concerned with product quality and/or public relations. FACTOR I: Personal Demands. Senses obligation to conduct oneself according to the stereotype of the conservative business manager. FACTOR J: Preservation of Assets. Concerned about capital expenditures, taxes, preservation of assets, loss of company money. An analysis of these factors suggests relatively little focus on product quality. Rather, most factors dealt with creating internal services and controls for efficiency and developing external images to promote acceptability of the company in the community. More recently, Flanders and Utterback (1985) reported on the development and use of the Management Excellence Inventory (MEI) by the Office of Personnel Management (OPM). The MEI is based on a model describing management functions and the skills needed to perform each function. Analyses conducted at three levels of management suggested that different skills and knowledge are needed to be successful at different levels. Lower-level managers needed technical competence and interpersonal communication skills; middle-level managers needed less technical competence but substantial skill in areas such as communication, leadership, flexibility, concern with goal achievement, and risk-taking; and top-level managers needed all the skills of a middle-level manager plus sensitivity to the environment, a long-term view, and a strategic view. A review of these skill areas indicates that all are general, some are task-oriented, and some, such as flexibility and leadership, are personal traits. The finding that managers at different levels have different skill requirements is also reflected in the research of Katz (1974), Mintzberg (1975), and Kraut et al. (1989). In essence, the work describing managerial jobs has concentrated on behaviors, skills, or traits in general terms. These researchers suggest that assessment of effective managerial performance in terms of specific behaviors is particularly difficult because many of the behaviors related
OCR for page 54
Pay for Performance: Evaluating Performance Appraisal and Merit Pay to successful job performance are not directly observable and represent an interaction of skills and traits. Traits are widely used across organizations and are easily accepted by managers because they have face validity. However, they are relatively unattractive to measurement experts because they are not particularly sensitive to the characteristics of specific jobs and they are difficult to observe, measure, and verify. In many settings, outcomes have been accepted as legitimate measures. However, as measures of individual performance they are problematical because they are the measures most likely to be affected by conditions not under the control of the manager. Implications In sum, virtually all of the analysis of managerial performance has been at a global level; little attention has been given to the sort of detailed, task-centered definition that characterized the military JPM research. (One exception is the work of Gomez-Mejia et al. , which involved the use of several job analysis methods to develop detailed descriptions of managerial tasks.) This focus on global dimensions conveys a message from the research community about the nature of managerial performance and the infeasibility of capturing its essence through easily quantified lists of tasks, duties, and standards. Reliance on global measures means that evaluation of a manager's performance is, of necessity, based on a substantial degree of judgment. Attempts to remove subjectivity from the appraisal process by developing comprehensive lists of tasks or job elements or behavioral standards are unlikely to produce a valid representation of the manager's job performance and may focus raters' attention on trivial criteria. In a private-sector organization with a measurable bottom line, it is frequently easier to develop individual, quantitative work goals (such as sales volume or the number of units processed) than it is in a large bureaucracy like the federal government, where a bottom line tends to be difficult to define. However, the easy availability of quantitative goals in some private-sector jobs may actually hinder the valid measurement of the manager's effectiveness, especially when those goals focus on short-term results or solutions to immediate problems. There is evidence that the incorporation of objective, countable measures of performance into an overall performance appraisal can lead to an overemphasis on very concrete aspects of performance and an underemphasis on those less easily quantified or that yield concrete outcomes only in the long term (e.g., development of one's subordinates) (Landy and Farr, 1983). It appears that managerial jobs fit less easily within the measurement tradition than simpler, more concrete jobs, if one interprets valid performance measurement to require job-related measures, and the preference for "objective" measures (as the Civil Service Reform Act appears to do). It remains to be seen whether any approaches to performance appraisal can be demonstrated to
OCR for page 55
Pay for Performance: Evaluating Performance Appraisal and Merit Pay be reliable and valid in the psychometric sense and, if so, how global ratings compare with job-specific ratings. Psychometric Properties of Appraisal Tools and Procedures Approaches to Appraisal As is true of standardized tests, performance evaluations can be either norm-referenced or criterion-referenced. In norm-referenced appraisals, employees are ranked relative to one another based on some trait, behavior, or output measure—this procedure does not necessarily involve the use of a performance appraisal scale. Typically, ranking is used when several employees are working on the same job. In criterion-referenced performance evaluations, the performance of each individual is judged against a standard defined by a rating scale. Our discussion in this section focuses on criterion-referenced appraisal because it is relevant to more jobs, particularly at the managerial level, and because it is the focus of the majority of the research. In criterion-referenced performance appraisal the "measurement system" is a person-instrument couplet that cannot be separated. Unlike counters on machines, the scale does not measure performance; people measure performance using scales. Performance appraisal is a process in which humans judge other humans; the role of the rating scale is to make human judgment less susceptible to bias and error. Can raters make accurate assessments using the appraisal instruments? In addressing this question, researchers have studied several types of rating error, each of which was believed to influence the accuracy of the resulting rating. Among the most commonly found types of errors and problems are (1) halo: raters giving similar ratings to an employee on several purportedly different independent rating dimensions (e.g., quality of work, leadership ability, and planning); (2) leniency: raters giving higher ratings than are warranted by the employee's performance; (3) restriction in range: raters giving similar ratings to all employees; and (4) unreliability: different raters rating the same rater differently or the same rater giving different ratings from one time to the next. Over the years, a variety of innovations in scale format have been introduced with the intention of reducing rater bias and error. Descriptions of various formats are presented below prefatory to the committee's review of research on the psychometric properties of performance appraisal systems. Scale Formats The earliest performance appraisal rating scales were graphic scales—they generally provided the rater with a continuum on which to rate a particular trait or behavior of the employee. Although these scales vary in the degree of explicitness, most provide only general guidance on the nature of the underlying
OCR for page 66
Pay for Performance: Evaluating Performance Appraisal and Merit Pay indicates that there is little to be gained from having more than 5 response categories. Implications There are substantial limitations in the kinds of evidence that can be brought to bear on the question of the validity of performance appraisal. The largest constraint is the lack of independent criteria for job performance that can be used to test the validity of various performance appraisal schemes. Given this constraint, most of the work has focused on (1) establishing content evidence through applying job analysis and critical incident techniques to the development of behaviorally based performance appraisal tools, (2) demonstrating interrater reliability, (3) examining the relationship between performance appraisal ratings, estimates of job knowledge, work samples, and performance predictors such as cognitive ability as a basis for establishing the construct validity of performance ratings, and (4) eliminating race, age, and gender as significant sources of rating bias. The results show that supervisors can give reliable ratings of employee performance under controlled conditions and with carefully developed rating scales. In addition, there is indirect evidence that supervisors can make moderately accurate performance ratings; this evidence comes from the studies in which supervisor ratings of job performance have been developed as criteria for testing the predictive power of ability tests and from a limited number of studies showing that age, race, and gender do not appear to have a significant influence on the performance rating process. It should be noted that the distinction between validity and reliability tends to become hazy in the research on the construct validity of performance appraisals. Much of the evidence documents interrater reliabilities. While consistency of measurement is important, it does not establish the relevance of the measurement; after all, several raters may merely display the same kinds of bias. Nevertheless, the accretion of many types of evidence suggests that performance appraisals based on well-chosen and clearly defined performance dimensions can provide modestly valid ratings within the terms of psychometric analysis. Most of the research, however, has involved nonmanagerial jobs; the evidence for managerial jobs is sparse. The consensus of several reviews is that variations in scale type and rating format have very little effect on the measurement properties of performance ratings as long as the dimensions to be rated and the scale anchors are clearly defined (Jacobs et al., 1980; Landy and Farr, 1983; Murphy and Constans, 1988).2 In addition, there is evidence from research on the cognitive processes of raters suggesting that the distinction between behaviors and traits as bases for 2 On a cautionary note, there are some important methodological weaknesses in the research comparing behaviorally anchored rating scales with other types of rating scales. In particular, the performance dimensions for the scales to be compared were generated by the same BARS methodology in some studies, so that what was really being tested was different presentation modes, not different scaling approaches (see Kingstrom and Bass, 1981; Landy and Farr, 1983).
OCR for page 67
Pay for Performance: Evaluating Performance Appraisal and Merit Pay rating is less critical than once thought. Whether rating traits or behaviors, raters appear to draw on trait-based cognitive models of each employee's performance. The result is that these general evaluations substantially affect raters' memory for and evaluation of actual work behaviors (Murphy et al., 1982; Ilgen and Feldman, 1983; Murphy and Jako, 1989; Murphy and Cleveland, 1991). In litigation dealing with performance appraisal, the courts have shown a clear preference for job-specific dimensions. However, there is little research that directly addresses the comparative validity of ratings obtained on job-specific, general, or global dimensions. There is, however, a substantial body of research on halo error in ratings (see Cooper, 1981, for a review) that suggests that the generality or specificity of rating dimensions has little effect. This research shows that raters do not, for the most part, distinguish between conceptually distinct aspects of performance in rating their subordinates. That is, ratings tend to be organized around a global evaluative dimension (i.e., an overall evaluation of the individual's performance—see Murphy, 1982), and ratings of more specific aspects of performance provide relatively little information beyond the overall evaluation. This suggests that similar outcomes can be expected from rating scales that employ highly general or highly job-specific dimensions. RESEARCH ON PERFORMANCE APPRAISAL APPLICATION Chapter 6 provides a summary of private-sector practices in performance appraisal. Our purpose here is to present a general review of the research in industrial and organizational psychology and in management sciences that contributes to an understanding of how appraisal systems function in organizations. The principal issues include (1) the role of performance appraisal in motivating individual performance, (2) approaches to improving the quality of performance appraisal ratings, and (3) the types and sources of rating distortions (such as rating inflation) that can be anticipated in an organizational context. The discussion also includes the implications of links between performance appraisal and feedback and between performance appraisal and pay. Performance Appraisal and Motivation Information about one's performance is believed to influence work motivation in one of three ways. The first of these, formally expressed in contingency theory, is that it provides the basis for individuals to form beliefs about the causal connection between their performance and pay. Two contingency beliefs are important. The first of these is a belief about the degree of association
OCR for page 68
Pay for Performance: Evaluating Performance Appraisal and Merit Pay between the person's own behavior and his or her performance. In Vroom's (1964) Expectancy X Valence model, these beliefs are labeled expectancies and described as subjective probabilities regarding the extent to which the person's actions relate to his or her performance. The second contingency is the belief about the degree of association between performance and pay. This belief is less about the person than it is about the extent to which the situation rewards or does not reward performance with pay, where performance is measured by whatever means is used in that setting. When these two contingencies are considered together, so goes the theory, it is possible for the person to establish beliefs about the degree of association between his or her actions and pay, with performance as the mediating link between the two. The second mechanism through which performance information is believed to affect motivation at work is that of intrinsic motivation. All theories of intrinsic motivation related to task performance (e.g., Deci, 1975; Hackman and Oldham, 1976, 1980) argue that tasks, to be intrinsically motivating, must provide the necessary conditions for the person performing the task to feel a sense of accomplishment. To gain a sense of accomplishment, the person needs to have some basis for judging his or her own performance. Performance evaluations provide one source for knowing how well the job was done and for subsequently experiencing a sense of accomplishment. This sense of accomplishment may be a sufficient incentive for maintaining high performance during the time period following the receipt of the evaluation. The third mechanism served by the performance evaluation is that of cueing the individual into the specific behaviors that are necessary to perform well. The receipt of a positive performance evaluation provides the person with information that suggests that whatever he or she did in the past on the job was the type of behavior that is valued and is likely to be valued in the future. As a result, the evaluation increases the probability that what was done in the past will be repeated in the future. Likewise, a negative evaluation suggests that the past actions were not appropriate. Thus, from a motivational standpoint, the performance evaluation provides cues about the direction in which future efforts should or should not be directed. The motivational possibilities of performance appraisal are qualified by several factors. Although the performance rating/evaluation is treated as the performance of the employee, it remains a judgment of one or more people about the performance of another with all the potential limitations of any judgment. The employee is clearly aware of its character, and furthermore, it is only one source of evaluation of his or her performance. Greller and Herold (1975) asked employees from a number of organizations to rate five kinds of information about their own performance as sources of information about how well they were doing their job: performance appraisals, informal interactions with their supervisors, talking with coworkers, specific indicators provided by the job itself, and their own personal feelings. Of the five, performance appraisals
OCR for page 69
Pay for Performance: Evaluating Performance Appraisal and Merit Pay were seen as the least likely to be useful for learning about performance. To the extent that many other sources are available for judging performance and the appraisal information is not seen as a very accurate source of information, appraisals are unlikely to play much of a role in encouraging desired employee behavior (Ilgen and Knowlton, 1981). If employees are to be influenced by performance appraisals (i.e., attempts to modify their behavior in response to their performance appraisal), they must believe that the performance reported in the appraisal is a reasonable estimate of how they have performed during the time period covered by the appraisal. One key feature of accepting the appraisal is their belief in the credibility of the person or persons who completed the review with regard to their ability to accurately appraise the employee's performance. Ilgen et al. (1979), in a review of the performance feedback literature, concluded that two primary factors influencing beliefs about the credibility of the supervisor's judgments were expertise and trust. Perceived expertise was a function of the amount of knowledge that the appraise believed the appraiser had about the appraisee's job and the extent to which the appraisee felt the appraiser was aware of the appraisee's work during the time period covered by the evaluation. Trust was a function of a number of conditions, most of which were related to the appraiser's freedom to be honest in the appraisal (Padgett, 1988) and the quality of the interpersonal relationship between the two parties. A difficult motivational element related to acceptance of the performance appraisal message is the fact that the nature of the message itself affects its acceptance. There is clear evidence that individuals are very likely to accept positive information about themselves and to reject negative. This effect is often credited for the frequent finding that subordinates rate their own performance higher than do their supervisors (e.g., see Holzbach, 1978; Zammuto et al., 1981; and Shore and Thornton, 1986). Although this condition is not a surprising one, if the focus is on the nature of the response that employees will make to performance appraisal information, then the existence of the discrepancy means that the employee is faced with two primary methods of resolving the discrepancy: acting in line with the supervisor's rating or denying the validity of that rating. The fact that the latter alternative is very frequently chosen, especially when the criteria for good performance are not very concrete (as is often the case for managerial jobs), is one of the reasons that performance appraisals often fail to achieve their desired motivational effect. Approaches to Increasing the Quality of Rating Data Applied psychologists have identified a variety of factors that can influence how a supervisor rates a subordinate. Some of these factors are associated with the philosophy and climate of the organization and may influence the rater's willingness to provide an accurate rating. Other factors are related to the
OCR for page 70
Pay for Performance: Evaluating Performance Appraisal and Merit Pay technical aspects of conducting a performance appraisal, such as the ability of the rater (1) to select and observe the critical job behaviors of subordinates, (2) to recall and record the observed behaviors, and (3) to interpret adequately the contribution of the behaviors to effective job performance. This section will discuss the research designed to reduce errors associated with the technical aspects of conducting a performance appraisal. Specific areas include rater training programs, behaviorally based rating scales, and variations in rating procedures. Rater Training The results of the effects of training on rating quality are mixed. A recent review by Feldman (1986) concluded that rater training has not been shown to be highly effective in increasing the validity and accuracy of ratings. Murphy et al. (1986) reviewed 15 studies (primarily laboratory studies) dealing with the effects of training on leniency and halo and found that average effects were small to moderate. In a more recent study, Murphy and Cleveland (1991) suggest that training is most appropriate when the underlying problem is a lack of knowledge or understanding. For example, training is more necessary if the performance appraisal system requires complicated procedures, calculations, or rating methods. However, these authors also suggested that the accuracy of overall or global ratings will not be influenced by training. Taking the other position, Fay and Latham (1982) proposed that rater training is more important in reducing rating errors than is the type of rating scale used. They compared the rating responses of trained and untrained raters on three rating scales (one trait and two behaviorally based scales). The results showed significantly fewer rating errors for the trained raters and for the behaviorally based scales compared with the trait scales. The rating errors were one and one half to three times as large for the untrained group. The training was a four-hour workshop consisting of (1) having trainees' rate behaviors presented on videotape and then identifying similar behaviors in the workplace, (2) a discussion of the types of rating errors made by trainees, (3) a group brainstorming on how to avoid errors. The workshop contained no examples of appropriate rating distributions or scale intercorrelations; the focus was on accurate observation and recording. Researchers have found that instructing raters to avoid giving similar ratings across rating dimensions or giving high ratings to several individuals may not be appropriate; some individuals do well in more than one area of performance and many individuals may perform a selected task effectively (Bernardin and Buckley, 1981; Latham, 1988). Thus, these instruction could result in inaccurate ratings. Other researchers have shown that training in observation skills is beneficial (Thornton and Zorich, 1980) and that training can help raters develop a common frame of reference for evaluating ratee performance (Bernardin and Buckley,
OCR for page 71
Pay for Performance: Evaluating Performance Appraisal and Merit Pay 1981; McIntyre et al., 1984). However, the training effects documented in these laboratory studies are typically not large, and it is not clear whether they persist over time. Behaviorally Based Rating Scale Design Another approach used by researchers to reduce rating errors has involved the use of rating scales that present the rater with a more accurate or complete representation of the behaviors to be observed and evaluated. Behaviorally based scales may serve as a memory or observation aid; if developed accurately, they can provide raters with a standard frame of reference. The strategy of using behaviorally based scales to improve observation might be especially helpful if combined with observation skill training. However, there is some evidence that these scales can unduly bias the observations and the recall processes of raters. That is, raters may attend only to the behaviors depicted on the scales to the exclusion of other, potentially important behaviors. Moreover, there is no compelling evidence that behaviorally based scales facilitate the performance appraisal process in a meaningful way, when these scales are compared with others developed with the same care and attention. Rating Sequence Supervisors rating many individuals on several performance dimensions could either complete ratings in a person-by-person sequence or in a dimension-by-dimension sequence (rate all employees on dimension I and then go on to dimension II, etc.). Presumably, a person-by-person procedure focuses the rater's attention on the strengths and weaknesses of the individual, while the dimension-by-dimension procedure focuses attention on the differences among individuals on each performance dimension. A review of this research by Landy and Farr (1983) indicates that identical ratings are obtained with either strategy. Implications Although the results are mixed, the most promising approach to increasing the quality of ratings appears to be a combination of factors including good scales, well-trained raters, and a context that supports and encourages the appraisal process. With respect to training, Latham (1988) and Fay and Latham (1982) found that training in the technical aspects of the performance appraisal process, if done properly, can lead to more accurate ratings. Their results suggest that if raters are trained to recognize effective and ineffective performance and are informed about pitfalls such as the influence of false first impressions, they can provide more reliable and accurate ratings than raters who have not received training. The implication is that training in the use of performance appraisal technology can lead to both a more acceptable and a more effective system. However,
OCR for page 72
Pay for Performance: Evaluating Performance Appraisal and Merit Pay training is only one among several factors with potential influences on the performance appraisal process. As mentioned earlier, the rater's approach to the process is affected by organizational goals, degree of managerial discretion, management philosophy, and external political and market forces, to name a few. Even if raters have been trained properly and have a good grasp of the rating process, they may distort their ratings on the basis of their perceptions of organizational factors. There is also evidence to suggest that the purpose of the rating may lead to rating distortion. Context: Sources of Rating Distortion It is widely assumed that the purpose of rating, or more specifically, the uses of rating data in an organization, affects the appraisal process and appraisal outcomes (Landy and Farr, 1980; Mohrman and Lawler, 1983; Murphy and Cleveland, 1991). That is, it is assumed that the same individual might receive different ratings and different feedback if a performance appraisal system is used to make administrative decisions (e.g., salary adjustment, promotion) than if it is used for employee development, systems documentation, or a number of other purposes. Furthermore, it is assumed that the rater will pay attention to different information about the ratee and will evaluate that information differently as a function of the purpose of the appraisal system. One of the major barriers to testing the assumption stated above has been the complexity of actual appraisal systems. Cleveland et al. (1989) documented 20 separate uses for performance appraisal and showed that most organizations use appraisal for a large number of different purposes, some of which may be conflicting (e.g., salary administration versus employee development). Thus, it is often difficult to characterize the primary purpose or even the major purposes of appraisal in any given setting. Some authors have suggested separate appraisal systems for different purposes (Meyer et al., 1965), but Cleveland et al.'s (1989) survey suggests that this is rarely done. Most studies of the effects of the purpose of rating involve comparisons between ratings that are used to make administrative decisions and ratings collected for research purposes only (a few studies have examined ratings collected for feedback purposes only). Many of these studies were carried out in the laboratory, although there have been some field studies, particularly in the area of teacher evaluations. The most common finding is that ratings used to make administrative decisions are higher or more lenient than ratings used for research or feedback (Taylor and Wherry, 1951; Heron, 1956; Sharon and Bartlett, 1969; Bernardin et al., 1980; Zedeck and Cascio, 1982; Williams et al., 1985; Reilly and Balzer, 1988). Other studies have failed to demonstrate the effects of rating purpose on rating results (Berkshire and Highland, 1953; Borreson, 1967; Murphy et al., 1984).
OCR for page 73
Pay for Performance: Evaluating Performance Appraisal and Merit Pay There is a broader literature that is mainly speculative or anecdotal dealing with the effects of rating purpose on rating outcomes. For example, in a series of interviews with executives, Longenecker et al. (1987) reported frank admissions of political dimensions of performance appraisal—i.e., the conscious manipulation of appraisals to achieve desired outcomes (see Longenecker, 1989; Longenecker and Gioia, 1988). Similarly, interviews conducted by Bjerke et al. (1987) showed clear evidence of conscious manipulation of ratings. This study was conducted in the Navy, and the majority of raters reported that they considered the outcomes of giving high or low ratings before filling out appraisal forms, and that they filled out forms in ways that would maximize the likelihood of outcomes they desired (e.g., promotion for a deserving subordinate) rather than reporting their true evaluations of each subordinate's present performance level. One reason for the relative lack of field research on rating distortion is that, although thought to be widespread, rating distortion is a behavior that is officially subject to sanction. Longenecker (1989) and Murphy and Cleveland (1991) make the point that rating distortion is often necessary and beneficial; brutally frank ratings would probably do more harm than good. Nevertheless, organizations rarely admit that ratings should sometimes be distorted. As a result, it is difficult to secure cooperation from organizations in research projects that examine the incidence, causes, or effects of rating distortion. Both Mohrman and Lawler (1983) and Murphy and Cleveland (1991) applied instrumentality models of motivation to explain rating distortion. These models suggest that raters will fill out appraisal forms in ways that maximize the rewards and minimize the punishment that they are likely to receive as a result of rating. Instrumentality theories suggest that the rater's choice to turn in distorted ratings will depend on: (a) the value he or she attaches to the outcomes of turning in distorted ratings and (b) the perceived likelihood that turning in distorted ratings will lead to those outcomes. In the context of pay for performance, instrumentality theories suggest that the motivation to distort ratings may be strong. Turning in low ratings could have substantial negative consequences for subordinates (i.e., lower pay), which are very likely to lead to subsequent interpersonal difficulties between supervisors and subordinates and to lower levels of subordinate motivation. By turning in high ratings, supervisors may be able to avoid a number of otherwise difficult problems in their interactions with their subordinates. Equity theory provides a second, related framework for explaining rating distortion. That is, raters might distort ratings to achieve or maintain equity within the work group. For example, an individual who received a low raise last year, perhaps because of a budgetary shortfall, might receive higher-than-deserved ratings this year in an attempt to restore equity. Similarly, raters might distort ratings to guarantee that salaries stay reasonably constant for individuals
OCR for page 74
Pay for Performance: Evaluating Performance Appraisal and Merit Pay within the work group who perform similar jobs. In both cases, attaining or maintaining parity might be viewed as more important then rewarding present performance. While these predictions of instrumentality theories are reasonable, empirical research on motivational factors in rating distortion is rare. For example, there is some disagreement about the extent to which negative reactions on the part of ratees will actually affect the rater's behavior (Napier and Latham, 1986). More fundamentally, little is known about the factors actually considered by raters when they decide how to complete their rating forms (Murphy and Cleveland, 1991). FINDINGS Job Analysis Job analysis and the specification of critical elements and standards can inform but not replace the supervisor's judgment in the performance appraisal process. Managerial Performance Most of the research on managerial performance describes broad categories of managerial tasks such as leadership, communication, and planning. Managerial performance does not lend itself to easily quantifiable job-specific measurement: many of the tasks performed by managers are amorphous and not directly observable. The bulk of the existing research on job performance and performance appraisal deals with jobs that are more concrete and with clearer outcome measures—research that is not directly relevant to managerial jobs. Psychometric Properties Within the framework of the psychometric tradition, research establishes that performance appraisals show a fairly high degree of reliability and moderate validities. There is some evidence that performance appraisals can motivate employees and can improve the quality and quantity of their work when the supervisor is trusted and perceived as knowledgeable by the employee. Real-world influences such as organizational culture, market forces, and rating purposes can work to distort performance appraisals. The research does not provide clear guidance on which scale format to use or whether to rely on global or job-specific ratings, although a consensus seems to be building that scale type and scale format are matters of indifference,
OCR for page 75
Pay for Performance: Evaluating Performance Appraisal and Merit Pay all things being equal. For example, one line of research suggests that rating scale format and the number of rating categories are not critical as long as the dimensions to be rated and the scale anchors are clearly defined. Another line of research suggests that raters tend to rely on broad traits in making judgments about employee performance, making the old distinctions between trait scales and behavioral scales appear less important. Although behaviorally based scales have not been shown to be superior to other scales psychometrically, some researchers suggest that behaviorally anchored rating scales offer advantages in providing employees with feedback and in establishing the external and internal legitimacy of the performance appraisal system. There is some evidence that rater training in the technology of performance appraisal tools and procedures can lead to more accurate performance ratings. In sum, the research examined here does not provide the policy maker with strong guidance on choosing a performance appraisal system. Instead, the literature presents the complexities and pitfalls of attempting to quantify and assess what employees, particularly managers and professionals, do that contributes to effective job performance. All of the appraisal systems that are behaviorally based require a significant amount of initial development effort and cost, are not easily generalizable across jobs, apparently offer little if any psychometric advantage, and require significant additional effort as jobs change. The primary value of behaviorally based appraisal is that it appears relevant to both the supervisor and the employee and it may provide an effective basis for corrective feedback. GAPS IN EXISTING RESEARCH A critical gap in the empirical research on performance appraisal relates to the influence of the rating context on the rating outcome. How does context affect the relationship between the supervisor and the employee and how does the nature of this relationship modify the supervisor's willingness to provide reliable ratings? Moreover, how specifically does the purpose of the rating change the rater's willingness to be accurate? Although the literature on performance appraisal discusses a variety of theoretical positions that bear on these questions, there is little convincing data on the extent or the causes of distortion in rating. As noted earlier, the existing theory suggests that pay for performance systems will be especially prone to distortion, particularly in contexts in which the base pay is regarded as unfairly low. However, it is unlikely that an adequate body of evidence could be assembled to document this phenomenon. A second gap, already noted above, concerns managerial performance
OCR for page 76
Pay for Performance: Evaluating Performance Appraisal and Merit Pay appraisal. The existing body of research deals with different (i.e., lower-level) jobs, and more important, different types of appraisal systems. The federal system has characteristics of both the traditional top-down system and management-by-objective systems (e.g., the use of elements, standards, and objectives that are defined by the supervisor represents a mix of concepts from both types of systems). It is not clear whether either the body of research at lower levels in the private sector or research on managerial appraisal and management-by-objective systems is fully relevant to the federal system. A third gap has to do with the implications of the reliability, validity, and other psychometric properties of appraisal systems for the behavior of employees and the organization's effectiveness. With few exceptions, the research does not establish any performance effects of performance appraisal. The preponderance of evidence relates to the consistency of measurement, not the relevance. Research documenting the impact of appraisal systems on organizations and their members is sparse, fragmented, and often poorly done. Empirical evidence is needed to determine whether organizations or their members actually benefit in any substantial way when appraisals are done, other than to the extent that legitimacy is provided and belief systems reinforced.
Representative terms from entire chapter: