The Nature of the Evidence
We have been asked to assess the role of performance appraisals and pay for performance systems in promoting excellence at work and to identify promising models for potential application to the federal work force. A number of major evidentiary obstacles impede scientific study of these issues, for reasons that go well beyond the scholar's perennial lament that more data are needed. As this chapter articulates, there are some conceptual and methodological mine fields implicit in this charge. At the same time, there are a number of strengths in this literature in terms of methodological rigor and relevance to organizational practices. These strengths and limitations need to be made explicit so that readers of this report can accurately gauge the existing scientific evidence bearing on performance appraisal and pay for performance systems.
This chapter does not aim to provide a comprehensive introduction to methodology in the social and behavioral sciences. Rather, it briefly reviews some of the evidentiary issues that arose in pursuing the committee's charge and summarizes the different kinds of research methods and data that have been brought to bear on performance appraisal and pay for performance plans. The diverse and fragmentary nature of the research evidence available to us turned out to have important implications for how we carried out the study and formulated our conclusions.
THE DIVERSITY OF RELEVANT THEORIES AND METHODS
Understanding how organizations appraise performance and the extent to which they allocate rewards on the basis of performance involves processes
operating at numerous levels. The issues involved range from the intrapsychic (e.g., memory and attention allocation) to the interpersonal (e.g., affect, group dynamics) to the organizational, interorganizational, and even societal level (e.g., organizational structure, the role of money, legal constraints on performance appraisal and pay systems). Accordingly, the kinds of research relevant to our charge also run the gamut: research on the nature of jobs and job performance; investigations into the accuracy and context of human judgment; analyses of the impact of pay on motivation and behaviors; research on how organizational structure and environment influence personnel practices; studies of the effects of performance appraisal and pay systems on organizational functioning; proprietary surveys on attitude and climate undertaken by specific companies; and everything in between.
Because the issues of interest to the committee lie at the interstices between different theories, disciplines, audiences, and levels of analysis, there is not a single predominant type of research evidence for us to evaluate. Rather, we are faced with the task of trying to compare, contrast, and synthesize very different kinds of evidence relevant to the charge.
All the different kinds of evidence do not address the same issues or even employ the same standards of proof. Each type has its strengths and its limitations, and each brand of research implies its own definition of what kinds of evidence are most relevant and useful. In this section, we briefly summarize the quality of existing evidence and discuss a number of challenges faced by the committee in reviewing, synthesizing, and drawing inferences from such diverse strands of research.
One of the clear areas of strength is the research on performance appraisal. There is an enormous literature, stretching back well over half a century, on the assessment of work performance. Although the particular topics that have captured the attention of researchers have changed from time to time, the sheer accumulation of empirical work, laboratory studies, surveys of practice, and analytical models provides a rich backdrop for contemporary thinking about the use of performance appraisal. An additional, although sometimes unrecognized, virtue of the work on performance appraisal in recent decades derives from the pressures of litigation under Title VII of the Civil Rights Act of 1964. Performance appraisal systems have had to be defended in high-stakes situations, a fact that has made researchers in the field more cognizant of actual practice and the problems of evaluating performance in applied settings.
Pay for performance is a much younger research field. Although there is a good deal of suggestive theory, there is not an equivalent cumulation of empirical research. The field is, however, energetic and protean. pay for performance compensation strategies have begun to draw the attention of
students of economics, finance, accounting, sociology, psychology, management, and organizational science, as well as compensation consultants. The topic is fundamentally interdisciplinary, and that quality provides its own richness in terms of the variety of viewpoints and methods that are being brought to bear. This is an important strength—if also a complication—for it gives us a variety of clues to the hypothesized links between pay and performance.
In addition to the pertinent scholarly theory and research, there is also an extensive body of clinical knowledge and experience with organizations that is by no means irrelevant to our task. Hence, we have looked for points of convergence between the findings of detached scholarly studies and the intimate understandings of clinicians and practitioners. Furthermore, although we lack the wealth of empirical data that would permit us to make precise predictions about the effects of performance-based pay, we are not wholly ignorant about its effects. Our review of existing theory, diverse types of research, and clinical experience suggests that there are certain preconditions that appear to be necessary (though not sufficient) for pay for performance to do more good than harm: for instance, ample performance-based rewards available to be distributed; participants who are knowledgeable about the linkage between their actions and rewards received; credible indices of performance; and incentives for those doing the performance appraisal to do it well versus incentives for them to not differentiate among subordinates. To the extent that some of these necessary preconditions may not be satisfied in many government contexts (see Chapters 2 and 7), there is reason to question whether the prerequisites for beneficial effects are satisfied.
The evidence relating to performance appraisal and to pay for performance compensation systems is discussed in detail in Chapters 4 through 7. On a more general plane, however, there are a number of issues and evidentiary challenges that merit the reader's attention, ranging from how to gauge the effectiveness of performance-based pay to questions of causality.
Criteria for Gauging the Effectiveness of Personnel Practices
The Office of Personnel Management wishes to identify performance appraisal and pay systems that ''work." However, there are so many conceivable definitions of what works—so many different ways of conceptualizing, measuring, and judging the effectiveness of a given performance appraisal and pay system—that it is difficult to render scientific assessments in this domain with confidence. In the course of the committee's review of the evidence, it became clear that there are at least four types of benefits that the theoretical and empirical literatures have posited in discussing performance-based pay systems: (1)
positive effects on the behaviors of individual employees (including decisions to join an organization, attend, perform, and remain attached); (2) increasing organization-level effectiveness (including cost-effectiveness); (3) facilitating socialization and communication (by transmitting expectations, goals, and role requirements); and (4) ensuring that the way the organization compensates, manages, and treats its employees is perceived as legitimate by important internal and external constituencies.
This is clearly a diverse set of criteria for gauging the effectiveness of an organization's performance appraisal and pay system. Agreeing on the relevant one(s) is hardly straightforward, especially because the criteria that may be important to scientists or academicians interested in performance appraisal and pay systems may not correspond to the ones of interest to managers and policy makers.
Moreover, the diverse criteria make radically different evidentiary demands. Marshaling evidence for the effectiveness of a performance appraisal system in facilitating socialization and communication, for example, would be fairly straightforward: careful surveys of supervisor and employee attitudes would satisfy most observers. The criterion of enhanced effectiveness at the organizational level, however, is largely (although not entirely) beyond the reach of social science analysis at present. Psychologists do not yet know much about the links between individual performance and group performance; neither psychology nor economics offers much empirical evidence of the effects of improved performance on productivity, although both disciplines have produced some interesting theory (see Hartigan and Wigdor, 1989). Accordingly, our conclusions about the organization-level effects of performance appraisal and pay for performance systems are necessarily guarded, based as they are on analogy to other compensation systems rather than direct evidence.
Validity and Reliability
Even if the relevant dimensions or criteria of effectiveness can be specified, however, they remain to be measured. In assessing the value of social science evidence, researchers emphasize two factors: validity and reliability.
Put simply, validity concerns the relevance or appropriateness of the measurement. The concept of validity is often expressed in terms of whether one is measuring what one intends to measure (e.g., Nunnally, 1967). Recent definitions focus on the appropriateness and meaningfulness of the inferences drawn from measurement data, such as test scores or performance ratings (American Educational Research Association et al., 1985). Reliability concerns the extent to which the measurement is consistent or dependable—that is, whether repeating a measurement in the absence of significant changes would yield the same measurement outcome. Both validity and reliability point to a process
of gathering evidence. More extensive discussion of issues of validity and reliability is provided in Chapter 4.
Clearly, validity and reliability are interrelated: a valid measure of Brand X word-processing skill, for instance, presumes a reliable one (e.g., computers free of malfunctions and operating with the same software). The point to be made here is that there are often trade-offs between the reliability and validity of evidence concerning the issues at hand. Laboratory experiments looking at performance appraisals or the impact of contingent rewards on behavior are often able to control for confounding factors and measure the relevant variables much more reliably than can be accomplished in field studies of real organizational settings. For example, participants in lab studies exposed to identical stimuli, such as film clips of a person performing a task adroitly and then inadequately, provide highly consistent evaluations of the good and poor behaviors. However, it is difficult to gauge the external validity (or generalizability) of the appraisal tools from evidence gathered in laboratory settings—that is, whether they would be as accurate when used to evaluate job performance in operational settings.
It is equally difficult to know what inferences to draw from the limited number of field-based and statistically controlled studies examining the consequences of tying rewards to performance. For instance, there is an increasing literature in economics assessing the effects of performance-based rewards on organizational performance among top managers and executives. There is also some empirical work looking at related issues in professional sports, and there are studies showing that salespeople tend to sell more when at least some of their compensation is based on commission. Needless to say, generalizing from this evidence to many of the managerial jobs that are the focus of our work is tenuous. To put the matter simply: work settings in which there are no problems finding valid and reliable measures of performance are likely not to be very interesting for our purposes. In these settings, pay is almost invariably based on performance. Examples would include door-to-door sales, piece-rate sewing of garments, and prize fights. However, few jobs within federal government agencies permit such concrete measurement, thereby making validity and reliability concerns much more salient (and much more matters of perception than of statistical reality).
Sources and Quality of Available Data
Knowing what one wants to measure and measuring it well are only part of the challenge. One's measures are only as good as the sample from which they are drawn. A perfectly valid and reliable public opinion survey administered to a random sample of adults entering and leaving the Veterans Administration, for instance, may be of limited value in predicting or understanding the attitudes of the U.S. population as a whole.
Studying organizational phenomena presents a number of challenges regarding data quality. Organizations often regard their performance appraisal and compensation policies as privileged information and are reluctant or unwilling to divulge information about them to researchers. Consequently, a considerable amount of information regarding prevailing practices in the performance appraisal and pay for performance area derives from three sources: surveys conducted by business associations, consulting organizations, and the like (e.g., the Wyatt Company and HayGroup surveys discussed extensively in Chapter 6); case studies of individual companies by researchers; and knowledge obtained by organizational consultants.
This state of affairs raises several possible problems in interpreting the available evidence. First, organizations that have been or are willing to share information on their practices with researchers need not be representative of any clearly defined population of interest. For instance, it seems likely that there is more information available about the personnel policies of an organization if it is in the public sector, publicly traded, or otherwise highly visible; has been taken to court; regards itself as a leader in the personnel field; belongs to industry or professional associations; or is large (and therefore more able to absorb the costs of complying with requests for information).
Statisticians refer to this problem as sample selection bias, whereby some observations are systematically excluded from the sample available for analysis. Sample biases can take two forms: the sample may be biased with respect to the dependent variables or outcomes of interest (typically referred to as censoring bias) and/or with regard to the independent variables or explanatory factors presumed to be at work (typically referred to as truncation bias). Both types of sample selection bias may be at work in our case, confounding the inferences we wish to draw. We are interested in understanding the factors that determine why organizations appraise performance and allocate pay differently and what consequences those differences have. It seems likely that the available data underrepresent organizations that appraise performance informally, that do not pay for performance, and that are performing poorly. (There are a number of justifications for this assumption. Poorly performing organizations are unlikely to respond to requests for information for many reasons, not the least of which is organizational mortality: when an organization is performing poorly enough, it ceases to exist, thereby precluding study. Moreover, we assume that organizations with elaborate performance appraisal and pay for performance systems are more likely to advertise the fact, perceiving their activities to be more legitimate and businesslike. As we note below, personnel professionals reporting on the organizations in which they work are also likely to have reasons to be partisan.) The extant data are also likely to overrepresent organizations of particular types, thereby resulting in truncation bias when it comes to examining the role of some explanatory factor (such as organizational size) in influencing how performance is appraised and pay is allocated and with what effects. After
all, if only large organizations were to permit researchers to study them, what could be said scientifically about small organizations?
It is important to emphasize that the mere fact that a sample is nonrandom in some respects does not make it unrepresentative or useless. The extent of bias depends on the population to which researchers wish to make generalizations. Results from the above-mentioned hypothetical survey outside the Veterans Administration may be perfectly appropriate for making generalizations to some populations.
In our case, much of the relevant organizational evidence bearing on performance-based pay comes from private-sector corporations. Leaving aside all the complications of measurement, causal inference, and the like discussed throughout this chapter, even if we could make perfectly valid and precise inferences about corporations, we would still face the difficult issue of whether those conclusions can safely be generalized to workers in federal agencies. (That question, of course, is hardly idiosyncratic to the work of this committee; after all, scientific debates occur constantly about the relevance of specific evidence from animal studies for human health and behavior.)
In addition, as we noted earlier, much of the data derive from clinical knowledge and experience. Although this sort of data can be informative, it is important to acknowledge the potential limits of clinical expertise. The opinions of managers about their companies or the assessments of paid consultants about organizations for whom they have consulted can be illuminating, but the potential for bias and conflict of interest must also be recognized. Furthermore, relying on the "excellent company" method to make inferences about the effectiveness of organizational practices is perilous. The mere observation that many organizations with a reputation for success appraise performance or allocate pay in a particular way does not constitute scientific evidence or a basis for prescription—any more than would the fact that most successful companies have male chief executive officers justify the recommendation that women should not be promoted at the top.
Two other related concerns should be noted about the sources and quality of available data bearing on performance-based pay. First, experimental control or random assignment of subjects to treatments is often difficult or impossible to obtain in studying organizational phenomena. Firms typically do not design or alter their appraisal or pay systems randomly over time, but rather in response to real or perceived dilemmas.
Second, it has been well documented that organizational intervention as such has effects on the behavior of organizational members. Physical scientists have documented that even physical phenomena are altered by the very process of scientific observation and measurement. However, in the organizational world, this problem, frequently called the Hawthorne effect, is much more severe and more difficult to disentangle. The mere entry of researchers or consultants into an enterprise or a change by the organization in its personnel
system can be enough to occasion large attitudinal and behavioral changes. The reactivity of organizations to policy changes and to external scrutiny further obscures inferences about the consequences of performance appraisal and pay systems for organizational effectiveness.
Determinants Versus Consequences
We have quite a bit of data describing organizational and industrial variations in performance appraisal and pay systems, and there are numerous respected consulting firms and other organizations (e.g., The Conference Board) in the business of tabulating and disseminating such data by size of firm, type of business, and so on. Yet one cannot infer from such evidence alone that, say, a given compensation plan is appropriate for other organizations of that size, technology, or industry, unless one is prepared to assume that "what is should be," and that the prevalence of a particular practice among organizations of a given type suggests some adaptive value of that practice. A considerable body of recent research suggests that inertia is a powerful force in organizations; many contemporary structures and practices appear to be residues or carryovers from the circumstances that prevailed when a particular organization was founded, rather than arrangements well suited to its contemporary environment (see Hannan and Freeman, 1984).
Much of the evidence concerning differences in performance appraisal systems, pay systems, the relationship between them, and their link to performance, which we summarize in this report, is based on studies that are cross-sectional or nearly cross-sectional (i.e., very short time series). This evidence is thus of limited power in making statements about causal relationships. Yet even if these difficulties could be surmounted and a causal link established between performance-based pay and some dimension of organizational performance, tricky issues remain that cloud the interpretation of the findings and their practical relevance.
First, inferences about the effects of performance-based pay plans on organization- or individual-level outcomes are only as valid as the statistical model used to look at the question. Any judgment about performance is always a judgment about performance compared with something. In statistical studies, that something is specified by control variables. If important control variables are omitted, or if the effects of the variables of interest are confounded with included or omitted control variables, then it can be perilous to make inferences about how some factor affects performance.
Another reason why empirical evidence regarding the effects of pay for performance can be misleading concerns unobserved heterogeneity . Even in
studying biochemical processes, variations across individuals and environments can make a big difference. In trying to assess statistically the impact of linking pay to performance, the accuracy of one's conclusions depends critically on how accurately the relevant heterogeneity has been taken account of. We have some theory and past research to guide us in specifying what the relevant dimensions of heterogeneity might be, but we actually know relatively little. The effect of pay for performance is likely to vary considerably across individuals (e.g., as a function of wealth, age, values, and the like), jobs, organizational context, dimensions of performance, time periods, and locales. Failure to capture this heterogeneity can produce misleading inferences.
One other difficulty in formulating policy or managerial prescriptions is that we might be able to document felicitous effects of performance-based pay systems without necessarily understanding why those effects obtain, and therefore how likely they are to persist. In particular, a number of different streams of research suggest that how organizations do things often matters at least as much as what they do (see Chapter 7). The literature on procedural justice, for instance, indicates that procedures for allocating rewards matter a great deal, quite apart from the actual magnitude of rewards allocated. Similarly, surveys of worker satisfaction and commitment, as well as field research on gainsharing, employee stock ownership plans, and the like routinely report that such factors as the extent of communication, participation, openness, flexibility, and "humaneness" surrounding employment and reward systems make a strong independent contribution to workers' subjective well-being, attachment, and (in some cases) work product (e.g., Halaby, 1986; Rosen, 1986). A common theme running through the presentations made by industry representatives to our committee was that their companies take the process very seriously.
These process effects are likely to be particularly elusive to researchers. Moreover, given the importance of belief systems, organizations may be extremely reluctant to permit their practices to be studied explicitly, since it may be preferable to have current practices taken for granted than to run the risk of uncovering evidence that those practices are dubious. The point here is that the ideology of pay for performance, based on fair and accurate performance appraisals, serves important functions. Accordingly, it may be no less difficult for managers and workers than for researchers and policy makers to separate the facts from beliefs about this topic.
In reviewing these various issues, we do not wish to overstate the complexities involved in weighing the evidence on performance appraisal and pay for performance. The issues raised in this chapter are generic to studies of social and organizational phenomena. Indeed, in some respects, there is a larger and higher-quality body of research bearing on these concerns than is often the case
in studying applied social science concerns. We have surveyed the nature of the evidence simply to underscore the need for caution (and additional research) in drawing policy inferences from the scientific evidence and prevailing practice and to explain the general approach we take throughout this report in weighing the evidence and drawing conclusions from it.
In carrying out the study we built upon our own diversity, which went well beyond simple differences in disciplinary training or occupation, to encompass fundamental differences in approach to issues in human motivation and behavior, the nature of organizations, and the relevant questions to be asked about performance appraisal and pay for performance. Some of us viewed the problem at the individual level of analysis; others were concerned with organizational effectiveness and change. Some employed criteria of individual or organizational performance, while others interpreted the issues in terms of procedural justice or the role of performance appraisal and pay for performance in legitimizing organizations.
We have been catholic in pulling together evidence and information that might bear on the effectiveness of performance appraisal and performance-based compensation systems, taking account of theory, empirical research, and clinical studies not only from many disciplines but also from any research topics that seemed relevant. We have supplemented formal evidence with as much information about current practices in private-sector firms as we could reasonably gather in the limited time available for the study.
For example, our findings about performance appraisal and pay for performance rely on and exploit existing knowledge about organizations available from related areas. We know a great deal about how organizations vary along a number of other dimensions of their personnel systems, as well as some of the consequences of those differences. For instance, we know what types tend to pay higher wages, to promote more from within, to provide on-the-job training, to emphasize seniority more in pay and promotion decisions, and so on. We also know that personnel practices tend to be part of a larger system governing employment. Accordingly, it would be surprising if the insights we have gleaned from this other research were irrelevant to understanding the determinants and consequences of performance appraisal and performance-based pay systems.
Not all of this evidence will meet rigorous standards of scientific proof. We have been careful throughout the text to identify the type of evidence and the level of confidence we feel that it merits. But the fact is that managers in the private and public sector routinely have to make choices about management practice in the absence of definitive evidence. Federal leaders are currently working on compensation policy and will soon revise the Performance Management and Recognition System. In the end, we judged it better to paint as rich a picture as possible. We felt that a careful weaving together of the many kinds of evidence and experiential data would provide useful insights into general
tendencies or likelihoods, if not precise predictions about specific outcomes. In the language of statistical inference, we have aimed to draw rather broad confidence intervals around what is likely to happen in any given organizational setting, rather than seeking to offer point estimates. Stated more colloquially, answering a policy maker's query with "it depends" can nonetheless be useful, if one can articulate the factors on which it depends.
Finally, although we are confident that federal policy makers can benefit from a careful assessment of the scientific and impressionistic evidence on performance appraisal and pay for performance, we are also mindful of the broader political and normative concerns impinging on personnel management in the context of the federal civil service. By their very nature, governmental institutions rely significantly on public trust. Such institutions are predicted by organizational theorists to adopt elaborate evaluation rituals because of a need for perceived legitimacy in the eyes of constituencies (Meyer and Rowan, 1977). Indeed, government bureaus have long sought to bolster their public image by emulating what is thought to be state-of-the-art practice in the private sector. DiPrete (1989:81) suggests that even more than 100 years ago, "a principal argument for the merit system was that it would put the personnel affairs of government on a more businesslike footing" by emulating prevailing corporate practice.
We recognize that current efforts to reform federal personnel policies involve an effort to increase the perceived legitimacy of the federal government. It may be public perceptions of how performance is appraised and pay administered within the civil service that matter more than anything else. We also recognize that the legitimation aspects of performance appraisal and pay for performance may to some extent work at cross-purposes with other functions of those practices—for instance, practices that adhere to some idealized business model might provide the greatest legitimacy to a given agency but not necessarily do the best job of communicating its organizational goals or motivating its employees. The fact that personnel systems have important symbolic purposes, which may in some cases be in conflict with other important objectives, prompts us to be cautious about making suggestions for radical changes in prevailing practice within the federal civil service.