Edition 37, Human Resources

Supervisor Ratings of Job Performance: A Look to Increasing Effectiveness

By: David J. Woehr
The University of North Carolina Charlotte

Sylvia G. Roch
State University of New York at Albany

A positive trend in recent selection and assessment research is an increased focus on multiple aspects of job performance. Specifically, job relevant performance has expanded beyond job specific task behavior to include other components such as organizational citizenship behavior, counterproductive work behavior, and adaptive behavior. Yet a key question in the literature focuses on the best way to assess and measure the different aspects of performance. On the surface this would seem a simple matter. How hard can it be to differentiate and assess levels of job performance across employees? In fact, most managers as well as researchers would agree that identifying and collecting relevant, psychometrically sound, and practical measures of job performance is a significant challenge.

At a very basic level, measures of individual job performance generally fall into two broad categories – objective and subjective. Objective measures include records of job-related outcomes (e.g., production counts, sales, accidents, salary, job-level, etc.). Alternately, subjective measures are those that rely on the evaluative judgment of another person (i.e., ratings and rankings). While objective measures might appear to be the preferred method for assessing performance, there is general agreement that objective measures are simply not feasible in most settings. Consequently, the use of subjective measures as criteria in selection and assessment research has been, and continues to be, far more common. Ironically, even for jobs for which objective measures should ostensibly be readily available (e.g., salespeople), subjective measures are often the criterion of choice. In a recent review of predictors of the job performance of salespeople, for example, 57% of the validation studies reviewed used subjective ratings of sales performance as opposed to actual sales data.

Subjective evaluations of performance may be obtained from a variety of sources (e.g., supervisors, peers, subordinates, customers, etc.). However, supervisory ratings of individual job performance are the most frequently used and studied manner in which employee performance is assessed. Consequently a key concern for both research and practice is the extent to which supervisors can and will differentiate and accurately rate performance. Toward that end, a great deal of research has strived to improve the quality of performance ratings. Yet despite the volume of research devoted to this topic, many questions about the value and appropriateness of performance ratings as measures of job performance remain. Thus, an uncomfortable inconsistency seems to present itself. On the one hand, supervisory performance ratings serve as the predominant measure of job performance in the vast majority of selection and assessment research. On the other hand, questions about the quality and usefulness of performance ratings as a measure of job performance abound. In fact, ratings of job performance are one of the most predictable criterion measures available. Most studies investigating the criterion validity of predictors use performance ratings as the criterion, and many predictors, such as cognitive ability, structured interviews, and situational judgment tests, as expected, have strong correlations with performance ratings. In addition, the research on improving the quality of performance ratings highlights a number of interventions that can serve to maximize rating veridicality and thus inform good practice. These interventions fall into three categories: rating scale format, rater training, and rater motivation.

Rating Scale Format

Rating scale interventions aimed at improving performance ratings have a long history in the literature. Not surprisingly, a great deal of research in the 1960′s and 70′s concentrated on a direct comparison of ratings obtained via specific rating formats. Results of this research, however, tended to indicate that scale format had little impact on rating outcomes (evaluated primarily in terms of the psychometric properties of ratings and interrater agreement). Nonetheless, although much of the rating scale literature indicates that specific scale format may not lead to major differences in rating outcomes, it is predicated on the use of job-relevant professionally developed scales. So while it may not matter if one uses behaviorally anchored ratings scales, behavioral observation scales or graphic rating scales, it does matter that the scales used are based on a thorough job analysis and incorporate clear behaviorally-based definitions of the constructs to be evaluated.

Driven to a large extent by management practices in high visibility organizations, recent research has focused on relative performance ratings requiring the use of a forced distribution of ratees into pre-determined performance categories. Jack Welch, the former CEO of General Electric, advocated that an organization should assign its employees to three categories: top 20%, middle 70%, and the bottom 10%. Welch advocated that employees assigned to the bottom 10% should be terminated and that this process should occur yearly; thus continually raising the bar of performance and increasing the quality of employees. Other companies such as Cisco Systems, Hewlett-Packard. Microsoft, Lucent, Conoco, EDS, and Intel have adopted management systems based on this idea. In fact, it’s been estimated that as many as one-third of U.S. corporations evaluate employees based on systems that pit them against their colleagues. However, a 2005 survey conducted by the Society for Human Resource Management found that of the 330 human resource professionals surveyed, only 43 reported that their companies used forced rankings/distributions and only 2 reported that forced rankings always lead to terminations.

Research suggests that forced distribution scales may offer psychometric advantages over more traditional scales. Two recent reviews investigating the difference between forced distribution and traditional rating scales found that forced distribution scales demonstrate stronger correlations with a variety of criteria, such as production quantity, sales volume, general mental ability, verbal ability, quantitative ability, perceptual speed, and spatial/mechanical ability. However, while the use of forced- distribution based rating systems in which employees are evaluated relative to each other may result in short term benefits, it may also lead to negative rater and ratee perceptions of the process and ultimately poorer employee moral and performance. A number of respected companies such as Xerox, Pepsico, Goodyear, and Ford have tried forced rankings/distributions and either backed away from them or rejected them. Furthermore, a number of organizations, such as Ford, Goodyear, Microsoft, and Sprint, have recently been involved in adverse impact cases based on forced distributions/rankings. Thus, organizations must consider whether the possible psychometric benefits of relative scales outweigh the potential negative reactions associated with them.

Rater Training

While evidence for the impact of specific scale formats on performance ratings has been somewhat equivocal, evidence for the positive impact of rater training is more widely accepted. Moreover, the benefits of rater training may extend beyond the direct effect of improved rating accuracy. Rater training has the potential to improve “buy-in” from both the rater and ratee, which may in turn increase rater motivation to provide accurate ratings as well as ratee motivation to use the feedback provided to improve performance.

Although a number of rater training approaches have emerged from the performance appraisal literature, the most widely cited is frame-of-reference (FOR) training. The goal of FOR rater training is to train raters to use a common conceptualization (i.e., frame of reference) of performance when observing and evaluating ratees. Typical FOR rater training includes emphasizing the multidimensionality of job performance, concretely defining performance dimensions, providing sample behavioral incidents indicative of each dimension and corresponding evaluative standards, along with practice and feedback using these standards to evaluate performance. FOR training helps improve rating accuracy through two processes: (1) by helping raters understand which behaviors constitute specific levels of performance on specific dimensions, and (2) by establishing performance prototypes that allow raters to counteract normal information loss (i.e. forgetting) by categorizing ratee performance based on the performance prototypes presented during the training.

FOR training was developed within the context of performance appraisal. However, over the past 20 years it has been extended beyond this context. Researchers have shown that FOR training is directly applicable to a variety of evaluative contexts including assessment centers, selection test cut scores, employment applications, competency modeling, job analysis, interviews, and even therapy. Certainly one of the primary reasons for the popularity and expanded use of FOR training is the evidence demonstrating the positive impact of this training on the quality of performance ratings. To date, a relatively large number of empirical studies have examined the impact of FOR training and this body of evidence continues to grow. These studies present a consistent picture that providing raters a clear and consistent explanation of what and how they are supposed to rate, along with practice and feedback doing so, greatly facilitates the rating process.

Rater Motivation

Both rating scale format and rater training interventions target raters’ ability to provide accurate ratings (although it may be argued that both may indirectly affect rater motivation). However, it is widely recognized that performance rating problems may be as, or more, likely to be a function of rater willingness to provide accurate ratings. Thus, raters must be motivated, as well as able, to provide accurate ratings. Rater motivation is a function of the manner and context in which ratings are collected. The context may serve both to inhibit or facilitate rater willingness to provide accurate ratings. Quite simply if the negative consequences of providing accurate ratings outweigh the positive consequences, no rating scale or rater training intervention will be effective. Consequently, in an attempt to neutralize the negative consequences often associated with ratings used for administrative purposes (e.g., promotion, salary adjustments, feedback and development), most research studies, including predictor criterion validity studies, utilize ratings collected for research purposes only. However, research-based ratings may not always be feasible or preferable to organizational decision makers. Thus, organizations should pay careful attention to the motivational incentives inherent in any rating situation.

In sum, research aimed at improving the quality of supervisory performance ratings has a long history in the organizational literature. Moreover, this research provides important findings that can be used to inform good practice with respect to job performance evaluation. Specifically, a full understanding of the role of rating scale format, rater training, and contextual and process factors is crucial for maximizing the effectiveness of job performance ratings.?

Bibliography

Bennett, W., Lance, C. E. & D. J. Woehr (2006). Performance measurement: Current perspectives and future challenges. Mahwah, NJ: Erlbaum.

Cascio, W.F., & Agunis, H. (2008). Research in Industrial and Organizational Psychology From 1963 to 2007: Changes, Choices, Trends. Journal of Applied Psychology, 98, 1062-1081.

Murphy, K. R., & Cleveland, J. N. (1995). Understanding performance appraisal: Social, organizational, and goal-based perspectives. Thousand Oaks, CA: SAGE Publications, Inc.

Smither, J. W., & London, M. (2009). Performance Management; Putting Research into Action. Jossey-Bass; San Francisco, CA.

The Authors

David J. Woehr is Professor of Management at the University of North Carolina, Charlotte. He received his PhD in Industrial and Organizational Psychology from Georgia Institute of Technology. Dr. Woehr is coauthor of the book: Performance measurement: Current perspectives and future challenges.

Sylvia G. Roch is Associate Professor at the Psychology Department of the State University of New York at Albany. She holds a PhD in Industrial and Organizational Psychology from Texas A&M University.

Más artículos Relacionados:

One Comment

  1. Posted %A %B %e%q, %Y at %I:%M %p | Permalink

    Ya learn something new evyedray. It’s true I guess!

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>