Professional Documents
Culture Documents
ratings. Hobson et al. (1981) also compared a psychology department chairperson's captured policy
with subjective estimates of that policy obtained from
19 faculty members. In all cases the goal was to quantify the extent to which subjective perceptions were
different from the objective captured policy.
In addition to computing captured rating policies
and comparing them to subjective policy estimates,
researchers have utilized various statistical clustering
procedures to group captured rating policies in terms
of their similarity (Hobson et al., 1981; Naylor &
Wherry, 1965; Stumpf & London, 1981; Zedeck &
Kafry, 1977). The purpose has been to determine if
there were systematic differences in rating orientations in a given group of raters.
When considering the policy capturing literature,
four general conclusions can be reasonably drawn.
(Table 1 provides summary information about the
six studies being reviewed.)
First, the general linear model has worked well in
describing rater policies, as evidenced by the consistently high rater Rh obtained when regressing
overall ratings on scores for the separate dimensions
of performance (see Table 1). These results are not
surprising in light of the decision tasks given to raters.
These tasks generally possessed characteristics similar
to those asserted by Dawes and Corrigan (1974) to
be necessary for adequate fit of the linear model. This
finding simply adds further confirmation to other
well-documented results obtained in a variety of decision making settings (Slovic & Lichtenstein, 1971;
Slovic et al., 1977).
Second, there is consistent evidence indicating that
raters generally possess poor insight into their own
rating policies (Hobson et al., 1981; Taylor &
Wilsted, 1974; Zedeck & Kafry, 1977). Raters exhibit
a tendency to perceive a more equal weighting of performance dimensions than actually is the case
tending to underestimate the importance of major
dimensions and overestimate the importance of
minor ones.
Third, Hobson et al. (1981) demonstrated that subordinate subjective estimates of their supervisor's
captured rating policy reflect the same basic errors.
That is, subordinates tended to believe that their
supervisor used a more equal weighting of performance dimensions in making appraisals than acutaliy
was the case.
Finally, in three of four studies utilizing rater
clustering procedures (Hobson et al., 1981; Naylor
P7?
<*>. II
*1
-o
"5
'c
'<A O
I i S IB
o 2
n
BO U
* : II fe
* " <='S-S>.fe
fc2
1,
^-
r>l
en
rn
iS
.2 a
S5^
<N
mil Isis
> Ul
35
iiij
^1
^1
II
II
sa
1= O
J5
u E :; u
8
2
II
3
III if
If
>. u
XE
52
It
X Q.S
X a.
.e
o
o
4.E&
Si
3 ^.
E u 2'fe.E
e c ^ 5
m t: a> c 3
w=-a 3.D
o-0 O
III t^ll
ill
1/5-JC-
642
Methodological Issues/Problems
Major features of the policy capturing methodology detract from its usefulness as a way to
describe and represent actual rating decisions and
seriously undermine the construct validity of captured
rater policies.
j
Research Purpose
In all of the policy capturing studies, directions to
subjects clearly indicated that ratings were for
research purposes. As Anderson (1977) has noted,
the context in which ratings are made seems to affect the rating process and the results obtained. Landy and Farr (1980), in their review of the performance
rating literature, concluded that the evidence suggests
that there are systematic differences in ratings collected for administrative versus research purposes.
Specifically, ratings for administrative reasons seem
to be more lenient than those made for research purposes. Such findings raise important questions about
the construct vaUdity of rater policies captured when
ratings are for research purposes only. In the absence
of direct comparisons, the limited available research
casts some degree of doubt as to adequacy of captured policies as paramorphic representations of actual rating decisions.
Ratee Performance Profile Issues
dimensional intercorrelations differ from actual relationships, one simply does not know what captured
policies mean. Certainly, the external validity of the
ratee profiles and the resultant captured policies are
seriously compromised. Thus, in those studies using
orthogonal dimensions, one is left with the conclusion that the construct validity of captured policies
is highly questionable.
Given the serious problems associated with high
levels of dimensional multicollinearity, what other
options are available besides artificial orthogonalization? Hobson et al. (1981) have discussed the various
methodological (use of behavioral expectation scales
to generate conceptually independent dimensions)
and statistical procedures (principal components
analysis and partialing overall performance from
dimensional scores before factor analysis as proposed
by Landy, Vance, Bames-Farrell, & Steele, 1980) offered in the literature to deal with this problem. The
key issue, however, is the extent to which the actual
relationships among performance dimensions are
represented accurately in the ratee profiles presented
to subjects. When gross discrepancies e.xist, the meaning of captured rating policies is hopelessly confused.
Number of Ratee Profiles. An important consideration in the use of multiple regression in any
decision making setting is the number of stimuli or
profiles presented to raters, relative to the number
of cues or dimensions. In situations in which the ratio
of profiles to dimensions is low, two major, related
problems arise: (a) substantial overfitting of the
linear model, resulting in spuriously high obtained
R^ values and (b) large sampling error in all computed estimatesboth individual regression weights
and the overall R^ (Cohen & Cohen, 1975;
Darlington, 1968).
In policy capturing studies, obtained R^& indicate
the consistency with which raters combine performance information in making an overall judgment.
Regression weights for individual performance
dimensions are used in standard calculations of the
relative importance of each dimension in the rater's
policy (Hoffman's, 1960, relative weight formula is
the one most frequently utilized). In fact, captured
rating policies are nothing more than the pattern of
relative weights associated with the various dimensions of performance. Thus, small ratee profile to
performance dimension ratios resulting in substantial sampling error can lead to unreliable estimates
of an individual's rating policy.
644
To index the extent to which sampling error is present in the policy capturing research, the correction
formulae outlined by Cattin (1980) for estimating the
population cross-validity coefficient were applied to
the central tendency measures of obtained Rh in
each study. These corrected R^ values are provided
in the last column of Table 1. As the values in the
table indicate, there is substantial variation in the
amount of sampling error or shrinkage present.
Drops in 7?^ range from a low of .04 (Naylor &
Wherry, 1965) to a high of .70 (Stumpf & London,
1981), indicating that the obtained Rh in this study
were completely spurious. Although there are major
differences across studies in the profiles/dimensions
ratio, in general most researchers have exhibited a
lack of recognition of this important factor and have
violated the minimum standard ratio of 10:1 recommended by Nunnally (1978).
In those situations in which substantial sampling
error exists, the meaning of captured policies is further obscured. Relative weights used to define rater
policies and computed in terms of estimated regression weights will exhibit a great deal of sampling instability. This poses serious interpretational problems
as to the meaning of captured policies and their accuracy in representing actual rating decisions.
Summary. Given the serious methodological problems discussed above, one cannot reasonably conclude that the evidence supports the construct validity
of captured policies as accurate paramorphic representations of actual rating decisions. The critical differences between the typical research tasks presented
to subjects and actual rating settings force one to admit that what captured policies represent is not really known.
Research Recommendations
Based on this review of the policy capturing literature, two recommendations for future research are
offered. First and foremost, it is recommended that
future policy capturing studies be conducted exclusively in field settings, using actual supervisory
ratings of subordinates for administrative purposes.
In conjunction with this recommendation, a cessation of policy capturing research as designed and implemented to date would seem warranted. It appears
at this point that the differences between the artificial
rating situations characterizing current research and
actual rating settings are of such overwhelming
Practical Implications
Despite the serious methodological problems with
the policy capturing literature, it has served to focus
attention on the need for an explicit definition of
supervisor rating policies and their consistent use in
making overall evaluations. When used in conjunction with a thorough job analysis and rigorously
645
The third step in the development of this rating system involves the definition of rater policies. Finally,
step four focuses on the explicit communication and
use of rater policies in making overall performance
ratings.
Before discussing the implications of using a rating
system of this type, the issue concerning the source
of supervisor rating policies must be addressed. There
is no logical imperative that they be defined in a purely statistical manner. Certainly the interpretational
problems associated with policies captured in artificial settings are quite evident and serious in nature.
However, even were rater policies captured in actual
field settings, it is still questionable whether one
should rigidly apply such policies in making overall
ratings. Einhorn and Hogarth (1981) introduced the
notion of whether optimal decision models can be
considered "reasonable." This idea is applicable in
considering the "reasonableness" of using a statistically defined policy to make overall ratings. Certainly a captured rating policy first must be acceptable to the individual rater before one could expect
that person to utilize it in making overall evaluations.
Thus, in the final analysis, captured policies should
undergo the scrutiny of the individual rater and be
judged appropriate.
Hobson and Gibson (in press) have outlined three
646
Subordinate Benefits
References
Anderson, B. L. Differences in teachers' judgement policies for
varying numbers of verbal and numerical cues. Organizational
Behavior and Human Performance, 1977, 19, 68-88.
Hobson, C. J., Mendel, R. M., & Gibson, F. W. Clarifying performance appraisal criteria. Organizational Behavior and
Human Performance, 1981, 28, 164-188.
Hoffman, P. J. The paramorphic representation of clinical judgement. Psychological Bulletin, 1960, 57, 116-131.
Ilgen, D. R., Fisher, C. D., & Taylor, M. S. Consequences of individual feedback on behavior in organizations. Journal of Applied Psychology, 1979, 64, 349-371.
Dawes, R. M., & Corrigan, B. Linear models in decision making. Psychological Bulletin, 1974, 81, 95-106.
Oskamp, S. Clinical judgement from the MMPI: Simple or complex? Journal of Clinical Psychology, 1967, 23, 411-415.
648
Slovic, rrTVTtaiteilimriTvr^amfmman ui uujuijn and regression approaches to the study of information processing in judgement. Organizational Behavior and Human Performance, 1971,
6 649-744
649