Professional Documents
Culture Documents
Robert J. Valuck
Department of Pharmacy Administration, The University of Illinois at Chicago, Chicago IL
Reed G. Williams
Department of Medical Education, The University of Illinois at Chicago, Chicago IL
The study purpose was to improve pharmacy instruction by identifying dimensions of teaching unique to
pharmacy education and developing reliable and valid rating scales for student evaluation of instruction.
Error-producing problems in the use of student ratings of instruction, existing rating methods and dimensions
of effective teaching are reported. Rationale is provided for development of Behaviorally-Anchored Rating
Scales, BARS, and the methods used are described. In a national study, 4,300 descriptions of pharmacy
teaching were collected in nine critical incident writing workshops at four types of schools. Ten dimensions of
pharmacy teaching were identified and validated for classroom, laboratory and experiential teaching.
Scales were developed for each dimension. Measures of scale quality are described including retranslation
data, standard deviations of effectiveness ratings, reliability and validity data and data supporting reduction of
leniency and central tendency effects. Four outcomes of the project are discussed, emphasizing two: use
of the newly-validated dimensions in modification of traditional numerically-anchored scales in local use, and
of BARS in providing clear and convincing performance feedback to pharmacy instructors.
INTRODUCTION AND PURPOSE plines, without having been validated for use in rating
pharmacy instruction in particular, additional questions of
From among the traditional faculty roles of teaching, re- validity and rating error arise.
search and service, this study investigated only the evalua-
tion of teaching. Teaching performance may be evaluated Error in Instructor Ratings
using multiple data sources: (i) documented self-evaluation Reduction of measurement error is imperative in evalu-
and course improvement; (ii) peer review of instructional ation of faculty teaching performance. Eight kinds of error
methods, instructor-written texts or manuals, and other in the administration and use of instructional performance
developed media, syllabi and tests; (iii) gains in student rating scales prompted this study. The research and devel-
learning; (iv) student ratings of instructor performance; (v) opment methods chosen were intended to minimize most of
observation or videotaping; and (vi) teaching awards (1,2). these common sources of rating error, especially the first
This study focused on only one data source: student evalua- five listed: (i) error in instrument content; (ii) error in the
tion of faculty performance. Its purpose was to improve the interpretation of the meaning of ratings (3-5); (iii) show-
quality of instruction in U.S. colleges of pharmacy by iden- manship(6-8); (iv) common rating error effects such as “halo
tifying dimensions of pharmacy instruction and developing effect”(9), “reverse halo effect”(10), “leniency effect” and
new, reliable and valid student measures of effective phar- “harshness (or strictness) effect”(11), and “central ten
macy teaching2. Such measures of instructional performance,
whether utilized in instructor self-assessment, for periodic 1
performance reviews or in the critical promotion and tenure The research was supported, in part, by a GAPS grant from the SmithKline
Beecham Foundation through the American Association of Colleges of
process, are essential for the continued development of Pharmacy.
effective teachers. If pharmacy students and instructors are 2
The term “dimension”, as used in this article, refers to an axis, or
to have confidence in instructional rating systems and to continuum, along which performance descriptors, varying in quality or
eventually benefit from the rating process, clear dimensions of intensity, may be ordered. The dimension is identified arid shown to be
effective teaching should be identified and rating errors independent and non-overlapping in meaning with other clusters of
minimized. Problems with the content validity of student similar behaviors.
ratings of instructor performance introduce rating error 3
Formative evaluation refers to evaluation of a process or product to
when instruments are not sensitive to the unique differences provide feedback for the purpose of making possible mid-process refine-
in lecture, laboratory and experiential instruction. More- ments or improvements.
4
over, when instructor rating instruments are developed for Summative evaluation is conducted to examine the quality or impact of a
use across university colleges and departments or disci- final, completed process or product.
dency effect”(12,13); (v) error in instrument reliability; (vi) to select the most appropriate scaling method. The litera-
mixed purposes of evaluation3,4(14,15); (vii) inconsistent ture supporting this selection decision is described. The
methods of instrument administration(16-19); and (viii) third step was to conduct critical incident workshops for the
errors in data implementation(20,21). collection of descriptors of effective and ineffective teach-
Procedures for minimizing the first five types of rating ing in U.S. colleges of pharmacy. Editing and selection of
errors were sought. Emphasis was placed on selecting or collected incidents was the fourth step. The fifth was to
developing procedures and instruments to rate the most establish and validate dimensions of pharmacy teaching
appropriate pharmacy teaching behaviors and to rate them using the retranslation process to demonstrate indepen-
accurately and consistently. dence of the dimensions5 (22). Simultaneously, the sixth step
of obtaining effectiveness ratings for incidents from study
Study Goals panelists would provide data for establishing scale anchors.
Four goals were set for the study. First, the project The seventh step was to develop scales by selection of
would identify dimensions of instructional behavior unique meaningful behavioral anchors based on the retranslation
to pharmacy education and to three teaching environments: process and high rater agreement on the scale anchors. A
classroom, laboratory and experiential. Faculty colleagues concurrent validation study would constitute the eighth
have reported the belief that effective pharmacy teaching is step, for which traditional, numerically-anchored scales,
different from good teaching in other departments and parallel in content, would be developed. The final step was
disciplines, and that it varies from one pharmacy teaching accomplished through the concurrent validation study, yield-
environment to another. The researchers sought to apply a ing a useful parallel set of traditional, content-parallel nu-
method, other than factor analysis, to identify and describe merically-anchored scales.
dimensions of pharmacy teaching. The second goal was to
develop Behaviorally-Anchored Rating Scales, BARS, for Identification of Tentative Dimensions
each dimension and teaching environment. Third, the re- Tentative dimensions of pharmacy instruction were
searchers intended to demonstrate concurrent validation of identified and validated based on a review of the pertinent
the scales developed, by showing correlations with a known literature. Tables I and II display dimensions mentioned in
reliable and valid, traditional numerically-anchored scale of studies and review articles outside and within pharmacy
parallel content. Finally, the project was designed to demon- education. The tentative dimensions so identified were later
strate generalizability of the scales for use in all U.S. colleges used for preliminary classification of student-and faculty-
of pharmacy. generated critical incidents of pharmacy teaching.
METHODS 5
The Smith and Kendall retranslation process uses an independent group of
Nine study steps were elaborated to achieve the project expert raters who reallocate descriptors of performance to dimensions
describing performance qualities. It is analogous to the procedures used
goals. First, the study began with identification of tentative by language translators to ensure that all of the meanings of an original
dimensions of pharmacy teaching. This initial validation text are preserved. Text material is translated into a foreign language, then
step would be based on the literature. The second step was retranslated to the original by an independent expert.
Education, General. Seven dimensions of effective instruc- mented a factor-analyzed instrument for student evaluation
tion were reported often in the education and psychology of undergraduate pharmacy instruction(30, 31). Jacoby de-
literature. Table I summarizes the most frequently men- scribed how modification of an existing instrument for use in
tioned dimensions of teaching in original studies or reviews. student evaluation of pharmacy teaching contributed to
In their article describing the development of a teacher improved classroom instruction(32). Based on this research,
rating instrument, Wotruba and Wright reviewed 21 pub- an instructional consulting service was initiated to provide
lished studies of student evaluation of teaching(23). Of the feedback to faculty. Purohit et al. explored issues of student
40 criteria they listed, the nine most frequently mentioned evaluation of instruction(33). Citing Hildebrand, the au-
were also cited in a text chapter on uses and limitations of thors discuss “components” of effective teaching as per-
student ratings(24). Seven are shown in Table I. The text ceived by their colleagues and by students(34). Sauter and
author also summarized dimensions of teaching behavior as Walker, in reporting a theoretical model for pharmacy
identified in factor-analytic studies, four of which are re- faculty peer evaluation, mentioned basic components of
ported in Table I. Brandenberg et al. described development teaching and learning as requisite elements in such evalua-
and validation of scales for student evaluation of teach- tion^). Two authors reported special needs for evaluation
ing(25). Their work yielded a comprehensive evaluation of clinical teaching performance. Martin et al, described
system available at the researchers’ school(26). Instructors clinical faculty evaluation and development programs at
may select traditional, numerically-anchored items from a one college of pharmacy(36). Downs and Troutman identi-
“catalog” of over 400 items classified by teaching dimen- fied criteria for the evaluation of clinical pharmacy teach-
sions. Items designed for use in summative evaluation are ing(37). Three articles, written as reports or invitational
normed by instructor rank and by required/ elective status. articles, mentioned qualities of good pharmacy teaching. As
Items designed for instructor’s formative self-evaluation are part of a panel devoted to the evaluation of pharmacy
not so normed. Hildebrand et al. asked faculty and students teaching, Carlson suggested a comprehensive evaluation
to provide descriptions, in observable and behavioral terms,
of the “best” and “worst” teaching they had experienced(27). program(38). Citing articles by Kulick, and by Brown, the
Responses were factor-analyzed into five clusters (dimen- author emphasized three major dimensions students use to
sions) of teaching performance. In a Canadian study of judge their teachers, and discussed major functions in the
teaching in the behavioral sciences, Das et al. identified supervision of students by clinical pharmacists(39,40).
seven dimensions of teaching, and developed BARS for Peterson, in an ad hoc committee report, mentioned key
student evaluation of instruction (28). In equivalent forms features of pharmacy teaching performance(41). Zanowiak,
comparisons using traditional rating instruments, they re- in an invitational article, citing Kiker, mentioned character-
ported the BARS to be at least as psychometrically sound in istics of an effective pharmacy teacher(42,43).
terms of reliability, inter-rater variability and content valid- Some of the authors cited in Tables I and II described
ity. Dickinson and Zellinger identified six teaching dimen- additional kinds of dimensions and behaviors not shown in
sions for veterinary medicine instructors(29). the tables. First, problems in attaching meaning to labels
assigned to factors in earlier studies could introduce bias in
Education, Pharmacy. After review of the education and the generation of unobservable behaviors in this study.
psychology literature, evidence of criteria for effective phar- Specific behaviors might not, in the retranslation process
macy teaching was sought. Ten articles from the pharmacy chosen for this study, be assigned to the same dimension as
education literature, which described or mentioned dimen- suggested by the factor name coined by authors of previous
sions of teaching, are summarized in Table II. Three of the scales and instruments. A second type of dimension not
articles reported research on pharmacy instruction. Based listed in the tables was based on the notion of self-rated
on deficiencies in the use of rating instruments designed for student accomplishment. Behaviors associated with this
use in faculty performance evaluation generally, Kotzan named factor seemed unlikely to be collected and used as
and Mikael and Kotzan and Entreken developed and imple scale anchors in this research which would focus on teacher,
not student, behaviors. Finally, some instruments contained
PLEASE COMPLETE THE FOLLOWING RATINGS SHOWING YOUR EVALUATION ON THE FIVE POINT SCALES
BELOW.
EXAMPLE…The Minnesota Twins will win the World Series in 1993. Agree _ _ _√ _ Disagree
12345
Concurrent Validity. BARS ratings were correlated with bers of raters and eliminated ratings of “first-preceptor”
corresponding numerically-anchored scales constructed with student-faculty relationships. Because rotations were sys-
selected items from the catalog of items available from the tematically scheduled, it also ensured representativeness
university’s Office of Instructional Resources. The research- from the wide variety of required clerkships offered. Faculty
ers first identified all catalog items which related to the members responsible for laboratory instruction declined
participation. They noted that items and scale anchors in-
content described in the ten dimensions. Then, 31 items volving quality of laboratory instruments could cause low
were selected which most closely matched the behaviors ratings which, if not kept confidential, might adversely
described in the dimensions tested. The final 22-item nu- affect their departmental and college-wide performance
merically-anchored scale appears in Table III and its con- reviews. All but two of the correlations are positive and
struction in relationship to the ten dimensions is described significant.
in Table IV. A numerically-anchored media scale was not
constructed because the two lecturers did not use media Scale Properties and Error Reduction
other than assigned readings and the chalkboard. BARS and numerically-anchored scales were compared
for scale properties contributing to leniency, central ten-
Selected scales were administered to two groups of dency and halo effects. Evidence for less leniency effect in
students at the researchers’ school, one which rated two the use of BARS was provided by comparing the means for
lecturers and one which rated clerkship instructors. Two both sets of four selected scales: Evaluation, Interaction,
senior faculty members, with courses in lecture format, Workload and Teaching. All four BARS means were lower.
volunteered for the study and signed releases which were The mean BARS rating for four scales was 1.13 scale points
included with the written rating instructions provided to lower, a statistically significant difference. Although these
students. Raters included all students in attendance at one data may suggest that the BARS produce less leniency in
third professional year lecture. After team-taught courses
and unwilling volunteer instructors were eliminated, the ratings, possibly attributable to their unambiguous scale
two available participants received very high ratings with anchors, it is not clear which scale best represents a “true”
low variance and ranges of ratings. Ratings for the experien- rating of instructor performance.
tial rotations were obtained by asking volunteer students Comparison of two scale properties suggest that BARS
from the final professional year and recent alumni who were have produced less central tendency rating effect. The vari-
new, first-year members of the clinical faculty to rate their ance in ratings was greater for all BARS scales. Moreover,
“second preceptor.” This procedure provided sufficient num-
comparison of the modal ratings for both sets of all four tory,” and “Teaching Ability—Experiential.” Three’ scales
scales shows that the BARS yielded modal ratings which are environment-specific: “Teaching Ability—Lecture,”
were farther from their respective scale mid-points than “Teaching Ability—Laboratory,” and “Teaching Ability—
their adjusted numerical scale counterparts—total differ- Experiential.” The other seven scales apply to all three
ences of 14.9 vs. 8.3 scale points, respectively. teaching environments. By combining the scales as the table
Halo effect was compared by examining correlations of suggests, either seven or eight dimensions of teaching may
measures with each other within BARS and within numeri- be measured in three pharmacy teaching environments.
cal scale types. If scales show a low inter-correlation their Laboratory instruction might also include evaluation of
independence is demonstrated, suggesting that raters are selection and use of media. Three sample scales appear in
less apt to allow performance in one area to affect their the Appendix.
ratings in another. Evidence for lower halo effect for BARS
was not found. The mean intercorrelation for all four BARS BARS Scales. A total of 134 critical incidents “survived” the
was 0.71, SD = 0.10, and for the numerical scales, 0.58, SD = retranslation and effectiveness rating process, with a range
0.19. of from 10 to 19 incidents used as anchors per scale. The
process and results are summarized in Table VI. The mean
Project Products percentage agreement on assignment of incidents to dimen-
Dimensions. Ten independent dimensions of pharmacy sions, 79.6 percent, nearly met the 80 percent retranslation
teaching were identified. They are described in Table V and goal. The mean standard deviation of 1.76 scale points
include three previously-unreported new dimensions: “Se- illustrates strong student rater agreement on the level of
lection and Use of Media,” “Teaching Ability—Labora- effectiveness of each scale point.
32 American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994
Table V. Pharmacy instruction dimensions or revision of traditional scales; (ii) reliable and valid nu-
merically-anchored I.C.E.S. scales; (iii) reliable and valid
Dimensionsa BARS for administration, and iv) BARS for use in faculty
A. Teaching Ability—Lectureb development.
Audible and clear speaking; Interpretation and explana- Utility of the Dimensions in Local Scale Development or
tion of concepts ;
Use of examples and illustrations;
Revision. The kind and quality of instruments in current use
Emphasis and summary of main points; Effective use of for student rating of pharmacy faculty teaching varies con-
chalkboard. siderably. These BARS and parallel traditional scales are
the first to be based on the ten new independent dimensions
B. Teaching Ability—Laboratory of teaching performance unique to pharmacy education.
Availability of equipment, reagents and ingredients;
Demonstration before performance; Supervision; Safety;
For colleges of pharmacy which participate in university-
Sufficient time and access; Concise, useful reporting. wide rating systems, the project offers guidance for the
college to work with the central agency responsible for
C. Teaching Ability—Experiential managing the faculty evaluation program. Existing tradi-
Demonstration and supervision of learning experiences; tional numerically-anchored items of high quality may be
Professional and patient communications; Practice. combined into scales for the 10 unique pharmacy teaching
D. Course Organization11 dimensions. Such scales may be used to report performance
Clarity of scheduling; Detail of content outline; ratings with higher reliability than is possible with a series of
Clarity of learning objectives, assignments and student individual items. If the central service agency does not offer
expectations; Following the course outline and ob- items to rate performance in all of the new pharmacy
jectives. teaching dimensions identified, item-writing and validating
E. Selection and Use of Media activities are called for to complete the locally-developed
Effective use of slides, overheads, videos, texts, hand- scales. With such revised scales in place, development of
outs, models) local, within-pharmacy norms is possible. For schools not
F. Student Performance Evaluationb required to participate in university-wide teaching evalua-
Lecture, Laboratory and Experiential, Relationship to tion systems, similar possibilities exist for within-school
course content/objectives; scale modification and improvement.
Clear, unambiguous questions and assignments;
Explanation of method, content, administration; Use of Equivalent Form ICES Scales. Concurrent valida-
Feedback to students; Fair, objective grading; Applica- tion of the BARS using specially-constructed numerical
tion, not rote memory. scales of parallel content has an additional useful outcome.
For schools using the I.C.E.S. system, use of the traditional
G. Student-Instructor Interactionb scales developed for this study, augmented by additional
Availability for consultation; Responses to student diffi-
culties;
I.C.E.S. items or other items descriptive of the ten dimen-
Conveying a helpful and supportive attitude; sions, could provide reliable scale scores based on the
Concern about student learning; Sensitivity to students’ dimensions. Reliability studies on such expanded scales are
needs; recommended.
Interest in student outcomes; Availability for help after
class;
Administration of BARS Scales. The expected project out-
Listening to student questions and concerns; Initiatives come of reliable and valid scale development was accom-
to help students; plished and the product is available for use in schools of
Atmosphere conducive to learning. pharmacy. BARS scales are expensive to develop and main-
tain. Use and continued research and development of these
H. Workload/Course Difficultyb
Scope of content; Length and difficulty of assignments;
scales in multiple pharmacy schools would provide addi-
Coverage of content; Reasonable due dates and project tional positive returns on the research and development
deadlines. investment. Care should be taken, however, to systemati-
cally select, introduce, administer, and monitor the scales.
I.. Enthusiasm/Motivationb Use of BARS has been most successful in organizations
Dynamic in presentation of subject;
Stimulation of student thought and interest;
where persons being rated have had input into the scale
Motivation of students to do their best work. development process and where the scales are profession-
ally-administered(74). Each administration should be man
J. Knowledge of Subject Areab aged by a human resources expert familiar with develop-
Well-prepared; Competent in field; Knows limits of ex- ment and administration of this type of performance rating
pertise.
scale. Unsupervised scale use by students is not recom-
a
Classroom teaching evaluated on dimensions A, D-J. mended, nor is administration by persons untrained in
Laboratory teaching evaluated on dimensions B, D-J. performance assessment. Potential user schools should uti-
Experiential teaching evaluated on dimensions C, D, F-J. lize a designated testing specialist for BARS scale adminis-
b
Tentative dimensions identified at onset of study. Dimensions B, tration.
C & E added on basis of critical incidents surviving the retranslation/rating Use of BARS in Faculty Development. One of the many
process.
characteristics of BARS is that, because of the vivid behav-
DISCUSSION iors they portray, faculty ratees are prone to adopt effective
teaching behaviors and to abandon those associated with
The project yielded four major outcomes: (i) validated low scale ratings. This tends to cause a favorable shift in
dimensions of teaching performance for use in development teaching behaviors and an inflation of ratings based on
American Journal of Pharmaceutical Education Vol. 58, Winter Supplement 1994 33
Table VI. Summary statistics, retranslation and effectiveness ratings
Percent agreement on Standard deviation,
N of useable relevant effectiveness
incidents dimension ratings
improved faculty performance. This desirable side effect of talent being rated. Such studies should be expanded to
BARS use suggests that their greatest contribution may be include all of the dimensions of teaching in all environments,
in the provision of highly-effective faculty performance particularly laboratory teaching. The low concurrent validity
feedback, and not in their reliable and valid performance correlations for two scales require additional study. Low
assessment capabilities alone. The utility of BARS in pro- correlations for the Workload item and the Knowledge
viding performance feedback is well-established(75). Heart- scale are attributable, in part, to student differences in
felt introspection about these unforgiving “snapshots” of perceptions. Review of scale development ratings of critical
what students think of their instructors’ teaching could incidents depicting “Workload and course difficulty” showed
result in re-dedicated commitment to improved teaching. that some students approach ratings for this dimension in
terms of relative “ease” of workload, others in a more
Study Constraints normative sense in terms of perceived “appropriateness” of
Three constraints, two methodological and one philo- the amount of work assigned. Thus, both types of scales are
sophical, may have limited the study outcomes. First, the subject to students’ perceptions of appropriate input and
known disadvantages of using study volunteers is evident. effort vs. their own learning styles and willingness to expend
Although sufficiently represented to permit rich incident effort. Similarly, for student ratings of “Knowledge,” stu-
writing contributions from upperclassmen, a larger number dents deal with perceptions rather than facts about the
of volunteers from the final professional year could have instructor’s knowledge. Only vivid examples of lack of
enhanced the study. Perspectives of additional mature stu- preparedness in the classroom, as measured by the BARS
dents’ writings would have enhanced the pool of incidents. scale, served to measure this reliably. Reliability studies will
More importantly, participation by a larger proportion of also be conducted on expanded versions of the numerically-
“seniors” from study schools would have enabled their
utilization in larger numbers for the retranslation/rating anchored I.C.E.S. scales.
steps, allowing less reliance on senior students from the pilot Research on Learning Styles. When students are made
school. Second, faculty member commitment from study aware of their personal learning styles, accommodations to
schools for the purpose of concurrent validation of the scales instructional formats and styles may be made. This study
was not sought at the onset. Instead, volunteers were ob- demonstrated that the mean BARS ratings for items se-
tained only from the researchers’ school. Only two lecturers, lected in scale development were not affected by students’
both highly-experienced, volunteered and with limitations learning styles. Research is continuing on the effect of
on their available class time. This required administration of learning styles on all 402 critical incidents which were sub-
only part of the scales. Both lecturers received very high jected to retranslation and effectiveness ratings, especially
ratings on both types of scales, thus narrowing the range of those which were rejected for scale use because of wide
responses. The higher correlations for experiential courses rating variance. Significance of item variance differences
were due, in part, to a much wider variance in ratings than between learning style groups, if discovered, may offer
for the two volunteer lecturers. Third, the factor-analytic insights to instructors for possible instructional style and
basis for classifying teaching behaviors was not challenged performance accommodations based on specific observed
in this study. The foundation for scale construction was the teaching behaviors.
commonality of seven factors established and named in
previous studies. Because this study stressed observed be- Taxonomical Classification of Incidents with Ethical Impli-
haviors, it did not create global descriptors of instructors’
“personality.” Moreover, the dimensions were not created cations. Numerous items describing substandard profes-
or edited by students. Perhaps students, not educational sional behavior were eliminated from the scales. A review of
researchers, should be asked to fashion a tentative set of the bank of critical incidents for purposes of classification
dimensions based on the critical incidents, without prompt- into available taxonomies of ethical behaviors is planned
ing of previously-named factors or the dimensions identi- (76, 77).
fied in this study. It is possible that students have a discern-
ing and reliable way of “knowing” qualities of instruction Personal Dimensions. The emphasis in scale construction
and may be able to organize and describe an of instructional and use has been on the advantages of unidimensional
qualities more efficiently than researchers who begin with observed behaviors as scale anchors. This emphasis enabled
factor-analyzed groupings of teaching behaviors and who identification of ten discrete dimensions of teaching perfor-
insist on working only with descriptions of observed behavior. mance. It may also be possible to classify additional teaching
behaviors based on personal attributes of the instructor.
Topics for Future Research More “trait-like” than the observable performance-based
dimensions identified, such clusters may “cut across” many
Reliability. Ongoing reliability studies are planned. Coop- of the ten validated performance dimensions. Such personal
eration of additional volunteer instructors, including those dimensions, e.g. “Independence/Assertiveness” and “Han-
with little teaching experience, would broaden the range of dling/Coping with Detail”, have been previously reported