Professional Documents
Culture Documents
Abstract
The present study conducted a metaevaluation of a teacher performance system used in
the Performance Assessment Services Unit (PASU) of De La Salle-College of Saint
Benilde in Manila Philippines. To determine whether the evaluation system on teacher
performance adheres to quality evaluation, the standards of feasibility, utility, propriety, and
accuracy are used as standards. The system of teacher performance evaluation in PASU
includes the use of students rating called the Student Instructional Report (SIR) and a
rating scale used by peers called the Peer Evaluation Form (PEF). A series of guided
discussions was conducted among the different stakeholders of the evaluation system in the
college such as the deans and program chairs, teaching faculty, and students to determine
their appraisal of the evaluation system in terms of the four standards. A metaevaluation
checklist was also used by experts in measurement and evaluation in the Center for
Learning and Performance Assessment (CLPA). The results of the guided discussion
showed that most of the stakeholders were satisfied with the conduct of teacher
performance assessment. Although in using the standards by the Joint Committee on
evaluation, the results are very low. The ratings of utility, propriety, and feasibility were fair
and the standard on accuracy is poor. The areas for improvement are discussed in the
paper.
Introduction
The National Board Certification was meant as a complement to, but not a
replacement for, state licensure exams. While this latter represents the minimum
standards required to teach, the former stands as a test for more advanced
standards in teaching as a profession. Unlike the licensure examinations, it may or
may not be taken; it is voluntary. As such, some schools offer monetary rewards for
the completion of the test, as well as opportunities for better positions (i.e. certified
teaching leadership and mentor roles) (Isaacs, 2003).
Metaevaluaton: ―Evaluation of an evaluation‖
In 1969, Michael Scriven used the term metaevaluation to describe the
evaluation of any evaluation, evaluative tool, device or measure. Seeing how so
many decisions are based on evaluation tools (which is typically their main purpose
for existence in the first place—to help people make informed decisions), it is no
wonder that the need to do metaevaluative work on these evaluation tools is as great
as it is (Stufflebeam, 2000).
In the teaching profession, student evaluation of teachers stands as one of
the main tools of evaluating. However, as earlier stated, while it is but fair that
students be included in the evaluative process, depending on the evaluation process
and content, it may not be very fair to teaching professionals to have their very
careers at the mercy of a potentially flawed tool.
evaluation under the microscope. These may include teachers, students, and
administrators.
2. Staff the Metaevaluation with One or More Qualified Metaevaluators.
Preferably, these should be people with technical knowledge in psychometrics and
people who are familiar with the Joint Committee Personnel Evaluation Standards.
It is sound to have more than one metaevaluator on the job, so that more aspects
may be covered objectively.
3. Define the Metaevaluation Questions. While this might differ on a case-
to-case basis, the four main criteria ought to be present: propriety, utility, feasibility,
and accuracy.
4. As Appropriate, agree on Standards, Principles, and/or Criteria to Judge
the Evaluation System or Particular Evaluation
5. Issue a Memo of Understanding or Negotiate a Formal Metaevaluation
Contract. This will serve as a guiding tool. It contains the standards and principles
contained in the previous step and will help both the metaevaluators and their
clients understand the direction the metaevaluation will take.
6. Collect and Review Pertinent Available Information
7. Collect New Information as Needed, Including, for Example, On-Site
Interviews, Observations and Surveys
8. Analyze the Findings. Put together all the qualitative and quantitative data
in such a way that it will be easy to do the following step.
9. Judge the Evaluation's Adherence to the Selected Evaluation Standards,
Principles, and/or other criteria. This is the truly metaevaluative step. Here, one
takes the analyzed data and judges the evaluation based on the standards that were
agreed upon and put down in the formal contract. In another source, this step is
lumped with the previous one to form a single step (Stufflebeam, 2000).
10. Prepare and Submit the Needed Reports. This entails the finalization of
the data into a coherent report.
11. As Appropriate, Help the Client and Other Stakeholders Interpret and
Apply the Findings. This is important for helping evaluation system under scrutiny
improve by ensuring that the clients know how to use the metaevaluative data
properly.
Selection, (U4) Values Identification, (U5) Report Clarity, (U6) Report Timeliness
and Dissemination, and (U7) Evaluation Impact.
Feasibility standards make sure that the evaluation ―is conducted in a
realistic, well-considered, diplomatic, and cost-conscious manner‖ (Widmer, 2004).
They include: (F1) Practical Procedures, (F2) Political Viability, and (F3) Cost
Effectiveness.
Finally, accuracy standards make sure that the evaluation in question
produces and disseminates information that is both valid and useable (Widmer,
2004). They include: (A1) Program Documentation, (A2) Context Analysis, (A3)
Described Purposes and Procedures, (A4) Defensible Information Sources, (A5)
Valid Information, (A6) Reliable Information, (A7) Systematic Information, (A8)
Analysis of Quantitative Information, (A9) Analysis of Qualitative Information,
(A10) Justified Conclusion, (A11) Impartial Reporting, and (A12) Metaevaluation.
It should be noted that the aforementioned standards were developed
primarily for the metaevaluation of the evaluation of education, training programs
and educational personnel.
Peer Evaluation Form. The Peer Evaluation Form (PEF) is used by faculty
members in observing the performance of their colleagues. The PEF is designed to
determine the extent to which the CSB faculty has been exhibiting teaching
behaviors along the areas of: teacher‘s procedures, teacher‘s performance, and
students‘ actions as observed by their peers.
The PEF is used by a peer observer if the teacher is new in the college and
due for promotion. The peer discuss with the faculty evaluated the observation and
rating given. The faculty signs the form after the conference proper.
Method
Guided Discussion
and/or when he or she deems it fit to interrupt (i.e. at points when the discussion
goes astray, or the participant spend too much time on one point)
A similar procedure was followed for the Student group. The purpose the
students‘ discussion is to get information of their perspectives of the evaluation
process and their perception of their role as evaluators.
At the end of each discussion, the participants were asked to give their
opinion about the usefulness and feasibility of having this sort of discussion every
year to process their questions, comments, doubts, and suggestions. This provides
data for streamlining the metaevaluative process for future use.
The average ratings of the teachers within the last three school years (AY
2003-2004 and 2004-2005) were used to generate findings on how well the results
could discriminate the levels of good teaching and needs improvement teaching.
The Cronbach‘s alpha was used to determine the internal consistency of the old
teacher performance instrument.
The average of the scores for the three terms was computed for each school
year, generating three average scores. These scores were compared to each other to
check the reliability across time.
Metaevaluation Checklist
Instrumentation
For the GD sessions, a guide lists was used. The guide is composed of a set
of questions under each standard that is meant to evaluate the evaluation system.
The questions in the GD are the pre-written. In the data-gathering method, these
are still subject to change, both in the fielding of the questions prior to the GT
sessions and on the day of the GT session itself.
The Metaevaluation Checklist by Stufflebeam (2000) was used to rate the
SIR and PEF as an evaluation system. It is composed of ten items for each of the
subvariables under each of the four standards (see appendix B). The task is to
check the items in each list that are applicable in the current teacher performance
evaluation system done by the center. Nine to ten (9-10) items generates a rating of
excellent for that particular subvariables; 0.7-0.8), a very good; 0.5-0.6, good; 0.3-
0.4, fair; and 0-0.2, poor.
Data Analysis
The data obtained from the GD was analyzed using the qualitative
approach. The important themes from the notes produced in the GD were
extracted based on the appraisal components for each area of metaevaluation
standard. For utility appraisal themes referring to stakeholder identification
(persons affected by the evaluation should be identified), evaluator credibility
(trustworthiness and competence of the evaluator), information scope and selection
(broad selection of information/data for evaluation), values identification
(description of procedures and rationale of the evaluation), report clarity
(description of the evaluation being evaluated), report timeliness (findings and
reports distributed to users), and Evaluation impact (the evaluation should
encourage follow-through by stakeholders) were extracted. For propriety the
appraisal themes extracted are on Service orientation (designed to assist and
address effectively the needs of the organization), formal agreement (Obligation of
formal parties are agreed to in writing), rights of human subjects (evaluation is
conducted to respect and protect the rights of human subjects), and human
interaction (respect human dignity and worth). For feasibility the themes extracted
are on practical procedures, political viability, fiscal viability, and legal viability. The
qualitative data were used as basis in accomplishing the metaevaluation checklist for
utility, feasibility, and propriety.
For the standards on accuracy on accuracy, the existing documents of
processes, procedures, programs, policies, documentations, and reports were made
available to the metaevaluators in order to accomplish the metaevaluation checklist
in this area.
In the checklist, every item of the metaevaluation standard that was checked
were divided into 10 and averaged according to the number of metaevaluators who
accomplished the checklist. Each component is then interpreted whether the
system reached the typical stands of evaluation. The scores are interpreted as 0.9 to
1.0, Excellent; 0.7 to 0.8, Very Good; 0.5 to 0.6, Good; 0.3 to 0.4, Fair; 0.1 to 0.2,
Poor.
Results
Utility
(Standing of the faculty). The sources of feedback come from the students through
the SIR, student advising, e-mail from students and parents, and peers (senior
faculty, chairs, deans). Feedback is given ―if the rating is high (3.75 and above);
sometimes no feedback is given; when the results of the SIR are low; if the faculty is
new to the college; and those who have been teaching for a long time and getting
low ratings.‖
For values identification, the strands were clustered into three themes:
Needs, actions taken, and value of the instrument. According to the participants,
the needs included ―Results (that) are (not) too cumbersome for deans to read; A
print out of the results should be given; the time taken to access the results turns off
some teachers from accessing them; Students having difficulty answering the SIR;
Students don‘t see how teaching effectiveness is measured and; Create a particular
form for laboratory classes.‖ The action taken theme included ―removing items that
are valid and another computation is done and; Other evaluation criteria is done.‖
The themes of the instrument value showed that for the instrument to be valuable,
―there should be indicators for each score; there should be factors of teaching
effectiveness with clear labels; identify what the instrument measures; there needs to
be a lump score on learner-centeredness and; there are other success indicators that
are not reflected in the SIR.‖
For functional reporting, two clusters emerged: Decisions and functions.
The decisions made by the teacher evaluation include promotion, loading with
course, retaining part-time faculty, deloading a faculty, permanency, and training
enhancement. The functions of the teacher evaluation are ―used for improvement
the faculty; the administrators come up with a list of faculty that will be given
teaching load based on SIR reports and; The PEF constricts what needs to be
evaluated more.‖
The follow-up and impact included both qualitative and quantitative. The
qualitative aspect of the instruments included suggestions to ―give headings/labels
for the different parts; Come up with dimensions and subdimensions; Devise a way
to reach the faculty (yahoo, emails etc.); The teachers and students should see what
aspects to improve on; and there should be narrative explanations for the figures.‖
The quantitative aspect of the report included ―faculty doesn‘t understand the
spreading index; Conduct a seminar explaining the statistics; Come up with a
general global score; Each area should be represented with a number; and a verbal
list of strengths and weaknesses of the faculty.‖
Two clusters were identified for information scope and selection:
Perception and action. In the perception the faculty ―looks at evaluation as
something negative because the school uses the results.‖ For the suggested actions
―come up with CLPA kit explaining the PEF and SIR; check on the credibility on
the answers of students; and SIR needs to be simplified for the non-hearing
students.‖
Table 1
Rating for Utility
The ratings for utility using the metaevaluation checklist showed that in
most of the item areas, the performance of the teacher evaluation processes are
good. In particular, the area on information scope and selection is very good.
However, report timeliness is poor and evaluation impact is fair and should thus be
improved.
Propriety
Table 2
Rating for Propriety
Most of the ratings for propriety using the metaevaluation checklist were
pegged at good. A very good rating was obtained for formal agreement. A fair rating
is obtained in the areas of complete and fair assessment, conflict of interest, and
physical responsibility.
Feasibility
issues, anticipating student needs, and concerns about utility. The time issues in
administration mentioned administration problems regarding the first-thirty-minutes
policy observed by the Center. The time allotment is generally too short for the
whole administration procedure from giving instructions to the actual answering of
the instrument. Teachers also have issues regarding the same policy. Some refuse
to be rated in the first thirty minutes, preferring to be rated in the last thirty.
Another issue regarding the policy is the refusal of some teachers to be evaluated in
the first thirty minutes. There are faculty members who ―dictate that the last 30
minutes will be used for evaluation‖. There are others who ―complain about the
duration of the SIR administration‖, even if ―the guidelines (distributed in the
eighth week of the term, the week before the evaluation) indicated first 30
minutes.‖
Though discouraged by the Center, rescheduling still does happen during
the evaluation period. Usually it is because ―some of the faculty members (or their
students) do not show up‖. Similarly, there are times when some students do come,
but their numbers do not meet the fifty percent quota required for each section‘s
evaluation. Another common reason for rescheduling are schedule conflicts with
other activities: ―(the) Young Hoteliers‘ Exposition and some tours and retreats
have the same schedule as the SIR‖.
The next issue ―cluster‖ formed is regarding the frequency of evaluation;
teachers question whether there is a need to evaluate every term. Although there is
only one strand, it is important enough to be segregated as it gives voice to one of
the interest groups‘ major concerns.
The next cluster forms the biggest group, the cluster that talks about
identifying and anticipating the needs of the one of the major interest
groups/stakeholders of the whole evaluation system: the teachers themselves. Their
needs range from the minor (―We need to request for the updated list of the faculty
names early in the term, a list including the faculty members who changed their
surnames with computer center.‖) to the major, and a lot in between. Among this
last include the need to make sure that teachers are aware of their evaluation
schedules and the Center‘s policies, to come up with ways to deal with the teachers
during the actual administration, and to equip them with the know-how to access
their online results.
Just as teachers, the evaluatees, have needs, so do their evaluators, their
students. By not taking care of the students‘ needs and/or preferences, the Center
risks generate inaccurate results. Thus, the Center should ―compile the needs of
students and present it (the SIR) to (the) students in an attractive form. (CLPA
should) drum up the interest of students in the evaluation.‖
Last under the feasibility area are issues on utilization. There appears to be
a need to make the utilization clearer to the stakeholders, especially the teachers.
For the area on cost effectiveness, the clusters formed were human
resources, material resources, and technology. The human resources of the Center
are ―well-utilized‖. The staff feels that despite special cases when they find it
difficult to go home because of the late working hours, they feel well compensated,
in part because of the meals served. As to material resources, ―the SIR process is
well-supported by the College‖ and so, everything is generally provided. There are
special cases where the evaluation setting makes administration difficult. For
instance, ―sometimes it‘s hard to administer in the far buildings,‖ especially in the
food labs located in far places. Finally, under the theme of technology, the Center
proved well-equipped enough to handle the pen-and-paper instrument‘s processing.
However, it may be some time before the process become paperless; if the memos
would be delivered online, instead of personally, as is currently done, some of the
faculty would ―not get the memo on time‖ because ―the faculty members do not
have their own PCs‖. Then, an attempt was made to administer the instrument
online. A problem that was noted in this regard was ―kaunting respondents with
online evaluation‖ (very few respondents are gathered with the online evaluation).
Other than that, ―if all classes come together for on-line the computers hang.‖
For legal viability, only one theme was developed, standardizing the
evaluation setting. ―There is a common script‖ to keep the instructions standardized
and, although ―During college break some classes are affected with the noise (of
college break activities)‖, the ―classroom is generally conducive in answering‖.
Table 3
Rating for Feasibility
For the three areas of feasibility, a good raring was obtained for practical
procedure and cost effectiveness and poor for political viability.
Accuracy
The standards of accuracy were rated based on the reliability report of the
instrument since SY 2003-2004 to 2005-2006. The trend of the mean performance
of the means of the faculty from 2003-2006 was also obtained.
Table 4
Internal Consistency of the items for the SIR from 2003 to 2006
School Year
2003-2004 2004-2005 2005-2006
1st Term 0.873 0.875 0.881
2nd Term 0.888 0.892 0.894
3rd Term 0.892 0.885
Summer 0.832 0.866
The reliability of the SIR form is consistently high since 2003 to 2006. The
Cronbach‘s alpha obtained are all in the same high level across the three terms and
across three school years. This indicates that the internal consistency of the SIR
measure is stable and accurate across time.
Figure 1 shows a line graph of the means in the SIR each term across three
school years.
Figure 1
Data Trend from the Last Three Years
4.40
4.30
4.20
Part I
Mean
4.10
Part 2
4.00
Part 3
3.90
3.80
3.70
1st 2nd 3rd 4th 1st 2nd 3rd 4th 1st 2nd 3rd 4th
Term
The trend in the means show that the SIR results increase at a high level
during summer terms (4th). The high level of increase can be observed from the
spikes in the 4th term in the line graph for the three part of the SIR instrument.
The means during the first, second, and third term are stable and it rapidly increase
for the summer term.
Table 5
Rating for Accuracy
The ratings for accuracy using the metaevaluation checklist were generally
poor in most areas. Only systematic information was rated as very good, only
defensible information sources was rated as good, and both reliable and impartial
reporting were fair.
Table 6
Summary Ratings for the Standards
Figure 1
Outcome of the Standards
Standard Outcome
Utility
Propriety
Accuracy
Feasibility
Discussion
audiences. In order to improve the timely exchanges, the Center needs to conduct
consistent communications with different offices that they are serving.
For propriety, the rating is only fair because low ratings were obtained for
complete and fair assessment, conflict of interest, and fiscal responsibility. To
improve complete and fair assessment, there is a need to assess and report the
strengths and weaknesses of the procedure, use the strengths to overcome
weaknesses, estimate the effects of the evaluation‘s limitations on the overall
judgment of the system. In line with conflict of interest, there is a need to make the
release of evaluation procedures, data and reports for public review. For physical
responsibility, there is a need to improve adequate personnel records concerning
job allocations and time spent on the job, and employ comparisons for evaluation
materials.
In standards of accuracy, majority of the ratings were poor, including
program documentation, content analysis, described purposes and procedures,
valid information, analysis of qualitative and quantitative information, justified
conclusion and metaevalaution. For program documentation the only criteria met
was the technical report that documents the programs‘ operations; all other nine
criteria were not met. For content analysis, all criteria were not met. In described
purposes and procedures, only the record of the client‘s purpose of evaluation and
implementation of actual evaluation procedures were met. All other eight criteria
were not met. For valid information, there is a need to focus evaluation on key
ideas, employ multiple measures to address each idea, provide detailed description
of the constructs assessed, report the type of information each employed
procedures acquires, report and justify inferences, report the comprehensiveness of
the information provided by the procedures as set in relation to the information
needed, and establish meaningful categories of information by identifying regular
and recurrent themes using qualitative analysis. In the analysis of qualitative and
quantitative information, there is a need to conduct exploratory analysis to assure
data correctness, choose procedures appropriate to the system of evaluating
teachers, specify assumptions being met by the evaluation, report limitations of each
analytic procedures, examine outliers and verify correctness, analyze statistical
interactions, and using displays to clarify the presentation and interpretation of
statistical results. In the areas of justified conclusions and metaevaluation, all criteria
were not met.
In the standards of feasibility, political viability needs to be improved. For
political viability, the evaluation needs to consider ways to counteract attempts to
bias or misapply the findings, foster cooperation, involve stakeholders throughout
the evaluation, issue interim reports, report divergent views, and affirm a public
contract.
Given the present condition of the SIR and PEF in evaluating faculty
performance based on the qualitative data, there are still gaps that need to be
addressed in line with the evaluation system. The stakeholders are more or less not
yet aware of the detailed standards on conducting evaluations among their faculty
and what is verbalized in the qualitative data is only based on their personal
experience and the practices required of the evaluation system. By contrast, the
standards on evaluation would specify more details that need to be met in the
evaluation. Some areas in the evaluation are interpreted by the stakeholders as
acceptable based on the themes of the qualitative data but more criteria need to be
met in a larger range of evaluating teachers. It is recommended for the Center for
Learning and Performance Assessment to consider the specific areas found wanting
under utility, propriety, feasibility, and especially accuracy to attain quality standards
in their conduct of teacher evaluation.
References
Bonfadini, J. (1998). Should students evaluate their teachers? Rural Living, 52(10),
40-41.
Berliner, D. (1982). Recognizing instructional variables. In D.E. Orlsoky (Ed.),
Introduction to Education (pp. 198-222). Columbus, OH: Merrill.
Bradshaw, L. (1996). Alternative teacher performance appraisal in North Caroling:
Developing guidelines. (ERIC Document Reproduction Service No. ED
400 255)
Egelson, P. (1994, April). Collaboration at Richland School District Two: Teachers
and administrators design and implement a teacher evaluation system that
supports professional growth. Paper presented at the Annual Meeting of the
American Educational Research Association, New Orleans, LA. (ERIC
Document Reproduction Service No. ED 376 159)
Glatthorn, A. (1997). Differential instruction. Alexandria, VA: Association for the
Supervision and Curriculum Development.
Galves, R.E. (1986). Ang ginabayang talakayan: Katutubong pamamaraan ng sama-
samang pananaliksik [Guided discussion: Ethnic approach in research].
Unpublished manuscript, Psychology Department, University of the
Philippines
Hummel, B. (2006). Metaevaluation: An online resource. Retrieved September 6,
2006, from http://www.bhummel.com/Metaevaluation/index.html
Isaacs, J.S. (2003). A study of teacher evaluation methods found in select Virginia
secondary public schools using the 4x4 model of block scheduling.
Unpublished doctoral dissertation, Virginia Polytechnic Institute and State
University.
Lengeling, M. (1996). The complexities of evaluating teachers. (ERIC Document
Reproduction Service No. ED 399 822)
O‘Donell, J. (1990).
Scriven, M. (1969). An introduction to meta-evaluation. Educational Products
Report, 2, 36–38.
Seldin, P. (1991). The teacher portfolio. Bolton, MA: Anker.
Shulman, L. (1988). A union of insufficiencies: Strategies for teacher assessment in
a period of educational reform. Educational Leadership, 46(3), 36-41.
Strobbe, C. (1993). Professional partnerships. Educational Leadership, 51(42), 40-
41.
Stufflebeam, D.L. (2000). The methodology of metaevaluation as reflected in by
the Western Michigan University Evaluation Center. Journal of Personnel
Evaluation in Education, 14(1), 95.
The National Board for Professional Teaching Standards (1998). The national
certification process in-depth. [On-line]. Available:
http://www.nbpts.org/nbpts/standards/intro.html
Williams, W. & Ceci, S. (1997). ―How am I Doing?‖ Change, 29(5), 12-23.
Widmer, T. (2004): The Development and Status of Evaluation Standards in
Western Europe. New Directions for Evaluation, 10(4), 31-42.
Author Notes