You are on page 1of 16

Funded through the ESRCs Researcher Development Initiative

Session 3.3: Inter-rater reliability

Prof. Herb Marsh

Ms. Alison OMara

Dr. Lars-Erik Malmberg


Department of Education, University of Oxford

Establish research question

Define relevant studies

Develop code materials

Data entry and effect size calculation

Pilot coding; coding

Locate and collate studies

Main analyses

Supplementary analyses

Interrater reliability
Aim of co-judge procedure, to discern: Consistency within coder Consistency between coders

Take care when making inferences based on little


information, Phenomena impossible to code become missing values

Interrater reliability

Percent agreement: Common but not recommended Cohens kappa coefficient


Kappa is the proportion of the optimum improvement
over chance attained by the coders, 1 = perfect agreement, 0 = agreement is no better than that expected by chance, -1 = perfect disagreement Kappas over .40 are considered to be a moderate level of agreement (but no clear basis for this guideline)

Correlation between different raters Intraclass correlation. Agreement among multiple


raters corrected for number of raters using Spearman-Brown formula (r)

Interrater reliability of categorical IV (1)


Number of observations agreed on Percent exact agreement = Total number of Study Rater 1 Rater 2 observations
1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 1 1 2 1 1 0 2 1 1 0 1 1 1 1 2 1 1 0 1 0 1

Categorical IV with 3 discreet scale-steps 9 ratings the same % exact agreement = 9/12 = .75

Interrater reliability of categorical IV (2) unweighted Kappa


Rater 1 0 Rater 2 0 1 2 Sum
2 1 0 3

1
0 6 2 8

2
0 0 1 1

Sum
2 7 3 12

If agreement matrix is irregular Kappa will not be calculated, or misleading

PO PE , 1 PE

Kappa: Positive values indicate how much the raters agree over and above chance alone Negative values indicate disagreement

PO ( 2 6 1) / 12 .75 PE [(2)(3) (7)(8) (3)(1)] / 122 .451 K .750 .451 .544 1 .451

Interrater reliability of categorical IV (3) unweighted Kappa in SPSS


CROSSTABS /TABLES=rater1 BY rater2 /FORMAT= AVALUE TABLES /STATISTIC=KAPPA /CELLS= COUNT /COUNT ROUND CELL .
Symmetric Measures Value .544 12 Asymp. a Std. Error .220 Approx. T 2.719
b

Measure of Agreement Kappa N of Valid Cases a. Not assuming the null hypothesis.

Approx. Sig. .007

b. Using the asymptotic standard error assuming the null hypothesis.

Interrater reliability of categorical IV (4) Kappas in irregualar matrices


If rater 2 is systmatically above rater 1 when coding an ordinal scale, Kappa will be misleading possible to fill up with zeros
Rater 1 1 2 Rater 2 2 3 4 Sum
4 3 0 7 1 6 3 10

3
0 1 7 8

Sum
5 10 10 25

Rater 1 1 Rater 2 2 3 4 Sum


1 0 4 3 0 7

2
0 1 6 3 10

3
0 0 1 7 8

4 0 0 0 0 0

Sum
0 5 10 10 25

K = .51

K = -.16

Interrater reliability of categorical IV (5) Kappas in irregular matrices


If there are no observations in some row or column, Kappa will not be calculated possible to fill up with zeros
Rater 1 1 3 Rater 2 1 2 3 4 Sum
4 2 1 0 7 0 1 3 1 5

4
0 0 2 4 6

Sum
4 3 6 5 18

Rater 1 1 Rater 2 1 2 3 4 Sum


4 2 1 0 7

2 0 0 0 0 0

3
0 1 3 1 5

4
0 0 2 4 6

Sum
4 3 6 5 18

K not possible to estimate

K = .47

Interrater reliability of categorical IV (6) weighted Kappa using SAS macro


PROC FREQ DATA = int.interrater1 ; TABLES rater1 * rater2 / AGREE; TEST KAPPA; RUN;

KW 1

wi poi wi pei

Papers and macros available for estimating Kappa when unequal or misaligned rows and columns, or multiple raters: <http://www.stataxis.com/ab out_me.htm>

Interrater reliability of continuous IV (1)


Study Rater 1 Rater 2 Rater 3 1 5 6 5 2 2 1 2 3 3 4 4 4 4 4 4 5 5 5 5 6 3 3 4 7 4 4 4 8 4 3 3 9 3 3 2 10 2 2 1 11 1 2 1 12 3 3 3
Correlations rater1 rater1 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N 1 12 .873** .000 12 .879** .000 12 rater2 .873** .000 12 1 12 .866** .000 12 rater3 .879** .000 12 .866** .000 12 1 12

rater2

rater3

**. Correlation is significant at the 0.01 level (2-tailed).

Average correlation r = (.873 + .879 + .866) / 3 = .873 Coders code in same direction!

Interrater reliability of continuous IV (2)

a Estimates of Cov ariance Parameters

Parameter Residual Intercept [subject = study] Variance a. Dependent Variable: rating.

Estimate .222222 1.544613

2 B 1.544 1.544 ICC 2 0.874 2 B W 1.544 0.222 1.767

Interrater reliability of continuous IV (3) Design 1 one-way random effects model when each
study is rater by a different pair of coders Design 2 two-way random effects model when a random pair of coders rate all studies Design 3 two-way mixed effects model ONE pair of coders rate all studies

Comparison of methods (from Orwin, p. 153; in Cooper & Hedges, 1994)

Low Kappa but good AR when little variability across items, and coders agree

Interrater reliability in meta-analysis and primary study

Interrater reliability in meta-analysis vs. in other contexts Meta-analysis: coding of independent variables How many co-judges? How many objects to co-judge? (sub-sample of

studies, versus sub-sample of codings) Use of Golden standard (i.e., one master-coder) Coder drift (cf. observer drift): are coders consistent over time? Your qualitative analysis is only as good as the quality of your categorisation of qualitative data

You might also like