3 3 Interrater

Funded through the ESRCs Researcher Development Initiative
Session 3.3: Inter-rater reliability
Prof. Herb Marsh
Ms. Alison OMara
Dr. Lars-Erik Malmberg

Department of Education, University of Oxford
Establish research question
Define relevant studies
Develop code materials
Data entry and effect size calculation
Pilot coding; coding
Locate and collate studies
Main analyses
Supplementary analyses
Interrater reliability
Aim of co-judge procedure, to discern: Consistency within coder Consistency between coders
Take care when making inferences based on little

information, Phenomena impossible to code become missing values
Interrater reliability
Percent agreement: Common but not recommended Cohens kappa coefficient

Kappa is the proportion of the optimum improvement
over chance attained by the coders, 1 = perfect agreement, 0 = agreement is no better than that expected by chance, -1 = perfect disagreement Kappas over .40 are considered to be a moderate level of agreement (but no clear basis for this guideline)
Correlation between different raters Intraclass correlation. Agreement among multiple

raters corrected for number of raters using Spearman-Brown formula (r)
Interrater reliability of categorical IV (1)

Number of observations agreed on Percent exact agreement = Total number of Study Rater 1 Rater 2 observations
1 2 3 4 5 6 7 8 9 10 11 12 0 1 2 1 1 2 1 1 0 2 1 1 0 1 1 1 1 2 1 1 0 1 0 1
Categorical IV with 3 discreet scale-steps 9 ratings the same % exact agreement = 9/12 = .75
Interrater reliability of categorical IV (2) unweighted Kappa

Rater 1 0 Rater 2 0 1 2 Sum
2 1 0 3
1
0 6 2 8
2
0 0 1 1
Sum
2 7 3 12
If agreement matrix is irregular Kappa will not be calculated, or misleading
PO PE , 1 PE
Kappa: Positive values indicate how much the raters agree over and above chance alone Negative values indicate disagreement
PO ( 2 6 1) / 12 .75 PE [(2)(3) (7)(8) (3)(1)] / 122 .451 K .750 .451 .544 1 .451
Interrater reliability of categorical IV (3) unweighted Kappa in SPSS

CROSSTABS /TABLES=rater1 BY rater2 /FORMAT= AVALUE TABLES /STATISTIC=KAPPA /CELLS= COUNT /COUNT ROUND CELL .
Symmetric Measures Value .544 12 Asymp. a Std. Error .220 Approx. T 2.719
b
Measure of Agreement Kappa N of Valid Cases a. Not assuming the null hypothesis.
Approx. Sig. .007
b. Using the asymptotic standard error assuming the null hypothesis.
Interrater reliability of categorical IV (4) Kappas in irregualar matrices

If rater 2 is systmatically above rater 1 when coding an ordinal scale, Kappa will be misleading possible to fill up with zeros
Rater 1 1 2 Rater 2 2 3 4 Sum
4 3 0 7 1 6 3 10
3
0 1 7 8
Sum
5 10 10 25
Rater 1 1 Rater 2 2 3 4 Sum

1 0 4 3 0 7
2
0 1 6 3 10
3
0 0 1 7 8
4 0 0 0 0 0
Sum
0 5 10 10 25
K = .51
K = -.16
Interrater reliability of categorical IV (5) Kappas in irregular matrices

If there are no observations in some row or column, Kappa will not be calculated possible to fill up with zeros
Rater 1 1 3 Rater 2 1 2 3 4 Sum
4 2 1 0 7 0 1 3 1 5
4
0 0 2 4 6
Sum
4 3 6 5 18
Rater 1 1 Rater 2 1 2 3 4 Sum

4 2 1 0 7
2 0 0 0 0 0
3
0 1 3 1 5
4
0 0 2 4 6
Sum
4 3 6 5 18
K not possible to estimate
K = .47
Interrater reliability of categorical IV (6) weighted Kappa using SAS macro

PROC FREQ DATA = int.interrater1 ; TABLES rater1 * rater2 / AGREE; TEST KAPPA; RUN;
KW 1
wi poi wi pei
Papers and macros available for estimating Kappa when unequal or misaligned rows and columns, or multiple raters: <http://www.stataxis.com/ab out_me.htm>
Interrater reliability of continuous IV (1)

Study Rater 1 Rater 2 Rater 3 1 5 6 5 2 2 1 2 3 3 4 4 4 4 4 4 5 5 5 5 6 3 3 4 7 4 4 4 8 4 3 3 9 3 3 2 10 2 2 1 11 1 2 1 12 3 3 3
Correlations rater1 rater1 Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N 1 12 .873** .000 12 .879** .000 12 rater2 .873** .000 12 1 12 .866** .000 12 rater3 .879** .000 12 .866** .000 12 1 12
rater2
rater3
**. Correlation is significant at the 0.01 level (2-tailed).
Average correlation r = (.873 + .879 + .866) / 3 = .873 Coders code in same direction!
Interrater reliability of continuous IV (2)
a Estimates of Cov ariance Parameters
Parameter Residual Intercept [subject = study] Variance a. Dependent Variable: rating.
Estimate .222222 1.544613
2 B 1.544 1.544 ICC 2 0.874 2 B W 1.544 0.222 1.767
Interrater reliability of continuous IV (3) Design 1 one-way random effects model when each
study is rater by a different pair of coders Design 2 two-way random effects model when a random pair of coders rate all studies Design 3 two-way mixed effects model ONE pair of coders rate all studies
Comparison of methods (from Orwin, p. 153; in Cooper & Hedges, 1994)
Low Kappa but good AR when little variability across items, and coders agree
Interrater reliability in meta-analysis and primary study
Interrater reliability in meta-analysis vs. in other contexts Meta-analysis: coding of independent variables How many co-judges? How many objects to co-judge? (sub-sample of
studies, versus sub-sample of codings) Use of Golden standard (i.e., one master-coder) Coder drift (cf. observer drift): are coders consistent over time? Your qualitative analysis is only as good as the quality of your categorisation of qualitative data

3 3 Interrater

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 3 Interrater

Uploaded by

Copyright:

Available Formats

Funded through the ESRCs Researcher Development Initiative

Session 3.3: Inter-rater reliability

Prof. Herb Marsh

Ms. Alison OMara

Dr. Lars-Erik Malmberg

Establish research question

Define relevant studies

Develop code materials

Data entry and effect size calculation

Pilot coding; coding

Locate and collate studies

Take care when making inferences based on little

Percent agreement: Common but not recommended Cohens kappa coefficient

Correlation between different raters Intraclass correlation. Agreement among multiple

Interrater reliability of categorical IV (1)

Interrater reliability of categorical IV (2) unweighted Kappa

If agreement matrix is irregular Kappa will not be calculated, or misleading

Interrater reliability of categorical IV (3) unweighted Kappa in SPSS

Approx. Sig. .007

b. Using the asymptotic standard error assuming the null hypothesis.

Interrater reliability of categorical IV (4) Kappas in irregualar matrices

Rater 1 1 Rater 2 2 3 4 Sum

Interrater reliability of categorical IV (5) Kappas in irregular matrices

Rater 1 1 Rater 2 1 2 3 4 Sum

K not possible to estimate

Interrater reliability of categorical IV (6) weighted Kappa using SAS macro

Interrater reliability of continuous IV (1)

**. Correlation is significant at the 0.01 level (2-tailed).

Interrater reliability of continuous IV (2)

a Estimates of Cov ariance Parameters

Parameter Residual Intercept [subject = study] Variance a. Dependent Variable: rating.

Estimate .222222 1.544613

2 B 1.544 1.544 ICC 2 0.874 2 B W 1.544 0.222 1.767

Comparison of methods (from Orwin, p. 153; in Cooper & Hedges, 1994)

Interrater reliability in meta-analysis and primary study

You might also like