You are on page 1of 34

Measuring association among ordinal

categorical variables
Goodman and Kruskals Gamma

An example
Effect of smoking at 45 years of age on self
reported health five years later
Variable
Smoking

Categories
1 Never smoked
2 Stopped smoking
3 1-14 cigarettes/day
4 15-24 cigarettes/day
5 25+ cigarettes/day

SRH

1 Very good
2 Good
3 Fair
4 Bad
Ordinal categories

Expected monotonous association:


Increasing codes on Smoking

Increasing code on SRH


2

Data on males from the Glostrup surveys


+ SMOKE45
| | B:HEALTH51
|
J | Vgood Good Fair
Bad | TOTAL |
-------+-------------------------+-------+
Never |
16
73
6
1 |
96 |
row%| 16.7 76.0
6.3
1.0 | 100.0 |
No mo |
15
75
6
0 |
96 |
row%| 15.6 78.1
6.3
0.0 | 100.0 |
1-14 |
13
59
7
1 |
80 |
row%| 16.3 73.8
8.8
1.3 | 100.0 |
15-24 |
10
81
17
3 |
111 |
row%|
9.0 73.0 15.3
2.7 | 100.0 |
25+ |
1
29
3
1 |
34 |
row%|
2.9 85.3
8.8
2.9 | 100.0 |
-----------------------------------------+
TOTAL |
55
317
39
6 |
417 |
row%| 13.2 76.0
9.4
1.4 | 100.0 |
-----------------------------------------+

2 = 16.2

df = 12

p = 0.182

No evidence of association even though


the expected monotonous relationship is
plain as the nose on your face

Correlation coefficients for ordinal


categorical variables
Pearsons correlation coefficient
1) Measures linear association which
is not meaningful for ordinal
variables
2) Evaluation of significance
requires normal distributions
Rank correlations (Kendalls and
Spearmans ) are more appropriate but
require continuous data with very little
risk of ties.

Goodman and Kruskals for ordinal


categorical data
1) Similar to Kendalls .
2) Related to the odds ratio for 22
tables.
3) Well-known asymptotic properties.
4) A partial coefficient measuring
conditional monotonous relationship
among ordinal variables is available.

Monotonous relationships

Y increases when X increases/decreases

Two variables: X,Y


Probabilities: pxy = Prob(X=x,Y=y)
X and Y are independent

pxy = Prob(X=x)P(Y=y)
What exactly do we mean when we say that
there is a monotonous relationship between
X and Y?

Concordance and discordance

Compare outcomes on (X,Y) for two


stochastically independent cases
(X1,Y1) and (X2,Y2)
Concordance (C) if X1<X2 and Y1<Y2
or X1>X2 and Y1>Y2
Discordance (D) if X1<X2 and Y1>Y2
or X1>X2 and Y1<Y2
Tie (T)

if X1=X2 or Y1=Y2

Concordance = same trend in X and Y


Discordance = different trends in X and Y

Probability of concordance
pC
pD

Px1y1 Px2y 2

Px1y1 Px2 y 2

(x1 ,y 1 ,x 2 ,y 2 )C

(x1 ,y1 ,x 2 ,y 2 )D

Positive relationship: PC > PD


Negative relationship: PC < PD

The gamma coefficient


A measure of the strength of the
monotonous relationship
PD PC

PD PC
Satisfies all conventional requirements of
correlationcoefficients:
-1 +1
= 0 if X and Y are independent
Positive association if > 0
Negative association if < 0
Change the order of Y categories:

after recoding = - before recoding


Interpretation of

P(C)
P(C | C D)
P(C) P(D)

P(D)
P(D | C D)
P(C) P(D)
such that

= P(C|CD) P(D|CD)

10

is the difference between two conditional


probabilities
Estimation of
Pairwise comparison of all persons in the
data set
nC = number of concordances
nD = number of discordances
nT = number of ties
Relative frequences

nC
hC
nC n D n T

nD
hD
nC n D n T

nS
hT
nC n D nT

11

The estimate of

hC h D nC n D
G

hC h D nC n D
A little bit of notation
nxy = number of persons with X=x and Y=y
A ij
Dij

x i,y j
x i,y j

n xy
n xy

x i,y j
x i,y j

n xy
n xy

X
1
Y

1
.
y
.
.
R

Axy

c
Dxy

nxy
Dxy

Axy

12

Number of concordances and discordances


nC n xy A xy

n D n xy Dxy

x,y

x,y

The coefficient for 22 tables


a
c

b
d

nC = ad

nD = bc

13

nC n D
nC n D

ad
ad bc bc 1 OR 1

ad bc ad 1 OR 1
bc

14

1
OR
1

OR 1

OR 1

Gamma, odds-ratio og logit values


gamma
-1,00
-,90
-,80
-,70
-,60
-,50
-,40
-,30
-,20
-,10
,00
,10
,20
,30
,40
,50
,60
,70
,80
,90
1,00

oddsratio
,00
,05
,11
,18
,25
,33
,43
,54
,67
,82
1,00
1,22
1,50
1,86
2,33
3,00
4,00
5,67
9,00
19,00
+

logit LN(odds-ratio
-
-2,94
-2,20
-1,73
-1,39
-1,10
-,85
-,62
-,41
-,20
,00
,20
,41
,62
,85
1,10
1,39
1,73
2,20
2,94
+

Note: logit 2gamma in the interval [-0,30 til 0,30]

15

Properties of the estimate of


The estimate is unbiased
E(G) =
and asymptotically normally distributed
with standard error, s1, given by

s
2
1

16

nC n D

n n
xy

x,y

A xy nC Dxy

If X and Y are independent, then the


standard error, s0, of G is given by
s
2
0

n A

nC n D

xy

x,y

16

xy

Dxy

nC n D

Statistical inference

95 % confidence intervals
G 1.96se1
Test of significance
If X and Y are independent then
G
z
Norm(0,1)
se0
Notice that confidence intervals and
assessment of significance uses different
estimates of the standard errors

17

The example
+ SMOKE45
| | B:HEALTH51
|
J | Vgood Good Fair
Bad | TOTAL |
-------+-------------------------+-------+
Never |
16
73
6
1 |
96 |
row%| 16.7 76.0
6.3
1.0 | 100.0 |
No mo |
15
75
6
0 |
96 |
row%| 15.6 78.1
6.3
0.0 | 100.0 |
1-14 |
13
59
7
1 |
80 |
row%| 16.3 73.8
8.8
1.3 | 100.0 |
15-24 |
10
81
17
3 |
111 |
row%|
9.0 73.0 15.3
2.7 | 100.0 |
25+ |
1
29
3
1 |
34 |
row%|
2.9 85.3
8.8
2.9 | 100.0 |
-----------------------------------------+
TOTAL |
55
317
39
6 |
417 |
row%| 13.2 76.0
9.4
1.4 | 100.0 |
-----------------------------------------+

2 = 16.2 df = 12
p = 0.182
= 0.24 p < 0.0005
Very strong evidence of an effect of
smoking on health
For ordinal variables, is much more
powerful than 2 distributed test
statistics
18

Exact conditional inference


The problem:
Can asymptotic distributions of estimates and test
statistics be approximated by asymptotic
distributions in small and moderate samples.
The small number of persons with bad health
would result in warnings from most statistical
programs that asymptotics probably do not work.

If in doubt use exact conditional tests instead of


asymptotic tests.

19

The hypergeometric distribution


The contingency table:

nxy, x = 1,..c y =1,,.r

The margins of the table:


n x n xy
y

n y n xy n n xy
x,y

The probability of the table

n
P(n11,,ncr) =

n
...n
11 rc

20

p
x,y

n xy
xy

H0: pxy = px+p+y

n
P(n11,,ncr) =

n
...n
11 rc

p p
x

n x
x

n y
y

The marginal tables, nx+ and n+y, are


sufficient under H0
P(n11,,ncr | n1+,..,nc+, n+1,..,n+r)
=


! n y
!
n x
x
y

n! n xy !

x,y

does not depend on unknown parameters

21

The exact conditional test procedure 1


Find all tables with the same marginal
tables as the observed table.
For each of these tables calculate:
The conditional probability of the table
The tests statistics of interest
The exact p-value
=
the sum of probabilities for tables with test
statistics that are more extreme than the
test statistic of the observed table
22

The exact conditional test procedure


Test statistic T(M)
Where M is a rc table
Observed teststatistic = tobs
The exact p-value
pexact

M:
mx nx ,
m y n y ,
T(M ) t obs

P(M | n1 ,..,n c ,n 1 ,..,n r )

Fishers exact test for 22 tables


Also appropriate for rc tables, but may be
very time consuming due to the number of
tables fitting the margins
23

The Monte Carlo test


Since the conditional probabilities are known
exactly we may ask the computer to generate a
random sample consisting of a large number of
independent tables from this distribution:
The MC test procedure:
Generate tables:
M1,,MNsim
Calculate test statistics for each table:
Ti = T(Mi),

i = 1,..Nsim

Count the number of random test statistics which


are as extreme as the observed statistic

Nsim

1
i 1

p MC

Ti t obs

is an unbiased estimate of pexact


Nsim

24

The standard error of pMC depends on Nsim

Sequential Monte Carlo tests


Interrupts the Monte Carlo procedure when it
becomes obvious that the test statistic will not be
significant
Nsim = 10,000
Critical level of the test = 5 %,
The sequential Monte Carlo test interrupts the
Monte Carlo procedure when the number of tables
with T(Mi) is equal to 501
S 501
pMC 501/10000 > 0.05

25

Repeated Monte Carlo tests


The repeated Monte Carlo test interrupts the
Monte Carlo procedure when the risk of a
significant pMC-value has become very small

Parameters of the Repeated Monte Carlo test:


Nsim = the total number of tables to be generated
Nstart = the minimum number of tables to be
Generated
Critical value
Max risk of stopping too soon (default = 0.1 %)

26

Smoking and Self reported health


+ SMOKE45
| | B:HEALTH51
|
J | V.goo Good Fair
Bad | TOTAL |
-------+-------------------------+-------+
Never |
16
73
6
1 |
96 |
row%| 16.7 76.0
6.3
1.0 | 100.0 |
No mo |
15
75
6
0 |
96 |
row%| 15.6 78.1
6.3
0.0 | 100.0 |
1-14 |
13
59
7
1 |
80 |
row%| 16.3 73.8
8.8
1.3 | 100.0 |
15-24 |
10
81
17
3 |
111 |
row%|
9.0 73.0 15.3
2.7 | 100.0 |
25+ |
1
29
3
1 |
34 |
row%|
2.9 85.3
8.8
2.9 | 100.0 | X = 16.2
-----------------------------------------+ df = 12
TOTAL |
55
317
39
6 |
417 |
p = 0.182
row%| 13.2 76.0
9.4
1.4 | 100.0 | Gam = 0.24
-----------------------------------------+
p = 0.000

Confounding?

27

Analysis of the conditional association


given self reported health at 45 years
HEALTH45 = Very good
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
1 Never |
4
12
0
0 |
16 |
row%| 25.0 75.0
0.0
0.0 | 100.0 |
No mo |
5
7
0
0 |
12 |
row%| 41.7 58.3
0.0
0.0 | 100.0 |
1-14 |
9
6
0
0 |
15 |
row%| 60.0 40.0
0.0
0.0 | 100.0 |
15-24 |
2
3
0
0 |
5 |
row%| 40.0 60.0
0.0
0.0 | 100.0 |
25+ |
0
3
0
0 |
3 |
row%|
0.0 100.0
0.0
0.0 | 100.0 | X =
6.0
------------------------------------------+ df =
4
TOTAL |
20
31
0
0 |
51 |
p = 0.196
row%| 39.2 60.8
0.0
0.0 | 100.0 | Gam = -0.18
------------------------------------------+
p = 0.188

28

HEALTH45 = Good
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
2 Never |
11
55
5
1 |
72 |
row%| 15.3 76.4
6.9
1.4 | 100.0 |
No mo |
10
59
5
0 |
74 |
row%| 13.5 79.7
6.8
0.0 | 100.0 |
1-14 |
3
50
4
1 |
58 |
row%|
5.2 86.2
6.9
1.7 | 100.0 |
15-24 |
6
76
8
1 |
91 |
row%|
6.6 83.5
8.8
1.1 | 100.0 |
25+ |
1
25
1
0 |
27 |
row%|
3.7 92.6
3.7
0.0 | 100.0 | X =
9.8
------------------------------------------+ df = 12
TOTAL |
31
265
23
3 |
322 |
p = 0.636
row%|
9.6 82.3
7.1
0.9 | 100.0 | Gam = 0.17
------------------------------------------+
p = 0.041
HEALTH45 = Fair

+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
3 Never |
1
6
1
0 |
8 |
row%| 12.5 75.0 12.5
0.0 | 100.0 |
No mo |
0
6
1
0 |
7 |
row%|
0.0 85.7 14.3
0.0 | 100.0 |
1-14 |
1
3
3
0 |
7 |
row%| 14.3 42.9 42.9
0.0 | 100.0 |
15-24 |
2
1
6
2 |
11 |
row%| 18.2
9.1 54.5 18.2 | 100.0 |
25+ |
0
1
2
1 |
4 |
row%|
0.0 25.0 50.0 25.0 | 100.0 | X = 17.5
------------------------------------------+ df = 12
TOTAL |
4
17
13
3 |
37 |
p = 0.131
row%| 10.8 45.9 35.1
8.1 | 100.0 | Gam = 0.52
------------------------------------------+
p = 0.001

29

HEALTH45 = Bad
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
4 Never |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 |
No mo |
0
3
0
0 |
3 |
row%|
0.0 100.0
0.0
0.0 | 100.0 |
1-14 |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 |
15-24 |
0
1
3
0 |
4 |
row%|
0.0 25.0 75.0
0.0 | 100.0 |
25+ |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 | X =
3.9
------------------------------------------+ df =
1
TOTAL |
0
4
3
0 |
7 |
p = 0.047
row%|
0.0 57.1 42.9
0.0 | 100.0 | Gam = 1.00
------------------------------------------+
p = 0.001

----------------------------------------------------------** Local testresults for strata defined by HEALTH45 (G) **


p-values
p-values (1-sided)
G: HEALTH45
X
df asympt exact Gamma asympt exact
----------------------------------------------------------1: V.good
6.04
4 0.1960 0.1880 -0.18 0.1884 0.2050
2:
Good
9.77
12 0.6358 0.6310
0.17 0.0410 0.0290
3:
Fair 17.52
12 0.1311 0.1430
0.52 0.0010 0.0030
4:
Bad
3.94
1 0.0472 0.1540
1.00 0.0006 0.1220
-----------------------------------------------------------

30

Tests of conditional independence


H0: P(X,Y|Z=z) = P(X|Z=z)P(Y|Z=z)
for all z
Z

Y
1

Concordance Test statistics


and
discordance
3

1
2

N1C and N1D

1
2

N2C and N2D

1=

12
N 1 C N1 D
N 1 C N1 D

22
N 2C N 2 D
= N 2C N 2 D

..
..
..

kZ

1
2

N2C and N2D

k2
N kC N kD
= N kC NkD

All test statistics must be insignificant


Global tests of conditional independence
The global 2
2 2z
z

df df z
z

31

The partial coefficient


N C N zC

N D N zD

partial

N zC N zD
NC ND
z

N C N D N zC N zD
z

N N
w
N N
zC

zD

z z

zC

zD

where wz =

N zC N zD
i N iC N iD

partial w z z
z

Weighted mean

Asymptotic normal distribution


Monte Carlo approximation of exact
conditional p-values as for 2-way tables
----------------------------------------------------------** Local testresults for strata defined by HEALTH45 (G) **
p-values
p-values (1-sided)
G: HEALTH45
X
df asympt exact Gamma asympt exact
-----------------------------------------------------------

32

1: V.good
6.04
4 0.1960 0.1880 -0.18 0.1884 0.2050
2:
Good
9.77
12 0.6358 0.6310
0.17 0.0410 0.0290
3:
Fair 17.52
12 0.1311 0.1430
0.52 0.0010 0.0030
4:
Bad
3.94
1 0.0472 0.1540
1.00 0.0006 0.1220
-----------------------------------------------------------

Global 2 = 37.3 df = 29 p = 0.139 pexact =


0.148

partial = 0.17 p = 0.034 pexact = 0.027

33

Are the local coefficients homogenous?


Least square estimate:

Gamma =

0.1998 s.e.

0.0777

G: HEALTH45 Gamma variance


s.e. weight residual
----------------------------------------------------1: V.good -0.18
0.0397
0.1993 0.152 -2.060
2:
Good
0.17
0.0097
0.0987 0.620 -0.411
3:
Fair
0.52
0.0264
0.1625 0.228
2.237
4:
Bad
1.00
standard error is not available
----------------------------------------------------Incomplete set of Gammas
Test for partial association: X = 7.5 df = 2 p =

0.023

Pairwise comparisons of strata:


Comparison of strata 1+2 - p = 0.11
Significant difference between 1+2 and 3
P = 0.025
Notice the similarity of the analysis of
coefficients and Mantel-Haenszel analysis
of odds-ratios

34

You might also like