Professional Documents
Culture Documents
categorical variables
Goodman and Kruskals Gamma
An example
Effect of smoking at 45 years of age on self
reported health five years later
Variable
Smoking
Categories
1 Never smoked
2 Stopped smoking
3 1-14 cigarettes/day
4 15-24 cigarettes/day
5 25+ cigarettes/day
SRH
1 Very good
2 Good
3 Fair
4 Bad
Ordinal categories
2 = 16.2
df = 12
p = 0.182
Monotonous relationships
pxy = Prob(X=x)P(Y=y)
What exactly do we mean when we say that
there is a monotonous relationship between
X and Y?
if X1=X2 or Y1=Y2
Probability of concordance
pC
pD
Px1y1 Px2y 2
Px1y1 Px2 y 2
(x1 ,y 1 ,x 2 ,y 2 )C
(x1 ,y1 ,x 2 ,y 2 )D
PD PC
Satisfies all conventional requirements of
correlationcoefficients:
-1 +1
= 0 if X and Y are independent
Positive association if > 0
Negative association if < 0
Change the order of Y categories:
P(C)
P(C | C D)
P(C) P(D)
P(D)
P(D | C D)
P(C) P(D)
such that
= P(C|CD) P(D|CD)
10
nC
hC
nC n D n T
nD
hD
nC n D n T
nS
hT
nC n D nT
11
The estimate of
hC h D nC n D
G
hC h D nC n D
A little bit of notation
nxy = number of persons with X=x and Y=y
A ij
Dij
x i,y j
x i,y j
n xy
n xy
x i,y j
x i,y j
n xy
n xy
X
1
Y
1
.
y
.
.
R
Axy
c
Dxy
nxy
Dxy
Axy
12
n D n xy Dxy
x,y
x,y
b
d
nC = ad
nD = bc
13
nC n D
nC n D
ad
ad bc bc 1 OR 1
ad bc ad 1 OR 1
bc
14
1
OR
1
OR 1
OR 1
oddsratio
,00
,05
,11
,18
,25
,33
,43
,54
,67
,82
1,00
1,22
1,50
1,86
2,33
3,00
4,00
5,67
9,00
19,00
+
logit LN(odds-ratio
-
-2,94
-2,20
-1,73
-1,39
-1,10
-,85
-,62
-,41
-,20
,00
,20
,41
,62
,85
1,10
1,39
1,73
2,20
2,94
+
15
s
2
1
16
nC n D
n n
xy
x,y
A xy nC Dxy
n A
nC n D
xy
x,y
16
xy
Dxy
nC n D
Statistical inference
95 % confidence intervals
G 1.96se1
Test of significance
If X and Y are independent then
G
z
Norm(0,1)
se0
Notice that confidence intervals and
assessment of significance uses different
estimates of the standard errors
17
The example
+ SMOKE45
| | B:HEALTH51
|
J | Vgood Good Fair
Bad | TOTAL |
-------+-------------------------+-------+
Never |
16
73
6
1 |
96 |
row%| 16.7 76.0
6.3
1.0 | 100.0 |
No mo |
15
75
6
0 |
96 |
row%| 15.6 78.1
6.3
0.0 | 100.0 |
1-14 |
13
59
7
1 |
80 |
row%| 16.3 73.8
8.8
1.3 | 100.0 |
15-24 |
10
81
17
3 |
111 |
row%|
9.0 73.0 15.3
2.7 | 100.0 |
25+ |
1
29
3
1 |
34 |
row%|
2.9 85.3
8.8
2.9 | 100.0 |
-----------------------------------------+
TOTAL |
55
317
39
6 |
417 |
row%| 13.2 76.0
9.4
1.4 | 100.0 |
-----------------------------------------+
2 = 16.2 df = 12
p = 0.182
= 0.24 p < 0.0005
Very strong evidence of an effect of
smoking on health
For ordinal variables, is much more
powerful than 2 distributed test
statistics
18
19
n y n xy n n xy
x,y
n
P(n11,,ncr) =
n
...n
11 rc
20
p
x,y
n xy
xy
n
P(n11,,ncr) =
n
...n
11 rc
p p
x
n x
x
n y
y
! n y
!
n x
x
y
n! n xy !
x,y
21
M:
mx nx ,
m y n y ,
T(M ) t obs
i = 1,..Nsim
Nsim
1
i 1
p MC
Ti t obs
24
25
26
Confounding?
27
28
HEALTH45 = Good
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
2 Never |
11
55
5
1 |
72 |
row%| 15.3 76.4
6.9
1.4 | 100.0 |
No mo |
10
59
5
0 |
74 |
row%| 13.5 79.7
6.8
0.0 | 100.0 |
1-14 |
3
50
4
1 |
58 |
row%|
5.2 86.2
6.9
1.7 | 100.0 |
15-24 |
6
76
8
1 |
91 |
row%|
6.6 83.5
8.8
1.1 | 100.0 |
25+ |
1
25
1
0 |
27 |
row%|
3.7 92.6
3.7
0.0 | 100.0 | X =
9.8
------------------------------------------+ df = 12
TOTAL |
31
265
23
3 |
322 |
p = 0.636
row%|
9.6 82.3
7.1
0.9 | 100.0 | Gam = 0.17
------------------------------------------+
p = 0.041
HEALTH45 = Fair
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
3 Never |
1
6
1
0 |
8 |
row%| 12.5 75.0 12.5
0.0 | 100.0 |
No mo |
0
6
1
0 |
7 |
row%|
0.0 85.7 14.3
0.0 | 100.0 |
1-14 |
1
3
3
0 |
7 |
row%| 14.3 42.9 42.9
0.0 | 100.0 |
15-24 |
2
1
6
2 |
11 |
row%| 18.2
9.1 54.5 18.2 | 100.0 |
25+ |
0
1
2
1 |
4 |
row%|
0.0 25.0 50.0 25.0 | 100.0 | X = 17.5
------------------------------------------+ df = 12
TOTAL |
4
17
13
3 |
37 |
p = 0.131
row%| 10.8 45.9 35.1
8.1 | 100.0 | Gam = 0.52
------------------------------------------+
p = 0.001
29
HEALTH45 = Bad
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
4 Never |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 |
No mo |
0
3
0
0 |
3 |
row%|
0.0 100.0
0.0
0.0 | 100.0 |
1-14 |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 |
15-24 |
0
1
3
0 |
4 |
row%|
0.0 25.0 75.0
0.0 | 100.0 |
25+ |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 | X =
3.9
------------------------------------------+ df =
1
TOTAL |
0
4
3
0 |
7 |
p = 0.047
row%|
0.0 57.1 42.9
0.0 | 100.0 | Gam = 1.00
------------------------------------------+
p = 0.001
30
Y
1
1
2
1
2
1=
12
N 1 C N1 D
N 1 C N1 D
22
N 2C N 2 D
= N 2C N 2 D
..
..
..
kZ
1
2
k2
N kC N kD
= N kC NkD
df df z
z
31
N D N zD
partial
N zC N zD
NC ND
z
N C N D N zC N zD
z
N N
w
N N
zC
zD
z z
zC
zD
where wz =
N zC N zD
i N iC N iD
partial w z z
z
Weighted mean
32
1: V.good
6.04
4 0.1960 0.1880 -0.18 0.1884 0.2050
2:
Good
9.77
12 0.6358 0.6310
0.17 0.0410 0.0290
3:
Fair 17.52
12 0.1311 0.1430
0.52 0.0010 0.0030
4:
Bad
3.94
1 0.0472 0.1540
1.00 0.0006 0.1220
-----------------------------------------------------------
33
Gamma =
0.1998 s.e.
0.0777
0.023
34