Ordinal Categorical Data Analysis

Measuring association among ordinal
categorical variables
Goodman and Kruskals Gamma
An example
Effect of smoking at 45 years of age on self
reported health five years later
Variable
Smoking
Categories
1 Never smoked
2 Stopped smoking
3 1-14 cigarettes/day
4 15-24 cigarettes/day
5 25+ cigarettes/day
SRH
1 Very good
2 Good
3 Fair
4 Bad
Ordinal categories
Expected monotonous association:

Increasing codes on Smoking
Increasing code on SRH

2
Data on males from the Glostrup surveys

+ SMOKE45
| | B:HEALTH51
|
J | Vgood Good Fair
Bad | TOTAL |
-------+-------------------------+-------+
Never |
16
73
6
1 |
96 |
row%| 16.7 76.0
6.3
1.0 | 100.0 |
No mo |
15
75
6
0 |
96 |
row%| 15.6 78.1
6.3
0.0 | 100.0 |
1-14 |
13
59
7
1 |
80 |
row%| 16.3 73.8
8.8
1.3 | 100.0 |
15-24 |
10
81
17
3 |
111 |
row%|
9.0 73.0 15.3
2.7 | 100.0 |
25+ |
1
29
3
1 |
34 |
row%|
2.9 85.3
8.8
2.9 | 100.0 |
-----------------------------------------+
TOTAL |
55
317
39
6 |
417 |
row%| 13.2 76.0
9.4
1.4 | 100.0 |
-----------------------------------------+
2 = 16.2
df = 12
p = 0.182
No evidence of association even though

the expected monotonous relationship is
plain as the nose on your face
Correlation coefficients for ordinal

categorical variables
Pearsons correlation coefficient
1) Measures linear association which
is not meaningful for ordinal
variables
2) Evaluation of significance
requires normal distributions
Rank correlations (Kendalls and
Spearmans ) are more appropriate but
require continuous data with very little
risk of ties.
Goodman and Kruskals for ordinal

categorical data
1) Similar to Kendalls .
2) Related to the odds ratio for 22
tables.
3) Well-known asymptotic properties.
4) A partial coefficient measuring
conditional monotonous relationship
among ordinal variables is available.
Monotonous relationships
Y increases when X increases/decreases
Two variables: X,Y

Probabilities: pxy = Prob(X=x,Y=y)
X and Y are independent
pxy = Prob(X=x)P(Y=y)
What exactly do we mean when we say that
there is a monotonous relationship between
X and Y?
Concordance and discordance
Compare outcomes on (X,Y) for two

stochastically independent cases
(X1,Y1) and (X2,Y2)
Concordance (C) if X1<X2 and Y1<Y2
or X1>X2 and Y1>Y2
Discordance (D) if X1<X2 and Y1>Y2
or X1>X2 and Y1<Y2
Tie (T)
if X1=X2 or Y1=Y2
Concordance = same trend in X and Y

Discordance = different trends in X and Y
Probability of concordance
pC
pD
Px1y1 Px2y 2
Px1y1 Px2 y 2
(x1 ,y 1 ,x 2 ,y 2 )C
(x1 ,y1 ,x 2 ,y 2 )D
Positive relationship: PC > PD

Negative relationship: PC < PD
The gamma coefficient

A measure of the strength of the
monotonous relationship
PD PC
PD PC
Satisfies all conventional requirements of
correlationcoefficients:
-1 +1
= 0 if X and Y are independent
Positive association if > 0
Negative association if < 0
Change the order of Y categories:
after recoding = - before recoding

Interpretation of
P(C)
P(C | C D)
P(C) P(D)
P(D)
P(D | C D)
P(C) P(D)
such that
= P(C|CD) P(D|CD)
10
is the difference between two conditional

probabilities
Estimation of
Pairwise comparison of all persons in the
data set
nC = number of concordances
nD = number of discordances
nT = number of ties
Relative frequences
nC
hC
nC n D n T
nD
hD
nC n D n T
nS
hT
nC n D nT
11
The estimate of
hC h D nC n D
G
hC h D nC n D
A little bit of notation
nxy = number of persons with X=x and Y=y
A ij
Dij
x i,y j
x i,y j
n xy
n xy
x i,y j
x i,y j
n xy
n xy
X
1
Y
1
.
y
.
.
R
Axy
c
Dxy
nxy
Dxy
Axy
12
Number of concordances and discordances

nC n xy A xy
n D n xy Dxy
x,y
x,y
The coefficient for 22 tables

a
c
b
d
nC = ad
nD = bc
13
nC n D
nC n D
ad
ad bc bc 1 OR 1
ad bc ad 1 OR 1
bc
14
1
OR
1
OR 1
OR 1
Gamma, odds-ratio og logit values

gamma
-1,00
-,90
-,80
-,70
-,60
-,50
-,40
-,30
-,20
-,10
,00
,10
,20
,30
,40
,50
,60
,70
,80
,90
1,00
oddsratio
,00
,05
,11
,18
,25
,33
,43
,54
,67
,82
1,00
1,22
1,50
1,86
2,33
3,00
4,00
5,67
9,00
19,00
+
logit LN(odds-ratio
-
-2,94
-2,20
-1,73
-1,39
-1,10
-,85
-,62
-,41
-,20
,00
,20
,41
,62
,85
1,10
1,39
1,73
2,20
2,94
+
Note: logit 2gamma in the interval [-0,30 til 0,30]
15
Properties of the estimate of

The estimate is unbiased
E(G) =
and asymptotically normally distributed
with standard error, s1, given by
s
2
1
16
nC n D
n n
xy
x,y
A xy nC Dxy
If X and Y are independent, then the

standard error, s0, of G is given by
s
2
0
n A
nC n D
xy
x,y
16
xy
Dxy
nC n D
Statistical inference
95 % confidence intervals
G 1.96se1
Test of significance
If X and Y are independent then
G
z
Norm(0,1)
se0
Notice that confidence intervals and
assessment of significance uses different
estimates of the standard errors
17
The example
+ SMOKE45
| | B:HEALTH51
|
J | Vgood Good Fair
Bad | TOTAL |
-------+-------------------------+-------+
Never |
16
73
6
1 |
96 |
row%| 16.7 76.0
6.3
1.0 | 100.0 |
No mo |
15
75
6
0 |
96 |
row%| 15.6 78.1
6.3
0.0 | 100.0 |
1-14 |
13
59
7
1 |
80 |
row%| 16.3 73.8
8.8
1.3 | 100.0 |
15-24 |
10
81
17
3 |
111 |
row%|
9.0 73.0 15.3
2.7 | 100.0 |
25+ |
1
29
3
1 |
34 |
row%|
2.9 85.3
8.8
2.9 | 100.0 |
-----------------------------------------+
TOTAL |
55
317
39
6 |
417 |
row%| 13.2 76.0
9.4
1.4 | 100.0 |
-----------------------------------------+
2 = 16.2 df = 12
p = 0.182
= 0.24 p < 0.0005
Very strong evidence of an effect of
smoking on health
For ordinal variables, is much more
powerful than 2 distributed test
statistics
18
Exact conditional inference

The problem:
Can asymptotic distributions of estimates and test
statistics be approximated by asymptotic
distributions in small and moderate samples.
The small number of persons with bad health
would result in warnings from most statistical
programs that asymptotics probably do not work.
If in doubt use exact conditional tests instead of

asymptotic tests.
19
The hypergeometric distribution

The contingency table:
nxy, x = 1,..c y =1,,.r
The margins of the table:

n x n xy
y
n y n xy n n xy
x,y
The probability of the table
n
P(n11,,ncr) =
n
...n
11 rc
20
p
x,y
n xy
xy
H0: pxy = px+p+y
n
P(n11,,ncr) =
n
...n
11 rc
p p
x
n x
x
n y
y
The marginal tables, nx+ and n+y, are

sufficient under H0
P(n11,,ncr | n1+,..,nc+, n+1,..,n+r)
=

! n y
!
n x
x
y
n! n xy !
x,y
does not depend on unknown parameters
21
The exact conditional test procedure 1

Find all tables with the same marginal
tables as the observed table.
For each of these tables calculate:
The conditional probability of the table
The tests statistics of interest
The exact p-value
=
the sum of probabilities for tables with test
statistics that are more extreme than the
test statistic of the observed table
22
The exact conditional test procedure

Test statistic T(M)
Where M is a rc table
Observed teststatistic = tobs
The exact p-value
pexact
M:
mx nx ,
m y n y ,
T(M ) t obs
P(M | n1 ,..,n c ,n 1 ,..,n r )
Fishers exact test for 22 tables

Also appropriate for rc tables, but may be
very time consuming due to the number of
tables fitting the margins
23
The Monte Carlo test

Since the conditional probabilities are known
exactly we may ask the computer to generate a
random sample consisting of a large number of
independent tables from this distribution:
The MC test procedure:
Generate tables:
M1,,MNsim
Calculate test statistics for each table:
Ti = T(Mi),
i = 1,..Nsim
Count the number of random test statistics which

are as extreme as the observed statistic
Nsim
1
i 1
p MC
Ti t obs
is an unbiased estimate of pexact

Nsim
24
The standard error of pMC depends on Nsim
Sequential Monte Carlo tests

Interrupts the Monte Carlo procedure when it
becomes obvious that the test statistic will not be
significant
Nsim = 10,000
Critical level of the test = 5 %,
The sequential Monte Carlo test interrupts the
Monte Carlo procedure when the number of tables
with T(Mi) is equal to 501
S 501
pMC 501/10000 > 0.05
25
Repeated Monte Carlo tests

The repeated Monte Carlo test interrupts the
Monte Carlo procedure when the risk of a
significant pMC-value has become very small
Parameters of the Repeated Monte Carlo test:

Nsim = the total number of tables to be generated
Nstart = the minimum number of tables to be
Generated
Critical value
Max risk of stopping too soon (default = 0.1 %)
26
Smoking and Self reported health

+ SMOKE45
| | B:HEALTH51
|
J | V.goo Good Fair
Bad | TOTAL |
-------+-------------------------+-------+
Never |
16
73
6
1 |
96 |
row%| 16.7 76.0
6.3
1.0 | 100.0 |
No mo |
15
75
6
0 |
96 |
row%| 15.6 78.1
6.3
0.0 | 100.0 |
1-14 |
13
59
7
1 |
80 |
row%| 16.3 73.8
8.8
1.3 | 100.0 |
15-24 |
10
81
17
3 |
111 |
row%|
9.0 73.0 15.3
2.7 | 100.0 |
25+ |
1
29
3
1 |
34 |
row%|
2.9 85.3
8.8
2.9 | 100.0 | X = 16.2
-----------------------------------------+ df = 12
TOTAL |
55
317
39
6 |
417 |
p = 0.182
row%| 13.2 76.0
9.4
1.4 | 100.0 | Gam = 0.24
-----------------------------------------+
p = 0.000
Confounding?
27
Analysis of the conditional association

given self reported health at 45 years
HEALTH45 = Very good
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
1 Never |
4
12
0
0 |
16 |
row%| 25.0 75.0
0.0
0.0 | 100.0 |
No mo |
5
7
0
0 |
12 |
row%| 41.7 58.3
0.0
0.0 | 100.0 |
1-14 |
9
6
0
0 |
15 |
row%| 60.0 40.0
0.0
0.0 | 100.0 |
15-24 |
2
3
0
0 |
5 |
row%| 40.0 60.0
0.0
0.0 | 100.0 |
25+ |
0
3
0
0 |
3 |
row%|
0.0 100.0
0.0
0.0 | 100.0 | X =
6.0
------------------------------------------+ df =
4
TOTAL |
20
31
0
0 |
51 |
p = 0.196
row%| 39.2 60.8
0.0
0.0 | 100.0 | Gam = -0.18
------------------------------------------+
p = 0.188
28
HEALTH45 = Good
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
2 Never |
11
55
5
1 |
72 |
row%| 15.3 76.4
6.9
1.4 | 100.0 |
No mo |
10
59
5
0 |
74 |
row%| 13.5 79.7
6.8
0.0 | 100.0 |
1-14 |
3
50
4
1 |
58 |
row%|
5.2 86.2
6.9
1.7 | 100.0 |
15-24 |
6
76
8
1 |
91 |
row%|
6.6 83.5
8.8
1.1 | 100.0 |
25+ |
1
25
1
0 |
27 |
row%|
3.7 92.6
3.7
0.0 | 100.0 | X =
9.8
------------------------------------------+ df = 12
TOTAL |
31
265
23
3 |
322 |
p = 0.636
row%|
9.6 82.3
7.1
0.9 | 100.0 | Gam = 0.17
------------------------------------------+
p = 0.041
HEALTH45 = Fair
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
3 Never |
1
6
1
0 |
8 |
row%| 12.5 75.0 12.5
0.0 | 100.0 |
No mo |
0
6
1
0 |
7 |
row%|
0.0 85.7 14.3
0.0 | 100.0 |
1-14 |
1
3
3
0 |
7 |
row%| 14.3 42.9 42.9
0.0 | 100.0 |
15-24 |
2
1
6
2 |
11 |
row%| 18.2
9.1 54.5 18.2 | 100.0 |
25+ |
0
1
2
1 |
4 |
row%|
0.0 25.0 50.0 25.0 | 100.0 | X = 17.5
------------------------------------------+ df = 12
TOTAL |
4
17
13
3 |
37 |
p = 0.131
row%| 10.8 45.9 35.1
8.1 | 100.0 | Gam = 0.52
------------------------------------------+
p = 0.001
29
HEALTH45 = Bad
+HEALTH45
|
+ SMOKE45
|
| | B:HEALTH51
|
G
J | V.goo Good Fair
Bad | TOTAL |
--------+-------------------------+-------+
4 Never |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 |
No mo |
0
3
0
0 |
3 |
row%|
0.0 100.0
0.0
0.0 | 100.0 |
1-14 |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 |
15-24 |
0
1
3
0 |
4 |
row%|
0.0 25.0 75.0
0.0 | 100.0 |
25+ |
0
0
0
0 |
0 |
row%|
0.0
0.0
0.0
0.0 |
0.0 | X =
3.9
------------------------------------------+ df =
1
TOTAL |
0
4
3
0 |
7 |
p = 0.047
row%|
0.0 57.1 42.9
0.0 | 100.0 | Gam = 1.00
------------------------------------------+
p = 0.001
----------------------------------------------------------** Local testresults for strata defined by HEALTH45 (G) **

p-values
p-values (1-sided)
G: HEALTH45
X
df asympt exact Gamma asympt exact
----------------------------------------------------------1: V.good
6.04
4 0.1960 0.1880 -0.18 0.1884 0.2050
2:
Good
9.77
12 0.6358 0.6310
0.17 0.0410 0.0290
3:
Fair 17.52
12 0.1311 0.1430
0.52 0.0010 0.0030
4:
Bad
3.94
1 0.0472 0.1540
1.00 0.0006 0.1220
-----------------------------------------------------------
30
Tests of conditional independence

H0: P(X,Y|Z=z) = P(X|Z=z)P(Y|Z=z)
for all z
Z
Y
1
Concordance Test statistics

and
discordance
3
1
2
N1C and N1D
1
2
N2C and N2D
1=
12
N 1 C N1 D
N 1 C N1 D
22
N 2C N 2 D
= N 2C N 2 D
..
..
..
kZ
1
2
N2C and N2D
k2
N kC N kD
= N kC NkD
All test statistics must be insignificant

Global tests of conditional independence
The global 2
2 2z
z
df df z
z
31
The partial coefficient

N C N zC
N D N zD
partial
N zC N zD
NC ND
z
N C N D N zC N zD
z
N N
w
N N
zC
zD
z z
zC
zD
where wz =
N zC N zD
i N iC N iD
partial w z z
z
Weighted mean
Asymptotic normal distribution

Monte Carlo approximation of exact
conditional p-values as for 2-way tables
----------------------------------------------------------** Local testresults for strata defined by HEALTH45 (G) **
p-values
p-values (1-sided)
G: HEALTH45
X
df asympt exact Gamma asympt exact
-----------------------------------------------------------
32
1: V.good
6.04
4 0.1960 0.1880 -0.18 0.1884 0.2050
2:
Good
9.77
12 0.6358 0.6310
0.17 0.0410 0.0290
3:
Fair 17.52
12 0.1311 0.1430
0.52 0.0010 0.0030
4:
Bad
3.94
1 0.0472 0.1540
1.00 0.0006 0.1220
-----------------------------------------------------------
Global 2 = 37.3 df = 29 p = 0.139 pexact =

0.148
partial = 0.17 p = 0.034 pexact = 0.027
33
Are the local coefficients homogenous?

Least square estimate:
Gamma =
0.1998 s.e.
0.0777
G: HEALTH45 Gamma variance

s.e. weight residual
----------------------------------------------------1: V.good -0.18
0.0397
0.1993 0.152 -2.060
2:
Good
0.17
0.0097
0.0987 0.620 -0.411
3:
Fair
0.52
0.0264
0.1625 0.228
2.237
4:
Bad
1.00
standard error is not available
----------------------------------------------------Incomplete set of Gammas
Test for partial association: X = 7.5 df = 2 p =
0.023
Pairwise comparisons of strata:

Comparison of strata 1+2 - p = 0.11
Significant difference between 1+2 and 3
P = 0.025
Notice the similarity of the analysis of
coefficients and Mantel-Haenszel analysis
of odds-ratios
34

Ordinal Categorical Data Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ordinal Categorical Data Analysis

Uploaded by

Copyright:

Available Formats

Measuring association among ordinal

Expected monotonous association:

Increasing code on SRH

Data on males from the Glostrup surveys

No evidence of association even though

Correlation coefficients for ordinal

Goodman and Kruskals for ordinal

Y increases when X increases/decreases

Two variables: X,Y

Concordance and discordance

Compare outcomes on (X,Y) for two

Concordance = same trend in X and Y

Positive relationship: PC > PD

The gamma coefficient

after recoding = - before recoding

is the difference between two conditional

Number of concordances and discordances

The coefficient for 22 tables

Gamma, odds-ratio og logit values

Note: logit 2gamma in the interval [-0,30 til 0,30]

Properties of the estimate of

If X and Y are independent, then the

Exact conditional inference

If in doubt use exact conditional tests instead of

The hypergeometric distribution

nxy, x = 1,..c y =1,,.r

The margins of the table:

The probability of the table

H0: pxy = px+p+y

The marginal tables, nx+ and n+y, are

does not depend on unknown parameters

The exact conditional test procedure 1

The exact conditional test procedure

P(M | n1 ,..,n c ,n 1 ,..,n r )

Fishers exact test for 22 tables

The Monte Carlo test

Count the number of random test statistics which

is an unbiased estimate of pexact

The standard error of pMC depends on Nsim

Sequential Monte Carlo tests

Repeated Monte Carlo tests

Parameters of the Repeated Monte Carlo test:

Smoking and Self reported health

Analysis of the conditional association

----------------------------------------------------------** Local testresults for strata defined by HEALTH45 (G) **

Tests of conditional independence

Concordance Test statistics

N1C and N1D

N2C and N2D

N2C and N2D

All test statistics must be insignificant

The partial coefficient

Asymptotic normal distribution

Global 2 = 37.3 df = 29 p = 0.139 pexact =

partial = 0.17 p = 0.034 pexact = 0.027

Are the local coefficients homogenous?

G: HEALTH45 Gamma variance

Pairwise comparisons of strata:

You might also like

---------------------------------------------------------- Local testresults for strata defined by HEALTH45 (G)