You are on page 1of 13

Categorical Response Data

Categorical Variable
A categorical variable is one for which the measurement scale consists of a set of categories. Categorical variables
represent types of data which may be divided into groups.
Example: Race, Sex, Age, Group, Educational level, Smoking status etc

Data
Data is the measurement of characteristics of an object or subject. There are two types of data one is Qualitative data
and another is Quantitative data. We can get the data from the field of Multinomial sampling and Poisson sampling.

Measurement Scale
Nominal
Categorical variables for which levels do not have a natural ordering are called nominal. For nominal variables the
order of listing of the categories is irrelevant to the statistical analysis.
Example: Religious affiliation (Categories Catholic, Jewish, Protestant, Other), mode of transportation (automobile,
bus, subway, bicycle, other), choice of residence (house, apartment, condominium, other), race, gender and marital
status

Ordinal
Many categorical variables do have ordered levels. Such variables are called ordinal. Ordinal variables clearly order
the categories, but absolute distances between categories are unknown.
Example: Size of automobile (subcompact, compact, mid-size, large), social class (upper, middle, lower), attitude
toward legalization of abortion (strongly disapprove, disapprove, approve, strongly approve), appraisal of companys
inventory level (too low, about right, too high) etc

Interval
An interval variable is one that does have numerical distances between any two levels of the scale.
Example: Blood pressure level, functional life length of television set, length of prison term, income and age.

Contingency Table
Let X and Y denote two categorical response variables, X having I levels and Y having J levels. When we
classify subjects on both variables, there are IJ possible combinations of classifications. The responses ( X , Y ) of a
subject randomly chosen from some population have a probability distribution. We display this distribution in a
rectangular table having I rows for the categories of X and J column for the categories of Y . The cells of the table
represent the IJ possible outcomes. Their probabilities are ij , where ij denotes the probability that ( X , Y ) falls
in the cell in row i and clumn j . When the cells contain frequency counts of outcomes, the table is called a
contingency table. The term contingency table was introduced by Karl Pearson (1904). Another name is cross
classification table.
A contingency table having I rows and J columns is referred to as a I by J or I J table.

Where, nI is the i

th

Y
2

Total

n11

n12

n1J

n1

n21

n22

n2J

n2

nI 1

nI 2

nIJ

nI

Total

n1

n2

row total, n J is the j

th

n n

n J

column total, nIJ is the frequency of the i

th

row and j

th

column total.

Categorical Response Data ~ 1 of 13

Notation and Definitions


Let nij denote the number of observations cross-classified in the cell of the table that is in row i and column j and

let pij denote the proportion of the total sample falling in that cell. That is pij

total sample size, so that

pij
i

. The set

pij

nij
n

; wrere, n

nij
i

is the

is called the sample joint distribution. The sample marginal

distributions are the row totals and column totals obtained by summing the joint proportions. These still denoted by

pi

for the row variable, where pi pij and by

pi

n j
ni
and p j
also
n
n

p j

for the column variable, where p j pij . Note that


i

pi p j 1 .
i

Similar notation will be used for population proportions, with the Greek letter in place of p . For instance, population

analogs will be denoted by ij for the joint probabilities and j i

for the conditional probabilities. The population

ij
conditional, joint and marginal probabilities are related by j i
and they satisfy
i

ij
i

and

j i

for i 1, 2, L , r . The following table illustrates the notation for the 2 2 case.
Table: Notation for joint, conditional and marginal distribution.
Column
Row

Total

11 1|1

12 2|1

21 1|2

22 2|2

Total

Independence
When both variables are response variables, we can describe the association using their joint distribution, the
conditional distribution of Y given X , or the conditional distribution of X given Y .
The conditional distribution of Y given X is related to the joint distribution by j |i

ij
i

for all i and j . The

variables are statistically independent if all joint probabilities equal the product of their marginal probabilities that is, if

ij i j

for i 1, 2, L , I

and

j 1, 2, L , J

1 ,

When X and Y are independent,


j |i

ij
i

i j
i

Use equ. 1

for i 1, 2, L , I

i.e., each conditional distribution of Y is identical to the marginal distribution of Y . Thus, two variables are
independent when the probability of column response j is the same in each row, for j 1, 2, L , J . When Y is a
response and X is an explanatory variable, the condition

j |1 L

j|I

for all j

provides a more natural

definition of independence than (1).

Categorical Response Data ~ 2 of 13

We use similar notation for sample distributions, with the letter p in place of . For instance,

pij

denotes the

i j nij

being

1|i , 2|i 1|i , 1 1|i

is the

sample joint distribution in a contingency table. The cell frequencies are denoted by nij , with n
the total sample size, so

pij

nij

The proportion of times that subjects in row

p j |i
where ni npi

pij

pi

nij

i made response j is

ni

nij .
j

Way of Comparing Proportions


Difference of proportions
For subjects in row i , i 1, 2, L , I ,

1|i

is the probability of response 1 and

conditional distribution of the binary response. We can compare two rows, say h and i , using the difference of
1| h 1|i . Comparison on response 2 is equivalent to comparison on response 1, since

proportions,

2| h 2| i 1 1| h 1 1| i 1| i 1| h .

The difference of proportions falls between -1 and 1. It equals zero when rows h and i have identical conditional
distributions. The response Y is statistically independent of the row classification when

1| h 1|i 0

for all pairs of

rows h and i .
For I J contingency tables, we can compare the conditional probabilities of response j for rows h and i using the
difference j | h j |i . The variables are independent when this difference equals zero for all pairs of rows h and i
and all possible responses j equivalently, when the I 1 J 1 differences
j |i j | I 0, i 1, 2, L , I 1

and

j 1, 2, L , J 1 .

When both variables are responses and there is a joint distribution ij , the comparison of proportions within rows h
and i satisfies

1| h 1|i

h1 i1

.
h i

For the 2 2 case,


p col. 1 | row 1 col. 1 | row 2

11 21

.
1 2

We can also compare columns in terms of the proportion of row-1 response, using the difference of within column
proportions
p row 1 | col . 1 p row 1 | col . 2

11 12

.
1 2

This does not usually give the same value as the difference of within row proportions.

Test procedure and Obtaining C.I. for Difference of Proportion


Under null hypothesis
H 0 : 1|1 1|2

1|1 1|2 0

Categorical Response Data ~ 3 of 13

Under alternative hypothesis


H1 : 1|1 1|2 0

For large n
Z

1|1 1|2 E 1|1 1|2

Var 1|1 1|2

1|1 1|2 1|1 1|2

1|1 1|2

Under

1|1 2|1 1|2 2|2


1
2

n1 1|1 1 1|1
n11

n
n12
1

Var 1|1 Var

Since,

Similarly ,

Var 1|1 Var 1|2

Var 1|1

1|1 2|1
n1

1|1 2|1

n1

H0

1|1 2|1
n1

n11 ~ n1 , 1|1 ;

n12 ~ n2 , 1|2

Confidence Interval can be written as follows

1|1 1|2 Z

1|1 2|1 1|2 2|2


1

1|1 1|2 1|1 1|2 Z


2

1|1 2|1 1|2 2|2


1

Relative Risk
A difference in proportions of fixed size may have greater importance when both proportions are close to 0 or 1 than
when they are near the middle of the range. For instance, suppose we compare two drugs in terms of the proportion of
subjects who suffer bad side effects. The difference between 0.010 and 0.001 may be more noteworthy than the
difference between 0.410 and 0.401. in such cases, the ratio of proportions is also a useful descriptive measure.
For 2 2 tables, the relative risk is the ratio
1|1
1|2

11
1

21
2

11 2
.
1 21

The ratio can be any non-negative member. A relative risk of 1 corresponding to independence. Comparison on the
second response gives a different relative risk
2|1
2| 2

1 1|1
1 1| 2

Comment: Relative Risk 1 indicates the independency of the categorical variables.

Note: Relative risk and difference of proportion are affected by the interchange of rows and columns.
Odds Ratio
In 2 2 contingence table, within row-1, the odds that the response is in column 1 instead of column 2 are defined to
be 1

1|1
2|1

and within row-2 the corresponding odds equals 2

definition is i

1| 2
2|2

. For joint distributions, the equivalent

i1
; i 1, 2 . Each i is non-negative, with value greater than 1 when response 1 is more likely
i2

than response 2.
For example, when 1 4 , in the first row response 1 is 4 times as likely as response 2. The within-row conditional
distributions are identical, and thus the variables are independent, if and only if 1 2 .

Categorical Response Data ~ 4 of 13

The ratio of the odds 1 and 2 is

1
, called the odds ratio. From the definition of odds using joint
2

probabilities,
11
12


11 22
21
12 21
22

0 .

It is called the cross-product ratio, since it equals the ratio of the products 11 22 and 12 21 of probabilities from
diagonally opposite cells. The odds ratio can equal any non negative number. When all cell probabilities are positive
independence of X and Y are equivalent to 1 , When 1 , subjects in row 1 are more likely to make the
first response that are subjects in row 2; that is, 1|1 1|2 . For instance, when 4 , the odds of the first response
are four times higher in row 1 than in row 2. this does not mean that the probability 1|1 is four times higher than 1|2 ;
that is the interpretation of a relative risk of 4.0. when 0 1 , the first response is less likely in row 1 than in row 2;
that is 1|1 1|2 . When one cell has zero probability, equals 0 or

Test procedure and Obtaining C.I. for Odds Ratio


For sample cell frequencies nij , a sample version of is n11n22 n12 n21 . For testing odds ratio we can use the
following procedure.
Under null hypothesis
H 0 : 1
H : 1

Under

Now,

1
log 0

N log , Var log

log

L arg e Sample

Where,

Var log

1
1
1
1

n
n
n
n
11
12
21
22

log 0

Var log

1
2

~ N 0,1 |H 0

Confidence Interval

L.L. log Z

U .L. log Z
2

Var log

Var log

0 contains in the interval indicates the independency

Original Unit
L.L. e L.L.
U .L. eU .L.

1 contains in the interval indicates the independency

Properties

The value of does not change if both cell frequencies within any row is multiplies by a non-zero constant or
if both cell frequencies within any column are multiplied by a constant.

Categorical Response Data ~ 5 of 13

Two values for represent the same level of association, but in opposite directions, when one value is the
inverse of the other. For instance, when 0.25 , the odds of the first response are 0.25 times as high in row
1 as in row 2, or equivalently 1 0.25 4.0 times as high in row 2 as in row 1. If the order of the rows is
reversed or if the order of the column is revered, the new value of is simply the inverse of the original
value.

The odds ratio does not change value when the orientation of the table is reversed so that the rows become
the columns and the columns become the rows. Therefore, it is unnecessary to identify one classification as
the response variable in order to calculate .

Values of farther from 1.0 in a given direction represent stronger levels of association.

It is sometimes more convenient to use ln . Independence corresponds to ln 0 . The log odds ratio
is symmetric about this value- reversal of rows or of columns results in a change in its sign. Two values for

ln that are the same except for sign, such as ln 4 1.39 and ln 0.25 1.39 , represent the same
level of association.

An implication of the multiplicative invariance property is that the sample odds ratio estimates the same
characteristic even when we select disproportionately large or small samples from marginal categories of
a variable. For instance, suppose a study investigates the association between vaccination and catching a
certain strain of flu. For a retrospective design, the sample odds ratio estimates the same characteristic
whether we randomly sample (1) 100 people who got the flu and 100 people who did not, or (2) 150 people
who got the flu and 50 people who did not, in each case classifying subjects on whether they took the
vaccine. In fact, the odds ratio is equally valid for retrospective, prospective, or cross-sectional sampling
designs. We would estimate the same characteristic if (3) we randomly sample 100 people who took the
vaccine and 100 people who did not, and then classify them on whether they got the flu, or (4) we randomly
sample 200 people and classify them on whether they took the vaccine and whether they got the flu.

Comments: Odds Ratio 1 indicates the independency of the categorical variables.

Relationship between Odds Ratio and Relative Risk


For the

2 2 tables, the relative risk is defined as,


Relative Risk

1|1
1|2

On the other hand, the odds ratio can be define as,


1|1
2|1
1|1 2|2
Odds Ratio

.
1| 2
1|2 2|1
2| 2

Now we have,

Odds Ratio

1|1 2|2
1|2 2|1

1|2 1 1|1
1|1 1 1|2

1 1|2
.
So, Odds Ratio Relative Risk

1 1|1

Their magnitudes are similar whenever the probability of response 1 is close to zero for both groups. When the
sampling design is retrospective, it is possible to construct conditional distributions within levels of the fixed response.
It is usually not possible to estimate the probability of the outcome of interest, or to compute the difference of
proportions or relative risk for that outcome.

Categorical Response Data ~ 6 of 13

We can compute the odds ratio, however, since it is determined by the conditional distributions in either direction.
When the probability of the outcome of interest is very small, the population odds ratio and relative risk take similar
values. Thus, we can use the sample odds ratio to provide a rough indication of the relative risk.

Measures of Ordinal Association


A basic question researchers usually pose when analyzing ordinal data is Does Y tend to increase as X increases?
Bivariate analyses of intervalscale variables often summarize covariation by the Pearson correlation, which describes
the degree to which Y has a linear relationship with X . Ordinal variables do not have a defined metric, so the notion
of linearity is not meaningful. However, the inherent ordering of categories allows consideration of monotonicityfor
instance, whether Y tends to increase as X does. Measures for ordinal variables that are analogous to the Pearson
correlation describe the degree to which the relationship is monotone.
In a strict sense, comparisons of two subjects on an ordinal scale can answer Which subject makes the higher
response? when we observe the ordering of two subjects on each of two variables, we can classify the pair of
subjects as concordant or discordant.

Concordant and Discordant


The pair is concordant if the subject ranking higher on variable X also ranks higher on variable Y . The pair is
discordant if the subject ranking higher on X ranks lower on Y . The pair is tied if the subjects have the same
classification on X and / or Y .

Consider two independent observations from a joint probability distribution ij

for two ordinal variables. For that pair

of observations,

c ij
i

hk
hi k j

and

d ij

i
j

hk
h i k j

are the numbers of concordance and discordance respectively.


Several measures of association for ordinal variables utilize the difference c d between these probabilities. For
these measures, the association is said to be positive if c d 0 and negative if c d 0 and independent if

c d 0 .
Example of Job Satisfaction
We illustrate concordance and discordance using Table 2.4, taken from the 1984 General Social Survey of the National
Data Program in the United States as quoted by Norusis (1988). The variables are income and job satisfaction. Income
has levels less than $6000 (denoted <6), between $6000 and $15,000 (6-15), between $15,000 and $25,000
denoted by (15-25) , and over $25,000 (>25) . Job satisfaction has levels very dissatisfied (DV), little dissatisfied
(LD), moderately satisfied (MS), and very satisfied (VS). We treat VS as the high end of the job satisfaction scale.
Table: Cross Classification of Job Satisfaction by Income
Very
Dissatisfied

Job Satisfaction
Little
Dissatisfied

6000
6000 -15, 000

20

24

80

82

22

38

104

125

15, 000 - 25, 000

13

28

81

113

25, 000

18

54

92

Income (US$)

Moderately
Satisfied

Very
Satisfied

Consider a pair of subjects, one of whom is classified in the cell ( 6, VD) and the other in the cell (6-15, LD) , so
there are 20 38 760 concordant pairs from these two cells. The 20 subjects in the cell ( 6, VD) are also part of a

Categorical Response Data ~ 7 of 13

concordant pair when matched with each of the other (104 125 28 81 113 18 54 92) subjects ranked higher
on both variables. Similarly, the 24 subjects in the cell ( 6, LD) cell are part of concordant pairs when matched with
the (104 125 81 113 54 92) subjects ranked higher on both variables.
The total number of concordant pair denoted by C
C 20 38 104 125 28 81 113 18 54 92

24 104 125 81 113 54 92 80 125 113 92

22 28 81 113 18 54 92 38 81 113 54 92

104 113 92 13 18 54 92 28 54 92 81 92 109, 520

The number of discordant pairs of observations is


D 24 22 13 7 80 22 38 13 28 7 18 ... 113 7 18 54 84, 915

In this example, C D suggests a tendency for low income to occur with low job satisfaction and high income with
high job satisfaction.

Gamma
Given that the pair is untied on both variables,

c
c d

is the probability of concordance and

d
c d

is the

probability of discordance. The difference between these probabilities is

c d
c d

called Gamma. Its range is 1 1 . The absolute value of the correlation is 1 when the relationship between X
and Y is perfectly linear, only monotonicity is required for | | 1 , with 1 , if d 0 and 1 if c 0 . The
perfect association value 1 occurs even when the relationship is not strictly monotone. If 1 , for instance, then
for observations X a , Ya and X b , Yb on a pair of subjects a and b having X a X b , it follows that Ya Yb but
not necessarily that Ya Yb . Independence implies 0 , but the converse is not true.

Yules Coefficient
For 2 2 tables, we define Q

11 22 12 21
. This measure, which Yule (1900, 1912) introduced and called Q in
11 22 12 21

honor of the Belgian statistician Quetelet, is now referred to as Yules Q . the range of Q is 1 Q 1 .

Relationship between Odds Ratio and Yules coefficient


For 2 2 table, the odds ratio can be defined as
11
12


11 22
21
12 21
22

and Yules coefficient can be defined as


Q

11 22 12 21
,
11 22 12 21

Now we have

11 22
1
11 22 12 21

1
Q
12 21

.
11 22
11 22 12 21

1
1
12 21
So, Q

1
1

1 Q 1

Categorical Response Data ~ 8 of 13

Gamma is a strictly monotone transformation of from the 0, scale onto the 1, 1 scale.

Kendalls tau-b and Somersd


The sample correlation between the n n 1 distinct X ab , Yab pairs equals

C D

n n 1
Tx

2

Where,

Tx

n n 1
Ty
2

ni 1 ni

Ty

1
2

n j 1 n j
2

This index of ordinal association is called Kendalls tau-b. Tau-b tends to be less sensitive than gamma to the choice of
response categories.
So,

C D

Kendall's tau-b=

n n 1

Tx

2

n n 1
Ty
2

1
2

Hence, Somersd can be also defined as


Somers ' d

CD
n n 1

Tx

Where, d indicate the difference between the proportion of concordant and discordant pairs, out of those pairs untied
on X .

Measures of Nominal Association


When variables in a two-way table are nominal, notions such as positive / negative association and monotonicity are
no longer meaningful. It is then more difficult to describe association by a single number, and summary measures are
less useful than for ordinal or interval-scale variables.

Proportional Reduction
The most interpretable indices for nominal variables have the same structure as R-square (the coefficient
determination) for interval variables. R-squared and the more general intraclass correlation coefficient and correlation
ratio describe the proportional reduction in variance from the marginal distribution to the conditional distributions of the
response.
Let V Y denote a measure of variation for the marginal distribution
V Y | i denote this measure computed for the conditional distribution

1 , L

1|i , L

, J of the response Y and let

, J |i

of Y at the i th setting of an

explanatory variable X . A proportional reduction in variation measure has form


where E V Y | X

V Y E V Y | X
V Y

is the expectation of the conditional variation taken with respect to the distribution of X . When

X is a categorical variable having marginal distribution 1 , L , I

E V Y | X

i V Y | i . If 1 , then
i

the association between X and Y is strong and if 0 , then the association of Q and Y is zero.

Concentration and Uncertainty Measure


One variation measure for a nominal response is

Categorical Response Data ~ 9 of 13

V Y j 1 j
j

j 2 j
j

1 2 j
j

This is the probability that two independent observations from the marginal distribution of Y fall in different categories.

J
The variation takes its minimum value of zero when j 1 for some j and its maximum value of
j

I
for all
J

1
J

when

j . Then the conditional variation in row i is


V Y | i

j |i 1 j |i

j |i 2j |i
j

2j | i
j

For an I J contingency table with joint probabilities ij , the average conditional variation is,
E V Y | X

E V Y | X

i V Y | i

E V Y | X 1

i i 2j |i
i

i 1 2j |i

ij2

i 2
i

ij2

i
i

The proportional reduction in variation is Goodman and Kruskals tau

V Y E V Y | X
V Y

2
1 2 j 1 ij

j
i
j i

1 2 j

iij
i

1 2 j

1 2 j

also called the concentration coefficient. A large value for represents a strong association.

Theil (1970) proposed an alternative variation measure V Y j log j


j

and V Y | i j |i log j | i .
j

Now,
i j | i log j | i

ij
ij
E V Y | X i
log

i
i
i
j

ij

ij log

i
i
j

E V Y | X i V Y | i

So the proportional reduction in variation index

Categorical Response Data ~ 10 of 13

ij

V Y E V Y | X
U
V Y

j log j ij log
j

j log j
j

ij

ij log j ij log
i

j log j
j

ij

ij
j

log j

log

j log j
j

ij

ij log

i j

j log j
j

is called the uncertainty coefficient.


The measure and U are well defined when more than one j 0 . They take values between 0 and 1. When

U 0 , then X and Y are independent and U 1 indicate that there is no conditional variation in the sense
that for each

i , j |i 1 for some j .

The variation measure used in

is called the Gini concentration and the variation measure used in U is the entropy.

A difficulty with these measures is in determining how large a value constitutes a strong association. When the
response variable has several possible categorizations, these measures tend to take smaller values as the number of
categories increases. For instance, for

the variation measure is the probability that two independent observations

occur in different categories. Often this probability approaches 1.0 for both the conditional and marginal distributions as
the number of response categories grows larger, in which case

decreases toward 0.

Lambda Coefficient
Goodman and Kruskal (1954) proposed an alternative measure, lambda, for nominal variables. It has

Var Y 1 max j

V Y E V Y | i
V Y

and

1 max j E 1 max j j |i

Var Y | i 1 max j j |i


1 max j i 1 max j j |i
i

1 max j
1 max j

Comments

equal 0 indicates the complete independency, where 1 indicates the complete dependency.
Example 2.10
Describe the association in table 2.11, based on a sample conducted in 1965 of a probability sample of high school
seniors and their parentsTable

Categorical Response Data ~ 11 of 13

Parent Party

Student

Identification

Party

Identification

Democrat

Independent

Republic

Total

Democrat

604

245

67

916

Independent

130

235

76

441

Republic

63

180

252

495

Total

797

660

395

1852

797
;
1852

660
;
1852

395
1852

V Y p1 1 p1 p2 1 p 2 p3 1 p 3
0.245 0.230 0.168

0.643

604
604
245
245

67
67
1

1

1
916
916
916
916
916

916
0.225 0.196 0.068
0.448

V Y | X D

130
130
235
235

76
76
1

1

1
441
441
441
441
441

441
0.208 0.249 0.143
0.60

V Y | X I

63
63
180
180

1

1

495
495
495
495

0.111 0..231 0.25


0.592

V Y | X R

E V Y | X

0.448 0.60 0.592


3

V Y E V Y | X

V Y

252

252
1
495

495

0.547

0.643 0.547
0.643

0.149

There is very little bit relation between high school seniors and their parents.

Example
A sample of size 500 respondent was selected in a large metropolitan area to determine various concerning consumer
behavior. The following contingency table was given below:
Enjoys Shopping for clothing
Sex

Yes

No

Male

136

104

Female

224

36

a)

Find joint probabilities, conditional probabilities.

b)

Is enjoy of shopping depends on sex?

c)

What is the probability that a female was not enjoying shopping for clothing?

d)

Compute sample difference of proportions, relative risk and odds ratio.

Solution
i)

Joint probabilities are,

Categorical Response Data ~ 12 of 13

136
500
104

500
224

500
36

500

11

0.272

12

0.208

21
22

0.448

and

0.072

Also, the conditional probabilities are

ii)

1|1

n11
n1

136
240

0.567

1|2

n21
n2

224
260

0.862

2|1

n12
n1

104
240

0.433

2|2

n22
n2

36
260

0.138

and

Here,
n1
240

n
500
n
360
1 1

n
500
1 0.48 0.72

0.48
0.72
0.346

Hence, 1 11
So that, we may conclude that enjoy of shopping for clothing depends on sex.
iii)

The probability that a female was not enjoy shopping for clothing is
1|2

iv)

36
260

0.138

Here, proportion of male is

136
224
0.567 and proportion of female is
0.862 . So that the sample
240
260

difference of proportion is 0.862 0.567 0.295 . Relative Risk

0.567
0.658 . So that proportion enjoy shopping
0.862

for clothing was 0.658 times lower for male than for female. The sample odds ratio is

136 36
0.210 .
224 104

Categorical Response Data ~ 13 of 13

You might also like