Professional Documents
Culture Documents
Categorical Variable
A categorical variable is one for which the measurement scale consists of a set of categories. Categorical variables
represent types of data which may be divided into groups.
Example: Race, Sex, Age, Group, Educational level, Smoking status etc
Data
Data is the measurement of characteristics of an object or subject. There are two types of data one is Qualitative data
and another is Quantitative data. We can get the data from the field of Multinomial sampling and Poisson sampling.
Measurement Scale
Nominal
Categorical variables for which levels do not have a natural ordering are called nominal. For nominal variables the
order of listing of the categories is irrelevant to the statistical analysis.
Example: Religious affiliation (Categories Catholic, Jewish, Protestant, Other), mode of transportation (automobile,
bus, subway, bicycle, other), choice of residence (house, apartment, condominium, other), race, gender and marital
status
Ordinal
Many categorical variables do have ordered levels. Such variables are called ordinal. Ordinal variables clearly order
the categories, but absolute distances between categories are unknown.
Example: Size of automobile (subcompact, compact, mid-size, large), social class (upper, middle, lower), attitude
toward legalization of abortion (strongly disapprove, disapprove, approve, strongly approve), appraisal of companys
inventory level (too low, about right, too high) etc
Interval
An interval variable is one that does have numerical distances between any two levels of the scale.
Example: Blood pressure level, functional life length of television set, length of prison term, income and age.
Contingency Table
Let X and Y denote two categorical response variables, X having I levels and Y having J levels. When we
classify subjects on both variables, there are IJ possible combinations of classifications. The responses ( X , Y ) of a
subject randomly chosen from some population have a probability distribution. We display this distribution in a
rectangular table having I rows for the categories of X and J column for the categories of Y . The cells of the table
represent the IJ possible outcomes. Their probabilities are ij , where ij denotes the probability that ( X , Y ) falls
in the cell in row i and clumn j . When the cells contain frequency counts of outcomes, the table is called a
contingency table. The term contingency table was introduced by Karl Pearson (1904). Another name is cross
classification table.
A contingency table having I rows and J columns is referred to as a I by J or I J table.
Where, nI is the i
th
Y
2
Total
n11
n12
n1J
n1
n21
n22
n2J
n2
nI 1
nI 2
nIJ
nI
Total
n1
n2
th
n n
n J
th
row and j
th
column total.
let pij denote the proportion of the total sample falling in that cell. That is pij
pij
i
. The set
pij
nij
n
; wrere, n
nij
i
is the
distributions are the row totals and column totals obtained by summing the joint proportions. These still denoted by
pi
pi
n j
ni
and p j
also
n
n
p j
pi p j 1 .
i
Similar notation will be used for population proportions, with the Greek letter in place of p . For instance, population
ij
conditional, joint and marginal probabilities are related by j i
and they satisfy
i
ij
i
and
j i
for i 1, 2, L , r . The following table illustrates the notation for the 2 2 case.
Table: Notation for joint, conditional and marginal distribution.
Column
Row
Total
11 1|1
12 2|1
21 1|2
22 2|2
Total
Independence
When both variables are response variables, we can describe the association using their joint distribution, the
conditional distribution of Y given X , or the conditional distribution of X given Y .
The conditional distribution of Y given X is related to the joint distribution by j |i
ij
i
variables are statistically independent if all joint probabilities equal the product of their marginal probabilities that is, if
ij i j
for i 1, 2, L , I
and
j 1, 2, L , J
1 ,
ij
i
i j
i
Use equ. 1
for i 1, 2, L , I
i.e., each conditional distribution of Y is identical to the marginal distribution of Y . Thus, two variables are
independent when the probability of column response j is the same in each row, for j 1, 2, L , J . When Y is a
response and X is an explanatory variable, the condition
j |1 L
j|I
for all j
We use similar notation for sample distributions, with the letter p in place of . For instance,
pij
denotes the
i j nij
being
is the
sample joint distribution in a contingency table. The cell frequencies are denoted by nij , with n
the total sample size, so
pij
nij
p j |i
where ni npi
pij
pi
nij
i made response j is
ni
nij .
j
1|i
conditional distribution of the binary response. We can compare two rows, say h and i , using the difference of
1| h 1|i . Comparison on response 2 is equivalent to comparison on response 1, since
proportions,
2| h 2| i 1 1| h 1 1| i 1| i 1| h .
The difference of proportions falls between -1 and 1. It equals zero when rows h and i have identical conditional
distributions. The response Y is statistically independent of the row classification when
1| h 1|i 0
rows h and i .
For I J contingency tables, we can compare the conditional probabilities of response j for rows h and i using the
difference j | h j |i . The variables are independent when this difference equals zero for all pairs of rows h and i
and all possible responses j equivalently, when the I 1 J 1 differences
j |i j | I 0, i 1, 2, L , I 1
and
j 1, 2, L , J 1 .
When both variables are responses and there is a joint distribution ij , the comparison of proportions within rows h
and i satisfies
1| h 1|i
h1 i1
.
h i
11 21
.
1 2
We can also compare columns in terms of the proportion of row-1 response, using the difference of within column
proportions
p row 1 | col . 1 p row 1 | col . 2
11 12
.
1 2
This does not usually give the same value as the difference of within row proportions.
1|1 1|2 0
For large n
Z
1|1 1|2
Under
n1 1|1 1 1|1
n11
n
n12
1
Since,
Similarly ,
Var 1|1
1|1 2|1
n1
1|1 2|1
n1
H0
1|1 2|1
n1
n11 ~ n1 , 1|1 ;
n12 ~ n2 , 1|2
1|1 1|2 Z
Relative Risk
A difference in proportions of fixed size may have greater importance when both proportions are close to 0 or 1 than
when they are near the middle of the range. For instance, suppose we compare two drugs in terms of the proportion of
subjects who suffer bad side effects. The difference between 0.010 and 0.001 may be more noteworthy than the
difference between 0.410 and 0.401. in such cases, the ratio of proportions is also a useful descriptive measure.
For 2 2 tables, the relative risk is the ratio
1|1
1|2
11
1
21
2
11 2
.
1 21
The ratio can be any non-negative member. A relative risk of 1 corresponding to independence. Comparison on the
second response gives a different relative risk
2|1
2| 2
1 1|1
1 1| 2
Note: Relative risk and difference of proportion are affected by the interchange of rows and columns.
Odds Ratio
In 2 2 contingence table, within row-1, the odds that the response is in column 1 instead of column 2 are defined to
be 1
1|1
2|1
definition is i
1| 2
2|2
i1
; i 1, 2 . Each i is non-negative, with value greater than 1 when response 1 is more likely
i2
than response 2.
For example, when 1 4 , in the first row response 1 is 4 times as likely as response 2. The within-row conditional
distributions are identical, and thus the variables are independent, if and only if 1 2 .
1
, called the odds ratio. From the definition of odds using joint
2
probabilities,
11
12
11 22
21
12 21
22
0 .
It is called the cross-product ratio, since it equals the ratio of the products 11 22 and 12 21 of probabilities from
diagonally opposite cells. The odds ratio can equal any non negative number. When all cell probabilities are positive
independence of X and Y are equivalent to 1 , When 1 , subjects in row 1 are more likely to make the
first response that are subjects in row 2; that is, 1|1 1|2 . For instance, when 4 , the odds of the first response
are four times higher in row 1 than in row 2. this does not mean that the probability 1|1 is four times higher than 1|2 ;
that is the interpretation of a relative risk of 4.0. when 0 1 , the first response is less likely in row 1 than in row 2;
that is 1|1 1|2 . When one cell has zero probability, equals 0 or
Under
Now,
1
log 0
log
L arg e Sample
Where,
Var log
1
1
1
1
n
n
n
n
11
12
21
22
log 0
Var log
1
2
~ N 0,1 |H 0
Confidence Interval
L.L. log Z
U .L. log Z
2
Var log
Var log
Original Unit
L.L. e L.L.
U .L. eU .L.
Properties
The value of does not change if both cell frequencies within any row is multiplies by a non-zero constant or
if both cell frequencies within any column are multiplied by a constant.
Two values for represent the same level of association, but in opposite directions, when one value is the
inverse of the other. For instance, when 0.25 , the odds of the first response are 0.25 times as high in row
1 as in row 2, or equivalently 1 0.25 4.0 times as high in row 2 as in row 1. If the order of the rows is
reversed or if the order of the column is revered, the new value of is simply the inverse of the original
value.
The odds ratio does not change value when the orientation of the table is reversed so that the rows become
the columns and the columns become the rows. Therefore, it is unnecessary to identify one classification as
the response variable in order to calculate .
Values of farther from 1.0 in a given direction represent stronger levels of association.
It is sometimes more convenient to use ln . Independence corresponds to ln 0 . The log odds ratio
is symmetric about this value- reversal of rows or of columns results in a change in its sign. Two values for
ln that are the same except for sign, such as ln 4 1.39 and ln 0.25 1.39 , represent the same
level of association.
An implication of the multiplicative invariance property is that the sample odds ratio estimates the same
characteristic even when we select disproportionately large or small samples from marginal categories of
a variable. For instance, suppose a study investigates the association between vaccination and catching a
certain strain of flu. For a retrospective design, the sample odds ratio estimates the same characteristic
whether we randomly sample (1) 100 people who got the flu and 100 people who did not, or (2) 150 people
who got the flu and 50 people who did not, in each case classifying subjects on whether they took the
vaccine. In fact, the odds ratio is equally valid for retrospective, prospective, or cross-sectional sampling
designs. We would estimate the same characteristic if (3) we randomly sample 100 people who took the
vaccine and 100 people who did not, and then classify them on whether they got the flu, or (4) we randomly
sample 200 people and classify them on whether they took the vaccine and whether they got the flu.
1|1
1|2
.
1| 2
1|2 2|1
2| 2
Now we have,
Odds Ratio
1|1 2|2
1|2 2|1
1|2 1 1|1
1|1 1 1|2
1 1|2
.
So, Odds Ratio Relative Risk
1 1|1
Their magnitudes are similar whenever the probability of response 1 is close to zero for both groups. When the
sampling design is retrospective, it is possible to construct conditional distributions within levels of the fixed response.
It is usually not possible to estimate the probability of the outcome of interest, or to compute the difference of
proportions or relative risk for that outcome.
We can compute the odds ratio, however, since it is determined by the conditional distributions in either direction.
When the probability of the outcome of interest is very small, the population odds ratio and relative risk take similar
values. Thus, we can use the sample odds ratio to provide a rough indication of the relative risk.
of observations,
c ij
i
hk
hi k j
and
d ij
i
j
hk
h i k j
c d 0 .
Example of Job Satisfaction
We illustrate concordance and discordance using Table 2.4, taken from the 1984 General Social Survey of the National
Data Program in the United States as quoted by Norusis (1988). The variables are income and job satisfaction. Income
has levels less than $6000 (denoted <6), between $6000 and $15,000 (6-15), between $15,000 and $25,000
denoted by (15-25) , and over $25,000 (>25) . Job satisfaction has levels very dissatisfied (DV), little dissatisfied
(LD), moderately satisfied (MS), and very satisfied (VS). We treat VS as the high end of the job satisfaction scale.
Table: Cross Classification of Job Satisfaction by Income
Very
Dissatisfied
Job Satisfaction
Little
Dissatisfied
6000
6000 -15, 000
20
24
80
82
22
38
104
125
13
28
81
113
25, 000
18
54
92
Income (US$)
Moderately
Satisfied
Very
Satisfied
Consider a pair of subjects, one of whom is classified in the cell ( 6, VD) and the other in the cell (6-15, LD) , so
there are 20 38 760 concordant pairs from these two cells. The 20 subjects in the cell ( 6, VD) are also part of a
concordant pair when matched with each of the other (104 125 28 81 113 18 54 92) subjects ranked higher
on both variables. Similarly, the 24 subjects in the cell ( 6, LD) cell are part of concordant pairs when matched with
the (104 125 81 113 54 92) subjects ranked higher on both variables.
The total number of concordant pair denoted by C
C 20 38 104 125 28 81 113 18 54 92
22 28 81 113 18 54 92 38 81 113 54 92
In this example, C D suggests a tendency for low income to occur with low job satisfaction and high income with
high job satisfaction.
Gamma
Given that the pair is untied on both variables,
c
c d
d
c d
is the
c d
c d
called Gamma. Its range is 1 1 . The absolute value of the correlation is 1 when the relationship between X
and Y is perfectly linear, only monotonicity is required for | | 1 , with 1 , if d 0 and 1 if c 0 . The
perfect association value 1 occurs even when the relationship is not strictly monotone. If 1 , for instance, then
for observations X a , Ya and X b , Yb on a pair of subjects a and b having X a X b , it follows that Ya Yb but
not necessarily that Ya Yb . Independence implies 0 , but the converse is not true.
Yules Coefficient
For 2 2 tables, we define Q
11 22 12 21
. This measure, which Yule (1900, 1912) introduced and called Q in
11 22 12 21
honor of the Belgian statistician Quetelet, is now referred to as Yules Q . the range of Q is 1 Q 1 .
11 22 12 21
,
11 22 12 21
Now we have
11 22
1
11 22 12 21
1
Q
12 21
.
11 22
11 22 12 21
1
1
12 21
So, Q
1
1
1 Q 1
Gamma is a strictly monotone transformation of from the 0, scale onto the 1, 1 scale.
C D
n n 1
Tx
2
Where,
Tx
n n 1
Ty
2
ni 1 ni
Ty
1
2
n j 1 n j
2
This index of ordinal association is called Kendalls tau-b. Tau-b tends to be less sensitive than gamma to the choice of
response categories.
So,
C D
Kendall's tau-b=
n n 1
Tx
2
n n 1
Ty
2
1
2
CD
n n 1
Tx
Where, d indicate the difference between the proportion of concordant and discordant pairs, out of those pairs untied
on X .
Proportional Reduction
The most interpretable indices for nominal variables have the same structure as R-square (the coefficient
determination) for interval variables. R-squared and the more general intraclass correlation coefficient and correlation
ratio describe the proportional reduction in variance from the marginal distribution to the conditional distributions of the
response.
Let V Y denote a measure of variation for the marginal distribution
V Y | i denote this measure computed for the conditional distribution
1 , L
1|i , L
, J |i
of Y at the i th setting of an
where E V Y | X
V Y E V Y | X
V Y
is the expectation of the conditional variation taken with respect to the distribution of X . When
E V Y | X
i V Y | i . If 1 , then
i
the association between X and Y is strong and if 0 , then the association of Q and Y is zero.
V Y j 1 j
j
j 2 j
j
1 2 j
j
This is the probability that two independent observations from the marginal distribution of Y fall in different categories.
J
The variation takes its minimum value of zero when j 1 for some j and its maximum value of
j
I
for all
J
1
J
when
j |i 1 j |i
j |i 2j |i
j
2j | i
j
For an I J contingency table with joint probabilities ij , the average conditional variation is,
E V Y | X
E V Y | X
i V Y | i
E V Y | X 1
i i 2j |i
i
i 1 2j |i
ij2
i 2
i
ij2
i
i
V Y E V Y | X
V Y
2
1 2 j 1 ij
j
i
j i
1 2 j
iij
i
1 2 j
1 2 j
also called the concentration coefficient. A large value for represents a strong association.
and V Y | i j |i log j | i .
j
Now,
i j | i log j | i
ij
ij
E V Y | X i
log
i
i
i
j
ij
ij log
i
i
j
E V Y | X i V Y | i
ij
V Y E V Y | X
U
V Y
j log j ij log
j
j log j
j
ij
ij log j ij log
i
j log j
j
ij
ij
j
log j
log
j log j
j
ij
ij log
i j
j log j
j
U 0 , then X and Y are independent and U 1 indicate that there is no conditional variation in the sense
that for each
i , j |i 1 for some j .
is called the Gini concentration and the variation measure used in U is the entropy.
A difficulty with these measures is in determining how large a value constitutes a strong association. When the
response variable has several possible categorizations, these measures tend to take smaller values as the number of
categories increases. For instance, for
occur in different categories. Often this probability approaches 1.0 for both the conditional and marginal distributions as
the number of response categories grows larger, in which case
decreases toward 0.
Lambda Coefficient
Goodman and Kruskal (1954) proposed an alternative measure, lambda, for nominal variables. It has
Var Y 1 max j
V Y E V Y | i
V Y
and
1 max j E 1 max j j |i
Var Y | i 1 max j j |i
1 max j i 1 max j j |i
i
1 max j
1 max j
Comments
equal 0 indicates the complete independency, where 1 indicates the complete dependency.
Example 2.10
Describe the association in table 2.11, based on a sample conducted in 1965 of a probability sample of high school
seniors and their parentsTable
Parent Party
Student
Identification
Party
Identification
Democrat
Independent
Republic
Total
Democrat
604
245
67
916
Independent
130
235
76
441
Republic
63
180
252
495
Total
797
660
395
1852
797
;
1852
660
;
1852
395
1852
V Y p1 1 p1 p2 1 p 2 p3 1 p 3
0.245 0.230 0.168
0.643
604
604
245
245
67
67
1
1
1
916
916
916
916
916
916
0.225 0.196 0.068
0.448
V Y | X D
130
130
235
235
76
76
1
1
1
441
441
441
441
441
441
0.208 0.249 0.143
0.60
V Y | X I
63
63
180
180
1
1
495
495
495
495
V Y | X R
E V Y | X
V Y E V Y | X
V Y
252
252
1
495
495
0.547
0.643 0.547
0.643
0.149
There is very little bit relation between high school seniors and their parents.
Example
A sample of size 500 respondent was selected in a large metropolitan area to determine various concerning consumer
behavior. The following contingency table was given below:
Enjoys Shopping for clothing
Sex
Yes
No
Male
136
104
Female
224
36
a)
b)
c)
What is the probability that a female was not enjoying shopping for clothing?
d)
Solution
i)
136
500
104
500
224
500
36
500
11
0.272
12
0.208
21
22
0.448
and
0.072
ii)
1|1
n11
n1
136
240
0.567
1|2
n21
n2
224
260
0.862
2|1
n12
n1
104
240
0.433
2|2
n22
n2
36
260
0.138
and
Here,
n1
240
n
500
n
360
1 1
n
500
1 0.48 0.72
0.48
0.72
0.346
Hence, 1 11
So that, we may conclude that enjoy of shopping for clothing depends on sex.
iii)
The probability that a female was not enjoy shopping for clothing is
1|2
iv)
36
260
0.138
136
224
0.567 and proportion of female is
0.862 . So that the sample
240
260
0.567
0.658 . So that proportion enjoy shopping
0.862
for clothing was 0.658 times lower for male than for female. The sample odds ratio is
136 36
0.210 .
224 104