You are on page 1of 11

Section 4.

2
Relationships between Categorical
Variables
Let’s Talk About Sex, Baby…
So far, we have concentrated on relationships
between quantitative variables. Now we will
describe relationships between two or more
categorical variables.
Some variables are categorical by nature – sex, race,
occupation – others are created by grouping
quantitative variables into classes.
To analyze categorical data, we use the counts or the
percents of individuals that fall into various
categories.
Two-way Tables

The distributions of Sex alone and Age alone are


marginal distributions. They appear in the bottom
and the right margins of the two-way table. Each
marginal distribution from a two-way table is a
Marginal Distributions
Percents are often more informative than counts because
counts are often hard to compare. There may be
significantly more observations in one grouping compared
to another, which would appear to skew the counts.
(

Example:
The percent
of college
students who
are 15 to 17
years
total age 15 toold
17 is:150
= =0.00901=0.901%
table total 16,639
Another Look at the Marginal
Distribution of Age

0.901% of
college students
are in the 15 to
17 years age
group
Conditional Distributions
When percents of two groups within one
marginal distribution are compared, the
comparison is between two conditional
distributions.
Example:

Conditional
distributions of sex,
given age: Fem Male
15 to 17 yrs 59.3% 40.7%
18 to 24 yrs 54.7% 45.3%
25 to 34 yrs 54.5% 45.5%
35 yrs or older 63.1% 36.9% 89
150
Note: 63.1% + 36.9% = 100%
The Other Way Around

Example: Fem Male


15 to 17 yrs 1.0%
Conditional
0.8%
distributions of 18 to 24 yrs 60.8% 64.2% 89
age, given sex: 25 to 34 yrs 20.4% 21.7%
35 yrs or older 17.8% 13.3%
Note: 1.0%+60.8%+20.4%+17.8%= 100%
9321
Well-Chosen Percents Can Help
No single graph (like a
scatterplot) shows the
form of the relationship
between categorical
variables.
No single numerical measure (like
correlation) summarizes the strength of the
association.

If there IS a clear explanatory/response relationship:


Compare the conditional distributions of the response
variable for the separate values of the explanatory
variable.
You try it
# deaths in the US from selected causes in 2003
15 - 24 yrs 25 - 44 yrs 45 - 64 yrs
Accidents 14,966 27,844 23,669
AIDS 171 6,879 5,917
Cancer 1,628 19,041 144,936
Heart disease 1,083 16,283 101,713
Homicide 5,148 7,367 2,756
Suicide 3,921 11,251 10,057
Total deaths 33,022 128,924 437,058
Whatmany
How
Why
Shoulddon’t
isyou
thethe
people
of
use
conditional
these
entries
counts
are
people
in
described
ordistribution
each
percents
diedcolumn
ofinheart
to
thecompare
add
two-way
disease?
to the
the
table?
“Total
age 15-24
599,004
119,079
There 25-44
are– the age
Percents 45-64
deaths”
groups?
of each cause,
count?given age? (NOTE: Accidents 45.32%
omitted
groups21.60%
causes.
have 5.42%
different
You many want to use your LISTS AIDS 0.52% 5.34%
numbers of 1.35%
to do this work for you more Cancer 4.93% 14.77% 33.16%
individuals
quickly.) Heart disease 3.28% 12.63% 23.27%
Now, explain how the leading causes of Homicide 15.59% 5.71% 0.63%
death change as people get older. Suicide 11.87% 8.73% 2.30%
Do helicopters save accident victims’ lives?
Helicopter Road
Victim died 64 260
Victim survived 136
840
So, Total 200
32% (64/200) helicopter patients died,1100
compared with only
24% (260/1100) of the others!
Better not call for the helicopter!
Except… when the accidents are broken out differently:
Serious Accidents Less Serious Accidents
Helicopter Road Helicopter Road
Died 48 60 16 200
Survived 52 40 84 800
Total 100 100 100 1000
Among victims of serious accidents, the helicopter saves 52% (52/100) of
lives, compared with 40% (40/100) for road transport. And among victims
of less serious accidents, the helicopter saves 84% (84/100) as opposed to
Simpson’s Paradox
The explanation for this striking
misunderstanding is that the helicopter is sent
mostly to serious accidents, so the victims
transported by helicopter are more often
seriously injured. They are more likely to die
with or without helicopter evacuation.
An association or comparison that holds for all
of several groups can reverse direction when the
data are combined to form a single group. This
reversal is called Simpson’s Paradox. It is an
example of the effect of lurking variables on an

You might also like