You are on page 1of 48

Bivariate Data Analysis I: Crosstabulation

and Measures of Association


Types of Bivariate Relationships and
Associated Statistics
 Nominal/Ordinal and Nominal/Ordinal
(including dichotomous)
 Crosstabulation (Lamda, Chi-Square Gamma, etc.)

 Interval and Dichotomous


 Difference of means test

 Interval and Nominal/Ordinal


 Analysis of Variance

 Interval and Interval


 Regression and correlation
Assessing Relationships between
Variables
 1. Calculate appropriate statistic to
measure the magnitude of the
relationship in the sample
 2. Calculate additional statistics to
determine if the relationship holds for
the population of interest (statistical
significance)
 Substantive significance vs. Statistical
significance
What is a Crosstabulation?
 Crosstabulations are appropriate for
examining relationships between
variables that are nominal, ordinal, or
dichotomous.
What is a Crosstabulation?
 Example: We would like to know if
presidential vote choice in 2000 was
related to race.

 Vote choice = Gore or Bush


 Race = White, Hispanic, Black
Are Race and Vote Choice Related?
Why?

Black Hispanic White TOTAL

Gore 106 23 427 556

Bush 8 15 484 507

TOTAL 114 38 911 1063


Are Race and Vote Choice Related?
Why?

Black Hispanic White TOTAL


106 (93%) 23 (60.5%) 427 (46.9%) 556 (52.3)
Gore
8 (7%) 15 (39.5%) 484 (53.1%) 507 (47.7)
Bush
114 (100%) 38 (100%) 911 (100%) 1063 (100%)
TOTAL

• Independent variable (columns)


• Dependent variable (rows)
• Marginal distribution vs. conditional distribution
Measures of Association for
Crosstabulations
 Purpose – to determine if nominal/ordinal
variables are related in a crosstabulation

 At least one nominal variable


 Lamda
 Chi-Square
 Cramer’s V

 Two ordinal variables


 Tau
 Gamma
Lamda

NE MW SO WE TOTAL
1 2 1 2 1 2 1 2
4 10 6 10
REPUB 30
1 2 1 2 1 2 1 2
12 8 16 4
IND 40
1 2 1 2 1 2 1 2
14 2 8 6
DEM 30

TOTAL 30 20 30 20 100
Lamda – Rule 1
(prediction based solely on knowledge of marginal distribution
of dependent variable – partisanship)

NE MW SO WE TOTAL
1 2 1 2 1 2 1 2

REPUB 4 0 10 0 6 0 10 0 30
1 2 1 2 1 2 1 2

IND 12 30 8 20 16 30 4 20 40
1 2 1 2 1 2 1 2

DEM 14 0 2 0 8 0 6 0 30

TOTAL 30 20 30 20 100
Lamda – Rule 2
(prediction based on knowledge provided by independent
variable )

NE MW SO WE TOTAL
1 2 1 2 1 2 1 2
4 0 0 10 0 20 6 0 0 10 0 20
REPUB 30
1 2 1 2 1 2 1 2
12 30 0 8 20 0 16 30 30 4 20 0
IND 40
1 2 1 2 1 2 1 2
14 0 30 2 0 0 8 0 0 6 0 0
DEM 30

TOTAL 30 20 30 20 100
Lamda –Calculation of Errors
 Errors w/Rule 1: 18 + 12 + 14 + 16 = 60
 Errors w/Rule 2: 16 + 10 + 14 + 10 = 50
 Lamda =(Errors R1 – Errors R2)/Errors R1
 Lamda = (60-50)/60=10/60=.17

NE MW SO WE TOTAL
1 2 1 2 1 2 1 2

REPUB 4 0 0 10 0 20 6 0 0 10 0 20 30
1 2 1 2 1 2 1 2

IND 12 30 0 8 20 0 16 30 30 4 20 0 40
1 2 1 2 1 2 1 2

DEM 14 0 30 2 0 0 8 0 0 6 0 0 30

TOTAL 30 20 30 20 100
Lamda
 PRE measure
 Ranges from 0-1
 Potential problems with Lamda
 Underestimates relationship when variables
(one or both) are highly skewed
 Always 0 when modal category of Y is the
same across all categories of X
Chi –Square (c2)
 Also appropriate for any crosstabulation
with at least one nominal variable (and
another nominal/ordinal variable)

 Based on the difference between the


empirically observed crosstab and what
we would expect to observe if the two
variables are statistically independent
An Example of Perfect Statistical
Independence
Black Hispanic White TOTAL

120 (60%) 60 (60%) 540 (60%) 720 (60%)


Democrat

80 (40%) 40 (40%) 360 (40%) 507 (40%)


Republican

200 (100%) 100 (100%) 900 (100%) 1200 (100%)


TOTAL
Chi –Square (c2)

NE MW SO WE TOTAL
O E O E O E O E
REPUB 4 10 6 10 30
O E O E O E O E
IND 12 8 16 4 40
O E O E O E O E
DEM 14 2 8 6 30

TOTAL 30 20 30 20 100
Calculating Expected Frequencies
 To calculate the expected cell frequency for NE
Republicans:
○ E/30 = 30/100, therefore E=(30*30)/100 = 9

NE MW SO WE TOTAL
O E O E O E O E
REPUB 4 10 6 10 30
O E O E O E O E
IND 12 8 16 4 40
O E O E O E O E
DEM 14 2 8 6 30

TOTAL 30 20 30 20 100
Calculating the Chi-Square Statistic
 The chi-square statistic is calculated as:
 (Obs. Frequencyik - Exp. Frequencyik)2 / Exp. Frequencyik
(25/9)+(16/6)+(9/9)+(16/6)+(0)+(0)+(16/12)+(16/8)+(25/9)+16/6)+(1/9)+(0) = 18

NE MW SO WE TOTAL
O E O E O E O E
REPUB 4 9 10 6 6 9 10 6 30
O E O E O E O E
IND 12 12 8 8 16 12 4 8 40
O E O E O E O E
DEM 14 9 2 6 8 9 6 6 30

TOTAL 30 20 30 20 100
Interpreting the Chi-Square Statistic

 The Chi-Square statistic ranges from 0 to


infinity
 0 = perfect statistical independence
 Even though two variables may be statistically
independent in the population, in a sample the
Chi-Square statistic may be > 0
 Therefore it is necessary to determine
statistical significance for a Chi-Square
statistic (given a certain level of confidence)
Cramer’s V
 Problem with Chi-Square: not
comparable across different sample
sizes (and their associated crosstab)

 Cramer’s V is a standardization of the


Chi-Square statistic
Calculating Cramer’s V
 V= Chi  Squared
N  Min ( R  1, C  1)
 Where R = #rows and C =
#columns
○ V ranges from 0-1

 Example (region and


partisanship)
 = 18 = √.09 = .30
100  (3  1)
Relationships between Ordinal
Variables
 There are several measures of
association appropriate for relationships
between ordinal variables

 Gamma, Tau-b, Tau-c, Somer’s d

 All are based on identifying concordant,


discordant, and tied pairs of
observations
Concordant Pairs:
Ideology and Voting
 Ideology - conserv (1), moderate (2), liberal (3)
 Voting - never (1), sometimes (2), often (3)

 Consider two hypothetical individuals in the


sample with scores
○ Individual A: Ideology=1, Voting=1
○ Individual B: Ideology=2, Voting=2

○ Pair A&B are considered a concordant pair because B’s


ideology score is greater than A’s score, and B’s voting score
is greater than A’s score
Concordant Pairs (cont’d)
 All of the following are concordant pairs

 A(1,1) B(2,2)
 A(1,1) B(2,3)
 A(1,1) B(3,2)
 A(1,2) B(2,3)
 A(2,2) B(3,3)

 Concordant pairs are consistent with a positive


relationship between the IV and the DV (ideology
and voting)
Discordant Pairs
 All of the following are discordant pairs

 A(1,2) B(2,1)
 A(1,3) B(2,2)
 A(2,2) B(3,1)
 A(1,2) B(3,1)
 A(3,1) B(1,2)

 Discordant pairs are consistent with a negative


relationship between the IV and the DV (ideology
and voting)
Identifying Concordant Pairs
 Concordant Pairs for Never - Conserv (1,1)
 #Concordant = 80*70 + 80*10 + 80*20 + 80*80
= 14,400

Conservative (1) Moderate (2) Liberal (3)

Never (1) 80 10 10

Sometimes (2) 20 70 10

Often (3) 0 20 80
Identifying Concordant Pairs
 Concordant Pairs for Never - Moderate (1,2)
 #Concordant = 10*10 + 10*80 = 900

Conservative (1) Moderate (2) Liberal (3)

Never (1) 80 10 10

Sometimes (2) 20 70 10

Often (3) 0 20 80
Identifying Discordant Pairs
 Discordant Pairs for Often - Conserv (1,3)
 #Discordant = 0*10 + 0*10 + 0*70 + 0*10 = 0

Conservative (1) Moderate (2) Liberal (3)

Never (1) 80 10 10

Sometimes (2) 20 70 10

Often (3) 0 20 80
Identifying Discordant Pairs
 Discordant Pairs for Often - Moderate (2,3)
 #Discordant = 20*10 + 20*10

Conservative (1) Moderate (2) Liberal (3)

Never (1) 80 10 10

Sometimes (2) 20 70 10

Often (3) 0 20 80
Gamma
 Gamma is calculated by identifying all
possible pairs of individuals in the
sample and determining if they are
concordant or discordant

 Gamma = (#C - #D) / (#C + #D)


Interpreting Gamma
 Gamma = 21400/24400 =.88
 Gamma ranges from -1 to +1
 Gamma does not account for tied pairs

 Tau (b and c) and Somer’s d account for


tied pairs in different ways
Square tables:

Non-Square tables:
Example
 NES 2004 – What explains variation in
one’s political Ideology (25)?

 Income?
 Education?
 Religion?
 Race?
Bivariate Relationships and Hypothesis Testing
(Significance Testing)
 1. Determine the null and alternative
hypotheses

○ Null: There is no relationship between


X and Y in the population (X and Y are
statistically independent and test
statistic = 0).

○ Alternative: There IS a relationship


between X and Y in the population (test
statistic does not equal 0).
Bivariate Relationships and
Hypothesis Testing
 2. Determine Appropriate Test Statistic
(based on measurement levels of X and
Y)

 3. Identify the type of sampling


distribution for test statistic, and what it
would look like if the null hypothesis
were true.
Bivariate Relationships and
Hypothesis Testing
 4. Calculate the test statistic from the
sample data and determine the probability
of observing a test statistic this large (in
absolute terms) if the null hypothesis is
true.

 P-value (significance level) – probability


of observing a test statistic at least as large
as our observed test statistic, if in fact the
null hypothesis is true
Bivariate Relationships and
Hypothesis Testing
 5. Choose an “alpha level” – a decision
rule to guide us in determining which
values of the p-value lead us to reject/not
reject the null hypothesis
 When the p-value is extremely small, we reject
the null hypothesis (why?). The relationship is
deemed “statistically significant,”
 When the p-value is not small, we do not reject
the null hypothesis (why?). The relationship is
deemed “statistically insignificant.”

 Most common alpha level: .05


Bottom Line
 Assuming we will always use an alpha
level of .05:

 Reject the null hypothesis if P-value<.05


 Do not reject the null hypothesis if P-
value>.05
An Example
 Dependent variable: Vote Choice in
2000
 (Gore, Bush, Nader)
 Independent variable: Ideology
 (liberal, moderate, conservative)
An Example
 1. Determine the null and alternative
hypotheses.
An Example
 Null Hypothesis: There is no relationship
between ideology and vote choice in
2000.
 Alternative (Research) Hypothesis:
There is a relationship between ideology
and vote choice (liberals were more
likely to vote for Gore, while
conservatives were more likely to vote
for Bush).
An Example
 2. Determine Appropriate Test Statistic
(based on measurement levels of X and
Y)

 3. Identify the type of sampling


distribution for test statistic, and what it
would look like if the null hypothesis
were true.
Sampling Distributions for the Chi-Squared Statistic
(under assumption of perfect independence)
df = (rows-1)(columns-1)
Bivariate Relationships and
Hypothesis Testing
 4. Calculate the test statistic from the
sample data and determine the probability
of observing a test statistic this large (in
absolute terms) if the null hypothesis is
true.

 P-value (significance level) – probability


of observing a test statistic at least as large
as our observed test statistic, if in fact the
null hypothesis is true
Bivariate Relationships and
Hypothesis Testing
 5. Choose an “alpha level” – a decision
rule to guide us in determining which
values of the p-value lead us to reject/not
reject the null hypothesis
 When the p-value is extremely small, we reject
the null hypothesis (why?). The relationship is
deemed “statistically significant,”
 When the p-value is not small, we do not reject
the null hypothesis (why?). The relationship is
deemed “statistically insignificant.”

 Most common alpha level: .05


In-Class Exercise
 For some years now, political commentators have
cited the importance of a “gender gap” in
explaining election outcomes. What is the source
of the gender gap?

 Develop a simple theory and corresponding


hypothesis (where gender is the independent
variable) which seeks to explain the source of the
gender gap.

 Specifically, determine:
 Theory
 Null and research hypothesis
 Test statistic for a cross-tabulation to test your hypothesis

You might also like