Professional Documents
Culture Documents
Chi-Squared Tests
Analyzing association between 2 Categorical Variables
In this section we will study the hypotheses used to test whether or not an association exists between 2
categorical variables.
EX 1:Researchers wanted to test the theory that women who went to work shortly after giving birth were
more likely to experience postpartum depression compared to those who stayed home.
A random sample of women giving birth at a Dallas hospital were queried six months after giving birth to
their first child. The researchers recorded whether or not the woman worked outside the home and
whether or not she experienced postpartum depression.
What is the explanatory variable?
Work Status
Descriptive Statistics:
Contingency Table
Work Status By Mental State
Count
Row %
Expected
At Home
Working
Depressed
Not
Depressed
17(n11)
50(n12) 67(R1)
25.37%
74.63%
23.7635(E11) 43.2365(E12)
55(n21)
81(n22) 136(R2)
40.44%
59.56%
48.2365(E21) 87.7635(E22)
72(C1)
131(C2) 203(n)
The rows are the groups one row for each group defined by the explanatory variable.
The columns are divided up by response variable value one column for each response variable
value.
The first number in a cell (box) is the number of subjects in that row group taking that columns
response value. Ex 1: In the first cell the first number is 17. This tells us that 17 of the 67 stayat-home moms in the sample were depressed.
The second number in a cell is the % of subject in the row group taking that response value. Ex 1:
In the first cell the 2nd number is 25.37 which tells us that 25.37% of the stay-at-home moms in
the sample were depressed. This is a conditional percent.
The 3rd number in the cell gives the number of counts wed expected to see if H0 were true and the
% of depressed moms is the same in both the working and stay-at-home groups.
Hypotheses:
The null hypothesis is always there is no association (the explanatory and response are
independent).
This is not to say the explanatory causes the response but only that they are associated. For
example, we can predict that the likelihood of a shark attack in Florida is greater when more
ice cream cones are sold. The reason for this being that more ice cream cones are sold on
warm days and on warm days, more people swim making them shark bait.
Alternative statement of the null hypothesis: The explanatory and response are
independent.
Alternative statement of the alternative hypothesis: The explanatory and response are
not independent.
If there are only 2 values for both the explanatory and response variable, then we can write the
hypotheses in the following mathematical form:
H0: : p=p
versus
HA: p p2
Ex: 1 continued: What are the hypotheses both written out and in mathematical form?
Hypotheses when the data comes from a good Comparative Randomized Experiment:
In the case of the good randomized experiment, we can make stronger conclusions when we reject the
null hypothesis.
One standard conclusion is to say the chance of a particular experimental outcome occurring depends, in
We use a statistical test called a chi-squared test to determine if there is statistical evidence of association
(or cause and effect) between 2 categorical variables
Output from the Chi-squared test:
Output from example 1:
Test
Pearson
ChiSquare
4.453
Prob>ChiSq
0.0348*
ni Ei 2
Ei
For i= 1 ni = 17
Count
Row %
Expected
At Home
Depressed
Not
Depressed
17
25.37
23.7635
55
40.44
48.2365
72
50
74.63
43.2365
81
59.56
87.7635
131
Working
The larger the value of (ni Ei)2 the stronger the evidence is that H0 is not true.
For our data set the TS is calculated as
TS =
(17 23.7635) 2
(50 43.2365)2 (55 48.2365) 2 (81 87.7635)2
+
+
+
= 4.4527
43.2365
48.2365
87.7635
23.7635
If there is no association between postpartum depression and work status, how many of the 136
working moms in our study would you expect to have been depressed?
How many working moms in our study were actually found to be depressed?
The larger the value of the test statistic is, then the greater the difference between the data counts and
4
67
136
203
Are the conditions met to use the chi-squared test to analyze the data from example 6?
EX 2: The Physicians Health Study is a very famous study which looked at the effects of aspirin on heart
attack rates. In that study, the male subjects (all doctors) were randomly assigned to either take an
aspirin or a placebo over a 5 year period. At the end of the study, the proportion of men who had heart
attacks in each group was reported. Many spin-off studies have resulted from that one. In one study,
men were randomly assigned to one of 3 groups aspirin, ibuprofen or placebo, which they took for a
period of 5 years. The number of heart attacks in each group was recorded at the end of the study. The
proportion of men in each group who had a heart attack was then compared. The data is below. The
theory they are testing is does the type of drug a man takes effect his chances of having a heart attack?
Contingency Table
Drug By Health Status
Count
Heart
Row %
Attack
Expected
Aspirin
104
0.94
151.664
Ibuprofen
81
1.57
70.7133
Placebo
189
1.71
151.623
374
None
10933
99.06
10885.3
5065
98.43
5075.29
10845
98.29
10882.4
26843
Test
Pearson
ChiSquare
26.048
Prob>ChiSq
<.0001*
11037
5146
11034
27217
Explanatory variable: Type of drug with values: aspirin, ibuprofen and placebo
Response variable: Whether or not a man experienced a heart attack during the time of the study.
5
Are the conditions met to use the chi-squared test to analyze this data set?
What is the p-value and what is your decision?
What do you conclude based on the chi-square test?
In the example on the previous page, the proportions in the Heart Attack column are all very small. In
cases like this, a useful way to compare the how much the groups conditional proportions differ from
each other is a measure called relative risk.
p 1
p 2
Relative risk
Aspirin
Ibuprofen
Placebo
Totals
Heart
Attack
104
81
189
374
None
10933
5065
10845
26843
Totals
11037
5146
11034
27217
What is the relative risk of having a heart attack among men who take aspirin compared to men who take
6
a placebo?
We interpret this as: Men in the study who took aspirin were about _____ as likely to have a heart attack
as those men who took a placebo.
What is the relative risk of having a heart attack among men who take Ibuprofen compared to men who
take a placebo?
Linear Regression
Inferential statistics for 2 numerical variables
Now we want to make statistical inferences a population based on the data when there appears to be a
linear relationship between the explanatory variable (numerical) and the response variable (numerical).
Response variable:
weight
The first step in answering this question is to make a scatter plot (see below). It is clear that there is a
general positive linear trend and that as height increases, on average, weight also increases. There are
no outrageous outliers and the correlation, R = .660, is a good measure of the strength of the linear
association.
7
From this data, I can estimate the equation for the regression
line: The line used to estimate the true population line is:
250
Y | x
200
150
100
weight
Y | X = -247 + 6.0 x
60
65
70
75
height
Y|64 = average weight of all people in the population who are 64 inches tall.
The slope, 1 of the regression line measures the change in Y|x for every unit change in the explanatory
variable x. This is a population parameter.
EX 6:
1 = average change in weight when height increased by 1 inch for this population
1 is estimated by b1 = slope of the line which is the best fit through the data.
Using computer output, the value of b1 is calculated from the data set. b1 is a statistic calculated
from the data.
8
2.00
Response
1.00
0.00
Explanatory
4.00
Response
3.00
3.75
3.50
3.25
2.00
2.50
Response
3.50
3.00
1.50
1
b1 > 0
Explanatory
Explanatory
b1 < 0
b1 = 0
The true intercept, 0, of the mean function tells the value of Y|X when X = 0.
parameter.
We estimate the value of 0 with b0.
Y | x .
Y | x to calculate the
EX 1 revisited:
0 is a population
Y |X = -247 + 6.0x
b0 =
b1 =
Y |70 =
The 3 possible hypotheses that can be tested using linear regression methods are:
There is a positive linear relationship:
1-sided
1-sided
H0: 1 = 0 vs HA: 1 0
2-sided
EX 2: Theory: the number of times a TAMU student goes out per week is negatively linearly related to
their GPR. A SRS was taken of 43 STAT 302 students in Fall 02. Below is a scatter plot of their data. We
want to test this theory at the = .05 level.
Explanatory variable:
Response variable:
Hypothesis:
Summary
Multiple
R
0.5722
Regression
Table
Coefficient
Constant
# NightsOut
3.678
-0.239
b0 estimated intercept
R-Square
0.3275
Adjusted
R-Square
0.3111
Std Err of
Estimate
0.4831625
t-Value
p-Value
20.310
< 0.0001
0.002
Standard
Error
0.181
0.072
b1 estimated slope
10
1
(p-value in table)
2
b) You have 1-sided hypotheses and the sign of b1 doesnt match HA statement, then FTR H0.
Regression
Table
Coefficient
Constant
# NightsOut
3.678
-0.239
Standard
Error
0.181
0.072
t-Value
p-Value
20.310
< 0.0001
0.002
What is the correct p-value for testing the hypotheses H0: 1 = 0 versus HA: 1 < 0?
y .
The formula
What is the predicted GPR of a person who goes out 3 times per week?
11
As usual, there are conditions that must be met before we can make statistical inferences.
Below is a discussion of those conditions.
We start with defining residuals. These are very important to statisticians but all you need to be able to
do is understand enough of the plots to be able to decide if conditions are met.
Residuals:
There are various methods for estimating an equation for the best straight line through a set of data
points. The most commonly used method results in a line called the Least Squares Line. The least square
line is the line that minimizes the sum of the squared sample residuals.
For a data point with response value
A data points residual tells us how much a subjects response value differs from the average
response value.
A plot of the residuals (see lower right) tells us about the variation of the data values about the
predicted regression line.
Residual
0.5
0.0
2.0
2.2
2.4
2.6
2.8
3.0
3.2
-0.5
-1.0
-1.5
12
3.4
13
Checking independence:
o
Checking the assumption of a linear relationship between Y|X and the value of x.
o
Look at the normal QQ plot of the residuals and make sure you dont see a C shape.
Look at the scatter plot and make sure that the pattern of the points looks linear or like a
shotgun pattern.
Checking the assumption that the responses are normally distributed about their means
o
Look at the scatter plot of residuals, you want to see a horizontal band of points or a
shotgun pattern. You do NOT want to see a wedge shape.
Also need to check that there are no extreme outliers as they mess up everything just like with
correlation.
60
Dependent Variable: Y
40
.75
30
20
10
0
-10
-20
.50
.25
0.00
-30
2.0
1.00
50
0.00
10
15
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-2.0
.25
.50
.75
1.00
-1.5
-1.0
-.5
0.0
.5
1.0
20
14
1.5
2.0
Dependent Variable: Y
1.0
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
In the case above, the explanatory and response are not linearly related but everything else is ok
although the lack of linearity messes up the scatter plot of the residuals (plot on far right above).
Data Not normally distributed
Normal P-P Plot of Regression Standardized Residual
Scatterplot
Dependent Variable: Y
40.000
Dependent Variable: Y
1.00
4
30.000
20.000
10.000
5.000
.75
10.000
15.000
.50
.25
-1
0.00
20.000
-1.5
0.00
.25
.50
.75
-1.0
-.5
0.0
.5
1.0
1.5
2.0
1.00
15
50.000
20.000
10.000
5.000
10.000
4
3
.75
40.000
30.000
Dependent Variable: Y
1.00
60.000
2
1
.50
0
-1
.25
-2
0.00
15.000
20.000
0.00
-3
.25
.50
.75
1.00
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
2.0
More examples of the right side plot used to check equal variances:
Equal variances
-2
-4
-6
-2.0
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
2.0
16
EX 3: Doctors would like a way to predict a premature infants weight at birth based on the infants
gestational age. They wanted to test their theory that gestational age (in weeks) and weight (in grams)
are positively linearly related. To test their theory, a researcher group selected a random sample of 100
premature infants and recorded the gestational age at birth and the birth weight of each baby. Assume all
conditions are met to do the analysis.
Explanatory variable:
Response variable
2000
H0: 1 = 0 vs HA: 1
birthweight
Hypotheses:
0
1500
1000
500
0
20
25
30
35
40
Gestational age
Standardized Q-Value
2.5
1.5
0.5
-3.5
-2.5
-1.5
-0.5
-0.5
0.5
1.5
2.5
3.5
-1.5
-2.5
-3.5
Z-Value
Residual
200.0
0.0
600.0
-200.0
800.0
1000.0
1200.0
1400.0
1600.0
-400.0
-600.0
-800.0
Fit
Are the conditions met to analyze this data set using linear regression?
1. Independence: This condition is met because the data comes from a random sample and most
importantly, each babys response value (birthweight) was only measured once.
2. Linear relationship: Yes, visual inspection of the scatter plot shows gestational age at birth and
birth weight are linearly related.
3. Normality: Condition met, the Q-Q normal plot of the residuals doesnt have a C shape.
17
4. Equal variances: Condition met, the scatter plot of residuals shows a shotgun pattern and not a
wedge pattern.
Multiple
R
0.66
Summary
R-Square
0.44
Adjusted
R-Square
0.43
Std Err of
Estimate
203.89
Recall the definition of R2 from set 2 notes. It is the % of variability in the datas response values that can
be explained by differences in the explanatory values.
What % of the variability in birth weights in this data set can be explained by differences in
gestational age?
Regression
Table
Intercept
gestational age
Coeff.
-932.40
70.31
Standard
Error
234.49
22.87
t-Value
p-Value
-3.976
3.176
0.0001
0.0010
The average birth weight of premature babies increases by approximately 70.3 grams
when the gestational age at birth increases by 1 week.
The data provides very strong statistical evidence that for premature babies, gestational
age at birth and birth weight are positively linearly related.
Discussion on when we can estimate average response and calculate a predicted response:
Now that we have completed our analysis, we can use the values of the coefficients to form a linear
equation relating gestational age and birth weight. Based on this data set we can both estimate the
average weight at birth and predict the birth weight of a baby yet to be born after x weeks.
Y |x 932.40 70.31 x
Scatterplot of birthweight vs gestational
age
y 932.40 70.31 x
2000
birthweight
1500
1000
500
0
20
25
30
Gestational age
35
18
40
To find the minimum and maximum value of x, look at the scatter plot of the data.
Therefore, this equation can only be used to estimate average birth weight or predict the weights of
infants whose gestational age is between 23 and 35 weeks. It is very important that once a regression is
done, the estimates of and are only used to estimate averages or predict response values for values
of x (gestational age) between the minimum and maximum x values of your data. In other words, we can
interpolate between our explanatory data values but we cant extrapolate to values outside of the range of
x values.
Using our equation, we predict that a baby born at 30 weeks will weigh -932.404 + 70.310(30) =
1,176.896 grams at birth.
What is the estimated average weight of babies born at 40 weeks?
19