302 10 Chi Squared Tests and Regression S15

STAT Course Notes Set 10
Chi-Squared Tests
Analyzing association between 2 Categorical Variables
In this section we will study the hypotheses used to test whether or not an association exists between 2
categorical variables.
EX 1:Researchers wanted to test the theory that women who went to work shortly after giving birth were
more likely to experience postpartum depression compared to those who stayed home.
A random sample of women giving birth at a Dallas hospital were queried six months after giving birth to
their first child. The researchers recorded whether or not the woman worked outside the home and
whether or not she experienced postpartum depression.
What is the explanatory variable?
Work Status
What is the response variable?
Whether or not the woman has post partum depression
Descriptive Statistics:
Contingency Table
Work Status By Mental State
Count
Row %
Expected
At Home
Working
Depressed
Not
Depressed
17(n11)
50(n12) 67(R1)
25.37%
74.63%
23.7635(E11) 43.2365(E12)
55(n21)
81(n22) 136(R2)
40.44%
59.56%
48.2365(E21) 87.7635(E22)
72(C1)
131(C2) 203(n)
Reading a contingency table:
The rows are the groups one row for each group defined by the explanatory variable.
The columns are divided up by response variable value one column for each response variable
value.
The first number in a cell (box) is the number of subjects in that row group taking that columns
response value. Ex 1: In the first cell the first number is 17. This tells us that 17 of the 67 stayat-home moms in the sample were depressed.
The second number in a cell is the % of subject in the row group taking that response value. Ex 1:
In the first cell the 2nd number is 25.37 which tells us that 25.37% of the stay-at-home moms in
the sample were depressed. This is a conditional percent.
The 3rd number in the cell gives the number of counts wed expected to see if H0 were true and the
% of depressed moms is the same in both the working and stay-at-home groups.
How to calculate is 3rd number:
Hypotheses:
Observational study: In the population there is no association between 2 variables if the

probability of having a particular response value is the same across all groups and this holds for
all response values.
o
The null hypothesis is always there is no association (the explanatory and response are
independent).
In the population there is an association or dependence between 2 variables, if the explanatory

variable has some value for predicting the response.
o
This is not to say the explanatory causes the response but only that they are associated. For
example, we can predict that the likelihood of a shark attack in Florida is greater when more
ice cream cones are sold. The reason for this being that more ice cream cones are sold on
warm days and on warm days, more people swim making them shark bait.
General form of the hypotheses in observational studies
H0: There is no association between the explanatory and response variable

o
Alternative statement of the null hypothesis: The explanatory and response are
independent.
HA: There is an association between the explanatory and response variable

o
Alternative statement of the alternative hypothesis: The explanatory and response are
not independent.
If there are only 2 values for both the explanatory and response variable, then we can write the
hypotheses in the following mathematical form:
H0: : p=p
versus
HA: p p2
P1 = proportion of successes in group 1. This parameter equals the conditional proportion of

successes in group 1
P2 = proportion of successes in group 2. This parameter equals the conditional proportion of
successes in group 2
Ex: 1 continued: What are the hypotheses both written out and in mathematical form?
Hypotheses when the data comes from a good Comparative Randomized Experiment:
In the case of the good randomized experiment, we can make stronger conclusions when we reject the
null hypothesis.
One standard conclusion is to say the chance of a particular experimental outcome occurring depends, in
part, on the treatment received.
Other good hypotheses statements
H0: There is no difference in response to the different treatments.

Note: a control is considered a treatment.
HA: Different treatments cause a difference in response.

o
Alternative statement: The different treatments affect the response differently
We use a statistical test called a chi-squared test to determine if there is statistical evidence of association
(or cause and effect) between 2 categorical variables
Output from the Chi-squared test:
Output from example 1:
Test
Pearson
ChiSquare
4.453
Prob>ChiSq
0.0348*
What is the p-value?

What can we conclude based on this output if = .05?
3
Below is a discussion of the chi-squared test in the context of this example.

Recall that test statistics in some manner measure the distance between the null hypothesis and the data.
We call the data that has been collected the observed counts. The observed counts are the number of
counts from the data in each cell.
The expected counts for a cell are the # of counts that would have been observed if the conditional
proportions were the same for each group that is, the percents in a column would be identical. This is
what we would expect if the null hypothesis were true and there was no sampling variability.
We will form a test statistic from these observed and expected counts to test if the proportion of stay-athome moms who suffer from postpartum depression is different from the proportion of working moms
who suffer from postpartum depression after birth of the first child.
The test statistic has chi-squared distribution IF the null hypothesis is true. A chi-squared distribution is
written
The test statistic is: TS = =
ni = actual # subjects in cell i.
ni Ei 2
Ei
We sum over every cell in the table.
For i= 1 ni = 17
Ei = # of subjects wed expect would be in cell i if H0 were
true. So for cell 1 this value is 23.76. We interpret this to

mean that if working and stay at home moms have exactly
the same chance of experiencing post-partum depression
and in a group of 203 new moms in which 67 stayed at
home, then wed expect 23.76 of the stay at home moms to
be depressed.
Count
Row %
Expected
At Home
Depressed
Not
Depressed
17
25.37
23.7635
55
40.44
48.2365
72
50
74.63
43.2365
81
59.56
87.7635
131
Working
The larger the value of (ni Ei)2 the stronger the evidence is that H0 is not true.
For our data set the TS is calculated as
TS =
(17 23.7635) 2
(50 43.2365)2 (55 48.2365) 2 (81 87.7635)2
+
+
+
= 4.4527
43.2365
48.2365
87.7635
23.7635
If there is no association between postpartum depression and work status, how many of the 136
working moms in our study would you expect to have been depressed?
How many working moms in our study were actually found to be depressed?
The larger the value of the test statistic is, then the greater the difference between the data counts and
4
67
136
203
the expected number of counts.
Do larger test statistics result in larger or smaller p-values?
Conditions that must be satisfied in order to run a chi-squared test

1. Independent samples - All observations independent of one another if SRS or randomly assigned
treatment then ok.
2. Large sample sizes - Expected number of observations in each cell 5
Note: if neither variable can be considered to be explanatory, choose one as the explanatory
Are the conditions met to use the chi-squared test to analyze the data from example 6?
EX 2: The Physicians Health Study is a very famous study which looked at the effects of aspirin on heart
attack rates. In that study, the male subjects (all doctors) were randomly assigned to either take an
aspirin or a placebo over a 5 year period. At the end of the study, the proportion of men who had heart
attacks in each group was reported. Many spin-off studies have resulted from that one. In one study,
men were randomly assigned to one of 3 groups aspirin, ibuprofen or placebo, which they took for a
period of 5 years. The number of heart attacks in each group was recorded at the end of the study. The
proportion of men in each group who had a heart attack was then compared. The data is below. The
theory they are testing is does the type of drug a man takes effect his chances of having a heart attack?
Contingency Table
Drug By Health Status
Count
Heart
Row %
Attack
Expected
Aspirin
104
0.94
151.664
Ibuprofen
81
1.57
70.7133
Placebo
189
1.71
151.623
374
None
10933
99.06
10885.3
5065
98.43
5075.29
10845
98.29
10882.4
26843
Test
Pearson
ChiSquare
26.048
Prob>ChiSq
<.0001*
11037
5146
11034
27217
What are the variables and variable types?
Explanatory variable: Type of drug with values: aspirin, ibuprofen and placebo
Response variable: Whether or not a man experienced a heart attack during the time of the study.
5
What hypotheses can we test with this data set?
Are the conditions met to use the chi-squared test to analyze this data set?
What is the p-value and what is your decision?
What do you conclude based on the chi-square test?
In the example on the previous page, the proportions in the Heart Attack column are all very small. In
cases like this, a useful way to compare the how much the groups conditional proportions differ from
each other is a measure called relative risk.
p 1
p 2
Relative risk
p 1 = sample proportion of successes in group 1

p 2 = sample proportion of successes in group 2
Properties of Relative Risk
1. Relative risk can equal any number 0.
2. When the conditional proportions being compared are equal, the relative risk equals 1.
3. A relative risk greater than 1 indicates that the proportion of successes is larger in the first
group than in the 2nd group.
4. A relative risk less than 1 indicates that the proportion of successes is larger is the second
group.
5. Values farther from one (either less than or greater to 1) represent stronger associations.
Continuing EX 2: Below, the table gives just the counts
Aspirin
Ibuprofen
Placebo
Totals
Heart
Attack
104
81
189
374
None
10933
5065
10845
26843
Totals
11037
5146
11034
27217
What is the relative risk of having a heart attack among men who take aspirin compared to men who take
6
a placebo?
We interpret this as: Men in the study who took aspirin were about _____ as likely to have a heart attack
as those men who took a placebo.
What is the relative risk of having a heart attack among men who take Ibuprofen compared to men who
take a placebo?
Confidence intervals for the difference of 2 proportions:
Linear Regression
Inferential statistics for 2 numerical variables
Now we want to make statistical inferences a population based on the data when there appears to be a
linear relationship between the explanatory variable (numerical) and the response variable (numerical).
EX 1 We are interested in determining if there is statistical evidence of a linear relationship between

height and weight in the population. In particular, in the Spring 02 STAT 302 class (population of
interest), were taller people, in general heavier? I took a random sample of people in this class. This data
is plotted below.
Explanatory variable: height
Response variable:
weight
The first step in answering this question is to make a scatter plot (see below). It is clear that there is a
general positive linear trend and that as height increases, on average, weight also increases. There are
no outrageous outliers and the correlation, R = .660, is a good measure of the strength of the linear
association.
7
From this data, I can estimate the equation for the regression
line: The line used to estimate the true population line is:
250
Y | x
= the estimated average weight of all individuals who
are x inches tall.

For example, the estimated average weight of people 70.0
Inches tall is
-247 + 6.0 70.0 = 173.0 lbs.
200
150
100
weight
Y | X = -247 + 6.0 x
60
65
70
75
height
Definitions of the parameters of interest
Y|x = 0 + 1x equation of the population regression line

Y|x is the average response value for all individuals in the population defined by the explanatory
variable value x. This is a population parameter.
EX 6: Explanatory variable = height

Response variable = weight.
Y|64 = average weight of all people in the population who are 64 inches tall.
The slope, 1 of the regression line measures the change in Y|x for every unit change in the explanatory
variable x. This is a population parameter.
EX 6:
1 = average change in weight when height increased by 1 inch for this population
1 is estimated by b1 = slope of the line which is the best fit through the data.
Using computer output, the value of b1 is calculated from the data set. b1 is a statistic calculated
from the data.
8
2.00
Response
1.00
0.00
Explanatory
4.00
Response
3.00
3.75
3.50
3.25
2.00
2.50
Response
3.50
3.00
1.50
1
b1 > 0
Explanatory
Explanatory
b1 < 0
b1 = 0
The true intercept, 0, of the mean function tells the value of Y|X when X = 0.
parameter.
We estimate the value of 0 with b0.
b0 is a statistic calculated from the data.
The estimated value of Y|x is written

value
Y | x .
We use the expression b0 + b1x =
Y | x to calculate the
Y | X from the values of x, b1 and b0.
EX 1 revisited:
0 is a population
Y |X = -247 + 6.0x
b0 =
b1 =
Y |70 =
-247 + 6.0 70.0 = 173.0 lbs.
Simple Linear Regression

Statistical Inferences when 2 numerical variables
In linear regression the idea is to test if there is a linear relationship between the explanatory and
response variable. The way we tell if there is a linear relationship is to test if the slope of the least
squares line is not zero. Of course, this only makes sense when the conditions are met (see pages 6-8).
The 3 possible hypotheses that can be tested using linear regression methods are:
There is a positive linear relationship:
H0: 1 = 0 vs HA: 1 > 0
1-sided
There is a negative linear relationship:
H0: 1 = 0 vs HA: 1 < 0
1-sided
There is a linear relationship:
H0: 1 = 0 vs HA: 1 0
2-sided
EX 2: Theory: the number of times a TAMU student goes out per week is negatively linearly related to
their GPR. A SRS was taken of 43 STAT 302 students in Fall 02. Below is a scatter plot of their data. We
want to test this theory at the = .05 level.
Explanatory variable:
Response variable:
Hypothesis:
Summary
Multiple
R
0.5722
Regression
Table
Coefficient
Constant
# NightsOut
3.678
-0.239
b0 estimated intercept
R-Square
0.3275
Adjusted
R-Square
0.3111
Std Err of
Estimate
0.4831625
t-Value
p-Value
20.310
< 0.0001
0.002
Standard
Error
0.181
0.072
b1 estimated slope
p-value for 2-sided hypotheses for 1
10
How do we interpret the value of b1?

b1 is a slope estimate which estimates the change in average response when the explanatory
variable increases by 1.
From this data set, we estimate the average GPR drops by 0.24 when the # nights out increases by
1.
Calculating the p-value:

Case 1: You have 2-sided hypotheses: H0: 1 = 0 vs HA: 1 0 then JMP gives the correct p-value.
Case 2: You have 1-sided hypotheses
a) Data supports HA then the correct p-value =
1
(p-value in table)
2
How to tell if data supports HA.

1. If HA: 1 > 0 then we must have b1 > 0.
2. If HA: 1 < 0
then we must have b1 < 0.
b) You have 1-sided hypotheses and the sign of b1 doesnt match HA statement, then FTR H0.
Regression
Table
Coefficient
Constant
# NightsOut
3.678
-0.239
Standard
Error
0.181
0.072
t-Value
p-Value
20.310
< 0.0001
0.002
What is the correct p-value for testing the hypotheses H0: 1 = 0 versus HA: 1 < 0?
What is the decision?
What is the conclusion?
Predicting GPR from # of nights out per week:
The predicted response of an individual whose explanatory value is x is written as

for calculating
y .
The formula
y is given by the regression line: y b0 b1x
What is the predicted GPR of a person who goes out 3 times per week?
11
Can we predict the GPR of a person who goes out 0

times per week?
As usual, there are conditions that must be met before we can make statistical inferences.
Below is a discussion of those conditions.
We start with defining residuals. These are very important to statisticians but all you need to be able to
do is understand enough of the plots to be able to decide if conditions are met.
Residuals:
There are various methods for estimating an equation for the best straight line through a set of data
points. The most commonly used method results in a line called the Least Squares Line. The least square
line is the line that minimizes the sum of the squared sample residuals.
For a data point with response value
y , the residual of this data point is: y Y |x
A data points residual tells us how much a subjects response value differs from the average
response value.
A plot of the residuals (see lower right) tells us about the variation of the data values about the
predicted regression line.
Scatterplot of Residual vs Fit

1.5
1.0
Residual
0.5
0.0
2.0
2.2
2.4
2.6
2.8
3.0
3.2
-0.5
-1.0
-1.5
12
3.4
Estimated average GPA = 3.68 -0.24*nights out

Estimate the residual for the person who goes out 6 times per week and whose GPR = 1.5.
Assumptions inherent in the model of linear regression:

For theoretical reasons, we divide the target population into many populations according to the
explanatory variable value. For example, if x = height and y = weight, then we have a separate
population for each height. This assumption is needed because of the mathematical model assumptions
we make below. These mathematical assumptions were used when someone devised the hypothesis test
for 2 numerical variables. We use these model assumptions to come up with the requirements
(conditions) that must be met in order to use a linear regression analysis to analyze our data. If the
requirements arent met, then we cant use a linear regression analysis on our data because the results
will be nonsense.
For each explanatory variable value, we have a population of ys. Moreover we have the following 4
conditions:
1. Independence: All the response values are independent. This is assured if the data comes from a
random sample and there is exactly 1 response value for each randomly selected subject.
2. Linearity: If individuals x and y values are linearly related by the equation Y|x = 1x + 0. The
quantity Y|x is the average response (average y value) in the population of individuals taking explanatory
value x.
3. Normality: For each explanatory variable value x, the response values, y, associated with that x are
normally distributed. That means that for each x value, they values (associated with that x value) are
normally distributed.
For example, we expect the weights of everyone who is 70.0 inches tall to be normally distributed.
4. Equal variances: For each explanatory variable value x the response values, y, associated with that x all
have the same variance. That means that for each x value, the y values (associated with that x value) all
have the same variance, regardless of the x value.
13
How to check model assumptions
Checking independence:
o
Checking the assumption of a linear relationship between Y|X and the value of x.
o
Look at the normal QQ plot of the residuals and make sure you dont see a C shape.
Checking the assumption of equal variances

o
Look at the scatter plot and make sure that the pattern of the points looks linear or like a
shotgun pattern.
Checking the assumption that the responses are normally distributed about their means
o
Make sure there is only 1 response value per subject.
Look at the scatter plot of residuals, you want to see a horizontal band of points or a
shotgun pattern. You do NOT want to see a wedge shape.
Also need to check that there are no extreme outliers as they mess up everything just like with
correlation.
Example where all the conditions are met:

Scatter plot used to check linearity
NQ plot to check for normality
Residual plot to check

Equal Variances
Normal P-P Plot of Regression Standardized

Scatterplot
Residual
Dependent Variable: Y
60
40
.75
30
Expected Cum Prob
20
10
0
-10
-20
.50
.25
0.00
-30
2.0
Regression Standardized Residual
1.00
50
0.00
10
15
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-2.0
.25
.50
.75
1.00
-1.5
-1.0
-.5
0.0
.5
1.0
20
Observed Cum Prob
Regression Standardized Predicted Value
14
1.5
2.0
EXAMPLES WHERE THE MODEL ASSUMPTIONS DONT HOLD

Data Not linear
Normal P-P Plot of Regression Standardized Residual
1.0
Expected Cum Prob
0.8
0.6
0.4
0.2
0.0
0.0
0.2
0.4
0.6
0.8
1.0
Observed Cum Prob
In the case above, the explanatory and response are not linearly related but everything else is ok
although the lack of linearity messes up the scatter plot of the residuals (plot on far right above).
Data Not normally distributed
Normal P-P Plot of Regression Standardized Residual
Scatterplot
40.000
1.00
4
30.000
20.000
10.000
5.000
.75
Expected Cum Prob
10.000
15.000
.50
.25
-1
0.00
20.000
-1.5
0.00
.25
.50
.75
-1.0
-.5
0.0
.5
1.0
1.5
2.0
1.00
Observed Cum Prob
15
Data doesnt have equal variances

In the example below, the responses for each X value are normal BUT the variances increase as X
increases. As a result, the constant variance assumption is violated.
Normal P-P Plot of Regression Standardized
Scatterplot
Residual
50.000
Expected Cum Prob
20.000
10.000
5.000
10.000
4
3
.75
40.000
30.000
1.00
60.000
2
1
.50
0
-1
.25
-2
0.00
15.000
20.000
0.00
-3
.25
.50
.75
1.00
-1.5
Observed Cum Prob
-1.0
-.5
0.0
.5
1.0
1.5
2.0
More examples of the right side plot used to check equal variances:
Equal variances
Not Equal variances

Scatterplot
Dependent Variable: Y2
-2
-4
-6
-2.0
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
2.0
16
EX 3: Doctors would like a way to predict a premature infants weight at birth based on the infants
gestational age. They wanted to test their theory that gestational age (in weeks) and weight (in grams)
are positively linearly related. To test their theory, a researcher group selected a random sample of 100
premature infants and recorded the gestational age at birth and the birth weight of each baby. Assume all
conditions are met to do the analysis.
Explanatory variable:
Scatterplot of birthweight vs gestational

age
Response variable
2000
H0: 1 = 0 vs HA: 1
birthweight
Hypotheses:
0
1500
1000
500
0
20
25
30
35
40
Gestational age
Q-Q Normal Plot of Residuals

3.5
Standardized Q-Value
2.5
1.5
0.5
-3.5
-2.5
-1.5
-0.5
-0.5
0.5
1.5
2.5
3.5
-1.5
-2.5
-3.5
Z-Value
Scatterplot of Residuals vs Fit

600.0
400.0
Residual
200.0
0.0
600.0
-200.0
800.0
1000.0
1200.0
1400.0
1600.0
-400.0
-600.0
-800.0
Fit
Are the conditions met to analyze this data set using linear regression?
1. Independence: This condition is met because the data comes from a random sample and most
importantly, each babys response value (birthweight) was only measured once.
2. Linear relationship: Yes, visual inspection of the scatter plot shows gestational age at birth and
birth weight are linearly related.
3. Normality: Condition met, the Q-Q normal plot of the residuals doesnt have a C shape.
17
4. Equal variances: Condition met, the scatter plot of residuals shows a shotgun pattern and not a
wedge pattern.
Multiple
R
0.66
Summary
R-Square
0.44
Adjusted
R-Square
0.43
Std Err of
Estimate
203.89
Recall the definition of R2 from set 2 notes. It is the % of variability in the datas response values that can
be explained by differences in the explanatory values.
What % of the variability in birth weights in this data set can be explained by differences in
gestational age?
Regression
Table
Intercept
gestational age
Coeff.
-932.40
70.31
Standard
Error
234.49
22.87
t-Value
p-Value
-3.976
3.176
0.0001
0.0010
How should you interpret the number 70.31 given above?

o
The average birth weight of premature babies increases by approximately 70.3 grams
when the gestational age at birth increases by 1 week.
What is the correct p-value?
What decision should you make based on the above analysis?
What is your conclusion?

o
The data provides very strong statistical evidence that for premature babies, gestational
age at birth and birth weight are positively linearly related.
Discussion on when we can estimate average response and calculate a predicted response:
Now that we have completed our analysis, we can use the values of the coefficients to form a linear
equation relating gestational age and birth weight. Based on this data set we can both estimate the
average weight at birth and predict the birth weight of a baby yet to be born after x weeks.
Y |x 932.40 70.31 x
Scatterplot of birthweight vs gestational
age
y 932.40 70.31 x
2000
birthweight
is the predicted birth

x is the gestational age and y
weight when the gestational age is x. By looking at
the data, I determined that 23 is the minimum age
(minimum x value in the data set) and 35 is the
maximum age (maximum x value in the data set).
1500
1000
500
0
20
25
30
Gestational age
35
18
40
To find the minimum and maximum value of x, look at the scatter plot of the data.
Therefore, this equation can only be used to estimate average birth weight or predict the weights of
infants whose gestational age is between 23 and 35 weeks. It is very important that once a regression is
done, the estimates of and are only used to estimate averages or predict response values for values
of x (gestational age) between the minimum and maximum x values of your data. In other words, we can
interpolate between our explanatory data values but we cant extrapolate to values outside of the range of
x values.
Using our equation, we predict that a baby born at 30 weeks will weigh -932.404 + 70.310(30) =
1,176.896 grams at birth.
What is the estimated average weight of babies born at 40 weeks?
19

302 10 Chi Squared Tests and Regression S15

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

302 10 Chi Squared Tests and Regression S15

Uploaded by

Copyright:

Available Formats

STAT Course Notes Set 10

What is the response variable?

Whether or not the woman has post partum depression

Reading a contingency table:

STAT Course Notes Set 10

How to calculate is 3rd number:

Observational study: In the population there is no association between 2 variables if the

In the population there is an association or dependence between 2 variables, if the explanatory

General form of the hypotheses in observational studies

H0: There is no association between the explanatory and response variable

HA: There is an association between the explanatory and response variable

STAT Course Notes Set 10

P1 = proportion of successes in group 1. This parameter equals the conditional proportion of

part, on the treatment received.

Other good hypotheses statements

H0: There is no difference in response to the different treatments.

HA: Different treatments cause a difference in response.

Alternative statement: The different treatments affect the response differently

What is the p-value?

STAT Course Notes Set 10

Below is a discussion of the chi-squared test in the context of this example.

ni = actual # subjects in cell i.

We sum over every cell in the table.

Ei = # of subjects wed expect would be in cell i if H0 were

true. So for cell 1 this value is 23.76. We interpret this to

the expected number of counts.

STAT Course Notes Set 10

Do larger test statistics result in larger or smaller p-values?

Conditions that must be satisfied in order to run a chi-squared test

What are the variables and variable types?

STAT Course Notes Set 10

What hypotheses can we test with this data set?

p 1 = sample proportion of successes in group 1

STAT Course Notes Set 10

Confidence intervals for the difference of 2 proportions:

EX 1 We are interested in determining if there is statistical evidence of a linear relationship between

STAT Course Notes Set 10

= the estimated average weight of all individuals who

are x inches tall.

Definitions of the parameters of interest

Y|x = 0 + 1x equation of the population regression line

EX 6: Explanatory variable = height

STAT Course Notes Set 10

b0 is a statistic calculated from the data.

The estimated value of Y|x is written

We use the expression b0 + b1x =

Y | X from the values of x, b1 and b0.

-247 + 6.0 70.0 = 173.0 lbs.

STAT Course Notes Set 10

Simple Linear Regression

H0: 1 = 0 vs HA: 1 > 0

There is a negative linear relationship:

H0: 1 = 0 vs HA: 1 < 0

There is a linear relationship:

p-value for 2-sided hypotheses for 1

STAT Course Notes Set 10

How do we interpret the value of b1?

Calculating the p-value:

How to tell if data supports HA.

then we must have b1 < 0.

What is the decision?

What is the conclusion?

Predicting GPR from # of nights out per week:

The predicted response of an individual whose explanatory value is x is written as