You are on page 1of 33

Elementary Statistics

Larson Farber
9 Correlation and Regression
Correlation
Section 9.1
Correlation
What type of relationship exists between the two
variables and is the correlation significant?
x
y
Cigarettes smoked per day
Score on SAT
Height
Hours of Training
Explanatory
(Independent) Variable
Response
(Dependent) Variable
A relationship between two variables
Number of Accidents
Shoe Size Height
Lung Capacity
Grade Point Average
IQ
Negative Correlationas x increases, y decreases
x = hours of training
y = number of accidents
Scatter Plots and Types of Correlation
60
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16 18 20
Hours of Training
A
c
c
i
d
e
n
t
s

Positive Correlationas x increases, y increases
x = SAT score
y = GPA
G
P
A

Scatter Plots and Types of Correlation
4.00
3.75
3.50
3.00
2.75
2.50
2.25
2.00
1.50
1.75
3.25
300 350 400 450 500 550 600 650 700 750 800
Math SAT
No linear correlation
x = height y = IQ
Scatter Plots and Types of Correlation
160
150
140
130
120
110
100
90
80
60 64 68 72 76 80
Height
I
Q

Correlation Coefficient
A measure of the strength and direction of a linear
relationship between two variables
The range of r is from 1 to 1.
If r is close to
1 there is a
strong
positive
correlation.
If r is close to 1
there is a strong
negative
correlation.
If r is close to
0 there is no
linear
correlation.
1 0
1
x y
8 78
2 92
5 90
12 58
15 43
9 74
6 81
Absences
Final
Grade
Application
95
90
85
80
75
70
65
60
55
45
40
50
0 2 4 6 8 10 12 14 16
F
i
n
a
l

G
r
a
d
e

X
Absences
6084
8464
8100
3364
1849
5476
6561

624
184
450
696
645
666
486
57 516 3751 579 39898
1 8 78
2 2 92
3 5 90
4 12 58
5 15 43
6 9 74
7 6 81
64
4
25
144
225
81
36
xy x
2
y
2
Computation of r
x y
r is the correlation coefficient for the sample. The
correlation coefficient for the population is (rho).
The sampling distribution for r is a t-distribution with n 2 d.f.
Standardized test
statistic
For a two tail test for significance:
For left tail and right tail to test
negative or positive significance:
Hypothesis Test for Significance
(The correlation is not significant)
(The correlation is significant)
A t-distribution with 5 degrees of freedom
Test of Significance
You found the correlation between the number of times absent
and a final grade r = 0.975. There were seven pairs of
data.Test the significance of this correlation. Use = 0.01.
1. Write the null and alternative hypothesis.
2. State the level of significance.
3. Identify the sampling distribution.
(The correlation is not significant)
(The correlation is significant)
= 0.01
t
0 4.032 4.032
Rejection Regions
Critical Values t
0

4. Find the critical value.
5. Find the rejection region.
6. Find the test statistic.
t
0
4.032
4.032
t = 9.811 falls in the rejection region. Reject the null hypothesis.
There is a significant correlation between the number of
times absent and final grades.
7. Make your decision.
8. Interpret your decision.
Linear Regression
Section 9.2
The equation of a line may be written as y = mx + b
where m is the slope of the line and b is the y-intercept.
The line of regression is:
The slope m is:
The y-intercept is:
Once you know there is a significant linear correlation,
you can write an equation describing the relationship
between the x and y variables. This equation is called the
line of regression or least squares line.
The Line of Regression
180
190
200
210
220
230
240
250
260
1.5 2.0 2.5 3.0
Ad $
= a residual
(x
i
,y
i
)
= a data point
r
e
v
e
n
u
e

= a point on the line with the same x-value
Calculate m and b.
Write the equation of the
line of regression with
x = number of absences
and y = final grade.
The line of regression is: = 3.924x + 105.667
6084
8464
8100
3364
1849
5476
6561

624
184
450
696
645
666
486
57 516 3751 579 39898
1 8 78
2 2 92
3 5 90
4 12 58
5 15 43
6 9 74
7 6 81
64
4
25
144
225
81
36
xy x
2
y
2
x y
0 2 4 6 8 10 12 14 16
40
45
50
55
60
65
70
75
80
85
90
95
Absences
F
i
n
a
l

G
r
a
d
e

m = 3.924 and b = 105.667
The line of regression is:
Note that the point = (8.143, 73.714) is on the line.
The Line of Regression
The regression line can be used to predict values of y for
values of x falling within the range of the data.
The regression equation for number of times absent and final grade is:
Use this equation to predict the expected grade for a student with

(a) 3 absences (b) 12 absences
(a)
(b)
Predicting y Values
= 3.924(3) + 105.667 = 93.895
= 3.924(12) + 105.667 = 58.579
= 3.924x + 105.667
Measures of
Regression and
Correlation
Section 9.3
The coefficient of determination, r
2
,

is the ratio of explained
variation in y to the total variation in y.
The correlation coefficient of number of times absent and
final grade is r = 0.975. The coefficient of determination
is r
2
= (0.975)
2
= 0.9506.
Interpretation: About 95% of the variation in final grades
can be explained by the number of times a student is
absent. The other 5% is unexplained and can be due to
sampling error or other variables such as intelligence,
amount of time studied, etc.
The Coefficient of Determination
The Standard Error of Estimate, s
e
,is the standard
deviation of the observed y
i
values about the predicted
value.

The Standard Error of Estimate
1 8 78 74.275 13.8756
2 2 92 97.819 33.8608
3 5 90 86.047 15.6262
4 12 58 58.579 0.3352
5 15 43 46.807 14.4932
6 9 74 70.351 13.3152
7 6 81 82.123 1.2611
92.767
= 4.307
x y
Calculate for each x.
The Standard Error of Estimate
Given a specific linear regression equation and x
0
, a specific value
of x, a c-prediction interval for y is:
where
Use a t-distribution with n 2 degrees of freedom.
The point estimate is and E is the maximum error of estimate.
Prediction Intervals
Construct a 90% confidence interval for a final grade when a
student has been absent 6 times.
1. Find the point estimate:
The point (6, 82.123) is the point on the regression line with
x-coordinate of 6.
Application
Construct a 90% confidence interval for a final grade when
a student has been absent 6 times.
2. Find E,
At the 90% level of confidence, the maximum
error of estimate is 9.438.
Application
Construct a 90% confidence interval for a final grade
when a student has been absent 6 times.
When x = 6, the 90% confidence
interval is from 72.685 to 91.586.
3. Find the endpoints.
Application
E = 82.123 9.438 = 72.685
+ E = 82.123 + 9.438 = 91.561
72.685 < y < 91.561
Regression Analysis


The regression equation is
y = 106 3.92x

Predictor Coef StDev T P
Constant 105.668 3.655 28.91 0.000
Minitab Output
x 3.9241 0.4019 9.76 0.000
S = 4.307 R-Sq = 95.0% R-Sq(adj) = 94.0%
Multiple Regression
Section 9.4
Absence IQ Grade
More Explanatory Variables
8
2
5
12
15
9
6
115
135
126
110
105
120
125
78
92
90
58
43
74
81


Regression Analysis

The regression equation is
Grade = 52.7 2.65 absence + 0.357 IQ

Predictor Coef StDev T P
Constant
Absence
IQ
Minitab Output
S = 4.603 R-Sq = 95.4% R-Sq(adj) = 93.2%
0.573
0.277
0.571
0.61
1.26
0.62
86.110
2.111
0.580
52.720
2.652
0.357
Interpretation
The regression equation is
Grade = 52.7 2.65 absence + 0.357 IQ
When other variables are 0, the grade is 52.7.
If IQ is held constant, each time there is one more
absence the predicted grade will decrease by 2.65
points.
If number of absences is held constant, and IQ is
increased by one point the predicted grade will increase
by 0.357 points.
The regression equation is
Grade = 52.7 2.65 absence + 0.357 IQ
Predicting the Response Variable
Use the regression equation to predict a grade when a
student is absent 5 times and has an IQ of 125.
Grade = 52.7 2.65 absence + 0.357 IQ
Grade = 52.7 2.65(5) + 0.357(125) = 80.075 (about 80)
Use the regression equation to predict a grade when a
student is absent 9 times and has an IQ of 120.
Grade = 52.7 2.65 absence + 0.357 IQ
Grade = 52.7 2.65(9) + 0.357(120) = 71.69 (about 72)

You might also like