You are on page 1of 58

Simple Linear Regression

Simple Linear Regression Model


Least Squares Method
Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation
for Estimation and Prediction
Residual Analysis: Validating Model Assumptions
Outliers and Influential Observations
Simple Linear Regression
Regression analysis can be used to develop an
equation showing how the variables are related.
Managerial decisions often are based on the
relationship between two or more variables.
The variables being used to predict the value of the
dependent variable are called the independent
variables and are denoted by x.
The variable being predicted is called the dependent
variable and is denoted by y.
Simple Linear Regression
The relationship between the two variables is
approximated by a straight line.
Simple linear regression involves one independent
variable and one dependent variable.
Regression analysis involving two or more
independent variables is called multiple regression.
Simple Linear Regression Model
y = |
0
+ |
1
x +c
where:
|
0
and |
1
are called parameters of the model,
c is a random variable called the error term.
The simple linear regression model is:
The equation that describes how y is related to x and
an error term is called the regression model.
Simple Linear Regression Equation
The simple linear regression equation is:
E(y) is the expected value of y for a given x value.
|
1
is the slope of the regression line.
|
0
is the y intercept of the regression line.
Graph of the regression equation is a straight line.
E(y) = |
0
+ |
1
x
Simple Linear Regression Equation
Positive Linear Relationship
E(y)
x
Slope |
1
is positive
Regression line
Intercept
|
0

Simple Linear Regression Equation
Negative Linear Relationship
E(y)
x
Slope |
1
is negative
Regression line
Intercept
|
0

Simple Linear Regression Equation
No Relationship
E(y)
x
Slope |
1
is 0
Regression line
Intercept
|
0

Estimated Simple Linear Regression Equation
The estimated simple linear regression equation
0 1

y b b x = +
is the estimated value of y for a given x value.
y
b
1
is the slope of the line.
b
0
is the y intercept of the line.
The graph is called the estimated regression line.
Estimation Process

Regression Model
y = |
0
+ |
1
x +c
Regression Equation
E(y) = |
0
+ |
1
x
Unknown Parameters
|
0
, |
1

Sample Data:
x y
x
1
y
1
. .
. .
x
n
y
n

b
0
and b
1

provide estimates of
|
0
and |
1
Estimated
Regression Equation

Sample Statistics
b
0
, b
1

0 1

y b b x = +
Least Squares Method
Least Squares Criterion
min (y y
i i

)
2
where:
y
i
= observed value of the dependent variable
for the ith observation
^
y
i
= estimated value of the dependent variable
for the ith observation
Slope for the Estimated Regression Equation

1
2
( )( )
( )
i i
i
x x y y
b
x x

=

Least Squares Method


where:
x
i
= value of independent variable for ith
observation
_
y = mean value for dependent variable
_
x = mean value for independent variable
y
i
= value of dependent variable for ith
observation
y-Intercept for the Estimated Regression Equation
Least Squares Method
0 1
b y b x =
Reed Auto periodically has
a special week-long sale.
As part of the advertising
campaign Reed runs one or
more television commercials
during the weekend preceding the sale. Data from a
sample of 5 previous sales are shown on the next slide.
Simple Linear Regression
Example: Reed Auto Sales
Simple Linear Regression
Example: Reed Auto Sales
Number of
TV Ads (x)
Number of
Cars Sold (y)
1
3
2
1
3
14
24
18
17
27
Ex = 10 Ey = 100
2 x = 20 y =
Estimated Regression Equation

10 5 y x = +
1
2
( )( )
20
5
( ) 4
i i
i
x x y y
b
x x

= = =

0 1
20 5(2) 10 b y b x = = =
Slope for the Estimated Regression Equation
y-Intercept for the Estimated Regression Equation
Estimated Regression Equation
Scatter Diagram and Trend Line














y = 5x + 10
0
5
10
15
20
25
30
0 1 2 3 4
TV Ads
C
a
r
s

S
o
l
d
Coefficient of Determination
Relationship Among SST, SSR, SSE
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
SST = SSR + SSE
2
( )
i
y y

( )
i
y y =

( )
i i
y y +

The coefficient of determination is:


Coefficient of Determination
where:
SSR = sum of squares due to regression
SST = total sum of squares
r
2
= SSR/SST
Coefficient of Determination
r
2
= SSR/SST = 100/114 = .8772
The regression relationship is very strong; 87.7%
of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold.
Sample Correlation Coefficient
2
1
) of (sign r b r
xy
=
ion Determinat of t Coefficien ) of (sign
1
b r
xy
=
where:
b
1
= the slope of the estimated regression
equation x b b y
1 0

+ =
2
1
) of (sign r b r
xy
=
The sign of b
1
in the equation is +.

10 5 y x = +
= + .8772
xy
r
Sample Correlation Coefficient
r
xy
= +.9366
Assumptions About the Error Term c
1. The error c is a random variable with mean of zero.
2. The variance of c , denoted by o
2
, is the same for
all values of the independent variable.
3. The values of c are independent.
4. The error c is a normally distributed random
variable.
Testing for Significance
To test for a significant regression relationship, we
must conduct a hypothesis test to determine whether
the value of |
1
is zero.
Two tests are commonly used:


t Test
and
F Test
Both the t test and F test require an estimate of o
2
,
the variance of c in the regression model.
An Estimate of o
2

Testing for Significance

= =
2
1 0
2
) ( )

( SSE
i i i i
x b b y y y
where:
s
2
= MSE = SSE/(n 2)
The mean square error (MSE) provides the estimate
of o
2
, and the notation s
2
is also used.
Testing for Significance
An Estimate of o
2
SSE
MSE

= =
n
s
To estimate o we take the square root of o
2
.
The resulting s is called the standard error of
the estimate.
Hypotheses




Test Statistic
Testing for Significance: t Test
0 1
: 0 H | =
1
: 0
a
H | =
1
1
b
b
t
s
=
where
1
2
( )
b
i
s
s
x x
=
E
Rejection Rule
Testing for Significance: t Test
where:
t
o/2
is based on a t distribution
with n - 2 degrees of freedom
Reject H
0
if p-value < o
or t < -t
o/2
or t > t
o/2
1. Determine the hypotheses.
2. Specify the level of significance.
3. Select the test statistic.
o = .05
4. State the rejection rule.
Reject H
0
if p-value < .05
or |t| > 3.182 (with
3 degrees of freedom)
Testing for Significance: t Test
0 1
: 0 H | =
1
: 0
a
H | =
1
1
b
b
t
s
=
Testing for Significance: t Test
5. Compute the value of the test statistic.
6. Determine whether to reject H
0
.
t = 4.541 provides an area of .01 in the upper
tail. Hence, the p-value is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H
0
.
1
1
5
4.63
1.08
b
b
t
s
= = =
Confidence Interval for |
1
H
0
is rejected if the hypothesized value of |
1
is not
included in the confidence interval for |
1
.
We can use a 95% confidence interval for |
1
to test
the hypotheses just used in the t test.
The form of a confidence interval for |
1
is:
Confidence Interval for |
1
1
1 /2 b
b t s
o

where is the t value providing an area


of o/2 in the upper tail of a t distribution
with n - 2 degrees of freedom
2 / o
t
b
1
is the
point
estimator

is the
margin
of error
1
/2 b
t s
o
Confidence Interval for |
1
Reject H
0
if 0 is not included in
the confidence interval for |
1
.
0 is not included in the confidence interval.
Reject H
0
= 5 +/- 3.182(1.08) = 5 +/- 3.44
1
2 / 1 b
s t b
o

or 1.56 to 8.44
Rejection Rule

95% Confidence Interval for |
1
Conclusion

Hypotheses




Test Statistic
Testing for Significance: F Test
F = MSR/MSE
0 1
: 0 H | =
1
: 0
a
H | =
Rejection Rule
Testing for Significance: F Test
where:
F
o
is based on an F distribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
Reject H
0
if
p-value < o
or F > F
o
1. Determine the hypotheses.
2. Specify the level of significance.
3. Select the test statistic.
o = .05
4. State the rejection rule.
Reject H
0
if p-value < .05
or F > 10.13 (with 1 d.f.
in numerator and
3 d.f. in denominator)
Testing for Significance: F Test
0 1
: 0 H | =
1
: 0
a
H | =
F = MSR/MSE
Testing for Significance: F Test
5. Compute the value of the test statistic.
6. Determine whether to reject H
0
.
F = 17.44 provides an area of .025 in the upper
tail. Thus, the p-value corresponding to F = 21.43
is less than 2(.025) = .05. Hence, we reject H
0
.
F = MSR/MSE = 100/4.667 = 21.43
The statistical evidence is sufficient to conclude
that we have a significant relationship between the
number of TV ads aired and the number of cars sold.
Some Cautions about the
Interpretation of Significance Tests
Just because we are able to reject H
0
: |
1
= 0 and
demonstrate statistical significance does not enable
us to conclude that there is a linear relationship
between x and y.
Rejecting H
0
: |
1
= 0 and concluding that the
relationship between x and y is significant does
not enable us to conclude that a cause-and-effect
relationship is present between x and y.
Using the Estimated Regression Equation
for Estimation and Prediction

/
y t s
p y
p

o 2
where:
confidence coefficient is 1 - o and
t
o/2
is based on a t distribution
with n - 2 degrees of freedom
/2 ind p
y t s
o

Confidence Interval Estimate of E(y


p
)
Prediction Interval Estimate of y
p
If 3 TV ads are run prior to a sale, we expect
the mean number of cars sold to be:
Point Estimation
^
y = 10 + 5(3) = 25 cars
2

2
( )
1
( )
p
p
y
i
x x
s s
n x x

= +

Estimate of the Standard Deviation of

p
y
Confidence Interval for E(y
p
)
2

2 2 2 2 2
(3 2) 1
2.16025
5 (1 2) (3 2) (2 2) (1 2) (3 2)
p
y
s

= +
+ + + +

1 1
2.16025 1.4491
5 4
p
y
s = + =
The 95% confidence interval estimate of the mean
number of cars sold when 3 TV ads are run is:
Confidence Interval for E(y
p
)
25 + 4.61

/
y t s
p y
p

o 2
25 + 3.1824(1.4491)
20.39 to 29.61 cars
2
ind
2
( )
1
1
( )
p
i
x x
s s
n x x

= + +

Estimate of the Standard Deviation


of an Individual Value of y
p

1 1
2.16025 1
5 4
p
y
s = + +

2.16025(1.20416) 2.6013
p
y
s = =
Prediction Interval for y
p

The 95% prediction interval estimate of the number
of cars sold in one particular week when 3 TV ads
are run is:
Prediction Interval for y
p
25 + 8.28
25 + 3.1824(2.6013)
/2 ind p
y t s
o

16.72 to 33.28 cars


Residual Analysis

i i
y y
Much of the residual analysis is based on an
examination of graphical plots.
Residual for Observation i
The residuals provide the best information about c .
If the assumptions about the error term c appear
questionable, the hypothesis tests about the
significance of the regression relationship and the
interval estimation results may not be valid.
Residual Plot Against x
If the assumption that the variance of c is the same
for all values of x is valid, and the assumed
regression model is an adequate representation of the
relationship between the variables, then
The residual plot should give an overall
impression of a horizontal band of points
x

y y
0
Good Pattern
R
e
s
i
d
u
a
l

Residual Plot Against x
Residual Plot Against x
x

y y
0
R
e
s
i
d
u
a
l

Nonconstant Variance
Residual Plot Against x
x

y y
0
R
e
s
i
d
u
a
l

Model Form Not Adequate
Residuals
Residual Plot Against x
Observation Predicted Cars Sold Residuals
1 15 -1
2 25 -1
3 20 -2
4 15 2
5 25 2
Residual Plot Against x
TV Ads Residual Plot
-3
-2
-1
0
1
2
3
0 1 2 3 4
TV Ads
R
e
s
i
d
u
a
l
s
Standardized Residual for Observation i
Standardized Residuals

i i
i i
y y
y y
s

1
i i
i y y
s s h

=
2
2
( ) 1
( )
i
i
i
x x
h
n x x

= +

where:
Standardized Residual Plot
The standardized residual plot can provide insight
about the assumption that the error term c has a
normal distribution.
If this assumption is satisfied, the distribution of the
standardized residuals should appear to come from a
standard normal probability distribution.
Standardized Residuals
Standardized Residual Plot
Observation Predicted Y Residuals Standard Residuals
1 15
-1 -0.535
2 25 -1 -0.535
3 20 -2 -1.069
4 15 2 1.069
5 25 2 1.069
Standardized Residual Plot
Standardized Residual Plot
A B C D
28
29 RESIDUAL OUTPUT
30
31 Observation Predicted Y Residuals Standard Residuals
32 1 15 -1 -0.534522
33 2 25 -1 -0.534522
34 3 20 -2 -1.069045
35 4 15 2 1.069045
36 5 25 2 1.069045
37
-1.5
-1
-0.5
0
0.5
1
1.5
0 10 20 30
Cars Sold
S
t
a
n
d
a
r
d

R
e
s
i
d
u
a
l
s
Standardized Residual Plot
All of the standardized residuals are between 1.5
and +1.5 indicating that there is no reason to question
the assumption that c has a normal distribution.
Outliers and Influential Observations
Detecting Outliers
An outlier is an observation that is unusual in
comparison with the other data.
Minitab classifies an observation as an outlier if its
standardized residual value is < -2 or > +2.
This standardized residual rule sometimes fails to
identify an unusually large observation as being
an outlier.
This rules shortcoming can be circumvented by
using studentized deleted residuals.
The |i th studentized deleted residual| will be
larger than the |i th standardized residual|.

You might also like