You are on page 1of 50

COMPLETE

BUSINESS
STATISTICS
by
AMIR D. ACZEL
&
JAYAVEL SOUNDERPANDIAN
7th edition.

Prepared by Lloyd Jaisingh, Morehead State


University

Chapter 10
Simple Linear Regression and Correlation

McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
10-2

10 Simple Linear Regression and Correlation


• Using Statistics
• The Simple Linear Regression Model
• Estimation: The Method of Least Squares
• Error Variance and the Standard Errors of Regression Estimators
• Correlation
• Hypothesis Tests about the Regression Relationship
• How Good is the Regression?
• Analysis of Variance Table and an F Test of the Regression Model
• Residual Analysis and Checking for Model Inadequacies
• Use of the Regression Model for Prediction
• The Solver Method for Regression
10-3

10 LEARNING OBJECTIVES
After studying this chapter, you should be able to:
• Determine whether a regression experiment would be useful in a given
instance
• Formulate a regression model
• Compute a regression equation
• Compute the covariance and the correlation coefficient of two random
variables
• Compute confidence intervals for regression coefficients
• Compute a prediction interval for the dependent variable
10-4

10 LEARNING OBJECTIVES (continued)

After studying this chapter, you should be able to:


• Test hypothesis about a regression coefficients
• Conduct an ANOVA experiment using regression results
• Analyze residuals to check if the assumptions about the
regression model are valid
• Solve regression problems using spreadsheet templates
• Use LINEST function to carry out a regression
10-5

10-1 Using Statistics

• Regression refers to the statistical technique of modeling the


relationship between variables.
• In simple linear regression, we model the relationship
between two variables.
• One of the variables, denoted by Y, is called the dependent
variable and the other, denoted by X, is called the
independent variable.
• The model we will use to depict the relationship between X and
Y will be a straight-line relationship.
• A graphical sketch of the the pairs (X, Y) is called a scatter
plot.
10-6

10-1 Using Statistics


This scatterplot locates pairs of observations of Scatterplot of Advertising Expenditures (X) and Sales (Y)
advertising expenditures on the x-axis and sales 140

on the y-axis. We notice that: 120

100

Sales
80
 Larger (smaller) values of sales tend to be 60
associated with larger (smaller) values of 40

advertising. 20

0
0 10 20 30 40 50
A d ve rtising

 The scatter of points tends to be distributed around a positively sloped straight line.

 The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
 The line represents the nature of the relationship on average.
10-7

Examples of Other Scatterplots

Y
Y
Y

X 0 X X
Y

Y
X X X
10-8

Model Building

The inexact nature of the Data In ANOVA, the systematic


relationship between component is the variation
advertising and sales of means between samples
suggests that a statistical or treatments (SSTR) and
model might be useful in
Statistical the random component is
analyzing the relationship. model the unexplained variation
(SSE).
A statistical model separates
the systematic component Systematic In regression, the
of a relationship from the systematic component is
component
random component. the overall linear
+ relationship, and the
Random random component is the
errors variation around the line.
10-9

10-2 The Simple Linear Regression


Model
The population simple linear regression model:
Y= 0 + 1 X + 
Nonrandom or Random
Systematic Component
Component
where
 Y is the dependent variable, the variable we wish to explain or predict
 X is the independent variable, also called the predictor variable
  is the error term, the only random component in the model, and thus, the
only source of randomness in Y.

 0 is the intercept of the systematic component of the regression relationship.


 1 is the slope of the systematic component.

The conditional mean of Y: E[Y X ]   0   1 X


10-10

Picturing the Simple Linear


Regression Model
Y
Regression Plot The simple linear regression
model gives an exact linear
relationship between the
expected or average value of Y,
the dependent variable, and X,
E[Y]=0 + 1 X
the independent or predictor
Yi
variable:
{
Error: i } 1 = Slope

E[Yi]=0 + 1 Xi
}

1
Actual observed values of Y
0 = Intercept
differ from the expected value by
an unexplained or random error:

X
Yi = E[Yi] + i
Xi = 0 + 1 Xi + i
10-11

Assumptions of the Simple Linear


Regression Model
• The relationship between X and Y is a Assumptions of the Simple
straight-line relationship. Y Linear Regression Model
• The values of the independent
variable X are assumed fixed (not
random); the only randomness in the
values of Y comes from the error term
i. E[Y]=0 + 1 X
• The errors i are normally distributed
with mean 0 and variance 2. The
errors are uncorrelated (not related)
in successive observations. That is:
~ N(0,2)
Identical normal
distributions of errors,
all centered on the
regression line.

X
10-12

10-3 Estimation: The Method of Least


Squares
Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.

The estimated regression equation:


Y = b0 + b1X + e

where b0 estimates the intercept of the population regression line, 0 ;


b1 estimates the slope of the population regression line, 1;
and e stands for the observed errors - the residuals from fitting the estimated
regression line b0 + b1X to a set of n points.
The estimated regression line:

Y  b0 + b1 X
where Y (Y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
10-13

Fitting a Regression Line


Y Y

Data
Three errors from the
least squares regression
X line X
Y

Three errors Errors from the least


from a fitted line squares regression
line are minimized
X X
10-14

Errors in Regression

Y
the observeddata point
Y  b0  b1 X the fitted regression line
Yi .
Yi
{
Error ei  Yi  Yi
Yi the predicted value of Y for X
i

X
Xi
10-15

Least Squares Regression

The sum of squared errors in regression is:


n n
SSE = e
i=1
2
i   (y
i=1
i  y i ) 2

The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .

The normal equations: SSE b0

n n

y
i=1
i  nb0  b1  x i
i=1
At this point
SSE is
Least squares b0 minimized
n n n with respect

x y
i=1
i i b0  x i  b1  x 2i
i=1 i=1
to b0 and b1

Least squares b1 b1
10-16

Sums of Squares, Cross Products,


and Least Squares Estimators
Sums of Squares and Cross Products:
  x
2

SSx   (x  x )   x
2 2

n 2
SS y   ( y  y )   y 
2 2   y
n
SSxy   (x  x )( y  y )   xy 
  x  ( y )
n
Least  squares regression estimators:
SS XY
b1 
SS X

b0  y  b1 x
10-17

Example 10-1

Miles Dollars Miles 2 Miles*Dollars


2  x 2
1211
1345
1802
2405
1466521
1809025
2182222
3234725
SS x   x 
1422 2005 2022084 2851110 n
1687 2511 2845969 4236057 2
1849 2332 3418801 4311868 79, 448
2026 2305 4104676 4669930  293, 426,946   40,947 ,557.84
2133 3016 4549689 6433128 25
2253
2400
3385
3090
5076009
5760000
7626405
7416000  x ( y )
2468 3694 6091024 9116792 SS xy   xy 
2699 3371 7284601 9098329 n
2806 3998 7873636 11218388
(79, 448)(106,605)
 390,185,014   51, 402,852.4
3082 3555 9498724 10956510
3209 4692 10297681 15056628
3466 4244 12013156 14709704 25
3643 5298 13271449 19300614
3852 4801 14837904 18493452 SS 51, 402,852.4
4033 5147 16265089 20757852 b  XY   1.255333776  1.26
4267 5738 18207288 24484046 1 SS 40,947 ,557.84
4498 6420 20232004 28877160 X
4533 6059 20548088 27465448
4804 6426 23078416 30870504 106,605  79,448 
5090 6321 25908100 32173890 b  y b x   (1.255333776 ) 
5233 7026 27384288 36767056
0 1 25  25 
5439 6964 29582720 37877196
79,448 106,605 293,426,946 390,185,014  274.85
10-18

Template (partial output) that can be


used to carry out a Simple Regression
10-19

Template (continued) that can be used


to carry out a Simple Regression
10-20

Template (continued) that can be used


to carry out a Simple Regression

Residual Analysis. The plot shows the absence of a relationship


between the residuals and the X-values (miles).
10-21

Template (continued) that can be used


to carry out a Simple Regression

Note: The normal probability plot is approximately linear. This


would indicate that the normality assumption for the errors has not
been violated.
10-22

Total Variance and Error Variance


Y Y

X X

What you see when looking


What you see when looking
at the total variation of Y.
along the regression line at
the error variance of Y.
10-23

10-4 Error Variance and the Standard


Errors of Regression Estimators
Y
Degrees of Freedom in Regression:

df = (n - 2) (n total observations less one degree of freedom


for each parameter estimated (b 0 and b1 ) )
2 Square and sum all
2 ( SS XY ) regression errors to find
SSE =  ( Y - Y )  SSY  SSE.
SS X X

= SSY  b1SS XY Example 10 - 1:


SSE = SS Y  b1 SS XY
2 2  66855898  (1.255333776)( 51402852 .4 )
An unbiased estimator of s , denoted by S :
 2328161.2
SSE 2328161.2
SSE MSE  
MSE = n2 23
(n - 2)  101224 .4
s  MSE  101224 .4  318.158
10-24

Standard Errors of Estimates in


Regression

The standard error of b0 (intercept): Example 10 - 1:


2
s x
s(b0 ) 
s(b0 ) 
s  x 2
nSS X
nSS X 318.158 293426944

( 25)( 4097557.84 )
where s = MSE  170.338
s
The standard error of b1 (slope): s(b1 ) 
SS X
318.158
s 
s(b1 )  40947557.84
SS X  0.04972
10-25

Confidence Intervals for the


Regression Parameters
A (1 -  ) 100% confidence interval for b :
0
b  t  s (b ) Example 10 - 1
0  ,(n 2 ) 0 95% Confidence Intervals:
2 
b t s (b )
0  0.025,( 25 2 ) 0
A (1 -  ) 100% confidence interval for b : = 274.85  ( 2.069) (170.338)
1
b  t  s (b )  274.85  352.43
1  ,(n 2 ) 1
2   [ 77.58, 627.28]
Least-squares point estimate:
b1=1.25533
b1  t s (b1 )
 0.025,( 25 2 )
= 1.25533  ( 2.069) ( 0.04972 )
Height = Slope

 1.25533  010287
.
 [115246
. ,1.35820]

0 (not a possible value of the


Length = 1
regression slope at 95%)
10-26

Template (partial output) that can be used


to obtain Confidence Intervals for 0 and 1
10-27

10-5 Correlation

The correlation between two random variables, X and Y, is a measure of the


degree of linear association between the two variables.

The population correlation, denoted by, can take on any value from -1 to 1.

  1 indicates a perfect negative linear relationship


-1 <  < 0 indicates a negative linear relationship
0 indicates no linear relationship
0<<1 indicates a positive linear relationship
  1 indicates a perfect positive linear relationship

The absolute value of  indicates the strength or exactness of the relationship.


10-28

Illustrations of Correlation

Y Y Y
 = -1 =0
=1

X X X

Y  = -.8 Y =0 Y
 = .8

X X X
10-29

Covariance and Correlation


The covariance of two random variables X and Y:
Cov ( X , Y )  E [( X   )(Y   )]
X Y
where  and  Y are the population means of X and Y respectively.
X

The population correlation coefficient: Example 10 - 1:


Cov ( X , Y ) SS
= XY
  r=
SS SS
X Y X Y
51402852.4
The sample correlation coefficient * : 
( 40947557.84)( 66855898)
SS
XY 51402852.4
r=  .9824
SS SS 52321943.29
X Y

*Note: If  < 0, b1 < 0 If  = 0, b1 = 0 If  > 0, b1 >0


10-30

Hypothesis Tests for the Correlation


Coefficient

Example 10 -1:
r
H0:  = 0 (No linear relationship) t( n 2 ) 
H1:   0 (Some linear relationship) 1 r2
n2
0.9824
r =
Test Statistic: t( n 2 )  1 - 0.9651
1 r2
25 - 2
n2 0.9824
=  25.25
0.0389
t0. 005  2.807  25.25
H 0 rejected at 1% level
10-31

10-6 Hypothesis Tests about the


Regression Relationship
Constant Y Unsystematic Variation Nonlinear Relationship
Y Y Y

X X X
A hypothesis test for the existence of a linear relationship between X and Y:
H0: 1  0
H1:  1  0
Test statistic for the existence of a linear relationship between X and Y:
b
 1
t
(n - 2) s(b )
1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
1 1 1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
10-32

Hypothesis Tests for the Regression


Slope
Example 10 - 1: Example10 - 4 :
H0: 1  0 H :  1
0 1
H1:  1  0 H :  1
1 1
b b 1
1  1
t  t
(n - 2) s(b ) ( n - 2) s (b )
1
1
1.24 - 1
1.25533 =  1.14
=  25.25 0.21
0.04972
t  1.671  1.14
t  2.807  25.25 (0.05,58)
( 0 . 005 , 23 ) H is not rejected at the10% level.
0
H 0 is rejected at the 1% level and we may
We may not conclude that the beta
conclude that there is a relationship between
coefficien t is different from 1.
charges and miles traveled.
10-33

10-7 How Good is the Regression?

The coefficient of determination, r2, is a descriptive measure of the strength of


the regression relationship, a measure of how well the regression line fits the data.
( y  y )  ( y  y)  ( y  y )
Y Total = Unexplained Explained
Deviation Deviation Deviation
Y . (Error) (Regression)

Y

Y
Unexplained Deviation

Explained Deviation
{
}
{
Total Deviation

SST
2
= SSE
2
 ( y  y )   ( y  y)   ( y  y )
+ SSR

Percentage of
2

2 SSR SSE
r   1 total variation
SST SST explained by
X
X the regression.
10-34

The Coefficient of Determination

Y Y Y

X X X
SST SST SST
S
r2 = 0 SSE r2 = 0.50 SSE SSR r2 = 0.90 S SSR
E

7000
Example 10 -1: 6000

5000

Dollars
SSR 64527736.8
r  2
  0.96518 4000

SST 66855898 3000

2000

1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
10-35

10-8 Analysis-of-Variance Table and


an F Test of the Regression Model
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio

Regression SSR (1) MSR MSR


MSE
Error SSE (n-2) MSE
Total SST (n-1) MST

Example 10-1
Source of Sum of Degrees of
Variation Squares Freedom F Ratio p Value
Mean Square
Regression 64527736.8 1 64527736.8 637.47 0.000
Error 2328161.2 23 101224.4
Total 66855898.0 24
10-36

Template (partial output) that displays Analysis of


Variance and an F Test of the Regression Model
10-37

10-9 Residual Analysis and Checking


for Model Inadequacies
Residuals Residuals

0 0

x or y x or y

Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals


random. No indication of model inadequacy. increases when x changes.

Residuals Residuals

0 0

Time x or y

Curved pattern in residuals resulting from


Residuals exhibit a linear trend with time. underlying nonlinear relationship.
10-38

Normal Probability Plot of the


Residuals
Flatter than Normal
10-39

Normal Probability Plot of the


Residuals

More Peaked than Normal


10-40

Normal Probability Plot of the


Residuals

Positively Skewed
10-41

Normal Probability Plot of the


Residuals

Negatively Skewed
10-42

10-10 Use of the Regression Model


for Prediction

• Point Prediction
A single-valued estimate of Y for a given value of X obtained by
inserting the value of X in the estimated regression equation.
• Prediction Interval
For a value of Y given a value of X
 Variation in regression line estimate
 Variation of points around regression line
For an average value of Y given a value of X
 Variation in regression line estimate
10-43

Errors in Predicting E[Y|X]

Y Upper limit on slope Y Upper limit on intercept


Regression line Regression line

Lower limit on slope


Y Y Lower limit on intercept

X X X X

1) Uncertainty about the 2) Uncertainty about the


slope of the regression line intercept of the regression line
10-44

Prediction Interval for E[Y|X]

Y Prediction band for E[Y|X] • The prediction band for E[Y|X] is


Regression narrowest at the mean value of X.
line • The prediction band widens as the
distance from the mean of X increases.
Y • Predictions become very unreliable when
we extrapolate beyond the range of the
sample itself.

X X

Prediction Interval for E[Y|X]


10-45

Additional Error in Predicting Individual


Value of Y

Y
Regression line Y Prediction band for E[Y|X]
Regression
line

Prediction band for Y

X X X
3) Variation around the regression
line Prediction Interval for E[Y|X]
10-46

Prediction Interval for a Value of Y

A (1 -  ) 100% prediction interval for Y :

1 (x  x) 2

yˆ  t  s 1  

2 n SS X

Example10 - 1 (X = 4,000) :

1 (4,000  3,177.92) 2

{274.85  (1.2553)(4,000)}  2.069  318.16 1  


25 40,947,557.84

 5296.05  676.62  [4619.43, 5972.67]


10-47

Prediction Interval for the Average


Value of Y
A (1 -  ) 100% prediction interval for the E[Y X] :

1 (x  x) 2

yˆ  t  s


2 n SS X

Example10 - 1 (X = 4,000) :

1 (4,000  3,177.92) 2

{274.85  (1.2553)(4,000)}  2.069  318.16 


25 40,947,557.84

 5,296.05  156.48  [5139.57, 5452.53]


10-48

Template Output with Prediction


Intervals
10-49

10-11 The Excel Solver Method for


Regression
The solver macro available in EXCEL can also be used to conduct a
simple linear regression. See the text for instructions.
10-50

Using Minitab Fitted-Line Plot for


Regression

Fitted Line Plot


Y = - 0.8465 + 1.352 X

9.0 S 0.184266
R-Sq 95.2%
R-Sq(adj) 94.8%
8.5

8.0

7.5
Y

7.0

6.5

6.0
5.5 6.0 6.5 7.0 7.5
X