Complete Business Statistics: Simple Linear Regression and Correlation

COMPLETE
BUSINESS
STATISTICS
by
AMIR D. ACZEL
&
JAYAVEL SOUNDERPANDIAN
7th edition.
Prepared by Lloyd Jaisingh, Morehead State

University
Chapter 10
Simple Linear Regression and Correlation
McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
10-2
10 Simple Linear Regression and Correlation

• Using Statistics
• The Simple Linear Regression Model
• Estimation: The Method of Least Squares
• Error Variance and the Standard Errors of Regression Estimators
• Correlation
• Hypothesis Tests about the Regression Relationship
• How Good is the Regression?
• Analysis of Variance Table and an F Test of the Regression Model
• Residual Analysis and Checking for Model Inadequacies
• Use of the Regression Model for Prediction
• The Solver Method for Regression
10-3
10 LEARNING OBJECTIVES
After studying this chapter, you should be able to:
• Determine whether a regression experiment would be useful in a given
instance
• Formulate a regression model
• Compute a regression equation
• Compute the covariance and the correlation coefficient of two random
variables
• Compute confidence intervals for regression coefficients
• Compute a prediction interval for the dependent variable
10-4
10 LEARNING OBJECTIVES (continued)
After studying this chapter, you should be able to:

• Test hypothesis about a regression coefficients
• Conduct an ANOVA experiment using regression results
• Analyze residuals to check if the assumptions about the
regression model are valid
• Solve regression problems using spreadsheet templates
• Use LINEST function to carry out a regression
10-5
10-1 Using Statistics
• Regression refers to the statistical technique of modeling the

relationship between variables.
• In simple linear regression, we model the relationship
between two variables.
• One of the variables, denoted by Y, is called the dependent
variable and the other, denoted by X, is called the
independent variable.
• The model we will use to depict the relationship between X and
Y will be a straight-line relationship.
• A graphical sketch of the the pairs (X, Y) is called a scatter
plot.
10-6
10-1 Using Statistics

This scatterplot locates pairs of observations of Scatterplot of Advertising Expenditures (X) and Sales (Y)
advertising expenditures on the x-axis and sales 140
on the y-axis. We notice that: 120
100
Sales
80
 Larger (smaller) values of sales tend to be 60
associated with larger (smaller) values of 40
advertising. 20
0
0 10 20 30 40 50
A d ve rtising
 The scatter of points tends to be distributed around a positively sloped straight line.
 The pairs of values of advertising expenditures and sales are not located exactly on a
straight line.
 The scatter plot reveals a more or less strong tendency rather than a precise linear
relationship.
 The line represents the nature of the relationship on average.
10-7
Examples of Other Scatterplots
Y
Y
Y
X 0 X X
Y
Y
X X X
10-8
Model Building
The inexact nature of the Data In ANOVA, the systematic

relationship between component is the variation
advertising and sales of means between samples
suggests that a statistical or treatments (SSTR) and
model might be useful in
Statistical the random component is
analyzing the relationship. model the unexplained variation
(SSE).
A statistical model separates
the systematic component Systematic In regression, the
of a relationship from the systematic component is
component
random component. the overall linear
+ relationship, and the
Random random component is the
errors variation around the line.
10-9
10-2 The Simple Linear Regression

Model
The population simple linear regression model:
Y= 0 + 1 X + 
Nonrandom or Random
Systematic Component
Component
where
 Y is the dependent variable, the variable we wish to explain or predict
 X is the independent variable, also called the predictor variable
  is the error term, the only random component in the model, and thus, the
only source of randomness in Y.
 0 is the intercept of the systematic component of the regression relationship.

 1 is the slope of the systematic component.
The conditional mean of Y: E[Y X ]   0   1 X

10-10
Picturing the Simple Linear

Regression Model
Y
Regression Plot The simple linear regression
model gives an exact linear
relationship between the
expected or average value of Y,
the dependent variable, and X,
E[Y]=0 + 1 X
the independent or predictor
Yi
variable:
{
Error: i } 1 = Slope
E[Yi]=0 + 1 Xi
}
1
Actual observed values of Y
0 = Intercept
differ from the expected value by
an unexplained or random error:
X
Yi = E[Yi] + i
Xi = 0 + 1 Xi + i
10-11
Assumptions of the Simple Linear

Regression Model
• The relationship between X and Y is a Assumptions of the Simple
straight-line relationship. Y Linear Regression Model
• The values of the independent
variable X are assumed fixed (not
random); the only randomness in the
values of Y comes from the error term
i. E[Y]=0 + 1 X
• The errors i are normally distributed
with mean 0 and variance 2. The
errors are uncorrelated (not related)
in successive observations. That is:
~ N(0,2)
Identical normal
distributions of errors,
all centered on the
regression line.
X
10-12
10-3 Estimation: The Method of Least

Squares
Estimation of a simple linear regression relationship involves finding
estimated or predicted values of the intercept and slope of the linear
regression line.
The estimated regression equation:

Y = b0 + b1X + e
where b0 estimates the intercept of the population regression line, 0 ;

b1 estimates the slope of the population regression line, 1;
and e stands for the observed errors - the residuals from fitting the estimated
regression line b0 + b1X to a set of n points.
The estimated regression line:
Y  b0 + b1 X
where Y (Y - hat) is the value of Y lying on the fitted regression line for a given
value of X.
10-13
Fitting a Regression Line

Y Y
Data
Three errors from the
least squares regression
X line X
Y
Three errors Errors from the least

from a fitted line squares regression
line are minimized
X X
10-14
Errors in Regression
Y
the observeddata point
Y  b0  b1 X the fitted regression line
Yi .
Yi
{
Error ei  Yi  Yi
Yi the predicted value of Y for X
i
X
Xi
10-15
Least Squares Regression
The sum of squared errors in regression is:

n n
SSE = e
i=1
2
i   (y
i=1
i  y i ) 2
The least squares regression line is that which minimizes the SSE
with respect to the estimates b 0 and b 1 .
The normal equations: SSE b0
n n
y
i=1
i  nb0  b1  x i
i=1
At this point
SSE is
Least squares b0 minimized
n n n with respect
x y
i=1
i i b0  x i  b1  x 2i
i=1 i=1
to b0 and b1
Least squares b1 b1
10-16
Sums of Squares, Cross Products,

and Least Squares Estimators
Sums of Squares and Cross Products:
  x
2
SSx   (x  x )   x
2 2

n 2
SS y   ( y  y )   y 
2 2   y
n
SSxy   (x  x )( y  y )   xy 
  x  ( y )
n
Least  squares regression estimators:
SS XY
b1 
SS X
b0  y  b1 x
10-17
Example 10-1
Miles Dollars Miles 2 Miles*Dollars

2  x 2
1211
1345
1802
2405
1466521
1809025
2182222
3234725
SS x   x 
1422 2005 2022084 2851110 n
1687 2511 2845969 4236057 2
1849 2332 3418801 4311868 79, 448
2026 2305 4104676 4669930  293, 426,946   40,947 ,557.84
2133 3016 4549689 6433128 25
2253
2400
3385
3090
5076009
5760000
7626405
7416000  x ( y )
2468 3694 6091024 9116792 SS xy   xy 
2699 3371 7284601 9098329 n
2806 3998 7873636 11218388
(79, 448)(106,605)
 390,185,014   51, 402,852.4
3082 3555 9498724 10956510
3209 4692 10297681 15056628
3466 4244 12013156 14709704 25
3643 5298 13271449 19300614
3852 4801 14837904 18493452 SS 51, 402,852.4
4033 5147 16265089 20757852 b  XY   1.255333776  1.26
4267 5738 18207288 24484046 1 SS 40,947 ,557.84
4498 6420 20232004 28877160 X
4533 6059 20548088 27465448
4804 6426 23078416 30870504 106,605  79,448 
5090 6321 25908100 32173890 b  y b x   (1.255333776 ) 
5233 7026 27384288 36767056
0 1 25  25 
5439 6964 29582720 37877196
79,448 106,605 293,426,946 390,185,014  274.85
10-18
Template (partial output) that can be

used to carry out a Simple Regression
10-19
Template (continued) that can be used

to carry out a Simple Regression
10-20

Residual Analysis. The plot shows the absence of a relationship

between the residuals and the X-values (miles).
10-21

Note: The normal probability plot is approximately linear. This

would indicate that the normality assumption for the errors has not
been violated.
10-22
Total Variance and Error Variance

Y Y
X X
What you see when looking

What you see when looking
at the total variation of Y.
along the regression line at
the error variance of Y.
10-23
10-4 Error Variance and the Standard

Errors of Regression Estimators
Y
Degrees of Freedom in Regression:
df = (n - 2) (n total observations less one degree of freedom

for each parameter estimated (b 0 and b1 ) )
2 Square and sum all
2 ( SS XY ) regression errors to find
SSE =  ( Y - Y )  SSY  SSE.
SS X X
= SSY  b1SS XY Example 10 - 1:

SSE = SS Y  b1 SS XY
2 2  66855898  (1.255333776)( 51402852 .4 )
An unbiased estimator of s , denoted by S :
 2328161.2
SSE 2328161.2
SSE MSE  
MSE = n2 23
(n - 2)  101224 .4
s  MSE  101224 .4  318.158
10-24
Standard Errors of Estimates in

Regression
The standard error of b0 (intercept): Example 10 - 1:

2
s x
s(b0 ) 
s(b0 ) 
s  x 2
nSS X
nSS X 318.158 293426944

( 25)( 4097557.84 )
where s = MSE  170.338
s
The standard error of b1 (slope): s(b1 ) 
SS X
318.158
s 
s(b1 )  40947557.84
SS X  0.04972
10-25
Confidence Intervals for the

Regression Parameters
A (1 -  ) 100% confidence interval for b :
0
b  t  s (b ) Example 10 - 1
0  ,(n 2 ) 0 95% Confidence Intervals:
2 
b t s (b )
0  0.025,( 25 2 ) 0
A (1 -  ) 100% confidence interval for b : = 274.85  ( 2.069) (170.338)
1
b  t  s (b )  274.85  352.43
1  ,(n 2 ) 1
2   [ 77.58, 627.28]
Least-squares point estimate:
b1=1.25533
b1  t s (b1 )
 0.025,( 25 2 )
= 1.25533  ( 2.069) ( 0.04972 )
Height = Slope
 1.25533  010287
.
 [115246
. ,1.35820]
0 (not a possible value of the

Length = 1
regression slope at 95%)
10-26
Template (partial output) that can be used

to obtain Confidence Intervals for 0 and 1
10-27
10-5 Correlation
The correlation between two random variables, X and Y, is a measure of the

degree of linear association between the two variables.
The population correlation, denoted by, can take on any value from -1 to 1.
  1 indicates a perfect negative linear relationship

-1 <  < 0 indicates a negative linear relationship
0 indicates no linear relationship
0<<1 indicates a positive linear relationship
  1 indicates a perfect positive linear relationship
The absolute value of  indicates the strength or exactness of the relationship.

10-28
Illustrations of Correlation
Y Y Y
 = -1 =0
=1
X X X
Y  = -.8 Y =0 Y
 = .8
X X X
10-29
Covariance and Correlation

The covariance of two random variables X and Y:
Cov ( X , Y )  E [( X   )(Y   )]
X Y
where  and  Y are the population means of X and Y respectively.
X
The population correlation coefficient: Example 10 - 1:

Cov ( X , Y ) SS
= XY
  r=
SS SS
X Y X Y
51402852.4
The sample correlation coefficient * : 
( 40947557.84)( 66855898)
SS
XY 51402852.4
r=  .9824
SS SS 52321943.29
X Y
*Note: If  < 0, b1 < 0 If  = 0, b1 = 0 If  > 0, b1 >0

10-30
Hypothesis Tests for the Correlation

Coefficient
Example 10 -1:
r
H0:  = 0 (No linear relationship) t( n 2 ) 
H1:   0 (Some linear relationship) 1 r2
n2
0.9824
r =
Test Statistic: t( n 2 )  1 - 0.9651
1 r2
25 - 2
n2 0.9824
=  25.25
0.0389
t0. 005  2.807  25.25
H 0 rejected at 1% level
10-31
10-6 Hypothesis Tests about the

Regression Relationship
Constant Y Unsystematic Variation Nonlinear Relationship
Y Y Y
X X X
A hypothesis test for the existence of a linear relationship between X and Y:
H0: 1  0
H1:  1  0
Test statistic for the existence of a linear relationship between X and Y:
b
 1
t
(n - 2) s(b )
1
where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b .
1 1 1
When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom.
10-32
Hypothesis Tests for the Regression

Slope
Example 10 - 1: Example10 - 4 :
H0: 1  0 H :  1
0 1
H1:  1  0 H :  1
1 1
b b 1
1  1
t  t
(n - 2) s(b ) ( n - 2) s (b )
1
1
1.24 - 1
1.25533 =  1.14
=  25.25 0.21
0.04972
t  1.671  1.14
t  2.807  25.25 (0.05,58)
( 0 . 005 , 23 ) H is not rejected at the10% level.
0
H 0 is rejected at the 1% level and we may
We may not conclude that the beta
conclude that there is a relationship between
coefficien t is different from 1.
charges and miles traveled.
10-33
10-7 How Good is the Regression?
The coefficient of determination, r2, is a descriptive measure of the strength of

the regression relationship, a measure of how well the regression line fits the data.
( y  y )  ( y  y)  ( y  y )
Y Total = Unexplained Explained
Deviation Deviation Deviation
Y . (Error) (Regression)
Y
Y
Unexplained Deviation
Explained Deviation
{
}
{
Total Deviation
SST
2
= SSE
2
 ( y  y )   ( y  y)   ( y  y )
+ SSR
Percentage of
2
2 SSR SSE
r   1 total variation
SST SST explained by
X
X the regression.
10-34
The Coefficient of Determination
Y Y Y
X X X
SST SST SST
S
r2 = 0 SSE r2 = 0.50 SSE SSR r2 = 0.90 S SSR
E
7000
Example 10 -1: 6000
5000
Dollars
SSR 64527736.8
r  2
  0.96518 4000
SST 66855898 3000
2000
1000 1500 2000 2500 3000 3500 4000 4500 5000 5500
Miles
10-35
10-8 Analysis-of-Variance Table and

an F Test of the Regression Model
Source of Sum of Degrees of
Variation Squares Freedom Mean Square F Ratio
Regression SSR (1) MSR MSR

MSE
Error SSE (n-2) MSE
Total SST (n-1) MST
Example 10-1
Source of Sum of Degrees of
Variation Squares Freedom F Ratio p Value
Mean Square
Regression 64527736.8 1 64527736.8 637.47 0.000
Error 2328161.2 23 101224.4
Total 66855898.0 24
10-36
Template (partial output) that displays Analysis of

Variance and an F Test of the Regression Model
10-37
10-9 Residual Analysis and Checking

for Model Inadequacies
Residuals Residuals
0 0
x or y x or y
Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals

random. No indication of model inadequacy. increases when x changes.
Residuals Residuals
0 0
Time x or y
Curved pattern in residuals resulting from

Residuals exhibit a linear trend with time. underlying nonlinear relationship.
10-38
Normal Probability Plot of the

Residuals
Flatter than Normal
10-39

Residuals
More Peaked than Normal

10-40

Residuals
Positively Skewed
10-41

Residuals
Negatively Skewed
10-42
10-10 Use of the Regression Model

for Prediction
• Point Prediction
A single-valued estimate of Y for a given value of X obtained by
inserting the value of X in the estimated regression equation.
• Prediction Interval
For a value of Y given a value of X
 Variation in regression line estimate
 Variation of points around regression line
For an average value of Y given a value of X
 Variation in regression line estimate
10-43
Errors in Predicting E[Y|X]
Y Upper limit on slope Y Upper limit on intercept

Regression line Regression line
Lower limit on slope

Y Y Lower limit on intercept
X X X X
1) Uncertainty about the 2) Uncertainty about the

slope of the regression line intercept of the regression line
10-44
Prediction Interval for E[Y|X]
Y Prediction band for E[Y|X] • The prediction band for E[Y|X] is

Regression narrowest at the mean value of X.
line • The prediction band widens as the
distance from the mean of X increases.
Y • Predictions become very unreliable when
we extrapolate beyond the range of the
sample itself.
X X
Prediction Interval for E[Y|X]

10-45
Additional Error in Predicting Individual

Value of Y
Y
Regression line Y Prediction band for E[Y|X]
Regression
line
Prediction band for Y
X X X
3) Variation around the regression
line Prediction Interval for E[Y|X]
10-46
Prediction Interval for a Value of Y
A (1 -  ) 100% prediction interval for Y :
1 (x  x) 2
yˆ  t  s 1  

2 n SS X
Example10 - 1 (X = 4,000) :
1 (4,000  3,177.92) 2
{274.85  (1.2553)(4,000)}  2.069  318.16 1  

25 40,947,557.84
 5296.05  676.62  [4619.43, 5972.67]

10-47
Prediction Interval for the Average

Value of Y
A (1 -  ) 100% prediction interval for the E[Y X] :
1 (x  x) 2
yˆ  t  s


2 n SS X
Example10 - 1 (X = 4,000) :
1 (4,000  3,177.92) 2
{274.85  (1.2553)(4,000)}  2.069  318.16 

25 40,947,557.84
 5,296.05  156.48  [5139.57, 5452.53]

10-48
Template Output with Prediction

Intervals
10-49
10-11 The Excel Solver Method for

Regression
The solver macro available in EXCEL can also be used to conduct a
simple linear regression. See the text for instructions.
10-50
Using Minitab Fitted-Line Plot for

Regression
Fitted Line Plot

Y = - 0.8465 + 1.352 X
9.0 S 0.184266
R-Sq 95.2%
R-Sq(adj) 94.8%
8.5
8.0
7.5
Y
7.0
6.5
6.0
5.5 6.0 6.5 7.0 7.5
X

Complete Business Statistics: Simple Linear Regression and Correlation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Complete Business Statistics: Simple Linear Regression and Correlation

Uploaded by

Copyright:

Available Formats

COMPLETE

Prepared by Lloyd Jaisingh, Morehead State

10 Simple Linear Regression and Correlation

10 LEARNING OBJECTIVES (continued)

After studying this chapter, you should be able to:

10-1 Using Statistics

• Regression refers to the statistical technique of modeling the

10-1 Using Statistics

on the y-axis. We notice that: 120

Examples of Other Scatterplots

The inexact nature of the Data In ANOVA, the systematic

10-2 The Simple Linear Regression

 0 is the intercept of the systematic component of the regression relationship.

The conditional mean of Y: E[Y X ]   0   1 X

Picturing the Simple Linear

Assumptions of the Simple Linear

10-3 Estimation: The Method of Least

The estimated regression equation:

where b0 estimates the intercept of the population regression line, 0 ;

Fitting a Regression Line

Three errors Errors from the least

Least Squares Regression

The sum of squared errors in regression is:

The normal equations: SSE b0

Sums of Squares, Cross Products,

Miles Dollars Miles 2 Miles*Dollars

Template (partial output) that can be

Template (continued) that can be used

Template (continued) that can be used

Residual Analysis. The plot shows the absence of a relationship

Template (continued) that can be used

Note: The normal probability plot is approximately linear. This

Total Variance and Error Variance

What you see when looking

10-4 Error Variance and the Standard

df = (n - 2) (n total observations less one degree of freedom

= SSY  b1SS XY Example 10 - 1:

Standard Errors of Estimates in

The standard error of b0 (intercept): Example 10 - 1:

Confidence Intervals for the

0 (not a possible value of the

Template (partial output) that can be used

The correlation between two random variables, X and Y, is a measure of the

  1 indicates a perfect negative linear relationship

The absolute value of  indicates the strength or exactness of the relationship.

Covariance and Correlation

The population correlation coefficient: Example 10 - 1:

*Note: If  < 0, b1 < 0 If  = 0, b1 = 0 If  > 0, b1 >0

Hypothesis Tests for the Correlation

10-6 Hypothesis Tests about the

Hypothesis Tests for the Regression

10-7 How Good is the Regression?

The coefficient of determination, r2, is a descriptive measure of the strength of

The Coefficient of Determination

SST 66855898 3000

10-8 Analysis-of-Variance Table and

Regression SSR (1) MSR MSR

Template (partial output) that displays Analysis of

10-9 Residual Analysis and Checking

Homoscedasticity: Residuals appear completely Heteroscedasticity: Variance of residuals

Curved pattern in residuals resulting from

Normal Probability Plot of the

Normal Probability Plot of the