C 6

Chapter 6
Linear Regression and

Correlation analysis
1. Introduction to regression
analysis
 Regression analysis
- Describe a relationship between two or
more than two variables in mathematical
terms.
- Predict the value of a dependent
variable based on the value of at least one
independent variable
- Explain the impact of changes in an
independent variable on the dependent
variable
1. Introduction to regression
analysis
Dependent Independent
variable variable
the variable we wish the variable used to

to explain explain the
dependent variable
Names for ys and xs in regression
model
Names for y Name for xs
Dependent variable Independent variables
Regressand Regressors
Effect variable Causal variables
Explained variable Explanatory variables

Simple Linear Regression
Model
 Only one independent variable, x

 Relationship between x and y is
described by a linear function
 Changes in y are assumed to be
caused by changes in x
Types of Regression Models
Positive Linear Relationship Non-linear relationship
Negative Linear Relationship No Relationship

Population Linear Regression
The population regression model:

Population Random
Population Independent Error
Slope
y intercept Variable term, or
Coefficient
Dependent residual
Variable
y  β0  β1x  ε
Linear component Random Error
component
Linear Regression Assumptions
 Error values (ε) are statistically

independent
 Error values are normally distributed for
any given value of x
 The probability distribution of the errors
has constant variance
 The underlying relationship between the x
variable and the y variable is linear
Population Linear Regression
y y  β0  β1x  ε
Observed Value
of y for xi
εi Slope = β1
Predicted Value Random Error
of y for xi
for this x value
Intercept = β0
xi x
Estimated Regression Model
The sample regression line provides an estimate of

the population regression line
Estimated Estimate of Estimate of the

(or predicted) the regression regression slope
y value intercept
Independent
ŷ i  b0  b1x variable
The individual random error terms ei have a mean of zero

Least Squares Criterion
 b0 and b1 are obtained by finding the

values of b0 and b1 that minimize
the sum of the squared residuals
e 2
  (y ŷ) 2
  (y  (b 0  b1x)) 2
The Least Squares Equation
 The formulas for b1 and b0 are:
 xy   x y
b1  n
(
x  n
2  x ) 2
and
or
xy  x . y
b1 
 2
x
b0  y  b1 x
Interpretation of the
Slope and the Intercept
 b0 is the estimated average value of y

when the value of x is zero
 b1 is the estimated change in the

average value of y as a result of a
one-unit change in x
Example
 A real estate agent wishes to examine the

relationship between the selling price of a
home and its size (measured in square
feet)
 A random sample of 10 houses is selected

◦ Dependent variable (y)?house price in $1000s
 Independent variable (x)? square feet

Sample Data for House Price Model
House Price in $1000s Square Feet
(y) (x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Least Squares Regression
Properties
 Thesum of the residuals from the least
squares regression line is 0
 ( y yˆ )  0
 The sum of the squared residuals is a minimum
(minimized)
 ( y y)
ˆ 2
 The simple regression line always passes through the

mean of the y variable and the mean of the x variable
y  b0  b1 x
 The least squares coefficients are unbiased
estimates of β0 and β1
Explained and Unexplained Variation
 Total variation is made up of two parts:
SST  SSE  SSR

Total sum of Sum of Squares Sum of Squares
Squares Error Regression
SST  ( y  y)2 SSE  ( y  ŷ)2 SSR  ( ŷ  y)2

where:
y = Average value of the dependent variable
y = Observed values of the dependent variable
ŷ = Estimated value of y for the given x value
Coefficient of Determination R 2
 The coefficient of determination is the

portion of the total variation in the
dependent variable that is explained by
variation in the independent variable
 The coefficient of determination is also
called R-squared and is denoted as
RSS 0  R2  1
R 
2
TSS where
Coefficient of Determination R 2
Coefficient of determination
RSS sum of squares explained by regression
R 
2

TSS total sum of squares
Examples of Approximate
Values R2
y
R2 = 1
R2 = 1 Perfect linear relationship

between x and y:
x
y 100% of the variation in y is

explained by variation in x
R2 = +1
x
Values R2
y
0 < R2 < 1
Weaker linear relationship

between x and y:
x
Some but not all of the
y
variation in y is explained
by variation in x
x
Values R 2
R2 = 0
y
No linear relationship
between x and y:
The value of Y does not
x depend on x. (None of the
R2 = 0
variation in y is explained
by variation in x)
y x ŷ ( yˆ  y ) 2 ( y  y )2
245 1400 251.8283 1202.127 1722.25
312 1600 273.7683 162.0962 650.25
279 1700 284.7383 3.103587 56.25
308 1875 303.9358 304.0071 462.25
199 1100 218.9183 4567.286 7656.25
219 1550 268.2833 331.8482 4556.25
405 2350 356.0433 4836.271 14042.25
324 2450 367.0133 6482.391 1406.25
319 1425 254.5708 1019.474 1056.25
255 1700 284.7383 3.103587 992.25
2865 17150 2863.838 18911.71 32600.5
RSS
R 
2
TSS
2. Correlation analysis
 Correlation is a technique used to
measure the strength of the relationship
between two variables.
 The stronger the correlation, the better
the relationship or the better fit the
regression line and vice versa.
Scatter Plot Examples
High degree of Low degree of
correlation correlation
y y
x x
y y
x x
Scatter Plot Examples
No relationship
x
The correlation coefficient (r)
 The correlation coefficient is

used to measure the strength of
the linear relationship between
two variables
 The product moment correlation
coefficient is calculated using the
formula:
The correlation coefficient (r)
r
 ( x  x)( y  y)
[ ( x  x ) ][  ( y  y ) ]
2 2
n xy   x  y
r
[n( x 2 )  ( x )2 ][n(  y 2 )  ( y )2 ]
xy  x . y
r
 x y
Note
 In the single
independent
variable case, the
coefficient of
R r 2 2
determination is
where
r : simple correlation coefficient
Features of r
 Unit free
 Range between -1 and 1
 The closer to -1, the stronger the
negative linear relationship
 The closer to 1, the stronger the positive
linear relationship
 The closer to 0, the weaker the linear
relationship
r Values
y y y
x x x
r = -1 r = -.6 r=0
y y
x x
r = +.3 r = +1
Example calculation
xy  x . y
r
x 2  ( x )2 y 2  ( y )2
Working Productivity
Example experience (items/h)
 The data below 1 2
relates the
working 3 8
experience 4 9
(years) to the
productivity of 10 5 15
workers in a small 6 15
firm
7 20
9 23
12 25
14 22
15 36
Example calculation
 x  76 x 2
 782
 y  175 y 2
 3932
 xy  1722
Estimate b0 and b1
Linear regression equation
Interpretation of b0 and b1?

and correlation coefficient
Steps in Regression
1- For Xi (independent variable) and Yi (dependent variable),
Calculate:
ΣYi
ΣXi
ΣXiYi
ΣXi2
ΣYi2
2- Calculate the correlation coefficient, r:

nX i Yi  (X i )(Yi )
r=
nX i
2
 X i 
2
 nY
i
2
 Yi 
2

-1 ≤ r ≤ 1
[This can be tested for significance. H0: ρ=0. If the correlation is not significant,
then X and Y are not related. You really should not be doing this regression!]
Simple Regression 39
Steps in Regression
3- Calculate the coefficient of determination: r2 = (r)2
0 ≤ r2 ≤ 1
This is the proportion of the variation in the dependent variable (Yi) explained by
the independent variable (Xi)
4- Calculate the regression coefficient b1 (the slope):

nX i Yi  (X i )(Yi )
b1 =
nX i2  X i 
2
Note that you have already calculated the numerator and the denominator for parts
of r. Other than a single division operation, no new calculations are required.
BTW, r and b1 are related. If a correlation is negative, the slope term must be
negative; a positive slope means a positive correlation.
5- Calculate the regression coefficient b0 (the Y-intercept, or constant):

b0 = Y  b1 X
The Y-intercept (b0) is the predicted value of Y when X = 0.
Steps in Regression
6- The regression equation (a straight line) is:
Yî = b0 + b1Xi
7- [OPTIONAL] Then we can test the regression for statistical significance.
There are 3 ways to do this in simple regression:

(a) t-test for correlation:
H0: ρ=0
H1: ρ≠0
r n2
tn-2 =
1 r2
(b) t-test for slope term

H0: β1=0
H1: β1≠0
Steps in Regression
(c) F-test – we can do it in MS Excel
MSExplained MSRegressi on
F= F=
MSUn exp lained MSResidual
where numerator is Mean Square (variation) Explained by the regression

equation, and the denominator is Mean Square (variation) unexplained by the
regression.
 This is part of the output you see when you do

regression in SPSS, MS excel…
 n = 5 pairs of X,Y observations
 Independent variable (X) is amount of
water (in gallons) used on crop
 Dependent variable (Y) is yield (bushels of
tomatoes). Y
i X
i XY X Y i i i
2
i
2
2 1 2 1 4
5 2 10 4 25
8 3 24 9 64
10 4 40 16 100
15 5 75 25 225
40 15 151 55 418
Example: Water and Tomato Yield

Step 1-
ΣYi = 40
ΣXi =15
ΣXiYi =151
ΣXi2 = 55
ΣYi2 = 418
(5)(151)  (15)( 40) 155

Step 2- r = = = .9903
(5)(55)  (15) (5)(418)  (40) 
2 2
50490
Step 3- r2 = (.9903)2 = 98.06%
155
Step 4- b1 = = 3.1 The slope is positive. There is a positive relationship
50
between water and crop yield.
Step 5- b0 =   - 3.1   = -1.3

40 15
 5 5
Step 6- Thus, Yî = -1.3 + 3.1Xi
Yî = -1.3 + 3.1 Xi

# bushels Does no water Every gallon # gallons of water
of result in a adds
tomatoes negative yield? 3.1 bushels
of tomatoes
Yi Xi ei ei2
Yî
2 1 1.8 .2 .04
5 2 4.9 .1 .01
8 3 8.0 0 0
10 4 11.1 -1.1 1.21
15 5 14.2 .8 .64
Σei = 0 Σei2 = 1.90
Σei2 = 1.90. This is a minimum, since regression

minimizes Σei2 (SSE)
Now we can answer a question like: How many

bushels of tomatoes can we expect if we use 3.5
gallons of water? -1.3 + 3.1 (3.5) = 9.55 bushels.
Notice the danger of predicting outside the range

of X. The more water, the greater the yield? No.
Too much water can ruin the crop.
Sources of Variation in Regression
  2 Y  2
Total Variation:  Y  Y 
i  Y  i
2
n
Explained Variation: Yˆ  Y   b Y  b X Y
2

Yi 
2
i 0 i 1 i i
n
Unexplained Variation:

 Yi  Yî   Y
2
i
2
 b0Yi  b1X iY
From our previous problem,
Total variation in Y = 418 – (40)2/5 = 98
Explained variation (explained by X) = -1.3(40) + 3.1(151) – (40)2/5 = 96.10
Unexplained variation = 418 - -1.3(40) - 3.1(151) = 1.90
The coefficient of determination, r2, is the proportion of Y explained by X.
In other words, 98% of the total variation in crop yield is explained

by the linear relationship of yield with amount of water used on the
crop.
ExplainedVariation 96.10
r2    .98
TotalVariation 98
The Multiple Regression Model
Idea: Examine the linear relationship between
1 dependent (y) & 2 or more independent variables (xi)
Population model:
Y-intercept Population slopes Random Error
y  β0  β1x1  β2 x 2    βk xk  ε
Estimated multiple regression model:
Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of y
ŷ  b0  b1x1  b2 x 2    bk xk
Estimates b0, b1, b2,….,bk
 y  nb0  b1  x1  b2  x2  .......  bk  xk

 1  0 1  1  1  b2  x1 x2 .......  bk  x1 xk
2
x y b x b x

 2  0 2  1 1 2  2  2 .......  bk  x2 xk
2
x y b x b x x b x
......................................................................................

 xk y  b0  xk  b1  x1 xk  b2  x2 xk .......  bk  xk2
Interpretation of Estimated
Coefficients
 Slope (bi)
◦ Estimates that the average value of y changes by
bi units for each 1 unit increase in Xi given that
all other variables unchanged
 Intercept (b0)
◦ The estimated average value of y when all xi = 0
Multiple Regression Model
Two variable model
y
ŷ  b0  b1x1  b2 x 2
x2
x1
Multiple Regression Model
Two variable model
y Sample
yi
<
observation ŷ  b0  b1x1  b2 x 2
yi
<
e = (y – y)
x2i
x2
<
x1i The best fit equation, y ,
is found by minimizing the
x1 sum of squared errors, e2
Multiple Regression
Assumptions
Errors (residuals) from the regression model:
<
e = (y – y)
 The errors are normally distributed

 The mean of the errors is zero
 Errors have a constant variance
 The model errors are independent
Example
A distributor of frozen
desert pies wants to
evaluate factors thought
to influence demand
Data are collected for 15 weeks
Price Advertising
Week Pie Sales ($) ($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
4 430 8.00 4.5
5 350 6.80 3.0
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
Example
Dependent variable (y): Pie sales
Independent variables 1 (x1): Prices ($)
Independent variables 2 (x2): Advertising ($ 100s)
Estimated (Predicted) regression equation:
ŷ  b0  b1 x1  b2 x2
Estimates b0, b1, b2
 y  nb0  b1  x1 b2  x2

 x1 y  b0  x1  b1  x1  b2  x1 x2
2

 2  0 2  1 1 2  2 2
2
x y b x b x x b x
Multiple Coefficient of
Determination
 Reports the proportion of total variation in
y explained by all x variables taken
together
RSS Regression sum of squares

R 
2

TSS Total sum of squares
Multiple correlation (R)
 Multiple correlation provides a measure of
the overall strength of the relationship
between dependent variable and
independent variables.
 It is defined as the positive square root of
the coefficient of the determination
R R 2
 Define residuals
 Create residual plots
 Interpret plots
Residual analysis
Notation Properties
E = residual Σe = 0
Y = observed value
Y’ = predicted value
e=0
What is a residual
Residual = observed value – predicted value
Typical patterns for residual plots
Is linear regression appropriate?
 Randome pattern: use linear
regression
 Non-random: consider other
technique
How to use residual plots

TRANSFORMATIONS
 A transformation applies a math operation
to a variable
 Example
Original Math opertion Transformed

variable variable
X Addition Xt = X + 5
Y Multiplication Yt = 2* Y
A Square root At = A
B Reciprocal Bt = 1/B
TRANSFORMATIONS BASIC
Do you know?
Linear Non-linear
No effect on correlation Changes correlation
Xt = c* X Anything else
Xt = X/c
Xt = X + c
Application to Regression
 Nonlinear transformations change linear

correlation
◦ Increase correlation: better regression
◦ Reduce correlation: worse regression
Methods of Transforming
Variables to Achieve Linearity
METHOD TRANSFORATIONS REGRESSION PREDICTED
EQUATION VALUE
Standard linear None y = b0 + b 1 x ŷ = b0 + b1x
regression
Exponential Dependent variable log(y) = b0 + b1x ŷ = 10b0 + b1x
model = log(y)
Quadratic model Dependent variable sqrt(y) = b0 + b1x ŷ = ( b0 + b1x )2
= sqrt(y)
Reciprocal model Dependent variable 1/y = b0 + b1x ŷ = 1 / ( b0 + b1x

= 1/y )
Logarithmic Independent variable y= b0 + b1log(x) ŷ = b0 + b1log(x)

model = log(x)
Power model Dependent variable log(y)= b0 + ŷ = 10b0 + b1log(x)

= log(y) b1log(x)
Independent variable
= log(x)
NOTE
 These methods need to be tested on the
data to which they are applied to be sure
that they increase rather than decrease
the linearity of the relationship. Testing
the effect of a transformation method
involves looking at residual plots and
correlation coefficients,
How to Perform a Transformation
to Achieve Linearity
 Conduct a standard regression analysis on the
raw data.
 Construct a residual plot.
◦ If the plot pattern is random, do not transform data.
◦ If the plot pattern is not random, continue.
 Compute the coefficient of determination (R2).
 Choose a transformation method (see above
table).
 Transform the independent variable, dependent
variable, or both.
 Conduct a regression analysis, using the
transformed variables.
 Compute the coefficient of determination (R2),
based on the transformed variables.
◦ If the tranformed R2 is greater than the raw-score R2, the
transformation was successful. Congratulations!
◦ If not, try a different transformation method.
The best tranformation method
(exponential model, quadratic model,
reciprocal model, etc.) will depend on
nature of the original data. The only way to
determine which method is best is to try
each and compare the result (i.e., residual
plots, correlation coefficients).
X 1 2 3 4 5 6 7 8 9
Y 2 1 6 14 15 30 40 74 75
When we apply a linear

regression to the
untransfromed raw
data, the residual plot
shows a non-random
pattern (a U-shaped
curve), which suggests
that the data are
nonlinear
EXAMPLE
 Suppose we repeat the analysis, using a
quadratic model to transform the dependent
variable. For a quadratic model, we use the
square root of y, rather than y, as the
dependent variable. Using the transformed
data, our regression equation is:
y't = b0 + b1x
 yt = transformed dependent variable, which is equal to
the square root of y
 y't = predicted value of the transformed dependent
variable yt
 x = independent variable
 b0 = y-intercept of transformation regression line
 b1 = slope of transformation regression line
 The table below shows the transformed
data we analyzed.
x 1 2 3 4 5 6 7 8 9
Yt 1.14 1 2.45 3.74 3.87 5.48 6.32 8.6 8.66
The residual plot shows residuals

based on predicted raw scores from
the transformation regression
equation. The plot suggests that
the transformation to achieve
linearity was successful.
Since the transformation was based on
the quadratic model (yt = the square root
of y), the transformation regression
equation can be expressed in terms of the
original units of variable Y as:
y' = ( b0 + b1x )2
where
 y' = predicted value of y in its orginal units
 x = independent variable
 b0 = y-intercept of transformation regression line
 b1 = slope of transformation regression line

C 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

C 6

Uploaded by

Copyright:

Available Formats

Chapter 6

Linear Regression and

the variable we wish the variable used to

Dependent variable Independent variables

Explained variable Explanatory variables

 Only one independent variable, x

Negative Linear Relationship No Relationship

The population regression model:

 Error values (ε) are statistically

The sample regression line provides an estimate of

Estimated Estimate of Estimate of the

The individual random error terms ei have a mean of zero

 b0 and b1 are obtained by finding the

 The formulas for b1 and b0 are:

 b0 is the estimated average value of y

 b1 is the estimated change in the

 A real estate agent wishes to examine the

 A random sample of 10 houses is selected

 Independent variable (x)? square feet

 The simple regression line always passes through the

SST  SSE  SSR

SST  ( y  y)2 SSE  ( y  ŷ)2 SSR  ( ŷ  y)2

 The coefficient of determination is the

R2 = 1 Perfect linear relationship

y 100% of the variation in y is

Weaker linear relationship

 The correlation coefficient is

Interpretation of b0 and b1?

2- Calculate the correlation coefficient, r:

4- Calculate the regression coefficient b1 (the slope):

5- Calculate the regression coefficient b0 (the Y-intercept, or constant):

The Y-intercept (b0) is the predicted value of Y when X = 0.

7- [OPTIONAL] Then we can test the regression for statistical significance.

There are 3 ways to do this in simple regression:

(b) t-test for slope term

where numerator is Mean Square (variation) Explained by the regression

 This is part of the output you see when you do

Example: Water and Tomato Yield

(5)(151)  (15)( 40) 155

Step 3- r2 = (.9903)2 = 98.06%

Step 5- b0 =   - 3.1   = -1.3

Step 6- Thus, Yˆi = -1.3 + 3.1Xi

Yˆi = -1.3 + 3.1 Xi

Σei2 = 1.90. This is a minimum, since regression

Now we can answer a question like: How many

Notice the danger of predicting outside the range

The coefficient of determination, r2, is the proportion of Y explained by X.

In other words, 98% of the total variation in crop yield is explained

Errors (residuals) from the regression model:

 The errors are normally distributed

Estimated (Predicted) regression equation:

RSS Regression sum of squares

How to use residual plots

Original Math opertion Transformed

 Nonlinear transformations change linear

Reciprocal model Dependent variable 1/y = b0 + b1x ŷ = 1 / ( b0 + b1x

Logarithmic Independent variable y= b0 + b1log(x) ŷ = b0 + b1log(x)

Power model Dependent variable log(y)= b0 + ŷ = 10b0 + b1log(x)

When we apply a linear

The residual plot shows residuals

You might also like