Professional Documents
Culture Documents
REGRESSION
C A S E : S P E N D I N G A N D E D U C AT I O N
INTRODUCTION TO
REGRESSION ANALYSIS
One of the most pervasive method in business world.
Regression analysis is used to:
Study the relationship between variables
Predict the value of a dependent variable based on the value of at least one
independent variable
TYPES OF RELATIONSHIPS
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
TYPES OF RELATIONSHIPS
Strong relationships
Y
Weak relationships
Y
X
Y
X
Y
Population
Slope
Coefficient
Independent
Variable
Random
Error
term
Yi 0 1Xi i
Linear component
Random Error
component
Yi 0 1Xi i
Observed Value
of Y for Xi
i
Predicted Value
of Y for Xi
Slope = 1
Random Error
for this Xi value
Intercept = 0
Xi
EXAMPLE
The annual bonuses ($1,000s) of six employees with different years of
experience were recorded as follows. We wish to determine the straight
Annual bonus y
17
12
Bonus
Annual_Bonus
18
16
14
12
10
8
6
4
2
0
2x 3
Y
0
Experience
(Y - Y )
(Y - Y )^2
-6
36
11
-6
36
17
13
16
12
15
-3
9
98
2.114x 0.934
Y
(Y - Y )
(Y - Y )^2
3.048
2.952
8.714304
5.162
-4.162
17.32224
7.276
1.724
2.972176
9.39
-4.39
19.2721
17
11.504
5.496
30.20602
12
13.618
-1.618
2.617924
81.10476
Estimate of
the regression
intercept
Estimate of the
regression slope
Yi b 0 b1X i
Value of X for
observation i
SSXY X X Y Y
SSXX
b1
b
X X
SSXY
SSXX
Y b1 X
Y
n
b1
Total
1.00
2.00
3.00
4.00
5.00
6.00
6.00
1.00
9.00
5.00
17.00
12.00
21.00
50.00
X X
-2.50
-1.50
-0.50
0.50
1.50
2.50
Y Y
X X Y Y
-2.33
-7.33
0.67
-3.33
8.67
3.67
37
b1 17.5 2.114
50
21
b0 6 2.114 * 6 0.9343
2.114x 0.9343
Y
X X
5.83
11.00
-0.33
-1.67
13.00
9.17
6.25
2.25
0.25
0.25
2.25
6.25
37.00
17.50
INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
b0 is the estimated average value of Y when
the value of X is zero
b1 is the estimated change in the average
value of Y as a result of a one-unit increase
in X
Square Feet
(X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
450
400
350
300
250
200
150
100
50
0
0
500
1000
1500
Square Feet
2000
2500
3000
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
450
Intercept
= 98.248
400
350
Slope
= 0.10977
300
250
200
150
100
50
0
0
500
1000
1500
2000
2500
3000
Square Feet
When using a regression model for prediction, only predict within the
relevant range of data
Relevant range for
interpolation
House Price ($1000s)
450
400
350
300
250
200
150
100
50
0
0
500
1000
1500
2000
Square Feet
2500
3000
Do not try to
extrapolate
beyond the range
of observed Xs
MAKING PREDICTIONS
Predict the price for a house
with 2000 square feet:
MEASURES OF VARIATION
Y
Yi
2
SSE = (Yi - Yi )
Yi
_
SSR = (Yi - Y)2
_
Y
Xi
_
Y
MEASURES OF VARIATION
Total variation is made up of two parts:
SST
SSR
Total Sum of
Squares
Regression Sum
of Squares
SST ( Yi Y)2
Y)2
SSR ( Y
i
SSE
Error Sum of
Squares
)2
SSE ( Yi Y
i
where:
MEASURES OF VARIATION
SST = total sum of squares
(Total Variation)
COEFFICIENT OF
2
DETERMINATION, R
SST
total sum of squares
note:
0 R 1
2
EXAMPLES OF APPROXIMATE
R 2 VALUES
Y
R2 = 1
R2 = 1
R2
=1
EXAMPLES OF APPROXIMATE
R 2 VALUES
Y
0 < R2 < 1
EXAMPLES OF APPROXIMATE
R 2 VALUES
R2 = 0
No linear relationship
between X and Y:
R2 = 0
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
SSR 18934.9348
0.58082
SST 32600.5000
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
ASSUMPTIONS OF REGRESSION
L.I.N.E
Linearity
The relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values are normally distributed for any given value of X
RESIDUAL ANALYSIS
ei Yi Y
i
The residual for observation i, ei, is the difference between its
observed and predicted value
Check the assumptions of regression by examining the
residuals
Examine for linearity assumption
Evaluate independence assumption
Not Linear
residuals
residuals
Linear
0
-3
-2
-1
Residual
x
Non-constant variance
residuals
residuals
Constant variance
251.92316
-6.923162
273.87671
38.12329
284.85348
-5.853484
304.06284
3.937162
218.99284
-19.99284
80
60
40
Residuals
Predicted
House Price
20
0
268.38832
-49.38832
-20
356.20251
48.79749
-40
367.17929
-43.17929
-60
254.6674
64.33264
10
284.85348
-29.85348
1000
2000
Square Feet
3000
x
Non-constant variance
residuals
residuals
Constant variance
251.92316
-6.923162
273.87671
38.12329
284.85348
-5.853484
304.06284
3.937162
218.99284
-19.99284
80
60
40
Residuals
Predicted
House Price
20
0
268.38832
-49.38832
-20
356.20251
48.79749
-40
367.17929
-43.17929
-60
254.6674
64.33264
10
284.85348
-29.85348
1000
2000
Square Feet
3000
CONSTRUCTING
A NORMAL PROBABILITY PLOT-SPSS OUTPUT
Normal probability plot
Arrange residuals into an ascending array.
Expected
Series1
0.2
0.4
0.6
Observed
0.8
Not Independent
residuals
residuals
residuals
Independent
AUTOCORRELATION
Autocorrelation is correlation of the errors (residuals) over time
THE DURBIN-WATSON
STATISTIC
The Durbin-Watson statistic is used to test for
autocorrelation
(e e
i 2
2
e
i
i1
i1
MEASURING
AUTOCORRELATION:
THE DURBIN-WATSON
STATISTIC
Used when data are collected over time to detect if
autocorrelation is present
Autocorrelation exists if residuals in one time period are related
to residuals in another period
dL
Positive
autocorrelation does
not exist
dU
Is there autocorrelation?
Week
Customers
Sales
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
794
799
837
855
845
844
863
875
880
905
886
843
904
950
841
9.33
8.26
7.48
9.08
9.83
10.09
11.01
11.49
12.07
12.55
11.92
10.27
11.8
12.15
9.64
Is there autocorrelation?
Week
Customers
Sales
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
794
799
837
855
845
844
863
875
880
905
886
843
904
950
841
9.33
8.26
7.48
9.08
9.83
10.09
11.01
11.49
12.07
12.55
11.92
10.27
11.8
12.15
9.64
Test statistic
t STAT
b1 1
Sb
d.f. n 2
where:
b1 = regression slope
coefficient
1 = hypothesized slope
Sb1 = standard
error of the slope
S YX
Sb1
SSX
S YX
2
(X
X
)
i
where:
Sb1
S YX
n2
SSE
SYX
n2
(Yi Yi ) 2
i 1
n2
Where
SSE = error sum of squares
n = sample size
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
H1: 1 0
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
b1
Sb1
t STAT
b1 1
Sb
0.10977 0
3.32938
0.03297
H0: 1 = 0
H1: 1 0
d.f. = 10- 2 = 8
a/2=.025
Reject H0
a/2=.025
Do not reject H0
-t/2
-2.3060
Reject H0
t/2
2.3060
3.329
Decision: Reject H0
There is sufficient evidence
that square footage affects
house price
H1: 1 0
From Excel output:
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
p-value
FITTING
DIAGNOSTICS
INTERPRETATION
PREDICTION