Regression Final

SIMPLE LINEAR
REGRESSION
C A S E : S P E N D I N G A N D E D U C AT I O N
INTRODUCTION TO
REGRESSION ANALYSIS
One of the most pervasive method in business world.
Regression analysis is used to:
Study the relationship between variables
Predict the value of a dependent variable based on the value of at least one
independent variable
Explain the impact of changes in an independent variable on the dependent

variable
Dependent variable (Response or Target Variable): the variable we wish to

predict or explain
Independent variable (Explanatory or predictor Variable) the variable used to
predict
or explain the dependent
variable
SIMPLE LINEAR REGRESSION

MODEL
Only one independent variable, X
Assume that X and Y are linearly related.
Relationship between X and Y is described
by a linear function
Changes in Y are assumed to be related to
changes in X
TYPES OF RELATIONSHIPS
Linear relationships
Y
Curvilinear relationships
Y
X
Y
X
Y
TYPES OF RELATIONSHIPS
Strong relationships
Y
Weak relationships
Y
X
Y
X
Y
PROMOTION V/S SALES

MODEL
Population
Y intercept
Dependent
Variable
Population
Slope
Coefficient
Independent
Variable
Random
Error
term
Yi 0 1Xi i
Linear component
Random Error
component

MODEL
Y
Yi 0 1Xi i
Observed Value
of Y for Xi
i
Predicted Value
of Y for Xi
Slope = 1
Random Error
for this Xi value
Intercept = 0
Xi
EXAMPLE
The annual bonuses ($1,000s) of six employees with different years of
experience were recorded as follows. We wish to determine the straight
line relationship between annual bonus and years of experience.

Years of experience x
Annual bonus y
17
12
Bonus
Annual_Bonus
18
16
14
12
10
8
6
4
2
0
2x 3
Y
0
Experience
(Y - Y )
(Y - Y )^2
-6
36
11
-6
36
17
13
16
12
15
-3
9
98
2.114x 0.934
Y
(Y - Y )
(Y - Y )^2
3.048
2.952
8.714304
5.162
-4.162
17.32224
7.276
1.724
2.972176
9.39
-4.39
19.2721
17
11.504
5.496
30.20602
12
13.618
-1.618
2.617924
81.10476
LEAST SQUARES LINE

Example 16.1
these differences are

called residuals

EQUATION
The simple linear regression equation provides an
estimate of the population regression function
Estimated
(or predicted)
Y value for
observation i
Estimate of
the regression
intercept
Estimate of the
regression slope
Yi b 0 b1X i
Value of X for
observation i
THE LEAST SQUARES METHOD

b0 and b1 are obtained by finding the values of
that minimize the sum of the squared differences
between Y and Y :
2
2
min (Yi Yi ) min (Yi (b0 b1Xi ))
SSXY X X Y Y
SSXX
b1
b
X X
SSXY
SSXX
Y b1 X
Y
n
b1
Total
1.00
2.00
3.00
4.00
5.00
6.00
6.00
1.00
9.00
5.00
17.00
12.00
21.00
50.00
X X
-2.50
-1.50
-0.50
0.50
1.50
2.50
Y Y
X X Y Y
-2.33
-7.33
0.67
-3.33
8.67
3.67
37
b1 17.5 2.114
50
21
b0 6 2.114 * 6 0.9343
2.114x 0.9343
Y
X X
5.83
11.00
-0.33
-1.67
13.00
9.17
6.25
2.25
0.25
0.25
2.25
6.25
37.00
17.50
INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
b0 is the estimated average value of Y when
the value of X is zero
b1 is the estimated change in the average
value of Y as a result of a one-unit increase
in X

EXAMPLE
A real estate agent wishes to examine the relationship
between the selling price of a home and its size
(measured in square feet)
A random sample of 10 houses is selected
Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet

EXAMPLE: DATA
House Price in $1000s
(Y)
Square Feet
(X)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
SIMPLE LINEAR REGRESSION EXAMPLE: SCATTER

PLOT
House price model: Scatter Plot
House Price ($1000s)
450
400
350
300
250
200
150
100
50
0
0
500
1000
1500
Square Feet
2000
2500
3000
SIMPLE LINEAR REGRESSION EXAMPLE:

USING EXCEL DATA ANALYSIS FUNCTION
1. Choose Data
2. Choose Data Analysis

3. Choose Regression

USING EXCEL DATA ANALYSIS FUNCTION
Enter Y rande and X rande and desired options

EXCEL OUTPUT
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
The regression equation is:

houseprice 98.24833 0.10977 (square feet)
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580

GRAPHICAL REPRESENTATION
House price model: Scatter Plot and Prediction Line
450
Intercept
= 98.248
400
350
Slope
= 0.10977
300
250
200
150
100
50
0
0
500
1000
1500
2000
2500
3000
Square Feet

b0 is the estimated average value of Y when the value of X is zero (if X = 0
is in the range of observed X values)
Because a house cannot have a square footage of 0, b0 has no practical

application
b1 estimates the change in the average value of

Y as a result of a one-unit increase in X
Here, b1 = 0.10977 tells us that the mean value of a house increases by
.10977($1000) = $109.77, on average, for each additional one square foot of
size
When using a regression model for prediction, only predict within the
relevant range of data
Relevant range for
interpolation
450
400
350
300
250
200
150
100
50
0
0
500
1000
1500
2000
Square Feet
2500
3000
Do not try to
extrapolate
beyond the range
of observed Xs
MAKING PREDICTIONS
Predict the price for a house
with 2000 square feet:
houseprice 98.25 0.1098 (sq.ft.)

98.25 0.1098(2000)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850
MEASURES OF VARIATION
Y
Yi
2
SSE = (Yi - Yi )
Yi
SST = (Yi - Y)2
_
SSR = (Yi - Y)2
_
Y
Xi
_
Y
Total variation is made up of two parts:
SST
SSR
Total Sum of
Squares
Regression Sum
of Squares
SST ( Yi Y)2
Y)2
SSR ( Y
i
SSE
Error Sum of
Squares
)2
SSE ( Yi Y
i
where:
= Mean value of the dependent variable
Yi = Observed value of the dependent variable
Yi = Predicted value of Y for the given Xi value
SST = total sum of squares
(Total Variation)
Measures the variation of the Yi values around their mean Y
SSR = regression sum of squares (Explained Variation)

Variation attributable to the relationship between X and Y
SSE = error sum of squares (Unexplained Variation)

Variation in Y attributable to factors other than X
COEFFICIENT OF
2
DETERMINATION, R
The coefficient of determination is the portion of

the total variation in the dependent variable that is
explained by variation in the independent variable
The coefficient of determination is also called rsquared and is denoted as R2

R2
SSR regression sum of squares
SST
total sum of squares
note:
0 R 1
2
EXAMPLES OF APPROXIMATE
R 2 VALUES
Y
R2 = 1
R2 = 1
100% of the variation in Y is

explained by variation in X
R2
=1
Perfect linear relationship

between X and Y:
R 2 VALUES
Y
0 < R2 < 1
Weaker linear relationships

between X and Y:
Some but not all of the
variation in Y is explained
by variation in X
R 2 VALUES
R2 = 0
No linear relationship
between X and Y:
R2 = 0
The value of Y does not

depend on X. (None of the
variation in Y is explained
by variation in X)

COEFFICIENT OF DETERMINATION, R 2 IN EXCEL
R2
Regression Statistics
Multiple R
0.76211
R Square
0.58082
Adjusted R Square
0.52842
Standard Error
SSR 18934.9348
0.58082
SST 32600.5000
58.08% of the variation in

house prices is explained by
variation in square feet
41.33032
Observations
10
ANOVA
df
SS
MS
F
11.0848
Regression
18934.9348
18934.9348
Residual
13665.5652
1708.1957
Total
32600.5000
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
Significance F
0.01039
Lower 95%
Upper 95%
98.24833
58.03348
1.69296
0.12892
-35.57720
232.07386
0.10977
0.03297
3.32938
0.01039
0.03374
0.18580
ASSUMPTIONS OF REGRESSION
L.I.N.E
Linearity
The relationship between X and Y is linear
Independence of Errors
Error values are statistically independent
Normality of Error
Error values are normally distributed for any given value of X
Equal Variance (also called homoscedasticity)

The probability distribution of the errors has constant variance
RESIDUAL ANALYSIS
ei Yi Y
i
The residual for observation i, ei, is the difference between its
observed and predicted value
Check the assumptions of regression by examining the
residuals
Examine for linearity assumption
Evaluate independence assumption
Evaluate normal distribution assumption

Examine for constant variance for all levels of X (homoscedasticity)
Graphical Analysis of Residuals

Can plot residuals vs. X or Y predicted.
RESIDUAL ANALYSIS FOR

LINEARITY
Y
No apparent pattern between residuals and X
Not Linear
residuals
residuals
Linear

NORMALITY
When using a normal probability plot, normal
errors will approximately display in a straight line
Percent
100
0
-3
-2
-1
Residual

EQUAL VARIANCE
Y
x
Non-constant variance
residuals
residuals
Constant variance
SIMPLE LINEAR REGRESSION EXAMPLE RESIDUAL

OUTPUT
RESIDUAL OUTPUT
Residuals
251.92316
-6.923162
273.87671
38.12329
284.85348
-5.853484
304.06284
3.937162
218.99284
-19.99284
80
60
40
Residuals
Predicted
House Price
House Price Model Residual Plot
20
0
268.38832
-49.38832
-20
356.20251
48.79749
-40
367.17929
-43.17929
-60
254.6674
64.33264
10
284.85348
-29.85348
1000
2000
Square Feet
Does not appear to violate

any regression assumptions
3000

EQUAL VARIANCE
Plot the residual on the vertical axis against the value of X.
If there is same same amount of variations in residuals in X, it shows equal variance.
x
Non-constant variance
residuals
residuals
Constant variance
SIMPLE LINEAR REGRESSION EXAMPLE: EXCEL

RESIDUAL OUTPUT
RESIDUAL OUTPUT
Residuals
251.92316
-6.923162
273.87671
38.12329
284.85348
-5.853484
304.06284
3.937162
218.99284
-19.99284
80
60
40
Residuals
Predicted
House Price
House Price Model Residual Plot
20
0
268.38832
-49.38832
-20
356.20251
48.79749
-40
367.17929
-43.17929
-60
254.6674
64.33264
10
284.85348
-29.85348
1000
2000
Square Feet
Does not appear to violate

any regression assumptions
3000
CONSTRUCTING
A NORMAL PROBABILITY PLOT-SPSS OUTPUT
Normal probability plot
Arrange residuals into an ascending array.
Calculate observed probability as i/(N+1)

Calculated expected cumulative probability as per normal distribution.
Plot the pairs of points with observed cumulative probability on the
vertical axis and the expected cumulative probability on the horizontal axis
Evaluate the plot for evidence of linearity
Expected
NORMAL PROBABILITY PLOT

1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Normality P-P Plot
Series1
0.2
0.4
0.6
Observed
0.8
Residual Analysis for

Independence
Plot of residuals versus time will show some pattern (Time-Series data)
Not Independent
residuals
residuals
residuals
Independent
AUTOCORRELATION
Autocorrelation is correlation of the errors (residuals) over time
Violates the regression assumption that

residuals are random and independent
THE DURBIN-WATSON
STATISTIC
The Durbin-Watson statistic is used to test for
autocorrelation
(e e
i 2
2
e
i
i1
i1
The possible range is 0 D 4

D should be close to 2 for no
autocorrelation
D less than 2 may signal positive
autocorrelation, D greater than 2 may
signal negative autocorrelation
MEASURING
AUTOCORRELATION:
THE DURBIN-WATSON
STATISTIC
Used when data are collected over time to detect if
autocorrelation is present
Autocorrelation exists if residuals in one time period are related
to residuals in another period
TESTING FOR POSITIVE

AUTOCORRELATION
Calculate the Durbin-Watson test statistic = D. Value
can lie between 0 and 4.

Find the values dL and dU from the Durbin-Watson table
(for sample size n and number of independent variables k)
Positive
autocorrelation
exists
Inconclusive
dL
Positive
autocorrelation does
not exist
dU
Testing for Positive

Autocorrelation
Suppose we have the following data:
Is there autocorrelation?
Week
Customers
Sales
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
794
799
837
855
845
844
863
875
880
905
886
843
904
950
841
9.33
8.26
7.48
9.08
9.83
10.09
11.01
11.49
12.07
12.55
11.92
10.27
11.8
12.15
9.64
Testing for Positive

Autocorrelation
Suppose we have the following data:
Is there autocorrelation?
Week
Customers
Sales
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
794
799
837
855
845
844
863
875
880
905
886
843
904
950
841
9.33
8.26
7.48
9.08
9.83
10.09
11.01
11.49
12.07
12.55
11.92
10.27
11.8
12.15
9.64
TESTING FOR POSITIVE

AUTOCORRELATION
Here, n = 15 and there is k = 1 one independent variable
Using the Durbin-Watson table, dL = 1.08 and dU = 1.36
D = 0.8830 < dL = 1.08, significant positive autocorrelation
exists
INFERENCES ABOUT THE SLOPE:

T TEST
t test for a population slope
Is there a linear relationship between X and Y?
Null and alternative hypotheses
H0: 1 = 0 (no linear relationship)

H1: 1 0 (linear relationship does exist)
Test statistic
t STAT
b1 1
Sb
d.f. n 2
where:
b1 = regression slope
coefficient
1 = hypothesized slope
Sb1 = standard
error of the slope
INFERENCES ABOUT THE SLOPE

The standard error of the regression slope coefficient (b1) is estimated by
S YX
Sb1
SSX
S YX
2
(X
X
)
i
where:
Sb1
= Estimate of the standard error of the slope
S YX
SSE = Standard error of the estimate
n2
STANDARD ERROR OF ESTIMATE

The standard deviation of the variation of observations around the
regression line is estimated by
SSE
SYX
n2
(Yi Yi ) 2
i 1
n2
Where
SSE = error sum of squares
n = sample size

T TEST EXAMPLE
House Price
in $1000s
(y)
Square Feet
(x)
245
1400
312
1600
279
1700
308
1875
199
1100
219
1550
405
2350
324
2450
319
1425
255
1700
Estimated Regression Equation:

houseprice 98.25 0.1098 (sq.ft.)
The slope of this model is 0.1098

Is there a relationship between the
square footage of the house and its
sales price?

T TEST EXAMPLE
H0: 1 = 0
H1: 1 0
From Excel output:

Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
b1
Sb1
t STAT
b1 1
Sb
0.10977 0
3.32938
0.03297

T TEST EXAMPLE
Test Statistic: tSTAT = 3.329
H0: 1 = 0
H1: 1 0
d.f. = 10- 2 = 8
a/2=.025
Reject H0
a/2=.025
Do not reject H0
-t/2
-2.3060
Reject H0
t/2
2.3060
3.329
Decision: Reject H0
There is sufficient evidence
that square footage affects
house price

T TEST EXAMPLE
H : = 0
0
H1: 1 0
From Excel output:
Coefficients
Intercept
Square Feet
Standard Error
t Stat
P-value
98.24833
58.03348
1.69296
0.12892
0.10977
0.03297
3.32938
0.01039
Decision: Reject H0, since p-value <

There is sufficient evidence that
square footage affects house price.
p-value
REGRESSION DOES NOT MEAN CAUSATION
PROCEDURE OF CARRYING OUT REGRESSION
FITTING
DIAGNOSTICS
INTERPRETATION
PREDICTION

Regression Final

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Final

Uploaded by

Copyright:

Available Formats

SIMPLE LINEAR

Explain the impact of changes in an independent variable on the dependent

Dependent variable (Response or Target Variable): the variable we wish to

SIMPLE LINEAR REGRESSION

PROMOTION V/S SALES

SIMPLE LINEAR REGRESSION

SIMPLE LINEAR REGRESSION

line relationship between annual bonus and years of experience.

LEAST SQUARES LINE

these differences are

SIMPLE LINEAR REGRESSION

THE LEAST SQUARES METHOD

min (Yi Yi ) min (Yi (b0 b1Xi ))

SIMPLE LINEAR REGRESSION

SIMPLE LINEAR REGRESSION

SIMPLE LINEAR REGRESSION EXAMPLE: SCATTER

House Price ($1000s)

SIMPLE LINEAR REGRESSION EXAMPLE:

2. Choose Data Analysis

SIMPLE LINEAR REGRESSION EXAMPLE:

SIMPLE LINEAR REGRESSION EXAMPLE:

The regression equation is:

SIMPLE LINEAR REGRESSION EXAMPLE:

House Price ($1000s)

houseprice 98.24833 0.10977 (square feet)

houseprice 98.24833 0.10977 (square feet)

Because a house cannot have a square footage of 0, b0 has no practical

houseprice 98.24833 0.10977 (square feet)

b1 estimates the change in the average value of

houseprice 98.25 0.1098 (sq.ft.)

SST = (Yi - Y)2

= Mean value of the dependent variable

Yi = Observed value of the dependent variable

Yi = Predicted value of Y for the given Xi value

Measures the variation of the Yi values around their mean Y

SSR = regression sum of squares (Explained Variation)

SSE = error sum of squares (Unexplained Variation)

The coefficient of determination is the portion of

The coefficient of determination is also called rsquared and is denoted as R2

SSR regression sum of squares

100% of the variation in Y is

Perfect linear relationship

Weaker linear relationships

The value of Y does not

SIMPLE LINEAR REGRESSION EXAMPLE:

58.08% of the variation in

Equal Variance (also called homoscedasticity)

Evaluate normal distribution assumption

Graphical Analysis of Residuals

RESIDUAL ANALYSIS FOR

No apparent pattern between residuals and X

RESIDUAL ANALYSIS FOR

RESIDUAL ANALYSIS FOR

SIMPLE LINEAR REGRESSION EXAMPLE RESIDUAL

House Price Model Residual Plot

Does not appear to violate

RESIDUAL ANALYSIS FOR

SIMPLE LINEAR REGRESSION EXAMPLE: EXCEL

House Price Model Residual Plot

Does not appear to violate

Calculate observed probability as i/(N+1)

NORMAL PROBABILITY PLOT

Normality P-P Plot

Residual Analysis for