You are on page 1of 62

SIMPLE LINEAR

REGRESSION
C A S E : S P E N D I N G A N D E D U C AT I O N

INTRODUCTION TO
REGRESSION ANALYSIS
One of the most pervasive method in business world.
Regression analysis is used to:
Study the relationship between variables
Predict the value of a dependent variable based on the value of at least one
independent variable

Explain the impact of changes in an independent variable on the dependent


variable

Dependent variable (Response or Target Variable): the variable we wish to


predict or explain
Independent variable (Explanatory or predictor Variable) the variable used to
predict
or explain the dependent
variable

SIMPLE LINEAR REGRESSION


MODEL
Only one independent variable, X
Assume that X and Y are linearly related.
Relationship between X and Y is described
by a linear function
Changes in Y are assumed to be related to
changes in X

TYPES OF RELATIONSHIPS
Linear relationships
Y

Curvilinear relationships
Y

X
Y

X
Y

TYPES OF RELATIONSHIPS
Strong relationships
Y

Weak relationships
Y

X
Y

X
Y

PROMOTION V/S SALES

SIMPLE LINEAR REGRESSION


MODEL
Population
Y intercept
Dependent
Variable

Population
Slope
Coefficient

Independent
Variable

Random
Error
term

Yi 0 1Xi i
Linear component

Random Error
component

SIMPLE LINEAR REGRESSION


MODEL
Y

Yi 0 1Xi i

Observed Value
of Y for Xi

i
Predicted Value
of Y for Xi

Slope = 1
Random Error
for this Xi value

Intercept = 0

Xi

EXAMPLE
The annual bonuses ($1,000s) of six employees with different years of
experience were recorded as follows. We wish to determine the straight

line relationship between annual bonus and years of experience.


Years of experience x

Annual bonus y

17

12

Bonus

Annual_Bonus
18
16
14
12
10
8
6
4
2
0

2x 3
Y
0

Experience

(Y - Y )

(Y - Y )^2

-6

36

11

-6

36

17

13

16

12

15

-3

9
98

2.114x 0.934
Y
(Y - Y )

(Y - Y )^2

3.048

2.952

8.714304

5.162

-4.162

17.32224

7.276

1.724

2.972176

9.39

-4.39

19.2721

17

11.504

5.496

30.20602

12

13.618

-1.618

2.617924
81.10476

LEAST SQUARES LINE


Example 16.1

these differences are


called residuals

SIMPLE LINEAR REGRESSION


EQUATION
The simple linear regression equation provides an
estimate of the population regression function
Estimated
(or predicted)
Y value for
observation i

Estimate of
the regression
intercept

Estimate of the
regression slope

Yi b 0 b1X i

Value of X for
observation i

THE LEAST SQUARES METHOD


b0 and b1 are obtained by finding the values of
that minimize the sum of the squared differences
between Y and Y :
2
2

min (Yi Yi ) min (Yi (b0 b1Xi ))

SSXY X X Y Y
SSXX

b1
b

X X

SSXY
SSXX

Y b1 X

Y
n

b1

Total

1.00
2.00
3.00
4.00
5.00
6.00

6.00
1.00
9.00
5.00
17.00
12.00

21.00

50.00

X X
-2.50
-1.50
-0.50
0.50
1.50
2.50

Y Y

X X Y Y

-2.33
-7.33
0.67
-3.33
8.67
3.67

37
b1 17.5 2.114

50
21
b0 6 2.114 * 6 0.9343

2.114x 0.9343
Y

X X

5.83
11.00
-0.33
-1.67
13.00
9.17

6.25
2.25
0.25
0.25
2.25
6.25

37.00

17.50

INTERPRETATION OF THE
SLOPE AND THE INTERCEPT
b0 is the estimated average value of Y when
the value of X is zero
b1 is the estimated change in the average
value of Y as a result of a one-unit increase
in X

SIMPLE LINEAR REGRESSION


EXAMPLE
A real estate agent wishes to examine the relationship
between the selling price of a home and its size
(measured in square feet)
A random sample of 10 houses is selected
Dependent variable (Y) = house price in $1000s
Independent variable (X) = square feet

SIMPLE LINEAR REGRESSION


EXAMPLE: DATA
House Price in $1000s
(Y)

Square Feet
(X)

245

1400

312

1600

279

1700

308

1875

199

1100

219

1550

405

2350

324

2450

319

1425

255

1700

SIMPLE LINEAR REGRESSION EXAMPLE: SCATTER


PLOT
House price model: Scatter Plot

House Price ($1000s)

450
400

350
300
250
200
150
100

50
0
0

500

1000

1500
Square Feet

2000

2500

3000

SIMPLE LINEAR REGRESSION EXAMPLE:


USING EXCEL DATA ANALYSIS FUNCTION
1. Choose Data

2. Choose Data Analysis


3. Choose Regression

SIMPLE LINEAR REGRESSION EXAMPLE:


USING EXCEL DATA ANALYSIS FUNCTION
Enter Y rande and X rande and desired options

SIMPLE LINEAR REGRESSION EXAMPLE:


EXCEL OUTPUT
Regression Statistics
Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

The regression equation is:


houseprice 98.24833 0.10977 (square feet)

41.33032

Observations

10

ANOVA
df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

SIMPLE LINEAR REGRESSION EXAMPLE:


GRAPHICAL REPRESENTATION
House price model: Scatter Plot and Prediction Line

House Price ($1000s)

450

Intercept
= 98.248

400
350

Slope
= 0.10977

300
250
200
150
100
50
0
0

500

1000

1500

2000

2500

3000

Square Feet

houseprice 98.24833 0.10977 (square feet)

houseprice 98.24833 0.10977 (square feet)


b0 is the estimated average value of Y when the value of X is zero (if X = 0
is in the range of observed X values)

Because a house cannot have a square footage of 0, b0 has no practical


application

houseprice 98.24833 0.10977 (square feet)

b1 estimates the change in the average value of


Y as a result of a one-unit increase in X
Here, b1 = 0.10977 tells us that the mean value of a house increases by
.10977($1000) = $109.77, on average, for each additional one square foot of
size

When using a regression model for prediction, only predict within the
relevant range of data
Relevant range for
interpolation
House Price ($1000s)

450
400
350
300
250
200
150
100
50
0
0

500

1000

1500

2000

Square Feet

2500

3000

Do not try to
extrapolate
beyond the range
of observed Xs

MAKING PREDICTIONS
Predict the price for a house
with 2000 square feet:

houseprice 98.25 0.1098 (sq.ft.)


98.25 0.1098(2000)
317.85
The predicted price for a house with 2000
square feet is 317.85($1,000s) = $317,850

MEASURES OF VARIATION
Y
Yi

2
SSE = (Yi - Yi )

Yi

SST = (Yi - Y)2

_
SSR = (Yi - Y)2

_
Y

Xi

_
Y

MEASURES OF VARIATION
Total variation is made up of two parts:

SST

SSR

Total Sum of
Squares

Regression Sum
of Squares

SST ( Yi Y)2

Y)2
SSR ( Y
i

SSE
Error Sum of
Squares

)2
SSE ( Yi Y
i

where:

= Mean value of the dependent variable

Yi = Observed value of the dependent variable

Yi = Predicted value of Y for the given Xi value

MEASURES OF VARIATION
SST = total sum of squares

(Total Variation)

Measures the variation of the Yi values around their mean Y

SSR = regression sum of squares (Explained Variation)


Variation attributable to the relationship between X and Y

SSE = error sum of squares (Unexplained Variation)


Variation in Y attributable to factors other than X

COEFFICIENT OF
2
DETERMINATION, R

The coefficient of determination is the portion of


the total variation in the dependent variable that is
explained by variation in the independent variable

The coefficient of determination is also called rsquared and is denoted as R2


R2

SSR regression sum of squares

SST
total sum of squares

note:

0 R 1
2

EXAMPLES OF APPROXIMATE
R 2 VALUES
Y
R2 = 1

R2 = 1

100% of the variation in Y is


explained by variation in X

R2

=1

Perfect linear relationship


between X and Y:

EXAMPLES OF APPROXIMATE
R 2 VALUES
Y
0 < R2 < 1

Weaker linear relationships


between X and Y:
Some but not all of the
variation in Y is explained
by variation in X

EXAMPLES OF APPROXIMATE
R 2 VALUES
R2 = 0

No linear relationship
between X and Y:

R2 = 0

The value of Y does not


depend on X. (None of the
variation in Y is explained
by variation in X)

SIMPLE LINEAR REGRESSION EXAMPLE:


COEFFICIENT OF DETERMINATION, R 2 IN EXCEL
R2

Regression Statistics
Multiple R

0.76211

R Square

0.58082

Adjusted R Square

0.52842

Standard Error

SSR 18934.9348

0.58082
SST 32600.5000

58.08% of the variation in


house prices is explained by
variation in square feet

41.33032

Observations

10

ANOVA
df

SS

MS

F
11.0848

Regression

18934.9348

18934.9348

Residual

13665.5652

1708.1957

Total

32600.5000

Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

Significance F
0.01039

Lower 95%

Upper 95%

98.24833

58.03348

1.69296

0.12892

-35.57720

232.07386

0.10977

0.03297

3.32938

0.01039

0.03374

0.18580

ASSUMPTIONS OF REGRESSION
L.I.N.E
Linearity
The relationship between X and Y is linear

Independence of Errors
Error values are statistically independent

Normality of Error
Error values are normally distributed for any given value of X

Equal Variance (also called homoscedasticity)


The probability distribution of the errors has constant variance

RESIDUAL ANALYSIS

ei Yi Y
i
The residual for observation i, ei, is the difference between its
observed and predicted value
Check the assumptions of regression by examining the
residuals
Examine for linearity assumption
Evaluate independence assumption

Evaluate normal distribution assumption


Examine for constant variance for all levels of X (homoscedasticity)

Graphical Analysis of Residuals


Can plot residuals vs. X or Y predicted.

RESIDUAL ANALYSIS FOR


LINEARITY
Y

No apparent pattern between residuals and X

Not Linear

residuals

residuals

Linear

RESIDUAL ANALYSIS FOR


NORMALITY
When using a normal probability plot, normal
errors will approximately display in a straight line
Percent
100

0
-3

-2

-1

Residual

RESIDUAL ANALYSIS FOR


EQUAL VARIANCE
Y

x
Non-constant variance

residuals

residuals

Constant variance

SIMPLE LINEAR REGRESSION EXAMPLE RESIDUAL


OUTPUT
RESIDUAL OUTPUT
Residuals

251.92316

-6.923162

273.87671

38.12329

284.85348

-5.853484

304.06284

3.937162

218.99284

-19.99284

80
60
40

Residuals

Predicted
House Price

House Price Model Residual Plot

20
0

268.38832

-49.38832

-20

356.20251

48.79749

-40

367.17929

-43.17929

-60

254.6674

64.33264

10

284.85348

-29.85348

1000

2000

Square Feet

Does not appear to violate


any regression assumptions

3000

RESIDUAL ANALYSIS FOR


EQUAL VARIANCE
Plot the residual on the vertical axis against the value of X.
If there is same same amount of variations in residuals in X, it shows equal variance.

x
Non-constant variance

residuals

residuals

Constant variance

SIMPLE LINEAR REGRESSION EXAMPLE: EXCEL


RESIDUAL OUTPUT
RESIDUAL OUTPUT
Residuals

251.92316

-6.923162

273.87671

38.12329

284.85348

-5.853484

304.06284

3.937162

218.99284

-19.99284

80
60
40

Residuals

Predicted
House Price

House Price Model Residual Plot

20
0

268.38832

-49.38832

-20

356.20251

48.79749

-40

367.17929

-43.17929

-60

254.6674

64.33264

10

284.85348

-29.85348

1000

2000

Square Feet

Does not appear to violate


any regression assumptions

3000

CONSTRUCTING
A NORMAL PROBABILITY PLOT-SPSS OUTPUT
Normal probability plot
Arrange residuals into an ascending array.

Calculate observed probability as i/(N+1)


Calculated expected cumulative probability as per normal distribution.
Plot the pairs of points with observed cumulative probability on the
vertical axis and the expected cumulative probability on the horizontal axis
Evaluate the plot for evidence of linearity

Expected

NORMAL PROBABILITY PLOT


1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0

Normality P-P Plot

Series1

0.2

0.4

0.6

Observed

0.8

Residual Analysis for


Independence
Plot of residuals versus time will show some pattern (Time-Series data)

Not Independent

residuals

residuals

residuals

Independent

AUTOCORRELATION
Autocorrelation is correlation of the errors (residuals) over time

Violates the regression assumption that


residuals are random and independent

THE DURBIN-WATSON
STATISTIC
The Durbin-Watson statistic is used to test for
autocorrelation

(e e
i 2

2
e
i
i1

i1

The possible range is 0 D 4


D should be close to 2 for no
autocorrelation
D less than 2 may signal positive
autocorrelation, D greater than 2 may
signal negative autocorrelation

MEASURING
AUTOCORRELATION:
THE DURBIN-WATSON
STATISTIC
Used when data are collected over time to detect if
autocorrelation is present
Autocorrelation exists if residuals in one time period are related
to residuals in another period

TESTING FOR POSITIVE


AUTOCORRELATION
Calculate the Durbin-Watson test statistic = D. Value

can lie between 0 and 4.


Find the values dL and dU from the Durbin-Watson table
(for sample size n and number of independent variables k)
Positive
autocorrelation
exists
Inconclusive

dL

Positive
autocorrelation does
not exist

dU

Testing for Positive


Autocorrelation
Suppose we have the following data:

Is there autocorrelation?

Week

Customers

Sales

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

794
799
837
855
845
844
863
875
880
905
886
843
904
950
841

9.33
8.26
7.48
9.08
9.83
10.09
11.01
11.49
12.07
12.55
11.92
10.27
11.8
12.15
9.64

Testing for Positive


Autocorrelation
Suppose we have the following data:

Is there autocorrelation?

Week

Customers

Sales

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

794
799
837
855
845
844
863
875
880
905
886
843
904
950
841

9.33
8.26
7.48
9.08
9.83
10.09
11.01
11.49
12.07
12.55
11.92
10.27
11.8
12.15
9.64

TESTING FOR POSITIVE


AUTOCORRELATION
Here, n = 15 and there is k = 1 one independent variable
Using the Durbin-Watson table, dL = 1.08 and dU = 1.36
D = 0.8830 < dL = 1.08, significant positive autocorrelation
exists

INFERENCES ABOUT THE SLOPE:


T TEST
t test for a population slope
Is there a linear relationship between X and Y?

Null and alternative hypotheses

H0: 1 = 0 (no linear relationship)


H1: 1 0 (linear relationship does exist)

Test statistic

t STAT

b1 1
Sb

d.f. n 2

where:
b1 = regression slope
coefficient
1 = hypothesized slope
Sb1 = standard
error of the slope

INFERENCES ABOUT THE SLOPE


The standard error of the regression slope coefficient (b1) is estimated by

S YX
Sb1

SSX

S YX
2
(X

X
)
i

where:

Sb1

= Estimate of the standard error of the slope

S YX

SSE = Standard error of the estimate

n2

STANDARD ERROR OF ESTIMATE


The standard deviation of the variation of observations around the
regression line is estimated by

SSE
SYX

n2

(Yi Yi ) 2

i 1

n2

Where
SSE = error sum of squares
n = sample size

INFERENCES ABOUT THE SLOPE:


T TEST EXAMPLE
House Price
in $1000s
(y)

Square Feet
(x)

245

1400

312

1600

279

1700

308

1875

199

1100

219

1550

405

2350

324

2450

319

1425

255

1700

Estimated Regression Equation:


houseprice 98.25 0.1098 (sq.ft.)

The slope of this model is 0.1098


Is there a relationship between the
square footage of the house and its
sales price?

INFERENCES ABOUT THE SLOPE:


T TEST EXAMPLE
H0: 1 = 0

H1: 1 0

From Excel output:


Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

98.24833

58.03348

1.69296

0.12892

0.10977

0.03297

3.32938

0.01039

b1

Sb1

t STAT

b1 1
Sb

0.10977 0
3.32938
0.03297

INFERENCES ABOUT THE SLOPE:


T TEST EXAMPLE
Test Statistic: tSTAT = 3.329

H0: 1 = 0
H1: 1 0

d.f. = 10- 2 = 8
a/2=.025

Reject H0

a/2=.025

Do not reject H0

-t/2
-2.3060

Reject H0

t/2
2.3060

3.329

Decision: Reject H0
There is sufficient evidence
that square footage affects
house price

INFERENCES ABOUT THE SLOPE:


T TEST EXAMPLE
H : = 0
0

H1: 1 0
From Excel output:
Coefficients
Intercept
Square Feet

Standard Error

t Stat

P-value

98.24833

58.03348

1.69296

0.12892

0.10977

0.03297

3.32938

0.01039

Decision: Reject H0, since p-value <


There is sufficient evidence that
square footage affects house price.

p-value

REGRESSION DOES NOT MEAN CAUSATION

PROCEDURE OF CARRYING OUT REGRESSION

FITTING
DIAGNOSTICS
INTERPRETATION
PREDICTION

You might also like