You are on page 1of 40

REGRESSION

Regression Analysis
There are three kinds of data arrangements.
Time series
Cross sectional
Panel
Therefore regression can be of all three.
Based on number of variables regression is
Bivariate and multivariate.

Bivariate Regression
A measure of linear association that
investigates a straight line relationship
Useful in estimation/forecasting
Introduction to Regression Analysis
Regression analysis is used to:
Predict the value of a dependent variable based on
the value of at least one independent variable
Explain the impact of changes in an independent
variable on the dependent variable
Dependent variable: the variable we wish to
explain
Independent variable: the variable used to
explain the dependent variable
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Bivariate Linear Regression

A measure of linear association that
investigates a straight-line relationship.
Y = + X +
where
Y is the dependent variable
X is the independent variable
and are two constants to be estimated
is error or residual term
Y intercept
An intercepted segment of a line
The point at which a regression line intercepts
the Y-axis

Slope
The inclination of a regression line as
compared to a base line

X
Y
160

150

140

130

120

110

100

90

80
70 80 90 100 110 120 130 140 150 160 170 180 190
Y hat
Actual Y
Y hat
Actual Y
Regression Line
= a + bx + e
is used
for
predicted
value of Y
The Least-Square Method
The criterion of attempting to make the least
amount of total error in prediction of Y from
X. More technically, the procedure used in the
least-squares method generates a straight line
that minimizes the sum of squared deviations
of the actual values from this predicted
regression line.
The Least-Square Method
A relatively simple mathematical technique
that ensures that the straight line will most
closely represent the relationship between X
and Y.


= - (The residual)
= actual value of the dependent variable
= estimated value of the dependent variable (Y hat)
n = number of observations

i
e
i
Y
i
Y

i
Y
i
Y

=
n
i
i
e
1
2
minimum is
Regression - Least-Square Method
2
1 0
2 2
x)) b (a (y
) y (y e
+ =
=



( ) ( )( )
( ) ( )
2
2

=
X X n
Y X XY n
|
Finding out the values of a and b

= estimated slope of the line (the regression coefficient)
= estimated intercept of the y axis
= dependent variable
= mean of the dependent variable
= independent variable
= mean of the independent variable
= number of observations

|

X
Y
n
a

Y
X
( )
( ) ( )
2
2


=
X n X
Y X n XY
|
X Y a |

=
The other method of calculating &
Use of simultaneous equation Method
Y=N+X (where y is dependent variable and x is
independent variable)
XY=X+X
2
F-Test (Regression)-Goodness of fit
A procedure to determine whether there is
more variability explained by the regression or
unexplained by the regression.
Total deviation equals= Deviation explained by
the regression + Deviation unexplained by the
regression

( ) ( ) ( )

+ =
2 2
2


i i i i
Y Y Y Y Y Y
Total
variation
=
Explained
variation
Unexplained
variation
(residual)
+

( ) ( ) ( )
i i i i
Y Y Y Y Y Y

+ =
Partitioning the Variance
= Mean of the total group
= Value predicted with regression equation
= Actual value
Y
Y

i
Y
SSe SSr SSt + =
Sum of Squares
SSt
SSe
SSt
SSr
r = = 1
2
The proportion of variance in Y that is explained by X (or vice versa) is referred as
Coefficient of Determination-r
2.
R
2
can also be calculated by squaring the correlation i.e. r. This is also known as
explained variance.
X Y
3 40
10 35
11 30
15 32
22 19
22 26
23 24
28 22
28 18
35 6
Equation for Line of Best Fit: y = .94x + 43.7
Correlation = -.94
Calculating The Value of R Square
X Y
Predicted
Y Value
Error
Error
Squared
Distance
between Y
values and
their mean
Mean
distances
squared
3 40
10 35
11 30
15 32
22 19
22 26
23 24
28 22
28 18
35 6
Mean: Sum: Sum:
Equation for Line of Best Fit: y = .94x + 43.7
X Y
Predicted
Y Value
Error
Error
Squared
Distance
between Y
values and
their mean
Mean
distances
squared
3 40 40.88 .88 .77 14.8 219.04
10 35 34.30 -.70 .49 9.8 96.04
11 30 33.36 3.36 11.29 4.8 23.04
15 32 29.60 -2.40 5.76 6.8 46.24
22 19 23.02 4.02 16.16 -6.2 38.44
22 26 23.02 -2.98 8.88 .8 .64
23 24 22.08 -1.92 3.69 -1.2 1.44
28 22 17.38 -4.62 21.34 -3.2 10.24
28 18 17.38 -.62 .38 -7.2 51.84
35 6 10.80 4.8 23.04 -19.2 368.65
Mean: 25.2 Sum: 91.81 Sum: 855.60
Equation for Line of Best Fit: y = .94x + 43.7
1-
Sum of squared distances between the
actual and predicted Y values

Sum of squared distances between the
actual Y values and their mean
To calculate R Squared
1-
91.81
855.60
1- 0.11 =.89
X Y
3 40
10 35
11 30
15 32
22 19
22 26
23 24
28 22
28 18
35 6
r = -.944
The value we got for R Squared was .89
Heres a short-cut. To find R Squared
Square r
r
2
= -.944 -.944
r
2
= .89
R Squared
To determine how well the regression line
fits the data, we find a value called R-
Squared (r
2
)
To find r
2
, simply square the correlation
The closer r
2
is +1, the better the line fits
the data
r
2
will always be a positive number
Understanding the Output of
Regression
Sample Data for House Price Model
House Price in Thousands
(y)
Square Feet
(x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
Regression Using Excel
Tools / Data Analysis / Regression
Excel Output
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R
Square 0.52842
Standard Error 41.3303
Observations 10
ANOVA
df SS MS F
Significance
F
Regression 1 18934.9348 18934.934 11.08 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficien
ts Standard Error t Stat
P-
value Lower 95%
Upper
95%
Intercept 98.24833 58.03348 1.69296 0.1289 -35.57720 232.0738
Square Feet 0.10977 0.03297 3.32938 0.0103 0.03374 0.18580
The regression equation is:
feet) (square 0.10977 98.24833 price house + =
0
50
100
150
200
250
300
350
400
450
0 500 1000 1500 2000 2500 3000
Square Feet
H
o
u
s
e

P
r
i
c
e

(
$
1
0
0
0
s
)

Graphical Presentation
House price model: scatter plot and regression
line
feet) (square 0.10977 98.24833 price house + =
Slope
= 0.10977

Intercept
= 98.248
Interpretation of the
Intercept, b
0
is the estimated average value of Y when the
value of X is zero (if x = 0 is in the range of
observed x values)
Here, no houses had 0 square feet, so =
98.24833 just indicates that, for houses within the
range of sizes observed, 98,248.33 is the portion
of the house price not explained by square feet
feet) (square 0.10977 98.24833 price house + =
Interpretation of the
Slope Coefficient, b
1
measures the estimated change in the
average value of Y as a result of a one-unit
change in X
Here, = .10977 tells us that the average value of a
house increases by .10977(1000) = 109.77, on
average, for each additional one square foot of size
feet) (square 0.10977 98.24833 price house + =
Excel Output
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F Significance F
Regression (Main) 1 18934.9348 18934.9348 11.0848 0.01039
Residual (Error) 8 13665.5652 1708.1957
Total 9 32600.5000
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
58.08% of the variation in house prices is
explained by variation in square feet
0.58082
32600.5000
18934.9348
SST
SSR
R
2
= = =
Adjusted R square Used to test if an additional
independent variable improves the model.
Standard Error of Estimate
The standard deviation of the variation of
observations around the regression line is
estimated by
1
=
c
k n
SSE
s
Where
SSE = Sum of squares error
n = Sample size
k = number of independent variables in the model
The Standard Deviation of the
Regression Slope
The standard error of the regression slope
coefficient (b
1
) is estimated by

=
n
x) (
x
s
) x (x
s
s
2
2

b
1
where:
= Estimate of the standard error of the least squares slope

= Sample standard error of the estimate
1
b
s
2 n
SSE
s

=
Regression Statistics
Multiple R 0.76211
R Square 0.58082
Adjusted R
Square 0.52842
Standard Error 41.33032
Observations 10
ANOVA
df SS MS F
Significance
F
Regression 1 18934.9348 18934.934 11.084 0.01039
Residual 8 13665.5652 1708.1957
Total 9 32600.5000

Coefficien
ts Standard Error t Stat
P-
value Lower 95%
Upper
95%
Intercept 98.24833 58.03348 1.69296 0.1289 -35.57720 232.0738
Square Feet 0.10977 0.03297 3.32938 0.0103 0.03374 0.18580
Thus, 41.33 means that the expected error for a
house price prediction is off by 41330 rupees.
Inference about the Slope: t Test
t test for a population slope
Is there a linear relationship between x and y?
Null and alternative hypotheses
H
0
:
1
= 0 (no linear relationship)
H
1
:
1
= 0 (linear relationship does exist)
Test statistic





1
b
1 1
s
b
t

=
2 n d.f. =
where:
b
1
= Sample regression slope
coefficient

1
= Hypothesized slope
s
b1
= Estimator of the standard
error of the slope
House Price
in $1000s
(y)
Square Feet
(x)
245 1400
312 1600
279 1700
308 1875
199 1100
219 1550
405 2350
324 2450
319 1425
255 1700
(sq.ft.) 0.1098 98.25 price house + =
Estimated Regression Equation:
The slope of this model is 0.1098
Does square footage of the house
affect its sales price?
Inference about the Slope: t Test
(continued)
Inferences about the Slope:
t Test Example
H
0
:
1
= 0
H
A
:
1
= 0
Test Statistic: t = 3.329
There is sufficient evidence
that square footage affects
house price
Reject H
0

Coefficient
s
Standard
Error t Stat
P-
value
Intercept 98.24833 58.03348 1.6929 0.1289
Square
Feet 0.10977 0.03297 3.3293 0.0103
1
b
s
t
b
1
Decision:
Conclusion:
Reject H
0
Reject H
0
o/2=.025
-t
/2
Do not reject H
0
0

t
/2
o/2=.025
-2.3060 2.3060 3.329
d.f. = 10-2 = 8
Regression Analysis for Description
Confidence Interval Estimate of the Slope:
Excel Printout for House Prices:
At 95% level of confidence, the confidence interval for the
slope is (0.0337, 0.1858)
1
b /2 1
s t b
o


Coefficient
s
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
d.f. = n - 2
Regression Analysis for Description
Since the units of the house price variable is $1000s, we are 95%
confident that the average impact on sales price is between
$33.70 and $185.80 per square foot of house size

Coefficient
s
Standard
Error t Stat P-value Lower 95%
Upper
95%
Intercept 98.24833 58.03348 1.69296 0.12892 -35.57720 232.07386
Square Feet 0.10977 0.03297 3.32938 0.01039 0.03374 0.18580
This 95% confidence interval does not include 0.
Conclusion: There is a significant relationship between house price and square
feet at the .05 level of significance
Multiple Regression
Extension of Bivariate Regression
Multidimensional when three or more
variables are involved
Simultaneously investigates the effect of two
or more variables on a single dependent
variable

You might also like