Professional Documents
Culture Documents
Dr Sunil D Lakdawala
Sunil_lakdawala@hotmail.com
Regression Line
25-Aug-17 5
Regression Line (Cont)
b = (Xi*Yi n*X*Y) / (Xi2 n*X2)
a = Y - b* X
Se Standard Error = Sqrt((Yi-Yp)2/(n-2))
Assumption: Errors are normally distributed
What is interpretation of standard error. In
comparison with mean value of Y? In terms
of percentage?
Look at example of Cost vs Passengers
25-Aug-17 6
Regression Line (Cont)
Interpretation of Standard Error (Cont)
what is the range of cost for 80 passengers
with 95% confidence and 90% confidence
For n > 30, use Z distribution
with For n < 30, normal distribution can not
be assumed. Need to take t distribution
What will be the range for the above
problem?
Range is same for Y. Is it true?
Assumption: One is predicting within the
range
25-Aug-17 7
Correlation Analysis
Degree to which one variable is linearly related to
another
Coefficient of Determination:
r2 = 1 (Yi Yp)2 / (Yi Y)2 (between 0 and 1)
= 1 - Ratio
Ratio : variation between actual and predicted value
w. r. t. variation of Yi from mean (Unexplained part)
Variation of Y around regression line
Variation of Y around its own mean
r2 = 1 - ratio of above two
r2 = 0.78 78% variation of Y from Y is explained
using regression. 22% is not explained
25-Aug-17 8
INTERPRETATION of r 2
25-Aug-17 9
Coefficient of Correlation
r = sqrt (r2)
See fig 12.16
If r = 0.6, how good is regression? How much
variation in Y is explained by regression?
25-Aug-17 10
Inferences about population parameters
Instead of point value of b, we want to find
out range of b with 90% confidence level
Find t for given degrees of freedom, and
given confidence level
Find sb (standard error for b)
Sb = Se / (Xi2 nX2)
Range is b t*sb and b + t*sb
25-Aug-17 11
The Equation
25-Aug-17 13
The Equation (cont)
The regression analysis is performed under following assumptions.
Y = 0 + 1 * X +
1. Residual:
(Y- ^Y) should be near zero. (Y - ^Y) is the residue, denoted by
2. Residual
Plot X vs (Y- ^Y). should be Random (i.e. normally distributed)
and should not have any trend (unlike Y = X**2)
3. Standard Error Se = (SUM Squared Error/(N-K-1); K = 1 for
Simple Regression; SUM Squared Error = (Y- ^Y) **2
Se should be acceptable (Se / Ymean give good idea of error)
68% of Residue should be within Se
95% of Residue should be within 2* Se
25-Aug-17 14
The Equation (Cont)
4. Correlation (r) is measure of linear association-ship
between the variables. Even if variables have high
nonlinear relationship, r might be very small (see Fig
5-7 Makridakis)
5. For small n, r is notoriously unstable. For n = 30 or
more, it starts becoming stable
6. r can change drastically due to extreme values. (See
King-Kong problem Fig 5-8 Makridakis, where just one
extreme point changes r from 0.527 to 0.940). What
should we do?
7. Coefficient of Determination: R**2 should be high,
towards 1. Interpretation of R2 and r
25-Aug-17 15
The Equation (Cont)
8. P value should be smaller than 0.05 (i.e. 95%
confidence) for rejecting null Hypothesis (0 =0 / 1 =
0)
F = t**2 = MS (Regression) / MS (Residual) (t value
for 1)
Significance F = p value for 1
9. Adjusted R**2 =
1 (Sum Squared Error/ (N-K-1)*(N-1)/ (X-X mean
)(Y-Y mean )
10. Should make common sense i.e. when X change by
1, Y changes by Slope. +ve or ve change should
make common sense
11. Only prediction valid within the range from which model
is made
25-Aug-17 16
The Equation (Cont)
12. Please see equation 5.19 for error interval on predicted
value
13. Please see equations for and and their error
interval on page 216
14. Residues vs explanatory variable should not have any
pattern (No trend, No seasonality, etc..)
15. Residues should have mean as zero and should be
normally distributed
25-Aug-17 17
Data and Analysis
25-Aug-17 18
Summary Output
25-Aug-17 19
Residuals
It is the
difference
between the
actual Y value
and predicted Y
value by the
regression
model in
predicting each
value of the
dependent
variable.
25-Aug-17 20
Residuals
25-Aug-17 21
Coefficient of Determination
25-Aug-17 22
r 2 in Airlines Cost
r2 = .899 [pg6,12,13]
This means that about 89.9% of the variability
of the cost of flying a Boeing 737 airplane on
a commercial flight is accounted for or
predicted by the number of passengers.
This also means that about 11.1% of the
variation in airline flight cost, Y, is
unaccounted for by X or unexplained by the
regression model.
25-Aug-17 23
Correlation
It is a measure of
association. It measures
the strength of relatedness
of two variables.
For example, we may be
interested in determining
the correlation between
the prices of two stocks in
the same industry
How strong are these
correlations?
The Pearson product -
moment correlation
coefficient is given by.
25-Aug-17 24
Correlation
1. The measure is applicable only
if both variables being analyzed
have at least an interval level of
data.
2. r is a measure of the linear
correlation of two variables.
3. r = +1 denotes a perfect positive
relationship between two sets of
variables.
4. r = -1$ denotes a perfect
negative correlation, which
indicates an inverse relationship
between two variables.
5. r=0 means that there is no linear
relationship between the two
variables.
6. The coefficient of determination
= (correlation coefficient) r2
25-Aug-17 25
25-Aug-17 26
Factors to be taken care of
25-Aug-17 27
Multiple Regression Model
The general equation which describes multiple regression
model is given by
Yi=0+1Xi + 2X2 + kXk +
Minimize 2 by finding best i
Assumptions made in the model are :
1. Residual:
(Y- ^Y) should be near zero. (Y - ^Y) is the residue,
denoted by
Plot Xi vs (Y- ^Y) for each Xi. should be Random and should
not have any trend (unlike Y = X**2)
2. Standard Error Se = (SUM Squared Error/(N-K-1); K = # of
independent variables; SUM Squared Error = (Y- ^Y) **2
Se should be acceptable (Se / Ymean give good idea of
error)
68% of Residue should be within Se
95% of Residue should be within 2* Se
25-Aug-17 28
Multiple Regression Model (Cont)
25-Aug-17 30
The Fitted Model
Y = 57.351 + 0.0177X1 0.6663X2
Interpretation:
The Y- intercept is equal to 57.351. In this example, Y-intercept
does not have any practical significance.
The coefficient of X1 (total number of square feet in the house) is
0.0177. This means that 1-unit increase in square footage would
result in predicted increase of (0.0177) ($1000) = $17.70 in the
price of the home if the age were held constant.
The coefficient of X2 (age) is -0.6663. The negative sign on the
coefficient denotes an inverse relationship between the age of a
house and the price of the house : the older the house, the lower
the price. In this case, if the total number of square feet in the
house is kept constant, a 1-unit increase in the age of the house
(1 year) will result in (-0.6663) ($1000) = - 666.30, a predicted
drop in the price.
25-Aug-17 31
Testing the Model
r2 = 0.715
Testing the
overall model
Significance
Tests of the
Regression
Coefficients
25-Aug-17 32
25-Aug-17 33
Analysis of Residuals
25-Aug-17 34
Multicollinearity
25-Aug-17 35
Search Procedures
All possible regression
Take all possible combination of K variables (2K -1 models).
Choose the best model
Forward selection
Start with one variable. Try out all variable one at a time. Choose
the best one.
Then take 2nd variable, and so on.
Backward Elimination:
Start with all variables
Keep on repeating
Stepwise regression
Same as Forward Selection, but at every time also check that the
variable included is significant (acceptable p value)
25-Aug-17 36
Factors to be taken care of
Value of R2 could be inflated. Consider R2 adjusted
Better model does not imply cause and effect between
independent variables and dependent variable (some
other factors might be causing both)
Value of regression coefficient may not directly tell about
the importance, because
Different Units
Multi co-linearity
25-Aug-17 37
Non Linear Models (5/4 - Makridakis)
Nonlinearity in parameters More complex (One may be
able to use transformation to convert into linear in certain
cases)
Nonlinearity in variables
Local Regression (see 5/4/3 of Makridakis)
25-Aug-17 38
Non-Linear Model
Y=0+1X1+ 2X2 **2 +
Choose Y1 = X1 ;Y2 = X2 **2
Y=0+1X1+ 2X1X2 +
Choose Y1 = X1 ;Y2 = X1 * X2
Y = 0*1X
Log(Y) = Log(0) +X*Log(1); Now it is in
linear form
Similarly Y = 0*X1 ; Y = 1 / (0+1X1+ 2X2 )
can be converted into linear regression
25-Aug-17 39
Indicator (Dummy) Variable
25-Aug-17 40
Others (Pg 270 Makridakis)
Trading day variation
Introduce seven variables, T1: # of Mondays in Month, T2: # of
Tuesdays in Month, ..
Holiday Effect
V=1 if Diwali falls in this month (or part of Diwali)
25-Aug-17 41
Interventions (Pg 271 Makridakis)
Seat belt legislation was introduced
Due to that car accidents went down
Introduce dummy variable I = 0 (Before seat belt
legislation), and I =1 (after seat belt legislation)
More complex models can be introduced, if effect is
spread over some time
See figure 8-15 for intervention variable
25-Aug-17 42
Effect of Advertising Expenditure on Sale (Pg 271 Makridakis)
25-Aug-17 43
Miscellaneous
Variance - Covariance Matrix
Vector X = (X1, X2, X3, )
Let (i) be arithmetic average of X(i)
(i.j) = k (X(i,k) - (i))*(X(j,k) - (j))/ N
(i.i) is Variance Matrix
25-Aug-17 44