You are on page 1of 45

Model Selection in Multiple

Linear Regression Analysis


Dr. Himanshu Joshi
FORE School of Management,
New Delhi
8-1
Learning Objectives

Understand the problem presented


by omitted variable bias
Understand the problem presented
by including an irrelevant variable
Understand the problem presented
by missing data
Understand the problem presented
by outliers
Perform the RESET test for the
inclusion of higher-order polynomials
8-2
Learning Objectives

Consider what it means for a p-


value to be just above a given
cutof

8-3
Model Specifications and
Errors in Specification
Model specification refers to the set of
variables included in the regression and
the regression equations functional form.

8-4
Principles of Model
Specification
1. The model should be grounded in cogent
economic reasoning: we should be able to supply the
economic reasoning behind the choice of variables,
and the reasoning should make sense. When this
condition is fulfilled, we increase the chance that the
model will have predictive value with new data.
This approach contrasts to the variable-selection
process known as data mining. With data mining,
the investigator essentially develops a model that
maximally exploits the characteristics of a specific
dataset.

8-5
Principles of Model Specification

2. The functional form chosen for the variables in the


regression should be appropriate given the nature of
variables:
As an illustration, consider studying mutual fund
market timing based on fund and market returns
only.
One might reason that for a successful timer, a plot
of mutual funds returns against market returns would
show curvature, because a successful timer would
tend to increase (decrease) beta when market
returns are high (low). The model specification should
reflect the expected non-linear relationship.
8-6
Principles of Model Specification

2. The functional form chosen for the variables in


the regression should be appropriate given the
nature of variables:
To capture curvature, Treynor and Mazuy (1966),
included a term in the squared market excess
return, which does not violate the assumption of
the multiple linear regression model that
relationship between the dependent and
independent variables in linear in the coefficients.
In other cases, we may transform the data such
that a regression assumption is better satisfied .

8-7
Principles of Model Specification

3. The Model should be Parsimonious:


Parsimonious means accomplishing a lot
with a little. We should expect each
variable included in a regression to play an
essential role.

8-8
Principles of Model Specification

4. The Model should be examined for


violations of regression assumptions
before being accepted:
We need to revise the set of included
variables and or their functional form in
the context of Heteroscedasticity, Serial
Correlation, and Multicollinearity.

8-9
Principles of Model Specification

5. Model should be tested and be found useful out


of sample before being accepted:
The term out of Sample refers to observations
outside the dataset on which the model was
estimated. A plausible model may not perform
well out of sample because economic
relationships have changed since the sample
period. That possibility itself is useful to know.
A second explanation, however, may be that
relationship have not changed but the model
explains only a specific dataset.

8-10
8-11
Understand the Problem
Presented by Omitted Variable
Bias
Omitted Variable Bias is the bias in
coefficient estimates when a variable is
omitted from the model and that variable is
also related to one or more independent
variables.

Omitted variable bias results in OLS estimates


being on average wrong and incorrect
hypothesis test and confidence intervals.

8-12
Understand the Problem
Presented by Including an
Irrelevant Variable
Including an Irrelevant Variable is when a
variable is included in the regression model
even though it is not related to the
dependent variable.

Including an irrelevant variable does not


cause the coefficient estimates to be biased
but it may result in larger standard errors
(which might result in more variables being
estimated as statistically insignificant).

8-13
What is the Lesser of Two Evils:
Omitted Variable Bias or
Including an Irrelevant
Variable?
Because omitting a relevant variable results
in biased estimates while including an
irrelevant variable does not, it is more
desirable to include an irrelevant variable.
However, it would be best to have a
correctly specified model without either an
omitted variable or an irrelevant variable.
A correctly specified model should be
created by considering relevant economic
theory and by looking at what others have
done in similar studies. 8-14
Understanding Omitted
Variable Bias:
If the true regression model was:
Yi = b0 + b1 X1i + b2 X2i + i
but we estimate the model
Yi = a0 + a1 X1i + i
Note: we have used a diferent regression
coefficient notation when X2i is omitted,
because the intercept term and slope
coefficient on X1i will generally not be the
same as when X2i is included.
8-15
Understanding Omitted
Variable Bias:
If the omitted variable (X2) is correlated with the
remaining variable (X1), then the error term in
the model will be correlated with (X1), and the
estimated values of the regression coefficient a 0
AND A1 WOULD BE BIASED AND INCONSISTENT.
IN ADDITION, THE ESTIMATES OF THE STANDARD
ERRORS OF THOSE COEFFICIENTS WILL ALSO
BE INCONSISTENT, SO WE CAN USE NEITHER
THE COEFFICIENTS NOR THE ESTIMATED
STANDARD ERRORS TO MARKET STATISTICAL
TESTS.

8-16
Understanding Omitted
Variable Bias
Omitted Variable Bias and the Bid-Ask Spread:
Results from ln (bid-ask spread/price) on ln
(number of market makers) and ln (market
Coefficients Standard Error t-statistics
Capitalization):
Intercept 1.5949 0.2275 7.0105
Ln (number of NASDAQ market -1.5186 0.0808 -18.7946
makers)
Ln (Companys Market -0.3790 0.0151 -25.0993
Capitalization)
ANOVA MSS F Significance F
Regression 1864.0667 22167505 0.00
Residual 0.8409

Residual standard error 0.9170


Multiple R-Squared 0.6318

8-17
Understanding Omitted
Variable Bias
Omitted Variable Bias and the Bid-Ask
Spread: Results from Regressing ln (bid-
ask spread/Price) Coefficients
on ln (number
Standard
of t-statistics
market
makers):
Intercept 5.0707
Error
0.2009 25.2399
Ln (number of NASDAQ market -3.1027 0.0561 -55.3066
makers)
Ln (Companys Market Omitted Variable
Capitalization)
ANOVA MSS F Significance F
Regression 3200.3918 3063.3655 0.00
Residual 1.0447

Residual standard error 1.0221


Multiple R-Squared 0.5423

8-18
Omitted Variable Bias:
Note that the coefficient on ln (Number of
NASDAQ market makers) changed from
-1.5186 in the original (correctly
specified) regression to -3.1027 in the
missing specified regression.
Also, the intercept changed from 1.5949
in the correctly specified regression to
5.0707 in the Misspecified regression.
These results illustrate that omitting an
independent variable that should be in
the regression can cause the remaining 8-19
Misspecification caused by
Use of Wrong Form of Data:
A second common cause of misspecification in regression
models is the use of the wrong form of the data in a
regression, when a transformed version of the data is
appropriate.
For example, sometimes researcher fails to account for
curvature or nonlinearity in the relationship between the
dependent variable and one or more of the independent
variable, instead specifying a linear relation among
variables.
When we are specifying a regression model, we should
consider whether economic theory suggests a nonlinear
relation.
We can often confirm the non-linearity by plotting the data.

8-20
Plotting ONGC Share Price with FX
Rate for $/INR
ONGC Share Price
350

300

250

200

150

100

50

0
40.0000 45.0000 50.0000 55.0000 60.0000 65.0000 70.0000

8-21
Plotting Ln (ONGC Price) with Ln (FX
Rates):
ONGC Returns
0.3

0.25

0.2

0.15

0.1

0.05

0
-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1

-0.05

-0.1

-0.15

-0.2

8-22
Misspecification caused by Use
of Wrong Form of Data:
If the relationship between the variables
becomes linear when one or more of the
variables is represented as a proportional
change in the variable, we may be able to
correct the misspecification by taking the
natural logarithm of the variable (s) we want to
represent as a proportional change.

8-23
Understanding Variable Bias:
Wrong Form of Data
Omitted Variable Bias and the Bid-Ask
Spread: Results from Regressing (bid-ask
spread/Price) on Coefficients
ln (number of market
Standard Error t-statistics
makers) and ln (Market
Intercept 0.0674
Cap):
0.0035 19.2571
Ln (number of NASDAQ market -0.0142 0.0012 -11.8333
makers)
Ln (Companys Market -0.0016 0.0002 -8.0000
Capitalization)
ANOVA MSS F Significance F
Regression 0.0770 392.3338 0.00
Residual 0.0002

Residual standard error 0.0140


Multiple R-Squared 0.2329

8-24
Understanding Variable Bias:
Wrong Form of Data
Table in the previous slide shows the regression with (bid-
ask spread/Price) as the dependent variable and the
natural logarithm of number of market makers and the
natural logarithm of the companys market capitalization
as the independent variables.
Q1. Suppose that for a particular listed stock, the number
of market makers is 50 and the market capitalization is
$6 billion. what is the predicted ratio of bid-ask spread to
price for this stock based on the above model?
Q2. Does the predicted bid-ask spread for the above
stock make sense? If not, how could this problem be
avoided?

8-25
Understanding Variable Bias:
Wrong Form of Data
Solution to Q1. ln 50 = 3.9120
Ln 6000 = 8.6995
In this case, the predicted ratio of bid-ask spread
to price is = 0.0674 + (-0.0142*3.9120) + (-
0.0016*8.6995) = -0.0021.
Therefore the model predicts that the ratio of
bid-ask spread to stock price is -0.0021 or -0.21
percent of the stock price.
Solution to Q2. ?

8-26
Understanding Variable Bias:
Wrong Form of Data
If we use non-transformed ratio bid-ask
spread/price as the dependent variable,
the estimated model could predict
negative values of the bid-ask spread.
This result would be nonsensical, in reality
no bid-ask spread is negative (it is hard to
motivate traders to simultaneously buy
high and sell low), so a model that predicts
negative bid-ask spread is certainly
Misspecified.
8-27
Misspecification caused by Use of
Wrong Form of Data: Unscaled Data
Other times analysts use unscaled data in
regression, when scaled data (such as
dividing net income or cash flow by sales)
are more appropriate.
In previous example, we scaled the bid-
ask spread by stock price because what a
bid-ask spread means in terms of
transmission costs for a given size
investment depends on the price of the
stock, if we had not scaled the bid-ask
spread, the regression would have been 8-28
Misspecification caused by
Use of Wrong Form of Data:
Unscaled Data
Often, analysts must decide whether to scale
variables before they compare data across
companies.
For example, in financial statement analysis,
analysts often compare companies using
common size statements. common size
statements make comparability across
companies much easier.
Issue of comparability also appear for analysts
who want to use regression analysis to compare
the performance of a group of companies.

8-29
Misspecification caused by
Use of Wrong Form of Data:
Unscaled Data
Suppose an analyst want to explain free cash flow
to the firm as a function of cash flow from
operations in 2001 for 11 family clothing stores
with market capitalization of more than $100
million as of end of 2001.
Using free cash flow as dependent variable and
cash flow from operations as independent
variable in regression following results are
obtained:

8-30
Misspecification caused by Use
of Wrong Form of Data: Unscaled
Data
Results from Regressing the Free Cash
Flow from Operations for Family Clothing
Stores
Coefficients Standard Error t-statistics

Intercept 0.7295 27.7302 0.0263


Cash Flow from Operations 0.3579 0.0548 6.5288

ANOVA MSS F Significance F


Regression 245093.7836 42.6247 0.0001
Residual 5750.0349

Residual standard error 75.8290


Multiple R-Squared 0.8257

8-31
Misspecification caused by
Use of Wrong Form of Data:
Unscaled Data
F and T Statistics are well above the
critical values, meaning that regression
relation is significant.
So can we conclude that for a clothing
store, if cash flow from operations
increased by $1.00, we could confidently
predict that free cash flow to the firm
would increase by $0.3579?
Is this specification correct?

8-32
Misspecification caused by
Use of Wrong Form of Data:
Unscaled Data
The regression does not account for size
diferences among the companies in the sample.
We can account for size diferences by using
common size cash flow results across companies.
We scale the variables by dividing cash flows
from operations and free cash flow to the firm by
the companys sales before using regression
analysis.
So we will use, (Free cash flow to the firm/Sales)
as the dependent variable and (Cash flow from
operations/Sales) as independent variable.

8-33
Misspecification caused by Use
of Wrong Form of Data: Unscaled
Data
Results from Regressing the Free Cash
Flow from Operations/Sales for Family
Clothing Stores
Coefficients Standard Error t-statistics

Intercept -0.0121 0.0221 -0.5497


Cash Flow from Operations/Sales 0.4749 0.2920 1.6262

ANOVA MSS F Significance F


Regression 0.0030 2.6447 0.1383
Residual 0.0102 0.0011

Residual standard error 0.0336


Multiple R-Squared 0.2271

8-34
Misspecification caused by Use of
Wrong Form of Data: Unscaled Data
Note that t-statistic for the slope coefficient is not
significant at 0.05 level, also F-statistic is 0.1383, so
we can not reject at the 0.05 level the null hypothesis
that the regression does not explain variation in (Free
cash flows/Sales) among family clothing stores.
Finally, note that R-squared in this regression is much
lower than that of the previous regression.
Which regression makes more sense?
Without scaling, the results of the regression can be
based solely on scale diferences across companies,
rather than based on the companies underlying
economics.

8-35
Understand the Problem
Presented by Missing Data
When collecting data sometime data are missing
for some of the observations.
Solutions:
(1) If there is no systematic reason that the data
are missing, we can delete those observations
and estimate the model for the observations
with the non-missing data.
(2) Create a new dummy variable, which is equal
to 1 if the data are missing and 0 if they arent
for that observation (and set the value of the
missing observations to 0)

8-36
Understand the Problem Presented
by Outliers
Outliers can significantly afect the calculated slope
coefficients.
It is not acceptable to simply drop outliers unless you can
determine their presence is due to data entry error.
One possible way to control for outliers is to put a dummy
variable in for dependent and independent variable
outliers.

8-37
Empirical Example: Total
Medals won in the Olympics
vs. GDP per Capita
Total Medals vs. GDP Per Capita (Thousands)
120

These countries are the US, China and


Russia
100
Potential
Outliers
80

60
Total Medals

40

These countries are Norway and


Qatar
20

0
0 20 40 60 80 100 120
GDP per Capita ($1000)

8-38
Empirical Example: Regression
Results without Controlling for the
Outliers
The coefficient on GDP per Capita
means on average, if GDP per
capita increases by $1000 then the
number of Olympic medals goes up
by .15 of a medal. This coefficient
is statistically significant at the
10% and it is almost significant at
the 5% level.

8-39
Empirical Example: Regression Results
with Controlling for the Outliers
The coefficient on GDP per Capita has
increased from .15 to .21. The medal
outlier coefficient says that, on
average the three medal outliers have
86.72 more medals relative to not
being an outlier. This GDP outlier
coefficient says that, on average the
two GDP outliers have 20 fewer medals
relative to not being an outlier. Both
of these coefficients are statistically
significant at the 5% level.

8-40
Perform the Reset Test for
the Inclusion of Higher-
Order Polynomials

8-41
End Term Project Guidelines

Select any topic of research in finance which requires use of multiple


regression analysis.
Start with referring to some existing research papers on the subject.
you can take help of illustrations given in your text book titled
Quantitative Investment Analysis by DeFusco, McLeavey, Pinto, and
Runkle. Also refer to the problems given at the end of the chapters.
Focus on Chapter 8 (Correlation and Regression) and Chapter 9
(Multiple Regression and Issues in Regression Analysis) of the text
book.
Choose appropriate dependent and independent variables in the
regression analysis.
Prefer a topic for which you require some qualitative independent
variables as well. use dummy variables (0 or 1) for qualitative
independent variables.

8-42
End Term Project Guidelines

Test for heteroscedasticity (non constant error terms for


varying observations).
Test for serial (generally positive autocorrelation)
correlation in the independent variables, and correct for
it.
Test of Multicollinearity (High R2 and Significant F-
Statistic but insignificant T-Statistics). Correct for
multicollinearity by dropping one or more correlated
independent variables from the regression estimation.

8-43
End Term Project Guidelines

Test for omitted variable bias


Test for incorrect independent variable bias.
Test for scaling issues.
Test for correct form of data (Linear, or Logarithmic)
Test for missing data (adding dummy variables)
Test for outliers (adding dummy variables for both dependent
and independent variables)

8-44
Assignment QF

Test of seasonality?
For stock returns (month of the calendar efect: using
11 dummy variables)
For stock returns (day of the week efect using 4
dummy variables)
For indices (same as above for the stock returns).
For other financial time series data like Foreign
Exchange rate, futures and options price and returns.

8-45

You might also like