Professional Documents
Culture Documents
8-3
Model Specifications and
Errors in Specification
Model specification refers to the set of
variables included in the regression and
the regression equations functional form.
8-4
Principles of Model
Specification
1. The model should be grounded in cogent
economic reasoning: we should be able to supply the
economic reasoning behind the choice of variables,
and the reasoning should make sense. When this
condition is fulfilled, we increase the chance that the
model will have predictive value with new data.
This approach contrasts to the variable-selection
process known as data mining. With data mining,
the investigator essentially develops a model that
maximally exploits the characteristics of a specific
dataset.
8-5
Principles of Model Specification
8-7
Principles of Model Specification
8-8
Principles of Model Specification
8-9
Principles of Model Specification
8-10
8-11
Understand the Problem
Presented by Omitted Variable
Bias
Omitted Variable Bias is the bias in
coefficient estimates when a variable is
omitted from the model and that variable is
also related to one or more independent
variables.
8-12
Understand the Problem
Presented by Including an
Irrelevant Variable
Including an Irrelevant Variable is when a
variable is included in the regression model
even though it is not related to the
dependent variable.
8-13
What is the Lesser of Two Evils:
Omitted Variable Bias or
Including an Irrelevant
Variable?
Because omitting a relevant variable results
in biased estimates while including an
irrelevant variable does not, it is more
desirable to include an irrelevant variable.
However, it would be best to have a
correctly specified model without either an
omitted variable or an irrelevant variable.
A correctly specified model should be
created by considering relevant economic
theory and by looking at what others have
done in similar studies. 8-14
Understanding Omitted
Variable Bias:
If the true regression model was:
Yi = b0 + b1 X1i + b2 X2i + i
but we estimate the model
Yi = a0 + a1 X1i + i
Note: we have used a diferent regression
coefficient notation when X2i is omitted,
because the intercept term and slope
coefficient on X1i will generally not be the
same as when X2i is included.
8-15
Understanding Omitted
Variable Bias:
If the omitted variable (X2) is correlated with the
remaining variable (X1), then the error term in
the model will be correlated with (X1), and the
estimated values of the regression coefficient a 0
AND A1 WOULD BE BIASED AND INCONSISTENT.
IN ADDITION, THE ESTIMATES OF THE STANDARD
ERRORS OF THOSE COEFFICIENTS WILL ALSO
BE INCONSISTENT, SO WE CAN USE NEITHER
THE COEFFICIENTS NOR THE ESTIMATED
STANDARD ERRORS TO MARKET STATISTICAL
TESTS.
8-16
Understanding Omitted
Variable Bias
Omitted Variable Bias and the Bid-Ask Spread:
Results from ln (bid-ask spread/price) on ln
(number of market makers) and ln (market
Coefficients Standard Error t-statistics
Capitalization):
Intercept 1.5949 0.2275 7.0105
Ln (number of NASDAQ market -1.5186 0.0808 -18.7946
makers)
Ln (Companys Market -0.3790 0.0151 -25.0993
Capitalization)
ANOVA MSS F Significance F
Regression 1864.0667 22167505 0.00
Residual 0.8409
8-17
Understanding Omitted
Variable Bias
Omitted Variable Bias and the Bid-Ask
Spread: Results from Regressing ln (bid-
ask spread/Price) Coefficients
on ln (number
Standard
of t-statistics
market
makers):
Intercept 5.0707
Error
0.2009 25.2399
Ln (number of NASDAQ market -3.1027 0.0561 -55.3066
makers)
Ln (Companys Market Omitted Variable
Capitalization)
ANOVA MSS F Significance F
Regression 3200.3918 3063.3655 0.00
Residual 1.0447
8-18
Omitted Variable Bias:
Note that the coefficient on ln (Number of
NASDAQ market makers) changed from
-1.5186 in the original (correctly
specified) regression to -3.1027 in the
missing specified regression.
Also, the intercept changed from 1.5949
in the correctly specified regression to
5.0707 in the Misspecified regression.
These results illustrate that omitting an
independent variable that should be in
the regression can cause the remaining 8-19
Misspecification caused by
Use of Wrong Form of Data:
A second common cause of misspecification in regression
models is the use of the wrong form of the data in a
regression, when a transformed version of the data is
appropriate.
For example, sometimes researcher fails to account for
curvature or nonlinearity in the relationship between the
dependent variable and one or more of the independent
variable, instead specifying a linear relation among
variables.
When we are specifying a regression model, we should
consider whether economic theory suggests a nonlinear
relation.
We can often confirm the non-linearity by plotting the data.
8-20
Plotting ONGC Share Price with FX
Rate for $/INR
ONGC Share Price
350
300
250
200
150
100
50
0
40.0000 45.0000 50.0000 55.0000 60.0000 65.0000 70.0000
8-21
Plotting Ln (ONGC Price) with Ln (FX
Rates):
ONGC Returns
0.3
0.25
0.2
0.15
0.1
0.05
0
-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08 0.1
-0.05
-0.1
-0.15
-0.2
8-22
Misspecification caused by Use
of Wrong Form of Data:
If the relationship between the variables
becomes linear when one or more of the
variables is represented as a proportional
change in the variable, we may be able to
correct the misspecification by taking the
natural logarithm of the variable (s) we want to
represent as a proportional change.
8-23
Understanding Variable Bias:
Wrong Form of Data
Omitted Variable Bias and the Bid-Ask
Spread: Results from Regressing (bid-ask
spread/Price) on Coefficients
ln (number of market
Standard Error t-statistics
makers) and ln (Market
Intercept 0.0674
Cap):
0.0035 19.2571
Ln (number of NASDAQ market -0.0142 0.0012 -11.8333
makers)
Ln (Companys Market -0.0016 0.0002 -8.0000
Capitalization)
ANOVA MSS F Significance F
Regression 0.0770 392.3338 0.00
Residual 0.0002
8-24
Understanding Variable Bias:
Wrong Form of Data
Table in the previous slide shows the regression with (bid-
ask spread/Price) as the dependent variable and the
natural logarithm of number of market makers and the
natural logarithm of the companys market capitalization
as the independent variables.
Q1. Suppose that for a particular listed stock, the number
of market makers is 50 and the market capitalization is
$6 billion. what is the predicted ratio of bid-ask spread to
price for this stock based on the above model?
Q2. Does the predicted bid-ask spread for the above
stock make sense? If not, how could this problem be
avoided?
8-25
Understanding Variable Bias:
Wrong Form of Data
Solution to Q1. ln 50 = 3.9120
Ln 6000 = 8.6995
In this case, the predicted ratio of bid-ask spread
to price is = 0.0674 + (-0.0142*3.9120) + (-
0.0016*8.6995) = -0.0021.
Therefore the model predicts that the ratio of
bid-ask spread to stock price is -0.0021 or -0.21
percent of the stock price.
Solution to Q2. ?
8-26
Understanding Variable Bias:
Wrong Form of Data
If we use non-transformed ratio bid-ask
spread/price as the dependent variable,
the estimated model could predict
negative values of the bid-ask spread.
This result would be nonsensical, in reality
no bid-ask spread is negative (it is hard to
motivate traders to simultaneously buy
high and sell low), so a model that predicts
negative bid-ask spread is certainly
Misspecified.
8-27
Misspecification caused by Use of
Wrong Form of Data: Unscaled Data
Other times analysts use unscaled data in
regression, when scaled data (such as
dividing net income or cash flow by sales)
are more appropriate.
In previous example, we scaled the bid-
ask spread by stock price because what a
bid-ask spread means in terms of
transmission costs for a given size
investment depends on the price of the
stock, if we had not scaled the bid-ask
spread, the regression would have been 8-28
Misspecification caused by
Use of Wrong Form of Data:
Unscaled Data
Often, analysts must decide whether to scale
variables before they compare data across
companies.
For example, in financial statement analysis,
analysts often compare companies using
common size statements. common size
statements make comparability across
companies much easier.
Issue of comparability also appear for analysts
who want to use regression analysis to compare
the performance of a group of companies.
8-29
Misspecification caused by
Use of Wrong Form of Data:
Unscaled Data
Suppose an analyst want to explain free cash flow
to the firm as a function of cash flow from
operations in 2001 for 11 family clothing stores
with market capitalization of more than $100
million as of end of 2001.
Using free cash flow as dependent variable and
cash flow from operations as independent
variable in regression following results are
obtained:
8-30
Misspecification caused by Use
of Wrong Form of Data: Unscaled
Data
Results from Regressing the Free Cash
Flow from Operations for Family Clothing
Stores
Coefficients Standard Error t-statistics
8-31
Misspecification caused by
Use of Wrong Form of Data:
Unscaled Data
F and T Statistics are well above the
critical values, meaning that regression
relation is significant.
So can we conclude that for a clothing
store, if cash flow from operations
increased by $1.00, we could confidently
predict that free cash flow to the firm
would increase by $0.3579?
Is this specification correct?
8-32
Misspecification caused by
Use of Wrong Form of Data:
Unscaled Data
The regression does not account for size
diferences among the companies in the sample.
We can account for size diferences by using
common size cash flow results across companies.
We scale the variables by dividing cash flows
from operations and free cash flow to the firm by
the companys sales before using regression
analysis.
So we will use, (Free cash flow to the firm/Sales)
as the dependent variable and (Cash flow from
operations/Sales) as independent variable.
8-33
Misspecification caused by Use
of Wrong Form of Data: Unscaled
Data
Results from Regressing the Free Cash
Flow from Operations/Sales for Family
Clothing Stores
Coefficients Standard Error t-statistics
8-34
Misspecification caused by Use of
Wrong Form of Data: Unscaled Data
Note that t-statistic for the slope coefficient is not
significant at 0.05 level, also F-statistic is 0.1383, so
we can not reject at the 0.05 level the null hypothesis
that the regression does not explain variation in (Free
cash flows/Sales) among family clothing stores.
Finally, note that R-squared in this regression is much
lower than that of the previous regression.
Which regression makes more sense?
Without scaling, the results of the regression can be
based solely on scale diferences across companies,
rather than based on the companies underlying
economics.
8-35
Understand the Problem
Presented by Missing Data
When collecting data sometime data are missing
for some of the observations.
Solutions:
(1) If there is no systematic reason that the data
are missing, we can delete those observations
and estimate the model for the observations
with the non-missing data.
(2) Create a new dummy variable, which is equal
to 1 if the data are missing and 0 if they arent
for that observation (and set the value of the
missing observations to 0)
8-36
Understand the Problem Presented
by Outliers
Outliers can significantly afect the calculated slope
coefficients.
It is not acceptable to simply drop outliers unless you can
determine their presence is due to data entry error.
One possible way to control for outliers is to put a dummy
variable in for dependent and independent variable
outliers.
8-37
Empirical Example: Total
Medals won in the Olympics
vs. GDP per Capita
Total Medals vs. GDP Per Capita (Thousands)
120
60
Total Medals
40
0
0 20 40 60 80 100 120
GDP per Capita ($1000)
8-38
Empirical Example: Regression
Results without Controlling for the
Outliers
The coefficient on GDP per Capita
means on average, if GDP per
capita increases by $1000 then the
number of Olympic medals goes up
by .15 of a medal. This coefficient
is statistically significant at the
10% and it is almost significant at
the 5% level.
8-39
Empirical Example: Regression Results
with Controlling for the Outliers
The coefficient on GDP per Capita has
increased from .15 to .21. The medal
outlier coefficient says that, on
average the three medal outliers have
86.72 more medals relative to not
being an outlier. This GDP outlier
coefficient says that, on average the
two GDP outliers have 20 fewer medals
relative to not being an outlier. Both
of these coefficients are statistically
significant at the 5% level.
8-40
Perform the Reset Test for
the Inclusion of Higher-
Order Polynomials
8-41
End Term Project Guidelines
8-42
End Term Project Guidelines
8-43
End Term Project Guidelines
8-44
Assignment QF
Test of seasonality?
For stock returns (month of the calendar efect: using
11 dummy variables)
For stock returns (day of the week efect using 4
dummy variables)
For indices (same as above for the stock returns).
For other financial time series data like Foreign
Exchange rate, futures and options price and returns.
8-45