Professional Documents
Culture Documents
Política Publica
Multiple Regression and Model
Diagnostics
Dr. Heidi Jane Smith
Today:3/1
• Multiple Regression and Model Diagnostics
(multicollinearity heteroscedasticity and
autocorrelation)
– Readings:
• Stock and Watson, Chapter 6y7
• Berman and Wang, Chapter 15
• Acock, Chapter 10
Multiple regression analysis
• Multiple regression analysis has an interval-level
dependent variable and two or more
independent variables, which can be either
dummy or interval level. If an effect has multiple
causes, multiple regression allows us to predict
values of Y more accurately than bivariate
regression does. Multiple regression also helps
isolate the direct effect of a single independent
variable on the dependent variable, once the
effects of the other independent variables are
controlled.
The equation
• The equation for a multiple regression is similar
to that of a linear regression, although it no
longer describes a two-dimensional line:
Y-hat =b0 +b1X1 +b2X2 +b3X3 +...+bnXn + error
• Y-hat is the expected value of Y,
• X1 through Xn are the independent variables,
• b0 is the y-intercept, and
• b1 through bn are the regression coefficients
(also called partial slope coefficients).
Y hat
• To determine the expected value of Y, insert the actual
values of X1 through Xn into the equation, multiply, and
add.
• The y-intercept is still the expected value of the dependent
variable when all of the independent variables equal zero
(though the y-intercept will seldom have a practical
meaning; it frequently does not make sense for all
independent variables to equal zero).
• The partial slope coefficients show the expected change in
Y from a one-unit increase in Xi, holding all the other X’s
constant. That is, Xi changes but none of the other
independent variables do. It usually does not matter at
what values you hold the other variables constant.
Testing Assumptions
1. Model Specification
(this is, idenfication of DV and IV)
2. Testing of regression assumptions
3. Correction of assumption violation
4. Reporting of the results of the final
regression
Model Specification
• Multiple regression is an extension of a simple regression,
but an important exist between the two methods: multi-
regression aims for full model specification. Means that
affect the DV, by contrast, a simple regression examines the
effect of only one IV
• Must identify the variable that are of most (theoretical and
practical) relevance in explaining the dependent variable
• Addressing those variables that were not considered as
being most relevant
• The Assumption of full model specification is that these
other variables are justifiably omitted only when their
cumulative effect on the dependent variable is zero
• The validity of Multiple regression models centers on
examining the behavior of the error term in this regard. If
the cumulative effect of all the variable sis not zero—then
additional variables may have to be considered
Interpreting Coefficients
• Each regression coefficient is interpreted as its effect
on the dependent variable, controlled for the effects of
all the other independent variables included in the
regression.
• See exercise 1 with Auto data
E. Homoskedasticity: The variance in the error term ε is the same at every value (and every
combination of values) of the independent varaibles.
Var (ε) = Var (ε | x) = σ2.
Model misspecification
Definition: What happens when we violate the first assumption, a properly specified
model? We can potentially either (1) leave out an important independent variable (or
include it in the wrong form) or (2) include an irrelevant independent variable. The effects
will be different, with much more serious consequences in case (1).
Problem:
If we incorrectly leave out an important independent variable, the OLS estimator of βj
(call it bj*) will be biased. That is, E(bj*) ≠ βj. The OLS estimator bj* will remain biased
even in infinitely large samples. The estimated standard error of bj* could be larger or
smaller than the estimated standard error of bj, depending on both the additional
variation in y that could uniquely explained by adding the omitted variable to the model
and the correlation between that omitted variable and xj.
Suppose, on the other hand, that we add an irrelevant variable (x3) to the correctly
specified population regression function, so that we mistakenly test the model:
y = β0 + β1 x1 + β2 x2 + β3 x3 + ε
OLS is still BLUE – that is, OLS still gives the best linear unbiased estimator of the
population regression function. However, adding the irrelevant variable will raise the
standard errors (both true and estimated) of the other coefficients, which will increase the
confidence interval for the coefficient and reduce the expected value of t*, making it more
difficult to reject the null hypothesis of no impact.
Multicollinearity
Multicollinearity happens when explanatory variables interact with one another and
are correlated. There is always going to be some collinearity in your model because
you cannot always separate variables from having a linear relationship from one
another.
Cause: The main causes for mulitcollinearity are small sample size; data collection
method employed might be over a limited rang of values; constraints on the model or
in the population being sampled; model specification error or a complete miss-
specification of the model, which contains too many explanatory variables.
Consequences: The consequences of high mullitcolinearity are not too bad. When
present, the precision of the estimators may be less but the model may still capture
the true value. The model will show inflate standard errors–making them less
accurate, which makes the t-test misleading (lower t ratio and higher p value) so the
independent variables will look insignificant. Furthermore, these OLS estimators are
sensitive to small changes, it will show a high R-squared despite few significant
variables, but “OLS estimators are still Blue”—indicating that they are precise but
unbiased.
Multicollinearity Detection
• Detection: There are a number of indicators and measures of possible
multicollinearity in a data set. The R2 and F-statistic are high but the t-statistics are
insignificant, which suggests that multicollinearity among the independent variables
may be leading to their insignificant coefficients.
• Low tolerance and high VIF (variance inflation factor). Tolerance is 1 – R2, where R2 is
for the auxiliary regression. (Stata displays tolerance as 1/VIF.) If the tolerance value is
less than some cutoff value, usually .20, the independent should be dropped from the
analysis due to multicollinearity.
• VIF can be used instead of tolerance. VIF is 1/(1-R2). If VIF>4 then multicollinearity
might be a problem. If VIF >10, you have high multicollinearity (that is, if R2 in the
auxiliary regression is greater than .90). [Run the vif command immediately after the
regression of interest (not after the auxiliary regression), which you can run with
either the fit or the regress command.]
Multicollinearity Solutions
• Do nothing since OLS estimates are BLUE even in the presence of high
multicollinearity. Simply accept that your data are not strong enough to answer all
the questions you would like to put to them.
• Most of the time, however, you will feel a strong desire to do something about it.
A common situation is that either X1 or X2 (or both) will have a statistically
significant, even strong, coefficient if the other variable is left out of the model,
but that neither will be statistically significant if both are in the model. Possible
solutions include:
– Drop one of the variables. The other variable then becomes significant and the model looks
better. In general, this should be done only on theoretical grounds (but, in practice, theory will
frequently be weak). In practice, researchers tend to let the data choose the model, typically
dropping the variable with the smaller t- statistic. This is dangerous. If the originally specified
model was correct, the new coefficients will be biased.
– Create a new variable which is a combination of X1 and X2. With a large number of
independent variables, this would typically be done with principal components analysis or
factor analysis. These are methods for finding commonalities in sets of variables and can be
quite useful, but the meaning of the new variable and of its coefficient will usually be pretty
unclear.
– Get a bigger or better data set (one with more unexplained variation in the independent
variables). This leads to smaller standard errors for the coefficients.
Heteroskedasticity
• Definition: A key assumption of the classical linear regression model is that
the error term is homoskedastic. That is, the variance of the error term is the
same at all values of X. When the variance of the error terms is different for
different observations, the error is said to be heteroskedastic.
• Problem: Heteroskedasticity introduces two problems for Ordinary Least
Squares (OLS). OLS estimators produce unbiased but not efficient estimators
of the population parameters and produce biased estimators of the variances
of the regression coefficients, which invalidates the logic of the t- and F-tests
and of the confidence intervals.
• Detection:
– Graph the residuals against the independent variable or variables that you suspect
are responsible for the heteroskedasticity. Nonconstant spread means
heteroskedasticity.
– Use a modified version of the Breusch-Pagan test. In Stata, follow the regression
output with the hettest command (null hypothesis: the residual has a constant
variance). The hettest test comes in two forms:
– Name the variables that you suspect are responsible for the heteroskedasticity.
– Run the hettest command without naming any variables, in which case hettest
uses the expected value of the dependent variable as the independent variable.
– Solution: Use robust stand errors. In Stata, at the end of the regression command,
simply add a comma "robust."
Autocorrelation
• Definition: Errors are correlated from observation to
observation. Cov (εi , εj ) ≠ 0 for i ≠ j. It is more
common in time series and panel data.
• Problem: OLS regression yields unbiased but not
efficient estimators of the population parameters and
yields biased estimators of the variance and standard
errors of the regression coefficients. So neither the t-
test nor the confidence intervals can be trusted.
• Detection: In Stata, use Durbin-Watson Test: dwstat
(null hypothesis: there is no autocorrelation). The
larger d is, the lower the positive autocorrelation
should be.
• Solution: Use robust standard errors
Assumptions Needed to Believe
Parameter Estimates for X
• E(u) = 0 In general, this assumption means that any independent variables you couldn’t think of to include in the
analysis are just noise – they have no impact on the dependent variable.
• Non-random measurement error – “the extent to which a measure reflects the concept it is intended to measure.” An
example of non-random error would be: If I am trying to measure the impact of September 11 th on flying, and I include
an unrelated variable such as “pilot hair color,” that would be non-random error. At the same time, if I purposely leave
out an important variable such as “daily number of airline passengers in the past 12 months,” that in also non-random
error.
• If you have significant independent variables and a low R-square, you may have poor linear function form which means
that using a linear regression equation is not appropriate.
Assumptions Needed to Believe
Significance Tests for X
• No autocorrelation – observations must be independent of one another. If observations are not independent, you
cannot trust the accuracy of that variable. This can be a problem in:
– A time series design – observation from time 1 not independent from time 2 (what is an example of this?)
– Cross-sectional data – for example, if you are observing a whole group of people at the same time and are trying to keep data of each person in the
group individually. Each person in the group is impacting the other people in the group so the observations are not independent.
• No Heteroscedasticity – if the variance of the error term is not constant for all observations, there is
heteroscedasticity. For example, if you are trying to measure the impact of gun legislation on gun related death in
the US by collecting cross-sectional data from all 50 states there would probably be heteroscedasticity. The
reason is that with such varying populations, the variance in the error term would be different with different
populations.
– Look for it in:
– cross-sectional data or time series (mostly with cross-sectional).
– aggregate data – like states (each unit has a different N)
– test scores or policy opinions
– when dependent is spending and independent or control is income
• No severe collinearity (same as multicollinearity?) – there can be no significant relationship among independent
variables. If there are, R-square will be high and independent variables will be insignificant.
• Example – If I am using hair color and ethnicity as independent variables in a regression, neither will be significant
because they are highly related.
Review again R2
• R-Squared (R2) also called the coefficient of
determination helps interpret the value as the
percentage of variation in the dependent variable that
is explained by the independent variable.
• Overall R2 varies from zero to one and is a goodness of
fit measure (high numbers closer to one indicate there
are likely to have additional factors affect the
dependent variable).
• Strength of the relationship value is between 0:1
– Typically the values of R2 are below 0.20 are considered
weak relationship and those between 0.20-0.40 are
moderate and above 0.40 are strong relationships.
Reporting Regressions
• When you are reporting on bivariate
relationships with 2 continuous variables (i.e.
simple regressions) you must report:
– 1) level of significance at which the two variables
are associated if at all (t stat).
– 2) whether the relationship between the two
variables is positive or negative (b)
– 3) the strength of the relationship (R2)
• Use Ho testing not predictions
Standard Error of the Estimate (SEE)
• Which is the spread of the y values around the
regression line as calculated for the mean value
of the independent variable, only, and assuming a
large sample.
• The SEE has an interpretation in terms of the
normal curve, i.e. x% of the y values lies within
one standard error from the calculated value of
the y, as calculated for the mean value of x using
the preceding regression model.
error term e
• Predicted Values of y (is typically different from
the observed value of y. and the predicted value
of the dependent variable y ^
• Only when R2 =1 are the observed and predicted
values identical from each other
• The difference between y and y^ is called the
region error or error term e
– y^ =a+b*x
– y =a+b*x+ e
• Assumptions about e are important, which as
that they are normally distributed