You are on page 1of 15

1

Subject: Econometrics

Assignment # 01

Multicollinearity in Multiple Regression Models

Submitted to:

Dr. Syed Asim

Submitted by:

Mr. Amir Nadeem


Mr. Mohammad Zainullah

Dated: April 06, 2015

Qurtuba University, Peshawar, KPK


2

Contents
page
Theory The Nature of Multicollinearity 3
Consequences of high multicollinearity 3
So why is Multicollinearity a problem? 4
Sources of multicollinearity 4
Detecting multicollinearity 5
Solutions of Multicollinearity 5

Illustrated Example Dataset 6


Multiple linear regression 7

Interpretation of the Output: 7


Analysis of Variance 8
Source
Sum of Squares (SS)
Degrees of freedom (df)
Mean Square (MS)
Overall Model Fit 9
F-Ratio
Prob Level
R-squared
Root Mean Square Error
Parameter Estimates 10
Regression Coefficient
Standard Error
t-ratios:
p-values

Detecting the multicollinearity 11


Graphical method 11
Sign consistency 12
Variance Inflation Factor (VIF) 12
Correlation Matrix 13
Klein’s r ’s rule 13
Solution: Variable selection 14
Conclusion 15
3

Theory

The Nature of Multicollinearity

Multicollinearity occurs when two or more predictors/ regressors/ independent variables in


the model are correlated and provide redundant information about the response/ dependent
variable [Hawking, 1983].

Assumption 10 of the classical linear regression model (CLRM) is that there is no


multicollinearity among the regressors included in the regression model.

Ragnar Frisch coined the term multicollinearity, meant the existence of a “perfect,” or exact,
linear relationship among some or all explanatory variables in a regression model. If we have
k-variables as independent X1, X2, . . . , Xk then an exact linear relationship is said to exist if
the following condition is satisfied:

λ1X1 + λ2X2 +· · ·+λkXk = 0

where λ1, λ2, . . . , λk are constants such that not all of them are simultaneously zero. To
accommodate imperfect correlation a stochastic error term is introduced as under:

λ1X1 + λ2X2 +· · ·+λkXk + vi = 0

Consequences of high multicollinearity

Multicollinearity is a matter of degree, not a matter of presence or absence. In presence of


multicollinearity the ordinary least squares estimators are imprecisely estimated. If the goal
is to understand how the various independent variables impact dependent, then
multicollinearity is a big problem.

Some of the consequences are mentioned as under:

1. Increased standard error of estimates of the β’s and thus decreased reliability.
Although BLUE, the OLS estimators have large variances and covariances, making precise
estimation difficult.

2. t−tests in case of individual β1, β2 might suggest that none of the two predictors are
significantly associated to y, while the F−test might indicate model as useful for predicting y
and the R2, the overall measure of goodness of fit, can also be very high.

3. If multicollinearity is perfect, the regression coefficients of the X variables are


indeterminate and their standard errors are infinite.
4

4. If multicollinearity is less than perfect, the regression coefficients, although


determinate, possess large standard errors, which means the coefficients cannot be estimated
with great precision or accuracy.

5. Confidence intervals tend to be much wider, leading to the acceptance of the “zero
null hypothesis” (i.e., the true population coefficient is zero) more readily.

So why is Multicollinearity a problem?

If the goal is simply to predict Y from a set of X variables, then multicollinearity is not a
problem. The predictions will still be accurate, and the overall R2 (or adjusted R2 ) quantifies
how well the model predicts the Y values.

If the goal is to understand how the various X variables impact Y, then multicollinearity is
a big problem. One problem is that the individual P values can be misleading e.g., a p-value
can be high, even though the variable is important.

The second problem is that the confidence intervals on the regression coefficients will be
very wide. The confidence intervals may even include zero, which means one can’t even be
confident whether an increase in the X value is associated with an increase, or a decrease, in
Y. Because the confidence intervals are so wide, excluding a variable (or adding a new one)
can change the coefficients dramatically and may even change their signs.

Sources of multicollinearity

Montgomery and Peck stated multicollinearity may be due to the following factors:

1. The Sampling method: Sampling over a limited range of the regressors’ values in the
population.

2. Model or Population Constraints: For example, income relationship with size of house
owned when regressing electricity consumption on income and house size, there is a physical
constraint in the population.

3. Model specification, for example, adding polynomial terms to a regression model,


especially when the range of the X variable is small.

4. An over-determined model. This happens when the model has more explanatory variables
than the number of observations. This could happen in medical research where there may be a
small number of patients about whom information is collected on a large number of variables.
5

5. Especially in time series data, may be that the regressors included in the model share a
common trend, that is, they all increase or decrease over time. Thus, in the regression of
consumption expenditure on income, wealth, and population, the regressors income, wealth,
and population may all be growing over time at more or less the same rate, leading to
collinearity among these variables.

Detecting multicollinearity

By observing correlation matrix, variance influence factor (VIF), eigenvalues of the


correlation matrix, one can detect the presence of multicollinearity

Multicollinearity - Solutions

 If interest is only in estimation and prediction, multicollinearity can be ignored.


 If it is to establish association patters between y and the predictors, then we can try
the following rules of thumb to address the problem

1. A priori information, it could come from previous empirical work in which the
collinearity problem happens to be less serious or from the relevant theory underlying the
field of study.

2. Combining cross-sectional and time series data, suggested by Tobin, also known as
pooling the data.

3. Dropping a variable(s) and specification bias.

4. Transformation of variables, known as the first difference form because we run the
regression, not on the original variables, but on the differences of successive values of the
variables.

5. Additional or new data, sometimes simply increasing the size of the sample may
attenuate the collinearity problem.

6. Reducing collinearity in polynomial regressions, in practice though, it has been


found that if the explanatory variable(s) are expressed in the deviation form (i.e., deviation
from the mean value), multicollinearity is substantially reduced.

7. Multivariate statistical techniques such as factor analysis and principal


components or techniques such as ridge regression are often employed to “solve” the
problem of multicollinearity.
6

Illustrated Example

Dataset

There 27 instances into the data file. The goal is to predict the consumption of cars from
various characteristics (price, engine size, horsepower and weight).
7

Multiple linear regression

Command: regress Consumption Price Cylinder Horsepower Weight

In the first instance, we performed a multiple regression analysis using all the explanatory
variables (Price, Cylinder, Horsepower, Weight). We set Consumption as dependent.

Interpretation of the Output:

The coefficient of determination R2 is 0.9295 But, when we consider the coefficients of the
model, some results seem strange. Only the weight is significant for the explanation of the
consumption (p value = .009 < .05). The sign is positive, implies when the weight of the car
increases, the consumption also increases. It seems natural as our domain of knowledge
reflects. But, neither the horsepower nor the engine size (cylinder) seems to influence the
consumption, the p-values > .05 in both the cases, also the t statistics are lower than 1.96
level. It is unusual as our domain of knowledge about car’s consumption indicates the direct
relationship of horsepower and cylinder with consumption. Leading us to conclude in the
case that two cars (suppose) with the same weight have similar consumption, even if the
engine size of the second is 3 times bigger than the first, this last result does not correspond at
all with what we know about cars
8

Analysis of Variance

An analysis of variance (ANOVA) table summarizes the information related to the sources of
variation in data.

Source

This represents the partitions of the variation in dependent variable Y (Consumption). There
are four sources of variation listed: intercept, model, error, and total (adjusted for the mean).

Sum of Squares (SS)

These are the sums of squares associated with the corresponding sources of variation
(intercept, model, error, and total).

Degrees of freedom (df)

The degrees of freedom are the number of dimensions associated. Each observation can be
interpreted as a dimension in n-dimensional space. The degrees of freedom for the intercept
as 1, model as p (which is 4 in the case), error as n-p-1 (27-4-1 = 22), and adjusted total n-1.

Mean Square (MS)

The mean square is the sum of squares divided by the degrees of freedom. This mean square
is an estimated variance. For example, the mean square error is the estimated variance of the
residuals (errors). MS Source = 30.759606, MS Residual = .42402063, unstable variances
reflecting multicollinearity
9

Overall Model Fit

F-Ratio

This is the F statistic for testing the null hypothesis that all regression coefficients βj = 0. This
F-statistic has p(=4) degrees of freedom for the numerator variance and n-p-1 (27-4-1=22)
degrees of freedom for the denominator variance. Its higher than level 4, showing overall
significance of the model.

Prob Level

This is the p-value for the above F test. The p-value is the probability, if the p-value is less
than 0.05, the null hypothesis is rejected. If the p-value is greater than .05, then the null
hypothesis is accepted. In the case, p value = 0.0000 < .05 so rejecting null hypothesis that
all regression coefficients βj = 0

R-squared

If the model fits the data well, the overall R2 value will be high, and the corresponding p
value will be low (p value is the observed significance level at which the null hypothesis is
rejected). The model is not fitting the data well as can be seen in scatter-plot, though R2 is
high, indicating collinearity among independent variables. It represents the percent of
variation in the dependent variable explained by the independent variables in the model.

Root Mean Square Error

This is the square root of the mean square error. It is an estimate of σ, the standard deviation
of the ei’s. Closer its value to zero, is favorable.
10

Parameter Estimates:

Regression Coefficient

These are the estimated values of the regression coefficients b0, b1, b2, b3, b4. The value
indicates how much change in Y (Consumption) occurs for a one-unit change in X (Price,
Cylinder, Horespower, Weight) when the remaining X’s are held constant. These coefficients
are also called partial-regression coefficients since the effect of the other X’s is removed.

Standard Error

These are the estimated standard errors (precision) of the regression coefficients. These are
the standard deviations of the estimates. In regular regression, we divide the coefficient by
the standard error to obtain a t statistic.

t-ratios:

The t-ratio is the coefficient divided by the standard error. A large t-ratio indicates that we are
unlikely to have obtained this estimate due to chance or sampling error. Price and
Horsepower have t-statistics as 0.75 and -0.25 which are undesirably low while Cylinder
(1.67) and Weight (2.87) are larger indicating that we are unlikely to have obtained this
estimate due to chance or sampling error.

p-values

P-value of 0.05 implies that a sample with the coefficient estimate and a t-ratio will occur
only 5% of the time when the true value of coefficient is zero. P-values for Price (.460),
Horsepower (.806) are > .05 while for Weight (.009) < .05 For cylinder the value is .109
which can be due to multicollinearity otherwise it is a significant variable as our domain
knowledge of cars confirm.
11

Detecting the multicollinearity

Graphical method

We can create a scatter plot matrix of independent variables Weight, Horsepower, Cylinder,
Price and the dependent variable Consumption.

Command: graph matrix Consumption Weight Horsepower Cylinder Price

We suspect multicollinearity in the model. We know for instance that the engine size
(Cylinder) and the Horsepower are often highly correlated. The model is very unstable; a
small change in the dataset (by removing or adding instances) causes a large modification of
the estimated parameters. The sign and the values of the coefficients (e.g., -.0037419 for
horsepower) are inconsistent with our knowledge about cars. For instance, it seems here that
the horsepower has a negative influence on the consumption. We know that this cannot be
true. In short, we have an excellent model (according the R2) but unusable because we cannot
draw a meaningful interpretation of the coefficients. It is impossible to understand the causal
mechanism of the phenomenon studied.
12

Sign consistency.

We check if the sign of the coefficient is consistent with the sign of the correlation of each
explanatory variable with the target variable (computed individually). Horsepower has –ive
coefficient, it means that consumption and horsepower are negatively correlated which is not
in reality, indicating this variable interfering in the association between the explanatory
variable.

The correlation is positive, but the sign of its coefficient into the regression is negative.
Another variable probably interferes with Horsepower.

Variance Inflation Factor (VIF)

VIF is a measure of multicollinearity, the term coined by Marquardt (1970). It is the


reciprocal of 1-Rx2, where Rx2 is the R2 obtained when independent variables (Price,
Cylinder, Horsepower, Weight) are individually regressed on the remaining independent
variables. A VIF of 10 or more for large data sets indicates a multicollinearity problem since
the Rx2 with the remaining X’s is 90 percent. For small data sets, even VIF’s of 5 or more can
signify multicollinearity. A high Rx2 indicates a lot of overlap in explaining the variation
among the remaining independent variables. It can be seen that all the independent variables
have VIF > 10 and indicating collinearity among regressors.

When multicollinearity has got remedy, these values will all be less than 10.
13

Correlation Matrix

As shown the correlation coefficients show which independent variables are highly correlated
with the dependent variable and with each other. Independent variables, especially
horsepower and cylinder are highly correlated (0.9559) with one another as shown in the
figure below and these are causing multicollinearity problems.

Klein’s r ’s rule

We can also compute the square of the correlation for each couple of explanatory variables. If
one or more of the values are higher than (or at less near) the coefficient of determination
(R2) of the regression, there is probably a multicollinearity problem. The advantage here is
that we can identify the variables which are redundant in the regression.

All these symptoms, we studied in Sign Consistency, VIF, Correlation Matrix, suggest that
there is a problem of collinearity in our study. We must adopt an appropriate strategy if we
want to get usable results
14

Worked Solution:

Variable selection

This process helps to identify relevant variables and give an interpretable result. In the
context of multicollinearity problem, we can especially remove redundant variables which
interfere in the regression. We used a forward search. At each step, we searched the most
relevant explanatory variable according to the absolute value of the correlation coefficient.

We observed that:

 The selected explanatory variables are Weight and Cylinder (engine size). They are
very significant, whereas Price is insignificant in relation to consumption.
 Compared to the initial regression, despite the elimination of one variable
(Horsepower), the proportion of explained variance remains very good with a
coefficient of determination of R2 = 0.9293 (R2 = 0.9295 for the model with 4
variables).

 Weight and Cylinder have both a positive influence on the consumption i.e. when the
weight (or the engine size) increases, the consumption increases also. This is rather
consistent to the domain knowledge.
 And the signs of the coefficients are adequate to the sign of the correlation coefficient
computed individually.
 Because the p-value is lower than the significance level, we add the variable into the
regression.
15

VIF values for Weight and Cylinder are 9.68 and 6.42 respectively, which are lower than
level 10, while Price variable is insignificant, keeping in view its p-value .560 > .05

Conclusion

After dropping the two variables Price and Horsepower, multicollinearity converged to
acceptable level, we have the model with F > 4, p-value (overall) < .05, R2 highly desirable as
.9277 (about 93% of the Consumption can be explained by Cylinder and Weight), Root MSE
= .63154 very close to 0. The regression coefficients are positive, and t ratios are 3.42 and
5.81 both > 1.96 threshold, also individual p-values are .002 and .000 (both <.05) showing
statistical significance, and leading us to rejecting null hypothesis that regression coefficients
are significantly different from zero.

You might also like