You are on page 1of 44

Multicollinearity:

What Happens if the Regressors are


Correlated?
What is the nature of multi-collinearity?
Is multi-collinearity really a problem?
What are its practical consequences?
How does one detect it?
What remedial measure can be taken to
alleviation the problem of multi-
collinearity?
The Nature of Multicollinearity
The existence of a perfect or
exact, linear relationship among
some or all explanatory variables
of regression model
An exact linear relationship is said to exist if
the following condition is satisfied:
1X1 + 2X2 + ... + kXk = 0 (1)
where X2, Xk variables regression involving
explanatory variables and X1 = 1 for all
observations to allow for the intercept term.
where, 1, 2, , k are constants such that not
all of them are zero simultaneously
The X variables are intercorrelated but not
perfectly:
1X1 + 2X2 + ... + kXk + vi = 0 (2)
where vi is a stochastic error term.
The difference between perfect and less
than perfect multi-collinearity
Assume: 2 0

1 3 k
X 2i X 1i X 3i ... X ki (3)
2 2 2
1 3 k 1
X 2i X 1i X 3i ... X ki vi (4)
2 2 2 2
The numerical example:
X2 X3 X3*
10 50 52
15 75 75
18 90 97
24 120 129
30 150 152
Multi-collinearity refers only to linear
relationship among the X variables. It does
not rule out nonlinear relationships among
the X variables.
For example:
Yi = 0 + 1Xi + 2 Xi2 + 3Xi3 + ui (5)
Where:
Y = total cost of production
X = output
Model (5) does not violate the assumption of no
multicollinearity.
If multicollinearity is perfect, the regression
coefficients of X variables are indetermine
and the standard error are infinite
If multicollinearity is less than perfect, the
regression coefficients, although determine,
possess large standard errors (in the relation
to the coefficients themselves), which
means the coefficients cannot be estimated
with greater precision or accuracy
Assume: X3i = X2i , where is a nonzero
constant. Substituting (6) into (7)
yi 2 x2i 3 x3i ui (6)

y x x y x x
2
x
i 2i 3i i 3i 2 i 3i
(7)
2
x x x x
2
2i
2
3i 2 i 3i
2

y x x y x x
2 2 2

2 i 2i 2i i 2i 2i
(8)
x x x
2
2i
2 2
2i
2 2 2
2i

0

0
2 is indeterminate, 3 is also indeterminate.
Recall the meaning of 2 , the rate of change in the
average value of Y as X2 change by a unit, holding
X3 constant. If X3 and X2 are perfectly collinear,
there is no way X3 can be kept constant. As X2
changes, so does X3 by the factor .
There is no way of disentangling the separate
influences of X2 and X3 from the given sample. For
practical purpose X2 and X3 are indistinguishable.
This problem is most damaging since the entire
intent is to separate the partial effects of each X
upon the dependent variable.
To see differently:
Substituting X3i = X2i into (6)

y i 2 x 2i 3 (x 2i ) u i
( ) x u (10)
2 3 2i i

x 2i u i
where
( 2 3 ) (11)
Applying the usual OLS, we get


2 3
x y2i 2i
(12)
x 2
2i

Although we can estimate uniquely, there


is no way to estimate 2 and 3 uniquely.
Let 0.8 and 2

0.8 2 2 3 (13)
0.8 2 (14)
2 3
The case of perfect multicollinearity
1. A unique solution for linear combination
(2 + 3) is estimated by , given the
value of
2. The variances and standard error of 2
and 3 individually are infinite
Estimation in the Present of
High but Imperfect Multicollinearity
Instead of exact multicollinearity, we may
have
x 3i x2i vi (15)
Estimation of regression coefficient 2 and 3
may be possible
y x x v y x y v x
2 2 2 2

2 i 2i 2i i i 2i i i 2i
(16)
x x v x
2
2i
2 2
2i
2
i
2 2
2i
2

where x v 0
2i i
No reason to believe a priori that 2 and 3
cannot be estimated
If vi is sufficiently small, very closed to
zero, (15) will indicate almost perfect
collinearity and we shall be back to the
indeterminate case of (9)
Theoretical Consequences of
Multicollinearity
Multicollinerity violates no regression
assumptions. If multicollinearity is very
high, the OLS estimators still retain the
property of BLUE
Unbiased, consistent estimates will occur,
but their standard errors will be correctly
estimated
It hard to get coefficient estimates with
small standard error
The collinearity does not destroy the
property of minimum variance; that is , they
are efficient.
Having small number of observations also
has that effect. What should I do if I dont
have many observation? No statistical
answer can be given. Multicollinearity is
essentially a sample phenomenon
The consumption-income example

Consi = 1 + 2 Incomei + 3 Wealthi + ui


Beside income, the wealth of the consumer is
also an important determinant of consumption
expenditure
It may happen that the data on income and
wealth, the two variables are highly, if not
perfectly, correlated. Wealthier people
generally tend to have higher incomes.
The practical consequences of
multicollinearity
Although BLUE, the OLS estimators have
large variances and covariances, making
precise estimation difficult. Therefore, the
confidence intervals tend to be much wider,
leading to acceptance of the zero null
hypothesis
Large Variances and Covariances
of OLS Estimators
2
var( 2 ) (17)
x 2
2i (1 r )
2
23

2
var( 3 ) (18)
x 2
3i (1 r )
2
23

r23 2
cov( 2 , 3 ) (19)
(1 r232 ) 2 i 3i
x 2
x 2
Wider Confidence Interval
Because of the large SE, the confidence
intervals for the relevant population
parameters tend to be larger
In the case of high multicollinearity, the
sample data may be compatible with a
diverse set of hypotheses. Hence, the
probability of accepting a false hypothesis
increases

r23 95% confidence int erval for 2
2
0,00 2 1,96
2ix 2

2
0,50 2 1,96 1,33
2i
x 2

2
0,95 2 1,96 10,26
2i
x 2

2
0,995 2 1,96 100
2i
x 2

2
0,999 2 1,96 500
2i
x 2
The signals of multicollinearity

A high overall R2, in excess of 0.8, the F


test in most cases will reject H0 (a
significant F), but the individual t test will
show that none or very few of the partial
slope coefficients are statistically different
from zero
H 0 : 2 0
2
t
SE 2

The estimated SE increase dramatically,


thereby making the t values smaller. Therefore,
in such cases, one will increasingly accept the
null hypothesis that the relevant true
population value is zero
High pair-wise correlation among regressors,
in excess of 0.8
In models involving more than two explanatory
variables the simple or zero-order correlation will not
provide an infallible guide to the presence of
multicollinearity. If there are two explanatory
variables, the order-zero correlations will suffice
Examination of partial correlations
R21.234 is very high but r212.34, r213.24, and r214.23
are comparatively low may suggest that the
variables X2, X3, and X4, are highly
intercorrelated and at least one of these variables
is superfluous (suggested by Farrar & Glauber)
Although a study partial correlations may be
useful, there is no guarantee that they will
provide an infallible guide to multicollinearity.
It may be happen that both R2 and all the partial
correlations are sufficiently very high
Farrar and Glauber test has been severely
criticized by: C. Robert Wichers; T. Krisna
Kumar; John OHagan and Brendan McCabe
The Farrar-Clauber partial correlation test is
ineffective
Auxiliary regressions (R2i)
To regress each Xi on the remaining X variables
and compute the corresponding R2, which we
designate as R2i
If Fi is statistically significant, we still have to
decide whether the particular Xi should be
dropped from the model.
Kliens rule of thumb: if the R2i obtained from
auxiliary regression is greater than the overall
R2, the multicollinearity may be a troublesome
problem
Tolerance (TOL) and Variance Inflation
Factor (VIF)
If r23 tends toward 1, the collinearity increases,
the variances of the two estimators increase,
and in the limit when r23 = 1, they are infinite.
And the covariance of the two estimators also
increases in absolute value
The speed of var and cov increase can be seen
with the variance inflating factor (VIF) and
The inverse of the VIF is called tolerance
(TOL)
1
VIF (20)
(1 r23 )
2

2
var( 2 ) VIF (21)
x 2i2

2
var( 3 ) VIF (22)
x 2
3i

2
var( j ) VIF j (23)
x 2
j

1
TOL j (24)
VIF j
r23 VIF TOL
0,00 1,00 1,00
0,50 1,33 0,75
0,70 1,96 0,51
0,80 2,78 0,36
0,90 5,26 0,19
0,95 10,26 0,10
0,995 100,25 0,01
0,999 500,25 0,00
If the VIF of a variable exceeds 10,
the variable is said to highly
collinear
If the TOL of a variable less than
0.10, the variable is said to highly
collinear
Eigenvalues and conditional index
The condition number k:
Max eigenvalue
k
Min eigenvalue

If k is between 100 and 1000 there is moderate to strong


multicollinearity and if it exceeds 1000 there is severe
multicollinearity
The condition index (CI)
Max eigenvalue
CI k
Min eigenvalue

If CI is between 10 and 30 there is moderate to strong


multicollinearity and if it exceeds 30 there is severe
multicollinearity
Remedial Measures
Do nothing
Blanchard: It is essentially deficiency problem
(micronumerosity) and some times we have no
choice over the data we have available for
empirical analysis
Goldbergers analysis: Micronumerosity
(small sample size) is being important as
multicollinearity
A priori information
Yi = 1 + 2X2i + 3X3i + ut (25)
where, Y = consumption; X2 = income; X3 = wealth
Suppose a priori we believe that 2 = 0.103; that is, the
rate of change of consumption with respect to wealth is
one-tenth the corresponding rate with respect to income
We can then run the following regression
Yi = 1 + 2X2i + 0.102X3i + ut
= 1 + 2Xi + ut (26)
Where Xi = X2i+0.1X3i. Once we obtain 2, we can
estimate 3 from the postulate relationship between 2
and 3
Another example:
Cobb-Doglass type production function. If one
expects constant return to scale to prevail, than
(2 + 3) = 1, in which case we could run the
regression of the output-labor ratio on the
capital-labor ratio. If there is collinearity
between capital and labor, as generally is the
case in most sample data, such a transformation
may reduce or eliminate the collinearity
problem
A warning in a priori restriction: since in
general we will want to test economic
theorys a priori predictions rather than
simply impose them on data for which they
may not be true
Combining cross-sectional and time
series data
In the time series data, the price and
income variables generally tend to to be
highly collinear.
Tobin suggest to pooling the time series
and cross-sectional data, we can obtain a
fairly reliable estimated.
Pooling the time series and cross
sectional data may create problem of
interpretation
Dropping a variable(s) and specification
bias
When faced with severe multicollinearity,
one of the simplest things to do is to drop
one of the collinearity variables
However, dropping a variable from the
model we may be committing a
specification bias. Specification bias arises
from incorrect specification of the model
used in the analysis
Transformation of variables
Yt = 1 + 2X2t + 3X3t + ut (25)
Yt-1 = 1 + 2X2,t-1 + 3X3,t-1 + ut-1 (26)
Substracting (26) from (25)
Yt - Yt-1 = 2(X2t - X2,t-1) + 3(X3t - X3,t-1) + vt (27)

Another transformation is the ratio transformation


Yt = 1 + 2X2t + 3X3t + ut (28)

Yt 1 X 2t ut
1 2 3 (29)
X 3t X 3t X 3t X 3t
The first-difference or ratio transformations are not
without problems. The error term vt in (27) may not
satisfy one of the assumptions of the classical
linear regression model (the disturbances are
serially uncorrelated). If the original disturbance
term ut is serially uncorrelated, the error term v
obtained previously will in most cases be serially
correlated.
There is a loss of one observation due to the
differencing procedure, and therefore the degree of
freedom are reduced by one. In a small sample, this
could be a factor one would wish at least to take
into consideration
The first-diffrencing procedure may not be
appropriate in cross-sectional data where there
is no logical ordering of the observations
In the ratio model: the error term (ut/X3t) will
be heteroscedastic, if the original error term ut
is homoscedastic.
Again the remedy may be worse than the
disease of collinearity
Additional or new data
Sometimes simply increasing the size of the
sample may attenuate the collinearity
problem
The variance of i will decrease, thus
decreasing the standard error, which will
enable us to estimate i more precisely.
Reducing multicollinearity in polynomial
regrresions
In the polynomial regression models when
the explanatory variable(s) are expressed in
the deviation form (deviation in the mean
value), multicollinearity is substantially
reduced.
But even then the problem may persist, in
which case one may want to consider
technique such as orthogonal polynomials
Other methods of remedying multicol-
linearity
Multivariate statistical techniques such as
factor analysis; principal components or
ridge regression are often employed to
solve the problem of multicollinearity

You might also like