You are on page 1of 48

Multicollinearity

Assumption of OLS: X variables are


independent. That is, they are not correlated with
each other


Meaning of Multicollinearity: It explains
presence of a perfect (or) less than perfect linear
relationship among some or all Xs

Multicollinearity
1. No
2. Low
3. Moderate
4. High
5. Very High
In practice one rarely encounters perfect
multicollinearity

But, cases of near or very high
multicollinearity arise in many applications

Thus, multicollinearity is a question of degree

Note that nonlinear relationship between
variables does not imply multicollinearity


What, if there is multicollinearity?

Separate influence of Xs on Y cant be assessed


Example:
Consumption = f (Income, Wealth)


If I and W have linear relationship (i.e. if they
move exactly together), there is no way to asses
their separate influence on consumption
Intuitive Reasoning

We know that in multiple regression, B
2
measures
change in the mean value of Y per unit change in
X
2
, holding value of X
3
constant


In this circumstances, if X
2
and X
3
are perfectly
collinear, there is no way X
3
be kept constant


As X
2
changes so does X
3
Sources of multicollinearity


(a) Model Constraints or Specification:

Electricity consumption = f (Income, House
size)

Both Xs are important for the model

Becoz., families with high incomes
generally have larger homes than otherwise
(b) Overestimated model:

When the model has more Xs than the number
of observations

Example:
Information collected on large number of
variables from small no. of patients

(C) Data generation:

Most economic data are not obtained in
controlled laboratory experiments

Example: GDP, stock prices, profits

If data could be obtained experimentally, we
would not allow collinearity to come up

Thus, multicollinearity is a sample
phenomenon arising out of non-experimental
data generated in social sciences
Example





lll
X
1
X
2
X
3

10 50 52
15 75 75
18 90 97
24 120 129
30 150 152
X
2
=5 X
1
, there is a perfect multicollinearity (r
12
=1)
r
23
= 0.995 (highly correlated, but not perfect!).
With 2 X variables, use simple correlation; with more
than 2, partial correlation.
where r
12

implies simple correlation
Is multicollinearity problem most
damaging?

No, provided it is imperfect in nature

So long as collinearity is not perfect, OLS
estimators still retain BLUE property

This is because imperfect multicollinearity per
se doesnt violate other assumption of OLS
method

Anyway perfect collinearity is an extreme
case.
Then, why all the fuss?
Because multicollinearity has some
practical consequences
(a) Theory Vs. Practice:

Theoretically inclusion of two Xs might be
warranted. But, it may create practical
problems

Example:
Consumption = f (Income, Wealth)
Although, income & wealth are logical
candidates to explain consumption, they
might be correlated wealthier people tend
to have higher incomes

Ideal solution here would be to have sample
observations of both wealthy individuals with
low-income and high-income individuals
with low wealth

(b) High Standard Error (SE):
In the presence of multicollinearity the SE
and variance of OLS estimators become
large

If SE of an estimator increases, it becomes
more difficult to estimate true value of
estimator

(c) Wider confidence intervals:
Because of large SEs

Hence, probability of accepting a false
hypothesis (Type II error) or H
0
:B
2
:0
increases

(d) Insignificant t ratios:
For the purpose of hypothesis testing, we
use following t ratio



Here, if SE is large due to multicollinearity,
t values will be smaller

As such one will increasingly accept
H
0
:B
2
:0
) (
2
2 2
b se
B b
t

=
(e) High R
2
and F ratio but few significant t
ratios:

One or more bs are individually
statistically insignificant on the basis of t
test

R
2
may be so high

F test can reject H
0
:b
2
=b
3
=..=b
k
=0

This is indeed a sign of presence of
multicollinearity
(f) Sensitivity of results:

In the presence of multicollinearity, OLS
estimators and their SEs become very
sensitive to small changes in data

(g) Wrong signs:

Regression coefficients will have
wrong/unexpected signs
Detection of multicollinearity
Some thumb rules

(1) High R
2
and F ratio but few significant t
ratios:
One or more bs are individually statistically
insignificant on the basis of t test

R
2
may be so high

F test can reject H
0
:b
2
=b
3
=..=b
k
=0

This is the commonly used detection
technique.
Used as preliminary evidence, which can
be confirmed with other techniques


(2) zero-order (or) pair-wise (or) simple correlation
coefficient (r):
r measures degree of linear association between
two variables, say X and Y
r can be computed in two ways:
(or)






2
r r =
( ) ( )

=
2 2
i i
i i
y x
y x
r
If r between two Xs is high, say in excess of 0.8,
then multicollinearity can be a serious problem

However, if there are more than two explanatory
variables in the model, partial correlation
coefficient will provide a more accurate
assessment of presence or absence of
multicollinearity


(3) High partial correlation:

In general r is not likely to reflect true degree of
association between Y and X in the presence of
another variable, say X
1


What we need is a correlation coefficient that is
independent of influence, if any, of X
1
on X and Y

Such a correlation coefficient is known as partial
correlation coefficient



Partial correlation represents correlation between
two variables holding another variable constant

Conceptually it is similar to partial regression
coefficient

Example: r
12 . 3
Represents correlation between
variables 1 (say Y) and 2 (say X), holding a third
variable (X
1
) constant


Consider 3 Xs (X
2
, X
3
& X
4
) in a regression
model

Under zero-order correlation X
2
and X
3
might
be highly correlated

But under partial correlation (where we hold
influence of X
4
constant) X
2
and X
3
might not
be highly correlated



Thus, in the context of several Xs reliance of
zero-order correlation to check multicollinearity
can be misleading.



( )( )
2
23
2
13
23 13 12
3 . 12
1 1 r r
r r r
r


=
Partial correlations give above are called
first order correlation coefficients

By order we mean the number of
secondary subscripts

Thus, r
12.34
would be correlation coefficient
of order two, r
12.345
would be of order three
and so on.
(4) Auxiliary regressions:

Step 1: Regress each Xs on remaining X
variables (called auxiliary regressions) and
get corresponding R
2
, which we designate as
R
2
i


Step 2: Construct the following F-statistic




( ) ) 1 /( 1
) 2 /(
2
.... .
2
.... .
3 2
3 2
+

=
k n R
k R
F
k i
k i
x x x x
x x x x
i
Where n = sample size
k = no. of Xs including intercept term
R
2
x
i
.x
2
x
3
.x
k
= R
2

value from a single
auxiliary regression
k-2 = Numerator d.f.
n-k+1 = Denominator d.f.

Step 3: For any single auxiliary regression, if
computed F exceeds the critical F at chosen
o and given d.f. in numerator and
denominator, it is taken that particular X
i
is
collinear with other Xs. Otherwise, the
reverse conclusion will apply.
Problem with this rule: Computational
burden
(5) Kliens rule of thumb:

Lawrence R. Klien proposed this rule

Step 1: Obtain R
2
from auxiliary regression

Step 2: Obtain R
2
from the regression of Y
on all the Xs (Overall R
2
)

Step 3: If R
2
from Step 1 > R
2
from Step 2,
then multicollinearity might be present
(7) Variance inflating factor (VIF):
The speed with which variances increase can be
seen with the VIF
VIF shows how the variance of an estimator is
inflated by the presence of multicollinearity
VIF is defined as follows
) 1 (
1
2
3 2
x x
r
VIF

=
Here r
2
x2x3
is the r-squared value obtained from
the auxiliary regression of x
2
variable on x
3


Note that as r
2
x2x3
approaches 1, VIF approaches
infinity

In other words, as the extent of collinearity
increases, variance of an estimator increases

If there is no collinearity between x
2
and x
3
VIF
will be 1

Thus, larger the value of VIF w.r.t a single X
variable, the more troublesome or collinear
that variable is with other Xs

Rule of thumb: If VIF of a X exceeds 10,
which will happen if r
2
j
or R
2
j

exceeds 0.90, that
variable is said to be highly collinear

R
2
j
is R
2
in the regression of a single X variable
on remaining Xs in the model [Auxiliary regression]


(8) Tolerance factor:
The inverse of VIF is called Tolerance factor
(TOL). That is,

(or)

When R
2
j
=

1(i.e. perfect collinearity), TOL = 0
and when R
2
j
=

0 (i.e. no collinearity), TOL = 1
j
j
VIF
TOL
1
=
) 1 (
2
j
R

Therefore, closer is Tolerance to zero, the
greater the degree of collinearity of that
variable with other regressors

The closer the Tolerance is to 1, greater the
evidence that x
j
not collinear with other
regressors



To conclude:

We cant tell which of these methods will
work in a given case
No single diagnostic will give as complete
handle over the collinearity problem
Since it is a sample specific problem, in
some situations it might be easy to diagnose
But in others one or more of various
methods will have to be used
In short, there is no easy solution to the
problem
Remedial measures

What can be done if multicollinearity is
serious?

Unfortunately, there is no surefire remedy.

There are only a few rules of thumb

This is because multicollinearity is largely
data deficiency problem over which we dont
have full control

If the problem is with data there is not much
that can be done
Rules of thumb to eliminate
/reduce multicollinearity

(1 ) A priori information:

When constructing the regression model one
can avoid Xs that can have linear relationship

How?

Previous empirical work, theory, intuitive
reasoning etc.
(2) Dropping the collinear variables:


Simplest solution is to drop one or more of
the collinear variables

But, this remedy can be worse than the
disease

Because this will lead to specification bias

We construct regression models based on
some theoretical foundation
Hence dropping variables for the sake of
multicollinearity will produce biased results

The advice then is: do not drop a variable
from an theoretically variable model just
because of multicollinearity problem

(3) Transformation of variables/data:

Applicable in case of time-series data

In such data, multicollinearity emerges
because overtime variables tend to move in
the same direction

One way to minimize this dependence is to
proceed as follows:

Consider this relation: Y
t
=b
1
+b
2
X
2t
+b
3
X
3t
+e
t

If it holds at time t, it must also hold at time
t-1
Hence Y
t
-Y
t-1
=b
2
(X
2t
-X
2,t-1
)+b
3
(X
3t
-X
3,t-1
)+(e
t
-
e
t-1
)

Now the relationship is in first difference form

If we run this regression, it reduces severity
of multicollinearity problem

This is because there is no a priori reason to
believe that difference between variables will
also be highly correlated

But this approach has the following
problems



(a)There will be a loss of one observation
due to differencing procedure and hence
one d.f. will be lost

(b)This procedure may not be appropriate in
cross-sectional data where there is no
logical ordering of observations
(4) Ratio transformation of variables/data:

This is another form of data transformation

Consider the following model:

Y
t
=b
1
+b
2
X
2t
+b
3
X
3t
+e
t

Where Y is consumption expenditure, X
2

is GDP and X
3
is total population
Since GDP & population grow overtime, they
are likely to be correlated

Using ratio transformation method we can
express the above model on a per capita
basis

That is

|
|
.
|

\
|
+ +
|
|
.
|

\
|
+
|
|
.
|

\
|
=
t
t
t
t
t t
t
X
u
b
X
X
b
X
b
X
Y
3
3
3
2
2
3
1
3
1
Such a transformation may reduce
collinearity in original variables

Population variable is now missing from the
model (Is this lead to specification bias?)

(5) Additional or new data:
Sometimes, simply increasing sample size
may reduce collinearity problem provided it
helps to impart more variation in the data

(6) Specify the model correctly:
Sometimes, a model chosen for empirical
analysis is not carefully thought out

Some important variables may be omitted or
functional form of model is incorrectly chosen

Avoiding this will solve major part of the
problem

You might also like