Professional Documents
Culture Documents
=
2 2
i i
i i
y x
y x
r
If r between two Xs is high, say in excess of 0.8,
then multicollinearity can be a serious problem
However, if there are more than two explanatory
variables in the model, partial correlation
coefficient will provide a more accurate
assessment of presence or absence of
multicollinearity
(3) High partial correlation:
In general r is not likely to reflect true degree of
association between Y and X in the presence of
another variable, say X
1
What we need is a correlation coefficient that is
independent of influence, if any, of X
1
on X and Y
Such a correlation coefficient is known as partial
correlation coefficient
Partial correlation represents correlation between
two variables holding another variable constant
Conceptually it is similar to partial regression
coefficient
Example: r
12 . 3
Represents correlation between
variables 1 (say Y) and 2 (say X), holding a third
variable (X
1
) constant
Consider 3 Xs (X
2
, X
3
& X
4
) in a regression
model
Under zero-order correlation X
2
and X
3
might
be highly correlated
But under partial correlation (where we hold
influence of X
4
constant) X
2
and X
3
might not
be highly correlated
Thus, in the context of several Xs reliance of
zero-order correlation to check multicollinearity
can be misleading.
( )( )
2
23
2
13
23 13 12
3 . 12
1 1 r r
r r r
r
=
Partial correlations give above are called
first order correlation coefficients
By order we mean the number of
secondary subscripts
Thus, r
12.34
would be correlation coefficient
of order two, r
12.345
would be of order three
and so on.
(4) Auxiliary regressions:
Step 1: Regress each Xs on remaining X
variables (called auxiliary regressions) and
get corresponding R
2
, which we designate as
R
2
i
Step 2: Construct the following F-statistic
( ) ) 1 /( 1
) 2 /(
2
.... .
2
.... .
3 2
3 2
+
=
k n R
k R
F
k i
k i
x x x x
x x x x
i
Where n = sample size
k = no. of Xs including intercept term
R
2
x
i
.x
2
x
3
.x
k
= R
2
value from a single
auxiliary regression
k-2 = Numerator d.f.
n-k+1 = Denominator d.f.
Step 3: For any single auxiliary regression, if
computed F exceeds the critical F at chosen
o and given d.f. in numerator and
denominator, it is taken that particular X
i
is
collinear with other Xs. Otherwise, the
reverse conclusion will apply.
Problem with this rule: Computational
burden
(5) Kliens rule of thumb:
Lawrence R. Klien proposed this rule
Step 1: Obtain R
2
from auxiliary regression
Step 2: Obtain R
2
from the regression of Y
on all the Xs (Overall R
2
)
Step 3: If R
2
from Step 1 > R
2
from Step 2,
then multicollinearity might be present
(7) Variance inflating factor (VIF):
The speed with which variances increase can be
seen with the VIF
VIF shows how the variance of an estimator is
inflated by the presence of multicollinearity
VIF is defined as follows
) 1 (
1
2
3 2
x x
r
VIF
=
Here r
2
x2x3
is the r-squared value obtained from
the auxiliary regression of x
2
variable on x
3
Note that as r
2
x2x3
approaches 1, VIF approaches
infinity
In other words, as the extent of collinearity
increases, variance of an estimator increases
If there is no collinearity between x
2
and x
3
VIF
will be 1
Thus, larger the value of VIF w.r.t a single X
variable, the more troublesome or collinear
that variable is with other Xs
Rule of thumb: If VIF of a X exceeds 10,
which will happen if r
2
j
or R
2
j
exceeds 0.90, that
variable is said to be highly collinear
R
2
j
is R
2
in the regression of a single X variable
on remaining Xs in the model [Auxiliary regression]
(8) Tolerance factor:
The inverse of VIF is called Tolerance factor
(TOL). That is,
(or)
When R
2
j
=
1(i.e. perfect collinearity), TOL = 0
and when R
2
j
=
0 (i.e. no collinearity), TOL = 1
j
j
VIF
TOL
1
=
) 1 (
2
j
R
Therefore, closer is Tolerance to zero, the
greater the degree of collinearity of that
variable with other regressors
The closer the Tolerance is to 1, greater the
evidence that x
j
not collinear with other
regressors
To conclude:
We cant tell which of these methods will
work in a given case
No single diagnostic will give as complete
handle over the collinearity problem
Since it is a sample specific problem, in
some situations it might be easy to diagnose
But in others one or more of various
methods will have to be used
In short, there is no easy solution to the
problem
Remedial measures
What can be done if multicollinearity is
serious?
Unfortunately, there is no surefire remedy.
There are only a few rules of thumb
This is because multicollinearity is largely
data deficiency problem over which we dont
have full control
If the problem is with data there is not much
that can be done
Rules of thumb to eliminate
/reduce multicollinearity
(1 ) A priori information:
When constructing the regression model one
can avoid Xs that can have linear relationship
How?
Previous empirical work, theory, intuitive
reasoning etc.
(2) Dropping the collinear variables:
Simplest solution is to drop one or more of
the collinear variables
But, this remedy can be worse than the
disease
Because this will lead to specification bias
We construct regression models based on
some theoretical foundation
Hence dropping variables for the sake of
multicollinearity will produce biased results
The advice then is: do not drop a variable
from an theoretically variable model just
because of multicollinearity problem
(3) Transformation of variables/data:
Applicable in case of time-series data
In such data, multicollinearity emerges
because overtime variables tend to move in
the same direction
One way to minimize this dependence is to
proceed as follows:
Consider this relation: Y
t
=b
1
+b
2
X
2t
+b
3
X
3t
+e
t
If it holds at time t, it must also hold at time
t-1
Hence Y
t
-Y
t-1
=b
2
(X
2t
-X
2,t-1
)+b
3
(X
3t
-X
3,t-1
)+(e
t
-
e
t-1
)
Now the relationship is in first difference form
If we run this regression, it reduces severity
of multicollinearity problem
This is because there is no a priori reason to
believe that difference between variables will
also be highly correlated
But this approach has the following
problems
(a)There will be a loss of one observation
due to differencing procedure and hence
one d.f. will be lost
(b)This procedure may not be appropriate in
cross-sectional data where there is no
logical ordering of observations
(4) Ratio transformation of variables/data:
This is another form of data transformation
Consider the following model:
Y
t
=b
1
+b
2
X
2t
+b
3
X
3t
+e
t
Where Y is consumption expenditure, X
2
is GDP and X
3
is total population
Since GDP & population grow overtime, they
are likely to be correlated
Using ratio transformation method we can
express the above model on a per capita
basis
That is
|
|
.
|
\
|
+ +
|
|
.
|
\
|
+
|
|
.
|
\
|
=
t
t
t
t
t t
t
X
u
b
X
X
b
X
b
X
Y
3
3
3
2
2
3
1
3
1
Such a transformation may reduce
collinearity in original variables
Population variable is now missing from the
model (Is this lead to specification bias?)
(5) Additional or new data:
Sometimes, simply increasing sample size
may reduce collinearity problem provided it
helps to impart more variation in the data
(6) Specify the model correctly:
Sometimes, a model chosen for empirical
analysis is not carefully thought out
Some important variables may be omitted or
functional form of model is incorrectly chosen
Avoiding this will solve major part of the
problem