Professional Documents
Culture Documents
e-ISSN: 2070-5948
DOI: 10.1285/i20705948v7n1p153
Identification of Multicollinearity and its effect
in Model selection
By Jayakumar and Sulthan
This work is copyrighted by Universit`a del Salento, and is licensed under a Creative Commons Attribuzione - Non commerciale - Non opere derivate
3.0 Italia License.
For more information see:
http://creativecommons.org/licenses/by-nc-nd/3.0/it/
and A. Sulthan
Multicollinearity is the problem experienced by statisticians while evaluating a sample regression model. This paper explored the relationship between the sample variance inflation factor (vif) and F-ratio, based on this
we proposed an exact F-test for the identification of multicollinearity and it
overcomes the traditional procedures of rule of thumb. The authors critically identified that the variance inflation factor not only inflates the variance of the estimated regression coefficient and it also inflates the residual
error variance of a given fitted regression model in various level of inflation.
Moreover, we also found a link between the problem of multicollinearity and
its impact on the model selection decision. For this, the authors proposed
multicollinearity corrected version of generalized information criteria which
incorporates the effect of multicollinearity and help the statisticians to select
a best model among the various competing models. This procedure numerically illustrated by fitting 12 different types of stepwise regression models
based on 44 independent variables in a BSQ (Bank service Quality) study.
Finally.the study result shows the transition in model selection after the correction of multicollinearity effect.
keywords: Multicollinearity, variance inflation factor, Error-variance, Ftest, Generalized Information criteria, multicollinearity penalization.
c
Universit`
a del Salento
ISSN: 2070-5948
http://siba-ese.unisalento.it/index.php/ejasa/index
154
p
X
bj xji + ebi
(1)
j=1
where
b is the estimated Intercept, bj is the estimated beta co-efficients or partial
regression co-efficients and ebi is the estimated residual followed normal distribution N
(0,e2 ).From (1), the sample regression model should satisfy the assumptions of normality,
homoscedasticity of the error variance and the serial independence property. Though
the model satisfying all the assumptions, still it has to be evaluated. The authors more
particularly focused on the multicollinearity and its effects leads the strong inter causal
effect among the independent variables. For the past 5 decades, statisticians believe
that the impact of this multicollinearity problem severely inflates the variance of the
estimated regression co-efficients. This creates greater instability and inconsistency in
the estimated co-efficients. Besides this, we identified a remarkable and astonishing
problem due to the multicollinearity and it will be mathematically identified below.
Consider the variance of the estimated regression co-efficient as
2
bc
=
j
s2e
1
(
)
(n 1)s2xj 1 Rx2 j
(2)
155
Where s2e is the unbiased estimate of the error variance, s2xj is the variance of the xj
independent variable (j=1,2,3. . . p) and 1/1 Rx2 j is technically called as variance inflation factor (vif). The term 1 Rx2 j is the unexplained variation in the xj independent
variable due to the same independent variables other than xj .More specifically, statisticians named the term as Tolerance and inverse of the Tolerance is said to be the VIF.By
using the fact s2e = (n/n k)b
e2 Rewrite (2) as
n
be2
(
)
(n 1)(n k)s2xj 1 Rx2 j
n
=
b2
(n 1)(n k)s2xj IN F (ej )
2
=
bc
(3)
bc
(4)
2 ), the authors
analyzing the inflation of variance of estimated regression co-efficients (b
c
j
only focused on the inflated part of the error variance due to the impact of multicollinear2
ity as from (4). From (4)
bIN
F (ej ) is the inflated error variance which is inflated by the
(V IF )j is equal to
be2
1 Rx2 j
q
n
P
(ebi / 1 Rx2 j )2
bIN
F (ej ) =
bIN
F (ej ) =
i=1
n
n
P
bIN
F (ej )
(5)
(6)
(b
eIN F (eji ) )2
i=1
(7)
n
2
From (5), the inflated error variance
bIN
F (ej ) which is always greater than or equal
2
2
to the uninflated error variance
be where (b
IN F (ej )
be2 ).If Rx2 j is equal to 0, then both
the variances are equal, there is no multicollinearity. Similarly, If the Rx2 j is equal to 1,
then the error variance severely inflated and raise upto 8 and this shows the existence of
severe multicollinearity. In the same manner, if 0<Rx2 j <1, then there will be a chance
of inflation in the error variance of the regression
model. Likewise, from (6) and (7), the
p
estimated errors ebi are also inflated by the (V IF )j and it is transformed as estimated
inflated residuals ebIN F (eji ) . If the estimated errors and error variance are inflated, then
the forecasting performance of the model will decline and this leads to take inappropriate
and illogic model selection decision. From the above discussion, the authors proved the
problem of multicollinearity not only inflates the variance of estimated regression coefficients but also inflates the error variance of the regression model. In order to find the
2
statistical equality between the inflated error variance
bIN
F (ej ) and the uninflated error
2
variance
be , the authors proposed an F-test by finding the link between sample vif and
F-ratio. The methodology of applying the test statistic is discussed in the next section.
156
bIN
1
F (ej )
=
2
1 Rxj
be2
2\
IN
F (ej )
(vif )j =
c2
(8)
(9)
Rx2 j
1 Rx2 j
bIN
F (ej )
be2
= (vif )j 1
(10)
(11)
From (11), we using the fact(sst)j = (ssr)j + (sse)j ,Rx2 j = (ssr)j /(sst)j ,1 Rx2 j =
(sse)j /(sst)j , rewrite (11) as
(ssr)j
= (vif )j 1
(sse)j
(12)
Where ssr, sse, sst refers to the sample sum of squares of regression, error and the
total respectively. Based on (12), it can be rewritten as in terms of the sample mean
squares of regression (s2r ) and error (s2e ) as
qs2rj
(n q 1)s2ej
= ((vif )j 1)
(13)
From (13), multiply both sides by the population mean square ratiose2j /r2j , we get
qs2rj /r2j
(n q 1)s2ej /e2j
=(
e2j
r2j
)((vif )j 1)
(14)
From (14), the ratios qs2rj /r2j and (nq1)s2ej /e2j are followed chi-square distribution
with q and n-q-1 degrees of freedom (where q is the no.of independent variables in the
q
P
auxiliary regression modelxj = oj +
jk xk + ej ,j 6= k) and they are independent.
k=1
The independency of the ratios are based on least square property ifxji = xc
ji + eji ,
then xc
ji and eji are independent and the respective sum of squares are equal to(sst)j =
(ssr)j + (sse)j .without loss of generality, (14) can be written as in terms of the F-ratio
157
and it is defined as the ratio between the two independent chi-square variates divided
by the respective degrees of freedom and is given as
2rj /q
2ej /(n q 1)
Fj(q,nq1) = (
(n q 1)e2j
qr2j
((vif )j 1)
e2
nq1
)((vif )j 1)( 2j ) F(q,nq1)
q
rj
(15)
(16)
From (16), the population variance ratios e2j /r2j can be written in terms of the population VIF by combining the ratios e2j /r2j and (12) can be given as
r2j
e2j
e2j
r2j
= (V IF )j 1
(17)
1
(V IF )j 1
(18)
Then, substitute (18) in (16), we get the final version of the F-ratio of the jth independent variable using in the regression model as
Fj(q,nq1) =
(n q 1) (vif )j 1
(
) F(q,nq1)
q
(V IF )j 1
(19)
From (19), the relationship between the sample vif and the F-ratio can be derived as
(vif )j = 1 + ((V IF )j 1)
q
Fj
(n q 1)
(20)
This relationship equation is the significant part of this research paper and the authors
believe, it can be used to test the exactness of multi-collinearity and its effect on the error
variance of regression model. The sampling distribution of vif exists, if the population
VIF >1 and if VIF=1, then the sample vif=1.under the null hypothesis, if the population
VIF=2, then it shows to check the cent percent inflation in the population error variances
of a model and the test statistic from (19) is given as
Fj(q,nq1) =
(n q 1)
((vif )j 1) F(q,nq1)
q
(21)
Where from (21), vif is the sample variance factor of jth independent variable used
in a regression model, q is the no.of independent variables in the auxillary regression
model and n is the sample size. Pragmatically, the authors recommends for checking
the variance inflation in different levels gives more exact results. If VIF=2, then from
(9), the population inflated error variance is equal to twice of the population error
2
2
variance, that is IN
F (ej ) = 2e .This shows, due to the multicollinearity effect, the error
variance of regression model is inflated 100% percent. The authors believe, the exact
test of multicollinearity and its effect can be determined by fixing the value of VIF as
158
(22)
where
be2 the estimated error variance of the regression is model and f (n, k) is the
generalized penalty function which includes the sample size n and no. of parameters
k estimated in a regression model. In order to incorporate the effect of multicollinearity
effect in GIC, the authors highlighted some procedure as follows.
Step: 1 Calculate the residuals ebi for ith observation from a fitted sample regression
model, whereei N (0, e2 ).
Step: 2 Inflate the residuals calculated from step 1 by using the square root of
m
1 P
the Average variance inflation factor ( AV IF ), where AV IF = m
(1/1 Rj2 ) =
j=1
1
m
m
P
(vif )j and m is the no. of significant sample vif based on the F-test of significance
j=1
from (19).
Step: 3 Calculate the inflated residuals IN F (ebi )for ith observation from step 2, where
2
IN F (ebi ) N (0, IN
F (e) ).
Step: 4 Estimate the inflated error variance from IN F (ebi ) = ebi AV IF and we get
2
bIN
be2 (AV IF ).
F (e) =
2
2
Step: 5 From step 4, rewrite
bIN
be2 (AV IF ) as
be2 =
bIN
F (e) =
F (e) /AV IF and
substitute in (18), we get the Modified Generalized information criterion (M GIC)
Based on the above discussed procedure, the multicollinearity effect is incorporated
into the existing information criterion (9) is given as
GIC = n log(
bIN
F (e)
AV IF
) + f (n, k)
159
2
GIC = n log(b
IN
F (e) ) n log(AV IF ) + f (n, k)
2
GIC + n log(AV IF ) = n log(b
IN
F (e) ) + f (n, k)
2
M GIC = n log(b
IN
F (e) ) + f (n, k)
(23)
2
where M GICis the Modified Generalized information criterion,
bIN
F (e) is the inflated
estimate of population error variance and f (n, k)is the penalty function. Moreover, this
M GICis more meaningful when compared it with the traditional information criteria
because the error variance was inflated and it will be penalized due to the multicollinearity effect existing in a regression model. The authors also derived the Akaike information
criterion (AIC), Schwartz Bayesian criterion (SBIC), Hannan-Quinn information criterion, Akaike information criterion corrected for small sample (AICc) from the Modified
Generalized information criterion(M GIC)by assigning the proper values for the penalty
function which leads us to get 4 different modified version of the AIC, SBIC, HQIC and
AICc. These criteria can also be used for the purpose of selecting a best model whenever the problem of multicollinearity exists among the set of competing models. The
following table shows the different versions of information criteria after modification.
Criteria
Penalty function
After multicollinearity
GIC
f (n, k)
n log(b
e2 ) + f (n, k)
2
n log(b
IN
F (e) ) + f (n, k)
AIC
2k
n log(b
e2 ) + 2k
2
n log(b
IN
F (e) ) + 2k
SBIC
k log n
2
n log(b
IN
F (e) ) + k log n
HQIC
2k log(log n)
n log(b
e2 ) + k log n
c2 ) + 2k log(log n)
n log(
AICc
2nk/n k 1
e
n log(b
e2 )
+ (2nk/n k 1)
2
n log(b
IN
F (e) ) + 2k log(log n)
2
n log(b
IN
F (e) ) + (2nk/n k 1)
160
1 to 5 after the data collection was over only 102 completed questionnaires were used
for analysis. The aim of this article is to identify the exactness of multicollinearity and
how it inflates the error variance of a fitted regression model. Moreover the authors also
discussed how the multicollinearity distorts the model selection decision as well as the
transition occurred in model selection. The following table shows the results extracted
from the analysis by using IBM SPSS version 21. At first the authors used stepwise
multiple regression analysis by utilizing 44 independent variables and a dependent variable. The results of the stepwise regression analysis, test of multicollinearity and the
transitions in model selection are visualized in the following tables.
vif
0.993 1.007
0.993 1.007
be2 =0.18981
Tol.
Model 2
0.191
0.191
-
bIN
F (e)
100%
0.70
0.70
-
10%
7.00**
7.00**
.992 1.008
.982 1.018
.989 1.011
be2 =0.17679
0.178
0.180
0.179
-
F-ratiob
10% Limit
3.96*
8.91**
5.44**
-
5%
7.92**
17.82**
10.89**
-
50%
1.40
1.40
-
bIN
F (e)
5%
14.00**
14.00**
-
Vif
Tol.
Model 3
F-ratioa
X38
X2
X31
X19
X44
X5
X11
X13
X39
X8
X23
error variance
Independent variables
50%
0.79
1.78
1.09
-
100%
0.40
0.89
0.54
-
bIN
F (e)
Model 4
Tol.
0.171
0.173
0.172
0.176
-
vif
.978 1.023
.967 1.035
.973 1.028
.949 1.054
be2 =0.17679
F-ratioc
10%
7.51**
11.43**
9.15**
17.64**
-
50%
1.50
2.29
1.83
3.53*
-
100%
.75
1.14
0.91
1.76
-
bIN
F (e)
Model 5
Tol.
0.190
0.170
0.173
0.173
0.191
-
vif
.864 1.158
.964 1.037
.951 1.051
.947 1.056
.861 1.161
be2 =0.16422
5%
76.63**
17.94**
24.73**
27.16**
78.08**
-
10%
38.31**
8.97**
12.37**
13.58**
39.04**
-
50%
7.66**
1.79
2.47*
2.72*
7.81**
-
F-ratiod
X38
X2
X31
X19
X44
X5
X11
X13
X39
X8
X23
error variance
100%
3.83**
0.90
1.24
1.36
3.90**
-
vif
.978 1.023
.967 1.035
.973 1.028
.949 1.054
be2 =0.17679
Tol.
Model 4
0.171
0.173
0.172
0.176
-
bIN
F (e)
10%
7.51**
11.43**
9.15**
17.64**
-
50%
1.50
2.29
1.83
3.53*
.864 1.158
.964 1.037
.951 1.051
.947 1.056
.861 1.161
be2 =0.16422
0.190
0.170
0.173
0.173
0.191
-
F-ratiod
10%
38.31**
8.97**
12.37**
13.58**
39.04**
-
50%
7.66**
1.79
2.47*
2.72*
7.81**
-
5%
76.63**
17.94**
24.73**
27.16**
78.08**
-
100%
.75
1.14
0.91
1.76
-
bIN
F (e)
5%
15.03**
22.87**
18.29**
35.28
-
vif
Tol.
F-ratioc
Model 5
X38
X2
X31
X19
X44
X5
X11
X13
X39
X8
X23
error variance
Independent variables
100%
3.83**
0.90
1.24
1.36
3.90**
-
bIN
F (e)
Model 6
Tol.
0.182
0.165
0.165
0.168
0.192
0.170
-
vif
.860 1.163
.950 1.053
.951 1.052
.931 1.074
.815 1.226
.921 1.086
be2 =0.15650
F-ratioe
10%
31.30**
10.18**
9.98**
14.21**
43.39**
16.51**
-
50%
6.26**
2.04
2.00
2.84*
8.68**
3.30**
-
100%
3.13*
1.02
1.00
1.42
4.34**
1.65
-
bIN
F (e)
Model 7
Tol.
0.172
0.173
0.159
0.164
0.184
0.163
0.189
-
Vif
.858 1.166
.853 1.173
.925 1.081
.895 1.117
.802 1.247
.901 1.110
.780 1.282
be2 =0.14721
5%
52.57**
54.78**
25.65**
37.05**
78.22**
34.83**
89.30**
-
10%
26.28**
27.39**
12.82**
18.52**
39.11**
17.42**
44.65**
-
50%
5.26**
5.48**
2.56*
3.71**
7.82**
3.48**
8.93**
-
F-ratiof
X38
X2
X31
X19
X44
X5
X11
X13
X39
X8
X23
error variance
100%
2.63*
2.74*
1.28
1.85
3.91**
1.74
4.47**
-
vif
.848 1.180
.852 1.174
.906 1.103
.854 1.171
.797 1.255
.895 1.118
.757 1.321
.857 1.167
be2 =0.14099
Tol.
Model 8
0.166
0.166
0.156
0.165
0.177
0.158
0.186
0.165
-
bIN
F (e)
Variance
5%
48.34**
46.73**
27.66**
45.93**
68.49**
31.69**
86.21**
44.85**
-
F-ratiog
inflation
10%
24.17**
23.37**
13.83**
22.96**
34.24**
15.85**
43.11**
22.43**
-
level
50%
4.83**
4.67**
2.77*
4.59**
6.85**
3.17**
8.62**
4.49**
100%
2.42*
2.34*
1.38
2.30*
3.42**
1.58
4.31**
2.24*
.826 1.210
.808 1.238
.803 1.245
.836 1.197
.797 1.255
.872 1.147
.755 1.325
.855 1.169
.796 1.257
be2 =0.13317
vif
Model 9
Tol.
0.161
0.165
0.166
0.159
0.167
0.153
0.176
0.156
0.167
-
bIN
F (e)
X38
X2
X31
X19
X44
X5
X11
X13
X39
X8
X23
error variance
Independent variables
Variance
5%
48.82**
55.33**
56.96**
45.80**
59.29**
34.18**
75.56**
39.29**
59.75**
-
F-ratioh
inflation
10%
24.41**
27.67**
28.48**
22.90**
29.64**
17.09**
37.78**
19.65**
29.88**
-
level
50%
4.88**
5.53**
5.70**
4.58**
5.93**
3.42**
7.56**
3.93**
5.98**
-
100%
2.44*
2.77**
2.85**
2.29*
2.96**
1.71
3.78**
1.96
2.99**
-
bIN
F (e)
Model 10
Tol.
0.155
0.156
0.157
0.150
0.161
0.146
0.189
0.147
0.173
0.167
-
Vif
.811 1.233
.806 1.241
.802 1.248
.835 1.197
.781 1.280
.864 1.158
.665 1.503
.855 1.169
.726 1.378
.753 1.327
be2 =0.12569
F-ratioi
Variance inflation level
5%
10%
50%
47.64**
23.82** 4.76**
49.27**
24.64** 4.93**
50.70**
25.35** 5.07**
40.28**
20.14** 4.03**
57.24**
28.62** 5.72**
32.30**
16.15** 3.23**
102.84** 51.42** 10.28**
34.55**
17.28** 3.46**
77.28**
38.64** 7.73**
66.85**
33.43** 6.69**
-
100%
2.38*
2.46*
2.54*
2.01*
2.86**
1.62
5.14**
1.73
3.86**
3.34**
-
bIN
F (e)
Model 11
Tol.
0.154
0.165
0.183
0.154
0.162
0.143
0.185
0.146
0.170
0.175
0.189
vif
.799 1.251
.748 1.337
.673 1.485
.800 1.250
.760 1.315
.864 1.158
.665 1.504
.844 1.184
.724 1.381
.704 1.420
.651 1.536
be2 =0.123078
F-ratioj
Variance
5%
45.68**
61.33**
88.27**
45.50**
57.33**
28.76**
91.73**
33.49**
69.34**
76.44**
97.55**
inflation
10%
22.84**
30.67**
44.13**
22.75**
28.66**
14.38**
45.86**
16.74**
34.67**
38.22**
48.78**
X38
X2
X31
X19
X44
X5
X11
X13
X39
X8
X23
error variance
level
50%
4.57**
6.13**
8.83**
4.55**
5.73**
2.88**
9.17**
3.35**
6.93**
7.64**
9.76**
100%
2.28*
3.07**
4.41**
2.28*
2.87**
1.44
4.59**
1.67
3.47**
3.82**
4.88**
167
Model 12
Tol.
vif
bIN
F (e)
F-ratiok
Variance inflation level
10%
50%
100%
X38
X2
X31
.811
.748
-
1.233
1.336
-
0.157
0.170
-
47.64**
68.69**
-
23.82**
34.35**
-
4.76**
6.87**
-
2.38*
3.43**
-
X19
X44
X5
X11
X13
X39
X8
X23
error variance
.806 1.241
.789 1.268
.866 1.155
.681 1.469
.851 1.175
.790 1.266
.707 1.415
.775 1.290
be2 =0.127046
0.158
0.161
0.147
0.187
0.149
0.161
0.180
0.164
49.27**
54.79**
31.69**
95.88**
35.78**
54.38**
84.84**
59.29**
24.64**
27.40**
15.84**
47.94**
17.89**
27.19**
42.42**
29.64**
4.93**
5.48**
3.17**
9.59**
3.58**
5.44**
8.48**
5.93**
2.46*
2.74**
1.58
4.79**
1.79
2.72**
4.24**
2.96**
5%
Model
4
5
6
7
8
9
10
11
12
13
12
.331
.377
.410
.441
.489
.525
.565
.598
.615
.634
.630
R2
1.007
1.0123
1.0286
1.0926
1.109
1.168
1.1861
1.227
1.2734
1.3473
1.2848
Significantavif
0.19114
0.17896
0.18184
0.17943
0.17356
0.17194
0.16723
0.16340
0.16005
0.16582
0.16323
bIN
F (e)
AIC
BMC
-161.50
-166.74
-164.74
-170.27
-173.18
-177.42
-179.82
-183.65
-187.54
-187.68
-186.45
AMC
-160.78
-165.50
-161.87
-161.23
-162.63
-161.58
-162.42
-162.78
-162.89
-157.28
-160.88
SBIC
BMC
-151.00
-153.62
-149.00
-151.89
-152.18
-153.80
-153.58
-154.77
-156.04
-153.56
-154.95
AMC
-150.28
-152.38
-146.12
-142.86
-141.63
-137.96
-136.17
-133.90
-131.39
-123.15
-129.39
HQIC
BMC
-157.24
-161.43
-158.37
-162.83
-164.68
-167.85
-169.20
-171.95
-174.79
-173.87
-173.69
AMC
-156.53
-160.19
-155.49
-153.79
-154.12
-152.02
-151.79
-151.09
-150.14
-143.46
-148.13
BMC
-161.08
-166.12
-163.86
-169.08
-171.63
-175.46
-177.41
-180.71
-184.04
-183.55
-182.94
AICc
2
3
4
5
6
7
8
9
10
11
12
AMC
-160.37
-164.88
-160.99
-160.04
-161.08
-159.63
-160.00
-159.85
-159.39
-153.14
-157.38
BMC- Before multicollinearity correction, k-no.of parameters,avif- average variance inflation factor, AMC- After multicollinearity correction
4
5
6
7
8
9
10
11
12
13
12
2
3
4
5
6
7
8
9
10
11
12
.331
.377
.410
.441
.489
.525
.565
.598
.615
.634
.630
R2
1.007
1.0123
1.035
1.0926
1.109
1.168
1.1861
1.227
1.2734
1.3473
1.2848
Significantavif
.19114
.17896
.18298
.17943
.17356
.17194
.16723
.16340
.16005
.16582
.16323
bIN
F (e)
BMC
-151.00
-153.62
-149.00
-151.89
-152.18
-153.80
-153.58
-154.77
-156.04
-153.56
-154.95
BMC
-161.50
-166.74
-164.74
-170.27
-173.18
-177.42
-179.82
-183.65
-187.54
-187.68
-186.45
AMC
-160.78
-165.50
-161.23
-161.23
-162.63
-161.58
-162.42
-162.78
-162.89
-157.28
-160.88
SBIC
AIC
AMC
-150.28
-152.38
-145.48
-142.86
-141.63
-137.96
-136.17
-133.90
-131.39
-123.15
-129.39
BMC
-157.24
-161.43
-158.37
-162.83
-164.68
-167.85
-169.20
-171.95
-174.79
-173.87
-173.69
HQIC
AMC
-156.53
-160.19
-154.86
-153.79
-154.12
-152.02
-151.79
-151.09
-150.14
-143.46
-148.13
BMC
-161.08
-166.12
-163.86
-169.08
-171.63
-175.46
-177.41
-180.71
-184.04
-183.55
-182.94
AICc
AMC
-160.37
-164.88
-160.35
-160.04
-161.08
-159.63
-160.00
-159.85
-159.39
-153.14
-157.38
BMC- Before multicollinearity correction, k-no.of parameters avif- average variance inflation factor, AMC- After multicollinearity correction
Model
Table 10: Transition in model selection at 10% Inflation level of error variance
170
Figure 1: Line plot shows the values of Modified Versions of Information Criteria for
different Models at 5% and 10% inflation in error variance
4
5
6
7
8
9
10
11
12
13
12
2
3
4
5
6
7
8
9
10
11
12
.331
.377
.410
.441
.489
.525
.565
.598
.615
.634
.630
R2
1.054
1.1065
1.1372
1.168
1.1861
1.227
1.2734
1.3473
1.2848
Significantavif
.18981
.17679
.18633
.18170
.17798
.17194
.16722
.16340
.16005
.16582
.16323
bIN
F (e)
BMC
-151.00
-153.62
-149.00
-151.89
-152.18
-153.80
-153.58
-154.77
-156.04
-153.56
-154.95
BMC
-161.50
-166.74
-164.74
-170.27
-173.18
-177.42
-179.82
-183.65
-187.54
-187.68
-186.45
AMC
-161.50
-166.74
-159.38
-159.95
-160.06
-161.58
-162.42
-162.78
-162.89
-157.28
-160.88
SBIC
AIC
AMC
-151.00
-153.62
-143.63
-141.58
-139.06
-137.96
-136.17
-133.90
-131.39
-123.15
-129.39
BMC
-157.24
-161.43
-158.37
-162.83
-164.68
-167.85
-169.20
-171.95
-174.79
-173.87
-173.69
HQIC
AMC
-157.24
-161.43
-153.01
-152.51
-151.56
-152.02
-151.79
-151.09
-150.14
-143.46
-148.13
BMC
-161.08
-166.12
-163.86
-169.08
-171.63
-175.46
-177.41
-180.71
-184.04
-183.55
-182.94
AICc
AMC
-161.08
-166.12
-158.50
-153.98
-158.76
-159.63
-160.00
-159.85
-159.39
-153.14
-157.38
BMC- Before multicollinearity correction, k-no.of parameters,avif- average variance inflation factor, AMC- After multicollinearity correction
Model
Table 11: Transition in model selection at 50% Inflation level of error variance
Model
4
5
6
7
8
9
10
11
12
13
12
.331
.377
.410
.441
.489
.525
.565
.598
.615
.634
.630
R2
1.1595
1.1945
1.217
1.2113
1.2467
1.3008
1.3865
1.3147
Significantavif
bIN
F (e)
.18981
.17679
.17679
.19041
.18694
.17915
.17078
.16602
.16349
.17065
.16703
AIC
BMC
-161.50
-166.74
-164.74
-170.27
-173.18
-177.42
-179.82
-183.65
-187.54
-187.68
-186.45
AMC
-161.50
-166.74
-164.74
-155.17
-155.05
-157.39
-160.27
-161.16
-160.72
-154.35
-158.54
SBIC
BMC
-151.00
-153.62
-149.00
-151.89
-152.18
-153.80
-153.58
-154.77
-156.04
-153.56
-154.95
AMC
-151.00
-153.62
-149.00
-136.80
-134.05
-133.77
-134.02
-132.28
-129.22
-120.23
-127.04
HQIC
BMC
-157.24
-161.43
-158.37
-162.83
-164.68
-167.85
-169.20
-171.95
-174.79
-173.87
-173.69
AMC
-157.24
-161.43
-158.37
-147.73
-146.55
-147.83
-149.64
-149.46
-147.97
-140.53
-145.78
BMC
-161.08
-166.12
-163.86
-169.08
-171.63
-175.46
-177.41
-180.71
-184.04
-183.55
-182.94
AICc
Table 12: Transition in model selection at 100% Inflation level of error variance
2
3
4
5
6
7
8
9
10
11
12
AMC
-161.08
-166.12
-163.86
-153.98
-153.50
-155.44
-157.86
-158.22
-157.22
-150.21
-155.03
BMC- Before multicollinearity correction, k-no.of parameters,avif- average variance inflation factor, AMC- After multicollinearity correction
2
3
4
5
6
7
8
9
10
11
10
Model
2
3
4
5
6
7
8
9
10
11
12
10%
-150.28
-152.38
-145.48
-142.86
-141.63
-137.96
-136.17
-133.90
-131.39
-123.15
-129.39
50%
-151.00
-153.62
-143.63
-141.58
-139.06
-137.96
-136.17
-133.90
-131.39
-123.15
-129.39
5%
-150.28
-152.38
-146.12
-142.86
-141.63
-137.96
-136.17
-133.90
-131.39
-123.15
-129.39
100%
-161.50
-166.74
-164.74
-155.17
-155.05
-157.39
-160.27
-161.16
-160.72
-154.35
-158.54
5%
-160.78
-165.50
-161.87
-161.23
-162.63
-161.58
-162.42
-162.78
-162.89
-157.28
-160.88
50%
-161.50
-166.74
-159.38
-159.95
-160.06
-161.58
-162.42
-162.78
-162.89
-157.28
-160.88
10%
-160.78
-165.50
-161.23
-161.23
-162.63
-161.58
-162.42
-162.78
-162.89
-157.28
-160.88
SBIC
AIC
100%
-151.00
-153.62
-149.00
-136.80
-134.05
-133.77
-134.02
-132.28
-129.22
-120.23
-127.04
5%
-156.53
-160.19
-155.49
-153.79
-154.12
-152.02
-151.79
-151.09
-150.14
-143.46
-148.13
10%
-156.53
-160.19
-154.86
-153.79
-154.12
-152.02
-151.79
-151.09
-150.14
-143.46
-148.13
50%
-157.24
-161.43
-153.01
-152.51
-151.56
-152.02
-151.79
-151.09
-150.14
-143.46
-148.13
HQIC
100%
-157.24
-161.43
-158.37
-147.73
-146.55
-147.83
-149.64
-149.46
-147.97
-140.53
-145.78
5%
-160.78
-165.50
-161.87
-161.23
-162.63
-161.58
-162.42
-162.78
-162.89
-157.28
-160.88
10%
-160.37
-164.88
-160.35
-160.04
-161.08
-159.63
-160.00
-159.85
-159.39
-153.14
-157.38
50%
-161.08
-166.12
-158.50
-153.98
-158.76
-159.63
-160.00
-159.85
-159.39
-153.14
-157.38
AICc
100%
-161.08
-166.12
-163.86
-153.98
-153.50
-155.44
-157.86
-158.22
-157.22
-150.21
-155.03
174
Figure 2: Line plot shows the values of Modified Versions of Information Criteria for
different Models at 50% inflation in error variance
175
Figure 3: Line plot shows the values of Modified Versions of Information Criteria for
different Models at 100% inflation in error variance
176
Figure 4: Line plot shows the Information Loss for different Models BMC and AMC at
5% and 10%, 50%, 100% inflation in error variance
177
Figure 5: Line plot shows the Information Loss at 5%, 10%, 50%, and 100% inflation in
error variance AMC based on P (no.of independent variables
178
From the previous analysis, the authors first checked the significance of multicollinearity effect for the 12 fitted stepwise regression models. We left the results of the model
1 because it only involves one independent variable and there is no need to test the significance of multicollinearity effect. The result presented from Table-2 to 7 includes the
tolerance of the independent variable, sample variance inflation factor (vif), estimated
2
inflated error variance (b
IN
F (e) ) of the fitted model by using jth independent variable
and the result of F -test for multicollinearity identification. For the model 2, the authors
identified, that the result of F -test shows that the error variance of this model is inflated
up to 10% due to the multicollinearity effect between the independent variables X38 and
X2.Similarly,almost all the fitted models experienced a multicollinearity problem with
the inflation in error variance upto 10%.As far as model 4 is concern, the F -test reveals
that the estimated error variance was inflated at 50% level due to the multicollinearity
effected by the independent variable X19.Similarly, from model 5 to 12,based on the
result of F -test, the authors identified that the error variance of the fitted model was
inflated upto 50% due to the problem of multicollinearity created by the independent
variables at 1% and 5% significance level respectively. Moreover, model 5 to12, affected
severely due to multicollinearity effect among independent variables with a 100% inflation level in the error variances. From the above discussion, the authors found, due
to the inflation of the error variances in the fitted models, the estimated variances of
the regression co-efficients will definitely increase and it also threaten the stability of
the least squares estimates of regression co-efficients. In the same manner, the authors
also found another interesting result extracted from the previous analysis. The impact of
multicollinearity distorts the model selection decision and from table-8 to 11 with graphs
exhibits the transitions in the model selection decision at 5%,10%,50% and 100% inflation level in error variances based on 4 frequently used information criteria such as AIC,
SBIC, HQIC and AICc. As far as AIC is concern, it suggests model 11 is the best before
carrying multicollinearity correction because the information loss between the true and
the fitted model is minimum when compared to remaining competing models. As far
as SBIC, HQIC and AICc are concern, model 10 is the best before the incorporation of
multicollinearity correction. But after the incorporation of the multicollinearity effect in
model selection criteria, the selection decision varied. From table-8, 9,10 and 11, if the
error variance was inflated upto 5%, 10%, 50% and 100% respectively, the result of the
modified information criteria such as AIC, SBIC, HQIC and AICc recommends, model-3
is the best in minimum information loss when compare to remaining competing models.
Hence the simulation results proves that there is a definite transition in model selection
after removing the effect of multicollinearity from the fitted models.
6 Conclusion
From the previous sections, the authors explained how the multicollinearity influences
the fitness of the model and how it impacts the model selection decision. For this the
authors, utilized F -test for scrutinizing the identification of multicollinearity based on
the estimated inflated and uninflated error variance of a fitted regression model. While
179
taking a model selection decision, the statisticians should consider the problems in the
fitted models. Multicollinearity is not violating the assumption of least squares or maximum likelihood approach but some care should be given while selecting a model affected
by multicollinearity. The authors tactically penalized a model with multicollinearity and
they showed the inflation in various levels of the error variances among the competing
models. On the other hand, while taking a model selection decision based on information criteria, these are incapable of reflecting the problems in the fitted models. For
this, the authors made a multicollinearity correction in GIC and incorporate the multicollinearity effect by replacing the inflated error variance instead of using the un-inflated
error variance of a model. Hence the authors emphasize while selecting a model, statisticians should consider the problems in the fitted model and we should incorporate the
consequences of this problem in the model selection yardsticks. As far as least square
problems are concern, the authors suggested, statisticians to incorporate the problems
such as Heteroskedasticity, autocorrelation, outliers and influential points in the model
selection decision, only then we can do an unbiased model selection process. In future,
more rigorous simulation experiments may conduct by fixing the variance inflation level
between 1% to 100% with an interval of 0.1% helps the statisticians to find an exact
variance inflation level of the error variance. Moreover, based on the relationship between the F-ratio and sample variance inflation factor (vif), we can derive the exact
sampling distribution of the vif and its properties will motivates the statisticians to take
the multicollinearity problem to the next level.
References
Bowerman, B. L., OConnell, R. T. and Richard, T. (1993), Forecasting and Time Series:
An Applied Approach Belmont, CA Wadsworth.
Draper, N. and Smith, H. (1981), Applied Regression Analysis New York: Wiley.
Maddala, G. S. (1977), Econometrics New York: McGRAW-Hill Book Company.
Neter, J., Wasserman, W. and Kutner, M. H. (1989), Applied Linear Regression Models
Homewood, IL Richard D. Irwin.
B. Efron,(2004), The estimation of prediction error: covariance penalties and crossvalidation J. Amer. Statist. Assoc., 99,619632
Srivastava,M. S. (2002),
Methods of Multivariate Statistics John Wiley and
Sons,NewYork.