Multicollinearity

HANDLING MULTICOLLINEARITY KITH SAS IML@ SOFTHARE: RIDGE REGRESSION ON THE PC
GARY WILLIAM CARR, NYNEX

GERALD TLAPA, BELLCORE
1.0 Introduction.
where (X'X) is the correlation matrix and is

non-singular.
b {X'X)-l X'Y
(5)
where (X'X)-l is the inverse of the
correlation matrix. The solution b has the
following properties:
1) It estim.tes B with minimized error sum
of squares irrespective of the distribution
function of the errors.
2) The elements of b are linear functions of
the observations Yl, Yz. "'Y n and
provide unbiased estimates of the elements
of B which have the minimum variances
irrespective of distribution functions of
the errors.
Normal use of ordinary least squares
Analyses of economic data must consider

the likely presence of multicoll inearity. If
one is interested in studyIng Gross DomestIc
Product {GOP) versus other macroeconomic
infrastructure components, multicollinearity
wIll b. anticipated since components of the
infrastructure generally march 1n harmony with
GOP. for example, many correlation studIes have

shown a strong direct relationship between GOP
and Energy. (1) In particular, this relatIonshIp has been analyzed by Janosl and Grayson
wIth a.resulting R' l 0.9 in thirty-two of
the thIrty-four cases studied. These results
were obtained by using a log-log model assuming
constant rates of continuous growth. (2).
As the number of variables increases
within an econometric model, the concern with
multicollInearity increases since the
probabilIty that some variables measure similar
phenomena increases. Presence of multicollinearity violates the assumption that the
regreSSion assumes that the input variables are
uncorrelated. In addition, the researcher wants

as many observations as possible per variable to
insure that a purely random component will be
less likely to affect inferences about the
deterministic portion of the equation. Economic
data, however, are often sparse. especially with
regard to developing countries and may, to a
significant degree, measure the same basic
phenomena. Economic data therefore, almost
always displays multicollinearity. This Is a
common problem that must be addressed when
modeling economic effects. Minimizing
'multicollinearity maximizes the explanatory
power of any model chosen to describe causual
explanatory variables in a regression model are
not strongly interrelated. The assumption is

one of the three fundamental assumptions of
regression analysis.
Consequence of violating
this assumption leads to low precision of

individual regression coefficient estimates
which 1n turn can lead to erroneous inferen~es.
One specific focus of this paper

addresses multicollinearity using the ridge
regression technique.
Ridge regression
provid~s
'economlC relationShips.
a statistically robust method for overcoming
Multicollinearity can be defined as a

property of the correlation matrices where the
off diagonals (Independent regressions) approach
L (3). When significant multicollinearity
exists, it is impossible to determine the
importance of each independent (regressor)
variable in explaining a dependent varlabl.
based on R'. (4). Consider the variables of
energy~ telecommunications. domestic
investment, airline travel and savings that
data problems frequently encountered in
econometric modeling using Ordinary Least

Squares methodologies.
2.0 Analysis.
Assume that the relationships of any

given country's infrastructure components
relative to its GDP are sought such that they
can be expressed in terms of a general linear
regress ion mode 1:
Y.XB
e
(1)
where Y is a vector of observations (Ln GDP in
subsequent analyses of this study), X is a
produce a correlation matrix,
CORRELATION MATRIX (X'X) FOR CHINA (1961-1985)
matrix of lnfrastructure variable observations
(infrastructure components), B is a vector of
parameters. and e is a vector of errors
normally distributed with expected value of

E(e). 0, Var f{e}- a'. In this case
tha elements of variance are uncorrelated.
and
Since E(e).O
[(y). E(XB)
(2)
Ln(ENG)
Ln(TEU
Ln(AIR)
Ln<INV)
Ln(POP)
Ln(GOP)
e'e (Y - Xb)'(Y-XB)
(3)
where b is the LSE of Band ( )' indicates the
The least squares estimate (LSE)

of B is the value b which minimizes the error
e 'e
This provides the
normalized equation:
(X'X)b X'Y
Ln(ENG) Ln(TELl Ln(AIR) Ln<INV) Ln(SI\V) Ln(GLlP)

1.0000
0.9183
0.8786
0.9131
0.9110
0.8911
0.9183 0.8786 0.9131

1.0000 0.9257 0.9313
0.9257 1.0000 0.9618
0.9313 0.9618 1.0000
0.9349 0.9649 0.9992
0.9259 0.9902 0.9813
0.9110
0.9349
0.9649
0.9992
1.0000
0.9844
0.8911
0.9259
0.9902
0.9813
0.9844
1.0000
ENG - level of Energy consumption expressed in

million tons of coal equivalent. (UN).
TEL. Number of telephones installed. (AT&T).
AIR. Number of ajrllne passengers carried. (UN)
INV Oomestic investment. (World Bank).
SAV. Domestic savings. (World Bank).
GOP. Gross domestic investment. (IMF).
matrix transpose.
sum of squares.
(XIX>. below.
(4)
1220
To oVercome problems with data quality

aAd improve the reliability of data analyses and
forecasts, econometricians may (intentionally)
introduce bias into their models. This serves
the purpose of reducing standard errors and
Each "independent" infrastructure

variable shows a high correlation to each other
as well as to Ln(GOP). Alternatively,
multicollinearity can be observed in the inverse
of this correlation matrix, (X'X)-l measured
by the high diagonal values above 1.
inverse correlation matrix below).
mu1tlcoilinearity. Indirect introduction of

bias into a model occurs when one of the
(See
While
independent variables should ideally not be
variables which is collinear with another

variable is dropped. By reducing the
information input into a model, collinearity is
reduced. Another procedure that can be used
correlated with each other, economic data on

1nfrastr~cture components wtll show some degree
of correlation to each other.
Colline.rity
lntrodu~es bias directly.

One such robust
procedure is ridge regression and is becoming
among the independent variables must therefore
be kept at acceptable levels in the regression

model. The diagonal elements of the inverse
matrix are called Variance Inflation Factors
increasingly popular In econometric analyses.

Hoerl and Kennard (1970) are cited frequently
for their use of this technique. The guiding
principles which they have followed are listed
below. (8l.:
(Vln.
VIFi - l/O-R' I)
(6)
R'I is the coefficient of

determination of the i-th independent variable
regressed on all other independent variables.
(5). As an example, using the China data as the
independent variable Ln(SAV) regressed on all
other independent variables, one obtains a
R' value of .9987. When this R 2 value is
Introduced into equation 6 above, the VIF,
778.084. When the VIFi value exceeds the
value of 10 (as identified by Freund and Llttel)
(obtained by substituting the R-squared value of
the total regression results into the VIF
formula) the presence of unacceptable multicollinearity is identified. (6). The analyst must
now seiect which VIr element value is most
ciosely related and not independent from the
other independent variables. Correction of the
data Is now required to eliminate the presence
1)
elements of the correlation matrix. the
2)
3)
4)
represent.
The proper sign will be assigned to all

coefficients.
The residual sum of squares wi 11 not be
I ofl ated to an
unreasonable value.
The amount of
The ridge regression procedure Is

intended to oVercome multicollinearity problems
where the correlation matrlx is nearly 1.0,
Consider once again the,
giving rise to unstable parameter estimate,.

Additional work on this method by G.M. Mullett
has explained why an incorrectly signed
coefficient becomes corrected. (9).
G. Jellsavclc justifies this technique by
INVERSE CORRELATION MATRIX (X'X)-l fOR CHINA

<l961 - 1985)
Ln(ENG) In(TEL) Ln(AIR) Ln(INV) Ln(SAV)
Ln(ENG) 8.225 -5.036 1.035 -21.951 18.i50
tn(TEt) -5.036 11.768 -3.227 24.936 -28.2\6
Ln(AIR) 1.035 -3.227 i16.113 15.484 -~8.945
Ln(INV) -21.951 24.936 15.484 722.240 -7>9.911
Ln(POP) 18.150 -28.2i6 -28.945 -739.911 778.084
demonstrating the lower mean squared error which
it produces and its ability to choose the proper

bias, k. required for the variables being
analyzed. (10). The work of Chatterjee and
Price support this. technique by showing how It
mln1mizes the mean squared error when a

regression equation 1s used to predict future
Several remedies are frequently suggested
values (11). Draper .and Smith confirm the use

of ridge regression when prior knowledge of the
The general
approach Is usually to collect more data.
This
parameters is known <lower coeffieient values or

the sign of a coefficient is incorrect)~ and
also, when ridge regression is subject to
restrictions on the parameters (a least square
answer rarely helps the econometrician. who
typically has short data series and cannot wait

for additional data to be obtained. In
addition, the cost of obtaining additional data
problem with the addition of restricted or

constraints on external information). (12).
T. H. Wonnacott demonstrates that in using ridge
may be prOhibitive and cannot guarantee a
reduced collinearity sample. (7).
resulting coefficients will

stabilize and have characteristics
simi lar to an
orthogonal system.
The actual coefficients will have
reasonable absolute
values respective to the factors they
variance will not be large relative to

the process generating the data.
of collinearity.
to correct poorly conditioned data.
As bias, k, is added to the diagonal
Another
common ap~roach is to reduce the number of

independent variables in the model if similar
phenomena are being measured by many independent
regression t~ lvoid multicollinearity. the

confidence intervals of relevant regressor
variables are more precise. ~13}.
variables. This may not be fe.sible if one is

trying to determine the Importance of several
variables Influencing one dependent variable. A
procedure more robust than ordinary least
squares (OLS) regression is appropriate In this
instance.
Ridge regression is not a panacea for

ail economlc problems but in many instances it
has led to improved understanding of available

data.
1221
The ridge regression methodology

demonstrated here will add bias, k, to the trace
.lements of the correlation matrix. If any
trace element of the tnverse corr~lation matrix
is l.ss than one, no bias Is added. This
procedure is repeated until all trace values are
equal to one. This insures that the
collin.arityof .ach variable Is treated
separately and only those variables
demonstrating collinearity will receive bias.
In addition, the relationships of the variables
CHINA (GOP PER CAPITA)

1960
87.22
thereby avoiding distortion of the original

hypothesis. Anl/ther value of Importance to be
calculated is P , (P star), which determines
the correct amount of bias for establishing the
best regression equation. (14).
TR*
1980
109.66 275.56
1985
308.73
USing ridge regreSSion a different set

of coefficients 1$ obtained. 'Adding sectors of
an economy while not exceeding the dependent
variable, usually results in positive
coefficients. COntributions made by one sector
upon another should be removed to Insure the
ceteris paribus requirement thereby representing
the Independent contributions to the dependent
variable. ImpOSing these conditions results in
the following:
are ma i nta i ned to assure correct tlba,l ance"
p* ~trl {[(X'X+kD-1 (X'X)-U' [(X' X+kI>-'(X'X)_I])

Since, I Is a p * p matrix where p is
the number of independent, vari ~b 1es in the
model, the trl = p. Clearly P .s. .p, which
sUQgest that the violation of assumption about
interrelationship of the Independent variables
In the model has not been ignored, but rather
accounted for. Quantity p* could therefore be
thoutht of as "effective" number of Independent
variables in the model.
Another important quantity to be cal~u
lated is trace of 'nverted design matrIx, TR :
1910
0)
RIDGE REGRESSION RE5UL T5

CHINA (1961 to 1985)
Ln(ENG) Ln(TEL) Ln(AIR) Ln(INV) Ln(SAV)
BO
------ ------- -------- ------- ------- ------5.5153 0.1964
T for HO
~3.265)
Probabill ty> IT1
0.0635
0.1445 0.1764 0.1711
(4.873) (11.72) (12.88) (13.90)
(.00407) (.00001) (.0001) (.00001)(.00001)
The above results show a level of

significance below .004 In all Independent
varl.ables. This measures the probability that a
ITI statistic would obtain a value greater than
the observed given that the true parameter Is
zero. The probability of the T-statistlc for
energy in China Is 0. 00401. This means that if
we reject the null hypotheSiS (8 = 0), there is
a 0.41 probability that the null hypotheSis is
actua 11 y true.
tr[(X'X + KIl-1 (X'X)(X'X + KIl-1] (8)

Both p* and TR* are calculated for a
whole range of parameter k; starting with K = Q.
Optional value for k is the value at which the
following equality is reached, or approached as
close as feasible:
TR* ~ P*
(9)
SAS IML software program was used
to compute results following the above
procedures. This program appears in Appendix A
to Uis paper. (15).
=
The amount of rIdge bias necessary to

achieve the proper regreSSion result was within
"rule of thumb" parameters. (16).
The maximum
amount of bias added to any element was as

follows:
BIAS REQUIRED FOR RIDGE REGRESSION
CHINA 1961-1985
3.0 Results
Let us now apply ridge regression
t~chniques to the China example of Section 2.
first consider the ordinary least squares
results operating upon the .dependent variable of
Ln{GOP):
.275
4.0 Conclusions:
Ridge Regression bias intentionally

introduced can be kept at low levels and only
added to those elements demonstrating
col1inearity. The mean square error term is
kept at low levels, insuring improved results.
ORDINARY LEAST SQUARE RESULTS

CHINA (1961 to 1985)
Ridge regr.esslon corrects lmproper OlS stgns
60 Ln(ENG) Ln(TEL) (n(AIR) Ln(INV) In(SAV)
----------- ------- ------- ------- ------5.715 0.0388 -0.0475 0.3047 -0.5911 0.9865
inflated parameter estimates and unstable

coefficients.
UsIng the above coefficients for each

infrastructure component generates erroneous
results. for a growing economy the signs of the
coeffl,cients .should be positive. A negative
coeffiCient value Is e'pected only if a segment
of a' economy is flat or declines (negative
slope). Also, the large coefficients are
miSleading as indicators of the contribution
being made to the dependent variable. China's
economy shows Increasing growth based on the
gross domestiC product per capita:
ThiS study dealt with five variables

which show varying degrees of multicollln~ ,
oarlty. Ridge regression methodology was
employed to deal with multIcollinearity 1n
deter~lning model coefficients. One must,
however. use caution when maklng general
statements concerning these regresslon results
beyond the variables discussed. Additional work
should be undertaken with an expanded economic
model of Infrastructure components to develop a
better understanding of each country.
1222
'
Using SAS IML@ on a personal computer

to perform ridge regression of economic data
upward as variables are added to the model.

In Appendix B. the SAS output window
dlsplayof the CHINA data is includ.ed. It is
Important to observe the signs of the variables
streams provides Increased flexibility to the

analyst.
Care should be given to some of the PC's
limitations using SAS IML,
and the statls.tical measure of "press,1l
SAS version 6.03
and "MSE."
requires more memory than earlier versions and

640K bytes should be considered the minimum
requirement. The amount of variables one uses
should also be kept at a minimum and variables
only added that are important to the economic
model. Please note that in the attached program
listing IML worksize will need to be adjusted
Appen<lix C includes
t~e
~ESSu
lQg window
results. With each computational loop the

effects of bias. k. on the diagonal elements can
be observed by the ","lyst. No equation should
be considered the best until all statistical
parameters are revlewed.
Intermediate knowledge
of ridge regression procedural results and

underlying data is always useful.
NOTES and REFERENCES
SAS/lMLe software is the registered

trademarK of SA, Instutute Inc . Cary. NC.
USA.
1. deJanosl. Peter E. and Grayson. Leslie E.,
Nonorthogonal PrOblems", Technometrics. No.
lZ. 1970. pp. 69-82.

9. Mullett. G. W.. "Why Regression
COefficients Have the Wrong Sign". Journal
of Quality Technology. No.8, 1976. pp.
11Z-126.
Upatterns of Energy Consumption and
Economic Growth and Structure'\,
2.
3.
4.
5.
Journal of
Development Studies. Vol. 8. 1972. pp.

241-249.
See also:
Toldaro. Michael P.
In the Third World.
1985. pp.540-S42.
Dow 1I Edward T. MathematiCS for
Schaum's Outline Series,
Book COmpany. 1980.
Draper. N.R . and Smith. H. Applied
Regression Analysis. John Wiley & Sons.
New York. 1981. pp. 294-379.
Cassidy. Henry J . Using Econometrics: A
Beginners Guide. Reston Publishing Co.
Inc . New York. 1981. pp. ~60-168.
Belsley. David A. Kuh. Edwln. and Welsch.
Roy E. Regression Diagnostics: Identifying
Influential Oata and Sources of
COlllnearity. John Wiley &Sons. New York,
10. Jelisavcic, Gordana, liMe an Square Error as
a Reliability Measure for Biased

Estimators". Paper delivered at ASA
Meeting. August 16-19, 1982. Cincinnati.
Ohio.
11. Chatterjee. Samprit, and Price. Bertram.
Regression Analysis by Example. John Wiley
& Sons. New York. 1977. pp.143-214.
12. Draper, op. cit.
13. Wonnacott. Thomas H. and Wonnacott. Ronald
J.~
Regression: A Second Course 1n
Statistics.
1981.
John Wiley &Sons. New YorK.
pp. 64-448.
14. Personal papers from Gordana Jellsavcic.

Ph.C. who has done extensive work on ridge
regression and published several papers and
delivered speeches to America Statistical

Association.
15. SAS IML@ is a computer software product
1980. pp. 192-229.
by SAS institute, Box 8000. Cary N.C. The
6. Freund, Rudolph J. Ph.D . and Llttel. Ramon
original ridge program appeared in course
C Ph.D . SAS System for Regression. SAS
notes of Principals of Regression Analysis
Institute Inc. Cary. NC. 1986.
and has been modified to include Gordana
7. Belsley. op. cit.
Jelisavcic. Ph.D. notes.
8. Hoerl. A. E., and Kennard. R. W "Ridge
16. Ridge bias should not exceed the range of 0
Regression: Biased Estimation for
to .30.
APPENDIX A
SAS IML@ VERSION 6.03
RIDGE REGRESSION COMPUTER PRDGRAM
fOR THE PERSONAL COMPUTER
lGNS:LOG(GNS) ;
LTEL:LOGITEL) ;
OPTIONS HODATE;
DATA COUNTRY;
INFIlE 'D:CHN.PRK';
LEKGTH VAl AIR ENG GIll GOP GjlS TEL 8;
INPUT YAR AIR ENG GDI GOP GNS TEL;
LAIR:LOG(AIR) ;
LEtiG;:lOG(ENG) ;
LGDI:LOG(GDI);
LGIlP:lOG(GDP) ;
TITLEl 'CHlNA DATA (5 VARABLES)';

/, THE MACRO VARIABLE VARllST CONTAINS ,/
/, THE REGRESSOR VARIABLES ,/
%LH VARLI$T:
LAIR LENG LGOI LGNS LTEL ;
/, THE HACRO VARIABLE DEPVAR CONTAINS THE ,/
/, DEPENDNT VARIABLE ,/
1223
%lET DEPVAR:LGIIP;
STARTT:O; /' START Of BIAS 10 BE SET '/

ENDD:.50; I' END Of BIAS LOOP TO BE SET '/
INCREMNT:.025; /, INCREMENT OF BIAS TO BE SET 'I
00 K: STARlT TO ENDD BY INCRMN1:
/' RMAT:K#I(P); '/
If K:O THEN RHAT:K#I(P);
ELSE RMAT :Zt;
ZB:VECDIAG(RMAT);
I' THE MACRO VARIABLE DATA SET CONTAINS THE NAME 'I
I' Of THE SAS DATA SET 'I
%lET OATA$ET;COUNTRY;
I' THE MACRO VARIABLE OUTDSN CONSTAINS THE ,/

l' OUTPUT DATA SEf '/
~LET OUTDSN:RIDGE:
/, lHE MACRO VARIABLE COEf OONTAINS THE 'f

/' lABEtS fOR COEffiCIENTS ,/
~ET COEf:
'80' 'LAIR' 'LENG' 'LGIII' 'LGNS'
XCSPXCSK:XCSPXC~T;
MATINV:INV(XCSPXCSK); /' CORR MATRIX + BIAS INVERIED 'I

MATCOR:HATINV'XCSPXCS;I' VALUE Of H(l) 'I
IOLSHAT:MAICOR-I(P); /' VALUES Of HH(Z) 'I
TRMAT:IOlSHAT'IDLSHAT; I' VALUE Of HM(Z)'HH(l) 'I
TROfHAT:OIAG(TRMAT): /' OIAG Of MATRIX TR ,/
GARY:VECDIAG(TROfHAT); /, COLUMN VECTOR Of DIAG TR VALUES~I
IDLSTRMA:I(P)-TROfMAT: /, VALUE Of P MATRIX fORM '/
.STRIG:VECDIAG(IOLSTRMA); I' COLUMN VECTOR OF DIAG PVALUES 'I
VIfHZHZK:MATINV'HATCOR;
MlHlKYlf:VECOIAG(VlfMlHlK); I' Vlf (TR) VALUES '/
RVALUE:MZMlKYlf/STRIG; /' RVALUE HUST BE KEPT AT UPPER
BOUND )1 'I
TR=TRACE(VIfHZMlK);
PSTAR:TRACE(IDlSTRMA):
R:TRIP;
KBIAS:VECDIAG(RMAT);
'LlEL' ;
/' REQUIRED fOR CHARACTER MATRIX Of IIlDEL '/
~LET LABL:CC;
/' lHE MACRO VARIABLE BBHAMES CONTAINS THE ,/
I' NAMES Of THE COEffICIENTS AND USEfUL STATISTICS ,/
%LET BBHAMES:'PRESSSS' 'ESS' 'HSE' 'CK' 'K' 'IRHK'
'BO' 'LAIR' 'LENG' 'LGOI' 'LGNS' 'lTEL' ;
I' USED 10 NAME CHARACTER MATRIX Of MODEL ,/
%LE1 CCNAMES:'LABEL';
PRot PRINT DATA:&DATASET;
I' PROC REG OATA:&DATASET; ,/
1* tIlDEL WEPVAR:&VARlISTlVIf; *1
PROC IHl WOOKSllE:70;
START RIDGE;
N:NROI/(X); I' IIJMBER Of INPUTS PER VARIABLE 'I
P:NOOL(X): I' IIJMBER Of DEPENDENT VARIABLES '/
JX::J(N,l,1)/IX;
" COHPUTE ANaYA ITIMATE Of SIGHA2 'I
SIGMA2:(Y-JX'INV(JX/'JX)'JX/'Y)1
'(Y-JX'INV(JX/'JX)'JX/'Y)/(N-P-l);
If AfIY(RVALUE>I) THEN lABH:"0VER";

ELSE lABEl:"UtIlER';
LABL:LABEL;
PRINT RVALUE K6lAS PSTAR TR RLABEL ;
I' THIS DUE LOOP IS REQUIRED TO INCREASE THE KBIAS ON ONLY
XM:X[: ,j:
YH:y[: ,J;
THOSE ElEMENTS \/HIeH THE VIARANCE INfLATION fACTOR
(RVALUE) IS ABOVE I THEN ADOS EQUAL BIAS TO ALL ELEMENTS

AfTER (RVALUE) LESS THAN J ,/
X(:(X-REPEAT(XM,N,I)); I' XIS CENTERED 'I

YC:Y-YM; I' Y'S CENTERED ,/
YCPYC:YC/'YC;
SSXC:XC[##,l;
STDXC:SQRT( SSXU);
XCS:XC'DIAG(I/STDXC); I' X'I ARE CENTERED AND SCALED 'I
XCSPXCS=XCS/'XCS; /' CORRELATION MATRIX 'I
XCSf!YC:XCS/'YC;
lABl:J(I,I);
/' USED TO MAKE NUMERIC MATRIX 'I
LABL:CHAR(LABL); /' USED TO CHANGE NUMERIC TO CHARACTER
MATRIX 'I
ZA:J(P,l,O);
/' USED TO MAKE IIJHERIC MAUlX 'I
DO 1:1 TO PBY 1:
If RVALUE[I,lj>1 THEN ZA[I,lj:K+INCREMNT;

ELSE ZA[I, 1]:lB[I, lj;
If ALt(RVALUE<I) THEN DO 1:1 TO P BY 1; I' 00 LOOP fOR
RVALUE 'I
ZA[I,l]:K+INCREHNT;
/' ALL ONDER 1 */
END;
z(:DIAG(ZA);
RMAT;lC;
END;
1224
SBETA:MATINV'XCSPY(;
/, COI\PUTE UNSTANOARDIZED REGRESSION COEFfICIENTS 'I
SXC:l/STDXC;
BETA:$XC#SBTA;
BO:BETW'XIII;
INTERCPT:YH-80;
BETA:INTERCPT/IBETA;
I' PREPARE FOR COMPUTING APRESS LIKE STATISTIC 'I
I' THIS STATISTIC PR(RIDGE) IS EASIER 'I
I' TO COIlPUTE THAN PRESS 'I
RESID:Y-JX'BETA;
ESS:RESID[IfI,l;
HSE:ESS/(N-P-l);
I' IlAT MATRIX USING K'I
HK:XCS'INV (X(SPXCSK)
TRHK:TRACE(HK);
I'COIlPUTE CP TYPE STATISTIC 'I
CK:ESS/SIGMA2-N+2+2'TRHK;
PRESSRES:RESID/(I-VECDIAG{HK)-I/N);
PRESSSS:PRESSRES(IfI,l ;
BB:BBII(PRESSSSIIESSIJHSEIICKlIKl/TRHKl/BETA/);
CC:CC//(LABL); I' (REATES AVERHCAL CHARACTER MAnIX OF
lABEL 'I
FlMISH;
START ANOVA;
USE &OUTOSN VARIK PRESSSS M5E E5S CK TRHK SO &VARLIST/;
ROO POINT EE;
PRINT ,K PRESSSS MSE ES5 CK TRHK BO, &VARLIST;
XCSPXCSK:XCSPXCStK#I(P);
VARCOV:INV(XCSPXCSK)'(XCS/'X(S)'INV(XCSPX(SK);
VIf:VECDIAG(VARCOV);
5E:SQRT(HSE#VlF) ;
SBETA:INV(XCSPXC5K)'XCSPYC;
T:SSETA/SE;
PVALUE:(I-PROBT(A8S{T),N-P-l))'2;
COEF:/&COEFII;COEF:COEF[2:P+',l;
PRINT ,COEF T PVALUE VIF;
FINISH;
/' MAIN PROGRAM 'I
USE &IlATASH ;
READ All VARI&VARlISTI INTO X;
RAD ALL VAR/&OEPVARf IHTO V;
RUN RIDGE; I' RUN RIDGE PROGRAM 'I
I' SORT BB BV PRESSSS 'I
'xesl ;
/, TBB:8B;
BB[RANK(8B[, 1]), l=TBB;
FREE TSB; 'I
CREATE &DUTDSN FROM BB[COlNAHE:/&BBNAHESf];
APPEND fROM BB; /, CREATE OUTPUT DATA SET 'f
CREATE CFROM CC[COLNAME=I&CCNAHES/];
APPiND FROH CC; I' CREATE CHARACTER OUTPUT DATA SET 'I
RUN ANOVA; "CREATE NEW T STATISTICS ANtI VIFS 'I
QUIT;
DATA 0; MERGE &DUTDSN C;
PRO( PRINT DATA..-D;
/' PRO( PLOT DATA:O;
PLOT (&VARlIST)'K; 'I
END;
I' BUILD ANOVA TABLE 8ASED ON SMALLEST PRESS 'I

A:NROW(CC); J' READS THE AMOUNT Of COMPUTATIONS IN lABL 'I
Ff:O;
/' USED AS STARTING POINT CALULATIONS 'I
CC!1,1]:"O.l.S.";
I' CHANGES LABEL TO ROO OlS 'f
DO 1:1 TO ABY 1;
/' THIS DO lOOP IS REQUIRED 'I
If CC[I,l]:"UNDER" THEN FF:l+fF; I' SHOW THE POINT WHERE '1
If fF: 1 THEN CW-l, 1]:"BEST";f' HIllTI-COllNEARITY IS OUT 'I
ENO;
I' BASED ON R(VAlUE) 'f
DO 1:1 TO ABY 1;
I'DO lOOP REQUIRED TO READ 'I
If CC[I,1]="BEST" THEN U:I; I' THE BEST DATA SET
'I
END;
RUN;
1225
APPENDIX B
SAS OUTPUT
ORIGINAL DATA SET and RIDGE REGRESSION RESULTS
CHINA INPUT DATA (5 VARABlES)

OBS YAR
1 1961
2 1962
3 1963
4 1964
5 1965
6 1966
71967
8 1968
9 1969
101970
11 J971
12 1972
13 1973
141974
15 1975
16 1976
17 1977
18 1978
19 1979
20 1980
21 1981
22 1982
23 1983
24 1984
251985
AIR
ENG
116.2 265.1
136.4 264.2
146.5 285.3
153.2 306.0
176.4 316.8
188.9 349.0
201.0245.5
242.0 325.0
320.0 350.2
398.0347.7
476.0 390.3
554.0408.5
632.0 445.2
710.0477.0
1000.0501.2
1050.0 526.4
1140.0 602.4
1540.0 669.9
2519.0 688.7
2568.0 562.8
3236.0 556.6
3942.0 464.8
3836.0 498.5
5000.0 550.2
7300.0587.7
9074.7 47136.2
4886.7 43716.0
8286.6 47306.8
11780.0 55146.6
16150.8 65586.2
20590.6 75054.8
14010.1 70290.0
13392.6 66873.8
15622.7 76415.6
26561.9 91006.6
29125.0 98119.3
28879.7 103943.5
35890.0 120729.4
39196.5 132154.6
46669.8150201.4
42805.1 146834.7
48369.0 162860.9
65550.4 191617.4
74523.2229102.9
81570.6272007.6
79131.5289224.2
81190.9290673.9
84842.2294417.4
93468.2316321.8
124025.2322714.5
9281.8 244.034.75535.5801 9.113210.760B 9.13585.4973

5447.2 244.034.91565.5767 8.4943 10.6855 8.60295.4973
8891.9 244.034.98705.6535 9.022410.7644 9.0929 5.4973
12336.5 244.03 5.0317 5.7236 9.3742 10.9178 9.4203 5.4973
16487.9 244.035.17285.7583 9.689711.0911 9.71045.4973
20793.7 244.035.24125.8551 9.9326 11.2260 9.94245.4973
14257.9 244.035.30335.5033 9.547511.1604 9.5651 5.4973
13685.1 244.035.48895.7838 9.502511.1106 9.5241 5.4973
16179.2 742.025.76835.8585 9:6565 11.2439 9.69156.6094
26590.31240.025.98655.851310.187211.418710.18837.1229
29856.2 1738.01 6.1654 5.9669 10.2794 11.4939 10.3041 7.4605
29720.92236.01 6.31726.0125 10.2709 11.5516 10.2996 7.7125
36485.82734.01 6.44896.0985 10.4882 11.7013 10.50477.9135
38412.1 3232.006.56536.167510.576311.791710.5561 8.0809
46396.1 3412.006.90786.217010.750911.919710.74508.1351
43096.63629.006.95666.2661 10.6644 11.8971 10.6712 8.1967
48724.23833.007.03886.4009 10.7866 12.0007 10.7939 8.2514
64401.44059.007.33956.5071 11.0906 12.1633 11.0729 8.3087
71933.1 4220.007.83166.534811.218912.341911.18358.3476
77960.74432.007.85096.3329 11.3092 12.513611.26408.3966
79383.84634.008.0821 6.321911.278912.575011.28208.4412
86901.94907.008.27946.141611.304612.580011.37258.4984
88118.25161.008.25226.211611.348512.592811.39328.5489
95213.55536.00 8.51726.310311.445412.664511.46398.6190
109304.06134.008.8956 6.3762 11.7282 12.6845 11.6019 8.7216
CHINA RIDGE REGRESSION RESULTS (5 VARA8lES)

OBS PRES55S
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
E55
0.204280.06315 0.003355 6.0000

0.141820.091080.00479410.6603
0.159980.112200.005905 16.0584
0.181490.133350.00701821.7777
0.202990.153380.00807327.3281
0.223560.172100.00905832.5878
0.243020.189590.00997837.5478
0.261430.206010.01084342.2363
0.278940.221540.01166046.6922
0.289030.230440.01212849.2265
0.302560.242630.01277052.7329
0.324190.261410.01375858.2034
0.318200.257110.01353256.8664
0.357880.290640.01529766.7198
0.372710.303470.01597270.4650
0.387390.316130.01663874.1689
0.402000.32868 0.Q17299 77.8448
0.416560.341160.01795681.5040
0.431110.353590.01861085.1558
0.445680.366010.01926488.8081
0.000
0.025
0.050
0.075
0.100
0.125
0.150
0.175
0.200
0.225
0.250
0.275
0.300
0.325
0.350
0.375
0.40Q
0.425
0.450
0.415
5.0000 5.7150 0.3047

3.25756.11860.2597
2.8105 5.95550.2281
2.51855.83650.2077
2.30875.74870.1931
2.14905.68330.1820
2.02265.63450.1732
1.91965.59850.1660
1.83375.57240.1600
1.77535.49860.1498
1.71135.47500.1452
1.64895.51530.1445
1.62095.52550.1456
1.55125.53380.1399
1.51235.53760.1371
1.4771 5.54400.1345
1.4451 5.55240.1321
1.41585.5627 0.1300
1.38875.57450.1280
1.36375.58750.1261
1226
0.0388
-0.0019
0.0307
0.0627
0.0908
0.1149
0.1355
0.1531
0.1682
0.1675
0.1788
0.1964
0.1904
0.2186
0.2252
0.2312
0.2364
0.2411
0.2452
0.2489
-0.5911 0.9865
0.16140.2130
0.1741 0.2046
0.17660.1996
0.17630.1954
0.1750 0.1916
0.17330.1882
0.17150.1852
0.16970.1824
0.17670.1899
0.15570.2083
0.1764 0.1711
0.17840.1729
0.16150.1714
0.1601 0.1696
0.15870.1679
0.15740.1663
0.1561 0.1648
0.15490.1633
0.15370.1619
-0.0415 O.l.S
0.0026 OVER
0.0207 OVER
0.0324 OVER
0.0407 OVER
0.0469 OVER
0.0517 OVER
0.0555 OVER
0.0586 OVER
0.0518 OVER
0.0596 OVER
0.0635 BEST
0.0607 UNDER
0.0676 UNDER
0.0687 UNDER
0.0696 UNDER
0.0703 UNDER
0.0710 UNDER
0.0715 UNDER
0.0720 UNDER
APPENDIX C
SAS LOG OUTPUT
1. THE LOG OUTPUT SHOWS EACH lIME BIAS
(k)
IS ADDEO.
2. ANOVA RESULTS OF BEST RIDGE REGRESSION EQUATION AT THE END

RVALUE
16.109266
8.2.208294
717.77584
KBIAS
0
RVALUE
1.0256411
1. 1561 t14
0.489377
KBIAS
7.2631094
1.2618844
PSTAR
TR
J.1364017
R lABEL
RVAlUE
0,025 3.8328391 23.013123 4.6027446 OVER
4.9220496
5.7428669
5.5193258
6.287531
1).025
RVALIJ[
KSIAS
0.9643583
0.02S
1.0&432:10
0.8907362
0.5604755
0.025
0.025
4.6614848
PSTAR
TR
0.05 3.5128333 14.046961

0.05
9.05
9.05
0.05
KBIAS
PSTAR
TR
0.0753.24925329.7431704
0.015
0.075
0.075
0.075
K8rAS
PSTAR
TO
0.1 3.0330t28 7.2432591
0.1
O
0.1
0.1
KBIAS
P$TAR
TR
0.125 2.853772i 5.6380188
0.125
0.125
0.125
0.125
3.6615162
3.4635390
3.2691576
4.3891981
RVAlJ.I(
3.4032339
2.9111168
2.4819302
2.3331589
3.3408918
RVALUE
2.65808%
2.4204411
1.9182552
1.8016389
2.6771872
RVALUE
2.1682.136
2.0640819
1.5523229
1.4587214
2.2205525
RVALUE
1.8221115
1.1956H)9
1.2967520
1.2191496
1.8881955
I(VALUE
1.5665515
1.$860831
1.1091566
1.(1446158
1.5362293
RVAlUE
1.3695912
1.4181031
0.9663416
O.91l4124
~ .4392048
RVAllJ
1.162410l
1.2.723201
1.0126116
R lABEl
11.763261
RVAUJE
1.2724013
IT<
5 1527.2101 305.44202 OVER
773.34088
O.%7016~
PSTAR
1.0411158
R lABEL
RVAlUE
0.9700943
0.9742094
0.903021)1
0.5717253
2.8093923 OVER
0.9344435
R WEt
RVAlUE
0.8239312
-0.9146512
0.5806489
0.5519199
0.8181866
RVAlUE
0.7610498
0.8522119
0.5375327
0.5117989
0.8120679
RVAlUE
0.7066914
0.7973455
0.5005516
0.4113343
0.1546672
RVAlUE
0.6593105
0.7-487620
0.4685662
0.4475075
0.10444-43
1.9486341 OVER
R LA8EL
1.4486518 OVER
R l.AJlL
1.1216038 OVER
KBlAS
PSTAR
TR
R LABEL
0.152.10318654.5370252 0.907405 OVER
0.15
0.15
0.15
0.15
KBIAS
P$TAR.
TO
R l.AJlL
0.1752.57501453.7451394 Q.1490279 OVER
0.115
0.175
0.115
0.175
KBrAS
PSTAR
TR
RLASEl
0.2 2.4646275
3.1S459 0.630918 OVER
RVAlt,lF:
0.6177-&41
0.7055347
0.4406661
0.4214674
12.660195
RVAtUE
0.5809254
0.6668068
0.4161595
0.1986243
0.2
0.2
0.2
0.2
KalAS
0.6209646
PSTAR
TR
R LABEL
0.225 2.3810757 2.7873265 0.5574653 OVER
0.225
RVAlUE
0.5482199
0.6319599
0.3944935
0.3784036
0.5859872
6.2
6.2
0.225
PRESSSS
MSE
ESS.
CI(
KBIAS
TR
PSTAR
0.25 2.3011045 2.4224651

0.25
0.25
0.2
0.25
R LABEl
0.484493 OVER
KBIAS
PSTAR
TR
R lABEL
0.275 2.2171943 2.G899463 0.4179693 OVER
0.275
0.25
0.275
0.275
KBIAS
PSTAR
TR
R LABEL
0.275 2.1185343 1.95416140.3908335 UNDER
0.3
25
0.275
0.3
"lAS
PSTAR
0.325 2.(1829115
TR
R LABEL
1.6363190.3272158 UNDER
0.325
6.325
0.325
0.325
KBIAS
PSTAR
TR
0.35 2.0289314 1.4760887
0.35
0.3-5
0.35
0.35
K81AS
PSTAR
R LABEL
~.2952111
'R
R. lABEL
TR
R LABEl
UIIItJER
0.375 1.9198693 1.3406179-0.2681236 UNDER

0.375
0.375
0.315
0.375
KBlAS
P$TAR
0.4 1.9351114 1.2250181 0.2450036
0.'
0.'
BIAS
R LABEL
T.
"STAR
0.425 1.8941001 1.12552340.2251047 UNDER
0.425
0.425
0.425
0.425
K8IAS
TR
R. LABEt
PSTArt
0.45 1.8563696 1.0392253 0.2078451 UNDER
OAS
45
0.45
0.45
KBlAS
PSTAR
TR
R lABEL
0.475 I.B21528 0.9636473 0.1927695 UNOE~
0.475
0.415
0.475
0.415
TRHK
0.2750.32419250.01375830.261407458.2034021.6488649
BO
LAIR
LENG
LGOI
LGNS
LTfL
5.5153418 0.14447540.1963999 0.1764421 0.1710571 0.0635369
CDEF
T
PVALUE
LAIR 11.720654 3.853E-l0
LENG 3.2651576 0.0040725

LGOI 12.881383 7.756E-11
LGNS 13.9039392.075E-11
LTEL 4.87299640.0001056
VI F
0.456007
0.643119
0.2229594
0.2042108
0.5339643
Gary Wi 11 iam Carr
NYNX Corporation
335 Madison AY~ R.o~ 2104
New Ycrk. NY 10017
(212) 310-7459
1227
UNDER

Multicollinearity

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multicollinearity

Uploaded by

Copyright:

Available Formats

HANDLING MULTICOLLINEARITY KITH SAS IML@ SOFTHARE: RIDGE REGRESSION ON THE PC

GARY WILLIAM CARR, NYNEX

where (X'X) is the correlation matrix and is

Analyses of economic data must consider

GOP. for example, many correlation studIes have

regreSSion assumes that the input variables are

uncorrelated. In addition, the researcher wants

explanatory variables in a regression model are

not strongly interrelated. The assumption is

this assumption leads to low precision of

which 1n turn can lead to erroneous inferen~es.

One specific focus of this paper

a statistically robust method for overcoming

Multicollinearity can be defined as a

data problems frequently encountered in

econometric modeling using Ordinary Least

Assume that the relationships of any

produce a correlation matrix,

CORRELATION MATRIX (X'X) FOR CHINA (1961-1985)

matrix of lnfrastructure variable observations

(infrastructure components), B is a vector of

parameters. and e is a vector of errors

normally distributed with expected value of

The least squares estimate (LSE)

This provides the

Ln(ENG) Ln(TELl Ln(AIR) Ln<INV) Ln(SI\V) Ln(GLlP)

0.9183 0.8786 0.9131

ENG - level of Energy consumption expressed in

To oVercome problems with data quality

Each "independent" infrastructure

inverse correlation matrix below).

mu1tlcoilinearity. Indirect introduction of

independent variables should ideally not be

variables which is collinear with another

correlated with each other, economic data on

of correlation to each other.

lntrodu~es bias directly.

among the independent variables must therefore

be kept at acceptable levels in the regression

increasingly popular In econometric analyses.

R'I is the coefficient of

elements of the correlation matrix. the

The proper sign will be assigned to all

The ridge regression procedure Is

Consider once again the,

giving rise to unstable parameter estimate,.

INVERSE CORRELATION MATRIX (X'X)-l fOR CHINA

demonstrating the lower mean squared error which

it produces and its ability to choose the proper

mln1mizes the mean squared error when a

Several remedies are frequently suggested

values (11). Draper .and Smith confirm the use

approach Is usually to collect more data.

parameters is known <lower coeffieient values or

answer rarely helps the econometrician. who

typically has short data series and cannot wait

problem with the addition of restricted or

may be prOhibitive and cannot guarantee a

reduced collinearity sample. (7).

resulting coefficients will

variance will not be large relative to

to correct poorly conditioned data.

As bias, k, is added to the diagonal

common ap~roach is to reduce the number of