Professional Documents
Culture Documents
V.K. Bhatia
I.A.S.R.I., Library Avenue, New Delhi -110 012
vkbhatia@iasri.res.in
Analogous with ordinary correlation, canonical correlation squared is the percent of variance
in the dependent set explained by the independent set of variables along a given dimension
(there may be more than one). In addition to asking how strong the relationship is between
two latent variables, canonical correlation is useful in determining how many dimensions are
needed to account for that relationship. Canonical correlation finds the linear combination of
variables that produces the largest correlation with the second set of variables. This linear
combination, or "root," is extracted and the process is repeated for the residual data, with the
constraint that the second linear combination of variables must not correlate with the first one.
The process is repeated until a successive linear combination is no longer significant.
Canonical correlation is a member of the multiple general linear hypothesis (MLGH) family
and shares many of the assumptions of mutliple regression such as linearity of relationships,
homoscedasticity (same level of relationship for the full range of the data), interval or near-
interval data, untruncated variables, proper specification of the model, lack of high
multicollinearity, and multivariate normality for purposes of hypothesis testing.
Often in applied research, scientists encounter variables of large dimensions and are faced
with the problem of understanding dependency structures, reduction of dimensionalities,
construction of a subset of good predictors from the explanatory variables, etc. Canonical
correlation Analysis (CCA) provides us with a tool to attack these problems. However, its
appeal and hence its motivation seed to differ from the theoretical statisticians to the social
scientists. We deal here with the various motivations of CCA as mentioned above and related
statistical inference procedures.
U='X1 and V=' X2 subject to the normalizations, Var (U) = ='11 = 1 and Var (V) = '11
= 1' where,
11 12
Disp. (X) = =
21 22
is partitioned according to that of X as above,
It follows that, this maximum correlation, say 1 is given by the positive square root of the
largest eigen root among the eigen roots 1222...r2...pl2, of 1222-121 in the metrix of
11, i.e. of 1222-12111-1. and are then given by, 1,1 such that ' 111=1'221=1 and,
1
11 12
= 0 (2.1)
21 1 22
1
Alternatively, and may be obtained as the eigen vector solutions, subject to the same
normalizations, from
(11-112-122-121 - 2I) =0, (22-12111-1-112 - 2I) =0 (2.2)
so that one needs to solve only one of the two equations in (2.2).
1 is called the (first) canonical correlation between X1 and X2 and (U1, V1) = (1'X1,1'X2) the
pair of first canonical varieties. If ii, i=1 or 2 happens to be singular, one can use a g-inverse
ii- in place of ii-1 above.
now that, p1 = p2=1 1= usual Pearson's product moment correlation coefficient between the
scalar random variables X1 and X2 ; p1 = 1, p2 = p2>1 1= Multiple correlation coefficient
between the scalar X1 and the vector X2. Sample analogues are trivially defined.
Reduction of Dimensionality
In case p2 or p1 is large, it may become necessary to achieve a reduction of dimensionality but
without sacrificing much of the dependency between X1 and X2. We then seek further linear
combinations Ui=' X1, Vi = 1'X2, i=1,2,............., r+1, such that Ur+1 and Vr+1 are maximally
correlated among all linear combinations subject to having unit variances and further subject
to being uncorrelated with U1, V1,..............Ur,Vr. It turns out that Corr. (Ur+1, Vr+1) = r+1 and
r+1, r+1 are simply solutions of (2.1) with 1 replaced by r+1.
When k+1 is judged to be insignificant compared to zero for some k+1, one may then retain
only (Ui, Vi), i=1,2,......k variables for further analysis in place of the original = 1+ 2
IV-38
Canonical Correlation Analysis
presumably much larger number of variables. Note however, that information on all 1+ 2
variables X1 and X2 are still needed even to construct these 2k new variables.
where "varlist" is one of two lists of numeric variables. Output will be saved to a file called
"cc_tmp2.sav," which will contain the canonical scores as new variables along with the
original data file. These scores will be labeled s1_cv1 and s1_cv1, s2_cv1 and s2_cv2, and the
like, standing for the scores on the two canonical variables associated with each canonical
correlation. The macro will create two canonical variables for a number of canonical
correlations equal to the smaller number of variables in SET1 or SET2.
o OVERALS, which is part of the SPSS Categories module, computes nonlinear canonical
correlation analysis on two or more sets of variables.
IV-39
Canonical Correlation Analysis
variables by a canonical variable (the mean of the squared structure correlations for the
canonical variable) is not at all the same as the canonical correlation, which has to do with
the correlation between the weighted sums of the two sets of variables. Put another way,
the canonical correlation does not tell us how much of the variance in the original
variables is explained by the canonical variables. Instead, that is determined on the basis
of the squares of the structure correlations.
Canonical coefficients can be used to explain with which original variables a canonical
correlation is predominantly associated. The canonical coefficients are standardized
coefficients and (like beta weights in regression) their magnitudes can be compared.
Looking at the columns in SPSS output which list the canonical coefficients as columns
and the variables in a set of variables as rows, some researchers simply note variables with
the highest coefficients to determine which variables are associated with which canonical
correlations and use this as the basis for inducing the meaning of the dimension
represented by the canonical correlation.
However, Levine (1977) argues against the procedure above on the ground that the canonical
coefficients may be subject to multicollinearity, leading to incorrect judgments. Also, because
of suppression, a canonical coefficient may even have a different sign compared to the
correlation of the original variable with the canonical variable. Therefore, instead, Levine
recommends interpreting the relations of the original variables to a canonical variable in terms
of the correlations of the original variables with the canonical variables - that is, by structure
coefficients. This is now the standard approach.
IV-40
Canonical Correlation Analysis
The table above shows that, for the first canonical correlation, although the independent
canonical variable explains 47.15% of the variance in the dependent canonical variable, the
independent canonical variable is able to predict only 11.29% of the variance in the
individual original dependent variables. Also, the dependent canonical variable predicts only
23.94% of the variance in the individual original dependent variables. Similar statements
could be made about the second canonical correlation (row 2).
M 1 2
Y1 0.1510 0.1526
Y2 0.0280 0.0305
Y3 0.1596 0.1610
In the table above, the columns represent the canonical correlations and the rows represent
the original dependent variables, three in this case. The R-squareds are the percent of
variance in each original dependent variable explained by the independent canonical
variables. A similar table for the independent variables and the dependent canonical
variables is also output by SAS but is not reproduced here.
OVERALS uses optimal scaling, which quantifies categorical variables and then treats as
numerical variables, including applying nonlinear transformations to find the best-fitting
model. For nominal variables, the order of the categories is not retained but values are created
IV-41
Canonical Correlation Analysis
for each category such that goodness of fit is maximized. For ordinal variables, order is
retained and values maximizing fit are created. For interval variables, order is retained as are
equal distances between values.
Obtain OVERALS from the SPSS menu by selecting Analyze, Data Reduction, Optimal
Scaling; Select Multiple sets; Select either Some variable(s) not multiple nominal or All
variables multiple nominal; click Define; define at least two sets of variables; define the value
range and measurement scale (optimal scaling level) for each selected variable. SPSS output
includes frequencies, centroids, iteration history, object scores, category quantifications,
weights, component loadings, single and multiple fit, object scores plots, category coordinates
plots, component loadings plots, category centroids plots, and transformation plots.
Tip: To minimize output, use the Automatic Recode facility on the Transform menu to create
consecutive categories beginning with 1 for variables treated as nominal or ordinal. To
minimize output, for each variable scaled at the numerical (integer) level, subtract the smallest
observed value from every value and add 1.
Warning: Optimal scaling recodes values on the fly to maximize goodness of fit for the given
data. As with any atheoretical, post-hoc data mining procedure, there is a danger of overfitting
the model to the given data. Therefore, it is particularly appropriate to employ cross-
validation, developing the model for a training dataset and then assessing its generalizability
by running the model on a separate validation dataset.
The SPSS manual notes, "If each set contains one variable, nonlinear canonical correlation
analysis is equivalent to principal components analysis with optimal scaling. If each of these
variables is multiple nominal, the analysis corresponds to homogeneity analysis. If two sets of
variables are involved and one of the sets contains only one variable, the analysis is identical
to categorical regression with optimal scaling."
Reference
Levine, Mark S. (1977). Canonical Analysis and Factor Comparison. Thousand Oaks, CA:
Sage Publications, Quantitative Applications in the Social Sciences Series, No. 6.
IV-42