You are on page 1of 11

Factor analysis

Introduction 2
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Factor analysis model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Factor analysis model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Variance of x
i
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Covariance matrix of x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Non-uniqueness of factor loadings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Non-uniqueness of factor loadings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Principal factor analysis 11
Procedure - initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Constraint 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Heywood cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Maximum likelihood estimation 18
MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Testing for number of factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Factor rotation 22
Some general comments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
What do we look for? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Two types of rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Types of rotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Estimating/predicting factor scores 28
Random vs. deterministic factor scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Deterministic factor scores: Bartletts method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Random factor scores: Thompsons method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Factor analysis vs. PCA 33
1
Common properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Dierences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2
Introduction 2 / 35
Introduction
I In social sciences (e.g., psychology), it is often not possible to measure the variables of interest
directly. Examples:
N Intelligence
N Social class
Such variables are called latent variables or common factors.
I Researchers examine such variables indirectly, by measuring variables that can be measured and that
are believed to be indicators of the latent variables of interest. Examples:
N Examination scores on various tests
N Occupation, education, home ownership
Such variables are called manifest variables or observed variables.
I Goal: study the relationship between the latent variables and the manifest variables
3 / 35
Factor analysis model
I Multiple linear regression model:
x
1
=
11
f
1
+ +
1k
f
k
+u
1
x
2
=
21
f
1
+ +
2k
f
k
+u
2
.
.
. =
.
.
.
x
p
=
p1
f
1
+ +
pk
f
k
+u
p
where
N x = (x
1
, . . . , x
p
)

are the observed variables (random)


N f = (f
1
, . . . , f
k
)

are the common factors (random)


N u = (u
1
, . . . , u
p
)

are called specic factors (random)


N
ij
are called factor loadings (constants)
4 / 35
3
Factor analysis model
I In short: x = f +u, where is the p k matrix containing the
ij
s.
I Dierence with multiple regression: common factors f
1
, . . . , f
k
are unobserved.
I Assumptions:
N E(x) = 0 (if this is not the case, simply subtract the mean vector)
N E(f) = 0, Cov(f) = I
N E(u) = 0, Cov(u
i
, u
j
) = 0 for i = j
N Cov(f, u) = 0
5 / 35
Variance of x
i
I Notation:
N Cov(u) = = diag(
11
, . . . ,
kk
)
N Cov(x) =
I Then (see board):
N
ii
= Var(x
i
) =

k
j=1

2
ij
+
ii
N Var(x
i
) consists of two parts:
I h
2
i
=

k
j=1

2
ij
, called communality of x
i
, represents variance of x
i
that is shared with the
other variables via the common factors
I
ii
, called the specic or unique variance, represents the variance of x
i
that is not shared
with the other variables
6 / 35
Covariance matrix of x
I Note that (see board):
N
ij
= Cov(x
i
, x
j
) =

k
=1

j
I Hence, the factor models leads to: =

+
I The reverse is also true: If one can decompose in this form, then the k-factor model holds for x
7 / 35
4
Non-uniqueness of factor loadings
I Suppose that k-factor model holds for x: x = f +u
I Let G be a k k orthogonal matrix.
I Then x = GG

f +u.
I Note that G

f satises assumptions that we made about the common factors (see board).
I Hence the k-factor model holds with factors G

f and factor loadings G.


I = (G)(G

) + =

+
I Hence, factors f with loadings , or factors G

f with loadings G are equivalent for explaining the


covariance matrix of the observed variables.
8 / 35
Non-uniqueness of factor loadings
I Non-uniqueness can be resolved by imposing an extra condition. For example:
N

1
is diagonal with its elements in decreasing order (constraint 1)
N

D
1
is diagonal with its elements in decreasing order, where D = diag(
11
, . . . ,
pp
)
(constraint 2)
9 / 35
Estimation
I is usually estimated by S (or often: correlation matrix is estimated by R) .
I Given S (or R), we need to nd estimates

and

that satisfy constraint 1 or 2, so that S (or R)

+

.
I Note that typically, the number of parameters in

and

is smaller than the number of parameters
in S. Hence, there is no exact solution in general.
I Two main methods to estimate

and

:
N principal factor analysis
N maximum likelihood estimation (requires normality assumption)
I In practice, we also need to determine the value of k, the number of factors.
10 / 35
5
Principal factor analysis 11 / 35
Procedure - initialization
I Estimate correlation matrix by R
I Make preliminary estimates

h
2
i
of the communalities h
2
i
, using:
N The square of the multiple correlation coecient of the ith variable with all the other variables,
or
N The largest correlation coecient between the ith variable and one of the other variables
12 / 35
Idea
I Given R (p p), we want to nd

(p p) and

(p k) that satisfy constraint 2, so that
R


I We look at R

, because we are interested in explaining the (co)variances that are shared through
the common factors.
I R

is symmetric. Hence there is a spectral decomposition R

= GAG

p
i=1
a
i
g
(i)
g

(i)
I If the rst k eigenvalues are positive, and the remaining ones are close to zero, then
R

k
i=1
a
i
g
(i)
g

(i)
=

k
i=1
(a
1/2
i
g
(i)
)(a
1/2
i
g
(i)
)

.
I

k
i=1

(i)

(i)
. Hence, a natural estimate for
(i)
is

(i)
= a
1/2
i
g
(i)
.
I In matrix form:

= G
1
A
1/2
1
.
13 / 35
Procedure
I Determine the spectral decomposition of the reduced correlation matrix R

, where the ones on
the diagonal are replaced by

h
2
i
= 1

ii
. Thus, R

= GAG

, where A = diag(a
1
, . . . , a
p
)
contains the eigenvalues of R

, a
1
a
p
, and G contains the corresponding orthonormal
eigenvectors.
I Estimate by

= G
1
A
1/2
1
, where G
1
= (g
(1)
, . . . , g
(k)
) and A
1
= diag(a
1
, . . . , a
k
).
I Estimate the specic variances
ii
by

ii
= 1

k
j=1

2
ij
, i = 1, . . . , p.
I Stop, or repeat the above steps until some convergence criterion has been reached.
14 / 35
6
Constraint 2
I D = diag(
11
, . . . ,
pp
) = I because working with the correlation matrix is equivalent to working
with standardized variables.
I Hence,

satises constraint 2:

D
1

= (A
1/2
1
G

1
)(G
1
A
1/2
1
) = A
1
is diagonal with decreasing elements.
15 / 35
Heywood cases
I It can happen that

ii
< 0 or

ii
> 1.
I This makes no sense:
N
ii
is a variance, so must be positive.
N Working with the correlation matrix means we are working with standardized variables. So
V ar(x
i
) = 1, and V ar(
i
) cannot exceed 1.
I Such cases are called Heywood cases.
16 / 35
Example
I See R-code.
17 / 35
Maximum likelihood estimation 18 / 35
MLE
I Assume that X has a multivariate normal distribution
I Then log likelihood function (plugging in x for ) is (see board):
l() =
1
2
nlog |2|
1
2
n tr(
1
S)
I Regard =

+ as a function of and , and maximize the log likelihood function over and
.
I Optimization is done iteratively:
N For xed , one can maximize analytically over
N For xed , one can maximize numerically over
I This method is used by the R-function factanal().
I This method can also have problems with Heywood cases.
19 / 35
7
Testing for number of factors
I Advantage of the MLE method is that it allows to test if the number of factors is sucient:
N Null hypothesis: k factors is sucient
N Alternative hypothesis: k factors is not sucient
N p-value < 0.05 means ...
I Often sequential testing procedure is used: start with 1 factor and then increase the number of
factors one at a time until test doesnt reject the null hypothesis.
I It can occur that the test always rejects the null hypothesis. This is an indication that the model
does not t well (or that the sample size is very large).
20 / 35
Example
I See R-code
21 / 35
Factor rotation 22 / 35
Some general comments
I In factor rotation, we look for an orthogonal matrix G such that the factor loadings

= G can
be more easily interpreted than the original factor loadings .
I Is it a good idea to look for such rotations?
N Cons: One can keep rotating the factors until one nds an interpretation that one likes.
N Pros: Factor rotation does not change the overall structure of a solution. It only changes how
the solution is described, and nds the simplest description.
23 / 35
What do we look for?
I Factor loadings can often be easily interpreted if:
N Each variable is highly loaded on at most one factor.
N All factor loadings are either large and positive, or close to zero.
24 / 35
8
Two types of rotations
I Orthogonal rotation: the factors are restricted to be uncorrelated.
I Oblique rotation: the factors may be correlated.
I Advantage of orthogonal rotation: For orthogonal rotation (based on standardized variables), the
factor loadings represent correlations between factors and observed variables (see board). This is not
the case for oblique rotations.
I Advantage of oblique rotation: May be unrealistic to assume that factors are uncorrelated. One may
obtain a better t by dropping this assumption.
25 / 35
Types of rotations
I Orthogonal:
N Varimax: default in factanal(). Aims at factors with a few large loadings, and many
near-zero loadings.
N Quartimax: not implemented in base R.
I Oblique:
N Promax: use option rotation="promax" in factanal(). Aims at simple structure with low
correlation between factors.
N Oblimin: not implemented in base R
26 / 35
Example
I See R-code
27 / 35
Estimating/predicting factor scores 28 / 35
Random vs. deterministic factor scores
I So far, we considered the factor scores to be random. This is appropriate when we think of dierent
samples consisting of dierent individuals, and we are interested in the general structure.
I One can also consider the factor scores to be deterministic. That is appropriate when we are
interested in a specic group of individuals.
29 / 35
9
Deterministic factor scores: Bartletts method
I Assume normality, and suppose that and are known.
I Denote the factor scores for the ith individual by f
i
.
I Then x
i
given f
i
is normally distributed with mean f
i
and covariance matrix .
I Hence, the log likelihood for one observation x
i
is given by

1
2
log |2|
1
2
(x
i
f
i
)

1
(x
i
f
i
).
I Setting the derivative with respect to f
i
equal to zero gives (see board):

f
i
= (

1
)
1

1
x
i
.
30 / 35
Random factor scores: Thompsons method
I Consider f to be random, i.e., f has a normal distribution with mean 0 and covariance matrix I.
I Then

f
x

0
0

I Then f|x has distribution N(

1
x, I

1
) (see board).
I Hence, natural estimator for f
i
is

1
x
i
.
31 / 35
Examples
I Both methods have advantages and disadvantages, no clear favorite.
I See examples in R-code.
32 / 35
10
Factor analysis vs. PCA 33 / 35
Common properties
I Both methods are mostly used in exploratory data analysis.
I Both methods try to obtain dimension reduction: explain a data set in a smaller number of variables.
I Both methods dont work if the observed variables are almost uncorrelated:
N Then PCA returns components that are similar to the original variables.
N Then factor analysis has nothing to explain, i.e.
ii
close to 1 for all i.
I Both methods give similar results if the specic variances are small.
I If specic variances are assumed to be zero in principle factor analysis, then PCA and factor analysis
are the same.
34 / 35
Dierences
I PCA required virtually no assumptions.
Factor analysis assumes that data come from a specic model.
I In PCA emphasis is on transforming observed variables to principle components.
In factor analysis, emphasis is on the transformation from factors to observed variables.
I PCA is not scale invariant.
Factor analysis (with MLE) is scale invariant.
I In PCA, considering k + 1 instead of k components does not change the rst k components.
In factor analysis, considering k + 1 instead of k factors may change the rst k factors (when using
MLE method).
I Calculation of PCA scores is straightforward.
Calculation of factor scores is more complex.
35 / 35
11

You might also like