You are on page 1of 19

© 2017, Park Hun Myoung http://sonsoo.

org/

Factor Analysis for Questionnaire Survey Data:


Exploratory and Confirmatory Factor Analysis

Hun Myoung Park (kucc625@iuj.ac.jp)


International University of Japan

This document summarizes the gist of questionnaire survey and illustrates how to conduct
factor analysis of survey data.

1. Gist of Questionnaire Survey

A good questionnaire survey is very difficult to prepare and conduct. There are many critical
issues that must be handled properly with cautious efforts. There are three stages in
questionnaire survey before data analysis: survey sampling, questionnaire design, and survey
administration.

1.1 Survey sampling

The first stage is to clearly define the population (a group of respondents of interest) and take
a representative sample with a relevant sample size. The population is related to a research
question, the scope of the research, and unit of analysis. The population should be specific
enough (e.g., female citizens at their age of 25-60 who live in the Niigata Prefecture as of
January 1, 2017).

Then the researcher has to choose a right sampling method so that the sample is
representative of the population. Random sampling is to exclude arbitrariness of the research
(or rule out possibility subjective judgment and manipulation in sampling). A convenient
sample taken by a researcher at his convenience (to same time and money) is not necessarily
repetitive and cannot be used for scientific research (any inference from a convenient sample
may not be consider scientific). Random sampling methods range from a simple random
sampling to sophisticated one.

The number of observations in a random sample is determined by effect size, significance


level, and specific data analysis method. The more sophisticated method, the lager sample
size. For instance, t-test and OLS need different minimum sample sizes.

1.2 Questionnaire Design

A survey method is to ask others about somethings that you don’t know. Accordingly, this
method is useless if majority of population does not know much about the subjects (most
respondents will say “I don't know”).

In a questionnaire form, a research prepares at least three sections. The first section mentions
the name of research (contact information), purposes of the research, use of responses (data
security and confidentiality), instructions on how to fill out the form. Never, ever ask
respondents’ name, telephone number, address, signature, and others that can identify
respondents; otherwise, the survey will be biased because respondents feel embarrassed due
to the lack of confidentiality.

1
© 2017, Park Hun Myoung http://sonsoo.org/

The second section is to ask socio-economic status of a respondent. It includes age, gender,
income, education, marital status, location, type of employment, etc. But do not ask age and
income, for example, directly since respondents may feel uncomfortable answering these
questions honestly. Instead, ask class of each socio-economic status like “Age 20-25” and
“$50,000-$90,000.”

Finally, substantive questions are listed. Specific questions are designed to measure key
concepts to be used to answer research questions. Do not add questions for fun that are not
directly related to the research question. You need to ask, “Which specific questions are need
to measure a concept or construct A?” Wording in questions is a critical issue in
questionnaire survey. Therefore, researchers must put high effort in revising questions
through, for instance, pretest of a survey.

Since most respondents do not high willing to answer questions sincerely, researchers must
design questionnaire form properly. Several suggestions include 1) do not ask many
questions, 2) do not ask unnecessary questions (avoid socially desirable questions like “Do
you think murder is good?”), 3) change the order of questions properly (if Q1-Q5 are to
measure engagement, list questions, for example, Q7, Q5, Q1, Q9, Q3, Q8, Q4, Q2 rather
than Q1, Q2, Q3, Q4, Q5), 4) change the order of possible answers of some questions (e.g.,
apply “Strongly Disagree—Disagree—Agree—Strongly Agree); you must recode this type of
questions reversely before data analysis, 5) avoid neutral or indifferent choice to have a 4
point, 6 point, 8 point Likert scale (But you may add “I don’t know” separately).

Your substantive survey questions may look like,

Strongly Strongly Do Not


Question Agree Disagree
Agree Disagree Know

My performance appraisal is fair reflection of my


1 2 3 4 0
performance

I have trust and confidence in my supervisor 1 2 3 4 0

My workload is reasonable 1 2 3 4 0

… 1 2 3 4 0

Note that first and third questions are related to performance appraisal and the second one is
of supervision and leadership. The numbers assigned to strongly agree, agree, … are used for
data entry.

1.3 Survey Administration

Once a survey questionnaire is ready, you need to find a way to distribute it properly. You
need to consider pros and cons of each distribution method (e.g., mailing and direct
distribution). You need to prepare a return envelope with a stamp attached in case of mailing.
You must provide sufficient instructions on answering questions (including time limit) and
returning the questionnaire form.

You monitor how respondents answer and return the time to time. You must be ready for
questions from respondents and check who answered and who did not. If necessary, you may

2
© 2017, Park Hun Myoung http://sonsoo.org/

inform those who did not answer of the fact and significance of their responses in your
research (but don’t be aggressive and rude).

You need to consider the relevant method to collect questionnaire forms. You may ask
respondents to put the form in to a box located in a place or visit respondents and collect
forms by yourself (less confidential way).

After collecting questionnaire forms, you must check and single out badly answered forms
such as those with incomplete answers or consistent answer patters (e.g., 1, 1, 1, 1, 1, …).
Then provide a serial number to each questionnaire form so that you can check data entry
later. Finally, you need to calculate the final response rate.

2. An Example of a Mental Model

Suppose you have a mental model that has two concepts (constructs) of economic value and
moral value. These concepts are called latent variables or factors in a sense that they are not
observable in reality. You cannot observe economic and moral values although you have
some senses of these concepts.

Figure 1. A Measurement Model with Two Latent Variables

Now you want to measure both economic and moral values. Unfortunately, you cannot
directly measure these concepts because they are not observable. But here is an alternative to
measure the latent variables indirectly.

Suppose you observe some phenomena that are assumed to manifest two factors. That is, a
group of phenomena is caused (influenced) by the economic value (factor) and the other
group by moral value (factor). These phenomena are captured by observed variables.

In the figure above, private ownership, government responsibility, and competition manifest
(or are caused by) the economic value, whereas the moral value is manifested by
homosexuality, legalized abortion, and assisted suicide. Notice that the variables from private
ownership through assisted suicide are observable, but economic and moral values are not.

However, there are random components in these causal relationships. For instance, private
ownership (observed variable) is explained by the economic value (latent variable) but the
economic value alone cannot explain all variation of private ownership. The portion of

3
© 2017, Park Hun Myoung http://sonsoo.org/

variation that the economic value cannot explain is a random part of the manifest variable.
This random component is labeled as e1 or s1. The impact of the latent variable on the
observed variable is represented by b1. This relationship is described in the ordinary least
squares (OLS) style as follows.

Private ownership = a1 + (b1 ´ economic value) + e1

3. What Do We Want To Know?

The question here is if your mental model in you mind is correct. If your mental model is
supported by statistical inference, then you will be able to have reliable measures of the
economic and moral values. These latent variables are not observable but are measured
indirectly through observed or manifest variables.

The implication of this method is that you can reduce data (draw only two variables out of six
observed variables). Once you reduce data from many observed variables to several latent
variables successfully, you can use these latent variables as dependent and independent
variables in quantitative methods like OLS. If your mental model turns out incorrect, you
have to modify your model and test it out again.

Factor analysis is a data reduction technique that examines the relationship between
observed and latent variables (factors). This process is called measurement model that links
manifest variables to unobserved factors. Here is the summary of related terminologies.

Alternative Names Role/Relationship


Latent variables Factors, constructs, concepts Cause (manifested by) observed variables
Observed variables Manifest variables Manifest (caused by) latent variables

The fundamental questions of factor analysis are (1) to what extent observed variables are
significantly caused by factors (confirm your mental model—measurement model) and (2)
how to aggregate observed variables if they turn out significantly being influenced by the
corresponding latent variables.

There are two approaches to confirm your mental model: exploratory factor analysis (EFA)
and confirmatory factor analysis (CFA). EFA does not impose any constraints on the model,
while CFA places substantive constraints. EFA is data driven, but CFA is theory driven.

Once your measurement model turns out statistically significant, you may calculate factor
score of the latent variables on the basis of the factor analysis. Or simply you can get, for
example, a factor-based score or an average of individual means of related observed
variables (create a variable that has means of three variables of each subject and then
calculate the average of the new variable). CFA does not produce any factor scores.

4. Model Specification

Our mental model is represented as follows (Albright and Park 2009: 3). Assume that all
latent and observed variables are mean centered (transformed to have deviation from their
means) in order to eliminate intercept terms.

X = Lx + d

4
© 2017, Park Hun Myoung http://sonsoo.org/

or
x1 = l11x1 + d1 x2 = l21x1 + d2 x3 = l31x1 + d3
x4 = l42x2 + d4 x5 = l52x2 + d5 x6 = l62x2 + d6

in which X is the vector of observed variables, Λ (lambda) is the matrix of factor loadings l
connecting the ξi to the xi, ξ (ksi) is the vector of common factors, and δ (delta) is the vector
of unique factors. It is assumed that the error terms have a mean of zero, E(δ) = 0, and that
the common and unique factors are uncorrelated, E(ξδ’)=0.

This model is called measurement model that describes the relationship between latent
variables and observed variables (observed variables are used to measure or estimate the
latent variables). Measurement models specify how latent variables (hypothetical constructs)
directly or indirectly influences changes in other latent variables. There are two types of
measurement models: one for dependent observed variables labeled as Y and the other for
independent observed variables labeled as X. And they are formally written as Y = Λ yη + ε
and X = Λ xξ + δ and their matrix arrangement looks like,

é y1 ù é l y11 0 ù ée 1 ù
ê y ú êl 0 ú éh1 ù êêe 2 úú
ú
ê 2 ú = ê y 21 ´ +
ê... ú ê ... ... ú êëh 2 úû ê... ú
ê ú ê ú ê ú
êë y p úû êë 0 l yp 2 úû êëe p úû

é x1 ù é l x11 ù éd 1 ù
ê x ú êl ú ê ú
ê 2 ú = ê x 21 ú ´ [x ] + êd 2 ú
ê... ú ê ... ú 1
ê... ú
ê ú ê ú ê ú
ëê x q ûú ëê l xq1 ûú ëêd q ûú

A structural model describes the causal relationship between X and Y. A structural model
specifies the causal relationships among the latent endogenous (h, eta) and exogenous (x, ksi)
variables, describes the causal effects, and assigns the explained and unexplained variance
(disturbance term). The formal expression of a structural model is h = Bh + Gx + z and its
matrix arrangement looks like,

éh1 ù é 0 b12 ù éh1 ù ég 11 ù éz 1 ù


êh ú = ê b ú ´ ê ú + ê ú ´ [x1 ] + ê ú
ë 2 û ë 21 0 û ëh 2 û ëg 21 û ëz 2 û

Structural equation model integrates measurement models and structural models. Basic
notations used in structural equation modeling are summarized as follows.

• h (eta) is a m x 1 random vector of latent dependent, or endogenous, variables


• x (ksi) is a n x 1 random vector of latent independent, or exogenous, variables
• y is a p x1 vector of observed (endogenous) indicators of the dependent latent
variables h
• x is a q x 1 vector of observed (exogenous) indicators of the independent latent
variables x

5
© 2017, Park Hun Myoung http://sonsoo.org/

• e (epsilon) is a p x 1 vector of measurement errors in an observed endogenous


variable y
• d (delta) is a q x 1 vector of measurement errors in an observed exogenous variable x
• Ly (lambda y) is a p x m coefficients matrix of the regression of y on h
• Lx (lambda x) is a q x n coefficients matrix of the regression of x on x
• B (beta) is a m x m coefficients matrix of the hin the structural relationship. (B has
zeros in the diagonal, and I - B is required to be non-singular).
• G (gamma) is a m x n coefficients matrix of the xin the structural relationship.
• z (zeta) is a m x 1 vector of equation errors (residual) in the structural relationship
between h and x.

The following is an example of structural equation model used in (Byrne 1998: 38).

6
© 2017, Park Hun Myoung http://sonsoo.org/

5. Descriptive Statistics

Once you obtain and clean data, you need to describe data and take a look at them carefully
before conducting statistical inferences. Although often ignored in reality, descriptive
statistics provide important information and guidance on data analysis.
. sum *

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
privtown | 1200 3.508333 2.259244 -1 10
govtresp | 1200 4.3075 2.700127 -1 10
compete | 1200 3.438333 2.39847 -1 10
homosex | 1200 4.663333 3.317755 -1 10
abortion | 1200 4.323333 2.991306 -1 10
euthanas | 1200 2.61 2.474807 -1 10

5.1 Frequency Analysis for Categorical Variables

The first approach to analyze questionnaire survey data is to draw a frequency table. Cross-
tabulation is to use to find a relationship between two categorical variables. In the following
example, the variable privtown has 11 values starting from -1 through 10. For instance, 310
respondents chose 1 (answer type) and this 310 frequencies account fro 25.83 percent of the
total responses (1,200).
. tab privtown, miss

privtown | Freq. Percent Cum.


------------+-----------------------------------
-1 | 7 0.58 0.58
1 | 310 25.83 26.42
2 | 136 11.33 37.75
3 | 194 16.17 53.92
4 | 175 14.58 68.50
5 | 167 13.92 82.42
6 | 90 7.50 89.92
7 | 51 4.25 94.17
8 | 35 2.92 97.08
9 | 9 0.75 97.83
10 | 26 2.17 100.00
------------+-----------------------------------
Total | 1,200 100.00

This type of analysis provides naïve information about a variable. For instance, you just can
say 20 percent of respondents chose “Strongly agree,” 40 percent for “Agree,” …

5.2 Parametric Statistics for Interval Variables

If you assume that the variables are measured in interval scale rather than ordinal or nominal
one, you may provide parametric statistics such as central tendency (e.g., mean) and
dispersion (e.g., variance and standard deviation).

. sum privtown, detail

privtown
-------------------------------------------------------------
Percentiles Smallest
1% 1 -1

7
© 2017, Park Hun Myoung http://sonsoo.org/

5% 1 -1
10% 1 -1 Obs 1200
25% 1 -1 Sum of Wgt. 1200

50% 3 Mean 3.508333


Largest Std. Dev. 2.259244
75% 5 10
90% 7 10 Variance 5.104184
95% 8 10 Skewness .7154164
99% 10 10 Kurtosis 3.08777

You may also provide graphical statistics like histogram, dot plot, and box plot. But this
univariate analysis also provides limited (partial) information about the constructs that you
want to explain.
. dotplot privtown

. graph box privtown govtresp compete

. histogram privtown, normal


10
privtown
5
0

0 100 200 300


Frequency

8
© 2017, Park Hun Myoung http://sonsoo.org/

10
5
0

privtown govtresp
compete
.8
.6
Density
.4
.2
0

0 5 10
privtown

5.3 Multivariate Analysis: Correlation Analysis

The correlations among six observed variables suggest that private ownership, government
responsibility and competition are closely related to each other and homosexuality, legalized
abortion, and assisted suicide are another group. And government responsibility is

9
© 2017, Park Hun Myoung http://sonsoo.org/

significantly related to homosexuality, legalized abortion, and assisted suicide as well,


implying that this variable is related to the moral value (2nd factor) as well.
. graph matrix privtown-euthanas, half

privtown

10

5 govtresp

0
10

5 compete

0
10

5 homosex

0
10

5 abortion

0
10

5 euthanas

0
0 5 10 0 5 10 0 5 10 0 5 10 0 5 10

. pwcorr privtown-euthanas, sig

| privtown govtresp compete homosex abortion euthanas


-------------+------------------------------------------------------
privtown | 1.0000
|
|
govtresp | 0.0826 1.0000
| 0.0042
|
compete | 0.4061 0.1113 1.0000
| 0.0000 0.0001
|
homosex | -0.0199 0.1402 -0.0076 1.0000
| 0.4915 0.0000 0.7914
|
abortion | -0.0106 0.1041 -0.0027 0.4833 1.0000
| 0.7127 0.0003 0.9260 0.0000
|
euthanas | 0.0335 0.1062 0.0621 0.3589 0.4071 1.0000
| 0.2455 0.0002 0.0314 0.0000 0.0000
|

This analysis shows the relationship of two variables but cannot say anything about the
constructs that you want to explain.
6. Factor Analysis: Factor Extraction
Factor analysis studies the relationship between constructs and manifest variables. Factor
analysis follows (1) mental model (measurement model), (2) data collection and cleaning, (3)
data description (descriptive statistics), (4) factor extraction (determine the number of
factors), (5) rotation (choose rotation methods), (6) interpretation and labeling, and (7)

10
© 2017, Park Hun Myoung http://sonsoo.org/

calculation of factor scores. The core steps are extraction of the factors, determination of the
number of meaningful factors, rotation, and creation of factor scores.

Let us first fit an exploratory measurement model without any constraints.


. factor privtown-euthanas
(obs=1200)

Factor analysis/correlation Number of obs = 1200


Method: principal factors Retained factors = 2
Rotation: (unrotated) Number of params = 11

--------------------------------------------------------------------------
Factor | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Factor1 | 1.13871 0.53063 0.9871 0.9871
Factor2 | 0.60808 0.62757 0.5271 1.5142
Factor3 | -0.01949 0.10605 -0.0169 1.4973
Factor4 | -0.12554 0.08365 -0.1088 1.3885
Factor5 | -0.20919 0.02978 -0.1813 1.2072
Factor6 | -0.23897 . -0.2072 1.0000
--------------------------------------------------------------------------
LR test: independent vs. saturated: chi2(15) = 854.44 Prob>chi2 = 0.0000

Factor loadings (pattern matrix) and unique variances

-------------------------------------------------
Variable | Factor1 Factor2 | Uniqueness
-------------+--------------------+--------------
privtown | 0.0463 0.5310 | 0.7158
govtresp | 0.2025 0.1502 | 0.9364
compete | 0.0714 0.5389 | 0.7045
homosex | 0.6158 -0.0809 | 0.6143
abortion | 0.6422 -0.0799 | 0.5812
euthanas | 0.5467 0.0141 | 0.7010
-------------------------------------------------

Stata by default employs the principal factor method to extract factors. It supports other
options such as principal-component factor, iterated principal factor, and maximum-
likelihood factor. Factor analysis and principal component analysis (PCA) are used to reduce
the number of variables. The former assumes latent variables exert causal influence on
observed variables, while the latter does not assume such an underlying causal model
(O'Rourke and Hatcher, 2013: 7). “Factor analysis assumes that covariation among the
observed variables is due to the presence of one or more latent variables that exert directional
influence on these observed variables” (p. 6). In factor analysis, observed (manifest) variables
are linear combinations of latent variables, whereas a principle component is a linear
combination of optimally weighted observed variables (p. 4). Factor analysis incorporates
measurement error (random part), but PCA does not (p. 7). PCA does not differentiate
common (common factor) and unique variance (uniqueness) (Brown 2006: 22).

Stata output above displays eigenvalues of each factor and their proportions. The significant
difference of eigenvalues implies the appropriate number of factors to be extracted. That is, a
tiny difference of eigenvalues (e.g., .10605 between eigenvalues of factors 3 and 4) means
that factor 3 and 4 are not distinguished statistically. Accordingly, there appear to be two
factors in the measurement model. Eigenvalue of 1.13871 is the sum
of .046322, .20252, .07142, .61582, .64222, and .54672. This figure is interpreted as the
proportion (percent) of variation of manifest variables that the construct can explain; for
instance, 11.39 percent of variation of the six manifest variables is explained by factor 1.

Then factor loadings of two factors are listed (Albright and Park 2009: 3). For
example, .0463 is the amount of factor 1 that is loaded on private ownership. The squared

11
© 2017, Park Hun Myoung http://sonsoo.org/

factor loading, .000214 =.04632, is called commonality of the factor or the proportion of
variance in an observed variable that is explained by the factor. That is, factor 1 can
explain .02 percent of variance in private ownership. Similarly, 28.1961 percent of variance
of private ownership (=.53102) is explained by factor 2. The last column is uniqueness or the
proportion of variance in an observed variable that is not explained by factors listed; for
instance, 71.58 percent of variance in private ownership is not explained by factor 1 and
factor 2. By definition, the sum of squared factor loadings and uniqueness is 1; =.04632
+ .53102 + .7158 = 1. Therefore, this mental model for private ownership appears to be really
bad.

Cattell (1966: 26-27) suggests interpretability criteria for factor analysis. (1) There are at
least three variables (items) with significant loadings on each retained component (latent
variable); (2) the variables that load on a given component share the same conceptual
meaning; (3) the variables that load on different components seem to be measuring different
constructs; (4) the rotated factor pattern demonstrates “simple structure. The simple structure
here means that (1) most of the variables have relatively high factor loadings on only one
component (factor), and near zero loadings on the other components and (2) most
components have relatively high factor loadings for some variables, and near-zero loadings
for the remaining variables.

Stata can draw factor loading plot and scree plot that visualizes the result of factor analysis.
By looking at the loading plot below, we are able to guess that there are two groups of
observed variables (private ownership and competition versus homosexuality, legalized
abortion, and assisted suicide) and government responsibility is located at a blurring area.

The scree plot implies that two factors are reasonable since eigenvalues marginally change
after 2 factors.
. loadingplot

. screeplot

Factor loadings Scree plot of eigenvalues after factor


1.5
.6

compete
privtown
1
.4

Eigenvalues
Factor 2
.2

.5

govtresp
0

euthanas
abortion
homosex
-.2

-.5

0 .2 .4 .6
Factor 1 1 2 3 4 5 6
Number

7. Factor Analysis: Rotation

Next step is to rotate factor loadings to clarify distinction of factors. There are two types of
rotation: orthogonal and non-orthogonal rotations. Varimax, quartimax, equamax, and
parsimax are common orthogonal methods, where promax and quartimin are commonly used
non-orthogonal (oblique) methods.

12
© 2017, Park Hun Myoung http://sonsoo.org/

. rotate, varimax
Factor analysis/correlation Number of obs = 1200
Method: principal factors Retained factors = 2
Rotation: orthogonal varimax (Kaiser off) Number of params = 11

--------------------------------------------------------------------------
Factor | Variance Difference Proportion Cumulative
-------------+------------------------------------------------------------
Factor1 | 1.13257 0.51836 0.9818 0.9818
Factor2 | 0.61421 . 0.5324 1.5142
--------------------------------------------------------------------------
LR test: independent vs. saturated: chi2(15) = 854.44 Prob>chi2 = 0.0000

Rotated factor loadings (pattern matrix) and unique variances

-------------------------------------------------
Variable | Factor1 Factor2 | Uniqueness
-------------+--------------------+--------------
privtown | -0.0111 0.5329 | 0.7158
govtresp | 0.1852 0.1711 | 0.9364
compete | 0.0130 0.5434 | 0.7045
homosex | 0.6209 -0.0143 | 0.6143
abortion | 0.6471 -0.0104 | 0.5812
euthanas | 0.5420 0.0728 | 0.7010
-------------------------------------------------

Factor rotation matrix

--------------------------------
| Factor1 Factor2
-------------+------------------
Factor1 | 0.9942 0.1075
Factor2 | -0.1075 0.9942
--------------------------------

. rotate, oblimax

Factor analysis/correlation Number of obs = 1200


Method: principal factors Retained factors = 2
Rotation: orthogonal oblimax (Kaiser off) Number of params = 11

--------------------------------------------------------------------------
Factor | Variance Difference Proportion Cumulative
-------------+------------------------------------------------------------
Factor1 | 1.13352 0.52025 0.9826 0.9826
Factor2 | 0.61327 . 0.5316 1.5142
--------------------------------------------------------------------------
LR test: independent vs. saturated: chi2(15) = 854.44 Prob>chi2 = 0.0000

Rotated factor loadings (pattern matrix) and unique variances

-------------------------------------------------
Variable | Factor1 Factor2 | Uniqueness
-------------+--------------------+--------------
privtown | -0.0064 0.5330 | 0.7158
govtresp | 0.1867 0.1695 | 0.9364
compete | 0.0177 0.5433 | 0.7045
homosex | 0.6208 -0.0197 | 0.6143
abortion | 0.6470 -0.0160 | 0.5812
euthanas | 0.5426 0.0681 | 0.7010
-------------------------------------------------

Factor rotation matrix

--------------------------------
| Factor1 Factor2
-------------+------------------
Factor1 | 0.9951 0.0989

13
© 2017, Park Hun Myoung http://sonsoo.org/

Factor2 | -0.0989 0.9951


--------------------------------

Rotation will give us different pattern matrix (factor loadings). Rotation does not change raw
information such as variation but views statistics from different perspectives. Let us draw a
loadings plot and scree plot on the based of varimax rotation. These plots are not significantly
different from those above (without rotation).

Factor loadings Scree plot of eigenvalues after factor

1.5
.6

compete
privtown

1
.4
Factor 2

Eigenvalues
.5
.2

govtresp

0
euthanas
abortion
homosex
0

0 .2 .4 .6
-.5
Factor 1
1 2 3 4 5 6
Rotation: orthogonal varimax Number
Method: principal factors

Stata has a command to compare factor loadings before and after rotation. Since uniqueness
remains unchanged, only factor loadings are compared.
.estat rotatecompare

Rotation matrix -- orthogonal varimax (Kaiser off)

------------------------------------
Variable | Factor1 Factor2
-------------+----------------------
Factor1 | 0.9942 0.1075
Factor2 | -0.1075 0.9942
------------------------------------

Factor loadings

-----------------------------------------------------------
| Rotated | Unrotated
Variable | Factor1 Factor2 | Factor1 Factor2
-------------+----------------------+----------------------
privtown | -0.0111 0.5329 | 0.0463 0.5310
govtresp | 0.1852 0.1711 | 0.2025 0.1502
compete | 0.0130 0.5434 | 0.0714 0.5389
homosex | 0.6209 -0.0143 | 0.6158 -0.0809
abortion | 0.6471 -0.0104 | 0.6422 -0.0799
euthanas | 0.5420 0.0728 | 0.5467 0.0141
-----------------------------------------------------------

Once you finish extracting factors successfully, you need to get aggregate information from
observed variables. Assuming rotated measurement model is correct, then you can get factor
scores of two factors by running .predict command. Stata will create two variables
factor1 and factor2 and add them to the current dataset.
. predict factor1 factor2
(regression scoring assumed)

Scoring coefficients (method = regression; based on varimax rotated factors)

----------------------------------
Variable | Factor1 Factor2
-------------+--------------------
privtown | -0.01448 0.36833
govtresp | 0.07292 0.09930

14
© 2017, Park Hun Myoung http://sonsoo.org/

compete | -0.00202 0.37982


homosex | 0.33748 -0.02270
abortion | 0.36897 -0.02257
euthanas | 0.26351 0.04362
----------------------------------

. sum factor1 factor2

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
factor1 | 1200 1.06e-09 .777661 -1.583102 2.201795
factor2 | 1200 -5.07e-10 .6507104 -1.70306 2.322467

Alternatively, you can calculate factor-based score or average of related observed variables.
Pay attention to the following .egen command with rowmean() function.
. egen f1 = rowmean(privtown govtresp compete)

. egen f2 = rowmean(homosex abortion euthanas)

When comparing factor scores and factor-based scores, we observe big difference between
two sets of scores. Factor scores are recommended since they are theory based.

Cattell (1966) says that a factor score (component score) is a linear composite of the
optimally-weighted observed variables, whereas a factor-based score is a linear composite of
the variables that demonstrated meaningful loadings for the component in question (p. 31).
Also see O’Rourke and Hatcher (2013:72-74).
. sum factor1 f1 factor2 f2

Variable | Obs Mean Std. Dev. Min Max


-------------+--------------------------------------------------------
factor1 | 1200 1.06e-09 .777661 -1.583102 2.201795
f1 | 1200 3.751389 1.666624 -1 10
factor2 | 1200 -5.07e-10 .6507104 -1.70306 2.322467
f2 | 1200 3.865556 2.299605 -1 10

Before moving forward, you need to check reliability of observed variables. You can
calculate alpha (measure of reliability) using .alpha command. If alpha is larger than .8, in
general, the set of variables are assumed to be caused by the same factor. The first set of
observed variables below shows .4286, implying that the measurement model of the
economic value does not fit the data well.
. alpha privtown govtresp compete, asis item std

Test scale = mean(standardized items)

average
item-test item-rest interitem
Item | Obs Sign correlation correlation correlation alpha
-------------+-----------------------------------------------------------------
privtown | 1200 + 0.7264 0.3278 0.1113 0.2003
govtresp | 1200 + 0.5826 0.1156 0.4061 0.5777
compete | 1200 + 0.7404 0.3516 0.0826 0.1527
-------------+-----------------------------------------------------------------
Test scale | 0.2000 0.4286
-------------------------------------------------------------------------------

By contrast, homosexuality, legalized abortion, and assisted suicide are more reliable
measures of the moral value (alpha=.6816).
. alpha homosex abortion euthanas, asis item std

Test scale = mean(standardized items)

average
item-test item-rest interitem

15
© 2017, Park Hun Myoung http://sonsoo.org/

Item | Obs Sign correlation correlation correlation alpha


-------------+-----------------------------------------------------------------
homosex | 1200 + 0.7856 0.5020 0.4071 0.5786
abortion | 1200 + 0.8062 0.5401 0.3589 0.5282
euthanas | 1200 + 0.7531 0.4447 0.4833 0.6516
-------------+-----------------------------------------------------------------
Test scale | 0.4164 0.6816
-------------------------------------------------------------------------------

8. Confirmatory Factor Analysis


Stata has .sem command to fit confirmatory factor analysis and structural equation model.
You need to specify a measurement model within parenthesis as shown in the following
command. Here the latent variable is Values (one latent variable). -> indicates a causal
relationship between latent and observed variables. method(ml) tells Stata to uses maximum
likelihood method for estimation.
. sem (Values -> privtown govtresp compete homosex abortion euthanas), method(ml)

Endogenous variables

Measurement: privtown govtresp compete homosex abortion euthanas

Exogenous variables

Latent: Values

Fitting target model:

Iteration 0: log likelihood = -17155.58


Iteration 1: log likelihood = -17154.46
Iteration 2: log likelihood = -17154.399
Iteration 3: log likelihood = -17154.399

Structural equation model Number of obs = 1200


Estimation method = ml
Log likelihood = -17154.399

( 1) [privtown]Values = 1
--------------------------------------------------------------------------------
| OIM
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
Measurement |
privtown <- |
Values | 1 (constrained)
_cons | 3.508333 .0651917 53.82 0.000 3.38056 3.636107
-------------+----------------------------------------------------------------
govtresp <- |
Values | .3519197 .0847335 4.15 0.000 .1858451 .5179943
_cons | 4.3075 .0779135 55.29 0.000 4.154792 4.460208
-------------+----------------------------------------------------------------
compete <- |
Values | 1.280656 .2595241 4.93 0.000 .7719986 1.789314
_cons | 3.438333 .0692088 49.68 0.000 3.302687 3.57398
-------------+----------------------------------------------------------------
homosex <- |
Values | .1101116 .1216503 0.91 0.365 -.1283186 .3485418
_cons | 4.663333 .0957354 48.71 0.000 4.475695 4.850971
-------------+----------------------------------------------------------------
abortion <- |
Values | .115648 .1105949 1.05 0.296 -.101114 .33241
_cons | 4.323333 .0863156 50.09 0.000 4.154158 4.492509
-------------+----------------------------------------------------------------
euthanas <- |
Values | .2180622 .0865095 2.52 0.012 .0485068 .3876177
_cons | 2.61 .0714118 36.55 0.000 2.470036 2.749964
---------------+----------------------------------------------------------------
var(e.privtown)| 3.415275 .3523314 2.790056 4.180599
var(e.govtresp)| 7.075966 .2967287 6.517647 7.682112
var(e.compete)| 2.984817 .5986811 2.014594 4.422296
var(e.homosex)| 10.9779 .4496889 10.13098 11.89562
var(e.abortion)| 8.917925 .3658081 8.229018 9.664505

16
© 2017, Park Hun Myoung http://sonsoo.org/

var(e.euthanas)| 6.039458 .2517568 5.565643 6.55361


var(Values)| 1.684681 .3586063 1.110014 2.556861
--------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(9) = 616.17, Prob > chi2 = 0.0000

Look at the parameter estimates. The first coefficient of the latent variable on private
ownership is set to 1 in order to make the measurement model exactly identified; the
measurement model without any constraint is not under-identified.

The unstandardized coefficient on government responsibility is .3519 and its standard error
is .0847 (p<.000). Therefore, the latent variable causes government responsibility
significantly. By contrast the coefficient of the latent variable on homosexuality is .1101 and
its standard error is .1217 (p<.365). Therefore, it will be a suspicious relationship between the
latent variable and observed variable homosexuality.

The last line of the Stata output is the chi-square test of the model fit. The null hypothesis is
that the covariance structure of the population and sample is identical: S = s. The chi-square
616.17 is large enough at the 9 degrees of freedom to reject the null hypothesis at the .01
level (p<.0000). That is, covariance of the sample is different from that of the population; the
measurement model does not fit the data well.

The parameter estimates need to be standardized to evaluate the impact of latent variable on
observed variables substantively. Once you fit a measurement model, and run .sem with
standardized option.

. sem, standardized

Structural equation model Number of obs = 1200


Estimation method = ml
Log likelihood = -17154.399

( 1) [privtown]Values = 1
--------------------------------------------------------------------------------
| OIM
Standardized | Coef. Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
Measurement |
privtown <- |
Values | .5747456 .0584807 9.83 0.000 .4601256 .6893656
_cons | 1.553523 .0428827 36.23 0.000 1.469474 1.637571
-------------+----------------------------------------------------------------
govtresp <- |
Values | .1692386 .0392781 4.31 0.000 .0922549 .2462222
_cons | 1.595961 .0435272 36.67 0.000 1.510649 1.681272
-------------+----------------------------------------------------------------
compete <- |
Values | .6933293 .0750543 9.24 0.000 .5462257 .840433
_cons | 1.434155 .0411136 34.88 0.000 1.353574 1.514736
-------------+----------------------------------------------------------------
homosex <- |
Values | .0430952 .0484466 0.89 0.374 -.0518584 .1380487
_cons | 1.406155 .0407087 34.54 0.000 1.326368 1.485943
-------------+----------------------------------------------------------------
abortion <- |
Values | .0502016 .0490001 1.02 0.306 -.0458368 .14624
_cons | 1.445902 .0412847 35.02 0.000 1.364985 1.526819
-------------+----------------------------------------------------------------
euthanas <- |
Values | .114414 .0461772 2.48 0.013 .0239083 .2049196
_cons | 1.055067 .036016 29.29 0.000 .9844773 1.125657
---------------+----------------------------------------------------------------
var(e.privtown)| .6696675 .067223 .5500642 .8152768
var(e.govtresp)| .9713583 .0132947 .9456475 .9977682
var(e.compete)| .5192945 .1040746 .3506063 .769144
var(e.homosex)| .9981428 .0041756 .9899922 1.006361
var(e.abortion)| .9974798 .0049198 .9878837 1.007169

17
© 2017, Park Hun Myoung http://sonsoo.org/

var(e.euthanas)| .9869094 .0105666 .966415 1.007838


var(Values)| 1 . . .
--------------------------------------------------------------------------------
LR test of model vs. saturated: chi2(9) = 616.17, Prob > chi2 = 0.0000

The parameter estimate on private ownership is .5747 and its standard error is .0585
(p<.0000). The coefficient on homosexuality is .0431 and its standard error is .0484 (p<.374).
The chi-square test remains unchanged even after standardization.

The following command displays r-squared scores. For instance, the r-squared of the latent
variable on private ownership is .3303 = .57472. This figure is interpreted as if it is r-squared
in OLS; the latent variable can explain 33.03 percent of variation in private ownership. The
following output suggests that the latent variable cannot explain government responsibility
(2.86%), homosexuality (.19%), legalized abortion (.25%), and assisted suicide (1.31%)
sufficiently. Yes, this measurement model with one latent variable is really bad.
. estat eqgof

Equation-level goodness of fit

------------------------------------------------------------------------------
| Variance |
depvars | fitted predicted residual | R-squared mc mc2
-------------+---------------------------------+------------------------------
observed | |
privtown | 5.099957 1.684681 3.415275 | .3303325 .5747456 .3303325
govtresp | 7.284609 .2086435 7.075966 | .0286417 .1692386 .0286417
compete | 5.747831 2.763014 2.984817 | .4807055 .6933293 .4807055
homosex | 10.99832 .020426 10.9779 | .0018572 .0430952 .0018572
abortion | 8.940457 .0225317 8.917925 | .0025202 .0502016 .0025202
euthanas | 6.119567 .0801085 6.039458 | .0130906 .114414 .0130906
-------------+---------------------------------+------------------------------
overall | | .5945024
------------------------------------------------------------------------------
mc = correlation between depvar and its prediction
mc2 = mc^2 is the Bentler-Raykov squared multiple correlation coefficient

Stata can produce various goodness of fit measures. The presence of multiple goodness of fit
measures implies that chi-square test does not always produce reliable result. You need to
report at least chi-square, RMSEA (root mean square error of approximation), CFI
(comparative fit index).
. estat gof, stats(all)

----------------------------------------------------------------------------
Fit statistic | Value Description
---------------------+------------------------------------------------------
Likelihood ratio |
chi2_ms(9) | 616.175 model vs. saturated
p > chi2 | 0.000
chi2_bs(15) | 856.459 baseline vs. saturated
p > chi2 | 0.000
---------------------+------------------------------------------------------
Population error |
RMSEA | 0.237 Root mean squared error of approximation
90% CI, lower bound | 0.221
upper bound | 0.253
pclose | 0.000 Probability RMSEA <= 0.05
---------------------+------------------------------------------------------
Information criteria |
AIC | 34344.798 Akaike's information criterion
BIC | 34436.419 Bayesian information criterion
---------------------+------------------------------------------------------
Baseline comparison |
CFI | 0.278 Comparative fit index
TLI | -0.203 Tucker-Lewis index
---------------------+------------------------------------------------------
Size of residuals |
SRMR | 0.144 Standardized root mean squared residual

18
© 2017, Park Hun Myoung http://sonsoo.org/

CD | 0.595 Coefficient of determination


----------------------------------------------------------------------------

RMSEA “incorporates a penalty function for poor model parsimony” (Brown 2006: 83-84).
If RMSEA is less than .05, then you can conclude that the measurement model fit the data
well. CFI evaluates “the fit of a user-specified solution in relation to a more restricted, nested
baseline model” (Brown 2006: 84). CFI ranges from zero to 1; a CFI larger than .9 indicates a
good fit. This measurement model with one latent variable does not fit data well since (1) chi-
square is large (p<.0000), (2) RMSEA .237 is larger than .05, and (3) CFI .278 is smaller
than .9. Then what you have to do?
Modification indices provide you directions to go ahead. Stata post-estimation
command .estat mindices will produce modification indices for you.
. estat mindices

Modification indices

--------------------------------------------------------------------------
| Standard
| MI df P>MI EPC EPC
--------------------------+-----------------------------------------------
cov(e.privtown,e.compete)| 164.626 1 0.00 33.82137 10.59301
cov(e.privtown,e.homosex)| 5.607 1 0.02 -.5239465 -.0855686
cov(e.privtown,e.abortion)| 4.455 1 0.03 -.4239463 -.0768186
cov(e.privtown,e.euthanas)| 3.914 1 0.05 -.3738381 -.0823136
cov(e.govtresp,e.homosex)| 22.360 1 0.00 1.216426 .1380172
cov(e.govtresp,e.abortion)| 11.569 1 0.00 .7888493 .0993046
cov(e.govtresp,e.euthanas)| 9.754 1 0.00 .5990838 .0916422
cov(e.compete,e.homosex)| 9.224 1 0.00 -.8442552 -.1474875
cov(e.compete,e.abortion)| 9.358 1 0.00 -.7724137 -.1497129
cov(e.homosex,e.abortion)| 279.824 1 0.00 4.785221 .483627
cov(e.homosex,e.euthanas)| 154.265 1 0.00 2.934899 .3604414
cov(e.abortion,e.euthanas)| 198.536 1 0.00 3.001695 .4090117
--------------------------------------------------------------------------
EPC = expected parameter change

The first index says, if you posit a correlation between errors of private ownership and
competition in the current measurement model, then the chi-square will decrease by164.626.

End of document.

References
Acock, Alan C. (2013). Discovering structural equation modeling using Stata. College
Station, TX: Stata Press.
Albright, Jeremy J., and Hun Myoung Park. (2009). Confirmatory factor analysis using Amos,
LISREL, Mplus, and SAS/STAT CALIS. Working Paper. The University Information
Technology Services (UITS) Center for Statistical and Mathematical Computing,
Indiana University.
Brown, Timothy A. (2006). Confirmatory factor analysis for applied research. New York:
Guilford Press.
Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245-276.
O'Rourke, Norm, and Larry Hatcher. (2013). A step-by-step approach to using the SAS for
factor analysis and structural equation modeling, 2nd ed. Cary, NC: SAS Institute.

19

You might also like