On Tests and Indices For Evaluating Structural Models - 2007 - Personality and Individual Differences PDF

Personality and Individual Differences 42 (2007) 825–829
www.elsevier.com/locate/paid
On tests and indices for evaluating structural models I

Peter M. Bentler *
University of California, Los Angeles, Institute of Psychology and Statistics, Box 951563,
Los Angeles, CA, United States
Available online 15 December 2006
Abstract
Eight recommendations are given for the improved reporting of research based on structural equation
modeling. These recommendations differ substantially from those offered by Prof. Barrett in this issue, espe-
cially with regard to the virtues and limitations of current statistical methods.
Ó 2006 Elsevier Ltd. All rights reserved.
Keywords: Test statistic; Model modification; Approximate fit; Comparative fit index; RMSEA
1. Introduction
Professor Barrett makes many wise and perceptive observations in his discussion of model fit,
and I agree with much he says e.g., that investigators inappropriately ignore the test of model fit,
that there are virtues to cross-validation, etc. Yet I also disagree with certain points, e.g., his rec-
ommendation to ban all fit indices. I will give my own recommendations on how a structural
equation model (SEM) should be submitted to, and reported in, a journal, and compare these
to Professor Barrett’s. See also McDonald and Ho (2002).
q
Supported in part by National Institute on Drug Abuse Grants DA00017 and DA01070.
*
Tel.: +1 310 825 2893; fax: +1 310 206 4315.
E-mail address: bentler@ucla.edu
0191-8869/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.paid.2006.09.024
826 P.M. Bentler / Personality and Individual Differences 42 (2007) 825–829
2. My Recommendations vs. Barrett’s
1. When submitting a manuscript (ms) with an SEM, an author should submit a separate state-
ment that verifies, for each major model, that (a) every parameter in the model is purely a
priori, and if not, (b) details on all model modifications that were made. This material should
be sent to reviewers along with the ms.
2. Every ms should provide summary statistics, where these exist, for evaluating assumptions to
be made in the statistical analysis. Example: if using a normal theory statistic for continuous
data, always report on multivariate normality, and perhaps on univariate normality.
3. If a major theoretical argument on a simple SEM hinges on modeling results on a given data
set, then the ms should provide the correlation matrix, SDs, the means (if a mean structure)
in the appendix (if one page or less), or on a web site. Similar statistics on more complex
models are optional, but recommended.
4. At least one statistical test of model fit, say T, should be reported for each major SEM
model, unless the author verifies that no appropriate test exists for their design and model.
The key assumptions underlying T should be enumerated, and, where possible, evaluated
empirically. If assumptions are violated, the author’s decision rules on how to proceed
should be justified.
5. If any model modification is implemented on a major SEM, the user should report on the
similarity, before and after model modification, of the estimates of a priori parameters.
6. Any SEM based on a small sample (say, N < 100) should additionally report at least one
meaningful a priori model that is expected to be rejected.
7. For each major covariance/correlation model, the standardized root mean square residual
(SRMR) or the average absolute standardized residual, as well as the largest several residuals
in a correlation metric, should be reported.
8. Each major SEM model may be accompanied by at most two other indices of fit, such as CFI
(comparative fit index) and RMSEA (root mean square error of approximation).
I will call my recommendations #1–#8, and will compare them to Barrett’s, which I will call
B1–B5. First of all, Barrett ignores #1, yet it provides one fundamental reason to be skeptical
about his reliance on tests of fit (in B1). When based on post-hoc model modification, T will
not be v2 distributed, and the resulting p-values can be quite distorted (e.g., MacCallum, Roznow-
ski, & Necowitz, 1992). In #1, I suggest that reviewers be provided far greater access to model
modification information than is feasible in a manuscript. In #2, I recommend that empirical
information on assumptions almost always be reported. B3 suggests authors consider this infor-
mation when the model is rejected, and does not require reporting it in the ms. I recommend #3 so
that readers and reviewers can understand the data somewhat independently of an author’s re-
port; Barrett does not require this. B1 allows a test of fit, and similarly in #4, I would always want
to see a test, if it exists, as well as an evaluation of assumptions as in #2 and B3. When a model is
modified, the a priori estimated parameters may remain the same or may change substantially; #5
recommends a report. If they stay basically the same, the extra parameters show that the model
was incomplete, but not fundamentally biased. A correlation or rank-order correlation can be
used to measure this similarity (e.g., Bentler, 1995, p. 262). In B2, Barrett would reject most mod-
els with N < 200. This is not a bad idea for areas where large samples are easily available. But if
P.M. Bentler / Personality and Individual Differences 42 (2007) 825–829 827
the small N is not due to laziness and the science seems appropriate, in #6, I would be willing to
consider a small N model, especially if it can be shown that power is large enough to reject alter-
natives such as a 1-factor model. In B4 and elsewhere, Barrett is against #7 and #8. On #7, my
feeling is that standardized residuals (Hu & Bentler, 1995) are always relevant because they con-
vey much information and are easily understandable. A good model will have small residuals on
average, and even the largest residuals will be fairly small. And of course #8 is critical, for the
reasons given originally by Bentler and Bonett (1980). As summarized by Bentler (1990, p. 238)
for covariance structure modeling, where R is the population covariance matrix, the model is
R(h), and h is a vector of parameters: ‘‘Acceptance or rejection of the null hypothesis via a test
based on T may be inappropriate or incomplete in model evaluation for several reasons: (1) Some
basic assumptions underlying T may be false, and the distribution of the statistic may not be ro-
bust to violation of these assumptions; (2) No specific model R(h) may be assumed to exist in the
population, and T is intended to provide a summary regarding closeness of R b to S, but not nec-
essarily a test of R = R(h); (3) In small samples, T may not be chi-square distributed; hence the
probability values used to evaluate the null hypothesis may not be correct; (4) In large samples
any a priori hypothesis R = R(h), although only trivially false, may be rejected’’. Today we know
that other considerations also apply.
3. On the sources of caution Re. test statistics
Barrett notes that the chosen probability level (e.g., reject the model if p < .05) on a v2 test is
arbitrary (true: why not .10, or .024), but ‘‘once that alpha level is set subjectively, . . . it becomes
‘exact’.’’ I disagree. As dozens of simulations across decades have shown, test statistics are not
necessarily trustworthy (e.g., Curran, West, & Finch, 1996; Hu, Bentler, & Kano, 1992). Even
early proponents Jöreskog and Sörbom (1982, p.408) had reservations about their overall good-
ness of fit test: ‘‘. . . we emphasize that such a use of v2 is not valid in most cases . . .’’, hence they
proposed GFI, AGFI and RMR as useful adjuncts to evaluating fit. The toolkit of possible v2
tests has recently vastly expanded (see Yuan & Bentler, in press-a, in press-b), and it does not
make sense to talk about ‘‘the’’ v2 test. Including F-tests, EQS 6 provides more than a dozen mod-
el tests (Bentler, in press). I certainly favor the use of a carefully chosen model test, but even the
best of these can fail in applications e.g., due to model modification as noted above. Speaking gen-
erally, the conditions for a test statistic to be precisely v2 distributed will rarely be met exactly, and
hence what is printed out as a precise p-value will tend to be a rather crude and error-prone
approximation to what this probability would be under ideal conditions. Recent research has
shown that limitations on T extend to the typical situation where the most general model has been
rejected and chi-square difference tests are conducted (Maydeu-Olivares & Cai, 2006; Yuan &
Bentler, 2004).
Then there is the issue of whether model testing is even relevant. There is a long tradition sug-
gesting that the hypothesis R = R(h) will essentially never be precisely correct (e.g., Bentler &
Bonett, 1980; Browne & Cudeck, 1993; de Leeuw, 1988). In discussing models generally, MacCal-
lum (2003, p. 113) notes ‘‘All of these models, in their attempt to provide a parsimonious repre-
sentation of psychological phenomena, are wrong to some degree and are thus implausible if
taken literally’’. Such points of view imply that the classical use of T in a hypothesis testing
828 P.M. Bentler / Personality and Individual Differences 42 (2007) 825–829
way is inappropriate, even if the assumptions underlying the test are met and the model test truly
represents a confirmatory test of an a priori hypothesis. After all, if the null hypothesis really is
incorrect, then as sample size increases, the test statistic does not really evaluate a null hypothesis
but rather represents a measure of power to reject the null. This can be high: when unique vari-
ances in a latent variable model are small, the test T may have high power for model rejection even
though correlational residuals are very small on average (Browne, MacCallum, Kim, Anderson, &
Glaser, 2002).
Far more often than Barrett evidently sees, I have seen SEM with important real-world criteria
included as part of a model. This is harder to do than simple prediction or external validation,
which Barrett and I agree are desirable. Prediction is often associated with saturated models that
cannot be rejected, and their null hypotheses are often ‘‘nil’’, e.g., that population R2 = 0.
In a standard SEM, I am willing to believe that some non-nil null hypotheses R = R(h) may be
precisely true. But it is hard to take this viewpoint in a model with huge degrees of freedom (df).
Such a model is liable always to be misspecified, and hence to be rejected by any ‘‘exact’’ test.
Consider a model with 400 df, where df is the number of nonredundant elements in R minus
the number of free parameters in h. This means that there are over 400 ways of being incorrect
when specifying the model. It seems unlikely that any researcher would ever have enough knowl-
edge to propose a model that is precisely correct in all 400+ ways. And real models easily can have
more than 1000 df.
I doubt that there are fields in social science where large df models fit exactly. Barrett states that
in ‘‘. . . item response theory, the notion of ‘approximate fit’ does not even exist. Models either fit,
or they do not fit’’. This is not my reading of the literature. As in SEM, applying an appropriate
factor analysis to binary IRT data may reveal unidimensionality or multidimensionality (e.g.,
Schilling & Bock, 2005). In fact, a precise unidimensional model may be hard to achieve (e.g.,
Stout, 1990). Since SEM tools based on normal latent traits (e.g., Lee, Poon, & Bentler, 1995)
can be used to evaluate some IRT models, SEM fit indices become available to IRT. In this con-
text, a one factor model designed to evaluate whether 60 items are unidimensional – a reasonable
problem in IRT – is associated with about 1700 df. If we had methods to estimate and test such a
large model, I would guess that it would never fit exactly. But it might fit approximately.
4. Approximate fit tests
If exact fit tests are not so exact, there may be a role for approximate fit. In discussing this, Bar-
rett speaks negatively about recent work that tried to provide simulation-based guidance about
the behavior of fit indices under null and non-null conditions. It seems to me helpful to know
which indices are relatively insensitive to sample size, are sensitive to model misspecification,
etc., even if the best recent research is not definitive. Perhaps SRMR needs little further research,
since it is interpretable on its own. While T = 3000 is meaningless, SRMR = .04 provides sensible
information – correlations are reproduced to about .04 on average.
I doubt that the concept of approximate fit is misleading or wrong. It does allow uncertainty
that Barrett would banish (as would I, if I could). Yet there is uncertainty in external prediction
too; a large R2 with many predictors certainly does not rule out the existence of many competing
models (different sets of predictors) with virtually identical R2. And that’s just a 1-equation model!
P.M. Bentler / Personality and Individual Differences 42 (2007) 825–829 829
More generally, in large multivariate problems, perhaps there exists no single exact truth to be
discovered. If so, current research on well-functioning statistical tests of approximate or ‘‘close’’
fit is right on target (see e.g., UCLA Statistics Preprint #494).
References
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246.
Bentler, P. M. (1995). EQS structural equations program manual. Encino, CA: Multivariate Software.
Bentler, P. M. (in press). EQS 6 structural equations program manual. Encino, CA: Multivariate Software.
Bentler, P. M., & Bonett, D. G. (1980). Significance tests and goodness of fit in the analysis of covariance structures.
Psychological Bulletin, 88, 588–606.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.),
Testing structural equation models (pp. 136–162). Newbury Park, CA: Sage.
Browne, M. W., MacCallum, R. C., Kim, C.-T., Anderson, B., & Glaser, R. (2002). When fit indices and residuals are
incompatible. Psychological Methods, 7, 403–421.
Curran, P. J., West, S. G., & Finch, J. F. (1996). The robustness of test statistics to nonnormality and specification error
in confirmatory factor analysis. Psychological Methods, 1, 16–29.
de Leeuw, J. (1988). Model selection in multinomial experiments. In T. K. Dijkstra (Ed.), On model uncertainty and its
statistical implications (pp. 118–138). Berlin: Springer.
Hu, L.-T., & Bentler, P. M. (1995). Evaluating model fit. In R. H. Hoyle (Ed.), Structural equation modeling: Concepts,
issues and applications (pp. 76–99). Thousand Oaks, CA: Sage.
Hu, L., Bentler, P. M., & Kano, Y. (1992). Can test statistics in covariance structure analysis be trusted? Psychological
Bulletin, 112, 351–362.
Jöreskog, K. G., & Sörbom, D. (1982). Recent developments in structural equation modeling. Journal of Marketing
Research, 19, 404–416.
Lee, S.-Y., Poon, W.-Y., & Bentler, P. M. (1995). A two-stage estimation of structural equation models with continuous
and polytomous variables. British Journal of Mathematical and Statistical Psychology, 48, 339–358.
MacCallum, R. C. (2003). Working with imperfect models. Multivariate Behavioral Research, 38, 113–139.
MacCallum, R. C., Roznowski, M., & Necowitz, L. B. (1992). Model modification in covariance structure analysis: The
problem of capitalization on chance. Psychological Bulletin, 111, 490–504.
Maydeu-Olivares, A., & Cai, L. (2006). A cautionary note on using G2(dif) to assess relative model fit in categorical
data analysis. Multivariate Behavioral Research, 41, 55–64.
McDonald, R. P., & Ho, M.-H. R. (2002). Principles and practice in reporting structural equation analyses.
Psychological Methods, 7, 64–82.
Schilling, S., & Bock, R. D. (2005). High-dimensional maximum marginal likelihood item factor analysis by adaptive
quadrature. Psychometrika, 70, 1–23.
Stout, W. F. (1990). A new item response theory modeling approach with applications to unidimensionality assessment
and ability estimation. Psychometrika, 55, 293–325.
Yuan, K.-H., & Bentler, P. M. (2004). On chi-square difference and z-tests in mean and covariance structure analysis
when the base model is misspecified. Educational and Psychological Measurement, 64, 737–757.
Yuan, K.-H., & Bentler, P. M. (in press-a). Structural equation modeling. In C. R. Rao & S. Sinharay (Eds.), Handbook
of statistics: Psychometrics. Amsterdam: North-Holland.
Yuan, K.-H., & Bentler, P. M. (in press-b). Robust procedures in structural equation modeling. In S.-Y. Lee (Ed.).
Handbook of structural equation models. Amsterdam: Elsevier.

On Tests and Indices For Evaluating Structural Models - 2007 - Personality and Individual Differences PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On Tests and Indices For Evaluating Structural Models - 2007 - Personality and Individual Differences PDF

Uploaded by

Copyright:

Available Formats

Personality and Individual Diﬀerences 42 (2007) 825–829

On tests and indices for evaluating structural models I

Available online 15 December 2006

2. My Recommendations vs. Barrett’s

3. On the sources of caution Re. test statistics

You might also like