Professional Documents
Culture Documents
": Sample Size and Goodness of Fit in Structural Equation Models with Latent Variables Author(s): J. S. Tanaka Reviewed work(s): Source: Child Development, Vol. 58, No. 1 (Feb., 1987), pp. 134-146 Published by: Blackwell Publishing on behalf of the Society for Research in Child Development Stable URL: http://www.jstor.org/stable/1130296 . Accessed: 03/07/2012 03:38
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.
Blackwell Publishing and Society for Research in Child Development are collaborating with JSTOR to digitize, preserve and extend access to Child Development.
http://www.jstor.org
"How Big Is Big Enough?": Sample Size and Goodness of Fit in Structural Equation Models with Latent Variables
J. S. Tanaka
New YorkUniversity J. TANAKA, S. "How Big Is Big Enough?":Sample Size and Goodnessof Fit in StructuralEquation statistiModels with Latent Variables.CHILD 1987, 58, 134-146. The large-sample DEVELOPMENT,
cal theory for latent-variablestructuralequation models offers little solace to the developmental psychologist, who is often confronted with less than optimally large sample sizes. This article reviews previously proposed alternatives to the sample-size and goodness-of-fitissue in latentfit variablestructural equationmodels. Variousnonparametric indices forlatent-variable systemsare reviewed with their strengthsand weaknesses discussed. An alternativeestimationstrategycalled ME2 estimation is introducedas a possible alternativesolution to the small-sampleproblem. The theory and application of latentvariable structural equation models and their relevance to developmental issues have become well established. Reviews of the field by Bentler (1980, 1983) and J6reskog (1978), texts by authors such as Everitt (1984), James, Mulaik, and Brett (1982), and McDonald (1985), and the papers presented in this special section provide nontechnical introductions to the logic of these methods. In many of these treatments, the issue of sample size is invariably raised but typically not treated in sufficient detail to provide useful information to users of these methods. This paper will attempt to provide some guidelines and alternatives regarding sample-size and goodness-offit issues in latent-variable structural equation models. The developmental researcher is probably most acutely aware of sample-size problems since models of human development are often complex and involve many variables. However, the number of subjects available to test such models is often small or, at least, small relative to the size and complexity of the assumed model for the data. For example, Bornstein and Benasich (1986) recently tested a latent-variable model of habituation in infants using 35 subjects. The question of how many subjects are needed before estimating and testing a latentvariable structural equation model has plagued researchers who are forewarned about the necessity of large samples for appropriate statistical conclusions. While the statisticians can find solace in asymptotic statistical theory (perhaps unreasonably; see Tukey, 1986), the developmental researcher using these methods is often left wondering about the relevance of such theory for finite samples. This article will review results from previous work that has attempted to address the sample-size and goodness-of-fit issues and will introduce a new method of estimation especially designed for small samples.
This researchwas supportedin partby a New YorkUniversityResearchChallenge Fund Grant and a New York University Presidential Fellowship. Portionsof this manuscriptwere completed while the authorwas visiting the Departmentof Statisticsat The PennsylvaniaStateUniversity.The authorwishes to thankG. J. Huba, A. T. Panter,and two anonymousreviewers for their comments on the manuscriptand Paul Bolten for productionassistance.Addresscorrespondenceto: JeffreyS. Tanaka, Department of Psychology, New York University, 6 WashingtonPlace, 7th Floor, New
York, NY 10003. [Child Development, 1987, 58, 134-146. ? 1987 by the Society for Researchin Child Development,Inc. All rightsreserved.0009-3920/87/5801-0008$01.00]
J. S. Tanaka
stably a correlation coefficient in a sample of size 2. As the number of observed variables increases and all pairwise correlations among variables are considered, sample size must be increased for confidence in the computed correlations. For example, ad hoc rules of thumb given for statistical models such as multiple regression might place the number of subjects to number of variables ratio at 10:1 to deal with problems of sampling variability (e.g., the "bouncing beta" problem in multiple regression) and to ensure adequate statistical power. The problem is compounded in latentvariable structural equation models. Unlike more familiar univariate statistical models, such as ANOVA or multiple regression, statistical theory is not available explicitly to take into account differences in sample size, as is done in omnibus F or t tests. In statistics with known finite sample properties, the sampling distributions of the statistics change as a function of sample size. This is apparent in the changes in degrees of freedom when sample size changes. Hence, F statistics and t statistics explicitly adjust for sample-size differences. In contrast, the statistical theory in latent-variable structural equation models is asymptotic in nature. While asymptotic statistical theory is fully developed elsewhere (e.g., White, 1984), for the purposes of this review, it is sufficient to note that asymptotic statistical theory implies that confident conclusions can be drawn from data (e.g., regarding the distribution of omnibus test statistics and standard errors for parameter estimates) only as total sample-size N increases without bound. These "large sample" results buy some degree of confidence (but not certainty) when N is large, but do not provide a guideline about when sample sizes are large enough. Unfortunately, the researcher interested in latent-variable modeling is caught in a double bind. If large samples are obtained, statistical power to reject the null hypothesis will be high. In the "accept/support" null hypothesis testing strategy in latent-variable structural equation models where one is looking for small chi-square statistic values relative to the degrees of freedom (and hence wants to "accept" the null hypothesis that "the model fits"), trivial substantive deviations from a postulated model may lead to an overall significant omnibus test and rejection of the model as an adequate one for the given data. Therefore, having dutifully collected data on a large sample, the high statistical power available to reject the null hypothesis
135
may lead a researcher to reject a model which, in fact, deviates from the population model in a trivial way. The problem in latent-variable structural equation models can be summarized as follows: The researcher realizes that results are supported only by large-sample (asymptotic) statistical theory and obtains an appropriately large sample, when possible. Having obtained such a sample, the researcher tests models of interest but finds that all such models are rejected since the obtained sample is "too large" (i.e., the statistical power available in the sample is detecting potentially noninteresting substantive differences as contributing to the lack of correspondence between model and data). Recent developments in the theory of covariance structure models exacerbate this problem. Browne (1982, 1984) has introduced the appropriate statistical theory for covariance structures that can be applied to nonnormally distributed data. Huba and Harlow (1983, 1986, 1987, in this issue), Huba and Tanaka (1983), and Tanaka and Huba (in press) have demonstrated the empirical utility of this estimator, showing how the chi-square statistic evaluating model adequacy can be affected by data nonnormality. This estimator, which makes fewer assumptions regarding the distribution of the observed data, should be highly useful in evaluating developmental phenomena and is treated in greater detail by Huba and Harlow (1986, 1987, in this issue). The problem that this estimator (referred to as an asymptotically distribution-free [ADF] estimator) raises with respect to the sample-size issue is that the observed mean vector and covariance matrix are no longer sufficient statistics, as they are in the case when data follow a multivariate normal distribution. In other words, when data are normally distributed, they can be completely summarized in terms of their means, variances, and covariances since other information (e.g., skewness and kurtosis) is no longer relevant. This is not true of nonnormal data. Browne's nonnormal estimator requires fourth-order moment (kurtosis) information to estimate models. Mardia (1974) has shown that stable estimates of this information require large sample sizes. Hence, if these nonnormal data methods are employed, even larger samples are required both to estimate stably the necessary higherorder information from the data and to satisfy the requirements of the asymptotic statistical theory underlying the development of the nonnormal (and normal theory) approach.
136
Child Development
as implemented in the LISREL computer program (Joreskog & Sbrbom, 1984). Samples ranging in size from 50 to 300 were used in their Monte Carlo study. They concluded that reasonably robust estimates could be obtained in samples smaller than the optimal N = 200 reported in Boomsma. Tanaka (1984) looked at the effects of sample size in latent-variable structural equation models across different sample sizes as well as different estimators including the estimators introduced by Browne (1982, 1984) for non-normal data and a quasi-likelihood estimator (e.g., McCullagh & Nelder, 1984, chap. 8). He found that estimates in the model, standard errors of the estimates, and the model fit statistic were degraded in samples of size 100 in a confirmatory two-factor, six-variable model. Further, these degradations were most noticeable in the ADF cases, which require additional fourth-order information from the data. Hence, Tanaka's Monte Carlo results suggest that the type of samplesize concerns raised in the Boomsma and Gebring and Anderson studies become more acute in nonnormal estimation. Other Monte Carlo studies by Harlow (1985) and Muth6n and Kaplan (1985) support this finding. Monte Carlo studies can be of some utility in determining appropriate sample sizes. But even the most comprehensive studies tend to consider only a small subset of models encountered in practice. Further, the various Monte Carlo studies that have been conducted examining sample size have been inconclusive in their findings. Boomsma suggests that sample sizes of 100 are strong lower bounds when considering maximum likelihood estimation and suggested samples of 200 or more. Tanaka's results suggest that this size sample may be problematic when nonnormal estimation methods are used. In considering the behavior of the test statistic, Geweke and Singleton suggest that sample size can be reduced beyond this point, perhaps to samples as small as 20. Recall, however, that Geweke and Singleton explicitly did not consider sample-size effects on estimates of model parameters or standard errors. Given these conflicting results, it would be difficult to establish a precise decision rule for determining sample size based on the existing Monte Carlo evidence. However, a number of convergences seem to emerge from the Monte Carlo studies. First, it would appear that the problem of selecting an appropriate sample size is tied to both the ratio of number of variables to number of subjects and the ratio of number of pa-
This article reviews various solutions that have been forwarded in attempting to resolve the issue of sample size in structural equation models. Both empirical (Monte Carlo) and analytic work are discussed. In Monte Carlo studies, data are generated at various sample sizes. These data follow a known model in the population. The adequacy in recovering the known population models at the various sample sizes is then evaluated. Analytic work has focussed primarily on the development of nonparametric fit indices to establish goodness of fit. In this article, a new estimation strategy for latent-variable models designed to address the small-sample problem is also given. This alternative strategy is based on work by Theil (1982; Theil & Laitinen, 1980) and Vinod (1982) in econometrics. An example showing the possible utility of this estimator is given.
J. S. Tanaka
rameters to be estimated to the number of subjects. The results given by Geweke and Singleton seem to be based, in part, on the relatively small (with respect to the number of latent- and measured-variables) models that were considered in their simulations. The Boomsma results reflect an aggregation of evidence over a wider class of models. Additional information from the data required for the ADF estimators might partially explain the failure of samples of size 100 in Tanaka's investigation. It is worthwhile to spell out in detail convergences that exist about sample size from available Monte Carlo evidence. First, it can be noted that there is some agreement that sample-size appropriateness is tied to the ratio of the number of subjects to the number of parameters estimated. This differs somewhat from the usual concern with the ratio of number of subjects to number of variables. In the context of latent-variable models, it can be made clear why the concern should lie with the number of estimated parameters rather than with the number of variables. For example, in multiple regression, where the number of variables to number of subjects rule is often cited, regression coefficients are estimated for each predictor variable measuring its unique contribution to the outcome measure. Assessments of the adequacy of the regression model are obtained by comparing the predicted values of the outcome variable (i.e., the estimated regression coefficients multiplied by the values of the predictor variables) to the observed values of the outcome variable using the sum of squared deviations as the criterion. In comparison, when considering the relations between observed and latent variables in latentvariable models (the measurement model), both the regression coefficient (factor loading) linking the observed and latent variable and the error component (unique variance) of the model must be simultaneously estimated. Second, ML estimation will probably be more robust to the effect of small sample sizes than estimation procedures such as Browne's, which were designed to take into account data nonnormality. For example, Tanaka (1984) showed that ML estimates were least affected (using bias as a criterion) in comparison to a variety of possible nonnormal alternatives in nonnormal samples. It should be noted that ML test statistics and standard errors were more incorrect relative to their ADF counterparts in that study. Thus, decision rules about the omnibus fit of the model or the significance of individual parameter es-
137
timates may be incorrect under conditions of nonnormality. The magnitude of the parameter estimates did not change appreciably across the methods of estimation replicating results reported, for example, in Huba and Harlow (1983, 1986, 1987, in this issue) and Tanaka and Huba (in press). As an alternative to Monte Carlo work, other approaches to the small sample size issue have been forwarded. These approaches look for alternative goodness-of-fit indices that compare the observed data with the hypothesized model. With regard to latentvariable models, these fit indices were originally designed to look at possible problems in small samples where the chi-square statistic might, in fact, not be distributed as a chisquare variable. While these fit indices apply to any latent-variable model, they are particularly relevant in small samples where they can be used to discriminate among models in a way that may be more valid (although less statistically based) than the chi-square statistic.
138
Child Development
including the root mean squared residual (i.e., the root mean square of the sample covariance matrix minus the covariance matrix generated by the model specification). Unlike the Bentler-Bonett normed index, these fit indices appear to be estimator-specific, since different fit indices are presented depending on the estimation method employed. However, Tanaka and Huba (1985) show how each of these can be derived as a special case of a more general form and, hence, are in the same metric. This point will be of interest later in this article. Yet another alternative is presented in Cudeck and Browne (1983) based on Akaike or Schwarz information criteria. Cudeck and Browne utilize developments from information theory in statistics and suggest that model selection from a set of models can be determined from the model with minimum value of the Akaike (Schwarz) information. They suggest that such a strategy might be particularly effective in the context of crossvalidating models. In the next section, the behavior of the different fit indices will be examined. In particular, the same model for data will be estimated using different estimation procedures. By varying only the method of estimation, summary statistics such as these proposed fit indices should not vary extensively, particularly if these are to be the basis of determining model adequacy. More specifically, the same fit index for the same model should not vary according to a particular estimation strategy chosen. Such comparisons are considered by way of example in the next section.
ther, decision errors of both types can be made under conditions of nonnormality (i.e., models that should be retained can be rejected and models can also be incorrectly retained). An early decision criterion in evaluating the fit of models was to look at the ratio of the chi-square test statistic for a given model to the model's degrees of freedom. For wellfitting models, the expected value of this ratio is 1.0. However, published ad hoc rules for the retention of well-fitting models on the basis of this statistic have ranged from 2.0 to 5.0 (e.g., Marsh & Hocevar, 1985). In a move away from the chi-square statistic based decision criterion, Bentler and Bonett (1980), following the earlier work of Tucker and Lewis (1973), developed normed and non-normed fit indices for these latentvariable models. These indices differ only in that normed indices are constrained to lie between 0 and 1, while nonnormed indices do not have to lie in this range, although they generally will in practice. For both normed and nonnormed indices, values close to 1.0 are indicative of well-fitted models. The development of these indices by Bender and Bonett rests on what they term "null model logic," where models of interest are compared to a baseline model (e.g., a model hypothesizing uncorrelated observed variables). Sobel and Bohrnstedt (1985) discuss some alternative null model specifications. Bentler and Bonett developed their fit indices to deal with problems inherent in the omnibus chi-square test statistic and, in particular, its dependence on sample size. The normed and nonnormed indices are "sample size free" measures of the adequacy of fit of a model relative to some baseline (although, see Bollen, 1986). Further, the normed fit index was designed as an "estimator free" measure of fit, since it could be used in conjunction with estimation methods that did not explicitly provide a chi-square statistic such as ordinary (unweighted) least squares. Further issues regarding this aspect of the normed fit index can be found in Bentler and Bonett (1980). The Bentler-Bonett indices, while popular, are not the only measures of fit available for latent-variable models. Other fit indices have been proposed by Joreskog and Sorbom. In particular, recent releases of the JareskogSorbom program for latent-variable models, have included a goodness-of-fit index, LISREL, an adjusted goodness-of-fit index, as well as other nonparametric measures of model fit,
J. S. Tanaka
139
Gastro
Symptoms
intestinal
Neurological Symptoms
Respiratory Symptoms
Psychosomaticism #1
Psychosomaticism #2
Physical Symptoms
Psychosomaticism
Depression
Time
1
Depression
Time
2
Beck
Zung
CES-D
Beck
Zung
CS-D
Figure 1 depicts a latent-variable structural equation model looking at the effects of self-reported physical health ailments and self-reports of psychosomaticism on a construct of depressive affect at two time points. These were measured roughly 1 month apart. The structural part of the model is saturated, that is, all possible pairwise relations among latent variables are freely estimated. The sample was comprised of 112 college undergraduates. When the model of Figure 1 is estimated by ML, a normed fit index value of .88 is obtained. The same model is re-estimated under generalized least-squares (GLS) estimation (Browne, 1974; J6reskog & Goldberger, 1972). Under this specification, a normed fit index of .62 is obtained.' Using the decision criterion for the normed fit index originally provided in Bentler and Bonett, the ML model falls somewhat below the .90 threshold, while the GLS model falls well below this threshold. Note that the models themselves are identical; only the method by which they were estimated differs. Even
without concentrating on the particulars of a decision rule, it is clear that the fit indices under two different methods of estimation are yielding different results in terms of the proportion of total possible fit in the data being accounted for. In contrast to this result, the fit indices presented in Tanaka and Huba (1985) were also calculated for the ML and GLS solutions. It should be noted that, for ML, this is printed as the goodness-of-fit index (GFI) in LISREL. However, the GFI presented for the GLS solution in the current version of LISREL (i.e., LISREL is not the same one that is VI) presented in Tanaka and Huba. For the ML and GLS solutions of the Figure 1 model, the values of this fit index are .89. The exact technical nature of this difference between the Bentler-Bonett index and the Tanaka-Huba generalization of the Joreskog-Sorbom index is further developed in Tanaka and Huba (1986). For the present purposes, it is important to note that, because of these across-estimator differences, the
' The nonnormed fit indices behave in the same manner, with values of.90 and .70 for ML and GLS, respectively.
140
Child Development
the most extreme definition of undersized would imply that the sample is too small for an inverse of the covariance matrix to exist. While this definition of"undersizedness" certainly applies for those models where the number of variables is large relative to the number of subjects, there is a wider class of models where this strict definition of undersizedness is not met but where the ratio of number of subjects to observed variables is less than adequate. While the resulting covariance matrix may have an inverse, it will not be stably estimated. Hence, maximum entropy estimation of the covariance matrix might also have something to say in the case where the researcher does not have undersizedness by the strict definition but may feel uncomfortable with the available sample size. It is the latter definition of undersizedness that is suggested for adoption here, since there would appear to be limited scientific practicality and utility in postulating models for data where the number of modeled variables outstrips the number of subjects on whom data were collected.
normed fit index of Bentler and Bonett probably should not be used to evaluate model fit when comparisons are made across different methods of estimation. The robustness of fit indices across estimation method may become important in the context of cross-validations or comparisons of results across different studies, particularly as alternative methods of estimating models become available. In such comparisons, the index given in Tanaka and Huba (1985) and presented in the LISREL program of J6reskog and Sorbom (1984) may be preferable.
An Estimator-based Perspective to Small Samples: Background for the Maximum Entropy Estimator
The Monte Carlo evidence and the results on fit indices presented thus far represent ways of trying to address the sample-size issue from the perspective of existing methods of estimating latent-variable models. In this section, another approach based on developing a specific estimation strategy for small samples is developed. In comparison to the other approaches reviewed, this addresses the small-sample problem by suggesting a novel estimation strategy rather than working within the limits of existing standard methods. In examining small samples, one is particularly worried about the most degenerate case, where the number of subjects is less than the number of variables being analyzed. This results in what is referred to as matrix singularity (e.g., the covariance matrix is singular), or nonpositive definiteness (e.g., the covariance matrix is not positive definite). This implies that there is not sufficient information available in the sample to stably estimate the sample covariance matrix.2 Often, in LISREL runs, this information is given in the ubiquitous error message "the input matrix is not positive definite." Failure to stably estimate the sample covariance matrix poses a problem for statistically based methods of estimation, such as ML and GLS, since these methods depend on the existence of the inverse of a covariance matrix (equivalent to matrix nonsingularity and to a matrix being positive definite). The developments in maximum entropy estimation of covariance matrices is designed specifically to address this problem. Hence,
2 The same claim for lack of sample information could be made for the case of collinearity (e.g., Cohen & Cohen, 1983, pp. 115-116). The interpretation given here will be in terms of deficient sample size.
J. S. Tanaka
in observed measurement. Essentially, what one does is to augment the variances of measured variables by a factor reflecting the measurement uncertainty in each variable. Tanaka (1986) gives further details on the exact computational implementation of this method for latent-variable models. Given the Vinod formulation, the ME2 matrix can be expressed as a function of the sample covariance matrix. If one lets M be the ME2 matrix and S be the sample covariance matrix, then: M = S + D, where D is a diagonal matrix whose elements consist of the "correction factors" for each of the observed variables in the model. The reader interested in further technical developments for the ME2 estimator may refer to Vinod (1982) or Tanaka (1986).3 An example of the ME2 method of estimation follows. TABLE 1
141
MAXIMUM-LIKELIHOODESTIMATES FOR A
SUBTESTs = 50) (N
A. FACTOR LOADINGS AND UNIQUENESSES
Subtest Information ...... Vocabulary ...... Arithmetic ....... Similarities...... Comprehension .. Sentences .......
Loading 1.00* 1.41 (8.19) .74 (6.39) .77 (4.55) 1.07 (6.94) .90 (5.26)
Uniqueness 1.70 (3.64) 2.47 (3.18) 1.89 (4.33) 5.22 (4.71) 3.01 (4.12) 4.89 (4.60)
B. FACTOR VARIANCE
4.89(3.69)
NOTE.-Critical ratios determining the statistical significance of each parameterestimate are enclosed in parentheses. Asterisk denotes a parameterfixed at the given value. Goodness-of-fitchi-square statistic = 11.98 on 9 df (p = .21); goodness-of-fitindex = .93.
of the WPPSI
The data used in this example were presented by Woodward and Bentler (1978) and consist of the covariances among the Information, Vocabulary, Arithmetic, Similarities, Comprehension, and Sentences subtests of the Wechsler Preschool and Primary Scale of Intelligence (WPPSI; Wechsler, 1967). Woodward and Bentler adapted these data based on a sample of 50 respondents from earlier work by Cronbach, Gleser, Nanda, and Rajaratnam
tationally, these estimates were obtained by inputting the ME2 covariance matrix as the matrix to be analyzed in LISREL. Note, however, that only ME2 estimates can be obtained in this way, and other information that is given in output, such as the chi-square statistic and standard errors, is not to be trusted. TABLE 2
MAXIMUM ENTROPY ESTIMATES FOR A SINGLE-FACTOR MODEL UNDERLYING SIXWPPSI SUBTESTS A. FACTOR LOADINGS AND UNIQUENESSES
(1972, 251). p.
Table 1 presents ML estimates for a latent-variable model postulating a single Verbal latent construct underlying the six subtests. Note that the 12 parameters in the model are estimated in a sample of size 50 giving better than a 4: 1 subject-to-parameters ratio. Since this is not a particularly small ratio, it is likely that ME2 estimates of this model will not deviate from the ML estimates, despite the fact that the given sample size is relatively small. Maximum-likelihood estimates were obtained from the LISREL program. Table 2 presents the corresponding ME2 estimates of the single-factor model. Compu-
Subtest Information ....... Vocabulary ........ Arithmetic ......... Similarities ........ Comprehension.... Sentences .........
B. FACTOR VARIANCE
4.88 NOTE.-Asterisk denotes a parameter fixed at the given value.
3 There is a similarity between this method of "adjusting" the standard sample covariance matrix and similar methods of "adjusting" the data in multiple regression. In particular, the ME2 estimator resembles "ridge" regression and other "ridge" estimation strategies (e.g., Dong, 1985; Hoerl & Kennard, 1970; Pagel & Lunneborg, 1985). However, unlike the choice of the ridge constant(s) in ridge regression, the selection of the ME2 diagonal matrix is based on a more systematic notion of replacing sample moments as estimates of population moments by the ME2 moments.
142
Child Development
TABLE 3
MAXIMUM-LIKELIHOOD ESTIMATES A FOR SINGLE-FACTOR MODELUNDERLYING WPPSI SIX (N SUBTESTS = 1,200) A. FACTOR LOADINGS ANDUNIQUENESSES
In comparing the ML results from Table 2, one can observe that there is practically no difference between the estimates of factor loadings and the factor variance across the two solutions. The uniqueness values in the ME2 solution are slightly higher than those found in the ML solution. This is to be expected because of the inflation of variances necessary to ensure a positive definite covariance matrix in the maximum entropy solution. One method of quantifying the relationship between two sets of estimates is to calculate the Pearson correlation between the estimates (e.g., Tanaka & Huba, 1984). Calculating this coefficient for the Table 1 and Table 2 estimates gives a value of .991, indicating a high degree of congruence between the two solutions. This suggests that the ME2 estimates will "look like" ML estimates. Finally, one would like to compare both the ML and ME2 estimates with a largesample version of the same data to investigate how closely the different estimators resemble results that would be obtained in a large sample. Fortunately, for these data, such a comparison is possible. In this case, we employ the standardization data from the WPPSI (Wechsler, 1967). Recall that the standardization data for the WPPSI consisted of samples of 200 collected from six age groups ranging from age 4 to 61/2.To obtain the largest sample size possible, that is, a sample of size 1,200, the hypothesis that the covariance matrices for the six groups are poolable must be tested. If the null hypothesis that the covariance matrices are equal across groups cannot be rejected, then greater statistical power is available for the full sample of 1,200.4 In this case, the chi-square statistic evaluating this null hypothesis gave a value of 100.78 on 105 df (p = .60), thus allowing the pooling of covariance matrices across the six groups. Table 3 gives the ML estimates for the single-factor model based on the sample size of 1,200. The fit statistic for this model was 41.83 on 9 df(p < .001), thus indicating rejection of the model. Model rejection can be attributed to the large-sample size problems previously discussed in terms of developing alternative fit indices for covariance structure models. The ML Joreskog and Sorbom goodness-of-fit index for this model is .99, indicating high model-data congruence. As before, the congruence between estimates obtained across different solutions can
Loading 1.00* .99 (25.45) .86 (24.32) .88 (24.15) .96 (27.24) .87 (23.85)
Uniqueness 3.40 (18.33) 4.39 (20.56) 4.55 (21.14) 4.87 (21.23) 3.74 (19.32) 5.07 (21.36)
B. FACTOR VARIANCE
5.81(15.68)
NOTE.-Critical ratios determining the statistical significance of each parameter estimate are enclosed in parentheses. Asterisk denotes a parameter fixed at the given value. Goodness-of-fit chi-square statistic = 41.83
be compared using a Pearson correlation coefficient. The correlation between the ML estimates in the sample of size 50 and the sample of size 1,200 was .86; the corresponding correlation between estimates in the ME2 solution and the ML estimates in the sample of 1,500 was .90. An alternative method for assessing the "closeness" of the two sets of estimates in the sample of size 50 to the estimates in the sample of size 1,200 is to calculate the root mean square residual for estimated parameters. For the ML estimates, this was calculated to be 0.77, while for the ME2 estimates, this was calculated as 0.65. This result also suggests that the ME2 estimates are "closer" to the large-sample ML estimates than are ML estimates that were calculated in the sample of 50.
4 This null hypothesis is also referred to as testing the homogeneity of covariance matrices and is an assumption in MANOVA. References to this test can be found in any standard multivariate text (e.g., Mardia, Kent, & Bibby, 1979, p. 140).
J. S. Tanaka
to establish whether or not the ME2 estimator will remain similarly well-behaved in other empirical examples. A number of issues remain unresolved at this point with respect to the ME2 estimator. For example, at this time, it has not been determined whether or not standard errors or test statistics are available for the maximum entropy estimator.5 Two lines of research must be developed with regard to the ME2 estimator.First, extensive Monte Carlo work should be performedto evaluate the behavior of the ME2 estimator relative to other standard methods of estimation such as ML and GLS. Second, statisticalwork should be done to see if standarderrorsand test statisticscan be developed for the ME2 estimator. Some progress relative to the latterpoint may be possible if resampling methods such as the bootstrap(e.g., Efron, 1982) are used. Ratherthan derive asymptoticstandarderrors and test statistics for these models, as has been done for ML and GLS, bootstrapstandard errors and test statistics could be obtained. In the case of estimatorsbased on raw data such as ML and GLS for either normalor nonnormaldata,the bootstrapmethods would be conceptually simple, although computationally difficult.They would consist of a simple resampling and reanalysis of available data to "build up" empirically derived standarderrorsand omnibus fit. While some work in this area using the principal components model has been reportedin Chatterjee(1984) and applications to general covariance structures would be straightforward, my knowlto edge, no published work on the bootstrapfor latent-variablemodels has yet been reported. Details of the bootstrapand other resampling methods are developed in greater detail by Efron (1982). Application of bootstrap methods to the ME2 estimator may be slightly less since it involves direct adstraightforward, justments to the sample covariance matrixS. One possibility is to consider perturbations of the diagonal matrix D over some limited range assessing the uncertainty in the measurement errorfor the observed variablesand using those replicates as the bootstrapsample points. DeLaney and Chatterjee (1985) discuss use of the bootstrap in selection of the ridge constant in ridge regression. It is clear
143
that given the current state of knowledge regardingME2 estimation,additionalthoughtis required regarding how inferential statistics might be developed for this estimator. Summary and Discussion This paper considered two themes that can be used to address adequacy of samplesize issues in latent-variablemodels. One approach,which operates within the framework of existing estimation strategies,examines the effects of sample size on existing methods in Monte Carlo studies and has yielded some conflictingresults, althoughconvergences can be identified. The appropriatenessof sample size is intimately linked to the size of the model to be estimated. Fifty observations may be sufficient for a model hypothesizing a single latent variable underlying four measured indicators.The same number of observations will be inadequate for a model with 20 measured variables and four latent variables. In particular, the appropriateness of sample size is linked to the numberof parameters estimated in the model. Information about the number of parametersestimated in a model is standard part output in latentvariable modeling programs.The complexity of the estimation method determines samplesize appropriateness. Recent developments in latent-variable models that make fewer assumptions about the distribution of the data and allow for data nonnormalitywill require more subjects than more standardmethods, such as ML and GLS. Hence, the "cost" of making fewer distributional assumptions about data is the necessity of a large sample size. Numerous fit indices are available to evaluate the fit of latent-variablemodels. An importantcriterion in selecting a fit index is its generalizabilityacrossestimationmethods. In this regard, the fit index suggested by Tanaka and Huba (1985) and implemented (in part) in the LISREL programof Joreskog and S6rbom(1984) may be preferableto other alternatives. The second partof this article considered an estimator-basedsolution to the small sample problem. Unlike the firstclass of methods, which assess the robustness of large-sample results in small samples, a new estimatorfor latent-variablemodels that is specifically de-
5 The derivation of statistical results for the maximum entropy estimator would be straightforward if it could be established that the estimated covariance matrix in the maximum entropy approach was a consistent estimator of the population covariance matrix. This result remains to be established.
144
Child Development
efficient statistics in structural models: Specification and estimation of moment structures. Psychometrika, 48, 493-517. Bentler, P. M., & Bonett, D. G. (1980). Significance
signed for the small sample problem is introduced. Based on previous work in the econometric literature, the ME2 estimator for latent-variable models is easy to implement. The exact computational details of this estimator are developed in Tanaka (1986). Knowledge of the measurement uncertainty of the observed variables can probably be built up over studies that precede the application of these confirmatory methods to data. In an example, it was shown that the ME2 estimates seemed to lie closer to ML estimates obtained in a large sample than did smallsample ML estimates. Further work needs to be done to establish the replicability of this finding over other sets of data as well as implementing inferential methods for this estimator. The researchers who have been wary of the large-sample assumptions of latentvariable models may be comforted by the availability of a new estimator designed explicitly for the case of small-sample estimation. In particular, developmental research, where samples are often limited in size, may find the ME2 estimator useful in allowing tests of models that would be untestable using standard approaches to estimation. However, as with all new statistical developments, the practicality of the ME2 estimator in small samples lies in its ability to inform data we observe. In the current absence of inferential statistical theory for this estimator, the utility of these ME2 methods may lie in "exploratory" uses of structural equation modeling (see Connell, 1987, in this issue; Crano & Mendoza, 1987, in this issue). The benefits, if any, for this new estimator will be seen as it is applied more widely to other data sets. In this article I have set out to try to establish an answer to the question about sample-size adequacy in latent-variable modeling. Rather than providing a definitive answer, I have focused on guidelines on how to evaluate sample-size adequacy based on Monte Carlo work, some caveats about the use of nonparametric fit indices to help address the sample-size issue, and a new estimation strategy to provide a small-sample solution to small-sample problems. This represents an initial step in the direction of dealing with the idiosyncracies of the smallsample problem.
tests and goodness of fit in the analysis of covariance structures.Psychological Bulletin, 88, 588-606.
Bollen, K. A. (1986). Sample size and Bentler and Bonett's nonnormed fit index. Psychometrika, 51, 375-377. Boomsma, A. (1983). On the robustness of LISREL
(maximum likelihood estimation) against small sample size and nonnormality. Unpublished doctoral dissertation, University of Groningen. Bornstein, M. H., & Benasich, A. A. (1986). Infant habituation: Assessments of individual differences and short-term reliability at five months.
estimatorsin the analysis of covariance structures. South African Statistical Journal, 8, 124. Browne, M. W. (1982). Covariance structures. In D. M. Hawkins (Ed.), Topics in applied multivariate analysis (pp. 72-141). Cambridge: Cambridge University Press. Browne, M. W. (1984). Asymptotically distribution-
free methods for the analysis of covariance structures. British Journal of Mathematical and Statistical Psychology,37, 62-83.
Chatterjee, S. (1984). Variance estimation in factor analysis: An application of the bootstrap. Brit-
Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2d ed.). Hillsdale, NJ: Erlbaum. Connell, J. P. (1987). Structural equation modeling and the study of child development: A ques-
References
Bentler, P. M. (1980). Multivariate analysis with latent variables: Causal modeling. Annual Re-
sion. In American Statistical Association: 1985 Proceedings of the Business and Eco-
J. S. Tanaka
nomic Statistics Section (pp. 546-548). Washington, DC: American Statistical Association. Dong, H.- K. (1985). Non-Gramian and singular matrices in maximum likelihood factor analysis.
145
metrika,43, 443-477.
Jbreskog, K. G., & Goldberger, A. S. (1972). Factor analysis by generalized least squares. Psycho-
Applied Psychological Measurements,9, 363366. Efron, B. (1982). The jacknife, the bootstrap, and
other resampling plans. Philadelphia: Society for Industrial and Applied Mathematics. Everitt, B. S. (1984). An introduction to latent variable models. London: Chapman & Hall. Gebring, D. W., & Anderson, J. C. (1985). The effects of sampling error and model characteristics on parameter estimation for maximum likelihood confirmatory factor analysis. Multi-
metrika,37, 243-260.
J6reskog, K. G., & S6rbom, D. (1984). LISREL VI:
Analysis of linear structural relationships by the method of maximum likelihood, instrumental variables, and least squares methods.
Mooresville, IN: Scientific Software. Mardia, K. V. (1974). Applications of some measures of multivariate skewness and kurtosis in testing normality and robustness studies.
Sankhya,B36, 115-128.
Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979).
American Statistical Association, 75, 133-137. Harlow, L. L. (1985). Behavior of some elliptical theory estimators with nonnormal data in a covariance structures framework: A Monte
Carlo study. Unpublished doctoral dissertation, University of California, Los Angeles. Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55-67. Huba, G. J., & Harlow, L. L. (1983). Comparison of maximum likelihood, generalized least squares, ordinary least squares, and asymptotically distribution free parameter estimates in drug abuse latent variable causal models. Jour-
Huba, G. J., & Harlow, L. L. (1986). Robust estimation for causal models: A comparison of methods in some developmental datasets. In P. B. Baltes, D. L. Featherman, & R. M. Lerner
Abstracts International,45, 924B. Tanaka,J. S. (1986).A note on the technical development of the ME2 estimator for moment
structures. Manuscript in preparation. Tanaka, J. S., & Huba, G. J. (1984). Hierarchical confirmatory factor analyses of psychological distress measures. Journal of Personality and
Tanaka, J. S., & Huba, G. J. (1985). A fit index for covariance structure models under arbitrary
GLS estimation.BritishJournal of Mathematical and Statistical Psychology,38, 197-201. strategies in structural models: The "goodness"of goodness of fit. Manuscript in
velopmentsin mathematicalpsychology:Measurement, psychophysics, and neural information processing (Vol. 2, pp. 1-56). San Francisco: W. H. Freeman. Jareskog, K. G. (1978). Structural analysis of
preparation. Tanaka, J. S., & Huba, G. J. (in press). Assessing the stability of depression in college students. Mul-
146
Child Development
Vinod, H. D. (1982). Maximum entropy measurement error estimates of singular covariance matrices in undersized samples. Journal of Econometrics, 20, 163-174. Wechsler, D. (1967). Manual for the Wechsler Pre-
Theil, H. (1982). Some recent and new results on the maximum entropy distribution. Statistics &
ProbabilityLetters, 1, 17-22.
Theil, H., & Laitinen, K. (1980). Singular moment matrices in applied econometrics. In P. R. Krishnaiah (Ed.), Multivariate analysis-V (pp. 629-649). Amsterdam: North-Holland. Tucker, L. R., & Lewis, C. (1973). A reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1-10. Tukey, J. W. (1986). Sunset salvo. American Statis-
White, H. (1984). Asymptotic theoryfor econometricians. Orlando, FL: Academic Press. Woodward, J. A., & Bentler, P. M. (1978). A statistical lower bound to population reliability. Psy-