You are on page 1of 52

1

STRUCTURAL EQUATION MODELLING MODERN METHODS (Chapters 14, pages 681 704, 714-735, 755 - 785) Often applied psychologists are part of a research and development team trying to design interventions to help people improve their quality of life. Ideally, these interventions are based upon a theory that models aspects of human behaviour within particular social contexts. The more precise the theory and the more it is based upon empirical evidence, the more specific components of an intervention can be designed based upon sound scientific principles as well as upon the experience and intuition of dedicated practitioners. Indeed, theory is useful both because it suggests ways to probe the effectiveness of conventional practices and because it suggests innovative practices that have not been considered in the past.

It bears repeating (stifle your yawns!), that modern theories must be complex in nature if they are to have relevance to individuals designing social programs. As well, practitioners can only manipulate some of the pertinent variables that influence program participants. Testing theory in this situation is not addressed adequately by conventional hypothesis testing research methods, even though these methods provide strong support for central parts of the theory. More and more, researchers wish to test the plausibility of their theoretical models as a whole; that is they wish to examine the interaction of an entire set of variables in a field setting that includes important social outcomes. The growing interest in structural equation modelling (SEM) reflects this desire.

Whatever the SEM program selected by the researcher, the underlying statistical procedure uses matrix algebra to solve a set of simultaneous equations that are specified by a particular theory. These equations are directly derived from a path diagram that shows the relationships among the theoretical constructs (the structural model) and the relationships among the theoretical constructs and the measures of these constructs (the measurement model). Using these equations and initial start-up values, SEM programs follow an iterative algorithm which converges on an optimal solution (it is the different iterative procedures and their criteria for convergence that distinguishes the various SEM programs). The extent to which this solution can reproduce the variance covariance matrix among the variables in the data set becomes a test of the fit of the theoretical model (similar to the factor analysis criterion that the factor solution should reproduce the correlation matrix among the variables entered into the analysis).

Thus the results of a variety of goodness-of-fit indices are an important outcome of any SEM program and it is possible to compare different theories in terms of their relative goodness-of-fit. As well, the results of the analysis can indicate whether adding certain paths will significantly improve the fit between the model and the data. This is a valuable, albeit, post hoc procedure that suggests specific modifications to the theory that warrant investigation and replication in a subsequent study.

The degree to which a model fits the data is only one of the outcomes of interest, however. Of equal interest are the values of path coefficients that estimate the strength and direction of both direct and mediated relationships among the variables in the model. Indeed, SEM can model mediating psychological processes in a much more direct way than traditional experimental designs.

Combining the results of experiments that establish a causal link from an independent variable to a dependent variable with the results of SEM that place this causal relationship within a theoretically specified network of relationships provides far more compelling evidence in support of a theory than either kind of evidence alone. The path coefficients are unbiased estimates of population parameters and they can be tested to see if they are significantly different from zero. As well, the relative value of standardized path coefficients (range -1 through 0 to +1) indicate their importance in predicting specific outcomes.

Another name for structural equation modelling is confirmatory factor analysis. Researchers using this name are specifically concerned with generating evidence that a psychological construct takes the form specified by theory (often the construct is multidimensional; e.g., sexism has both a hostile and a benevolent aspect which are related). This type of analysis is usually used when the researcher is developing tests which measure the construct adequately (with reliability, content validity, and construct validity).

To repeat, psychologists use structural equation modelling techniques to test the adequacy of a theoretical model and to estimating the strength of causal paths using path coefficients. This modelling technique also allows the reliability of the measuring instruments to be estimated by specifying the measurement model and a structural model and testing the viability of both types of models simultaneously; a process that involves using a complex iterative computer program to analyse a large data set.

In this course only a basic introduction to SEM as it is used to conduct a confirmatory factor analysis or to assess the plausibility of a theoretical model with recursive paths will be covered. As well, the exercises and assignment will give you practice with one of the most widely used windows program, EQS (installed on the Arts Lab computers in room 31 Arts). The Department has a site licence for EQS so your supervisor can install it in her/his lab.

Note that if only one measured variable (manifest variable) is used to index each underlying construct in a theoretical model, the SEM procedure conducts a path analysis an analysis that only includes measured variables.

An Important Note: In the past, researchers often assumed that theoretical models are recursive. This is limiting now and will likely be even more limiting in the future. People react to events with a sense of agency rather than as pawns controlled by fate, and their actions count. The ability of theory to model complex feedback loops and the ability of statistics to test the plausibility of such theories is becoming increasingly important.

To use a clinical example, victims of violence often do not label themselves in this way. Rather active coping attempts by these individuals allow them to avoid applying the label of victim to themselves and to successfully adapt. Understanding how some individuals are able to do this while others succumb to this extremely stressful experience is essential if professional psychologists are to help effectively.

Similarly, in social psychology, stereotyping is only of concern because this invidious process is perpetuated through selective perceptions guided by unreasonable expectations. Again, understanding the feedback loop that results in group members confirming the negative expectations implied by their groups stereotype is a complex problem that involves reciprocal causal influences between stereotyper and stereotypee in an interpersonal interaction that extends across time.

More positively, consider the development of friendship which, at its heart, involves increasingly intimate and reciprocal interactions as the lives of two people become inextricably bound together. As all these examples show, reciprocal causality lies at the heart of many important psychological processes and the requirements for statistical techniques to help model these processes are in increasing demand.

CONVENTIONS USED IN SEM DIAGRAMS:

IDC2 E = Error IDC1 IDC3

Canadian Identity Em3 PD1 PD2 Em3 PD3 Perceived Discrimination endogenous latent variable Emotions Em2

Illegitimate X Unstable exogenous manifest variable

Intentions (Protest) endogenous manifest variable

D = Disturbance
exogenous latent variable

Cultural Identity

endogenous manifest variable ID3

ID1 ID2

RECTANGLES = MEASURED OR MANIFEST VARIABLES OVALS = CONSTRUCTS OR LATENT VARIABLES OR FACTORS ARROW = CAUSAL PATH; DOUBLE ARROW = CORRELATION EXOGENOUS VARIABLES = INDEPENDENT VARIABLES ENDOGENOUS VARIABLES = DEPENDENT VARIABLES WHICH ARE SOMETIMES INDEPENDENT VARIABLES AS WELL.

Assumptions Underlying Structural Equation Modelling

1. All variables must have linear relationships with each other.

2. Outliers must be identified and dealt with prior to the main analysis.

3. The variables in the analysis should be normally distribution (multivariate normality). This assumption is more crucial for modern forms of SEM and so a preliminary analysis must involve examining each variable for skewness and kurtosis. Transformations that create a normal distribution for these variables are then applied. If a transformation does not achieve a normal distribution, then an iterative estimation procedure that is robust to violation of this assumption must be used.

4. Absence of multicollinearity is necessary as the computer executes matrix inversions in each iteration. Most SEM programs give the determinant of the variance covariance matrix as part of the output so that this assumption can be examined.

5. Sometimes SEM programs have difficulty analysing a data set that contains variables measured on scales that vary considerably in range and mean value, a situation which results in covariances that are tremendously different in size. Rescale some of the variables before running the analysis.

6. The possibility of a specification error haunts any researcher using structural equation modelling. The solution is to examine the residual variance covariance matrix. The residuals should be small and centred around zero. Non-symmetrical residuals (some small and some large) suggest that the model estimates some parameters well and others poorly. One reason for this is that a causal path between variables in the model has been mistakenly set to zero (the theory is wrong). If this is true, then post hoc procedures can be used which suggest how the model can be improved by adding paths. Then replication using another sample is required. The other reason why residuals are large and non-symmetrical is that the model is misspecified. There is no easy solution to this problem but at least the analysis pushes the researcher to examine the theory more critically.

7. Large sample sizes are required in order to run modern structural equation modelling programs. Generally, the minimum sample size for all SEM programs can be estimated by multiplying the number of parameters that the program is estimating by ten. This means that usually EQS requires a sample size of at least 200 research respondents and other programs require more. However, experienced applied researchers with messy data say that even that number may not be enough for the program to convergence on a final solution. That is, with smaller sample sizes and, therefore, more unstable estimates, the program simply may not be able to find an optimal solution. Part of this problem can be caused by the default start values used by the SEM program being very different from the actual values of the parameters. Therefore, if estimates of these parameters can be obtained from past research, they should be specified as the initial start values in the analysis.

10

8. Structural Equation Modelling is based upon a mathematical procedure that tests the ability of a theoretically derived model to reproduce the variance covariance matrix among measured variables in a data set. The use of the variance covariance matrix preserves the scale of the original variables. Rescaling these variables by, for example, adding or subtracting a constant does not change the results of the analysis. However, rescaling variables through the use of sample statistics is more problematic as it alters the value of the 2 statistic that is the basis for testing the goodness-of-fit of the model. When variables are standardized, the rescaling involves sample statistics because deviations from the mean are divided by the sample's standard deviation. Hence, the developer of EQS, Peter Bentler, warns researchers not to use correlations whenever possible. In some circumstances, a researcher has no choice because he or she is doing a secondary analysis on a correlation matrix from a published article. In this instance, the EQS program alerts users of the program to the fact that the analysis may not be correct because a correlation matrix has been used. Of course standardized path coefficients are very useful because they reflect the relative strength of different paths in the model. Thus, after the analysis on the variance covariance matrix has been done, the computer calculates these standardized path coefficients from the unstandardized path coefficients and their standard errors. Perhaps in the future this limitation will be overcome, but right now it is important for researchers to know that they should analyse the variance covariance matrix, NOT the correlation matrix, whenever possible (the default option in EQS).

11

Model Specification Using the Bentler-Weeks Estimation Method The Bentler-Weeks model takes a regression approach to structural equation modelling. However, the matrix equation specifies both measured variables (also called manifest variables) and the latent variables (constructs or factors) that are presumed to underlie responses to the measuring instruments. In this model, both types of variables can be exogenous or endogenous.

Remember from chapter 5 that the matrix algebra equation for multiple regression is: y = x.b + e

In this equation there are k regression coefficients (in a k x 1 vector) which need to be estimated and there are always enough equations to provide an estimate of these parameters (as long as N > k). The reason why it is always possible to estimate the regression coefficients is because it is assumed that 1) the independent variables are measured without error, 2) the independent variables have a direct causal influence on the dependent variable and no other variables, and 3) the residuals associated with the dependent variable are not correlated with the independent variables (no specification error). In other words, these assumptions allow the research analyst to fix the values of many parameters that could, theoretically, vary and it is the specification of these parameters that allows the computer to estimate unique values for the regression coefficients (the estimated parameters) using a least squares solution.

12

In the same way, a more complex regression equation can be written describing the relationships among the endogenous (dependent) and exogenous (independent) variables specified by the structural (theoretical) model and the measurement model. However, unlike multiple regression, SEM asks the researcher to choose which parameters he or she will fix and which to estimate. However, the researcher can not allow all possible parameters to vary freely because this would always result in an underidentified model (not enough degrees of freedom available to estimate the parameters in the model). Thus the researcher must define both the measurement model and the structural model in such a way that enough parameters are fixed in value so that there are degrees of freedom available to test the plausibility of the model in its entirety. That is, the model must be over-identified.

Bearing this in mind, the fundamental Bentler-Weeks regression equation can be expressed as: (q x 1) + .

B .

(q x q) (q x 1) (q x r) (r x 1)

where (eta) is a q x 1 vector of the q endogenous (dependent) variables; B (beta) is a q x q square matrix of path (regression) coefficients which are estimates of the relationships among the endogenous variables; (gamma) is a q x r matrix of path coefficients which are estimates of the relationships between the endogenous variables and the exogenous (independent) variables; and (xi) is a r x 1 vector of the r exogenous variables.

13

Notice that this method involves solving q equations. That is, there is an equation for each of the q endogenous variables and there are no equations for the exogenous variables because their variability is explained by variables outside the model. However, the r exogenous variables have variances and covariances that need to be estimated. These variances and covariances are in an r x r variance covariance matrix called (phi). Altogether, then, the parameters that need to be estimated are in the B, , and matrices and the path diagram is used to set some of these parameters to fixed values (usually 0 or 1) so that there are enough degrees of freedom available to test the goodness of fit of the model to the data. Start values for the parameters are then entered into the matrices. These start values can be set by the computer or they can be estimated and entered by the researcher. The computer then estimates the variance covariance matrix among all the measured variables using the criterion for convergence specified by the researcher and compares it to the actual variance covariance matrix. New parameter estimates are calculated and entered as start values in the next iteration. The computer stops the iterations when the estimate of the variance covariance matrix cannot be improved. Notice in this form of structural equation modelling 1) the B, , and matrices contain parameter estimates for both the measured and the latent variables (factors) and 2) the and vectors are not estimated but derived directly from the data set.

14

Using the Path Diagram to Derive Equations Representing the Theory

Consider the path diagram shown below for the Skiing Satisfaction example in TF (Fig 14.4, p. 692).

D2*

E1*

1.0

V1 = yrsski

1.0 *
F1 = loveski 1.0

V3 = snowsat

1.0

E3*

1.0

F2 = skisat

* * 1.0
V2 = daysski

*
V4 = foodsat

1.0

E4*

V5 = senseek*

In this diagram the stars (*) indicate a path coefficient or a variance that needs to be estimated. The number of stars equals the total number of parameters that need to be estimated (see later diagrams).

15

Consider the elements in this path diagram. Remember, measured variables are represented by squares and latent variables (factors) by ovals. Endogenous variables have causal paths leading to them, while exogenous variables do not. Residuals of the endogenous measured variables are included in the model as exogenous variables labelled E (for error), while residuals of the endogenous latent variables (factors) are included in the model as exogenous variables labelled D (for disturbances the error in prediction). Notice that endogenous and exogenous variables can be either manifest or latent. Also note that usually the path from the errors and disturbances are fixed at 1, but their variance is estimated. This reflects the preference of most researchers to estimate the residual error and to not be concerned with estimating the path coefficient from unknown variables outside the model.

By convention, this model fixes the variance of the exogenous latent variable, love of skiing, at 1. This is the researcher's decision. Alternatively, he or she could have set one of the path coefficients from this latent variable to one of its indicators to 1, as is done for the path from ski trip satisfaction to snow satisfaction. Essentially this latter convention gives the latent variable the same scale as the chosen indicator and is advocated strongly by statisticians, including Bentler. Allowing all the paths and the variance to vary freely is not an option, however, as this will usually cause the model to become underidentified.

16

How then are equations derived from the path diagram? First, it is important to realize that the number of equations that you need to write is equal to the number of endogenous variables. In this example, four measured variables (number of years skied, number of days skied, snow satisfaction, and food satisfaction) are endogenous and one latent variable, ski trip satisfaction, is endogenous. Therefore, there are five endogenous variables represented in five equations. Four of these equations represent the measurement model as they link the latent factors to the measured variables. The fifth equation represents the structural model specified by theory; in this case that the love of skiing and sensation seeking are the causal determinants of ski trip satisfaction.

Next you need to write the five equations in the same way as the matrix equation: = B . + .

The matrix equation for this example is shown below (see TF page 694), where q = 5, r = 7, and * = parameter needs estimating: V1 V2 V3 V4 F2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 * 0 V1 V2 V3 V4 F2 0 0 0 0 * * * 0 0 * 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 V5 F1 E1 E2 E3 E4 D2

17

Consider the first endogenous variable, Number of years skied (V1). In the path diagram this variable is predicted by the latent factor, love of skiing, (F1) and variables outside the model (E1). Thus, the equation for this variable is:

V1 = * F1 + 1 . E1

In its full form, however, it would be written:

V1 = (0 . V1 + 0 . V2 + 0 . V3 + 0 . V4 + 0 . F2) + (0 . V5 + * . F1 + 1 . E1 + 0 . E2 + 0 . E3 + 0 . E4 + 0 . D2)

Remember when two matrices are multiplied each element of the first row in the first matrix is multiplied by the corresponding element in the first column of the second matrix and then added, and so on. Thus, the full form of this equation is equivalent to the first line of the matrix equation shown above.

18

In the same way, the other equations for the endogenous variables are equivalent to the corresponding lines of the matrix equation specified by the Bentler-Weeks SEM model. This means that writing these equations is equivalent to writing the matrix equation for this SEM procedure and writing these equations is easily done using the path diagram. The five equations implied by this particular path diagram are:

V1 = * F1 + E1 V2 = * F1 + E2 V3 = 1 F2 + E3 V4 = * F2 + E4 F2 = * F1 + * V5 + D2 Note that in these equations the path coefficients from E1, E2, E3, E4, and D2 are set to 1.

In addition, the model specifies the variances and covariances among the exogenous variables that need to be estimated This means that the variances for V5, E1, E2, E3, E4, and D2 need to be estimated, that the variance of F1 is set to 1, and that all the covariances are set to 0, specifying the variance covariance matrix, (see page 695 which shows this diagonal matrix).

Together this information is sufficient for the computer to execute the iterative estimation procedure (section 14.4.4 shows how the computer goes through one of these iterations). Specifically, start values replace the stars (*) in these matrices. Then the SEM program estimates the variance covariance matrix among the measured variables.

19

Methods of Achieving Convergence So far we have discussed how, in general, the computer converges on a best estimate of the variance covariance matrix for the measured variables. However, there are several different criterion for achieving convergence. The most common of these is the Maximum Likelihood (ML) method followed by the Generalized Least Squares (GLS) method. Essentially, the Maximum Likelihood method converges on an estimated variance covariance matrix which maximizes the probability that the difference between the estimated and the samples variance - covariance matrices occurred by chance.

In contrast, the Generalized Least Squares method converges on an estimated variance covariance matrix that minimizes the sum of the squared differences between the elements in the estimated and the samples variance covariance matrix.

Mathematically, both of these criteria for convergence involve minimizing a mathematical function through successive approximations. Both methods are good if the variables are distributed normally and the sample size is adequate. Tabachnick and Fidell suggest using an estimation method called the scaled Maximum Likelihood method if non-normality can not be corrected. EQS will do this if required and gives a corrected Chi squared called the SatorraBentler scaled 2 (TF, p. 713). As well, adjustments to the standard errors of the path coefficients are calculated so as to correct their statistical significance.

(I will discuss using dichotomous variables later.)

20

Testing the Adequacy of the Theoretical Model (including Goodness-of-Fit Indices)

The Basic Chi Squared Test for Goodness-of-Fit Once the SEM program has converged on a solution, a 2 statistic is calculated that tests how well the estimated variance - covariance matrix fits the actual variance - covariance matrix among the measured variables (its value is taken from the value of the mathematical function that is minimized to achieve convergence). The degrees of freedom for this statistic are equal to the amount of unique information in the sample variance covariance matrix minus the number of parameters that need to be estimated. If there are p measured variables, then the total degrees of freedom, p* = p ( p + 1) / 2.

In the ski trip satisfaction study, for example, there are five measured variables so the total number of degrees of freedom are 5 x 6 / 2 = 15. Given that the path diagram shows that the researcher wants to estimate five path coefficients and six variances, 2 is tested with 15 - 5 - 6 = 4 degrees of freedom.

21

Because the desired result is for a good fit between the estimated and the sample variance covariance matrices, the researcher hopes for a nonsignificant 2 . However, 2 values are dependent upon sample size such that even very small differences are significant when the sample size is large. The result is that a number of Goodness-of-fit indices have been developed to correct for this problem and many of them are part of the output in the EQS program.

Testing the Significance of the Path Coefficients

In the Bentler-Weeks SEM procedure, the unstandardized path coefficients are normally distributed so that when they are divided by their standard error a Z score is obtained. It is these Z scores that provided a test of whether the path coefficient is significantly different from zero.

As the unstandardized path coefficients are on a different scale from one another, researchers often report the standardized path coefficients that vary from -1 to 0 to +1 and which indicate the relative strengths of the causal paths in the model in the same way as a standardized regression coefficient. Notice that the standardized path coefficients from the latent variables to their measured counterparts are factor loadings. Indeed, if the measurement model specifies the entire model then the SEM procedure has executed a confirmatory factor analysis as will be illustrated shortly in an example.

22

Goodness-of Fit Indices

Given the problems associated with using the lack of significance of a 2 statistic to indicate a models goodness-of-fit to the data, many other indices have been developed. Unfortunately, there is no consensus on which of these indices are the best ones to use, so SEM programs avoid this problem by outputting most of these indices. Usually, they all show that the model is or is not a good fit, so it is a matter of preference which one the researcher reports. Hu & Bentler (1999) argue, however, that researchers should usually report one residual based fit index and one comparative fit index. The following section discusses some of the more commonly used goodness-of-fit indices of both types.

23

Comparative Fit Indices

In SEM the theoretical model can be compared to a just identified model that contains all possible causal paths. A just identified model can not be tested, but the matrix equation will give an exact and unique mathematical solution. If one more parameter needs to be estimated, then the mathematical solution is indeterminate (there are not enough equations to yield a solution). More commonly the theoretical model is compared to a model in which all the variables are independent of one another. Here all path coefficients are set to zero and only the values in the variance covariance matrix for the exogenous variables are estimated.

These two extremes illustrate the fact that models vary along a continuum from a model in which all the variables are independent of one another (only the variances of the exogenous variables are estimated) to the just identified model which can be specified but which can not be tested because there are no degrees of freedom left. The theoretical models goodness-of-fit is estimated by comparing it with one or the other of these extremes.

A simple and often used comparative goodness-of-fit index is the Bentler Bonett Normal Fit Index (NFI) defined as: NFI = 2indep - 2model / 2indep

This index varies from 0 to 1 and values greater than 0.9 indicate a good fit.

24

However, for relatively small samples ( < 200) the NFI underestimates the fit of a model. Thus, it has been replaced by the Comparative Fit Index (CFI) which uses a noncentrality parameter, , which is an index of model misspecification; the larger the value of the greater the misspecification, with = 0 indicating that the estimated model is perfect.

The Comparative Fit Index is defined by the following equation: CFI = 1 - (model / indep ) where = (2 - df) This index also varies from 0 to 1 and is a better estimate of goodnessof-fit for smaller samples. Models are a good fit if CFI > .95.

Another popular comparative fit index is the Root Mean Square Error of Approximation (RMSEA) which, compares the model to a just identified model. This statistic is defined as ( Fo / dfmodel )

RMSEA =

And Fo = (2model - dfmodel) / N , with Fo set to zero if its value is negative.

Fo = 0 indicates a perfect fit, so small values of RMSEA are desired. A good fitting model is indicated if RMSEA < 0.06. However, like the NFI, this statistic tends to reject models that fit well when the sample size is small.

25

Residual-Based Fit Indices

A popular index of goodness-of-fit that has intuitive appeal is the Root Mean Square Residual (RMR) index. This statistic is based upon the average of the squared differences between each element of the sample variance covariance matrix and the corresponding element of the estimated variance covariance matrix. RMR = ( 2 i j ( (sj - j)2 / p ( p - 1) ) , where p is the number of measured variables and sj and j are the corresponding variances and covariances from the two matrices.

Models that fit well have small RMR values, but these values are dependent upon the scale of the original measured variables in the model. Therefore, a standardized Root Mean Square Residual index (sRMR) has been developed. The values for sRMR range from 0 to 1 with small values indicating a good fit (small residuals). sRMR < .08 indicates that the model is a good fit.

(In the social psychology literature the CFI and the sRMR are fit indices that are often reported.)

26

Model Identification Can the Theoretical Model be Tested?

In order to test any model, it has to be over-identified. This means that there is a unique solution to the mathematical procedure which results in estimates for all the parameters that are allowed to vary freely in the model, and that there is at least one degree of freedom available to test this model using chi squared. If there are p measured variables, then the total number of degrees of freedom is, p* = p ( p + 1) / 2. The number of parameters that need to be estimated must, therefore, be less than this number.

However, if the model describes relationships among latent variables as well as the relationships among these factors and measured variables, then the SEM program may still not be able to converge on a solution. This is because both the structural model and the measurement model must be over-identified in order to test the entire models goodness-of-fit. Once p* has been calculated, the next step is to establish whether the measurement model is likely to be over-identified. If there is one latent factor in the model, there needs to be three variables measuring this construct and their errors must be uncorrelated. If there are two or more latent factors, the same conditions apply provided that each set of three measured variables only load on one factor and that the factors are allowed to covary. Sometimes two indicators per factor is sufficient under these conditions provided that none of the variances and covariances among the factors are zero.

27

Note that, in order for the factor to have meaning, one of the measured variables is used to scale the latent variable by setting the path coefficient to 1 (the variance of the latent variable is the same as the measured variable). This is called a marker variable. Failure to set the scale of a factor is one common error which results in identification problems.

Looking at whether the structural model is over-identified is the next step in this process. Provided there is only one latent variable or that the latent variables are recursive and their disturbances do not correlate, this part of the model is likely to be identified.

Notice that the phrase likely to be over-identified is used when discussing both the measurement model and the structural model. This is because the guidelines just reviewed do not guarantee that a particular model can be tested. Establishing this with certainty is complex, so perhaps the best strategy is to apply these guidelines and then run the analysis. The EQS program signals when this problem has arisen by indicating that some parameters are linearly dependent on other parameters. (NOTE: Dun et al. (1993) suggest using p* = p (p + 1) / 2 and then running the analysis to see if problems arise. If they do, more parameters can be given fixed values to deal with the problem.)

28

Using Parcels of Measured Variables When the Sample Size is Small

Sometimes applied researchers are forced to do structural equation modeling with a relatively small N (less than 200). In this instance, the researcher must balance the need to include enough measured variables to adequately specify the measurement model with the need to restrict the number of parameters estimated by the model as a whole. The solution is to create a small number of parcels made up by averaging the responses to several of the original measured variables (e.g., questionnaire items) with the minimum of three parcels per latent variable the number usually required for any SEM program to run properly.

Parcels of items (measured variables) are constructed for each latent variable using the following item-to-construct balance method:

Consider parceling measured variables (items) measuring a construct into three parcels. First a factor analysis is done on all the items measuring the construct. Then the three items with the highest factor loadings are used to anchor the three parcels. The three items with the next highest loadings are added to the anchors in reverse order, and so on. Together these three parcels are used as the manifest variables in the SEM analysis rather than the original items (see Little, Cunningham, Shahar, & Widaman, 2002). This reduces the number of parameters that the SEM program needs to estimate for the measurement model. However, this should only be done if N is small (usually < 200). Otherwise it is better to use the original measured variables in the analysis as multiple indicators of the construct.

29

EXAMPLE OF A SIMPLE CONFIRMATORY FACTOR ANALYSIS USING THE EQS PROGRAM Confirmatory Factor Analysis is a factor analytic technique that is design to test theory that specifies an underlying structure to a construct. Instead of discovering the underlying factor structure in a post hoc fashion through the exploratory factor analysis techniques covered earlier in this course, the function of this type of factor analysis is to confirm that the theorized factor structure underlying a construct is plausible. When SEM is used purely for confirmatory factor analysis, the theory that defines a construct in a certain way is tested, but the relationship of that construct to other constructs is not explored. This means that SEM is used to test a measurement model.

Self-concept is a complex multi-dimensional psychological construct that psychologists have been interested in since our discipline began. In the example, a two factor theory of academic self-concept is specified which suggests that it is comprised of two underlying and interrelated components reflecting different aspects of the self: English self-concept (ESC) and maths self-concept (MSC). Each of these components can be measured in several ways and that the responses to these measures are caused by these two factors in the manner specified by the path diagram on the next page. SEM tests this overall model as well as the specified causal paths which represent hypotheses derived from this academic self-concept theory. The study is a secondary analysis of published data summarizing the responses of 996 adolescents to a self-concept test battery. The authors of the study have provided the variance-covariance matrix and so EQS is used to analyse the data in this matrix.

30

The Path Diagram Specified by Theory

1.0
V3 E3*

1.0

ESC*

1.0
V9 E9*

* 1.0
V10 E10*

1.0
V4 E4*

1.0 1.0
MSC*

V11

E11*

* 1.0
V12 E12*

31

The EQS Syntax File ( *.EQS) /TITLE Self-concept: Confirmatory Factor Analysis Example The title statement can be several lines long and can contain explanatory notes on the decisions that resulted in the syntax being used. /SPECIFICATIONS VARIABLES= 12; CASES= 996; DATAFILE='c:\data\Eqs\807 2004\SEM notes.CFA example.byrne.ess'; MATRIX=COR; ANALYSIS = COV; METHOD=ML; The specifications commands give the computer details on the number of variables in the data set (VAR), the sample size (CASE),the location of the data matrix (DATAFILE -- a *.ESS is a EQS data file), the type of data matrix being analysed (MATRIX = COR or COV); the basis for the analysis (ANALYSIS = COR or COV), with the default being the variance covariance matrix, and the iterative estimation procedure (METHOD). All subcommands are separated by semi-colons (a general EQS syntax rule except for data matrices). The data is in the form of a correlation matrix with standard deviations
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V 12 1.0000 0.0710 0.2890 0.1700 0.8630 0.4786 0.2522 0.2160 0.1560 0.1280 0.1770 0.1350 0.0710 1.0000 0.2450 0.2530 0.2660 0.3060 0.2619 0.7675 0.2420 0.3070 0.2475 0.3040 0.2890 0.2450 1.0000 0.0120 0.2270 0.2990 0.2389 0.3430 0.7050 0.8543 0.0660 0.0270 0.1700 0.2530 0.0120 1.0000 0.2000 0.2250 0.3460 0.3472 0.0140 0.0690 0.8640 0.8280 0.8630 0.2660 0.2270 0.2000 1.0000 0.8310 0.7100 0.2160 0.1900 0.1310 0.2700 0.1880 0.4786 0.3060 0.2990 0.2250 0.8310 1.0000 0.2537 0.2830 0.2100 0.1740 0.2570 0.1870 0.2522 0.2619 0.2389 0.3460 0.7100 0.2537 1.0000 0.2545 0.1440 0.1396 0.2426 0.0367 0.2160 0.7675 0.3430 0.3472 0.2160 0.2830 0.2545 1.0000 0.2690 0.2900 0.2489 0.3057 0.1560 0.2420 0.7050 0.0140 0.1900 0.2100 0.1440 0.2690 1.0000 0.7627 0.1420 0.0280 0.1280 0.3070 0.8543 0.0690 0.1310 0.1740 0.1396 0.2900 0.7627 1.0000 0.0960 0.1460 0.1770 0.2475 0.0660 0.8640 0.2700 0.2570 0.2426 0.2489 0.1420 0.0960 1.0000 0.8060 0.1350 0.3040 0.0270 0.8280 0.1880 0.1870 0.0367 0.3057 0.0280 0.1460 0.8060 1.0000

The second last row is the standard deviations of the variables


14.1000 12.3000 10.0000 16.1000 9.3000 14.9000 9.4000 15.3000 11.3000 15.7000 11.5000 12.4000

The last row contains the means which, in this example, are set to zero.

32

EQS Syntax (Continued)

/LABELS V3 = ESC1; V4 = MSC1; V9 = ESC2; V10 = ESC3; V11 = MSC2; V12 = MSC3; F1 = ESC; F2 = MSC; These commands give more meaningful variable labels than those used by the computer. The syntax also helps you write the equations which need to use the computer labels. Dont forget the semi-colons. /EQUATIONS V3 = F1 + E3; V9 = *F1 + E9; V10 = *F1 + E10; V4 = F2 + E4; V11 = *F2 + E11; V12 = *F2 + E12; These equations specify the model that is being tested. Each equation is separated by a semi-colon. If you run EQS using a path diagram (EQS diagrammer function), these equations will be generated from the path diagram automatically. /VARIANCES F1 TO F2 = *; E3 TO E4 = *; E9 TO E12 = *; /COVARIANCES F1 TO F2 = *; /END These commands specify the matrix of variances and covariances. Notice that the covariation among the error terms are not specified implying that they are set to zero (the default). The /END statement tells the computer to begin the analysis.

33

The EQS Output File (*.OUT)

The first page of the output repeats the syntax that was just presented.

TITLE: Self-concept: Confirmatory Factor Analysis Example COVARIANCE MATRIX TO BE ANALYZED: 6 VARIABLES (SELECTED FROM 12 VARIABLES) BASED ON 996 CASES. This line reminds the researcher that the model contains 6 of the original 12 variables in the data set. Successive runs of EQS could specify different subsets of data for different analyses. Then the output gives the entire variance-covariance matrix among the variables used in the analysis.
ESC1 V 3 100.000 1.932 79.665 134.125 7.590 3.348 MSC3 V 12 153.760 MSC1 V 4 259.210 2.547 17.441 159.970 165.302 ESC2 V 9 127.690 135.311 18.453 3.923 ESC3 V 10 MSC2 V 11

ESC1 MSC1 ESC2 ESC3 MSC2 MSC3

V 3 V 4 V 9 V 10 V 11 V 12

246.490 17.333 28.423

132.250 114.936

MSC3

V 12

If a correlation matrix was analysed, the computer reminds the researcher because this is not a good idea. CORRELATION MATRIX TO BE ANALYZED:

34

BENTLER-WEEKS STRUCTURAL REPRESENTATION: NUMBER OF DEPENDENT VARIABLES = 6 DEPENDENT V'S : 3 4 9 10 11 12

Here the endogenous (dependent) variables in the model are identified. NUMBER OF INDEPENDENT VARIABLES = 8 INDEPENDENT F'S : 1 2 INDEPENDENT E'S : 3 4 9

10

11

12

Here the exogenous (independent) variables in the model are identified. NUMBER OF FREE PARAMETERS = 13 This is the number of parameters the researcher is estimating in this analysis (stars in the equations plus the stars in the variance covariance matrix, or equivalently the number of stars on the path diagram). The total number of degrees of freedom in this data set is calculated by the formula p* = p (p + 1) / 2 where p is the number of measured variables. In this example, there are 6 measured variables, so the total number of degrees of freedom are 6 x 7 / 2 = 21. As the researcher wishes to estimate 13 parameters, the degrees of freedom remaining that can be used to test the goodness-of-fit of the model is 21 - 13 = 8. DETERMINANT OF INPUT MATRIX IS 0.99223E+11. Clearly multicollinearity is not a problem in this data set. AVERAGE ABSOLUTE STANDARDIZED RESIDUALS = 0.0182 AVERAGE OFF-DIAGONAL STANDARDIZED RESIDUALS = 0.0255 The computer then prints out the residual variance covariance matrix and the standardized residual variance covariance matrix (not shown). Following each matrix is an average of all the residuals and all the off diagonal residuals. These averages should be small if the model fits the data well. The average of the off diagonal residuals is given because small residual covariance values are more crucial for the model to be a good fit.

35

LARGEST STANDARDIZED RESIDUALS: V11, V9 V12, V10 V4, V3 V 9, V4 V12, V3 0.076 0.069 -0.064 -0.054 -0.044 The computer then prints out the 20 largest residual values (this is an extract) so that the researcher knows which relationships are not modelled very well. For example, the first and largest residual in this table shows that the model does not explain a relationship between an index of English self concept (V9) and math self-concept (V11) as well as other relationships in the samples correlation matrix. Whether the researcher will use this information or not depends on the overall fit of the model and the size of these residual covariances (look at standardized residuals > .10). DISTRIBUTION OF STANDARDIZED RESIDUALS
---------------------------------------! ! 20! ! ! ! ! ! ! ! RANGE FREQ PERCENT 15! ! 1 -0.5 - -0 .00% ! ! 2 -0.4 - -0.5 0 .00% ! * ! 3 -0.3 - -0.4 0 .00% ! * ! 4 -0.2 - -0.3 0 .00% 10* - 5 -0.1 - -0.2 0 .00% ! * * ! 6 0.0 - -0.1 12 57.14% ! * * ! 7 0.1 - 0.0 9 42.86% ! * * ! 8 0.2 - 0.1 0 .00% ! * * ! 9 0.3 - 0.2 0 .00% 5* * - A 0.4 - 0.3 0 .00% ! * * ! B 0.5 - 0.4 0 .00% ! * * ! C ++ - 0.5 0 .00% ! * * ! ------------------------------! * * ! TOTAL 21 100.00% ---------------------------------------1 2 3 4 5 6 7 8 9 A B C EACH "*" REPRESENTS 1 RESIDUALS

This histogram shows that the residuals are centred on zero (100% are between 0.1 and 0.1) and are symmetrical. This information indicates that the model does not contain a serious specification error.

36

GOODNESS OF FIT SUMMARY INDEPENDENCE MODEL CHI-SQUARE = DEGREES OF FREEDOM 5093.587 ON 15

This first chi square test should be significant as it test the hypothesis that the variables are independent of one another (one of the standards of comparison used by some of the comparative goodness-of-fit indices). CHI-SQUARE = 266.589 BASED ON 8 DEGREES OF FREEDOM PROBABILITY IS LESS THAN 0.000001 This is the basic chi square value that tests the goodness-of-fit of the model. Because the sample size is large (N = 996), the fact that this statistic is significant does not mean that the model is a poor fit. BENTLER-BONETT NORMED FIT INDEX= 0.948 BENTLER-BONETT NONNORMED FIT INDEX= 0.905 COMPARATIVE FIT INDEX (CFI) = 0.949 These fit indices and particularly the CFI suggest that the model is quite a good fit as they are all around the 0.9. The CFI should be greater than 0.95 for the model to be a good fit. ITERATIVE SUMMARY PARAMETER ITERATION ABS CHANGE 1 69.848500 2 5.597344 3 .887007 4 .069267 5 .012263 6 .000785

ALPHA 1.00000 1.00000 1.00000 1.00000 1.00000 1.00000

FUNCTION .48517 .28123 .26795 .26793 .26793 .26793

This output shows how the function specified by the estimation method converges on a minimum value. Notice that the chi square testing the goodnessof-fit of the model is equal to the minimum function value multiplied by (N-1): 0.26793 x 995 = 266.589 (within rounding error).

37

The computer then writes the equations for the endogenous variables with the estimated parameters. MEASUREMENT EQUATIONS WITH STANDARD ERRORS AND TEST STATISTICS ESC1 =V3 = 1.000 F1 + 1.000 E3 MSC1 =V4 = 1.000 F2 + 1.000 E4 These first set of equations are the ones in which the paths of one of the indicators (a marker) for the four factors was fixed at a value of 1. This gives the underlying factor the same scale as the marker variable. ESC2 =V9 = 1.010*F1 + 1.000 E9 .031 32.704@ This equation shows that the unstandardized path coefficient between the English self-concept factor and the specific measure of English self-concept (ESC2) is 1.010. The standard error for this statistic is 0.031 and the Z score is 1.010 / .031 = 32.704. As this Z score is greater than 1.96, it is significant as indicated by @.

ESC3 =V10 = 1.705*F1 + 1.000 E10 .040 42.362@ MSC2 =V11 = .696*F2 + 1.000 E11 .014 49.339@ MSC3 =V12 = .720*F2 + 1.000 E12 .016 44.579@ Sometimes, one estimated parameter is linearly dependent upon the others. The computer will run the analysis, but will print out an error message concerning this linear dependency. Do NOT trust the output if this message appears. A less extreme form of linear dependence is shown by a parameter estimate having a very small standard error (relative to standard errors found in past research). Consider eliminating this variable and rerunning the analysis.

38

The computer then prints the estimated variances and covariances of the exogenous variables with standard errors and Z scores. The estimates for the measured variables should be inspected to see if they seem reasonable given your knowledge of these variances and covariances from past research. Sometimes the estimation procedure produces an odd solution that does not conform to the findings from past research. In this instance, it is probably wise to give more credence to the results of past research until the results of the analysis are replicated.
VARIANCES OF INDEPENDENT VARIABLES (EXTRACT) ---------------------------------I F1 - ESC I I I E3 - ESC1 21.310*I 1.524 I 13.980@I 78.690*I 4.539 I 17.338@I I I I

As an extreme, the SEM program can estimate negative variances for measured variables. This is flagged by the computer with an error message saying that these variances can not be estimated. In this instance, the analysis is seriously flawed and the researcher will need to reassess whether the use of SEM to analyse the data set is warranted. As well, the estimated variances of the latent exogenous factors can be negative. The computer will not allow this to happen and will constrain the variance to zero (or a lower bound estimate that is positive). If you see this error message, seriously question the validity of the results.

39

STANDARDIZED SOLUTION: ESC1 =V3 = MSC1 =V4 = ESC2 = V9 = ESC3 = V10 = MSC2 =V11 = MSC3 =V12 = .887 F1 .941 F2 .793*F1 .963*F1 .918*F2 .879*F2 + + + + + + .462 E3 .338 E4 .610 E9 .269 E10 .397 E11 .476 E12

R-SQUARED .787 .886 .628 .928 .842 .773

These standardized path coefficients from the measured variables to the latent factors are the ones that are most usually written onto the path diagram in published reports (they are calculated from the unstandardized path coefficients after the analysis is completed). Note that this table can not be calculated if some of the estimated variances for the exogenous variables are negative. In this example, the standardized path coefficients are the factor loadings of the measured variables on the latent factors. The squared multiple correlations (the square of the path coefficient) are estimates of the proportion of the variance of the measured variables which is shared with the underlying factor. This is a commonality estimate for the variable on the factor and, equivalently, an estimate of its reliability. For example, the reliability of the Math Self-Concept Scale (V12) is .773. This number indicates the proportion of variance in the measured variable that measures the underlying Maths Self-concept construct (F2). (This number is NOT a good reliability estimate if the error terms for the measured variables are correlated. ) Notice that in this table the standardized path coefficients for the marker variables (and the error terms) that were fixed in the equations now have a value different from 1 due to the standardization procedure. If you need their statistical significance, set another measured variable as the marker and rerun the analysis. This will give you the same value for the path coefficients because the two solutions are equivalent. The standardized variances are not printed as the computer sets all variances equal to 1. The correlations among the latent factors is given in the last table of the output. In this case it is 0.091 indicating that English and Math Academic Self-Concept are relatively independent of one another.

40

Additional Syntax The following syntax allows you to output additional goodness-of-fit indices. /PRINT FIT = ALL; For this example, some of the output from this command is: ROOT MEAN SQUARED RESIDUAL (RMR) = STANDARDIZED RMR = 0.032 ROOT MEAN SQ. ERROR OF APP.(RMSEA)= 90% CONFIDENCE INTERVAL OF RMSEA ( 5.133 0.180 0.162,

0.199)

Post Hoc Adjustment of the Theoretical Model: Addition and Subtraction of Parameters In the above example, the value of the goodness-of-fit indices suggest that the model is a good fit because they meet the established criteria. However, in some analyses the solution looks quite good but could be improved (the fit indices approach the criteria for a good fit, but do not meet these criteria). In this situation, the researcher can conduct post hoc tests which suggest which parameters should be estimated rather than fixed and which parameters can be removed (set to zero). Then the modified model can and should be tested on a new sample. These post hoc procedures capitalize on chance and so Tabachnick and Fidell suggest only adding or subtracting a few paths one at a time. As well they advocate using a conservative significance level (p < .01) for selecting modifications to the parameters specified by theory.

41

Lagrange Multiplier Test

This post hoc procedure indicates which parameters could be added to the model (estimated) to improve its goodness-of-fit based upon the current sample. Both univariate and multivariate tests are conducted but the multivariate test is the more important one as it identifies the parameters that could be added into the model in a stepwise fashion similar to forward selection in multiple regression. To run this procedure use the syntax:

/LMTEST.

Using this syntax in the context of the current example yields the following output:

MULTIVARIATE LAGRANGE MULTIPLIER TEST BY SIMULTANEOUS PROCESS IN STAGE 1 PARAMETER SETS (SUBMATRICES) ACTIVE AT THIS STAGE ARE: PVV PFV PFF PDD GVV GVF GFV GFF BVF BFF This line indicates the type of parameter matrices that were active at this stage of the analysis. The first letter indicates the matrix containing the suggested parameter (P = ; G = ; and B = B) and the remaining letters indicate the type of variables involved (V = measured variables, F = factors, E = errors, and D = disturbances).

42 CUMULATIVE MULTIVARIATE STATISTICS ---------------------------------UNIVARIATE INCREMENT ----------------------------HANCOCK'S SEQUENTIAL D.F. PROB. -----8 .067 7 .130

STEP ---1 2

PARAMETER ----------V4,F1 V3,F2

CHI-SQUARE ---------14.601 25.807

D.F. ---1 2

PROB. ----.000 .000

CHI-SQUARE ---------14.601 11.206

PROB. ----.000 .001

This part of the output indicates the possible changes along with a 2 test which, if significant, indicates that the model will be improved. For example the analysis suggests adding a path from F1 (the English self-concept factor to V4 (supposedly a measure of Math self-concept). While statistically this makes sense, theoretically it may not. Indeed, whether the researcher actually makes this change and recomputes the model depends on a thoughtful analysis of the theoretical implications. One or two theoretically meaningful changes may well improve the model sufficiently to become a good fit and this strategy is usually better than implementing all the changes without regard for theory. Whenever changes are made, it is important to rerun the analysis on the modified model so as to check on its goodness-of-fit and to examine the impact of the changes on all the parameter estimates. Indeed, if the modifications result in parameter estimates that are not consistent with past research, the researcher may decide that these modifications are not worth making at all. After all, this is a post hoc procedure relying on purely statistical criteria. There is no guarantee that the changes it suggests are changes that improve understand of the phenomena under study. The Wald Test This post hoc procedure is used to delete parameters (set them to zero) and so make the model more restrictive. It is usually done after parameters have been added using the Legrange test as adding paths changes the parameter estimates. The following syntax is used to activate this procedure: /WTEST Because parameters are being set to zero, results of the 2 test should be nonsignificant. The output is similar to the Lagrange test. In this instance, no paths were dropped when this test was conducted.

43

Comparing Nested Models: The Chi Square Difference Test

Sometimes different theories (or different versions of the same theories) specify two models such that one model is nested inside the other. In this instance the two models can be directly compared to see if the larger model (the one with more paths) significantly improves the goodness-of-fit (or, equivalently, if the added restrictions significantly reduces the goodness-offit). This comparison is achieved by subtracting the 2 values for the two models. The result is another 2 statistic with degrees of freedom equal to the difference in the degrees of freedom for the two models. This procedure requires the estimation of two models, but its advantage is that it is theorybased and provides evidence that directly bears upon the relative merits of the two theories. In my view, this is a whole lot better than fixing a model in a post hoc fashion using the Legrange Multiplier and/or the Wald tests.

44

EXAMPLE OF TESTING A CAUSAL MODEL USING THE EQS PROGRAM This path diagram specifies a simple theoretical model of job satisfaction (an endogenous latent variable) which was tested on 122 employees in an industrial sales force. The exogenous latent variables in this model are achievement motivation and self-esteem.

E4*

ACH1

1.0 1.0
D1* E5* ACH2

ACH*

1.0 *

1.0 1.0
F1

SAT1

E2*

1.0

*
SAT2 E3*

1.0

E6*

SE1

1.0

SE*

1.0 *

E7*

SE2

1.0

45

The syntax used to test this theoretical model is derived directly from the path diagram: /TITLE PERFORMANCE AND JOB SATISFACTION IN AN INDUSTRIAL SALES FORCE

/SPECIFICATIONS CASES = 122; VARIABLES = 8; MATRIX=CORRELATION; ANALYSIS=COVARIANCE; METHOD = ML; The number of parameters being estimated is 15, so the sample size (CASES = 122) is a little small (it should be at least 15 x 10 = 150). The ANALYSIS command specifies that the variance-covariance matrix should be analysed. This matrix is created from the correlation matrix and the standard deviations of the variables contained in the command statement /STA below. /LABELS V2 = SAT1; V3 = SAT2; V4 = ACH1; V5 = ACH2; V6 = SE1; V7 = SE2; F1 = JOBSAT; F2 = ACH; F3 = SE; /EQUATIONS V2 = F1 + E2; V3 = *F1 + E3; V4 = F2 + E4; V5 = *F2 + E5; V6 = F3 + E6; V7 = *F3 + E7; F1 = *F2 + *F3 + D1; The start values can be specified (from past research) for some or all of the parameters in these equations (they are given as numbers to the left of the stars. e.g., 0.5*F2). Specifying these start values is more crucial if you have a small sample size.

46

/VARIANCES F2 TO F3 = *; E2 TO E7 = *; D1 = *; /COVARIANCES F2,F3 = *; These sets of statements specify the variances and covariances for the exogenous variables. Covariances are set to zero by default, so it is only necessary to state that the covariance between F2 and F3 needs to be estimated. /MATRIX 1.000 .418 1.000 .394 .627 1.000 .129 .202 .266 1.000 .189 .284 .208 .365 1.000 .544 .281 .324 .201 .161 1.000 .507 .225 .314 .172 .174 .546 1.000 -.357 -.156 -.038 -.199 -.277 -.294 -.174 1.000 /STANDARD DEVIATIONS 2.09 3.43 2.81 1.95 2.08 2.16 2.06 3.65 This is the way a matrix of correlations with standard deviations (so as to create a covariance matrix for the computer to analyze) is specified in the command file. /END This statement ends the commands and tells the computer to begin the analysis.

47

The EQS Output The output starts out by repeating the syntax. Then the variance covariance matrix among the measured variables is given, followed by: BENTLER-WEEKS STRUCTURAL REPRESENTATION: NUMBER OF DEPENDENT VARIABLES = 7 DEPENDENT V'S : 2 3 4 5 6 7 DEPENDENT F'S : 1 NUMBER OF INDEPENDENT VARIABLES = 9 INDEPENDENT F'S : INDEPENDENT E'S : INDEPENDENT D'S : 2 3 2 3 4 5 6 7 1

NUMBER OF FREE PARAMETERS = 15 NUMBER OF FIXED NONZERO PARAMETERS = 10 The number of degrees of freedom are p (p + 1) / 2 = 6 x 7 / 2 = 21. Therefore, the degrees of freedom that can be used to test the goodness-of-fit of the model is 21 - 15 = 6.

DETERMINANT OF INPUT MATRIX IS 0.82146E+04 This shows that there is no problem with multicollinearity. The computer then prints out the residual variance covariance matrix and the standardized residual matrix. The summary of the values in the standardized matrix shows that the residuals are small indicating that the model fits the data well: AVERAGE ABSOLUTE STANDARDIZED RESIDUALS = 0.0113

AVERAGE OFF-DIAGONAL STANDARDIZED RESIDUALS = 0.0159

48

The histogram of the residuals show that they are small and centred around zero (over 97% lie in the range 0.1 to 0.1).

GOODNESS OF FIT SUMMARY CHI-SQUARE = 3.915 BASED ON 6 DEGREES OF FREEDOM 0.68813

PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS COMPARATIVE FIT INDEX (CFI) = 1.000

This part of the output shows that the model is a very good fit and the chi square is not significant. The CFI also shows this and is the statistic to report given the small sample size.

The computer then writes the equations for the endogenous variables with the parameter estimates and their significance.

49

MEASUREMENT EQUATIONS WITH STANDARD ERRORS AND TEST STATISTICS

SAT1=V2 = 1.000 F1 + 1.000 E2 SAT2=V3 = .929*F1 + 1.000 E3 .189 4.931@ ACH1=V4 = 1.000 F2 + 1.000 E4 ACH2=V5 = 1.006*F2 + 1.000 E5 .361 2.784@ SE1=V6 = 1.000 F3 + 1.000 E6 SE2=V7 = .879*F3 + 1.000 E7 .222 3.965 JOBSAT=F1 = .733*F2 + .547*F3 + 1.000 D1 .376 .234 1.949@ 2.335@

The output also estimates the variance and covariances of the exogenous variables, including the correlation between F2 and F3 as 0.396. STANDARDIZED SOLUTION: SAT1= V2 = SAT2= V3 = ACH1=V4 = ACH2=V5 = SE1= V6 = SE2= V7 = JOBSAT= F1 .743 F1 + .669 E2 .843*F1 + .537 E3 .622 F2 + .783 E4 .587*F2 + .810 E5 .770 F3 + .638 E6 .709*F3 + .705 E7 = .349*F2 + .357*F3 + .808 D1 R-SQUARED .553 .711 .387 .344 .593 .503 .347

50

Further Comments on Structural Equation Modelling with EQS

The Equivalency of Solutions Using Different Marker Variables

If there is more than one latent factor in your path diagram, then changing the variable used as the marker variable for a factor results in an equivalent solution (the size of the path coefficients are the same). However, if one manifest variable is a much better index of the construct than the others, it is best to use it as the marker variable as the solution tends to be more stable. Setting the variance of exogenous factors to 1.0 (standardizing the variance) rather than one of the paths to a marker variable is another option if you do not want to specify a variable as a marker variable. This also results in an equivalent solution.

The bottom line is that your choice of marker variable or your choice to set the variance of a exogenous factor to 1.0 and not specify a marker variable depends upon what you are interested in theoretically. For example, if you want to scale a factor to a well known and highly reliable instrument, then you should make this measured variable your marker variable. If all measures are equivalent (and perhaps of unknown reliability e.g., face valid measures of the construct) and you are not concerned with scaling a latent exogenous factor, but rather want to know how all the measures load on this factor, then set the variance of this exogenous factor to 1.0.

51

Including Dichotomous Variables in the Theoretical Model

Truly nominal variables define different groups of respondents without ranking them (see TF, section 14.5.7, p. 730). Provided the sample size is large enough, the SEM strategy is to test the theoretical model within each group separately. If the model is supported in both samples, then your results generalize across samples. For example, the model is supported for both men and women; within white, immigrant, and Aboriginal samples; etc

A more advanced form of SEM not covered in this course or TF, starts by testing the model within each sample and then doing a multiple group analysis in order, for example, to test the invariance of the factorial structure of a theoretical construct across groups. Simply put, the analysis constrains certain parameters within each group to be equal (e.g., the size of the path from a measured variable to a latent construct is the same for both men and women) and examines whether the goodness-of-fit is still as good as the goodness-of-fit of the model when these constraints are not applied.

If the sample size is small, you can include a nominal variable in the path diagram as one or more dummy variables. Clearly these dummy variables are not normally distributed so you have to use a robust estimation procedure.

If you are using ordinal data which reflects a underlying continuous variable (e.g., age: 1 = young, 2 = middle aged, 3 = old), you must estimate the size of the correlations that would have been obtained if you had actually measured the continuous variable directly. These estimates are called polychoric correlations (between two ordinal variables) or polyserial correlations (between an ordinal and an interval variable) and they form the basis of the analysis. This goes far beyond the scope of this course and is only briefly mentioned in TF (section 14.5.6, p. 734).

52

Multivariate Kurtosis If you run EQS using raw data, the program will print out Mardias coefficient which indicates multivariate kurtosis (use p < .001 to determine if kurtosis is a problem). The normalized estimate for this coefficient allows you to see if the variables in the data set are normally distributed or not. For large sample sizes, the values of the normalized coefficient correspond to Z values and so large values indicate some non-normality is present due to kurtosis.

The computer also prints out the 5 cases that have the largest normalized estimate (and which contribute the most to the overall value of Mardias coefficient). If one or two of these cases have much larger normalized estimate values than the others, consider dropping these cases and re-running the analysis. However, it is better not to drop these cases immediately, but to check the variables for univariate and multivariate outliers and to adjust the variables that have a non-normal distribution using transformations. This is usually done with the SPSS program. EQS reads SPSS data files and converts them into ***.ess files with ease.

You might also like