You are on page 1of 66

Introduction to Longitudinal

Analysis Using SPSS


Fabiana Gordon
Statistical Advisory Service
Imperial College
Tel : (+44) 20 75943339
Website: www.ic.ac.uk/stathelp
e-mail: fabiana.gordon@imperial.ac.uk

Course Outline
Longitudinal Studies are studies in which the data on individuals are measured repeatedly
through time. This course will cover exploratory analysis, modeling and interpretation of
longitudinal studies.
The outline of this course is as follows. Section 1 starts with the background assumed and a
brief overview of SPSS, introducing the module used to analyze longitudinal data: SPSS Proc
Mixed. Then, we highlight the main features and merits of longitudinal data. In longitudinal
analysis considerations we will mention different features of longitudinal studies that must be
taken into account when selecting an appropriate methodology. We present the data structure
needed to analyse longitudinal data using SPSS Proc Mixed. We also introduce some general
notation and concepts.
Section 2 presents some graphical techniques used to explore longitudinal analysis and it
provides a brief discussion of the two approaches for analysis of longitudinal data which will
be the main focus of this course.
Section 3 presents the Mean Response approach. Section 4 presents the Random
Coefficients approach.


General notes on using SPPS

Features of Longitudinal Data

Merits Longitudinal Data

General notation and concepts

Longitudinal analysis considerations

Longitudinal data layout for SPSS Mixed


INTRODUCTION

3
3
6
8
9
10
11

EXPLORING LONGITUDINAL DATA

17

MODELS FOR THE MEAN RESPONSE

23

Profile Analysis

23

Parametric Curves

RANDOM COEFFICIENRS APPROACH

42

Random Intercept Model


Random Intercept and Random
Slope Model

43

- Specifying the Covariance Structure


- Model Fit and Model Comparison

- Restricting the covariance structure


- Likelihood ratio statistic

24
25
33
34
36

51

1. Introduction
1.1 Background Assumed
- Variables:
Y: Outcome, response, dependent variable
X: Covariates, independent variables
- Inference
Estimation, testing, and confidence intervals
- Statistical methods:
Multiple linear regression, ANOVA,ANCOVA.

1.2 General notes on using SPSS


In the notes that follow, we assume that you are familiar with SPSS or at least with other
Windows programs. If you are not, then the following may be useful.
To open SPSS, just click on the SPSS button on your desktop.
You have now opened SPSS and you can see the default Untitled1 [DataSet0] - PASW
Statistics Data Editor sheet.
This is a sheet with rows that will represent cases and columns that will represent variables.
On top of the screen we find the drop down menus. For now we are going to have a look at
File and Analyze.

To open a file you need to click on File and then select Open Data. SPSS data sets have
the file extension *.sav.

Software: PROC MIXED in SPSS

To run a Linear Mixed Models analysis, from the menus choose:


Analyze
Mixed
Linear...

The first screen that appears is the one below.

Variable selection

After doing the appropriate selection and clicking on Continue the following screen will appear.

Variable selection

On the right hand side of the screen there are two Sub-dialogue boxes: Fixed and Random
(see below).

These four screens are needed to specify a model. We will go through each dialog box and
their range of options when presenting the methodology applied to longitudinal data.

1.3 Features of Longitudinal Data


- Defining feature: repeated observations on subjects over time, allowing the direct study of
change.
- Note that the measurements are commensurate, i.e. the same variable is measured
repeatedly.
- Longitudinal data require sophisticated statistical techniques because the repeated
observations are usually (positively) correlated.
- Sequential nature of the measures implies that certain types of correlation structures are
likely to arise.
- Correlation must be accounted for to obtain valid inferences.

EXAMPLES

1) Diet Study
A physician is evaluating a new diet for her patients with a family history of heart disease. To
test the effectiveness of this diet, 16 patients are placed on the diet for 6 months. Their
weights and triglyceride levels are measured before and after the study, and the physician
wants to know if the weights have changed.
2) Smoke Data
This dataset consists of a sub-sample from an epidemiologic study conducted in a rural
Netherlands where residents were followed over time to obtain information on the prevalence
of and risk factors for chronic obstructive lung diseases. A measure of pulmonary function
(FEV1) was obtained every three years for the first 15 years of the study, and also at year 19.
Information on respiratory symptoms and smoking status was collected.
3) Exercise Therapy Study
The data are from a study of exercise therapies, where 37 patients were assigned to one of
two weightlifting programs. In the first program (treatment 1), the number of repetitions was
increased as subjects became stronger. In the second program (treatment 2), the number of
repetitions was fixed but the amount of weight was increased as subjects became stronger.
Measures of strength were taken at baseline (day 0), and on days 2, 4, 6, 8, 10, and 12.
4) Treatment of Lead Exposed Children
These data consist of four repeated measurements of blood lead levels ( a common forms of
metal intoxication) obtained at baseline (or week 0), week 1, week 4, and week 6 on 100
children who were randomly assigned to a chelating treatment (antidote to lead poisoning) or
placebo.
5) Grocery coupons data
This is a hypothetical data file that contains survey data collected by a grocery store chain
interested in the purchasing habits of their customers. Each customer is followed for four
weeks, and each case corresponds to a separate customer-week and records information
about where and how the customer shops, including how much was spent on groceries during
that week.
A grocery store chain is interested in determining the effects of three different coupons
(versus no coupon) on customer spending. To this end, they construct a crossover trial in
which a random sample of their regular customers is followed for four weeks.
Design for crossover trial
Sequence 1 Sequence 2 Sequence 3 Sequence 4
Week 1 No coupon 5 percent
15 percent 25 percent
Week 2 5 percent
25 percent No coupon 15 percent
Week 3 15 percent No coupon 25 percent 5 percent
Week 4 25 percent 15 percent 5 percent
No coupon
Thus, in Sequence 1, a customer is not sent a coupon in the first week, receives a 5 percent
coupon in the second week, a 15 percent coupon in the third week, and a 25 percent coupon
in the fourth week. Each customer is randomly assigned to one of the sequences.
6) Clinical Trial of Patients with Respiratory Illness
The data are from a clinical trial of patients with respiratory illness, where 111 patients from
two different clinics were randomized to receive either placebo or an active treatment.

Patients were examined at baseline and at four visits during treatment. At each examination,
respiratory status (categorized as 1 = good, 0 = poor) was determined.
7) Airline Cost Data
These data are from a study of a group of U.S. airlines collected on several companies over
15 years. The objective of the study is to determine the effect of economic factors on cost.
8) Growth Study Data
Investigators at the University of North Carolina Dental School followed the growth of 27
children (16 males, 11 females) from age 8 until age 14. Every two years they measured the
distance between two points in the head that are easily identified on x-ray.
9) Opposites-naming data
This dataset consists on a sample of people who completed an inventory that assesses their
performance on a timed cognitive task called opposites naming. Individuals were measured
on four occasions. A baseline measure of cognitive skill was also collected. The research
interest focuses on whether opposites-naming skill increases more rapidly over time among
individuals with stronger cognitive skills.

1.4 Merits of Longitudinal Studies


(i) The prime advantage of a longitudinal study is its effectiveness for studying change. Unlike
a cross-sectional study, a longitudinal design can distinguish changes over time within
subjects (ageing effects) from differences among subjects in their baseline levels (cohort
effects).The idea is illustrated in Fig.1.1. In Fig 1.1 (a), reading ability is plotted against age for
a hypothetical cross-sectional study of children. Reading ability appears to be poorer among
older children; little else can be said. In Fig.1.1 (b), we suppose the same data were obtained
in a longitudinal study in which each individual was measured twice. Now it is clear that while
younger children began at a higher reading level, everyone improved with time. This can be
explained by saying that this group of children come from a poor rural community and that
when elementary education was introduced it began with the younger children.

(ii) It can provide information about individual change

(iii) Another advantage of a longitudinal study is its ability to distinguish the degree of
variation in the response over time for one person from the variation in the response among
people. We will come back to this later in the course.

1.5 Some general notation and concepts

General structure

In longitudinal studies the general data structures is as follows:


Suppose that each subject is measured in p occasions.

1
Subjects
1
2
.
.
.
n

Time
2
3

y11 y12 y13 . . . y1p


y21 y22 y23 . . . y2p
.
.
. ... .
.
.
. ... .
.
.
. ... .
yn1 yn2 yn3 . . . ynp

Time
2
3

Ex.: This is the


weight of
subject 2 on the
rd
3 occasion

where yij = j th observation on the i th subject and y is the response or outcome measure.
In the structure above we have a single group of subjects. We can extend this to have two or
more groups of subjects repeatedly measured over time. The groups can be defined by
characteristics of the study subjects, such as age, gender (Studies based on such categories
are observational). Groups can also be defined by randomization to alternative treatments.
This type of design is a called Parallel Groups Design. The structure of a Two- Groups
Parallel Design is below:

Time

Group 1
Subjects
1
2
.
.
.
m

y11
y21
.
.
.
ym1

y12
y22
.
.
.
ym2

y13
y23
.
.
.
ym3

Group 2
Subjects
m+1
m+2
.
.
.
n

ym+1,1 ym+1,2 ym+1,3 . . . ym+1,p


ym+2,1 ym+2,2 ym+2,3 . . . ym+2,p
.
.
.
... .
.
.
.
... .
.
.
.
... .
yn1
yn2
yn3 . . . ynp

. . . y1p
. . . y2p
... .
... .
... .
. . . ymp

The diet study in section 1.3 is an example of a two-group parallel design.

Next we consider a variant of the single-group repeated measures design known as the
crossover design. In the simplest version of the cross-over design, two treatments, say A
and B, are to be compared.
Subjects are randomly assigned to the two treatment orders: A B and B A. Example:
Placebo-controlled study of the effect of erythropoietin on plasma histamine levels and
purities scores of 10 dialysis patients. Treatment schedule was 5 weeks of placebo and 5
weeks of erythropoietin in random order. Another example is the Grocery Coupons data in
section 1.3.

Correlation

To avoid writing equations we can say that correlation is a single number that describes the
degree of relationship between two variables. For example we would expect that some
measure of self esteem and weight to be correlated. The same applies to price and demand
for a certain item.

Factors

Factors are categorical predictors. e.g., treatment (treatment 1,. . ., treatment k),gender.

Level

The categories of a factor are called levels. For example gender is a factor with two levels.

Covariates

The name covariate usually refers to a continuous predictor, e.g., age.

Interactions

An interaction between two factors means that each combination of factor levels can have a
different effect on the dependent variable. Additionally, you can find that the relationship
between a covariate and the dependent variables changes for different levels of a factor. This
would be a factor-covariate interaction.
The response or outcome measure will be denoted by y. Predictors, both factors and
covariates will be denoted by x except for time which will be denoted by t.

The mean response will be denoted by E(y)

The variance-covariance matrix by .

1.6 Longitudinal analysis considerations


There are several different features of longitudinal studies that must be considered when
selecting an appropriate methodology. First, the outcome or response measure can be
continuous or categorical. Second, subjects can be measured at the same time points or at
different occasions. In the first case we have a balanced design. Another situation is whether
measurement times are equally spaced (e.g. every 2 days) or unequally spaced. Another
important consideration is whether predictors of interest are fixed or time-varying. An
example, of a fixed predictor is gender since it remains unchanged over time. In the smoking

10

dataset ( section 1.3), smoking status is a predictor of FEV1 and it could also be considered
as time-varying if some patients stop smoking.

In this course we will focus on:


-

Parallel Design
Continuous outcome
Balanced design with equally or unequally spaced measurements.

1.7 Data Layout needed for SPSS Proc Mixed


Appropriate structure of the data file is an important yet often unmentioned condition in
longitudinal analysis. There are two basic structures to consider: the Multiple-Variable (MV),
and Multiple-Record (MR).
The defining characteristic of Multiple-Variable (MV) structure is that all information pertaining
to a single observation is placed on one line in the dataset. For example, if there are 20
participants in a study with 3 variables recorded for each person, the resulting MV data file will
contain 20 lines (records) and 3 variables .This is illustrated in Figure 1 for the first six
participants. In longitudinal designs it is necessary to have a variable that identifies inherent
group membership of multiple records. In this case it would be the participants ID.

Figure 1.Multiple Variable Data Structure

ID
1
2
3
4
5
6

Var1
12
23
31
13
26
27

Var2
45
43
54
42
40
49

Var3
34
34
45
31
38
44

Group
A
B
A
A
B
B

As an example we could think of Var1, Var2 and Var3 being, respectively, the weight of each
individual measured on three occasions.
The defining characteristic of Multiple-Record (MR) structure is that information pertaining to a
single observation is stacked or placed on multiple lines in the dataset.
For example if there are 20 participants in a study with 3 observations on variable X for each
person, the resulting MR data file will contain 120 lines (records) ,i.e., all the information
regarding variable X will be contained in one column. In addition to variable X, it also
necessary to include an individual level identifier (ID) and a variable representing timing or
sequence (ORDER) of measurements.
This format is illustrated in figure 2 for the first 3 participants.

11

Figure 2. Multiple Record Data Structure

ID
1
1
1
2
2
2
3
3
3

Var X
12
45
34
23
43
34
31
54
45

Order
1
2
3
1
2
3
1
2
3

In this course we are going to use the SPSS Linear Mixed Models procedure to analyze
longitudinal data. SPSS Mixed uses the MR format so if the data is initially recorded in MV
format you need to restructure the file.
PREPARING THE DATA: using the Restructure option in SPSS
First, lets open the diet study file,Diet_Study.sav. To restructure the data from variables to
cases, from the menus choose:

12

You want to restructure variables into cases, so simply click Next in the Restructure Data
Wizard Welcome dialog box.
You have to specify how many variable groups you want to restructure. The default is the
first option. However, the diet file contains Weight and Triglyceride which are recorded for
each patient at different time points. This means that we have two variable groups so the
second option is appropriate in this case.
Click on the More than One option and type 2 after How Many?

13

Click Next.

14

In the Case Group Identification group, select Use selected variable from the dropdown list
and select Patient ID as the identification variable.
In the Variables to Be Transposed group, type Triglice as the first target variable (trans1)
and select tg0,tg1,tg2,tg3 and tg4 as the variables to be transposed.
In the same Variables to Be Transposed box click on the second target variable (trans2),
type Weight and select wgt0,wgt1,wgt2,wgt3 and wgt4 as the second group of variables to be
transposed.

15

Select Age in years and Gender as fixed variables.


Click Next.

16

You want to create just one index variable, so simply click Next in the Variables to Cases:
Create Index Variables.

17

In the edit fields, type time as the variable name.


Click Finish
In order to keep the previous format you should save the restructured file with a new name,
e.g. Diet_StudyLF.

2. Exploring Longitudinal Data


Researchers should conduct descriptive exploratory analyses of their data before fitting statistical
models. As when working with cross-sectional data, exploratory analyses of longitudinal data can
reveal general patterns, provide insight into functional form, and identify individuals whose data do
not conform to the general pattern. In a longitudinal design you can also explore how different
individuals in your sample change over time.
Two types of questions form the core of every study about change:
(i) The first question is about characterizing each persons pattern of change over time - the withinperson question. For example: Is individual change linear? Nonlinear? Is it consistent over time or
does it fluctuate?

18

(ii) We can address the between-person question-How does individual change differ across
people?- by exploring whether different people change in similar or different ways, whether
observed differences in change across people are associated with individuals characteristics,
in other words which predictors are associated with which patterns.
Understanding these two questions will prepare you for subsequent mode-based analyses.
The exploratory analyses presented in this course will be based on graphical techniques.
The simplest way of visualizing how a person changes over time (question (i)) is to plot each
persons outcome vs. time. Because it is difficult to discern similarities and differences among
individuals if each page contains only a single plot, we recommend that you cluster sets of plots in
smaller numbers of panels.
We use the diet file with multiple records (Diet_StudyLF) as an illustration. Once youve opened this
file, from the menu choose:
Graphs

Select the Simple


Line chart from the
Gallery and drag it to
the box above.

19

The next screen should look as below:

Drag Triglyceride
to this box

Drag time
to this box

Click on this box

After you inserted Triglyceride in the Y-Axis box and time in the X-Axis box, click on the
Groups/Point ID box.
A drop down list appears. Tick the Rows panel variable box.
A Panel box is displayed on the right side of the canvas.
Drag Patient ID to the Panel box
Go to the Options box. Tick on Wrap Panels in the Panels sub-box.
Click Ok

20

Drag Patient ID
to this box

Tick the
Wrap
Panels box

Click Ok

21

You can get another version of this graph by double clicking on the graph to open the editor
and from the menu chose Elements and then click on Add Markers. If you get rid of the lines
connecting the markers you obtain the graph below.

22

Should you examine every possible empirical growth plot if your data set is large, including perhaps
hundreds of cases? May be, you can randomly select a sub sample of individuals (perhaps
stratified into groups defined by the values of important predictors).
Having summarized how each individual changes over time, we now examine similarities and
differences in these changes across people (question (ii)).In other words, we are interested in the
average change trajectory for the entire group.
To produce this type of graph, go to
Graphs

23

Select the Multiple


Line chart from the
Gallery and drag it to
the box above.

Drag Weight to the Y-Axis box and time to X-Axis box


Drag Patient ID to the Set colour box

24

Click Ok

25

The average change trajectory for the entire group would be a line somewhere in the middle of
graph above. Unfortunately, there is no straightforward way to do this in SPSS. However you could
do this semi-automatically by selecting Data Split File from the menu and then clicking on the
Organize output by groups option.

Drag time to this box


and click Ok

26

Now if you go to Analyze Descriptive Statistics you obtain the mean weight for each occasion
of measurement (time).You can create a new patient ID (e.g., 17) in the Diet_Study.sav file and
type in the five mean values. Re-run the graph above and youll get:

It is also useful do box-plots (go to Graphs Legacy Dialogs Box Plots) to investigate
symmetry and variability at each time point. Lack of symmetry implies non-normality. In the
diet study dataset we could do a box plots graph for weight by time and gender. We can see
that there is no apparent departure from symmetry.

27

The graph below shows box plots for each time point for the triglyceride outcome. This
response is less symmetric. And, unlike weight, variability over time is less homogeneous.
However, we will se that the methodology used in longitudinal designs can account for this
type of heterogeneity.

28

3. Models for the Mean Response


In this course Im going to introduce two main approaches to longitudinal data analysis:
Marginal Models and Random Coefficients. In this section we introduce the Marginal Models
approach also known as Mean Response Models. Random Coefficients will be described in
the following section.
In Marginal Models also known as models for the Mean response we model the regression of
the response on covariates and the covariance structure separately. In a parallel group
design the main goal of the analysis will be to characterize the patterns of change over time in
the several groups and to determine whether those patterns differ between the groups.
The linear model for Yij can be written as
Yij = 0+ 1X1ij + . . . + k Xkij + eij where
X1,. . ., Xk are the covariates and/or factors.
0,1, . . . , k are the regression parameters.
In matrix notation
Y= X + e
The model will be completely specified with which is the variance- covariance matrix of the
errors e.
Assumptions

The dependent variable Y is assumed to be linearly related to the factors and


covariates.
Normality of the eij
While observations from different individuals are independent, repeated
measurements of the same individual are not assumed to be independent, i.e. they
are correlated. Note that if we had independence would be diagonal which is the
case in multiple linear regression.

Next we discuss different choices for modeling the mean response over time. The mean, , is
given by the linear regression model,
E (Y) = = X
We also need to specify the covariance matrix, but we will deal with this later in this section.
First we will concentrate in modeling the mean response. We can distinguish two basic
strategies: (i) Arbitrary Means (Profile Analysis) and (ii) Parametric Curves

3.1 Arbitrary Means (Profile Analysis)


In profile analysis time is treated as a factor. In the following, we will assume that
measurements are equally spaced and a parallel group design.

3.1.1 Hypotheses about the mean response


We can distinguish three hypotheses that may be of scientific interest.

29

H10: Are the profiles of means similar between the groups, in the sense that the line
segments between adjacent occasions are parallel? This is the hypothesis of no group by
time interaction.
H20: If the group profiles are parallel, are the means constant over time? This is the
hypothesis of no time effect.
H30: If the group profiles are parallel, are they also at the same level? This is the hypothesis
of no group effect.

Below there is a graphical representation of the three hypotheses.

Although these general formulations of the study hypotheses are a good place to begin, the
appropriate hypotheses in a particular study must be derived from the relevant scientific
issues in that investigation.

3.1.2 Specifying the Covariance Structure


The covariance structure for the residuals is denoted as . More intuitive would be to say that
specifies the relationship between the levels of the repeated effects..

30

Suppose that each individual was measured three times. Some of the available structures are:

Unstructured. This is a completely general covariance matrix.

12

12
13

12
22
23

13

23
33

AR(1). This is a first-order autoregressive structure with homogenous variances. The


2
correlation between any two elements is equal to for adjacent elements, for elements
that are separated by a third, and so on. is constrained so that 1< <1.

1
2

2

1
1

Compound Symmetry. This structure has constant variance and constant covariance.

2 12

1
1

1
12
1
2

1
2 12

For more options see Covariance Structures on the SPSS Proc Mixed online help.

With too little structure (e.g. unstructured), there may be too many parameters to be
estimated with the limited amount of data available. This would leave too little information
available for estimating . With too much structure (e.g compound symmetry), there is more
information available for estimating .

3.1.3 Model Fit and Model Comparison


Model Fit
A model can be chosen on the basis of its goodness of fit which is interpreted as closeness of
the fitted values to the observed values. Therefore in assessing goodness of fit, it is essential
to first plot the fitted values vs. the observed values. Many potential problems are easiest to
spot graphically. In addition to examining the graph, several measures can be used for
quantifying goodness of fit. Here, I only present measures of model fit provided by SPSS Proc
Mixed:
The -2 Restricted Log Likelihood is the most basic measure for model selection.
Informally we can say that Likelihood tells you how likely your data are of being produced by
your model. If the likelihood is high we can expect the fitted values to be close to the

31

observed values. Note that a high likelihood corresponds to a low -2 Restricted Log
Likelihood because we are multiplying by -2. Therefore, the smaller this measure is the
better.
The other four measures are modifications of the Log Likelihood which penalize more
complex models.
Akaike's Information Criterion (AIC) adjusts the -2 Restricted Log Likelihood by twice the
number of parameters in the model.
Hurvich and Tsai's Criterion (AICC) is a correction for the AIC when the sample size is
small. As the sample size increases, the AICC converges to the AIC.
Scwartz's Bayesian Criterion (BIC) has a stronger penalty than the AIC for
overparametrized models, and adjusts the -2 Restricted Log Likelihood by the number of
parameters times the log of the number of cases.
Bozdogan's Criterion (CAIC) has a stronger penalty than the AIC for overparametrized
models, and adjusts the -2 Restricted Log Likelihood by the number of parameters times one
plus the log of the number of cases. As the sample size increases, the CAIC converges to the
BIC.
Smaller values indicate better models.
Model Comparison
First we need to make a distinction between nested and non-nested models:
Model I is nested in Model II when Model I is a particular case of Model II.
Example:

Y 0 1 x1 2 x 2 3 x1 x 2 e

(I)

Y 0 1 x1 2 x 2 3 x1 x 2 4 x12 5 x 22 e

(II)

When models are nested and the parameters estimated using Maximum likelihood methods
as it is the case with SPSS Mixed, they can be compared statistically by using the Likelihood
Ratio Test.
The likelihood ratio test compares a smaller model versus a more complex model. The null
hypothesis of the test states that the smaller model fits the data as good as the larger, more
complex model. If the null hypothesis is rejected, then the alternative, larger model provides a
significant improvement over the smaller model.
We can use likelihood ratio tests for hypotheses about models for the mean and the
covariance (keep in mind that in this approach we need to specify a model for the mean and a
model for the covariance structure).
Note models for the covariance can be statistically compared provided that they have the
same model for the mean.
Non-nested models can only be compared descriptively by using model fit measures such
as the ones mentioned above. For example the autoregressive and compound symmetry
covariance structures are not special cases of each other i.e., they are non-nested.
We will go back to model selection and comparison in the examples described throughout the
course.

32

Example of Conducting a Profile Analysis in SPSS Proc Mixed


To illustrate this approach we will use the diet study dataset.
Questions:
Have patients weights changed over time?
Does gender have an effect on weight?
How about age?
Running the Analysis:
Select Patient ID as the subjects variable.
Select time as the repeated effects variable.
Select Unstructured from the Repeated Covariance type dropdown list.

Screen 1

Subject variables define the


individual subjects of the repeated
measurements.

Repeated effects variables are


variables whose values are the
multiple observations of a single
subject.

The covariance structure specifies


the relationship between the levels
of the repeated effects.

Click Continue.

33

Note that so far we used time (which is a factor) to specify the covariance structure (since
observations for the same patient are correlated). However we are also interested in the
effect of time on the response so in the second window we enter time as a factor.
Select Weight as the dependent variable.
Select time and gender as a factor.
Screen 2

Click Fixed.

Screen 3-a

34

Select time and gender in the Factors and Covariates box and click Add.
Note that the box in the middle is set to Main Effects. If you want to include the interaction
term between time and gender in the model you have to click on and select Interaction.
Then highlight both time and gender and click Add. The screen below will appear

Screen 3-b

Lets choose the model without the interaction term (you can test the interaction effect as an
exercise).If you already are in Screen 3-b , highlight time*gender and click Remove.
Click Continue.
Click on Statistics in the Linear Mixed Models dialog box.
In the Statistics sub-box (Screen 2 ), select Parameter estimates, Tests for covariance
parameters and Covariances of residuals in the Model Statistics group.
Click Continue and then click OK in the Linear Mixed Models dialog box.

35

Screen 4

Output and Interpretation


The table below provides a summary of the model you selected. For each effect, the number
of levels of the effect and the number of parameters accounted for by the effect in the model
are reported. The covariance structure and number of parameters necessary for that structure
are reported for the repeated effects. The variable that defines subjects (patid) , and the
number of unique subjects defined by the subject variable are also shown.

Table 1

The Information Criteria table provides measures of model fit.

36

Table 2

The tests of fixed effects table provides F tests for each of the fixed effects specified in the
model. Small significance values (Sig. column) indicate that the effect contributes to the
model.

Table 3

Table 4 provides estimates of the fixed model effects and tests of their significance. Since
there is an intercept term, the fifth level of time is redundant. Thus, the estimates for the
first four levels contrast the average weight at time 1, 2, 3 and 4 to the last period. We
can see that the estimated average weight at each time point is significantly different from the
last time point .From the first column the estimates of the mean weights decrease over time
being the mean weight on the first occasion the highest. The effect of gender is also
significant.

Table 4

37

The following two tables show information about the variance-covariance structure.

Table 5

For the unstructured covariance matrix, the table above directly reports the values in the
matrix and their corresponding significance. UN(1,1) is the variance for the error term in time
1, UN(2,2) is the variance for the error term in time 2, UN(2,1) is the covariance between error
terms at the first and second time periods and so on. Another way to look at the estimated
variances and covariances is given by the table below.

Table 6

But how do we know that this model is appropriate for these data? First we need to assess
whether this is a feasible model before doing any model comparison (see Restricting the
Covariance Structure below).
A simple way to assess how well the model fits the data is by plotting the fitted values
obtained from the model vs. the observed values. To obtain the predicted values when you
run the analysis in SPSS Mixed, you will find a sub-box called Save in Screen 2 . Click on
Fixed Predicted Values. Lets assume for now that this model is suitable. In the example in
Section 3.2 we will see how to plot predicted vs. observed values.

38

Restricting the Covariance Structure


The Unstructured covariance is the most complex structure and as the number of
measurements increases so does the number of parameters in the covariance matrix. From
Table 1, five time points needs 15 parameters! So we may try a simpler structure and see
whether it fits the data well. From the table above, it seems that time periods further apart
have a smaller covariance so we could use an autoregressive structure (see section 3.1.2).
To do this re run the previous model as before except that in Screen 1 , select AR(1) in the
Repeated Covariance Type box. We can now compare the two models with the two
covariance structures.

Table 7

The model dimension table shows that the number of parameters in the repeated effects is
reduced from 15 to 2.
We can compare the information criteria for this model to those for the previous model with
the unstructured covariance.
The -2 Restricted Log Likelihood is smaller for the unstructured model but this is expected
since this model has a larger number of parameters. All of the other measures, which
penalize overly complex models, are smaller for the autoregressive model except the AIC
which is similar for both models.

Table 8- AR(1) covariance

Table 9- Unstructured covariance

Another way to check that the simpler model is suitable is by looking at the parameters
estimates table for the fixed effects (Table 10). If the estimated effects and Standard Errors
were very different from the corresponding table for the unstructured model (Table 4) it is not

39

a good sign. In this case there are only slight differences between the two models. A more
formal comparison is given by the Likelihood ratio Test.

Table 10

The AR1 diagonal parameter specifies the residual variance for each time point. The AR1
rho parameter specifies the residual correlation between two consecutive occasions (see
table 11).

Table 11

Computing the Likelihood Ratio Statistic:


The test statistic is computed by subtracting the -2 Restricted Log Likelihood of the larger
model from the -2 Restricted Log Likelihood of the smaller model. For these two models, that
difference is 391.354-364.812 = 26.5420.
Remember that the null hypothesis of the test states that the smaller model fits the data as
good as the larger, more complex model. If the null hypothesis is rejected, then we can
conclude that the smaller, simpler model is not appropriate. On the other hand if the null
hypothesis is not rejected, we can say that the simpler model is adequate for the data.
If the null hypothesis is true, then the test statistic has an approximately chi-squared
distribution. To compute the degrees of freedom for that distribution, compare the model
dimensions for the two models.
The degrees of freedom are computed by subtracting the total number of parameters in the
smaller model from the total parameters in the larger model. For these two models, that
difference is 21 - 8 = 13.
To compute the significance value for the likelihood ratio test, from the menus choose:
Transform
Compute Variable...

40

Type pvalue as the Target Variable.


Type sig.chisq(26.5420,13) as the Numeric Expression.
Click OK.

The significance value of the test is saved to the variable p-value, and its value is 0.014.
Therefore, at the 5 % significance level, we reject the assumption of an AR(1) structure.
However, if we take into account the information criteria for both models, it would make sense
to keep an AR(1) structure- specially because if the number of repeated measurements
become too large and the sample is not big enough, the estimation of an unstructured
covariance becomes unfeasible.
Suggestion: Try to fit an Heterogeneous AR(1).

Summary of Features of Profile Analysis


-

Does not assume any specific time trend


May have low power to detect specific trends; e.g., linear trends
On the other hand, when the treatment effects do not have a simple form,
profile analysis is a good option.
Can be used to accommodate other linear combinations of the response vector
When the time periods are unequally spaced profile analysis doesnt take into
account the distance between measurements.

3.2 Parametric Curves


An alternative approach for analyzing the parallel-groups repeated measures design is to
consider parametric curves for the time trends.
In this approach we model the means as an explicit function of time and unlike Profile
Analysis, the time variable enters the model as continuous. We will still need a categorical
variable for time in order to model the covariance structure (see next example).
(a) Linear Trend
If the means tend to change linearly over time we can fit the following
model:

E (Yij) = 0 + 1Timej + 2Trt i + 3Timej xTrt i


where Trt i is an indicator variable which takes the value 1 if the subject i receives treatment
1, and zero otherwise.
Then, for subjects in treatment group 2,

E (Yij) = 0 + 1 Time j
While for subjects in treatment group 1,

41

E (Yij) = (0+ 2) + (1 + 3) Time j


Thus, each group's mean is assumed to change linearly over time.

(b) Quadratic Trend


If the means tend to change over time in a quadratic manner, we can fit the following model:

E (Yij) = 0 + 1Timej + 2Time2j + 3Trt i + 4Timej x Trt i + 5 Time2j x Trt i


Then, for subjects in treatment group 2,

E (Yij) = 0 + 1 Timej + 2Time2j


While for subjects in treatment group 1,

E (Yij) = (0 + 3) + (1 + 4) Timej + (2+ 5) Time2j


Coding of Time:
The coding of time denoted as t, has implications for the interpretation of the model. For
example, t can start with the value of 0 for baseline and be incremented according to the
measurement timeline (e.g., 1,2 3, etc.). Alternatively, t can be centered on its mean. In the
first coding the meaning of the intercept will characterize aspects of the baseline time point
while in the latter coding it will reflect aspects about the midpoint of time.

Example of Conducting a Mean Response Analysis by fitting a


Parametric Curve in SPSS Proc Mixed
To illustrate this approach we will use the Lead Exposed Children dataset. The methodology
described in Profile Analysis regarding specification of the Covariance structure, model fit and
model comparison applies to parametric curves.

Questions:
Does Lead level increase linearly over time?
Running the analysis:
Select Patient ID as the subjects variable.
Select time as the repeated effects variable.
Select Unstructured from the Repeated Covariance type dropdown list.

42

Screen 1

Click Continue.
Note that as in profile analysis we still use time (which is a factor) to specify the covariance
structure . However, in the parametric curves approach we are interested in modeling the
time effect as a continuous variable so we are going to use actual day of measurement (days)
as the time variable. The simplest curve is a line , i.e. we are assuming that straight line
approximates the relationship between the response and days.
Select Lead Level as the dependent variable.
Select Treatment as a factor.
Select Days as a covariate.

43

Screen 2

Click Fixed.
Select Treatment and days in the Factors and Covariates box and click Add.

Screen 3

As in profile analysis,
Click Continue.

44

Click Statistics in the Linear Mixed Models dialog box.


In the Statistics sub-box (Screen 2 ), select Parameter estimates and Tests for covariance
parameters in the Model Statistics group.
Click Continue and then click OK in the Linear Mixed Models dialog box.

Output and Interpretation


The interpretation of tables 12 to 16 is similar to their corresponding tables in Section 3.1.4.

Table 12

Table 13

Table 14 below shows that days is very significant but the effect of treatment is borderline
(Sig. =0.042).

Table 14

45

Table 15

Table 16

First, before doing any model comparison (e.g. doing a profile analysis instead) we need to
assess whether this is a feasible model.
As it was mentioned in the previous section, a straightforward way to assess how well the model fits
the data is by plotting the fitted values obtained from the model vs. the observed values. After
saving the Fixed Predicted Values in Screen 2 you can plot Predicted vs. Observed as follows:
From the menu chose:
Graphs
Legacy Dialogs
Interactive Line

Click on Summaries of separate variables and select the Multiple option. You may want to
put Treatment in the Row box so that you can have one plot for each treatment. You get the
graph below.

46

It is obvious that when there is no treatment (A) the relationship between days and Lead level
is non linear so you may be better off by trying another curve or simply doing a profile
analysis.
I strongly advice to do some exploratory analysis before fitting any model. Things such as non
linearity are easy to check by doing a simple plot . In this case a graph of Lead Level vs. days
by Treatment would have indicated the lack of linearity.

Summary of Features of Parametric Curve Models


-

Allows one to model time trend and treatment effect(s) as a function of a small
number of parameters. That is, the treatment effect can be captured in one or
two parameters, leading to more powerful tests when these models fit the data.

Since E(Yij) is defined as an explicit function of the time of measurement,


Time j, there is no reason to require all subjects to have the same set of
measurement times, nor even the same number of measurements.

May not always be possible to fit data adequately.

47

Exercises for Section 3


1) For the diet study data
(i) Run a model including time as a factor without taking into account the correlation
between the repeated measures. To do this in SPSS Mixed when the first screen appears
you click on continue. Then you use Screen 2 and 3 as before.
(ii) Run a model including time as a factor but this time taking into account the correlation
between the repeated measures.
(iii) Compare the results of both models regarding the estimated effect of time, its
standard error and significance. What are the differences?
2) For the lead exposed children perform a profile analysis and compare the results with the
model in Section 3.2.
3) For the growth data, consider age as the time variable. Would a linear trend be appropriate
to describe the relationship between age and the response? Also check whether the effect of
age on the response is different between males and females.

4. Random Coefficients approach for


longitudinal data
When we have a longitudinal design, Random Coefficients is a particular case of Mixed
Models. Mixed Models contain both fixed and random effects.
Fixed Effects: factors for which the only levels under consideration are contained in the
coding of those effects
Examples: Gender, Marital Status where all levels are included.
Random Effects: Factors for which the levels contained in the coding of those factors are a
random sample of the total number of levels in the population for that factor.
Examples: if we take a random sample of Subjects from a target population, then Subject is a
random effect.
A basic feature of Random Coefficients is the inclusion of Subject as a random effect.

Mixed Models Assumptions


The dependent variable is assumed to be linearly related to the fixed factors, random factors,
and covariates. The fixed effects constitute a model for the mean of the dependent variable.
The random effects model the covariance structure of the dependent variable. The dependent
variable is assumed to come from a normal distribution.

Random Coefficients
A random coefficient model is an alternative approach to model longitudinal data. The most
common applications are those in which a linear relationship is assumed between the
outcome of interest and time and it could be considered as an extension of the simple linear
regression model for the response Y of subject i on occasion j (see equation below).

Yij 0 1Time ij e ij

i = 1,,N j:=1, , T

(*)

48

Note that in (*), 0 and 1 do not vary between individuals. They are fixed and they would
have the same interpretation as in the linear curve in section 3.2. For 1, for example, this
means that we are only assessing the average change trajectory for the entire group, not
each individuals growth trajectory.
Also, remember that linear regression models assume independence of the errors and for
longitudinal data this assumption is unreasonable.
A way to extend the idea of multiple regression models to longitudinal data is by introducing
random effects in the regression parameters. This is the main characteristic of Random
Coefficient Models. These random effects allow describing each subjects trend across time
and also accounting for the correlation between measurements.
We are going to describe these ideas in more detail in the next two sections by presenting the
Random Intercept Model and the Random Intercept and Slope Model.

4.1 Random Intercept Model


The simplest extension of equation (*) is given by the model below

Yij b0i 1Time ij e ij

(4.1-a)

b0i 0 u0i

(4.1-b)

where

eij ~ N ( 0 , e2 )

where

u0i ~ N ( 0 , u20 )

(4.1-c)

A typical situation where this model would be appropriate is if the data looked like the graph
below.

49

Each line corresponds to a different subject. The slope doesnt seem to change across
subjects. However, the intercept does.

From the first equation in (4.1) , the random intercept model indicates that the response for
subject i is influenced by his/her initial level boi. The second equation in (4.1) indicates that
the initial level for subject i, boi is determined by 0, the initial level common to all subjects,
plus a unique contribution to that subject u0i. Equation 4.1-c indicates that the subject
effect boi is random by specifying a variance parameter

u20

Another way to see it, is by thinking that each subjects trend line is parallel to a common
trend given by 0 + 1* Time so the difference between each subjects trend and the
common trend is given by u0i.This means that for the Random intercept model

u0i is a

2
between-subject error and its variance u0 , the between-subject variance. This variance
2
represents the spread of the lines in the graph above. If u0 , is near zero, then the individual

lines would not deviate much from the common trend 0 + 1* Time. The graph below
illustrates this point.

u0i

Subject i

0 + 1* Time

There is also a within-subject variation represented by

e2

(see equation 4.1-a).It is also

called within-person residual variance. It is the variability of the errors e ij . To interpret this
errors suppose an hypothesized trajectory for subject i. In the graph below

ei 1 , ei 2 , ei 3 are

deviations of subject is true trajectory on each occasion.

50

Subject i

The introduction of random effects induces correlation among the repeated measurements for
the same subject. For the Random Intercept Model the correlation structure will have a
Compound Symmetry form (see page 31). This type of correlation is usually unrealistic unless
subjects are measured only in two occasions.
The Random Intercept and Random Slope Model given next allows for a more flexible
structure.

Example of fitting a Random Intercept Model


As an illustration we will use the diet study dataset.
Questions:
Are patients weights significantly different at baseline?
Does gender have an effect on weight?
Running the Analysis:
To run a Linear Mixed Models analysis, from the menus choose:
Analyze
Mixed
Linear...

Select Patient ID as the Subjects variable.

51

Screen 1

Click Continue.
Select Weight as the dependent variable.
Select time as a covariate and gender as a factor.

Screen 2

Click Fixed.

52

Screen 3

Select time and gender in the Factors and Covariates box and click Add.
Click Continue
Click Random in the Linear Mixed Models dialog box.

53

Screen 4

Under Subject Groupings move Patient ID over into Combinations.


Tick the Include intercept box
Select Unstructured from Covariance type dropdown list.
Click Continue.
Click on Statistics in the Linear Mixed Models dialog box.
Select Parameter estimates and Tests for covariance parameters in the Model Statistics
group.
Click Continue and then click OK in the Linear Mixed Models dialog box.

54

Screen 5

Output and Interpretation


The table below provides a summary of the model you selected. The covariance structure and
set of variables that define subjects are reported for the random effects. Note that even if we
specified an unstructured covariance structure for the Random Effects, it automatically
changes to the Identity matrix which is a scalar since we only have one variance parameter
for the Random Effects.

55

Table 18

From this table we can see that time and gender are significant.

Table 20

The intercept refers to the initial value of the response. Since gender is significant the
interpretation of the intercept (0) depends on gender.
For gender =0, the baseline average weight is 165.65+57.92= 223.57. For gender =1 , the
average initial weight is 165.65.
The estimated slope (1) is -2.01. Because the sign is negative, the linear trend goes down,
i.e. , on average patients loose weight over time. The time scale in this data set is weeks so
we can say that on average patients are loosing 2.01 units per week.

56

Table 21

The table above provides a summary of the parameters used to specify the random effect and
residual covariance matrices. Unlike Models for the Mean response, no repeated effects are
specified here so the variance of the residuals has only one parameter. Its estimated value is
2.52.

The subject=paid variance is the

u2 0 =264.02 in equation 4.1-c, indicating that there is a

significant difference in the baseline weight among patients.


The residual variance
We can compute

e2 measures the variability within subjects.


u20

2
u0

e2

99% which indicates that 99% of the unexplained variability

in weight (i.e., the variability in the response not explained by the linear effect of time) is at the
individual level.
It is always recommended to check predicted vs. observed values either though the graph
below or producing a table of predicted vs. observed weight (see table 22). The model seems
to fit the data well.

Table 22

57

4.2 Random Intercept and Random Slope


Model
We can also make the slopes vary between individuals, i.e., we consider both intercept and
slope random:

Yij b0i b1i Time ij eij

b0i 0 uoi

(4.2-a)

where

eij ~ N ( 0 , e2 )

(4.2-b)

b1i 1 u1i
where

0 2
u0 ,u1
u0 i

~ N , u0

2
0 u0 ,u1 u1
u1i

(4.2-c)

This model is more suitable when the data show a pattern as in the graph below. We can see
that not only the intercepts but also the slopes seem to vary across subjects.

58

In this model both the intercepts boi and the slopes b1i are considered random.
The randomness of the coefficients is set by assigning a distribution to them (4.2-c).
It can be shown that for the Random Intercept and Random Coefficient model the VarianceCovariance structure not only a function of the parameters in 4.2-c but also a function of time.

Example of fitting a Random Intercept and Random Slope Model


We will use the Opposites-naming dataset.
Questions:
Do patients responses differ significantly at baseline?
Do their individual growth rates (slopes) differ significantly?
Running the Analysis:
To run a Linear Mixed Models analysis, from the menus choose:
Analyze
Mixed
Linear...

Screen 1 is the same as in the previous model, i.e.


Select id as the subjects variable

59

Screen 1

Click Continue.
Select as the dependent variable.
Select time as a covariate.

Screen 2

Click Fixed

60

Screen 3

Select time in the Factors and Covariates box and click Add.
Click Continue
Click Random in the Linear Mixed Models dialog box.
Screen 4

61

Under Subject Groupings move Patient ID over into Combinations.


Tick the Include intercept box
Select time in the Factors and Covariates box and click Add.
Select Unstructured from Covariance type dropdown list.
Click Continue.
Click on Statistics in the Linear Mixed Models dialog box.
Select Parameter estimates and Tests for covariance parameters in the Model Statistics
group.
Click Continue and then click OK in the Linear Mixed Models dialog box.

Output and Interpretation

Table 23

Table 24

Table 25

62

Table 26

The fixed effects (time and intercept) interpretation is the same as in the Random intercept
model: children start off, on average, with an OPP score of 164.37 and gained 26.96 points
per testing occasion.. Because of the sign of the estimated slope is positive, it means that
childrens OPP scores gets higher over time.

Table 27

Inspection of the random effects indicates that the residual variance is 159.48 and it is
statistically significant.
The intercept and slope also have significant variance, UN (1,1) and U (2,2),( u0 and
2

u21

in equation 4.2-c) indicating that there is a considerable heterogeneity in terms of childrens


initial OPP and in their change over time.
Note that the estimate of UN (2,1) ( u0 ,u1 in equation 4.2-c) is negative. This means that
children who have higher initial OPP scores (greater intercepts) improve at a lower rate (i.e.,
less pronounced slopes).

63

Summary of Features of Random Coefficients


-

With Random Coefficients Models it is possible to make inferences about average


effects (like models for the mean response) as well as individual effects (e.g. each
subjects trend over time).

This type of model assumes the correlation between repeated measurements arises
because each subject has an underlying (latent) level of response which persists over
time.

It distinguishes two sources of variation that account for differences in the response of
subjects measured at the same occasion:
(i) Between-Subject Variation: Different subjects simply respond differently
(ii) Within-Subject Variation: In a longitudinal design its the variation in the outcome
measure over time within each subject

Random coefficients models are very flexible since they can accommodate any degree
of imbalance in the data. That is, we do not necessarily require the same number of
observations on each subject or the measurements be taken at the same times.

Exercises for Section 4


1) For the diet study data fit a Random Intercept and Random Slopes model and compare the
results with those obtained from the Random Intercept model.
2) For the Opposites-naming dataset, add Gender to the Random Intercept and Random
Coefficients model in section 4 and interpret the results.
(i) Is gender significant?
(ii) Do the covariance parameters change? How?
3) Do the same as the previous exercise but include variable ccog instead.

64

Random Coefficients vs. Mean Response models


When repeated measures data are obtained at fixed points in time, there will be a
choice between the use of Mean Response Models and Random Coefficient Models.
This choice may be influenced by how well the dependency of the observations on time
can be modelled and whether interest is centred on inferences about individuals or the
population averages. If the times of measurements are not standardised, or if there are
substantial discrepancies between the scheduled times and actual time of observation,
the random coefficients models are more likely to be the model of choice.
In real life choosing a model is a complex process even within one type of approach.
However, some general guidelines are given below:
Choice among models
It should be guided by specific scientific question of interest:

Marginal Model: Population-averaged effect


The main interest focuses on differences between sub-groups in the study
Population.

Random Coefficients: make inferences about individuals rather than the population
averages.
Interest focuses on estimating intercepts or slopes for each subject and /or on
controlling for unmeasured subject characteristics (variance parameters for the
intercept and slope).
For missing data they are a better choice.

65

References

Diggle, P.J., Heagerty, P., Liang K-Y. and Zeger, S.L. (2002). Analysis of
Longitudinal Data (second edition). Oxford: Oxford University Press.

Hedeker, D.and Gibbons R.D (2006). Longitudinal data Analysis. Wiley, John
& Sons.

http://biosun1.harvard.edu/~fitzmaur/ala/

66

You might also like