You are on page 1of 14

Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

1/18/2001

This example is based on Klesges, R. C., et al. (1998) The prospective relationship between
smoking and weight in a young, biracial cohort: The coronary artery risk development in young
adults study. Journal of Counseling and Clinical Psychology, 66, 987-993.

The study looked at weight changes over a seven year period in subjects who did, and did not,
stop smoking. The authors broke down subjects by smoking condition, race, and sex, but the
way they presented their data I was not able to include sex as a variable in my example.

One reason for choosing this example is that it involved very large samples. We usually use
small samples for examples, and I thought it would be useful to look at a case with thousands of
subjects.

At baseline, data were collected on weight, smoking behavior (Never, Former, and Smoker), and
other variables for over 5000 subjects. Seven years later data were obtained from 3868 subjects
on their smoking status (Never, Former, Quitter, Intermittent, Initiator, and Continuous), their
weight, and their weight gain. Data were also collected on alcohol use and caloric intake..

For this example I am going to run one-way analyses for smoking behavior on the pretest and
the posttest data separately. I will ignore race and sex. I'll may come back to those variables
later.

The data are found in Klesges.sav for 3868 subjects. The variables are Race, Basesmoke,
Endsmoke, Alcohol, Basewt, Endweight, WtChange, and Fatpercn, in a different order. I
generated these myself based on their data. Weight is given in kilograms.

First we'll look at differences in weight of the three smoking conditions at the beginning of the
study. What would students predict?

1 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

Notice that the overall Anova is significant, but when we run the multiple comparisons the only significant
difference is the 2.58 kilogram difference between Nonsmokers and smokers. Interestingly, the ex-
smokers fall in the middle and don't differ from either group. So it is not true that quitting smoking led to
weight gain--at least over the long term.

Some basic formulae:

2 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

There are a whole set of procedures for making comparisons between groups subsequent to the overall
analysis of variance. I cover many more in the text that I can cover here, but the basic ideas are very
simple.

Error Rates

In the text I distinguish between two kinds of error rates. One is the probability that any
particular comparison will be a Type I error, and the other is the probability that any set of
comparisons will contain at least one Type I error.

Error rate per comparison (PC)

This is the probability that any given test will be significant if the null hypothesis
is true. If we just ran a simple t test between two means at alpha (a) = .05, then
the probability that a Type I error would occur is .05.

Familywise error rate (FW)

This is the probability that a whole set of comparisons will contain at least one
Type I error. It should be apparent that the more tests you run, the greater the
likelihood that you will make a Type I error someplace. [The more you have sex,
the more likely you are to get pregnant.]

Suppose that we have a situation where we somehow know that m1 = m2, m3 =


m4, and m5 = m6. Suppose that we ran 3 independent t tests, each at a = .05. Then
the probability that the first comparison will be significant is .05, the probability
that the second will be significant at .05, and the probability that the third will be
significant is .05. The probability that at least one of these will be significant is
approximately 3(.05) = .15. [In fact, for independent tests it is really 1 - (1 - a)c,
where c = number of comparisons. This is 1 - .953 = .1426]

If you have sex once/night for a week, the question is the likelihood that you will
be pregnant at the end of the week, and we aren't concerned about which night.

The major point behind almost all multiple comparison procedures is to reduce
FW to something reasonable, such as .05, and we do that by reducing the
significance level for any particular comparison to a small value. [In this case, if
we ran each test at a = .01667, the familywise error rate would come out to about
1 - (1 - .016667)3 = .05(approx).]

A Priori and Post-hoc Procedures

I hate to discuss this issue because it seems to get people all twisted around. The
basic idea is rather simple. If you plan out your comparisons before you run your
experiment, you get to use more liberal procedures. If you plan your comparisons
after you have looked at the data, even if you run just a few tests, it is as if you

3 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

were running all possible comparisons among the means. This will require much
more conservative procedures.

In actual practice, the vast majority of the situations that I see involves post-hoc
tests. People don't really plan out everything ahead of time. They wait until they
get to their data, and then they decide what they want to test.

I'm not sure that this distinction, while completely defensible on theoretical
grounds, is the best one to make here. I prefer to think of it a bit differently.

If you want to make just a few comparisons that were decided on


before you looked at the data (or that at least flow logically from the
theory), then you probably want what I call a priori procedures.

If you want to make lots of pairwise comparisons, regardless of


when you thought of them, then you probably want post hoc
procedures.

In the case of truly a priori tests, I recommend that you just run simple t tests
between the means that you want to compare. The one difference is that I would
use MSerror from the overall Anova in place of the individual group variances,
unless you have good reason to believe that the variances are heterogeneous.

If you have equal sample sizes, just use

and if you have unequal sample sizes, use

In each case, the df are the same as the df for error from the overall anova.

Example

From what I know about the folklore of smoking, I might be led to


believe that people who have quit smoking will be expected to gain
weight. Therefore I predict that they will weigh more than people
who have not quit. I would like to test that hypothesis using my data.

You might have a different hypothesis you want to test,


but this is my example, so I have the ball. I say that
because what is important is what the experimenter
predicted before seeing the data, and that is what I

4 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

would have predicted.

The means are given above as 71.65 and 69.52 for the 333 ex-
smokers and the 1450 Smokers, respectively. The within-group
variances are all very similar, so I'll use the error term from the
Anova.

The critical value of t on 3865 df is approximately 1.96, so we


cannot reject the null hypothesis. My belief in the effect of quitting
smoking seems to be wrong. (The actual value of alpha is .097.)

Contrasts

Contrasts are another way of doing exactly what we


have just done. These are covered in the text.

Show how to apply the above contrast using SPSS.

Bonferroni Procedure

We have already discussed the Bonferroni procedure, so


there isn't a lot to say about it here. The basic idea is that
you divide alpha by the number of comparisons you are
going to run, and then run each individual comparison at
that level.

Suppose that I had planned to run two comparisons. The


first was Smokers against Ex-smokers, and the second
was Smokers against Non-smokers. The first test I have
already run, and it gave me a t = 1.66, with a probability
of .097. The second, which I don't show here, would
give a t = 3.58, with a probability = .000.

To be significant, each test would have to have a


probability value less than .05/2 = .025. The first one
obviously does not, but the second was does. So we will
conclude that there is a significant difference in the
weight of Smokers and Non-smokers. Here our
familywise error rate = .05.

Notice that, as presented here, the Bonferroni is an a


priori test. I would only use this test if I wanted to make
a small set of comparisons from a larger set of possible
comparisons. If I did not have a priori tests, I would be
much better off using one of the other procedures,

5 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

because the Bonferroni will come out to be too


conservative.

The Dunn-Sidak test is a very similar test, that is slightly


more powerful. While the Bonferroni is based on the
idea that with three independent tests, the probability of
at least one being significant is approximately 3*.05 =
.15. The Dunn-Sidak is based on the idea that this
probability is actually 1 - .953 = .1426 if the
comparisons are independent. (Not much of a
difference!)

Multistage tests

I cover multistage tests in the text, and I recommend


them. I won't go over them here because they don't
really fit with this example. The basic idea behind them
is that if you have a lot of tests, it is not very likely that
the null will really be true for all of them. The
Bonferroni itself penalizes you as if that were the case.

Fisher's Least Significant Difference Test (LSD)

We have talked about this test before. Fisher argued that If the overall Anova is
significant, you can go ahead and run multiple t tests between any and all groups.

Notice the requirement of a significant F.

This is the most liberal of the mutliple comparison tests, and it only keeps the
familywise error rate at alpha if the complete (omnibus) null is true--i.e. if all
populations have equal means.

This is the only test that requires a significant overall F before proceding!!!

I have been pushing this test for years for the situation in which you have only 3
groups, but people don't like it. Finally I came across a paper by Levin, Serlin,
and Seaman (1994, Psychological Bulletin, 115, 153-159) that says the same
thing.

Studentized Range Statistic

Many procedures use what is called the Studentized Range Statistic. It was
originally designed as a statistic to compare the two extreme means in a set of
means. If there are a lot of means, the extremes are likely to be more different
than if there are just a few means. But that means that it is more likely to come up
with a "significant" difference when testing those means. So the test was
designed to adjust the critical value, making it larger when there are more means
to chose from.

For some reason that I have never seen explained, they came up with a slightly
different test statistic than the normal t statistic. There is no reason why they had
to do so, the t would do as well. But the statistic is

6 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

Note the relationship of this to t with equal n's.

Notice that q is just the same as t except that the "2" is missing from in front of
MSerror. This isn't a problem, because the critical values are altered in the same
way.

This testing approach is used in many of the tests which follow, which is why I
discussed it in the first place.

Newman-Keuls Test (Student-Newman-Keuls SNK)

I happen to like this test, but lots of people complain about it. I have laid out the
reasons in the text, but I'll simplify them here.

When there are three means, the Newman-Keuls holds the familywise error
rate at .05, just as we would like.
When there are four or five means, the error rate is held at (approx) .10.
When there are six or seven means, the max FW is .15, etc.
It is rare to have more than five groups in an experiment, and when we do
it is also very likely that at least some null hypothesis is not true. It is hard
to imagine an experiment where we really believe that all five means are
equal.
Thus I think the arguments against the Newman-Keuls are not really fair.

I go over how to apply this test by hand in the text, but people don't often do that,
and I will probably cut that back drastically in the next edition..

In SPSS this test has a somewhat different printout than we have seen. I'm not
sure why they do that. Basically they show you those groups that are
homogeneous. The first example is the same set of data as the examples above.

7 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

None of these groups are different from any others. I don't know what the sig =
.053 means, although it may be the significance level of the most extreme
comparisons.

This is a good example, because the overall F was significant, but the test does
not find any differences.

If we jump ahead to looking at weight change over 7 years as a function of


smoking groups we get

Here you can see that 5 of the groups are homogeneous, but the sixth group
(Quitter) is different from the other two.

8 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

What I can't display (because of the data I have) is the very common situation in
which two homogeneous sets of groups have some overlap.

Unequal Sample Sizes and Heterogeneous Variances

The formula that I gave above assumes equal sample sizes. When
you have unequal sample sizes, you can take the harmonic mean of
the n's and use that for all cases. You can see from the printout above
that this is what SPSS has done.

When the variances are unequal, you can use the Games and Howell
(1976) procedure. (Unfortunately, a different Howell) SPSS will
implement this procedure.

Tukey's test

Tukey's test is a very close relative of the Newman-Keuls test. The difference is
that all comparisons are done as if the groups were maximally far apart. In other
words, with 6 groups, two means that are adjacent in an ordered series are still
tested as if they were the largest and smallest of 6 means.

This test holds the familywise error rate at alpha regardless of what null
hypothesis(ses) are true.

The following very curious printout comes from an analysis of the three original
groups at baseline.

This shows that nonsmoker and Smokers are different.

But now look at the next part of the printout.

Homogeneous Subsets

9 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

Notice that there are no significant differences using this test on these data.

I don't know why we get the difference between the two tables.

This shows that the Tukey a somewhat more conservative test than the Newman-
Keuls. I think this test is a bit too conservative, but lots of people like it.

Ryan Procedure (REGWQ)

Abbreviation stands for Ryan, Einot, Gabriel, Welch q test.

Sort of like the Bonferroni logic, except that each test is run at a/(r/k) where k =
number of means in the experiment, and r is the number of means from which
these two are the largest and smallest. Einot, Gabriel, and Welch fiddled with this
just a little bit, but the basic idea is still right.

This test keeps FW at alpha regardless of the true null, but is less conservative
than Tukey's test.

SPSS will run this test. For our example the printout is shown in the following
tables for the baseline and the endpoint data..

If I apply this test to our three-group example I get

This is a good example of overlapping homogeneous groups.

10 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

Scheffé's Test

This test is the most conservative of the lot, and I do not recommend it. Only the
purists like it.

Bonferroni

I think that the Bonferroni is not a good test as a post-hoc test. I would only use it
as an a priori test. Explain why.

Describe the groups.


What would students predict, and why?
What multiple comparison technique would they use?

Descriptives

11 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

The following is the LSD output:

The next is the Bonferroni output:

12 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

Finally, for the REGWQ we get

Again, I got conflicting results with the Tukey. Students can do that on their own.

13 de 14 07/05/2019 21:29
Multiple Comparison Procedures https://www.uvm.edu/~dhowell/gradstat/psych341/lectures/AnovaRevi...

The assignment is to take the means, etc. from what we have here, sit down with a pencil and paper and my book, and see
what is going on with the conflicting Tukey results. It may have to do with different ways of treating unequal sample sizes.

Hint: You can find an exact probability of a t, for example, by COMPUTING a new variable named tprob.
From the menu choose cdf(q,df). Put the actual t value in where the "q" is (I don't know why they don't
call it "t", but they don't.) Put the df for error in place of df. The result will be the one-tailed probability
value for a t > the obtained t. (I know that it is annoying that it will calculate that value for every case, but
I don't know a way around it. If you wanted to know the value of t that cut off the lowest 2.5%, you could
use the same compute statement except substitute idf(p,df) where p is the lower tail probability (e.g. .025)
and df is the dferror. The result will be the critical value of t, and if you drop the sign it will be the two-
tailed value. I think that this will help you solve the problem, but unfortunately you can't get a probability
for q in the same way.

Last revised: 01/17/01

14 de 14 07/05/2019 21:29

You might also like