You are on page 1of 10


5, page 1

Chapter 5: Comparisons among several samples

Case Study 5.1
A randomized experiment to compare lifetimes in six different diets among female mice.

Principles of good experimental design:

• Randomization: female mice were randomly assigned to the six treatments. Randomization
ensures no bias in the assignment of mice to treatments. It does not guarantee that the
groups will be identical, but it allows us to use probability to assess whether the differences
observed could have occurred by chance.
• Replication – important for estimating variability within groups
• Other?

Scope of inference
• The scope of inference is to what would have happened if all mice had been fed each diet.
• The scope of inference can be expanded further if these mice can be viewed as
representative of a larger population of female mice.
• Since it’s an experiment, we can infer cause-and-effect if the experiment was well-run.

Comparison of all six diets was of interest, but there were some specific comparisons that were
of interest as outlined in Display 5.3.

Although comparisons between pairs of treatments could be handled with two-sample t-

procedures (if normality assumptions are satisfied), there are some advantages to a more
comprehensive procedure:

• If the variability within each treatment is about the same for all treatments, then it makes
sense to estimate a pooled standard deviation from all the treatments even if we’re only
comparing any two at a time.
• We may want to carry out more complicated comparisons, such as a comparison of a
control group to the average of the other five groups.
• A standard first question of interest when comparing several groups is whether there is
evidence that any of the means are different from each other. Comparing all the treatments
pairwise with two-sample t tests results in a lot of individual tests (15 for 6 treatments). An
overall test of equality of all the treatment means is much more efficient and will not suffer
from the problem of running multiple tests (where statistically significant results have to be
considered in the context of how many tests were run).

An Ideal Model which allows the problems above to be solved fairly easily
• Population distributions are normal
• Population standard deviations are equal
• Independent random samples from each population (a randomized experiment satisfies this
Chap. 5, page 2

This model is exactly the model for the pooled two-sample t-test when there are two groups:
different means, but common standard deviation

The assumption of equal standard deviations is very important and must be checked. If there are
large differences in variability, this may be of interest in and of itself and the reasons for this
should be addressed. Often, differing variability is caused by higher values of the variable in
some groups than another. For example, the variability in lifetimes of animals is likely to be
greater the longer they tend to live. Transformations (such as log) can sometimes solve this

Comparing pairs of means

The two-sample pooled t procedure for comparing any pairs of means, say µ1 and µ 2 uses
1 1 (n1 − 1) s12 + (n2 − 1) s 22
Y1 − Y2 and SE( Y1 − Y2 ) = s p + where s p = .
n1 n2 n1 + n2 − 2

The only change in adapting this to several groups is to use the pooled standard deviation from
all of the groups if the assumption of equal standard deviations seems reasonable.


Months survived

N Mean Std. Deviation Minimum Maximum

NP 49 27.40 6.134 6.4 35.5
N/N85 57 32.69 5.125 17.9 42.3
N/R50 71 42.30 7.768 18.6 51.9
R/R50 56 42.89 6.683 24.2 50.7
N/R lopro 56 39.69 6.992 23.4 49.7
N/R40 60 45.12 6.703 19.6 54.6

The equal variance assumption seems reasonable for this experiment so we will use the pooled standard
deviation from all 6 treatments.

(n1 − 1) s12 + (n2 − 1) s 22 + … + (n I − 1) s I2

sp =
(n1 − 1) + (n2 − 1) + … + (n I − 1)
48(6.134 2 ) + 56(5.125 2 ) + … + 59(6.703 2 )
= = 44.599 = 6.678
48 + 56 + … + 59

The degrees of freedom for the t distribution when you use this pooled standard deviation is the
denominator in the above expression which is n − I , where n is the total sample size (349 in our
example) and I is the number of groups or treatments (6 in our example). So we use a t with 343
degrees of freedom for the mice experiment.
Chap. 5, page 3

• One desired comparison is between groups 1 and 2: the unrestricted non-purified diet
(NP) to a standard 85 calorie diet (N/N85). The result is summarized in part e) on p. 116.

1 1 1 1
First, note that SE( Y1 − Y2 ) = s p + = 6.678 + = 1.301.
n1 n 2 49 57

A 95% confidence interval for µ 1 − µ 2 :

Y1 − Y2 ± t 343 (.975) SE(Y1 − Y2 ) = 35.5 – 42.3 ± 1.967 (1.301)
= -6.8 ± 2.56 ≈ -9.4 months to -4.2 months

Conclusion: It is estimated that the 85 calorie standard diet increases mean life
expectancy by 6.8 months over an unrestricted diet with a 95% confidence interval of 4.2
to 9.4 months.

• A test of the null hypothesis that µ 1 = µ 2 against a one-sided alternative that µ 1 < µ 2
(we would have to decide before collecting the data that we were only interested in
detecting an increase in mean life expectancy with the 85 calorie diet):

Y1 − Y2 − 6.8
Test statistic = = = -5.23
SE(Y1 − Y2 ) 1.301

Compare to t distribution with 343 d.f.

P-value = area to left of –5.22 < .0001

Conclusion: The data provide very strong evidence that the 85-calorie diet increases life
expectancy over the unrestricted diet.

Note: if the equal standard deviations assumption did not appear reasonable, then we could
have done the confidence interval and hypothesis test the usual way using the pooled
standard deviation from the two groups or the unpooled Welch’s t procedures. The
advantage of pooling all 6 groups is a better estimate with increased degrees of freedom.

One-way Analysis of Variance F-Test

Designed to answer the question: is there evidence of a difference between any of the means?

That is, we wish to test the null hypothesis H 0 : µ1 = µ 2 = µ 3 = µ 4 = µ 5 = µ 6 . The alternative

hypothesis is that at least one mean is different from the others. The alternative hypothesis
would include all these possibilities:
• All the means are different from one another
• Five means are the same and one is different
Chap. 5, page 4

• Three of the means are the same, the other three are the same but different from the first

The idea of a one-sided alternative hypothesis is meaningless with three or more groups.

Testing the hypothesis of equal means relies on a general approach which we will use frequently
in the rest of the course:

Extra Sum of Squares Principle

General principle for testing hypotheses.

Full model: a general model which adequately describes the data.
Reduced model: a special case of the full model obtained by imposing the restriction of the
null hypothesis.

For testing the equality of several population means, these models are:
Full model: the population distributions are normal with the same standard deviations, but
different (possibly) means
Reduced model: the population distributions are normal with the same standard deviations,
and the same means

The general idea is that we “fit” both these models to the data (like regression). Each model
gives a predicted value for every case. The full model uses each observation’s group mean as the
predicted value. The reduced model uses the mean of all the observations together. We then
measure how well the data fit the models by computing the sum of squared residuals. The full
model can fit no worse than the reduced model because the reduced model is a special case of the
full model.

So, the predicted responses are

Group 1 2 3 4 5 6
Full Y1 Y2 Y3 Y4 Y5 Y6
Reduced Y Y Y Y Y Y

To illustrate these calculations, we’ll use a small hypothetical example, with 3 groups and 10
observations in all.

Group 1: 10.7 13.2 15.7 n1 =3 Y1 = 13.2 s1 = 2.500

Group 2: 12.1 14.2 16.0 16.5 n2 =4 Y2 = 14.7 s 2 = 1.995
Group 3: 20.9 24.4 27.3 n3 =3 Y3 = 24.2 s3 = 3.205
Total: n =10 Y = 17.1 s p = 2.535
Chap. 5, page 5

Y is called the “grand mean” and is the mean of all 10 observations.

Predicted Residual Squared

(reduced (reduced residual Predicted Residual Squared
Group Obs Response model) model) (reduced) (full model) (full model) residual (full)
1 1 10.7 17.1 -6.4 40.96 13.2 -2.5 6.25
1 2 13.2 17.1 -3.9 15.21 13.2 0.0 0.00
1 3 15.7 17.1 -1.4 1.96 13.2 2.5 6.25
2 1 12.1 17.1 -5.0 25.00 14.7 -2.6 6.76
2 2 14.2 17.1 -2.9 8.41 14.7 -0.5 0.25
2 3 16.0 17.1 -1.1 1.21 14.7 1.3 1.69
2 4 16.5 17.1 -0.6 0.36 14.7 1.8 3.24
3 1 20.9 17.1 3.8 14.44 24.2 -3.3 10.89
3 2 24.4 17.1 7.3 53.29 24.2 0.2 0.04
3 3 27.3 17.1 10.2 104.04 24.2 3.1 9.61
Total 264.88 44.98

Extra sum of squares = Residual sum of squares (reduced) – Residual sum of squares (full)

= 264.88 - 44.98 = 219.9

The residual sum of squares for a model represents the variability in the original data which is
not explained by the model. The extra sum of squares therefore represents the amount of the
unexplained variability in the reduced model that is explained by the full model.

The question now is whether the improved fit represents something real or could just be
attributed to sampling variability. We use the F-statistic to test the null hypothesis that the
populations follow the reduced model against the alternative that they follow the full model and
not the reduced model.

(Extra sum of squares)/(Extra degrees of freedom)

F-statistic =
σˆ full

Extra degrees of freedom = # params for full model – # params for reduced model

σ̂ full
= estimate of σ 2 based on full model = s 2p (square of pooled standard deviation)

The numerator of the F-statistic is the average reduction in residual sum of squares for each
parameter added and the denominator is the reduction we would expect per extra parameter just
by chance.

For the above small example,

Chap. 5, page 6

F2,7 = 219.9/2 = 17.11


This statistic is compared to an F distribution. F distributions have two parameters: numerator

degrees of freedom and denominator degrees of freedom.

Numerator d.f. = extra degrees of freedom

Denominator d.f. = d.f. for s p = n – I


Sum of
Squares df Mean Square F Sig.
Between Groups 219.900 2 109.950 17.111 .002
Within Groups 44.980 7 6.426
Total 264.880 9

Notice that there are 3 different sums of squares.

I ni
Total sum of squares = SST = ∑∑ (Y
i =1 j =1
ij − Y )2

Sum of squares between groups = SSB = ∑ n (Y − Y )
i =1
i i

I ni
Sum of squares within groups = SSW = ∑∑ (Y
i =1 j =1
ij − Yi ) 2

Note that, SST = SSB + SSW and Extra sum of squares = SST – SSW, hence SSB = ESS.

Mean square between groups = MSB =
I −1

Mean square within groups = MSW = = s 2p

Source Sum of squares d.f. Mean square F-statistic P-value

Between groups SSB I-1 MSB MSB/MSW
Within groups SSW n-I MSW
Total SST n-1
Chap. 5, page 7

Logic behind the F-test

It’s easiest to see if the sample sizes are equal: n1 = n2 = … = n I . Call the common sample size
n*. Remember that we always assume that the population distributions are normal, the standard
deviations are all equal, and the samples are independent.

MSW = s 2p is an estimate of σ 2 no matter which model (equal means or separate means) is


If the population means are equal (i.e., if the null hypothesis is true) then
Y1 is N( µ , σ / n * )
Y2 is N( µ , σ / n * )

YL is N( µ , σ / n * )

Since the samples are independent, Y1 , Y2 ,…, YI are like a random sample from a normal
population with mean µ and standard deviation σ / n * . Therefore, the sample variance of
Y1 , Y2 ,… , YI is an estimate of σ 2 / n * :

1 I σ2
∑ (Yi − Y )2
I − 1 i =1
is an estimate of

1 I
Hence, ∑ n *(Yi − Y )2 = MSB is an estimate of σ 2 .
I − 1 i =1

To summarize:

• MSW is an estimate of σ 2 no matter whether the full or reduced model is correct.

• MSB is an estimate of σ 2 only if the reduced model (the equal means model) is correct.
If the reduced model is not correct, then MSB will tend to overestimate σ 2 .

• if the null hypothesis is true (i.e., the equal means model is correct), then MSB/MSW
should be 1 except for sampling error
• if the null hypothesis is false, MSB/MSW will tend to be bigger than 1
• if the null hypothesis is true, the sampling distribution of MSB/MSW is an F distribution
with I-1 d.f. in the numerator and n-I d.f. in the denominator.
• large values of MSB/MSW are evidence in favor of the alternative hypothesis; therefore,
the P-value is the area to the right of MSB/MSW in the F distribution.

Case Study 5.1 (comparing diets in mice)

Chap. 5, page 8


Months survived
Sum of
Squares df Mean Square F Sig.
Between Groups 12733.942 5 2546.788 57.104 .00000
Within Groups 15297.415 343 44.599
Total 28031.357 348

Conclusion: There is overwhelming evidence that there is a difference in the mean lifetimes
under the different diets. This does not mean that all the diets are different, only that at least one
of them is.

Robustness to assumptions: see Section 5.5.1, p. 130. The main distributional assumptions we
need to worry about are:
• Population standard deviations are roughly equal
• There are no extreme outliers; the F-test is not resistant to outliers, particularly with small

We can judge these assumptions from side-by-side dotplots or boxplots of the raw data. Judging
equality of standard deviations is a little easier if we subtract off the mean of group. That is we
examine the residuals for the full (separate means) model: Yij − Yi . As in regression, we plot the
residuals versus the predicted values. The predicted value for an observation is the group mean.

Judging from this plot, the original boxplots, and the sample standard deviations, there doesn’t
seem to be any reason to doubt the assumptions of the F test.
Chap. 5, page 9

Examining models between the separate means and the equal means models

Suppose we wanted to examine the model which assumes the two control groups (NP and
N/N85) have the same mean lifetime and the remaining four calorie restricted diets have the
same mean lifetime. The question is: how much of the difference among the means is due
simply to the differences between these two groups of diets?

This is a two-mean model that is between the separate means model (with 6 parameters to
describe the means) and the equal means model (with parameter to describe the means).

Model NP N/N85 N/R50 R/R50 N/R50lopro N/R40

Separate means
Two means
Equal means

These three models are said to be nested because each model is a special case of the ones above

We can test the two means model against the separate means model in SPSS by creating a new
categorical value which identifies the first two diets as group 1 and the remaining four diets as
group 2. We then run the ANOVA with this new variable as the explanatory variable.

Control diets (NP and N/N85) vs. restricted diets


Months survived
Sum of
Squares df Mean Square F Sig.
Between Groups 11131.393 1 11131.393 228.556 .000
Within Groups 16899.964 347 48.703
Total 28031.357 348

This ANOVA table is comparing the two means model to the equal means model. We see that it
is significant. Now, to compare the two-means model to the separate means model we need to
use the sums of square to compute a new F statistic. Recall

(Extra sum of squares)/(Extra degrees of freedom)

σˆ full


ESS = SSRreduced − SSR full

Chap. 5, page 10

Here are both ANOVA tables, side by side:

Comparing the separate means model to the equal means model


Months survived
Sum of
Squares df Mean Square F Sig.
Between Groups 12733.942 5 2546.788 57.104 .00000
Within Groups 15297.415 343 44.599
Total 28031.357 348

Comparing the two-means model to the equal means model


Months survived
Sum of
Squares df Mean Square F Sig.
Between Groups 11131.393 1 11131.393 228.556 .000
Within Groups 16899.964 347 48.703
Total 28031.357 348

Calculate the F statistic to test the separate means model against the two-means model:

F , =