You are on page 1of 17

STAT 241/251 - Chapter 10: Analysis of Variance

(ANOVA)

In the previous chapter, we learned how to do a 2-sample t-test for the equality of means,
for two independent groups. (ie) Ho : µ1 = µ2 .

Analysis of variance (ANOVA) is a method for testing the equality of 3 or more means
(independent groups). We will use g to represent the number of groups/treatments/means
we are comparing. It is an extension of the 2-sample t-test, where we use a pooled estimate
of the variance.

(eg) H0 : µ1 = µ2 = ..... = µg vs. HA : µa 6= µb for some a 6= b


(H0 : All means equal vs. HA : At least one group differs)

Like usual, we will work through an example, stating each idea in general terms as we go.

Note: Many software packages can preform ANOVA, although it is very useful to do it by
hand and understand exactly what ANOVA is doing.

Example A: You work for a company that manufactures cars and you have been asked to
design a computer networking system that coordinates all the machines involved in the
process. You have written 4 different programs for this (1, 2, 3, 4) and would like to test
if the average time to produce one car is the same for each of the 4 programs. You test
method 1 and 2, 5 times, and test method 3 and 4, 4 times, and record the following results.
Test if the mean time for each method is the same, and if they do differ, which methods
significantly differ?

Process Type (treatment) Production time (min.) Sample Mean Sample SD


----------------------- --------------------- ----------- ---------
1 37, 41, 33, 46, 36 38.6 5.03
2 29, 36, 35, 28, 34 32.4 3.65
3 42, 43, 37, 33 38.75 4.65
4 25, 36, 27, 42 32.5 7.94

1
We could possibly do all pair-wise t-tests, although this is much more lengthy and inefficient.
We will discuss why this approach is not optimal.....

The 2 main reasons are that we do not take a pooled estimate of the standard deviation,
and so we don’t get as good an estimate as we could possibly get. The second is that we
would encounter the problem of multiple comparisons, which we will talk about at the
end of this set of notes.

In general, we have......

Group (i) Samples (j) Sample size (ni ) Sample Mean for group i (Ȳi ) Sample SD (si )
1 y1,1 , y1,2 , ....y1,n1 n1 Ȳ1 s1
2 y2,1 , y2,2 , ....y2,n2 n2 Ȳ2 s2
... ... ... ... ...
... ... ... ... ...
g yg,1 , yg,2 , ....yg,ng ng Ȳg sg

Notation:

Note, that we use Y for our variable and not X. It is convention in statistics that we use
Y to represent a response variable. We will see this especially when we talk about linear
regression, where we will use an explanatory variable (X) to try and predict a response
variable (Y ).

yi,j is the sample observation for treatment/group i (i = 1, 2, ..., g) sample observation num-
ber j (j = 1, 2, ...., ni )

i indexes the g treatments/groups


j indexes the observations within each treatment/group
ni = The sample size for group i
n = n1 + n2 + ..... + ng = total sample size
Ȳi = The ith treatment/group mean
Ȳ = The overall mean
si = The sample SD for group i

2
In general, we have g-groups (treatments/populations/...) with true/population means of
µ1 , µ2 , ..., µg and true standard deviations of σ1 , σ2 , ...., σg (both of which are unknown in
reality), which we estimate using Ȳ1 , Ȳ2 , ....., Ȳg and s1 , s2 , ....., sg .

In our example, we want to test if all the means are equal (all the methods preform the
same)
(eg) H0 : µ1 = µ2 = µ3 = µ4 vs. HA : µa 6= µb , for some a 6= b

Note: That we are testing a hypothesis, so we will follow the general procedure for a hypoth-
esis test, outlined in the last chapter. (ie) set up hypotheses, calculate some test-statistic,
compare it to some critical value or calculate the p-value and state a conclusion.

We will propose the following model, where each measurement is represented as the sum of
an unknown constant (µi ) and a random error term (i,j )

Yi,j = µi + i,j , for i = 1, 2, ..., g and j = 1, 2, ...., ni

(ie) Each observation is represented as the sum of: 1) The mean or expected value for an
observation from that treatment/group 2) A random error term for unexplained variabil-
ity/noise.

Note: The (i,j ) terms account for the variability caused by factors other than the treat-
ment/group differences. We refer to this as unexplained variation, or random variation.

We randomize which experimental units get put into which group/treatment, in an attempt
to ‘average out’ these effects.

Then, i,j ∼i.i.d. N (0, s2)


Whenever we use a statistical model, there are some assumptions we make, and we must
check if it is reasonable to make such assumptions.

3
Model Assumptions:
1) Groups are independent

2) The i,j are independent and normally distributed

3) E(i,j ) = 0, for all i, j (ie) We expect each sample observation to be equal to it’s true
treatment/group mean.

4) Constant variance, Var(i,j ) = σ 2 , for all i, j (ie) All the g groups have equal variances.

5) Assignment of experimental units to treatments/groups is made randomly.

So, for our model Yi,j = µi + i,j , we have....


Model parameters: µ1 , µ2 , µ3 , µ4 , σ 2 and
2
Point Estimates: Ȳ1 , Ȳ2 , Ȳ3 , Ȳ4 , sp where,

(n −1)s 2+(n −1)s2+.....+(n −1)s2


1 2 g
s2pooled = 1
n−g
2 g

Note: This is the same formula for the ‘pooled’ estimate of variance that we saw in chpt 8.
Later in these notes, we will refer to the above as the MSE.

Recall: Here, we assume that σ1 = σ2 = .... = σg , so we take a ‘pooled’ estimate of the


sample variances.

Sources of Variability:

1) The machine/process preformance (eliminated by randomization)

2) The preformance of each method/group/treatment

3) Noise or random error

4
Roughly speaking, there is evidence that there is a difference in the means of the groups if.....
A substantial amount of the total variation in the data is due to the difference in groups/treatments
relative to the amount of variation due to noise or random error.

We will develop a method whereby we can take the total amount of variation in the data,
and split it into 1) how much is from a difference in treatments/groups and 2) how much is
from random error or can’t be explained by group differences.

Total Sum of Squares (SST)


We abbreviate the total variability in the data as SST, where the SS means ‘sum of squares’,
and the T means ‘total’.

This includes all sources of variability (ie) raw material differences, treatment/group differ-
ences, worker difference, unknown differences, noise/random error,.....

Pg Pni 2
SST = i=1 j=1 (y i,j − Ȳ )
Note: That the above looks sort of like the formula for calculating a variance.... sort of!

If one were to go through and calculate SST, you would find SST = 582.8

5
Partitioning the Total Sum of Squares (SST)
Like we discussed earlier, if H0 really is true (all means equal), then we would expect the
variation between groups to be roughly similar to the variation arising within groups.

If H0 is not true (the means are not all the same), then we would expect more of the
variation to come because of differences in groups/treatments, and realtively little variation
to be within groups.

Essentially, we would like to seperate ‘signal’ from ‘noise’.

Now, we will show how the Total sum of squares (SST), can be expressed as the sum of the
sum of squares due to treatments/groups (SSt) and the sum of squares due to random error
(SSe).

Note: Sometimes the SSt is also referred to as the sum of squares between groups (SSb) or
the model sum of squares (SSm), and the SSe is sometimes referred to as the sum of squares
withing groups (SSw) or the sum of squares from residuals (SSr). We will use the terms SSt
and SSe, although I wanted to mention other names in case you look at other sources.

(ie) Here, we will split the total S.S (or total variation, sort-of), into how much is coming
from between groups and how much within groups.

This will allow us to compare the variation due to differences in groups/treatments with the
variation due to random error or noise.
Pg Pni Pg Pni
SST = i=1
2
j=1 (yi,j − Ȳ ) = i=1 j=1 [(yi,j − Ȳi ) + (Ȳi − Ȳ )]2 =
Pg Pni 2
Pg Pni 2
Pg Pni
i=1 j=1 (yi,j − Ȳi ) + i=1 j=1 (Ȳi − Ȳ ) +2∗ i=1 j=1 (yi,j − Ȳi )(Ȳi − Ȳ )=
Pg Pni 2
Pg Pni
i=1 j=1 (yi,j − Ȳi ) + i=1 j=1 (Ȳi − Ȳ )2 = SSe + SSt

(ie) The first part gives us the S.S. from the difference of each observation, and its group
specific mean (error). The second part gives us the S.S. from the difference between the
group specific mean and the overall mean.

6
The Treatment Sum of Squares (SSt)
Pg Pni
We saw that SSt = i=1 j=1 (Ȳi − Ȳ )2 . With some algebra, it can be shown that.....

Pg
SSt = i=1 ni (Ȳi − Ȳ )2
This is the S.S. due to groups/treatments (the signal)

It is also commonly referred to as the between treatment S.S.

The Error Sum of Squares (SSe)


Pg Pni
We saw that SSe = i=1 j=1 (yi,j − Ȳi )2 . With some algebra, it can be shown that.....

Pg
SSe = i=1 (ni − 1)s2i , where S 2
i is the variance for group i.

This is the S.S. due to random error (the noise)

It is also commonly referred to as the within treatment S.S.

So, if you look at the data from the example on the first page, we find that...

SSt = gi=1 ni(Ȳi − Ȳ )2 = 5(38.6 − 35.56)2 + 5(32.4 − 35.56)2 +


P
4(38.75 − 35.56)2 + 4(32.5 − 35.56)2 = 174.3

SSe = gi=1(ni − 1)s2i = (5 − 1)(5.03)2 + (5 − 1)(3.65)2 + (4 −


P
1)(4.65)2 + (4 − 1)(7.94)2 = 408.5
Therefore, SST = SSt + SSe = 174.3 + 408.5 = 582.8, which is what we
saw before.

Ok, we are almost ready to compute the signal to noise ratio (compare variability due to
treatment differences with variability due to random error)

7
It is important to note that the SST, SSt and SSe were all computed using a different ‘number
of squares’ (eg) SSe was the sum of (n) squares, while SSt was only the sum of (g) squares.

So, we must divide each S.S. by its degrees of freedom before we can fairly compare them
(to get an average, or mean SS).

Degrees of Freedom (d.f.)


We will not go too deep into the theory here, but we will give an intuitive understanding.

Basically, our degrees of freedom are as large as the number of observations we have, and we
lose 1 degree of freedom for every parameter we must estimate.

When we calculated the SST, we used (n) squares, and estimated 1 parameter (Ȳ ), so we
have.....

d.f.SST = n − 1
When we calculated the SSe, we used n squares and we had to estimate g parameters
(Ȳ1 , Ȳ2 , ...., Ȳg ), so we have.....

d.f.SSe = n − g
Since, SST = SSt + SSe, we have that d.f.SST = d.f.SSt + d.f.SSe

Therefore,

d.f.SSt = g − 1
In our example, d.f.SST = 18 − 1 = 17, d.f.SSe = 18 − 4 = 14 and d.f.SSt = 4 − 1 = 3

8
Mean Squares
If we want to make a fair comparison of the variation due to treatments with variation due
to random error, we should compute the mean sum of squares (MS), and divide each
S.S. by its degrees of freedom. While were here, let’s take a closer look at the MSe....

SSt SSt
M St = d.f.SSt = g−1

SSe SSe
M Se = d.f.SSe = n−g
174.3 408.5
In our example, M St = 4−1
= 58.1 and M Se = 18−4
= 29.18

F-Statistic
Now, we would like to calculate our observed signal-noise ratio ratio, and then calculate the
probability of observing a ratio that large or larger, given that the null hypothesis is true
(p-value). (ie) just like before, if H0 is true, then whats the prob. of getting a test-statistic
as large or larger than what we observed.

Our test statistic is.......

M St
F = M Se ∼ F(d.f.1=g−1,d.f.2=n−g) , when H0 is true

We will just state the result that F , our test-statistic, follows an F -distribution with numer-
ator degrees of freedom = g − 1 and denominator degrees of freedom = n − g, under the
assumption that H0 is true.

58.1
In our example, F = 29.18
= 1.99

Now, we would like to compute the probability of observing an F as large as we observed (or
larger) if H0 really were true (the p-value). A computer can calculate this exactly, although
our tables make it more difficult to do such a thing.

Instead, we will compare our Fobs. to a critical value.

9
(ie) We will compare Fobs. to an F -statistic which we would observe only α*100% of the
time if H0 really were true.

Then, we can look up the value for F(g−1,n−g,α), and compare it to our Fobs.
In our example, if we use α = 0.10, 0.05, 0.01, then we have.....

F(4−1,18−4,0.10) = 2.52

F(4−1,18−4,0.05) = 3.34

F(4−1,18−4,0.01) = 5.56

Rules:
If Fobs. < F(g−1,n−g,α) , then Fail to reject (FTR) H0

If Fobs. ≥ F(g−1,n−g,α) , then Reject H0

In our example, 1.99 < 3.34, therefore we FTR H0 and conclude that there is not enough
evidence to say that any of the groups/treatments means differ significantly.

ANOVA Table
In general, we have the following ANOVA table......

Note: That between is due to treatment/group and within is due to random error/noise

Source S.S. d.f. MS F -Statistic


Pg SSt M St
Treatment/Between SSt = i=1 ni (Ȳi − Ȳ )2 g−1 g−1 M Se

Pg SSe
Error/Within SSe = i=1 (ni − 1)s2i n−g n−g

Pg Pni
Total SST = i=1 j=1 (yi,j − Ȳ )2 n−1

10
In our example, we had......

Source S.S. d.f. M S F -Statistic


Between/Treatment 174.3 3 58.1 1.99
Within/Error 408.5 14 29.18
Total 582.8 17

Example B: Suppose that a course at UBC is taught in 3 different sections, by 3 different


instructors. We would like to compare the teachers, so we randomize which students are in
which section and administer the same test within each section. In the end, we will test if
the mean grade for each class is the same. For sections A, B and C, you sample 35, 41 and
63 students, and find class averages of 71, 68 and 77 with sample standard deviations of 13,
14.2 and 16.5 respectively. Test if the mean grades are equal for each section of the course,
using a significance level of 5%.

Example C: A manufacturing company has 3 different methods for moulding steel into a
steel beam used for construction. They want to test if all methods seem roughly the same,
or if any of the particular methods result in a ‘harder’ steel, which is desirable. They will
use the Rockwell hardness scale as their measure of the ‘hardness’ of the steel beams. For
simplicity we will refer to the 3 different methods as method A, B and C. They measure
the hardness of the beam for methods A, B and C 10, 12 and 11 times for each method
respectively. They find sample means of 55, 57 and 64 with sample standard deviations of
3.1, 4.6 and 3.8 respectively. Test if the mean hardness is the same for each of the methods
of production using a significance level of 5%. Make sure to clearly state your hypothesis
and a conclusion.

11
In Example B, we concluded that there is a significant difference in the average grade for
different sections of the course. Although, our test only allows us to conclude: 1) Not enough
evidence to say they are not all the same or 2) At least one differs.

It does not tell us which ones differ significantly from each other.

To decide which differ significantly, we must use Multiple Comparisons. Here, we will
make all pair-wise comparisons of means. We may do this either using confidence intervals,
or hypothesis tests. Now, we will discuss how to do this using confidence intervals.

Simultaneous Confidence Intervals


To address the question of which means significantly differ, we can construct simultaneous
confidence intervals for the difference of all pairs of means.

The procedure will be the same as when we made a confidence interval for the difference of
two means, except now we must account for the fact that we are making multiple comparisons
at once.

We will continue with the example of comparing the 3 classes.

Recall: We had 3 groups (A, B and C) with means of (71, 68, 77), sample standard deviations
of (13, 14.2, 16.5) and sample sizes of (35, 41, 63) respectively, with an overall mean of 72.835

So, we have 3 groups/treatments and comparing every possible pair of means, results in
G = 32 = 3 comparisons at once.


g

In general, if we have g groups/treatments, there are G= 2 comparisons.
As we saw earlier, a confidence level of (1 − α) has an error rate of α.
(ie) We are (1 − α)*100% confident in the statement we make based on the confidence
interval.

But, now we must account for the fact that we are making many confidence intervals at
once... and hence, many statements being made at once.

12
Let’s assume that our confidence intervals are completely independent, then....

P(correct for confidence interval i) = (1 − α)

If we let Ai = {Correct on confidence interval i}, then....

P(correct in all intervals) = P(A1 ∩ A2 ∩ A3 ) = P(A1 )P(A2 )P(A3 ) = (1 − α)3 , which is less
than (1 − α)

(ie) If we make each confidence interval with a confidence level of 95%, then the overall
confidence accross all the confidence intervals will be less than 95%.

(ie) If the intervals are completely independent, then using a level of 95% confidence for one
interval would result in 0.953 = 0.857 level of confidence over all the statements we will make
about all the confidence intervals.

Bonferroni’s Multiple Comparisons Correction


If we wish to have an overall confidence level of (1 − α) ( (ie) overall error rate of α), then
∗ α
use α = G , to make each confidence interval.

Note: Bonferroni’s correction is conservative in that it assumes complete independence. This


results in an error rate that is at most α, or conversely, a confidence level that is at least
(1 − α).
There are other methods of correction that exist (Tukey’s correction is another one).

In our example, we wish to make G = 3 simultaneous confidence intervals (or, 3 comparisons


at once).

If we want to have an overall confidence level of 95% (α = 0.05 overall), then we should use
α∗ = 0.05
3 = 0.0167 to make each individual confidence interval.
(ie) 98.3% confidence intervals for each pair, to have (at least) 95% confidence over all
intervals at once.

13
Now, we will make confidence intervals for the difference of each pair of means
(ie) (µA − µB ), (µA − µC ) and (µB − µC )

Recall: When we made a confidence interval for the difference in means, it took the form.....


(ȲA − ȲB ) ± tαd.f. × Spooled(ȲA−ȲB )
We will make the confidence interval for (µA − µB ) first.

We have (ȲA − ȲB ) = (71 - 68) = 3

Also, we know that Var(ȲA − ȲB ) = Var(ȲA ) + Var(ȲB ), since A and B are independent

2
σA 2
σB
So, Var(ȲA − ȲB ) = nA + nB ,

and recall that we assume that σ1 = σ2 = ..... = σg , then....

= σ 2 nnA+n
n
B
A B
q
nA +nB
Then, S(ȲA−ȲB ) = σ nA nB

In general, when we are comparing groups/treatments i and j, we have.....


q
ni +nj
Spooled(Ȳi−Ȳj ) = σ ni nj

Now, what is our estimate of the ‘common variance’ or ‘common SD’ ?

Let’s take a moment to compare this to what we saw in Chapter 8, when making a confidence
interval for the difference of two means, assuming equal variances.

14
Pg Pni 2 Pg 2
i=1 j=1 (yi,j −Ȳi ) i=1 (ni −1)si
Recall: M Se = n−g = n−g

If we examine this carefully, we can see that it’s the pooled variance (s2pooled ) from chapter 8.


Then, s= M Se is our estimate of the ‘common’ standard deviation.

In our example, we had M Se = 225.67, so S = 225.67 u 15.02

Recall: We assume that the variances across groups are equal, so it is sort of like we have
n = n1 + ... + ng observations in total to estimate the variance.

Then, our d.f. = n − g = (35 + 41 + 63) − 3 = 136

(α∗ )
So, we would like to find t
(n−g)

( 0.05
3 )
In our example, we have t u 2.43
(136)
(From our table)

0.05 0.05 0.05 0.05


Note: Our t-table has columns for 3
, 6 , 10 , 15 (for g = 3, 4, 5, 6, using α = 0.05)

Now, we know all we need to know to make the simultaneous confidence intervals.

15
For (µA − µB ), we have....

(α∗ ) √ q
(ȲA − ȲB ) ± t(n−g) × ( M Se nnA+nB
n ) A B

Substituting in the values we have gives us.....


q
35+41
(71 − 68) ± (2.43) × (15.02 35×41 )

Which results in (−5.4, 11.4)

For (µA − µC ), we have....


q
35+63
(71 − 77) ± (2.43) × (15.02 35×63 )

Which results in (−12.5, 0.5)

For (µB − µC ), we have....


q
41+63
(68 − 77) ± (2.43) × (15.02 41×63 )

Which results in (−16.3, −1.7)

Note: That if 0 is in an confidence interval for a difference of means, then we would conclude
that the means do not differ significantly.

Now, we can say that we are overall 95% confident that.....

Here, we see that the mean of class C is significantly greater than the mean of class B. There
are no other significant differences between means.

16
Simultaneous Hypothesis tests
Note that we could alternative do all possible pairwise hypothesis tests....

(X̄1 −X̄2 )−0


t= √ q
n +n ∼ tdf =n−g
M Se n1 n 2
1 2

We then look up the p-value, and reject the null hypothesis that the two means are equal if
the p-value is less than α∗

My personal opinion is that a confidence interval is better as it not only allows us to conclude
if the two means are equal or not, but it also gives us an idea of the magnitude of the difference
if we conclude that two means are not equal.

Examples:

1. Four types of mortars - ordinary cement mortar (OCM), polymer impregnated mortar
(PIM), resin mortat (RM) and polymer cement mortar (PCM) - were subjected to a
cmopression test to measure the strength (MPa). Three observations were taken for
each mortar type, and are summarized in the table below.

Type Mean Strength Sample SD


---- ------------- ---------
OCM 33.5 2.5
PIM 131.2 3.8
RM 116.4 4.1
PCM 30.3 3.4

(a) Using a significance level of 5%, test if the mean strength is the same for all types
of mortars.
(b) If you determine that not all have the same strength, then use Bonferroni’s method
of simultaneous 95% confidence intervals to state which differ significantly.

2. Construct simultaneous 95% confidence intervals for Example C on page 11, and state
which means differ significantly.

17

You might also like