You are on page 1of 32

ANOVA

ANALYSIS OF VARIANCE

Difference between two sample means

Suppose that two machines are to be compared. Because these machines are operated by people and because of other, inexplicable reasons, output per hour is subject to chance fluctuation. In the hope of averaging out and thus reducing the effect of chance fluctuation, a random sample of 5 different hours is obtained from each machine and set out in the below table where each sample mean is then calculated. Assume that 2 populations are normally distributed. Are the machines really different? Machine 1 55 Machine 2 47

54
58 61 52

53
49 50 46

Distribution of difference between two sample means

Characteristics of distribution
Mean = 1 2
Variance = Variance =
2 1 1 2

+ +

2 2 2 2

z statistic t statistic with (probability to commit

the type I error)


2 2 Equal population variance common variance (1 = 2 = 2 ) exists 2 : pooled estimate of the common population variance 2
2 :( ;1) 2 1 ;1 1 2 2 1 :2 ;2

Difference between two sample means


2

2 :( ;1) 2 1 ;1 1 2 2 1 :2 ;2

412.5:47.5 5:5;2

= 10

0 : 1 = 2 or 1 2 = 0 and 1 : 1 2
1 ;2 ;(1 ;2 )
2 2 : 1 2

56;49 ;0
10 10 : 5 5

= 3.5

The significance level = 0.01 = 2.896 = 3.5 > = 2.896 0 There is a difference between two population means.

Difference between three sample means


Machine 1 Machine 2 Machine 3

47
53 49 50 46

55
54 58 61 52 2 = 56

54
50 51 51 49 3 = 51 = 52

1 = 49

-3

-1

( ) = 0

2

16

= 26

Are the machines really different?

Difference between three sample means


Sample 1 49 55 51 52 48 1 = 51 Sample 2 52 51 55 58 49 2 = 53 Sample 3 55 51 52 52 50 3 = 52

From one machine, 3 samples are taken. As expected, sampling fluctuations cause small difference in the even though the in this case are identical.

Difference between three sample means

Question
Are the differences in 3 means from 3 different machines of the same order as the differences in 3 means from 3 samples? If so, the differences are due to chance fluctuation. If not, the differences among 3 means from 3 different machines are large enough to indicate a difference in the underlying 0 : 1 = 2 = 3

Samples from three erratic machines means


Machine 1 Machine 2 50 42 53 45 55 48 57 65 59 51 2 = 56 Machine 3 57 59 48 46 45 3 = 51 = 52

1 = 49

3 machines
Three different machines Three erratic machines

ANOVA Analysis of Variance

ANOVA:
Partition the total variability into smaller components Identify sources for each components Measure each the extent of each sources Conclude which source is the true cause for total variability

2 sources
Variability due to distinct population means (treatments) Variability due to all other sources (residual)

Content
The completely randomized design one way ANOVA The randomized complete block design two way ANOVA The Latin square design The factorial experiment 2 factors ANOVA models

Definitions
Treatment: a treatment is any factor that the experimenter controls. It may refer, for example, to a type of drug, one of several concentrations of a single drug, a new type of house paint, an advertising technique, or a particular training program Entity: the entity that receives a treatment is called an experimental unit Mean square: the mean square is an expression synonym with the word variance.

Definitions of ANOVA

ANOVA: Analysis of variance is a technique whereby

the total variation present [

=1

=1

;..

in a set

of data is partitioned into several components. Associated with each of these components is a specific source of variation, so that, in the analysis, it is possible

to ascertain the magnitude of the contribution of each of


these sources to the total variation

Definitions of ANOVA
The

completely randomized design

One-way ANOVA
The

randomized complete block

design Two-way ANOVA


The

Latin square design

The completely randomized design One-way ANOVA


Assign the treatments at random to experimental units 4 brands of tires (A, B, C, D) 4 treatments Assign randomly 10 tires each brand to 20 cars 20 experimental units Observe number of km driven until the tread wear occurs. Analyze variance to decide whether or not the brands differ with respect to expected tire km. Define variation due to brand difference, and variation die to all sources other than brands.

ANOVA model

= + (1)
=1

= = + (2)
= + +
Treatment effect (Between)
Error term deviation from (Within)

1 :(2)

Grand mean

ANOVA assumptions

Fixed-effects model assumption


The k sets of observed data constitute k independent random samples form specified populations Each of the population represented by a sample is normally distributed with the mean and variance 2
Each of population has the same variance. That is 2 2 2 1 = 2 = = =
The are unknown constants. And <1

=0

Violation of ANOVA assumption

Increase the probability of rejecting a true null hypothesis. Equal population variances Normally distributed population

Test whether assumptions are violated Variance test Levenes test or Boxplot Normality test Boxplot, histogram, normal Q-Q test

Solution Make sample sizes equal Refine data set to eliminate outliers

ANOVA

& ,


<1 <1 2

=
<1 <1


<1 <1

+2
<1 <1

+
<1 <1

<1 <1

+2
<1


<1

+
<1


<1 <1

=
<1 <1


SS for error (Within)

+
<1

Total Sum of square (Variation)

SS for treatment (Between)

Sum of squares for treatment

= =

<1

2 ; =1 ;1

2 2 = =

Is the total variability of data explained by underlying population means? test hypothesis
0 : 1 = 2 = = 1 :

SSTr is the measure of proximity of sample means to each other. If sample mean are remarkably different, this measure will large

Sum of squares for treatment


= =

<1

=130
130 2

;1

= 65

Sum of squares for error

=
=

<1

<1

1 1
<1

+
<1

2 2

+ +
<1

( 1) = ( 1)

<1

( 1) 1 1 2 + ( 1)

<1

( 1) 2 2 2 + + ( 1)


<1

2 2 = 1 1 + 1 2 + +

2 = ( 1) 1

2
<1

2 : ;1 2 :: ;1 ;1 1 2 :1 : :1 ::(:1)

2 =

the pooled estimate of the common population variance

When the total variability is explained mainly by SSE, this

measure will be large.

Sum of squares for error


= =

<1 ;

<1

= 7.83

= 94

94 12

F distribution

Independent Chi squared variables

2 (;1)1 2 1

2 (;1)2 2 2

Divide each Chi-squared variable by its df

(1)2 1 2 1

(;1)

2 1

2 1

(1)2 2 2 2

(;1)

2 2

2 2

Ratio of those two is F distributed


2 1

= 2
2

2 1 2 2

2 2 2 2 0 : 1 = 2 and 1 : 1 2 2 2 > 0 1 2

F distribution MSE and MSTr


MSE is the estimate of common variance ( 2 ) MSTr is the estimate of total variability ( 2 ) Total variability is mainly explained by Treatment or all sources other than treatment?
=

2 2

0 : 1 = 2 = = 1 :
2 If 0 is rejected, then will be relatively large 2 F > 1. Therefore, the larger is compared to F, the less credible is the null hypothesis

F distribution MSE and MSTr


65 7.83

= 8.3

With numerator d.f. of 2; denominator d.f. of 12 and = 0.01


Critical F = 6.93

Computed F > critical F

Reject 0
Three machines are really different

ANOVA table (1 = 2 = = )
Source of variation Treatment: Differences between treatments Residual: Differences within treatment Sum of squares SS d.f. Mean square MS F ratio

=
<1

k-1

2 = 1

=
<1 <1

nk k

2 = =

2 2

Total

=
<1 <1

nk 1

ANOVA table (1 = 2 = 3 = 5)
Source of variation Treatment: Differences between machines Residual: Differences within machine Sum of squares SS d.f. Mean square MS F ratio

130

65
65 = 7.83

94

12

7.83

= 8.3

Total

244

14

ANOVA table (1 2 )
Source of variation Treatment: Differences between treatments Residual: Differences within treatment Sum of squares SS

d.f.

Mean square MS

F ratio

=
<1

k1

= 1 =

=
<1 <1

nk

2 =

Total

=
<1 <1

= 1 + 2 + +

Computational formula
Population sampled 1 11 11 11 Total .1 22 .2 2 12 22 33 .3 3 13 23 . .. k 1 2 Total

Mean

.1
<1

.2

.3

..

. = . = .. = .. =

= =
<1 <1

<1 . .. <1

= =

Computational formula

<1

2 <1

=1

2 =1

We call it C.

2 =1

2 .1 1

2 .2 2

+ +
=1

2 .

=1

2 =1 .

2 =1

When all samples are equal in size

t tests or ANOVA
Number of Population 3 (1 , 2 , 3 ) t test 1 = 2 2 = 3 1 = 3 Number of Population 2 (1 , 2 ) t test 1 = 2 ANOVA 1 = 2 1 = 2 = 3 ANOVA

You might also like