You are on page 1of 44

42

The ANOVA Table



The ANOVA Table (or Analysis Of Variance) table gives us the
following information:

1. Degrees Of Freedom
2. The Sum Of The Squares
3. The Mean Square
4. The F ratio
5. The p-value

A typical ANOVA table will look as follows:

Source df Sum of Mean F Value Prob> F
Squares Square
Model 1 4.9000 4.9000 13.364 0.0354
Error 3 1.1000 0.3667
Total 4 6.0000

The values for Mean Square is determined by:
MS =
df
SS


ie. Mean Square Model = Sum of Square Model
df
Model


Mean Square Error = Sum of Square Error
df
Error


The F-ratio is determined by:
F = MS
Model

MS
Error

43
So in the case above F =
364 . 13
3667 . 0
900 . 4
=


When we have n observations and k predictor variables, we will
always fill in our degrees of freedom column as below:

df
Model k
Error n (k+1)
Total n-1

In simple linear regression, where there is just 1 predictor variable,
our degrees of freedom are as follows:

df df
Model k = 1
Error n (k+1) = n-2
Total n-1 = n-1

The reason we split our ANOVA table into rows for Model, Error
and Total is to examine how much error we have when we use our
predictive equation, and to determine how much error has
disappeared because we used our predictive equation.

If we do not use our equation, and therefore use only the average
of the y variable (
y
) as an estimate of y, the scatterplot will look
something like this.

44


The sum of the squared differences between our actual value y and
the average of y,
y
(shown above) gives us SS
Total
. This is exactly
the same values as SS
yy
the sum of the squared deviations in y.

If we do use our equation to get an estimate of y, the scatterplot
will look something like this.



The sum of the squared differences between our actual value y and
the prediction of y,
y
(shown above) gives us SS
Error
. This tells us
45
how much of our Total error we are left with once we do use our
predictive equation.

The error that disappeared between these stages is our SS
Model
. This
measures the error that we avoid having when we do use our
predictive equation.

An ANOVA Table is often followed by a table of parameter
estimates. This table will be set up as follows:

Variable Param Stndrd T for H
0
Prob>|T|
Estim Error Param=0
Intercept
0

S

0

S
t =
p-value
X
1

S

1

S
t =
p-value

This can be seen in the table below:

Variable Param Stndrd T for H
0
Prob>|T|
Estim Error Param=0
Intercept -0.1000 0.6351 -0.157 0.8849
x 0.7000 0.1915 3.656 0.0354

Using our column of parameter estimates, we know the least
squares line will be:

y
= -0.1 + 0.7x
Our final column gives us the p-value to test if our parameter is
significantly different from 0. We can see that our y-intercept has a
large p-value (.8849) and so we cannot reject H
0
:
0
=0. However,
our slope has a small p-value (.0354) and so we would reject
H
0
:
1
=0 at the 5% significance level. A slope significantly
different from 0 again means that x is useful in predicting y.
46

Often we are only really interested in testing the significance of the
sample coefficients of the predictor variables. There is often very
little relevance in testing the significance of our estimate of the y-
intercept, since even if our y-intercept is significantly different
from zero, it really doesnt tell us anything about the usefulness of
the model.

Testing The Overall Model

We saw that our ANOVA table also included an F-statistic. The F-
test is a test to determine the overall significance of the model,
and not just of one individual coefficient. In simple linear
regression, the F-test and the t-test for our sole predictor variable
will result in the same conclusion. However in multiple regression
the tests will tell us different things.

The hypotheses being tested by the F-test in simple regression are:
H
0
:
1
= 0
H
1
:
1
0
In the case of simple regression analysis, F = t
2
. The F value can be
computed directly by:

F = SS
model
/df
model
= MS
model

SS
error
/df
error
MS
error


If we wish to see if our F-statistic is significant we can look at an
F-table (Table A7 in textbook) to find our critical value, where our
numerator degrees of freedom are the degrees of freedom for the
model and our denominator degrees of freedom are the degrees of
freedom for error.

More About The F-Distribution

Several properties of the F-distribution are as follows:
47

1. The F-distribution is a family of curves each of which is
determined by two types of degrees of freedom: the degrees of
freedom corresponding to the numerator (
1
) and the degrees of
freedom corresponding to the denominator (
2
)
2. F-distributions are positively skewed
3. The total area under each curve of an F-distribution is equal to 1
4. F-values are always greater than or equal to zero
5. For all F-distributions, the mean value of F is approximately
equal to 1

Note that, since all F-values are greater than equal to zero, all tests
using the F-distribution are effectively one-tailed tests that test for
a difference between parameters.



Finding Critical Values of the F-distribution

If
1
= 4,
2
= 14 and = .05:
From Table A7, F
4,14
(.05) = 3.11

If
1
= 8,
2
= 20 and = .10:
From Table A7, F
8,20
(.10) = 2.00
48

If
1
= 2,
2
= 4 and = .01:
From Table A7, F
2,4
(.01) = 18.00

In all situations we would then compare our test statistic with the
critical value. Recall that if our test statistic is larger than our
critical value we reject our null hypothesis.




































49
Section 2: Multiple Regression Analysis and Model Building

Models that include more than one independent variable are called
multiple regression models. The general form of these models is:
y =
0
+
1
x
1
+
2
x
2
+ +
k
x
k
+
where:
y is the dependent variable
x
1
, x
2
, , x
k
are the independent variables

i
determines the contribution of the
independent variable x
i


The value of the coefficient
i
determines the contribution of the
independent variable x
i
and
0
is once again the y-intercept. The
coefficients
0
,
1
, ,
k
are usually unknown because they
represent population parameters, and so we generally have to find
estimates of them from our sample.

In forming our model, we have to follow the same Five Step Plan
as we did with Simple Linear Regression.

Brief Summary

Step 1: Decide which model we wish to use

Now that we have more than one predictor variable we will have
more complicated models. If we are trying to predict values of our
dependent variable at different levels of 3 different independent
variables, our model would be of the form:
y =
0
+
1
x
1
+
2
x
2
+
3
x
3
+

We will still try to find the Least Squares Line through our points,
that is, the line that minimizes SSE =(y
y
)
2
.

Step 2: Use sample data to estimate unknown parameters
50

In multiple regression, whilst it is still possible to estimate all our
different parameters by hand, in actuality the process is tedious and
time consuming and so we will always be given ANOVA tables
with the appropriate estimates.

Step 3: Specify the probability distribution of the random
error term and estimate the standard deviation of this
distribution .

Our estimator of
2
for a multiple regression model with k
independent variables is:

s
2
=
MSE
k n
SSE
=
+ ) 1 (


Note that in the ANOVA table, this is the same as our mean
squared error. Also, the assumptions we made over the distribution
of in simple linear regression will again apply.

Step 4: Evaluate how useful the model is

If we want to test the usefulness of a particular term in our model,
we would perform a t-test and look at the p-value for that term.
However, if we wanted to test whether any of the terms in our
model are useful in predicting y we would use the F-test.

The F-test is a test of the hypothesis:
H
0
:
1
=
2
= =
k
= 0
H
1
: At least one of the coefficients is nonzero

Note that our H
0
will always include all of our parameters except
our y-intercept
0
.
51

Note that this test has a general set-up of:
H
0
: None of the explanatory variables are helping
H
1
: At least one of the explanatory variables are
helping

which shares the general format seen throughout the last couple of
chapters of:
H
0
: Model not useful
H
1
: Model useful

Once we know the test statistic of our F-test, we will often want to
determine whether it is significant. As in all our tests, if our test
statistic is more extreme (ie. greater) than our critical value, we
reject H
0
.

By rejecting H
0
we are saying that our model is significantly better
than just estimating y with
y
. However, we are not necessarily
saying that the model we have is the best model we could find. In
real life, once we have found our model to be useful, we often try
to fine-tune it by adding more independent variables or higher-
order terms. We also often look at each term currently in our
model to see which are individually significant.

Step 5: Using The Model For Estimation And Prediction

Calculating confidence or prediction intervals by hand in multiple
regression is incredibly tedious and time consuming. We can
however request them with our computer output and interpret their
meaning in the same way as before.




52
Example

The department head of a Universitys Accounting department
wanted to see if she could predict the GPA of students using the
number of credit hours and total SAT scores of each student. She
takes a sample of students and generates the following Excel
output:

df SS MS F p-value
Regression 2 1.4468 0.7234 9.7286 0.0488
Residual 3 0.2231 0.0744
Total 5 1.6698

Coefficients Stan Error t-stat p-value
Intercept 5.6357 2.1045 2.6779 0.0752
Credit H -0.3155 0.0770 -4.0974 0.0263
SAT Tot 0.0014 0.0014 0.9999 0.3923

Q. The standard deviation of the errors is closest to?

A. s
2
=
0744 .
) 1 (
= =
+
MSE
k n
SSE

Therefore s =
2728 . 0744 . =


Q. What is the predicted GPA of an accounting student with 12
credit hours and a 1200 SAT total score?
A. Let y represent students GPA
x
1
represent number of credit hours
x
2
represent SAT total score
Our model is of the form:
y =
0
+
1
x
1
+
2
x
2
+
So using our parameter estimates we get:

y
= 5.6357 + (-0.3155)x
1
+ (0.0014)x
2

53
So when x
1
=12 and x
2
=1200:

y
= 5.6357 + (-0.3155)(12) + (0.0014)(1200)
= 3.53

Other questions that could be asked are:
Is the overall regression model useful at a 5% level of
significance?
Which of the explanatory variables are useful at a 5% level of
significance?
By how much would we predict that a students GPA would
increase based on a 100-point increase in SAT Total score,
holding Credit Hours constant?
What proportion of the variability in a students GPA is
explained by using SAT Total and Credit Hours in a
regression model?
When using a regression model that predicts student GPA
based on SAT Total and Credit Hours, we would expect
approximately 95% of the data to fall within what distance of
the regression line?

Q. Is the overall regression model useful at a 5% level of
significance?

A. The table that comments on the entire regression model as a
whole (rather than breaking it down variable-by-variable) is the top
table and the test we are interested in is the F-test which tests the
following hypotheses:
H
0
: None of the explanatory variables are helping
H
1
: At least one of the explanatory variables are
helping
Since the p-value of this test is less than alpha (.0488 < .05) we
reject H
0
and conclude that the regression model is useful.

54
Q. Which of the explanatory variables are useful at a 5% level of
significance?

A. Testing whether the coefficient of Credit Hours is equal to zero
gives us a p-value of .0263 (see table). So we can say that, as our
model stands, credit hours is significantly linearly related to (and is
a useful predictor of) GPA (since .0263 < .05).

Doing the same test with the coefficient of SAT Total Score gives
us a p-value of .3923, which is not significant. So, as our model
stands, SAT Total Score does not have a significant linear
relationship with GPA.

Q. By how much would we predict that a students GPA would
increase based on a 100-point increase in SAT Total score, holding
Credit Hours constant?

A. The definition of the slope is that for every one-unit increase in
x, we predict that y will increase by the coefficient of x, holding all
other explanatory variables constant. Therefore if SAT Total
increases by 100, we predict that y (Student GPA) will increase by
0.14 (100 times the coefficient of SAT Total), holding credit hours
constant.

Q. What proportion of the variability in a students GPA is
explained by using SAT Total and Credit Hours in a regression
model?

A. This is the definition of R-Squared.

R
2
= Amount of error that disappeared
Total amount started with



55
= 1.4468
1.6698
= 0.8665

Q. When using a regression model that predicts student GPA based
on SAT Total and Credit Hours, we would expect approximately
95% of the data to fall within what distance of the regression line?

A. The empirical rule says that we expect approximately 95% of
the data to fall within two standard deviations of the mean. In the
context of regression, this tells us that we expect approximately
95% of the data to fall within two standard deviations of the
regression line.

Since s = .2728, we expect 95% of our predictions for student GPA
to fall within .5456 of the actual value.

The Coefficient Of Determination

Recall that R
2
measures the amount of variation in y that can be
explained by using x to predict y. It can be calculated as:
yy
yy
SS
SSE SS
R

=
2


Another way to calculate R
2
if we are given an ANOVA table is by
dividing the sum of squares for the model by the total sum of
squares as follows:
R
2
= Explained variability = SS
MODEL

Total variability SS
TOTAL


The calculation of R
2
does not involve any adjustment for degrees
of freedom. As a result of this, we could add an irrelevant term to
our model and R
2
would never decrease and in almost all cases it
56
would even increase. Because of this, there is a tendency for R
2
to
be too large. This bias can be removed by calculating instead an
adjusted R
2
, using the formula below.

Adjusted R
2
= SSE/(n - k - 1)
SS
total
/(n 1)

The gap between the R
2
and the adjusted R
2
tends to increase as
non-significant independent variables are added to the regression
model. As n increases, the difference between the R
2
and the
adjusted R
2
becomes less.

Indicator Variables

Indicator variables (also known as Dummy variables) are used
when we wish to incorporate a categorical explanatory variable
into our analysis. They are just variables that can take on the
values 0 or 1, where a 1 indicates that a subject possesses a
characteristic or is a member of an indicated group, while a 0
indicates the converse.

We will often use indicator variables in our analysis if we take into
account non-numeric variables such as gender (0=male, 1=female)
or, in the case of medical data, treatment (0=placebo, 1=drug).

Example

Suppose that a toy manufacturer wishes to determine whether his
red toys sell better than his blue toys. He gathered data regarding
sales levels, color, price and average age levels for which the toys
are intended. He entered these into a computer and obtained the
multiple regression equation:

y
= 70,663 713x
1
59.6x
2
+ 66.4x
3

Where y refers to sales levels (in units), X
1
refers to color (0=blue,
57
1=red), X
2
refers to retail price (in dollars) and X
3
refers to average
age level (in years).

Q. What is the prediction for the sales level of a red toy costing
$20 and intended for children 5-years-old?
y
= 70,663 713(1) 59.6(20) + 66.4(5)
= 69,090
Notice that the equation is basically telling us that a red toy will
sell 713 units less than a blue toy, holding all other factors
constant.

It would be tempting, if we had a situation like that above, but the
toys came in three colors, to code them as 0=blue, 1=red, 2=green.
We should not fall into this trap for several reasons:

1. The values 0, 1 and 2 indicate a hierarchy of colors with
green selling less than red, which in turn sells less than blue.
We really dont know whether this is the case though. With a
0-1 variable, the hierarchy is not fixed, since changing the
sign of the coefficient will reverse the order.

2. Since the variable has only one coefficient, if we coded it as
mentioned we are committing ourselves to the fact that the
difference between the sales of blue and red toys is the same
as the difference between the sales of red and green toys.

What we should do in a situation like this is create two different
indicator variable. Our first variable will be blue (1=blue, 0=not)
and our second will be red (1=red, 0=not). A variable should not
be assigned to green since all toys for whom a 1 was not recorded
either for the blue or red variables must be green.
Similarly, if an indicator variable has c categories, we must create
c-1 indicator variables and put them all in our regression model.

58
In many situations, especially where there is one class of extreme
interest, everything else is put in the baseline (X=0) class, while
items having the characteristic of interest are put in the (X=1)
class. For example, if our categorical variable was religious
preference, we may make code our data so 1=catholic, 0=non-
catholic, rather than split our data into smaller denominations and
over complicate our model.

More Complex Regression Models

Consider the following regression models:

y =
0
+
This is our null model, where we predict y using no explanatory
variables. Note that in this case
0
= y.

y =
0
+
1
X
1
+
This is a first order model (meaning the highest power of any
predictor variable in the model is 1) with one independent variable.
This is the model we looked at in simple linear regression.

y =
0
+
1
X
1
+
2
X
2
+
3
X
3
+
This is a first order model with three independent variables. We
started looking at this type of model with multiple regression.

y =
0
+
1
X
1
+
2
X
1
2
+
This is a second order model (meaning the highest power of any
predictor variable in the model is 2) with one independent variable.

y =
0
+
1
X
1
+
2
X
1
2
+
3
X
1
3
+
This is a third order model with one predictor variable.

y =
0
+
1
X
1
+
2
X
2
+
3
X
1
X
2
+
This type of model is considered to be a second order model with
59
two predictor variables. The X
1
X
2
term is an interaction term. Even
though the model has 1 as the highest power of any one variable, it
is considered to be a second order equation because of the
interaction term.

All of the cases above can be thought of as a special case of a
General Linear Model. The first three cases are models we have
looked at already and the next three are cases we will move on to.

Polynomial Regression

Polynomial regression models are regression models that are
second or higher order models. They contain squared, or higher
powers of the predictor variable

If the simple model:
y =
0
+
1
X
1
+
appears to be too high for moderate values of X
1
and to low in the
extremes, or vice versa, then we should worry about a possible
curvilinear relationship between X
1
and Y.

If we suspect a curvilinear relationship exists we could try a
quadratic model such as:
y =
0
+
1
X
1
+
2
X
1
2
+
A model like this allows our model to curve with the data.

Even better fits can occasionally be obtained by trying cubic
polynomials:
y =
0
+
1
X
1
+
2
X
1
2
+
3
X
1
3
+

In general, one could keep trying higher order polynomials, but
that is not advised. Even though adding additional terms will result
in a higher R
2
(and therefore a better fitting model) there is always
a danger of over-fitting the model to our sample points.

60
It is theoretically possible to exactly fit any data set with n points
with an (n-1)
th
degree polynomial. However, if you really attempt
this, you will get a wildly oscillating function that does nothing but
fit the observed data. It would be utterly useless for predicting how
x actually affected y in the general population.

Another drawback of higher-order models is that they tend to
become difficult to interpret. They really dont help you find trends
or general directions in your data. Due to all this, it is very rare to
use models with higher than second-order terms.

Regression Models With Interaction

Often when two different independent variables are used in a
regression analysis, there is an interaction between the two
variables. An interaction between two explanatory variables (X
1

and X
2
) simply implies a change in the coefficient of X
1
from one
value of X
2
to another value of X
2
.

In cases where we suspect there may be an interaction, we can use
a model like this one:
y =
0
+
1
X
1
+
2
X
2
+
3
X
1
X
2
+
or maybe try one like this, if we suspect that the relationship
between our X variables and Y is not linear:
y =
0
+
1
X
1
+
2
X
2
+
3
X
1
2
+
4
X
2
2
+
5
X
1
X
2
+

In a two-predictor regression with interaction, the response surface
is not a plane but a twisted surface (like "a bent cookie tin"). The
change of slope is quantified by the value of
3
in the first of the
two models above. Including it is a way to account for the
correlation between the two explanatory variables.

Model Refining

Once we have included all these terms in our model, we will
61
almost inevitably find that more than one of the terms will not
appear to be making a significant contribution to our model. It is
not right to automatically discard all these apparently insignificant
terms from our model in one go, however, you may not want to
waste the time of eliminating one variable at a time many times
over. The test that follows shows you how to test whether it is
statistically permissible to proceed from the full model of (k+1)
parameters to a reduced model with (g+1) parameters.

1. Perform the regression using the full (k+1) parameter model.
Calculate the SSE and MSE for the full model, and label them
SSE1 and MSE1 = SSE1/[n - (k+1)].

2. Perform the reduced model regression using only the (g+1)
parameters being considered for inclusion in the reduced model.
Calculate the SSE and MSE for this model, and label them
SSE2 and MSE2 = SSE2/[n (g+1)]. Note that MSE2 is not
really needed for this test procedure.

3. Calculate the increase in SSE caused by dropping the (k-g)
parameters from the full model. This sum of squares due to the
dropping can be calculated as SS(drop) = SSE2 SSE1. The
MS(drop) = SS(drop)/(k-g).

4. Compare MS(drop) with MSE1 using an F-test with (k-g) and
n-(k+1) degrees of freedom. The F-test will be testing the
hypothesis:
H
0
:
g+1
=
g+2
+ +
k
= 0
H
1
: At least one of the dropped coefficients is non-zero
If the test is insignificant, then the (k-g) terms can be dropped
from the model, with no significant loss of predictive power. If
the test is significant, we reject H
0
, which means that we cant
jump from the full model to the reduced model without a
significant loss of power.

62
We could summarize the above information in the ANOVA table
below:

Df SS MS F
Dropped Terms k-g SS(drop) MS(drop) MS(drop)/MSE1
Full Model n-(k+1) SSE1 MSE1
Reduced Model n-(g+1) SSE2

In a case where our F-test is significant and we cant collapse to
the reduced model, we might consider examining reduced models
which fall somewhere between the full model and the above
reduced model.

The drawback to this method is that one must tell the computer
which models to try at each stage. Next we will consider several
methods which will allow the program to give us the best model
in one run of the program.

Variable Selection Procedures

There are various ways to select the variables which are used in a
multiple regression model. There is no one "best" way, although
most people would agree that one wants the simplest possible
model which explains the response variable adequately. The
difficulty is in determining what is "adequate" and in deciding
what trade-off to make in terms of model complexity for model fit.

SAS is a popular statistical computing package that is often used to
choose statistical models. It offers nine different selection
procedures, some of which (like STEPWISE) actually pick a best
model, while others (like RSQUARE) list the models which have
the most optimal value of the statistic under consideration for a
given number of explanatory variables.

A brief review of some of these methods follow:
63

FORWARD selection

SAS starts with the null model (y =
0
+ ), and then adds the most
significant variable. After that it adds the next most significant
variable (with the first already entered into the equation). This
process continues until none of the variables left outside of the
model meet the entry-level selection (SLE) value,
1
. This gives a
reasonable idea of what variables might be important, but tends to
keep too many unneeded ones.

BACKWARDS selection

SAS starts with the full p-variable model, and deletes the least
significant variable. After that, it deletes the next least significant
variable remaining, etc, until all variables remaining are significant
at the stay-level selection (SLS) value,
2
. For a small number of
possible predictors, k, Forward and Backward regression (with
SLE=SLS) tend to give the same final models, but as k increases,
there is a good chance that they disagree. This is why the next
procedure was invented.

STEPWISE selection

It is a combination of Forward and Backward Regression. For
those options, once a variable was entered (for FORWARD) or
deleted (for BACKWARD), that variable was never re-examined.
In STEPWISE, a variable can be added or deleted several times
before the final model is attained, dependent on the other variables
in the model. This is quite important when collinearity is present,
because a variable which might initially have appeared
insignificant (in the presence of some variables), might become
very significant in the presence of others (and vice-versa). The
final model is achieved when no variables outside of the model
meet the SLE criteria, and all in the model pass the SLS criteria.
64
The default values for both SLE and SLS are 0.15 under
STEPWISE.

RSQUARE selection

Unlike the FORWARD, BACKWARD and STEPWISE options,
this option doesn't yield a best model. It yields the b models (you
specify b) with the highest r
2
values for p=1, 2, ..., k predictors. A
value of b=5 works well in most applications. Once these models
are printed out, one might want to look more closely at some of the
models which were "best in the class of p-variable models" and use
some other criteria to pick which one of those are best.

Once you have run these selection procedures, you will generally
have a choice of several models. Frequently, it doesn't really
matter which is used, and the decision over which will be most
helpful would be best made based on a general understanding of
the variables themselves.

Since we will not have access to SAS in this course, we will have
to use imperfect techniques in Excel to arrive at a reasonable
model. A good process to follow is outlined below:

1. Run the full model and observe the significance levels of the
individual parameters.
2. Use BACKWARDS selection to arrive at a model where all
terms in the model appear significant by removing variables
one at a time.
3. Check the validity of your new model.
4. Observe the R
2
value of your new model and decide if that is
acceptable.




65
Regression Pitfalls

There are many problems that can undermine our attempts to fit a
model to our data. Some of the major problems we can encounter
are outlined below.

Multicollinearity

Often two or more of the independent variables used in our model
contribute redundant information. That is, the independent
variables are correlated with each other. Suppose we wanted to
construct a model to predict a students GPA based on their Total
SAT score (x
1
), their Verbal SAT score (x
2
) and their Math SAT
score (x
3
). Although all three of our independent variables
contribute information for the prediction of GPA, some of the
information is overlapping because Total SAT score is highly
correlated to Verbal SAT score and Math SAT score.

If we were to fit a model using all three of these independent
variables to predict GPA we might find that the t values for
1
,
2

and
3
(the coefficients of x
1
, x
2
and x
3
) are by themselves not
significant, yet our F-test still says the model is useful. This is
because all three of the variables are contributing to the model, but
the contribution of one overlaps with that of the other two.

Another way that multicollinearity can be recognized is by
inspection of a correlation matrix. If we are using three different
explanatory variables (X
1
, X
2
and X
3
) to improve our prediction of
a dependent variable (Y), we may get a correlation matrix like the
one below:

66
|
|
|
|
|
.
|

\
|
1 08 . 36 . 12 .
1 72 . 34 .
1 84 .
1
3
2
1
3 2 1
X
X
X
Y
X X X Y


Since multicollinearity is a problem if the explanatory (X)
variables are highly correlated with one another, the only possible
problem here is that maybe X
1
and X
2
are providing very similar
information (r = .72). Note that the high correlation between Y and
X
1
is a good thing, since we want X and Y to be highly correlated.

Often when we have correlated explanatory variables we choose to
only include the bare minimum of the correlated variables in our
model.

Prediction Outside The Experimental Region

Throughout the course we have emphasized that predictions from
our regression model are only valid over the ranges of our
explanatory variables. If we try to use our model to predict outside
of the range of our x variables(s), we can encounter many
problems.

Curvature to the Data

It is important to realize that most of the models we have
considered in the course have been straight-line models. Often
though, if we were to perform some transformation (maybe using
x
2
rather than x to predict y) we would get a much better-fitting
model.

67
Consider the scatterplot below:

0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10
X
Y


In a situation like that above, using x to predict y would clearly be
a significant improvement on just predicting y with
y
. However, it
is apparent that we would get an even better prediction if we
transformed x and used x
2
, say.

It is therefore important to realize that if you only consider straight
line models, you are seriously limiting your chances of finding the
best model. The best linear model will not necessarily be the
best-fitting functional form for the data.

Violation Of Assumptions Concerning The Error Term,

All the assumptions made in Step 3 of constructing a regression
model are vitally important. If any of the assumptions do not hold,
the estimates of variability and all hypotheses tests based on them
will no longer be valid.


68
Section 3: Analysis Of Variance and Design Of Experiments

In this section we will explore research scenarios where
hypotheses are tested for more than two populations. For example,
we might wish to examine the average sales of salespeople trained
using five different training programs to see whether they are the
same. Our hypotheses become:
H
0
:
1
=
2
=
3
=
4
=
5

H
1
: not all s are equal
We test such a hypothesis by first collecting five samples, one
from each of the training programs (populations). We will see that
to compare these five means one pair at a time is not the correct
approach, as this would result in ten different pairwise tests, and
what was intended to be a testing procedure with, say, a 5%
significance level results in a much higher significance level.

The correct procedure for this situation is to examine the variation
of the sales value, both (1) within both of the samples (examining
the variability of each sample alone) and (2) among the five
samples (for example, are the values in sample 1 larger, or smaller,
on average, than the values in the other samples?).

Another way to consider the reasoning behind this approach is by
relating it to a situation with just two populations. When testing for
a difference between
1
and
2
, both s
1
and s
2
affect the width of
our confidence interval for (
1
-
2
). Consequently, we infer
something about the means of several populations by utilizing the
variation of the resulting samples. Hence the term analysis of
variance.

Comparing Two Population Means: One Approach

A large sample confidence interval for (
1
-
2
) is given by:
69
(
2 1
x x
) Z
) (
2 1
x x

= (
2 1
x x
) Z
2
2
2
1
2
1
n n

+

If we do not have large sample sizes (n
1
<30 or n
2
<30) we need to
use the t-distribution.

A small sample confidence interval for (1 - 2) is given by:
(
2 1
x x
) t
|
|
.
|

\
|
+
2 1
2
1 1
n n
s
p

where s
p
2
= (n
1
1)s
1
2
+ (n
2
1)s
2
2

n
1
+ n
2
2

and t is based on (n
1
+ n
2
2) degrees of freedom.

Note: s
p
is called the pooled standard deviation since it combines
the standard deviations of both samples.

When dealing with small samples we must make the following
assumptions:
Both sampled populations have relative frequency
distributions that are approximately normal.
The population variances are equal.
The samples are randomly and independently selected from
the population.

Example

Liverpool Drug Company claims its aspirin tablets will relieve
headaches faster than any other aspirin on the market. To
determine whether Liverpools claim is valid, a random sample of
size 15 is chosen from aspirins made by Liverpool and a further
random sample of size 15 is taken from aspirins made by the
Manchester Drug Company. An aspirin is given to each of the
randomly selected persons suffering from headaches and the
70
number of minutes required for each to recover from the headache
is recorded. The sample results are:


x
s
2

Liverpool (L) 8.4 2.2
Manchester (M) 8.9 2.6

Assume that the two populations are normally distributed with
equal, but unknown, variances.

Q. What is the pooled standard deviation?

A.
2
) 1 ( ) 1 (
2 2
+
+
=
M L
M M L L
p
n n
s n s n
s


2 15 15
) 6 . 2 )( 14 ( ) 2 . 2 )( 14 (
+
+
=


4 . 2 =

= 1.549

Q. Construct a 99% confidence interval for the true mean
difference in the time taken to relieve headaches (
M
-
L
).

A. (
L M
x x
) t
|
|
.
|

\
|
+
L M
p
n n
s
1 1
2

= (
4 . 8 9 . 8
) 2.763
|
.
|

\
|
+
15
1
15
1
4 . 2

=
5 . 0
2.763
|
.
|

\
|
15
2
4 . 2

=
5 . 0
1.563
= (-1.063, 2.063)
71

Therefore we cannot conclude that a difference exists between the
two aspirins in terms of the time taken to relieve headaches.

Example

A company offers an optional seminar for its managers on how to
interact with employees. A sample is taken of the job performance
ratings received by 30 managers who attended the seminar and 30
managers who did not attend the seminar. The mean and standard
deviation of ratings for the sample that did not attend was 6.1 and
5.9, respectively and the mean and standard deviation for the
sample that did attend was 9.7 and 4.5, respectively.

Q. What is the variance of the difference in the two sample means?

A. In this case our sample sizes are large, so we dont need to pool
the standard deviations.

Attended Not Attended
n = 30 n = 30

A
x
= 9.7
NA
x
= 6.1
s
A
= 4.5 s
NA
= 5.9

2
2
2
1
2
1
2
) (
2 1
n
s
n
s
s
x x
+ =


30
) 9 . 5 (
30
) 5 . 4 (
2 2
+ =

= 1.84

Q. Find a 96% confidence interval for the difference in mean
ratings for those managers who did and did not attend the seminar.

72
A. Interval = (
2 1
x x
) Z
2
2
2
1
2
1
n n

+

= (9.7 6.1) 2.05
84 . 1

= 3.6 2.78
= (.82, 6.38)

Therefore there is a significant difference in the means of those
that did and did not attend the seminar. The mean for those that
attended is higher.

The Analysis Of Variance Approach

We need to introduce two terms: factor and level. The previous
example examined the effect of one factor (sex), consisting of two
levels (male and female).

The purpose of Analysis of Variance is to determine whether the
factor has a significant effect on the variable being measured
(salary, in our example). If for example the factor of sex is
significant, the mean salaries for the different sexes will not be
equal. Consequently, testing for equal means among the different
sexes is the same as attempting to answer the question, is there a
significant effect on salary due to this factor.

We will begin this part of the course by examining the effect of a
single factor on the variable being measured, one-factor ANOVA.
Extensions of this technique include ANOVA procedures that
determine the effect of two or more factors operating
simultaneously.

Assumptions behind ANOVA

The following assumptions are basically the same requirements
that were necessary when testing two means using small,
73
independent samples and the pooled variance approach. These
requirements are:

1. The replicates (observations) are obtained independently and
randomly from each of the populations. The value of one
observation has no effect on any other replicates within the
same sample or within the other samples.
2. The replicates from each population follow (approximately) a
normal distribution.
3. The normal populations all have a common variance,
2
. We
expect the values in each sample to vary about the same
amounts. The ANOVA procedure will be much less sensitive to
violations of this requirement when we obtain samples of equal
size from each population.

We mentioned earlier that our error when comparing means of
different populations could be split into two groups: within-sample
variation and between-sample variation. When using the ANOVA
approach, we measure these two sources of variation by calculating
sums of squares for each of them, and in a similar way to our
previous look at the ANOVA table, we also calculate a sum of
squares total.

Deriving The Sum Of Squares

When examining k populations, for example, the data will be
configured something like this:








74
Level 1 Level 2 Level k
n
11
n
12
n
1k

n
21
n
22
n
2k

M M M
n
1
replicates n
2
replicates n
k
replicates
M M M
Totals T
1
T
2
T
k


In our example comparing male and female salaries, we had k=2
and n
1
= n
2
= 21 replicates. Notice also that T
i
is the total of the
observations in sample i and we will also define T as the grand
total of all the observations so T = T
1
+ T
2
+ + T
k
.

SS(factor): also known as SS(between)

SS(factor) is the sum of squares that determines whether the values
in one sample are larger or smaller on the average than the values
in another sample. It can be calculated as:

SS(factor) =
2 2
2
2
2
1
1
) ( ... ) ( ) ( X X n X X n X X n k
k
+ + +

=
=

k
j
j
j
X X n
1
2
) (


A short cut calculation method is:

SS(factor) =
n
T
n
T
n
T
n
T
k
k
2
2
2
2
2
1
2
1
...
(
(

+ + +


where, again, k is the total number of populations we are
comparing.

Sum Of Squares Total: SS(total)

75
SS(total) is a measure of the variation in all of the n = n
1
+ n
2
+
+ n
k
data values. You obtain this value as if you were finding the
variance of these n values, except that you do not divide by n-1. It
can be calculated as:

SS(total) =
| |

=
+ + +
k
j
j n j j
X X X X X X
j
1
2 2
2
2
1
) ( ... ) ( ) (

=
= =

k
j
n
i
ij
j
X X
1
2
1
) (


A short cut calculation method is:

SS(total) =
n
T
X
2
2



SS(error): also known as SS(within)

SS(error) is the measure of the variation within each of the
samples. It can be calculated as:

SS(error) =
| |

=
+ + +
k
j
j
j n
j
j
j
j
X X X X X X
j
1
2 2
2
2
1
) ( ... ) ( ) (

=
=

k
j
n
i
j
ij
j
X X
1 1
2
) (


A short cut calculation method is:

SS(error) =
(
(

+ + +
k
k
n
T
n
T
n
T
X
2
2
2
2
1
2
1 2
...

= SS(total) - SS(factor)

The ANOVA Table
76

The format of the ANOVA table will be the same, regardless of the
number of populations (levels), k. When we move on to examining
several factors in our analysis our degrees of freedom will change
however, in a similar way to when we moved from simple linear to
multiple regression.

The ANOVA table will look as follows:

Source df SS MS F p-value
Factor k-1 SS(factor) MS(factor)
Error n-k SS(error) MS(error)
Total n-1 SS(total)

Values for MS, F-ratio and p-value can be calculated from the
other values in the table as before.

Example (Transportation Costs)

Family transportation costs are usually higher than most people
believe, because they include car payments, insurance, fuel costs,
repairs, parking and public transportation. Twenty randomly
sampled families in four major cities are asked to use their records
to estimate a monthly figure for transportation cost. Use the data
obtained and ANOVA to test whether there is a significant
difference in monthly transportation costs for families in these
cities at a 5% level of significance.

Atlanta New York Los Angeles Chicago
650 250 850 540
480 525 700 450
550 300 950 675
600 175 780 550
675 500 600 600
Total 2955 1750 3880 2815
77

So T
1
= 2955, T
2
= 1750, T
3
= 3880, T
4
= 2815 and T = 11400.
Also n
1
= n
2
= n
3
= n
4
= 5 and n = 20. Finally we can calculate x
2

to be 7,175,400.

SS(factor) =
n
T
n
T
n
T
n
T
k
k
2
2
2
2
2
1
2
1
...
(
(

+ + +

=
20
) 11400 (
5
) 2815 (
5
) 3880 (
5
) 1750 (
5
) 2955 (
2 2 2 2 2

+ + +

= 6,954,630 6,498,000
= 456,630

SS(total) =
n
T
X
2
2


=
20
) 11400 (
400 , 175 , 7
2


= 7,175,400 6,498,000
= 677,400

SS(error) = SS(total) - SS(factor)
= 677,400 456,630
= 220,770

df(factor) = k - 1 = 4 - 1 = 3
df(error) = n - k = 20 4 = 16
df(total) = n - 1 = 20 1 = 19

The ANOVA table for this analysis follows:

Source df SS MS F
Factor 3 456,630 152,210 11.0312
Error 16 220,770 13,798.125
Total 19 677,400
78

By looking in our F-table with = .05 we see that F
3,16
(.05) =
3.24. Since our F-ratio of 11.03 is greater than this critical value
we can reject H
0
and conclude that at least one of the cities has a
different mean at a 5% level of significance.

Interpretation of Mean Squares

Note that one of our assumptions in making this comparison of
means was that the variances of each of the populations were the
same. The ANOVA procedure is based on a comparison of two
separate estimates of this variance,
2
.

The first estimate is derived using the variation among the sample
means whereas the other estimate is determined using the variation
within each of the samples. The ANOVA procedure is based on a
comparison of these two estimates of
2
because they should be
approximately equal provided H
0
is true.

MS(factor) = estimate of
2
based on the variation among the
sample means

MS(error) = estimate of
2
based on the variation within each of
the samples

The closer these estimates are to each other, the closer our F-ratio
will be to 1. As the differences between them get larger, our F-ratio
increases and becomes increasingly significant.

Multiple Comparisons

If the one-factor ANOVA leads to a rejection of H
0
, and therefore
a conclusion that at least one of the means differs, a natural
question would be to ask which of the means differ? In other
79
words, rejecting the ANOVA null hypothesis informs us that the
means are not all the same but provide no clue as to which of the
population means are different.

As mentioned earlier, performing a series of t-tests to compare all
possible pairs of means is not a good idea, since the chances of
making a Type I error (concluding a difference exists between two
population means when in fact they are the same) using such a
procedure is much larger than the predetermined used for each of
the t-tests.

What is needed is a technique that compares all possible pairs of
means in such a way that the probability of making one or more
Type I errors is . This is a multiple comparisons procedure. There
are several methods available for making these comparisons, but
the most well-known is Tukeys test, which is presented here.

Tukeys honestly significantly different (HSD) test is somewhat
limited by the fact that it requires equal sample sizes. It takes into
account the number of populations, the value of the mean square
error and the sample size. Using these values and a table value Q
(Table A.10 in textbook page A-29 of 2
nd
edition), the HSD
determines the critical difference necessary between the means of
any two treatment levels for the means to be significantly different.

Procedure

1. Find Q
,k,
using Table A.10 where is the significance level
required, k is the number of sample means (groups) and is the
degrees of freedom associated with MS(error).

2. Determine HSD =
r
v k
n
error MS
Q
) (
, ,

n
r

80
where n
r
is the number of replicates in each sample remember
that sample sizes should be equal.

3. Place the sample means in order, from smallest to largest.

4. If two sample means differ by more than HSD, the conclusion
is that the corresponding population means are unequal. In
other words, if | j i X X | > HSD, this implies that
i

j
.

Example (Transportation Costs, continued)

Recall how in the earlier example we compared the average
transportation costs in four different cities and concluded that at
least one of the means was different. We will now use the
procedure outlined above to determine which of the means differ.

1. We had an =5% level of significance, k=4 different
populations and =16 degrees of freedom for error. Therefore
Q
,k,
= Q
.05,4,16
= 4.05

2. We had MS(error) = 13,798.125 and n
r
= 5. Therefore the HSD
is equal to:
755 . 212
5
125 . 13798
05 . 4
) (
, ,
= =
r
v k
n
error MS
Q



3. Our sample means were:
New York: 1 X = 350
Chicago: 2 X = 563
Atlanta: 3 X = 591
Los Angeles: 4 X = 776

4. | 2 1 X X | = 213 > 212.755 so X
1
and X
2
differ
| 3 1 X X | = 241 > 212.755 so X
1
and X
3
differ
81
| 4 1 X X | = 426 > 212.755 so X
1
and X
4
differ
| 3 2 X X | = 28 < 212.755
| 4 2 X X | = 213 > 212.755 so X
2
and X
4
differ
| 4 3 X X | = 185 < 212.755

Conclusion

1 X (the mean for New York) differs from the mean from the other
three cities, at a 5% level of significance. There was also a
significant difference between 2 X and 4 X (Chicago and LA).

If we asked SAS to perform the above test, our output would look
as follows:
New York Chicago Atlanta Los Angeles
A
B B
C C

where means with the same letter are not significantly different.

Tukey-Kramer Procedure

As mentioned, the test above can only be performed when we have
equal sample sizes. The Tukey-Kramer procedure is a modification
of the regular Tukeys test that will work when we have unequal
sample sizes. The necessary formula is:
HSD =
|
|
.
|

\
|
+

j i
k n k
n n
error MS
Q
1 1
2
) (
, ,

where n
i
is the sample size for the i
th
sample and n
j
is the sample
size for the j
th
sample.

82
Note that a different HSD value must be computed for each
different pair, since our sample sizes will not be the same.

Designing An Experiment

So far in this part of the course we have introduced you to one-
factor (or one-way) ANOVA. In this type of analysis you
randomly obtain samples from each of the k populations (levels) of
a single factor in our last example k=4 levels (cities) of a single
factor (location). Since replicates (repeat observations) are
obtained in a random manner from each population, this type of
sampling plan is called a completely randomized design.

Before we go further into Experimental Design we need to define a
few new terms:

We have already mentioned that a factor is a set of related levels
used as an explanatory variable. Factors are usually qualitative
(sex, marital status, etc.) but can be quantitative when a limited
number of levels of a quantitative variable are chosen for study. A
factor can be either a treatment variable or a classification variable.

A treatment variable is one the experimenter controls or modifies
in the experiment: for example, in a medical study, a treatment
variable may be medicine a treatment that would consist of 2
levels, drug or placebo.

A classification variable is some characteristic of the
experimental subjects that was present prior to the experiment and
is not a result of the experimenters manipulation or control: for
example, in the transportation costs situation we looked at
previously, the classification variable was city a classification
that consisted of 4 levels (Atlanta, New York, Chicago and LA).

83
A treatment or treatment combination is a particular
combination of the levels of one or more factors. Treatment
combinations will come into play when we start studying more
than one factor at a time.

The experimental units are materials or items on which a
measurement is made and to which treatments are applied.

Nuisance variables are other variables which influence the
response variable but which are not of interest . Systematic bias
occurs when treatments are not alike with respect to nuisance
variables. In this case, the nuisance variable becomes a
confounding variable.

Confounding variables are variables that are not being controlled
by the researcher in the experiment but can have an effect on the
outcome of the treatment being studied. One way to control for
these variables is to include them in the experimental design. The
randomized blocking design, which is a type of experimental
design we will also be examining, has the capability of adding one
of these variables into the analysis as a blocking variable.

A blocking variable is a variable that the researcher wants to
control but is not the treatment variable of interest.

Structures Of An Experimental Design

A treatment structure is the set of treatments, treatment
combinations or populations under study the selection and
arrangement of treatment factors.

A design structure is the way in which the experimental units are
grouped together into homogeneous units (blocks).

84
These structures are combined with a method of randomization to
create an experimental design.

Types Of Treatment Structures

1. One-way
2. n-way Factorial, where two or more factors are combined so
that every possible combination occurs.
3. n-way Fractional Factorial, where a specified fraction of the
total number of possible treatment combinations occur (eg.
Latin Square).
4. Nested (Hierarchical) Treatment Structures

Types Of Design Structures

1. Completely Randomized Designs.

All experimental units are considered as a single homogeneous
group (no blocks). Treatments are assigned completely at random
(with equal probability) to all units.

2. Randomized Complete Block Designs.
Experimental units are grouped into homogeneous blocks within
which each treatment occurs c times (usually c=1).

3. Incomplete Block Designs.

Fewer than the total number of treatments occur in each block.

4. Latin Square Designs.





85
Considerations when Designing an Experiment

Experimental design should give unambiguous answers to
questions of interest.

Experimental design should be optimal. That is, it should
have more power (sensitivity) and estimate quantities of interest
more precisely than other designs.

Objectives of the experiment should be clearly defined.

- What questions are we trying to answer?
- What questions are more important?
- What populations are we interested in generalizing to?

Appropriate response and explanatory variables must be
determined and nuisance variables should be identified.

- What levels of the treatment factors will be examined?
- Should there be a control group?
- Which nuisance variables will be measured?

Statistical analysis of the experiment should be planned in detail
to meet the objectives of the experiment.

- What model will be used?
- How will nuisance variables be accounted for?
- What hypotheses will be tested?
- What effects will be estimated?

Experimental design should be economical

You might also like