Transform Y to Correct Model Inadequacies

Chapter 4 Transformations and Weighting to Correct Model
Inadequacies

UECM2263 Applied Statistical Model
Chapter 4 - 1
Recall that Regression model fitting has several implicit assumptions, including the following:
1. The model errors have mean zero and constant variance and use uncorrelated.
2. The model errors have a normal distribution this assumption is made in order to conduct
hypothesis tests and construct CIs under this assumptions, the errors are independent.
3. The form of the model, including the specification of the regressors, is correct.

Chapter 3 presented several techniques for checking the adequacy of the linear regression model. If the
linear regression model is not appropriate for a data set, there are two basic choices:
1. Abandon the regression model and develop a more appropriate model.
2. Employ some transformation on the data so that regression model is appropriate for the
transformed data.

We consider the use of transformation in this chapter.

4.1 Variance Stabilizing Transformation

The assumption of constant variance is a basic requirement of regression analysis. A common reason
for the violation of this assumption is for the response variable Y to follow a probability distribution in
which the variance is functionally related to the mean.

For example, if Y follow a Poisson distribution with mean , note that the variance of Y is equal to its
mean . Since the mean of Y related to the regressor variable X , the variance of Y will be
proportional to X .

Example 4.1:
Consider the simple linear regression model
i i i
x y c | | + + =
1 0
, where
i i
x Var
2
o c = ) ( . Suppose we use
the transformations
X
Y
Y = ' . Is this a Variance Stabilizing Transformation?

( ) ( )
2 2
2
2
1 1
) (
'
) (
) (
o o
o
o c
= = = |
.
|
\
|
=
=
=
=
x
x
Y Var
x x
Y
Var Y Var
x
Y
Y
x Y Var
x Var
i
i i

Yes, variance of Y became constanst.

Unequal error variances and non-normality of the error terms frequently appears together. To
remedial these departures from linear regression model, we need a transformation on Y , since
the shape and spreads of the distributions of Y need to be changed.

Transformation on Y may also at the same time help to linearize a curvilinear regression relation.

Inadequacies

Chapter 4 - 2
Figure 4.1 below contains some prototype regression relations where the skewness and error variance
increase with the mean response ) (Y E .

Figure 4.1: Prototype Regression Pattern

Transformation on Y
Y Y = ' ) ( log Y Y
10
= ' Y Y / 1 = '
Note: A simultaneous transformation on X may also be helpful or necessary.

Useful Variance-Stabilizing Transformations:
Relationship of
2
o to E(Y)
Transformation
constant
2
o o
Y = Y (no transformation)
) ( E
2
Y o o
Y = y (square root, Poisson data)
)] ( 1 )[ ( E
2
Y E Y o o
) ( sin '
1
Y Y

= (arsin; binomial proportions
0 Y
i
1)
2 2
) ( E Y o o
Y=ln(Y) (natural log)
3 2
) ( E Y o o
Y = Y
-1/2
(reciprocal square root)
4 2
) ( E Y o o
Y= Y
-1
(reciprocal)

Example 4.2:
Data on age ( X ) and plasma level of polymine ( Y ) for a portion of the 25 healthy children in a study
are presented below in R codes:
Age <- c(0,0,0,0,0,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)
Plasma <- c(13.44,12.84,11.91,20.09,15.60,10.11,11.38,10.28,8.96,
8.59,9.83,9,8.65,7.85,8.88,7.94,6.01,5.14,6.9,6.77,4.86,5.1,5.67,5.75,6.23)

#Use lm() function to fit the model
Blood.Reg <- lm(Plasma~Age)

#create the scatter plot
plot(x = Age, y =Plasma , xlab="Age", ylab = "Plasma", main = "Plasma Level vs. Age Before
Transformation", col = "Red", pch = 19, cex=1.5)
Inadequacies

Chapter 4 - 3
0 1 2 3 4
5
1
0
1
5
2
0
Plasma Level vs. Age Before Transformation
Age
P
l
a
s
m
a

The scatter plot indicates curvilinear regression relationship, as well as the greater variability for
younger children than for older ones.

Based on the prototype regression pattern, we shall first try the logarithmic transformation, Y Y
10
log ' =

#create the scatter plot after transformation
LY <- log10(Plasma)
plot(x = Age, y =LY , xlab="Age", ylab = "Plasma", main = "Plasma Level vs. Age Before
Transformation", col = "Red", pch = 19, cex=1.2)

Note that the transformation not only has led to reasonably linear regression relation, but the variability
at the different levels of X also becomes reasonably constant.

To further examine the reasonableness of the transformation Y Y
10
log ' = , we fitted the simple linear
regression model to the transformed Y' data and obtained:

X y 1023 0 135 1 . .
=

#To fit the model Y Y
10
log ' = vs X
BloodT.Reg <- lm(I(log10(Plasma)~Age))
summary(BloodT.Reg)

# Create plot of Residual vs. Age after transformaton
plot(x = Age, y =BloodT.Reg$residuals, xlab ="Age", ylab = "Residuals", main = "Residuals vs. Age
after Transformation (y = log10(Y))", col = "blue", pch = 19, cex=1.5, panel.first = grid(col = "gray",
lty = "dotted"))
abline(h = 0, col = "red")

#Normal Probability plot After transformation
qqt.plot <- qqnorm(BloodT.Reg$residuals, main = "Normal Probability Plot After Transformation", xlab
Inadequacies

Chapter 4 - 4
= "Theoretical Quantiles", ylab = "Sample Quantiles", plot.it = TRUE,col="blue", pch = 19, cex=1.5,
panel.first = grid(col = "gray", lty = "dotted"))
abline(lm(qqt.plot$y~qqt.plot$x))

A plot of residuals against X and a normal probability plot after the transformation are shown below.
All of this shows evidence of the appropriateness of linear regression model for the transformed Y' data.

0 1 2 3 4
-
0
.
1
0
-
0
.
0
5
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
Residuals vs. Age after Transformation (y = log10(Y))
Age
R
e
s
i
d
u
a
l
s

-2 -1 0 1 2
-
0
.
2
-
0
.
1
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
Normal Probability Plot After
Transformation
Theoretical Quantiles
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s

Inadequacies

Chapter 4 - 5

4.1.1 Transformations on Y : The Box-Cox Method
It is often difficult to determine from diagnostic plots, such as the one in the plasma levels example,
which transformation of Y is most appropriate for correcting skewness of the distributions of error
terms, unequal error variances, and nonlinearity of regression function. The Box-Cox procedure
automatically identifies a transformation from the family of power transformations on Y .

Consider the transformed regression model of

i i i
x Y c | |
+ + =
1 0
) (
where
( )
=
=
=
0
0 1
) ( log
) (
Y
Y
Y
e

This definition was given by Box and Cox (1964). Due to the structure of a linear regression model, one
can equivalently express this as

=
=
=
0
0
) ( log
) (
Y
Y
Y
e

With this model, there is an extra parameter, , that need to be estimated. ,
0
| ,
1
| , and
2
o can be
estimated via maximum likelihood estimation. The estimated can then be used to suggest the type
of transformation. For example,

2
2 Y Y = = '
Y Y = = ' .5 0
Y Y ln '= = 0 (by definition)

Y
Y
1
5 0 = = ' .

Y
Y
1
0 1 = = ' .
Notice if is estimated to be 1, no transformation is needed. The estimate for is commonly searched
for in the range of -2 to 2.

The MLE of corresponds to the value of for which the residual sum of squares from the fitted
model ) (
E
SS is minimum. It is usually determined by plotting ) (
E
SS versus . Usually 10 20
values of are sufficient for estimation of the optimum value.

Inadequacies

Chapter 4 - 6
From Example 4.2, the Box-Cox results show:

) (
E
SS

) (
E
SS
1.0 78.0 -0.1 33.1
0.9 70.4 -0.3 31.2
0.7 57.8 -0.4 30.7
0.5 48.4 -0.5 30.6
0.3 41.4 -0.6 30.7
0.1 36.4 -0.7 31.1
0 34.5 -0.9 32.7
-1.0 33.9

Note that 5 0.
= , with 6 30. ) ( =
E
SS

Beside Y Y
10
log ' = , another choice is
Y
Y
1
'
= .

Another approach by R-codes

Example 4.4:
This data is in the MASS package. The MASS package contains a set of functions and datasets. See
help(trees) for specific information on the dataset.
Let Y = volume and X = height for the trees in the sample.

R-Codes
Library(MASS)
trees
mod.fit<-lm(formula = Volume ~ Height, data=trees)
summary(mod.fit)
#Plot of Y vs. X with sample model
plot(x = trees$Height, y = trees$Volume, xlab = "Height",
ylab = "Volume", main = "Volume vs. Height",
abline(mod.fit)
#e.i vs. Yhat.i
plot(x = mod.fit$fitted.values, y = mod.fit$residuals,
xlab = expression(hat(Y)), ylab = "Residual",
main = expression(paste("Residuals vs. ", hat(Y))),
#Determine lambda.hat In MASS package
save.bc<-boxcox(object = mod.fit, lambda = seq(from = -2,to = 2, by = 0.01))
title(main = "Box-Cox transformation plot")
lambda.hat<-save.bc$x[save.bc$y == max(save.bc$y)]
lambda.hat

Inadequacies

Chapter 4 - 7
65 70 75 80 85
1
0
2
0
3
0
4
0
5
0
6
0
7
0
Volume vs. Height
Height
V
o
l
u
m
e

Notice that the variability in the
i
y s increases as
i
x increases.

10 20 30 40
-
2
0
-
1
0
0
1
0
2
0
3
0
Residuals vs. Y
^
Y
^
R
e
s
i
d
u
a
l

The funnel shape occurs here. Based upon this and the scatter plot, it would be of interest to
consider a transformation of Y .
Also, notice the use of hat(Y) and the expression() function in the plot() function. Use demo(plotmath)
for more information about how to get mathematical symbols in plots.

Note:
The function expression returns a vector of type "expression" containing its arguments (unevaluated)

lambda.hat
[1] -0.19
-2 -1 0 1 2
-
1
4
5
-
1
4
0
-
1
3
5
-
1
3
0
-
1
2
5
l
o
g
-
L
i
k
e
l
i
h
o
o
d
95%
Box-Cox transformation plot

Inadequacies

Chapter 4 - 8
The boxcox() function estimates using maximum likelihood estimation.

Here, it shows the log-likelihood function is maximized when = -0.19. It also gives a likelihood
based 95% confidence interval of about -0.8 to 0.4 for . Notice that = 0 is in the interval (may want
to consider natural log transformation), and notice = 1 is not interval (transformation needed).

Using 19 0.
= results in the following, Y = Y

-0.19

mod.fit2<-lm(formula = Volume^lambda.hat ~ Height, data=trees)
plot(x = mod.fit2$fitted.values, y =
mod.fit2$residuals, xlab = expression(hat(Y)^{-
0.19}), ylab = "Residual", main =
expression(paste("Residuals vs. ", hat(Y)^{-0.19})),
============================
Call:
lm(formula = Volume^lambda.hat ~ Height, data = trees)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.959543 0.090273 10.629 1.62e-11 ***
Height -0.005526 0.001184 -4.668 6.38e-05 ***
0.48 0.50 0.52 0.54 0.56 0.58 0.60
-
0
.
0
6
-
0
.
0
4
-
0
.
0
2
0
.
0
0
0
.
0
2
0
.
0
4
0
.
0
6
Residuals vs. Y
^
0.19
Y
^
0.19
R
e
s
i
d
u
a
l

It looks like 19 0.
= leads to an approximately constant variance. The sample model can then be

expressed as

Height Y * . .
.
005526 0 9595 0
19 0
=

Inadequacies

Chapter 4 - 9
How would you find Y
?

( ) 19 . 0
1
* 005526 . 0 9595 . 0

= Height Y

Since = 0 is in the interval, it may be of interest to try the natural log transformation since this is
easier to interpret (and more common).

R-Codes
mod.fit3<-lm(formula = log(Volume) ~ Height, data = trees)
summary(mod.fit3)
plot(x = mod.fit3$fitted.values, y =
mod.fit3$residuals, xlab = "log(Y)", ylab =
"Residual", main = "Residuals vs. log(Y)",

Call:
lm(formula = log(Volume) ~ Height, data = trees)
Coefficients:
(Intercept) -0.79652 0.89053 -0.894 0.378
Height 0.05354 0.01168 4.585 8.03e-05 ***

2.6 2.8 3.0 3.2 3.4 3.6 3.8
-
0
.
6
-
0
.
4
-
0
.
2
0
.
0
0
.
2
0
.
4
0
.
6
Residuals vs.log Y
^
log Y
^
R
e
s
i
d
u
a
l

The natural log transformation works as well. This sample model can be expressed as
= )
log(Y -0.7965 + 0.05354X

How would you findY
?

X
e Y
05354 . 0 7965 . 0
+
=

Inadequacies

Chapter 4 - 10

4.2 Transformations to Linearize the Model
When the distributions of the error terms are reasonable close to normal and have constant
variance, transformations on X should be attempted. The reason why transformations on Y may
not be desirable here is that a transformation on Y , such as Y Y = ' , may change the shape of the
distribution of the error terms from normal distribution and may also lead to substantially differing error
term variances.

Figure 4.2:
Prototype Regression Pattern Transformations of X

X X
10
log '=
X X = '

2
X X = '
) exp( ' X X =

X X / ' 1 =
) exp( ' X X =

Example 4.3:
Data from an experiment on the effect of number of days of training received ( X ) on performance(Y )
in a battery of simulated sales situations are presented below:
Train <- c(.5,.5,1,1,1.5,1.5,2,2,2.5,2.5)
Score <- c(42.5,50.6,68.5,80.7,89,99.6,105.3,111.8,112.3,125.7)
perf.Reg <- lm(Score~Train)
# Create scatter plot of Trainning vs.Score before transformaton
plot(x = Train, y = Score, xlab ="Trainning", ylab = "Performance", main = "Trainning vs. Performance
before Transformation", col = "blue", pch = 19, cex=1.5)
abline(perf.Reg)
# Create plot of Residual vs. Predited variable before transformaton
plot(x = perf.Reg$fitted.values, y =perf.Reg$residuals, xlab ="Predicted Values", ylab = "Residuals",
main = "Residuals vs. Predicted Values Before Transformation", col = "blue", pch = 19, cex=1.5,
#Normal Probability plot Before transformation
qq.plot <- qqnorm(perf.Reg$residuals, main = "Normal Probability Plot Before Transformation", xlab =
Inadequacies

Chapter 4 - 11
"Theoretical Quantiles", ylab = "Sample Quantiles", plot.it = TRUE,col="blue", pch = 19, cex=1.5,
abline(lm(qq.plot$y~qq.plot$x))

0.5 1.0 1.5 2.0 2.5
4
0
6
0
8
0
1
0
0
1
2
0
Trainning vs. Performance before Transformation
Trainning
P
e
r
f
o
r
m
a
n
c
e

50 60 70 80 90 100 110 120
-
1
0
-
5
0
5
1
0
Residuals vs. Predicted Values Before Transformation
Predicted Values
R
e
s
i
d
u
a
l
s

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-
1
0
-
5
0
5
1
0
Normal Probability Plot Before
Transformation
S
a
m
p
l
e

Q
u
a
n
t
i
l
e
s

The scatter plot indicates that the relation appears to be fairly curvilinear. Since the variability at
the different X levels appears to be fairly constant, we shall consider a transformation on X .
Based on the prototype plot, we shall consider initially the square root transformation X X = ' .
Inadequacies

Chapter 4 - 12

# Create scatter plot of Trainning vs.Score after transformaton
XP <- sqrt(Train)
plot(x = XP, y = Score, xlab ="Sqrt(Trainning)", ylab = "Performance",
main = "Sqrt(Trainning) vs. Performance
after Transformation", col = "blue", pch = 19, cex=1.5)
#To fit the model y vs sqrt(x)
perfT.Reg <- lm(Score~I(sqrt(Train)))
summary(perfT.Reg)

plot(x = perfT.Reg$fitted.values, y =perfT.Reg$residuals, xlab ="Predicted Values", ylab = "Residuals",
main = "Residuals vs. Predicted Values After Transformation", col = "blue", pch = 19, cex=1.5,

qqt.plot <- qqnorm(perfT.Reg$residuals, main = "Normal Probability Plot After
Transformation (x` =sqrt(x))", xlab = "Theoretical Quantiles", ylab = "Sample
Quantiles", plot.it = TRUE,col="blue", pch = 19, cex=1.5, panel.first = grid(col =
"gray", lty = "dotted"))
abline(lm(qqt.plot$y~qqt.plot$x))

0.8 1.0 1.2 1.4 1.6
4
0
6
0
8
0
1
0
0
1
2
0
Sqrt(Trainning) vs. Performance
after Transformation
Sqrt(Trainning)
P
e
r
f
o
r
m
a
n
c
e

60 80 100 120
-
1
0
-
5
0
5
Residuals vs. Predicted Values After Transformation
Predicted Values
R
e
s
i
d
u
a
l
s

Inadequacies

Chapter 4 - 13
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
-
1
0
-
5
0
5
Normal Probability Plot After
Transformation (x` =sqrt(x))
S
a
m
p
l
e
Q
u
a
n
t
i
l
e
s

Note that the scatter plot of Y versus X shows a reasonable linear relation. The variability
of the scatter plot at the different X levels is the same as before.

The plot of residual against X' shows no evidence of unequal error variances. The normal probability
plot after transformation also shows no indications of substantial departures from normality. Thus the
simple linear regression model c | | + + = X Y
1 0
appears to be appropriate here.

Fit the model using the transformed data, we obtain:

X Y 45 . 83 33 . 10
+ =

4.3.2 Transformations on the Predictor variable ( X ): The Box and Tidwell Method
Suppose that the relationship between Y and one or more of the regressor variables is nonlinear but
that the usual assumptions of normally and independently distributed responses with constant
variance are at least approximately satisfied. We want to select an appropriate transformation on
the regressor variables so that the relationship between y and the transformed regressor is as
simple as possible.

Box and Tidwell describe an analytical procedure for determining the form of the transformation on X .
Assume that the response variable Y is related to a power of the regressor, say
o
| X = , as

| | | | | |
1 0 1 0
+ = = ) , , ( ) ( f Y E where
=
=
=
0
0
o
o
|
o
X
X
ln

and
0
| ,
1
| and o are unknown parameters.

Inadequacies

Chapter 4 - 14
The procedure is:
Let 1
0
= o as the initial guess of o , so that X X = =
0
0
o
| , or that no transformation at all is applied in
the first iteration.
Expanding about the initial guess in a Taylor series and ignoring terms of higher than first order:

0
0
1 0 0
0 1 0 0
o o
| |
o
| | |
o o | | |
=
=
)
`
+ =
d
df
f Y E
) , , (
) ( ) , , ( ) (
0
0
1 0 0
1 0
1
o o
| |
o
| | |
o | |
=
=
)
`
+ + =
d
df
X
) , , (
) (

Note:
If the term in braces were known, it could be treated as an additional regressor variable, and it would be
possible to estimate the parameters
0
| ,
1
| and o by least squared estimation.

0 0
0
0
1 0 0 1 0 0
o o | |
o o
| |
o
|
|
| | |
o
| | |
= =
=
=
)
`
)
`
=
)
`
d
d
d
df
d
df ) , , ( ) , , (

=
( )
dX
X d
1 0
| | +
.
0
o o
o
o
=
d
X d ) (

= ) ln(X X
1
|

Thus,

) ln( ) ( ) (
* *
X X X Y E
1 1 0
1 | o | | + + =
W X
* * *
2 1 0
| | | + + =

where
1 2
1 | o | ) (
*
= and ) ln(X X W = .

Note that
1
| can be estimated by fitting the model X Y
1 0
| |

+ =

*
2
| can be estimated by fitting the model W X Y
* * *

2 1 0
| | | + + =

Taking 1
1
2
1
+ =
|
|
o
*
as the revised estimate of o .
This procedure may now be repeated using new regressor
1
o
X X = ' in the calculations.

Inadequacies

Chapter 4 - 15
Box and Tidwell (1962) noted that this procedure usually converges quite rapidly, and often the first-
stage result
1
o is a satisfactory estimate of o . However, round-off error is potentially a problem.
Convergence problems may be encountered in cases where the error standard deviation is large or when
the range of the regressor is very small compared to its mean.

Note:
1
|
and
*
1
| are generally differ.

Example 4.5:
A research engineer is investigating the use of a windmill to generate electricity. He has collected data
on the DC output (Y ) from his windmill and the corresponding wind velocity ( X ).
R-Codes:
Y <- c(.123, .5, .653, .558, 1.057, 1.137, 1.144, 1.194, 1.562, 1.582, 1.501, 1.737, 1.822, 1.866, 1.93,
1.8, 2.088, 2.179, 2.166, 2.112, 2.303, 2.294, 2.386, 2.236,2.31)
X <- c(2.45, 2.7, 2.9, 3.05, 3.4, 3.6, 3.95, 4.1, 4.6, 5, 5.45,5.8, 6, 6.2, 6.35, 7,7.4, 7.85, 8.15, 8.8, 9.1,
9.55, 9.7, 10, 10.2)

plot(X, Y, xlab = "Wind Velocity, X", ylab = "DC Output, Y", main = "DC Output vs. Wind Velocity",
col = "Blue", pch = 19, cex=1.5)

#First iteration
Fit0 <- lm(Y~X)
FitT0 <- lm(Y~X+I(X*log(X)))
Fit0
FitT0
4 6 8 10
0
.
5
1
.
0
1
.
5
2
.
0
DC Output vs. Wind Velocity
Wind Velocity, X
D
C

O
u
t
p
u
t
,

Y

The scatter plot suggests that the relationship between DC output and wind speed is not straight
line and that some transformation on X may be appropriate.
Inadequacies

Chapter 4 - 16

#First iteration
Call:
lm(formula = Y ~ X)
Coefficients:
(Intercept) X
0.1309 0.2411

Call:
lm(formula = Y ~ X + I(X * log(X)))
Coefficients:
(Intercept) X I(X * log(X))
-2.4168 1.5344 -0.4626

We begin with the initial guess 1
0
= o and fit the two variables:
X Y
1 0

| | + = = 0.1309 + 0.2411X

and

W X Y
* * *

2 1 0
| | | + + = = -2.4168 + 1.5344X 0.4626W

and we calculate
1
o = 9187 . 0 1
2411 . 0
4626 . 0
1
1
*
2
= + = +
|
|

as the improve estimate of o . Note that this estimate of o is very close to -1, so the reciprocal, X / 1 ,
transformation on X is appropriate.

R-codes:

#Download the package car from the CRAN homepage.
#To install the package: Menu->Packages->Install package(s) from local zip files.
Library(car)
Box.tidwell(Y~X)

Output:
Initial Power -0.91830
Score Statistic -9.13243
p-value 0.00000
MLE of Power -0.83334

iterations = 3
W = XlnX (from pg 14)
1
o
Inadequacies

Chapter 4 - 17
#Second iteration
Alpha1<- FitT0$coefficients[3]/ Fit0$coefficients[2]+1
lm(Y~I(X^ Alpha1))
lm(Y~I(X^Alpha1)+I((X^ Alpha1)*log(X^ Alpha1)))

#Second iteration
Call:
lm(formula = Y ~ I(X^ Alpha1))

Coefficients:
(Intercept) I(X^ Alpha1)
3.101 -6.683

Call:
lm(formula = Y ~ I(X^ Alpha1) + I((X^ Alpha1) * log(X^ Alpha1)))

Coefficients:
(Intercept) I(X^Alpha1) I((X^ Alpha1) * log(X^ Alpha1))
3.2409 -6.4445 0.5994

To perform a second iteration, define a new regressor variable
9183 0.
'

= X X and fit the model

'

X Y
1 0
| | + = = 3.101 6.683X

and

W X Y ' + ' + =
* * *

2 1 0
| | | = 3.2409 6.4445X + 0.5994W

where ' ln ' ' X X W = . The second-step estimate of o .is thus

=
2
o 01 . 1 ) 9183 . 0 (
683 . 6
5994 . 0
1
1
*
2
= + = +o
|
|

which again supports the use of the reciprocal transformation on X .

Inadequacies

Chapter 4 - 18
Generalized and Weighted Least Squares
4.2.1 Generalized Least Squares
A difficulty with transformations of Y is that they may create an inappropriate regression
relationship. When an appropriate regression relationship has been found but the variances of the error
terms are unequal, an alternative transformation is weighted least squares.

Consider the model: X Y + =
0 () = E , V ()
2
o = Var
The ordinary least-squares estimator y X X) X (
' ' =
1
| is no longer appropriate.

Note:
V
2
o is the covariance matrix of the errors and we define KK K K V = ' = , where K is a nonsingular
symmetric matrix. The matrix K is often called the square root of V.

Define the new variables
y K Z
1
= , X K B
1
= , K g
1
=

The regression model can be transformed as
K X K y K
1 1 1
+ = or g B Z + =
where the errors in this transformed model have zero expectation,

i.e. 0 () K (g)
1
= =

E E

and the covariance matrix of g is
} ] (g) (g)][g {[g (g) ' = E E E Var
1 1 1 1
' = ' = ' = K ) ( K ) K (K ) g (g E E E
I KKK K VK K
2 1 1 2 1 1 2
o o o = = =

Thus, the elements of g have mean zero and constant variance and are uncorrelated.

Since errors g in this new model satisfy the usual assumptions, we may apply ordinary least squares.
The least squares function is X) (y V ) X (y V g g ()
1 1
' = ' = ' =

S .

The normal equations are y V X
X) V X (
1 1
' = ' .

The solution to these equations is y V X X) V X (
1 1 1
' ' =

is called the generalized least squares estimators of .

Inadequacies

Chapter 4 - 19
Notes:
1. )
( = E
2.
1 1 2 1 2
' = ' = X) V X ( B) B ( )
( o o Var
3. When I V = , the error terms, , have uncorrelated and equal variances, the ordinary least-
squares estimator y X X) X (
' ' =
1
| is appropriate.
4. When V is a diagonal matrix with unequal diagonal, the error terms, , have uncorrelated
but unequal variances, the generalized least squares estimator y V X X) V X (
1 1 1
' ' = is used.

4.2.2 Weighted Least Squares
When the errors are uncorrelated but have unequal variances and

(
(
(
(
=
n
w
w
w
/
/
/
V
1 0
1
0 1
2
1
,

let
1
= V W , (since V is a diagonal matrix, W is also diagonal with diagonal elements or weights
n
w w w , , ,
2 1
.) the weighted least squares estimator y W X X) W X (
1 1 1
' ' = is used.

Notes:
1)
i
w is used to stand for weight
2) These estimators are unbiased and have minimum variance among all unbiased estimators.
3) Since the weight
i
w is inversely related to the variance
2
i
o , it reflects the amount of information
contained in the observation
i
y . Thus, an observation
i
y that has a large variance receives less
weight than another observation that has a smaller variance. The more precise is
i
y (i.e., the
smaller is
2
i
o ), the more information
i
y provides about ) (
i
y E and therefore the more weight it
should receive in fitting the regression function.

Problem:
i
w is usually unknown.

Solutions:
1) Examine a plot of
i
e vs.
i
y (using regular least squares estimates). When the constant variance
assumption is violated, the plot may look like:

Inadequacies

Chapter 4 - 20
Y
Y
0
Y

Divide the plot into 3 to 5 groups. Estimate the variance of the
i
e s for each group by
2
j
S .
Y
Y
0
Y

Set
2
1
j j
S w / = where j denotes the group number.

2) Suppose the variance of the residuals is varying with one of the predictor variables. For
example, suppose the following plot is obtained.

Y
Y
0
Y

X
k

e
i

Inadequacies

Chapter 4 - 21
Fit a simple regression model (estimated variance or standard deviation function) using the
2
i
e
(or
i
e ) as the response variable and
ik
X as the predictor variable. The predicted values from
the estimated variance or standard deviation function for each observation are then used to
find the weights,
i i
V w
/ 1 = where
i
V
denotes the fitted values.

3) Estimate the regression coefficients using these weights.

Notes:
1. Inferences are usually done assuming W is known even though it really is not. By using
estimated quantities in W, there is a source of variablity that is not being accounted for.
2.
2
R does not have the same meaning as for unweighted least squares.

Example 4.6: Fit a regression model using weighted least squares
We try to simulate some data to illustrate non-constant variance.
#Simulate data with nonconstant variance
X<-seq(from = 1, to = 40, by = 0.25)
#random generation for the normal distribution
set.seed(5)
epsilon<-rnorm(n = length(X), mean = 0, sd = 1)
epsilon2<-X*epsilon
#Var(epsilon2) = X^2 * 1 = X^2 (non-constant variance), recall:Var(epsilon) =1,i.e., 1 ) ( = c V
Y<- 2 + 3*X + epsilon2
set1<-data.frame(Y, X)
#Y vs. X with sample model
plot(x = X, y = Y, xlab = "X", ylab = "Y", main = "Y vs. X", panel.first = grid(col = "gray", lty = "dotted"))
mod.fit<-lm(formula = Y ~ X, data = set1)
abline(mod.fit, col="red")
summary(mod.fit)
Call:
lm(formula = Y ~ X, data = set1)

Residuals:
Min 1Q Median 3Q Max
-67.436 -9.892 -1.117 10.978 78.869

Coefficients:
(Intercept) 2.051 3.818 0.537 0.592
X 3.018 0.163 18.514 <2e-16 ***
Inadequacies

Chapter 4 - 22
0 10 20 30 40
0
5
0
1
0
0
1
5
0
Y vs. X
X
Y

From examining the plot, one can see that the variance is a function of X . (as X increases, the
variability increases).

#Residuals vs. Yhat
plot(x = mod.fit$fitted.values, y = mod.fit$residuals, xlab = expression(hat(Y)), ylab ="Residuals",
main = "Residuals vs. estimated mean response", panel.first = grid(col = "gray", lty = "dotted"))
abline(h = 0, col = "darkgreen")
20 40 60 80 100 120
-
5
0
0
5
0
Residuals vs. estimated mean response
Y
^
R
e
s
i
d
u
a
l
s

The megaphone shape above indicates non- constant variance.

Inadequacies

Chapter 4 - 23
#Try calculating a P.I. for X = 40 (will use later)
pred<-predict(object = mod.fit, newdata = data.frame(X = 40), interval = "prediction", level = 0.95)

fit lwr upr
[1,] 122.773 76.48418 169.0618

Three different weighted least squares methods are investigated.
1.) Based on the predicted values, the data is broken up into 5 groups. The estimated variance for
each group is obtained. The weight used is
2
1
j j
S w / = where
2
j
S is the sample variance of the
residuals for the
j
m

observations in group 5 1 ..., , = j .

# Method 1
#Find quantiles for Y

quant5<-quantile(x = mod.fit$fitted.values, probs =c(0.2, 0.4, 0.6, 0.8), type = 1)
round(quant5,2)

#Put Y
into groups based upon quantiles

groups<-ifelse(mod.fit$fitted.values < quant5[1], 1,
ifelse(mod.fit$fitted.values < quant5[2], 2,
5))))

#Quick way to find the variance of residuals for each group

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
20% 40% 60% 80%
28.46 51.85 75.99 99.38
5 groups
157 x 20% = 31.4
40% i.e. 62.8
80% i.e. 125.6
60% i.e. 94.2
Based on the predicted values, the data is
broken up into 5 groups
Inadequacies

Chapter 4 - 24
# function tapply = apply a function to each cell of a ragged array, that is to each (non-empty) group
of values given by a unique combination of the levels of certain factors.
var.eps<-tapply(X = mod.fit$residuals, groups, var)
var.eps

#Visualization of creating the groups
xlab = expression(hat(Y)), ylab = "Residuals",
main = "Residuals vs. estimated mean response",
abline(v = quant5, col = "red", lwd = 3)

20 40 60 80 100 120
-
5
0
0
5
0
Y
^
R
e
s
i
d
u
a
l
s

#Put the group variances into a vector corresponding to each observation
group.var<-ifelse(groups == 1, var.eps[1],
ifelse(groups == 2, var.eps[2],
var.eps[5]))))

1 2 3 4 5
25.91165 148.35059 331.15305 1036.06249 1172.47827
Refer to page 22, 28.46.
31 observations named 1 on previous page
will go to 1
st
quartile
51.8
5
75.99
99.38.
32 observations named
5 on previous page
will go into 5
th
quartile.
The estimated
variance for
each group is
obtained.
Inadequacies

Chapter 4 - 25

mod.fit1<-lm(formula = Y ~ X, data = set1, weight = 1/group.var)
summary(mod.fit1)

#Try calculating a P.I. for X = 40
pred1<-predict(object = mod.fit1, newdata=data.frame(X = 40), interval = "prediction" , level = 0.95)
pred1
fit lwr upr
[1,] 123.9026 116.1134 131.6919

2) Based on the predicted values, the data is broken up into 3 groups. The estimated variance for
each group is obtained. The weight used is
2
1
j j
S w / = where
2
j
S is the sample variance of the
residuals for the
j
m observations in group 3 2 1 , , = j .

# Method 2
#Find quantiles for Y^'s
quant3<-quantile(x = mod.fit$fitted.values, probs = c(1/3, 2/3), type = 1)
quant3
#Put Y
into groups based upon quantiles

groups<-ifelse(mod.fit$fitted.values < quant3[1], 1,
ifelse(mod.fit$fitted.values < quant3[2], 2, 3))
#Quick way to find the variance of residuals for each group
var.eps<-tapply(X = mod.fit$residuals, groups, var)
var.eps

1 2 3 4 5 6 7
25.91165 25.91165 25.91165 25.91165 25.91165 25.91165 25.91165
8 9 10 11 12 13 14
25.91165 25.91165 25.91165 25.91165 25.91165 25.91165 25.91165
15 16 17 18 19 20 21
25.91165 25.91165 25.91165 25.91165 25.91165 25.91165 25.91165
22 23 24 25 26 27 28
25.91165 25.91165 25.91165 25.91165 25.91165 25.91165 25.91165
29 30 31 32 33 34 35
25.91165 25.91165 25.91165 148.35059 148.35059 148.35059 148.35059
36 37 38 39 40 41 42
148.35059 148.35059 148.35059 148.35059 148.35059 148.35059 148.35059
43 44 45 46 47 48 49
148.35059 148.35059 148.35059 148.35059 148.35059 148.35059 148.35059
50 51 52 53 54 55 56
148.35059 148.35059 148.35059 148.35059 148.35059 148.35059 148.35059
57 58 59 60 61 62 63
148.35059 148.35059 148.35059 148.35059 148.35059 148.35059 331.15305
64 65 66 67 68 69 70
331.15305 331.15305 331.15305 331.15305 331.15305 331.15305 331.15305
Group 1 variance,
25.91165.
31 in total,
corresponding to 31
predicted response
values in Q1 in
previous page.
Group 2
variance,
148.35059.
31 in total,
corresponding to
31 predicted
response values
in Q2 in
previous page.

Compare to P.I. on page 22, current width
decreases
Inadequacies

Chapter 4 - 26
#Visualization of creating the groups
xlab = expression(hat(Y)), ylab = "Residuals",
main = "Residuals vs. estimated mean response",
abline(v = quant3, col = "red", lwd = 3)
20 40 60 80 100 120
-
5
0
0
5
0
Y
^
R
e
s
i
d
u
a
l
s

#Put the group variances into a vector corresponding to each observation
group.var<-ifelse(groups == 1, var.eps[1],
ifelse(groups == 2, var.eps[2], var.eps[3]))
mod.fit2<-lm(formula = Y ~ X, data = set1, weight = 1/group.var)
summary(mod.fit2)
pred2<-predict(object = mod.fit2, newdata =data.frame(X = 40), interval = "prediction", level = 0.95)
pred2

fit lwr upr
[1,] 123.08 115.03 131.13

3) Suppose right now, example looking at, Z ~ ) , (
2
0 o N . It can be shown that cZ ~ ) , (
2 2
0 o c N . In
the data simulation process, we are using
i
c ~ ) , (
2 2
0 o
i
x N as the error term where 1
2
= o . Thus,
the most appropriate weight to use is
2
1
i i
x w / = . Of course, in a real-life data analysis
setting, this information would not be known. However, this can serve here then as the best
method to compare with methods #1 and #2.

Inadequacies

Chapter 4 - 27
# Method 3
mod.fit3<-lm(formula = Y ~ X, data = set1, weight = 1/X^2)
summary(mod.fit3)
pred3<-predict(object = mod.fit3, newdata = data.frame(X = 40), interval = "prediction", level = 0.95)
pred3
fit lwr upr
[1,] 123.4184 116.0678 130.7691

Heres an overall summary of the estimated
j
| s:
name X.Intercept. X
1 Least Squares 2.05 3.02
2 WLS 1 2.22 3.04
3 WLS 2 2.67 3.01
4 WLS 3 1.84 3.04

Since the constant variance assumption is violated, inferences using least squares estimation may be
incorrect.

Below are the prediction intervals for 40 = X .

name fit lwr upr
1 Least Squares 122.77 76.48 169.06
2 WLS 1 123.90 116.11 131.69
3 WLS 2 123.08 115.03 131.13
4 WLS 3 123.42 116.07 130.77

Notice how different the regular least squares based interval (thus, variance used in calculation) is from
the WLS intervals.

Almost similar
Obtained from Ordinary least
square method, may be incorrect

Transform Y to Correct Model Inadequacies

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transform Y to Correct Model Inadequacies

Uploaded by

Copyright:

Available Formats

Chapter 4 Transformations and Weighting to Correct Model

= results in the following, Y = Y

= leads to an approximately constant variance. The sample model can then be

log(Y -0.7965 + 0.05354X

is called the generalized least squares estimators of .

denotes the fitted values.

into groups based upon quantiles

into groups based upon quantiles

You might also like