You are on page 1of 56

LINEAR REGRESSION

K.F.Turkman
Contents
1 Introduction 3
2 Straight line 4
2.1 Examining the regression equation . . . . . . . . . . . . . . . 5
2.2 Some distributional theory . . . . . . . . . . . . . . . . . . . . 6
2.3 Condence intervals and tests of hypotheses regarding (
0
,
1
). 9
2.4 Predicted future value of y . . . . . . . . . . . . . . . . . . . . 10
2.5 Straight line regression in matrix terms . . . . . . . . . . . . . 11
3 Generalization to multivariate (multiple) regression 12
3.1 Precision of the regression equation . . . . . . . . . . . . . . . 14
3.2 Is our model correct? . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 R
2
when there are repeated observations . . . . . . . . . . . . 23
3.4 Correlation Coecients . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Partial correlation coecient . . . . . . . . . . . . . . . . . . . 25
3.6 Use of qualitative variables in the regression equation . . . . . 26
4 Selecting the best Regression equation 31
4.1 Extra sums of squares and partial F-tests . . . . . . . . . . . . 31
4.2 Methods of selecting the best regression . . . . . . . . . . . . 34
5 Examination of residuals 38
5.1 Testing the independence of residuals . . . . . . . . . . . . . . 39
5.2 Checking for normality . . . . . . . . . . . . . . . . . . . . . . 41
5.3 Plots of residuals . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Outliers and test for inuential observations . . . . . . . . . . 44
6 Further Topics 47
6.1 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Unequal variances . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.3 Ill-conditioned regression, collinearity and Ridge regression . . 50
6.4 Generalized Linear models (GLIM) . . . . . . . . . . . . . . . 52
6.5 Nonlinear models . . . . . . . . . . . . . . . . . . . . . . . . . 56
1
Approximate duration of the course: 12 hours
References:
1. A.Sen and M. Srivastava(1990) Regression Analysis, Springer Verlag
2. N. Draper and H. Smith(1998) Applied Regression Analysis,
3. V.K. Rohatgi(1976) An Introduction to Probability Theory and Math-
ematical Statistics. J.Wiley and sons.
Recommended software: STATISTICA
2
1 Introduction
A common question in experimental science is to examine how some sets of
variables eect others. Some relations are deterministic and easy to interpret,
others are too complicated to grasp or describe in simple terms, possibly
having a random component. In these cases, we approximate these actual
relationships by simple functions or random processes using relatively simple
empirical methods. Among all the methods available for approximating such
complex relationships, linear regression possibly is the most used one. A
common feature of this methodology is to assume a functional, parametric
relationship between the variables in question, typically linear in unknown
parameters which are to be estimated from the available data. Two sets of
variables can be distinguished at this stage: Predictor variables and response
variables. Predictor variables are those that can either be set to a desired
value (controlled) or else take values that can be observed without any error.
Our objective is to nd out how changes in the predictor variables eect the
values of the response variables. Other names frequently attached to these
variables in dierent books by dierent authors are the following:
Predictor variables = Input variables
= X-variables
= Regressors
= Independent variables
Response variable = Output variable
= Y-variable
= Dependent variable.
We shall be concerned with relationships of the form
Response variable = Linear model function in terms of input variables
+ random error.
In the simplest case when we have data (y
1
, x
1
),(y
2
, x
2
),...,(y
n
, x
n
), the linear
function of the form
y
i
=
0
+
1
x
i
+
i
, i = 1, 2, ..., n (1)
3
can be used to relate y to x. We will also write down this model in generic
terms as
y =
0
+
1
x +,
Here, is a random quantity measuring the error of any individual y may
fall o the regression line, or is a random quantity measuring the variation in
y not explained by x. We also assume that the input variable x is assumed to
be either controlled or measured without error, thus is not a random variable
(As long as the error in measurement that may exist in x is smaller than the
measurement error in y, then this assumption is a fairly robust one)
If the relation between y and x is more complex than the linear relation-
ship given in (1), then models of the form
y =
0
+
1
x +
2
x
2
+.... +
p
x
p
+ (2)
can be used. Note that we say the model is linear in the sense that the model
is linear in parameters. For example,
y =
0
x
1
+
1
x
2
2
+
3
x
1
x
2
+
is linear, whereas
y =
0
+
1
x

1
x

2
+
is not.
2 Straight line
Suppose that we observe the data (y
1
, x
1
),(y
2
, x
2
),...,(y
n
, x
n
) and we think
that the model (1)
y
i
=
0
+
1
x
i
+
i
, i = 1, 2, ..., n (3)
is the right model.Here
0
,
1
are xed but unknown model parameters to be
estimated from data. One way of obtaining estimators b
0
, b
1
of
0
,
1
is by
minimizing
S =
n

i=1

2
i
=
n

i=1
(y
i

1
x
i
)
2
(4)
in terms of
0
,
1
, where (b
0
, b
1
) is the value of (
0
,
1
) corresponding to the
minimal value of S. Here S is called the sum of squares of errors and this
4
method is called the least square method. Under certain general conditions,
the estimators (b
0
, b
1
) obtained this way also turn out to be the minimum
unbiased estimators as well as the maximum likelihood estimators.
We can determine (b
0
, b
1
) by dierentiating S with respect to (
0
,
1
) ,
setting them to 0 and solving for (
0
,
1
), giving
S

0
= 2
n

i=1
(y
i

1
x
i
) = 0,
S

1
= 2
n

i=1
(y
i

1
)x
i
= 0, (5)
resulting in
b
1
=

n
i=1
(x
i
x)(y
i
y)

n
i=1
(x
i
x)
2
, (6)
b
0
= y b
1
x. (7)
Here x = 1/n

n
i=1
x
i
is the sample average and (5) are called the normal
equations.
We will call y
i
= b
0
+b
1
x
i
the tted value of y at x = x
i
and
i
= y
i
y
i
the residual for the ith observation.
Note that
y
i
= y +b
1
(x
i
x),
so that
n

i=1

i
=
n

i=1
(y
i
y
i
) = 0.
2.1 Examining the regression equation
So far, we made no assumption involving the probability structure of , and
hence of y. We now give the basic assumptions regarding the model (1).
In model (1), we assume
1.
i
, i = 1, ..., n are identically distributed uncorrelated random variables
with mean E(
i
) = 0 and variance V (
i
) =
2
, so that (assuming
x variables are measured without error or controlled) y
i
are random
variables with E(y
i
) =
0
+
1
x
i
and V (y
i
) =
2
. (In fact, correct
notation for E(y
i
) is E(Y |X = x
i
).) Note that if both independent
5
variable X as well as the response variable Y are random variables
than the random variable f(X) that minimizes
E[(Y f(X))
2
|X]
is given by
E(Y |X).
If further (X, Y ) have joint normal distribution, then E(Y |X = x) is a
linear function of the form
E(Y |X = x) =
0
+
1
x.
Hence, y
i
= b
0
+ b
1
x
i
can and should be seen as

E(Y |X = x
i
), the
estimator of this conditional mean.
2.
i
and
j
, for all i = j are uncorrelated hence y
i
, y
j
are also uncorrelated
.
3.
i
N(0,
2
), thus they are independent. Hence
Y |X = x
i
N(
0
+
1
x
i
,
2
)
and y
i
are independent but not identically distributed random variables
(With some abuse in notation, let y
i
= (Y |X = x
i
) N(
0
+
1
x
i
,
2
))
Note that in regression model, we specify the conditional distribution of
Y given X = x
i
; full inference on (Y, X) would require the specication
of the joint distribution for (Y, X).
2.2 Some distributional theory
While examining the regression equation, we will need to test various hy-
potheses which, in general will depend on the distributional properties of
sums of squares of independent normal variables and their ratios. In this
section, we give a brief summary of distributional results of these quadratic
forms.
Normal density: Random variable X has a normal distribution with
mean and variance
2
if it has the density
f(x) =
1
(2)
1/2
exp[(
(x )
2
2
2
)],
6
for x . Z =
(X)

transforms X to a standard normal variate


with mean 0 and variance 1.
(central) t-distribution: X has a t-distribution with v degrees of free-
dom ( denote it by t(v)) if it has the density
f
v
(t) =
((v + 1)/2)
(v)
1/2
(v/2)
(1 +
t
2
v
)
(v+1)/2
),
for t . Here, (q) =
_

0
e
x
x
q1
dx is the gamma function.
In General t-distribution looks like a normal distribution with heavier
tails. As v , t-distribution tends to the normal distribution and
in fact t() = N(0, 1). For all practical purposes, when v > 30, they
are equal.
(central) F-distribution: X has a F-distribution with m and n degrees
of freedom (F
m,n
) if it has the density
f
m,n
(x) =
((m +n)/2)(m/n)
m/2
(m/2)(n/2)
x
m/21
(1 + mx/n)
(m+n)/2
,
for x 0. If X has a F
1,n
distribution, than it is equivalent to the
square of a random variable with t(n) distribution that is F
1,n
= t
2
n
.
Another characteristic is that F
m,n
= 1/F
n,m
.
X is said to have a
2
distribution with n degrees of freedom (
2
(n))
if it has the density function
f(x) =
1
(n/2)2
n/2
e
x/2
x
n/21
,
for 0 < x .
How do these distributions appear in the regression analysis? As we will
see, most of the tests of hypotheses as well as estimators of model param-
eters will depend on sums of squares of independent, normally distributed
random variables and their ratios and these sums usually have
2
distribu-
tion, whereas the ratio of independent random variables with
2
distributions
have F distribution. Here is the summary of distributional results we need.
For detail, see Sen and Srivastava(1990) or any good statistics book such as
Rohatgi(1976)
7
1. If X
1
, ...., X
n
are independent, normally distributed random variables
with means (
1
,
2
, ...,
n
) and common variance
2
, then
2
Z
T
AZ
where Z
T
= (X
1

1
, ..., X
n

n
) and A is any symmetric matrix
with r = tr(A), has a (central)
2
distribution with r degrees of free-
dom.(Sum of the diagonal elements of an symmetric matrix A is called
the trace of the matrix and denoted by tr(A))
2. In particular,
n

i=1
(X
i

i
)
2

2
has
2
(n) distribution, whereas, if
i
= constant and is estimated by
X, then
n

i=1
(X
i
X)
2

2
has
2
(n 1) distribution. (this is due to the loss of one degree of
freedom in estimating the mean by X.)
3. Ratio of two independent
2
random variables each divided by their
respective degrees of freedom has an F distribution. That is, if X

2
(m) and Y
2
(n) and X, Y are independent then F
m,n
=
(X/m)
(Y/n)
has a F distribution with m, n degrees of freedom.
4. If X N(,
2
) and Y
2
(n) and if further X and Y are indepen-
dent, then t =
(X)/

Y/n
has a t distribution with n degrees of freedom.
Thus we immediately see that t
2
=
(X)
2
/
2
Y/n
has F distribution with
1, n degrees of freedom. This distribution appears when we want to
look at the distribution of the form (X )/, when X has normal
distribution, but is not known and substituted by the empirical stan-
dard deviation.
This is all the distributional theory we need to deal with inference on re-
gression equation. Most of the eort in proving results on the distributional
properties of estimators and tests of hypotheses will fall on showing the in-
dependence of various forms of quadratic forms.
8
2.3 Condence intervals and tests of hypotheses re-
garding (
0
,
1
).
A simple calculation shows that
b
1
=

n
i=1
(x
i
x)(y
i
y)

n
i=1
(x
i
x)
2
=

n
i=1
(x
i
x)y
i

n
i=1
(x
i
x)
2
,
(8)
hence
V (b
1
) =

2

n
i=1
(x
i
x)
2
. (9)
In general,
2
is not known (usually is the case) then a suitable estimator
for
2
can replace
2
. If the assumed model (1) is correct, then it is known
that under the normality assumption on residuals,
s
2
=
1
n 2
n

i=1
(y
i
y
i
)
2
is the minimum variance unbiased estimator of
2
. Note that
s
2
=
1
n
n

i=1
(y
i
y
i
)
2
is the MLE estimator, but it is biased. Hence under the assumption that
i
are normal, we can construct the usual 100(1 )%) condence interval for

1
:
b
1

t(n 2, 1 1/2)s
[

n
i=1
(x
i
x)
2
)]
1/2
.
Here, t(n 2, 1 1/2) is the 1 1/2 percentage point of a t-distribution
with n 2 degrees of freedom. (It is left to the reader to verify that
b
1

1
s.e(b
1
)
has a t-distribution with n 2 degrees of freedom. Here s.e stands for the
standard error) The test of hypotheses
H
0
:
1
=

1
, v.s. H
1
:
1
=

1
9
can be performed by calculating the test statistic
t =
(b
1

1
)
s
(
n

i=1
(x
i
x)
2
)
1/2
,
and comparing |t| with table value t(n 2, 1 1/2).
Standard error of b
0
can similarly be calculated:
s.e(b
0
) =
_
n
i=1
x
2
i
n

n
i=1
(x
i
x)
2
_
1/2
,
hence (1 )100% condence interval for
0
is given by
b
0
t(n 2, 1 1/2)
_
n
i=1
x
2
i
n

n
i=1
(x
i
x)
2
_
1/2
s,
and the test
H
0
:
0
=

0
, v.s. H
1
:
0
=

0
can be performed by comparing the absolute value of
t =
(b
0

0
)
s
_
n
i=1
x
2
i
n

n
i=1
(x
i
x)
2
_
1/2
with t(n 2, 1 1/2).
2.4 Predicted future value of y
Suppose that we want to predict the future value y
k
of the response variable y
at the observed value of x
k
. The expected predicted value is y
k
=

E(Y |X =
x
k
) = b
0
+ b
1
x
k
, that is, the estimator of the mean of Y conditional at
X = x
k
. Putting in the expressions (6) and (7) for b
1
and b
0
and simplifying
the expression, we get y
k
= y +b
1
(x
k
x). One can easily check that b
1
and
y are uncorrelated so that Cov(b
1
, y) = 0 and hence
V ( y
k
) = V (E(Y
k
|X = x
k
)) = V ar(y) + (x
k
x)
2
V (b
1
)
=
2
/n +
(x
k
x)
2

n
i=1
(x
i
x)
2

2
. (10)
10
Now, a future observation y
k
varies around its mean E(Y |X = x
k
) with a
variance
2
, hence
V (y
k
) =
2
(1 + 1/n +
(x
k
x)
2

n
i=1
(x
i
x)
2
) (11)
Hence, 100(1 )% condence interval for future observation y
k
is given
by
y
k
t(n 2, 1 1/2)s[1 + 1/n +
(x
k
x)
2

n
i=1
(x
i
x)
2
]
1/2
. (12)
2.5 Straight line regression in matrix terms
Suppose we have the observations
(y
1
, x
1
), (y
2
, x
2
), ...., (y
n
, x
n
).
Let
Y =
_

_
y
1
y
2
.
.
.
y
n
_

_
,
X =
_
_
_
_
_
_
_
_
1 x
1
1 x
2
. .
. .
. .
1 x
n
_
_
_
_
_
_
_
_
,
=
_

0

1
_
,
b =
_
b
0
b
1
_
,
=
_

2
.
.
.

n
_

_
.
11
Then, the model (1) can be written as
Y = X +. (13)
Note that

T
= (YX)
T
(YX) =
n

i=1
(y
i

1
x
i
)
2
,
hence the least square estimates are obtained by minimizing
T
and the
normal equations in (5) are given in matrix form by
X
T
Xb = X
T
Y, (14)
and hence
b = (X
T
X)
1
X
T
Y, (15)
provided that the inverse of the matrix X
T
X exists. As we will often see,
the matrix X
T
X and its inverse (X
T
X)
1
are the back bone of multiple
regression analysis. Note that
Cov(b
0
, b
1
) = Cov(y b
1
x, b
1
) =
x

n
i=1
(x
i
x)
2
,
hence
V (b) = (X
T
X)
1

2
. (16)
Letting a
k
= (1, x
k
), we can write

E(Y |X = x
k
) = b
0
+b
1
x
k
= a
k
b, and
V (

E(Y |X = x
k
)) = V (b
0
) + 2x
k
Cov(b
0
, b
1
) +x
2
k
V (b
1
)
= a
k
V (b)a
T
k
= a
k
(X
T
X)
1
a
T
k

2
. (17)
Here, V (b) is the covariance matrix of b.
3 Generalization to multivariate (multiple) re-
gression
Suppose that we have p independent variables (x
1
, x
2
, ..., x
p
) and we want
to know the eect of these variables on the response variable y through the
linear relationship
y
i
=
0
+
1
x
1i
+
2
x
2i
+... +
p
x
pi
+
i
, (18)
12
for i = 1, 2, ..., n. We can write this model in matrix term
Y = X +, (19)
where
Y =
_

_
y
1
y
2
.
.
.
y
n
,
_

_
X =
_
_
_
_
_
_
_
_
1 x
11
. . . . x
p1
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
1 x
1n
. . . . x
pn
_
_
_
_
_
_
_
_
,
=
_

1
.
.

p
,
_

_
b =
_

_
b
0
b
1
.
.
b
p
_

_
,
=
_

2
.
.
.

n
_

_
.
For this model, we again assume
1. E(
j
) = 0, V (
j
) =
2
, for every j
13
2.
i
,
j
are uncorrelated
3.
i
have normal distribution and hence, N(0, ), where = I
2
, I
being the identity matrix.
The generalization to the straight line gives similar results:
b = (X
T
X)
1
X
T
Y,
V (b) = (X
T
X)
1

2
,
and
V (

E(Y |X = x
k
)) = a
k
(X
T
X)
1
a
T
k

2
.
Hence, tests of hypotheses as well as condence intervals on individual pa-
rameters
0
, ...,
p
and on the future observation Y
k
can easily be constructed
based on the individual standard errors of b
i
. (Students are strongly urged
to construct these condence intervals and tests of hypotheses, which are
standard exercises in basic statistics)
3.1 Precision of the regression equation
So far, we looked at the problem of inference on the individual parameters
of the model (18)
y
i
=
0
+
1
x
1i
+
2
x
2i
+... +
p
x
pi
+
i
, (20)
however, there are many more important questions to ask;
1. Is the model we use correct?,
2. if it is the correct model, how signicant it is, in the sense that how
much the independent variables (x
1
, ..., x
p
) are contributing in explain-
ing the variation in the response variable y?,
3. how can we reach to a more parsimonious model by excluding those in-
dependent variables which do not contribute signicantly in explaining
the variation in y?
We start by answering the second question. Lets assume that the model in
(18) is the correct model. Then the second question can be formulated by
testing the hypotheses
H
0
:
1
=
2
= ... =
p
= 0 , H
1
: not all are 0. (21)
14
If we do not reject the null hypothesis, then the model is not statistically
dierent from the model
y =
0
+,
which means that whatever the variation in the independent variables, E(Y )
remains constant, indicating that the independent variables do not contribute
anything in explaining y. However, before going any further to see how such
a test can be performed, we note that the test
H
0
:
1
=
2
= ... =
p
= 0 , H
1
: not all are 0
is not equal to testing p consecutive hypotheses
H
(i)
0
:
i
= 0 H
(i)
1
:
i
= 0.
Let
i
= is the type one error in testing the hypotheses H
(i)
0
, that is,
= P(rejectingH
(i)
0
when it is true).
Suppose that we perform these p tests independently. Then the probability of
rejecting at least one of the p hypotheses when all are true is 1(1)
p
, which
is the type 1 error for the composite hypotheses(carried independently). Note
that this type of error increases when p increases. hence, individual hypothe-
ses can not substitute a composite hypotheses without increasing the type
one error.
Now let us see how we can perform the composite hypotheses (21). We
can write
n

i=1
(y
i
y)
2
=
n

i=1
(y
i
y
i
+ y
i
y)
2
=
n

i=1
(y
i
y
i
)
2
+
n

i=1
( y
i
y)
2
+ 2
n

i=1
(y
i
y)(y
i
y
i
)(22)
we can show that
2
n

i=1
(y
i
y)(y
i
y
i
) = 0.
Hence
n

i=1
(y
i
y)
2
=
n

i=1
( y
i
y)
2
+
n

i=1
(y
i
y
i
)
2
. (23)
15
In words, we have divided the sum of squares of total variation around
the mean into two components, namely, sum of squares of variation due to
regression (or sum of squares of variation explained by regression) and sum
of squares of variation due to error in regression (variation not explained by
regression) or in other words:
TOTAL VARIATION = VARIATION EXPLAINED BY REGRESSION
+ RESIDUAL, UNEXPLAINED VARIATION
(24)
A useful way of seeing how good the regression line is by seeing how big
the sum of squares due to regression is or equivalently how small the sum of
squares due to error in regression is. One way of measuring this can be done
by calculating
R
2
=
SS due to regression
SS total variation
and we would be pleased with the signicance of our model if R
2
is close
to 1. Usually R
2
is multiplied by 100 to represent it a percentage variation
explained by the variables. In order to be more precise with our judgement
of how good the regression line is, we need to perform statistical tests on
sum of squares: Now,
n

i=1
( y
i
y)
2
,
and
n

i=1
( y
i
y
i
)
2
are random variables and under the hypotheses that the model is correct,
E(s
2
) = E(
1
(n p 1)
n

i=1
(y
i
y
i
)
2
) =
2
.
Also,
E(1/p
n

i=1
( y
i
y)
2
) =
_

2
if
1
= ...
p
= 0

2
+
(X)
T
(X(X
T
X)
1
X
T
)X
p1
if not
(25)
16
Under the assumption of normality, it can be shown that

n
i=1
(y
i
y
i
)
2

2

2
(np1)
,
and under the further assumption that H
0
:
1
= ... =
p
= 0

n
i=1
( y
i
y)
2

2

2
p
.
further,

n
i=1
(y
i
y
i
)
2
and

n
i=1
( y
i
y)
2
are independent and hence
F =
1/p

n
i=1
( y
i
y)
2

2
1/(n p 1)

n
i=1
(y
i
y
i
)
2

2
F
p,np1
. (26)
(Students again are strongly urged to carry out the calculations, ll in
the details.) Note that, the test statistics F is equal to F
p,np1
under the
hypotheses that all of the coecients corresponding to the independent
variables are 0. Otherwise, F-value is larger, on the account that under the
alternative hypotheses, F has positive terms corresponding to
(X)
T
(X(X
T
X)
1
X
T
)X.
Hence, the signicance of the regression equation can be checked by compar-
ing F to table F
p,np1
values with desired signicance level. If F value is
signicantly larger then the F
p,np1
value then we conclude that our regres-
sion equation is signicant. Above arguments can conveniently be displayed
in a table called ANOVA table:
17
Source of varia-
tion
Degrees of
freedom
Sum of squares Mean square F-value
Due to regres-
sion (b
1
, ..., b
p
/b
0
p SS(reg) =

n
i=1
( y
i
y)
2
MS(reg) =
1
p
SS(reg)
F =
1
s
2
MS(reg)
Residual n p 1 SSE =

n
i=1
(y
i
y
i
)
2
s
2
=
1
(np1)
SSE
Total n 1

n
i=1
(y
i
y)
2
The same table can be given in terms of the matrix notation:
Source Degrees of
freedom
Sum of squares Mean square F-value
b
1
, ..., b
p
/b
0
p b
T
X
T
Y
1
n
(

n
i=1
y
i
)
2
MS(reg)
1
s
2
MS(reg)
Residual n p 1 Y
T
Yb
T
X
T
Y s
2
Total n 1 Y
T
Y
1
n
(

n
i=1
y
i
)
2
3.2 Is our model correct?
Suppose that we assume a model
y
i
=
0
+
1
x
1,i
+... +
k
x
k,i
+
1,i
(27)
whereas the correct model is given by
y
i
=
0
+
1
x
1,i
+... +
k
x
k,i
+
k+1
x
k+1,i
+... +
p
x
p,i
+
2,i
, (28)
where k < p.
How can we detect such a situation and what can we do about it? To
highlight the problem, consider the polynomial regression with one indepen-
dent variable: Suppose that for a given set of data (y
i
, x
i
), i = 1, ...n we
decide on the linear model
y =
0
+
1
x +
1
, (29)
18
when the actual relationship is given by
y =
0
+
1
x +
2
x
2
+
2
. (30)
Although the linear model in (29) is very likely to give a signicant t to the
data, it will be insucient and hence the wrong model.
Let
E(y
i
) =
0
+
1
x
1,i
+... +
p,i
x
p,i
be the value given by the correct model for the ith observation, whereas let
y
i
= b
0
+b
1
x
1,i
+... +b
k
x
k,i
(31)
be the predicted value obtained from the tted model. Then
y
i
y
i
= y
i
y
i
E(y
i
y
i
) + E(y
i
y
i
)
= (y
i
y
i
E(y
i
) +E( y
i
)) + E(y
i
y
i
) (32)
Let A
i
= y
i
y
i
E(y
i
y
i
), and B
i
= E(y
i
y
i
). Note that
E[A
i
] = E[y
i
E(y
i
)] E[ y
i
E( y
i
)]
= E(
2,i
) + E[ y
i
E( y
i
)]
= 0 (33)
and
B
i
= E(y
i
) E( y
i
)
=
0
+
1
x
1,i
+... +
p
x
p,i
(
0
+
1
x
1,i
+... +
k,i
x
k,i
),
=
_
0 if the model is correct( k = p)

k+1
x
(k+1),i
+... +
p
x
p,i
if the model is not correct (k < p)
Also,
n

i=1
(y
i
y
i
)
2
=
n

i=1
A
2
i
+
n

i=1
B
2
i
+
n

i=1
A
i
B
i
,
and can show that
E(
n

i=1
A
i
B
i
) = 0,
19
hence
1
(n k 1)
E(
n

i=1
(y
i
y
i
)
2
) =
1
(n k 1)
E[
n

i=1
A
2
i
] +
1
(n k 1)
E[

B
2
i
],
(34)
Also it can be shown that
1
(n k 1)
E(
n

i=1
A
2
i
) =
2
,
and hence
1
(n k 1)
E(
n

i=1
(y
i
y
i
)
2
)
=
_

2
if the model is correct

2
+ 1/(n k 1)

n
i=1
B
2
i
if the model is not correct.
(35)
If
2
is known or estimated from a previous experiment, then the residual
sum of square from the tted model
1
(n k 1)
E(
n

i=1
(y
i
y
i
)
2
)
can be compared to this estimated or known
2
to see if it is signicantly
larger than this true variance. If signicantly larger, than we say that there
is lack of t in the model and conclude that the model is inadequate in its
present form. The problem with this approach is that more often than not
the variance
2
is not known or a prior estimate is not available. However,
the experiment can be designed initially to overcome this possible problem by
obtaining repeated measurements on the response variable y for xed values
of the independent variables x. These repeated measurements for identical
xs can be used to estimate
2
. Such estimator of
2
is called pure error
and can be obtained as follows. For simplicity in notation, let us assume the
simple regression
y =
0
+
1
x +,
20
and the repeated observations of the form
_
_
_
_
_
_
_
_
y
1,1
y
1,2
. . . y
1,n
1
x
1
y
2,1
y
2,2
. . . y
2,n
2
x
2
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
y
m,1
y
2,2
. . . y
m,n
m
x
m
_
_
_
_
_
_
_
_
(36)
where

m
j=1
n
j
= n. The quantity
s
2
e
= 1/(n m)
m

j=1
n
j

i=1
(y
ij
y
j
)
2
is an estimate of
2
irrespective of the model tted. Also note that we can
break the total residual sum of squares as follows;
SS(E) =
n

i=1
(y
i
y
i
)
2
=
m

j=1
n
j

i=1
(y
ij
y
j
)
2
=
m

j=1
n
j

i=1
(y
ij
y
j
)
2
+
m

j=1
n
j

i=1
(y
j
y
j
)
2
= SS(pure error) + SS(lack of t) (37)
Now, it can be shown that
F =
1
(nk1)
SS(lack of t)
1
(nm)
SS(pure error)
has F
mk1,nm
distribution. Hence F can be used to test if there is any lack
of t in the model by comparing F to F
mk1,nm
at an desired signicance
level. If F is signicantly larger then we conclude that there is lack of t
and we have to look for better adequate models, using the diagnostic tools
on residuals. If F is not signicantly large, then there is no reason to doubt
the adequacy of the model. The lack of t test can conveniently be added to
the ANOVA table:
21
Source Degrees of
freedom
Sum of squares Mean square F-value
b
1
, ..., b
k
/b
0
k b
T
X
T
Y
1
n
(

n
i=1
y
i
)
2
MS(reg)
1
s
2
MS(reg)
Residual n k 1 Y
T
Yb
T
X
T
Y s
2
Lack of Fit nk 1
n
e
SS(lack of t) =
SSE
SS(pure error)
MSL =
SS(lack of t)
nk1n
e
MSL/s
2
e
Pure Error n
e
SS(pure error) s
2
e
=
SS(pure error )/n
e
Total n 1 Y
T
Y
1
n
(

n
i=1
y
i
)
2
where n
e
= n m.
We can now summarize our regression analysis as follows:
1. Plan and design an experiment to see the linear eect of the indepen-
dent variables (x
1
, x
2
, ...., x
k
) on the response variable y and assume
the model
y =
0
+
1
x
1
+..... +
k
x
k
+,
with the assumptions on given in the previous section.
2. Find the estimates of the model parameters and set up the rst ANOVA
table.
3. If repeated observations are available, compute the sum of squares of
pure error and extend the rst ANOVA table to the second ANOVA
table.
4. Perform lack of t F-test. If there is signicant lack of t, go to 5.
otherwise go to 6.
5. The assumed model is not correct. Try another model, using the diag-
nostic tools on residuals. (chapter 5)
6. There is no reason to doubt the chosen model. perform F-test and other
check other measures (such as R
2
and others) to see the eectiveness
of the regression equation. Perform statistical checks on the individual
variables.
22
7. Obtain more parsimonious models by using partial F-tests. (chapter
4)
3.3 R
2
when there are repeated observations
No matter how good our model is, it can not explain the pure error, that
is the measurement errors. In other words, the best our model can do is to
explain the variation in
Total sum of squares Sum of squares due to pure error,
thus the maximum R
2
which can be attained for any model is
R
2
max
=
SS(Total)-SS(pure error)
SS(Total)
.
When it is possible to compute pure error, R
2
/R
2
max
gives a better view of
how good the model is, in the sense that how much the model achieves in
terms of what can be achieved. For example, if we have the following ANOVA
table for a regression analysis
Source Degrees of
freedom
Sum of squares Mean square F-value
b
1
, ..., b
k
/b
0
1 5.499 5.499 7.56 (signicant)
Residual 21 15.278 0.728
Lack of Fit 11 8.233 0.748 1.061(not signi-
cant)
Pure Error 10 7.055 0.706
Total 22 20.777
R
2
max
=
20.777 7.055
20.777
= 0.6604
whereas
R
2
=
5.499
20.777
= 0.2674.
However, the regression actually explains
R
2
/R
2
max
= 0.4049
of the amount which can be explained.
23
3.4 Correlation Coecients
Let X and Y be random variables. The correlation coecient
XY

XY
=
COV (X, Y )
_
V AR(X)V AR(Y )
is a measure of degree of linear dependence between the random variables X
and Y and 1
XY
1. The correlation coecient provides a measure
of how good a (linear) prediction of the value of one of the variables can
be formed on the basis of an observed value of the other variable. As such,
the notion of correlation coecient plays an fundamental role in regression
analysis. If a sample of size n (x
1
, y
1
), ...., (x
n
, y
n
) is available, then the
correlation coecient can be estimated by
r
XY
=

n
i=1
(x
i
x)(y
i
y)
(

n
i=1
(x
i
x)
2
)
1/2
(

n
i=1
(y
i
y)
2
)
1/2
and is called the sample correlation coecient. In the regression model,
the independent variables are assumed to be non-random. Still, the sam-
ple correlation coecients r
yx
i
, r
x
i
x
j
, i, j = 1, ..., p can be used to see the
degree of linear association between the respective variables. The matrix
(r
x
i
x
j
)
i,j
,i, j = 1, ..., p is usually called the correlation matrix of the regres-
sion. ( Quite often in software outputs, the correlation matrix includes the
terms r
yx
i
) In general, higher values of |r
yx
i
| indicate the relative importance
of each of the independent variable on y. However, this is not always the
case, as later we will see when we study partial correlation coecients. (Note
that, certain softwares give covariance coecients and the covariance matrix,
instead of the correlation coecients and matrix.) Although, higher values of
|r
yx
i
| indicate a possible signicant regression analysis, high values of |r
x
i
x
j
|
among the independent variables signal trouble in the analysis. These high
values indicate that the independent variables are linearly ( or almost lin-
early) dependent, which in this case, the determinant of the matrix X
T
X is
nearly 0, and this results in computational diculties in inverting this matrix
and consequently instability in the whole regression analysis. High values of
|r
x
i
x
j
| indicate that some of these variables are redundant in the analysis and
extra care should be taken to exclude these redundant variables from the
study. As we will see, partial correlations are useful for such analysis.
24
3.5 Partial correlation coecient
Suppose that we have 3 random variables X,Y and Z, such that the corre-
lation coecients r
XY
1 and r
Y Z
1. In this case, X and Y as well as Y
and Z are linearly related in a strong manner and hence X and Z will also
have a strong linear relationship. However, this linear relationship between
X and Z exists thorough their relationship with Y and if we are able to
take out the eect of Y , then there may be very little linear relationship left
between X and Z.
More precisely, Let
X =
01
+
11
Y +
1
,
Z =
02
+
12
Y +
2
,
Let
X

= X

X,
Z

= Z

Z.
the correlation coecient r
X

Z
measures the degree of (residual) linear re-
lationship between X and Z after the linear eect of Y is taken from both
variables. r
X

Z
is called the partial correlation coecient and is denoted by
r
XZ.Y
Note again that r
XZ.Y
maybe be close to 0 even when |r
XZ
| is close to
1. Formally, r
XZ.Y
can be calculated in terms of the correlations:
r
XZ.Y
=
r
XZ
r
XY
r
ZY
[(1 r
2
XY
)(1 r
2
Y Z
)]
1/2
.
The notion of the correlation coecient can be extended to more than
one variable and we can dene the correlation between 2 variables, taking
the linear eect of a set of p variables in a similar way. For example the
partial correlation between variables Y and X
4
after taking the linear eect
of X
1
, X
2
, X
3
can be written as r
Y X
4
.X
1
X
2
X
3
. Most software on Regression
would calculate partial correlations and include in their output at various
stages of the analysis. Partial correlations are particularly useful in choosing
the most parsimonious model when using stepwise regression technique as we
will see in chapter 3. Similarly, partial correlations are very useful in dealing
with ill conditioned matrix X
T
X.
25
3.6 Use of qualitative variables in the regression equa-
tion
We so far implicitly assumed that the input variables in the regression are
quantitative, taking either discrete or continuous values. However, quite of-
ten we need to know the eect of some qualitative variables on the response
variables. For example, we may wish to know the eect of rain and tem-
perature on tobacco production every year. If we collect data in 2 dierent
countries, say country A and country B, we may also want to know if the
country of origin has any eect on the tobacco production. In another ex-
ample, suppose that transportation services in a country are provided by 3
dierent types of companies, namely public, semi-public and totaly private
companies. We may want to know if the type of the company has any eect
on the quality of the service. In another example, where the weight of the
newly born children are regressed on many factors, we may want to know if
smoking few, moderate or heavily may also eect the weight. We will classify
qualitative variables into 2 groups:
indicator ( dummy or dichotomous) variables, taking only 2 values,
namely 0 and 1
Polychotomous variables, taking more than 2 integer values. For exam-
ple, degree of belief expressed in a questionnaire such as strongly agree,
agree, disagree and strongly disagree. iIn other cases polychotomous
variables may represent some continuous variable. For example, in the
study of the eect of cigarette smoking on the weight of newly born
children, cigarette consumption per day may be coded into 3 categories,
representing 0, 1 15, above 15.
How can we introduce such variables in the regression equation together with
other variables?
Consider the case when we have a single independent variable x
1
taking
only 2 values 0 and 1. We then have the model
y
i
=
0
+
1
x
i1
=
i
,
where
x
i1
=
_
0 if i = 1, 2, ..., n
1
1 if i = n
1
+ 1, ..., n
26
In this case the X matrix will take the form
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
x
0
x
1
1 0
1 0
. .
. .
1 0
1 1
1 1
. .
. .
1 1
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
(38)
Let
1
=
0
and
2
=
0
+
1
. Then it is evident that
y
i
=
_

1
+
1
if i = 1, ..., n
1

2
+
i
if i = n
1
+ 1, ..., n.
This is the usual familiar two-sample testing problem to test
H
0
:
1
=
2
, H
1
:
1
=
2
,
which, under the regression problem becomes the test
H
0
:
1
= 0, H
1
:
1
= 0.
Note that the classical test on testing the equality of two means is trans-
formed into a regression problem with a single dummy variable and a corre-
sponding parameter representing the dierence between the means. Above
problem often occurs when we want to introduce into a regression model the
idea that the data come from two separate blocks due to some factor with
two dierent levels.(Think of the factor as a machine and the levels as two
dierent machines A and B) and we may want to know the eect of these
two dierent levels on the response variable apart from the other measured p
independent variables. These two levels then can conveniently be represented
in the regression equation by a single dummy variable taking the values 0
and 1 corresponding to the respective levels (machine A or machine B)
Now suppose that the factor has 3 dierent levels. (suppose there are 3
dierent machines used in the production of the response variable ) In this
27
case, in order to see if dierent machines have eect on the response variable,
we need to introduce two dummy variables z
1
and z
2
taking values 0 and 1
in the following fashion:
(z
1
, z
2
) =
_

_
(1, 0) if level 1
(0, 1) if level 2
(0, 0) if level 3
and the model would be
y
i
=
0
+X+
p+1
z
i1
+
p+2
z
i2
+
i
,
where the nal X would be of the form (assuming the rst block are data
with machine A, second block with machine B and last with machine C)
X =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
x
0
other xs z
1
z
2
1 . 1 0
1 . 1 0
. . . .
1 . 1 0
1 . 0 1
1 . 0 1
. . . .
1 . 0 1
1 . 0 0
1 . 0 0
. . . .
1 . 0 0
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
(39)
In general, we can deal with p levels by introducing p-1 dummy variables
in the X matrix. There is no unique way of introducing these dummy vari-
ables, however, they should be introduced in such a way that the resulting
X
T
X matrix is not singular. (Question: why cant we introduce the 3 levels
by a single dummy variable taking the values -1,0,1?)
Example: In order to model the quantity of material produced as a
function of pressure and temperature in a chemical plant, an experiment was
devised and data are collected at dierent combinations of temperature and
pressure levels. However, the management also suspects that the yield of
the product depends on whether the production was made during the day or
28
night shifts as well as the 3 dierent engineers that take care of these shifts,
say Mr A, Mr B and Mr C. What would be the regression model to test such
assertions? We can write the regression equation as follows:
y =
0
+
1
x
1
+
2
x
2
+
3
x
3
+
4
x
4
+
5
x
5
+,
where,
x
1
= temperature
x
2
= pressure
x
3
=
_
1 if response is measured during the night shift
0 if not
x
4
=
_
1 if response is measured during MR Bs shift
0 if not
x
5
=
_
1 if response is measured during Mr Cs shift
0 if not
Here,
3
is the dierence between the mean responses(production) be-
tween night and day shifts,
4
is the dierence between the mean responses
between MR B and MR A shifts whereas
5
is the dierence between the
mean responses between Mr C and Mr A shifts.
0
is the average response
(production) during MR As day shift.
In general, use of dummy variables are very useful in writing down models
for design of experiments in terms of a regression equation. (skip this material
for a while until you learn planning and analysis of experiments)
Example: Randomized Block design
Suppose that we wish to compare the eect of four dierent stain resistant
chemicals which vary substantially from one piece of material(cloth or fabric)
to another. We might select pieces of material from 3 dierent batches of
fabric and compare all four chemicals within relatively homogeneous condi-
tions provided by each piece of material. Thus each piece of fabric would be
cut into 4 pieces representing the experimental units. The treatments A,B,C,
D would be randomly assigned to the four units and this would be repeated
for each of the 3 pieces of material. The response y is then measured for each
combination. The result would be the randomized block design consisting of
29
three blocks and four treatments. A typical randomization would result in a
combination
block1 block2 block3
D A C
B C A
A B B
C D D
(40)
The model for such experiment would be
y =
0
+
1
x
1
+
2
x
2
+
3
x
3
+
4
x
4
+,
where
x
1
=
_
1 if the measurement is made in block 2
0 otherwise
x
2
=
_
1 if the measurement is made in block 3
0 otherwise
x
3
=
_
1 if treatment B is applied
0 if not
x
4
=
_
1 if treatment C is applied
0 if not
x
5
=
_
1 if treatment D is applied
0 if not
Note that
y =
0
+
when x
1
= x
2
= x
3
= x
4
= x
5
= 0, that is when the measurement is
obtained in block 1 for treatment A. Hence
0
must be the average response
for treatment A in block 1.
1
for example would correspond to the dierence
in average response between blocks 2 and 1, whereas
4
is the dierence
between treatments C and A. Now standard regression analysis, together
with relevant tests can be performed depending on the desired hypotheses.
30
4 Selecting the best Regression equation
While establishing a linear model for a response variable y in terms of a set
of p variables, often we see that some of the independent variables do not
contribute or contribute very little to explaining the variation in y. Hence
their presence in the regression equation is not justied. In this chapter, we
see how we can eliminated those independent variables whose contributions
are not signicant in the regression equation.
In general, for any statistical modelling problem, there are two conicting
interests:
1. to include as many parameters as possible in the model to explain the
variation,
2. to include as few parameters as possible in the model so that instability
caused by parameter estimation do not introduce undesired variability
in the tted model and hence in the predictions.
Hence, our objective is always to employ smallest possible number of pa-
rameters for adequate representation. This is called model parsimony and is
closely related to the sample size versus number of parameters or in other
words, to the degrees of freedom available for estimating the parameters.
This is the classical balance between dividing the limited information for
intrapolation and extrapolation.
4.1 Extra sums of squares and partial F-tests
Suppose that in an regression analysis, we have x
1
, x
2
, ...., x
p
, x
p+1
, ..., x
k
in-
dependent variables and we may want to know how much information we may
loose if we exclude x
p+1
, ..., x
k
variables from the model. That is, suppose
that we have
Model 1: y =
0
+
1
x
1
+... +
p
x
p
+
p+1
x
p+1
..... +
p+k
x
p+k
+
1
, (41)
and we know that there is no lack of t and the model is signicant. We
want to know how much information we loose if we t the reduced model
Model 2: y =
0
+
1
x
1
+... +
p
x
p
+
2
, (42)
Note that the above problem is equivalent to testing
H
0
:
p+1
= ... =
p+k
= 0 , H
1
: not all are zero
31
Suppose that we tted the Model 1 and calculated the sum of squares due to
residuals from this model, say SSE
1
. Suppose that we then t the reduced
model Model 2. Let SSE
2
be the corresponding sum of squares form this re-
duced model. Now, SSE
1
will always be smaller than SSE
2
, as the st model
has more parameters to explain total variation in y, so that SSE
2
SSE
1
> 0.
However, if the variables x
p+1
, ..., x
p+k
do not contribute much in explaining
y than this dierence should be small. Starting with this argument, we can
construct the test as follows:
E[
1
1/(n p k 1)
SSE
1
] =
2
,
and
E[
1
k
(SSE
2
SSE
1
)]
=
_

2
if
p=1
= ... =
p+k
= 0

2
+ positive terms with
p=1
, ...,
p+k
otherwise
(43)
Hence if H
0
is true
F = s
2
3
/s
2
2
=
1/k(SSE
2
SSE
1
)
1/(n k p 1)SSE
1
(44)
will be close to 1, otherwise it will be larger than 1. Also, it can be shown
that, under H
0
,
1

2
(SSE
2
SSE
1
)
2
k
,
and
SSE
1

2

2
(nkp1)
,
hence under the null hypotheses, F F
k,npk1
. Again the calculated F
value can be compared to F
k,(nkp1)
at a desired signicance level .
Note that
SST = SS( due regression from model 1) + SSE
1
,
and
SST = SS( due regression from model 2) + SSE
2
,
32
so that
SSE
2
SSE
1
= SS( due to regression from model 1)SS( due to regression from model 2).
Often we write
SS(b
1
, b
2
, ...., b
p+k
|b
0
) = SS( regression from model 1),
and
SS(b
1
, b
2
, ...., b
p
|b
0
) = SS( due regression from model 2),
so that we write
SS(b
p+1
, ..., b
p+k
|b
0
, b
1
, ...., b
p
) = SS(b
1
, b
2
, ...., b
p+k
|b
0
) SS(b
1
, b
2
, ...., b
p
|b
0
).
Often, while choosing the best regression equation, we will be interested
in the relative importance of each of the variables x
i
, i = 1, ..., p, as if they
are the last to enter in the regression equation. This will be quantied by
SS(b
i
|b
0
, b
1
, ..., b
i1
, b
i+1
, ..., b
p
).
This sum of squares has 1 degrees of freedom and it measures the contri-
bution to the regression sum of squares of each coecient
i
(hence each x
i
)
given that all the variables other that x
i
were already in the model. This is
the contribution of the variable x
i
to the regression as though it were added
to the model last.
Then
F =
SS(b
i
|b
0
, b
1
, ..., b
i1
, b
i+1
, ..., b
p
)
1/(n p 1)SSE
1
will have F
1,np1
distribution and is called a partial F-test for
i
.
When the best model is built, the partial F-tests is a useful criterion for
adding and removing terms from the model. For example the importance of
x
i
in the regression may be high when it is alone in the regression, but when
it is entered as the last variable in the equation after the other variables, it
may have very little eect on the dependent variable Y . This happens when
x
i
is highly correlated with other variables already in the regression equation.
Note that these partial F-tests are equivalent to t tests with np1 degrees
of freedom.
33
4.2 Methods of selecting the best regression
There are quite a lot of tested, well known methods for choosing the best re-
gression line. There is no guarantee that these methods will reach to the same
best regression for a given problem, hence individual judgement is needed.
However, stepwise regression seems to perform better under moist conditions.
The methods available for choosing the best regression are:
1. All possible regressions
2. The backward elimination
3. The forward elimination
4. Step-wise regression
5. The best subset regression
6. Stage wise regression
7. Ridge regression
8. Principal component regression
9. Latent root regression
we now briey explain the rst 3 of these methods.
1. All Possible Regressions
This procedure rst requires tting all the possible regression equations
which involve any subset of (x
1
, x
2
, ..., x
p
). Hence we rst t 2
p
possi-
ble models. Then the tted models are divided into p sets each having
k = 1, 2, ..., p variables. Each set is ordered according to the R
2
value
obtained by the regression lines in each group. We then take the regres-
sion lines from each group with the highest R
2
. Groups with higher
number of variables will have higher R
2
. However, at certain stage,
when the variables do not contribute, increase in R
2
will stop being
signicant. hence, by comparing the p R
2
values from each group and
deciding (by ad hoc methods) the signicant cut o point, we choose
the best regression line. For example, suppose that we have (x
1
, x
2
, x
3
).
We then have the following sets:
34
Set 1:
y =
0
+
1
x
1
y =
0
+
1
x
2
y =
0
+
1
x
3
Set 2:
y =
0
+
1
x
1
+
2
x
2
y =
0
+
1
x
1
+
2
x
3
y =
0
+
1
x
2
+
2
x
3
Set 3:
y =
0
+
1
x
1
+
2
x
2
+
3
x
3
Suppose that we have the following R
2
values in each set:
Set 1: R
2
1
= 0.51,R
2
2
= 0.61,R
2
3
= 0.30
Set 2: R
2
1
= 0.91,R
2
2
= 0.81,R
2
3
= 0.79
Set 3: R
2
1
= 0.92
Thus the maximum from each set is R
1
1
= 0.61, R
2
2
= 0.91, R
2
3
= 0.92,
corresponding to the models
y =
0
+
1
x
2
y =
0
+
1
x
1
+
2
x
2
y =
0
+
1
x
1
+
2
x
2
+
3
x
3
.
Note that while increasing the model from y =
0
+
1
x
2
to y =
0
+

1
x
1
+
2
x
2
, we have quite a signicant increase in R
2
, whereas from y =

0
+
1
x
1
+
2
x
2
to y =
0
+
1
x
1
+
2
x
2
+
3
x
3
is not so much signicant.
Hence the indicated best model is y =
0
+
1
x
1
+
2
x
2
. Note however
that this method involves enormous amount of calculations, for example
if p = 10, then we need to t 2
p
= 1024 regression equations. By
using partial correlation coecients and partial F-tests, we can reach
a similar choice without performing so many calculations.
2. The Backward elimination
This method can be summarized as follows:
35
(a) The full regression equation with all the variables is studied.
(b) If this regression equation is signicant, we perform p partial F-
tests on each of the variables in the regression. The lowest partial
F test value say, F
s
is compared with the preselected signicance
level (called F-to remove) F
0
. if F
s
> F
0
then the current regres-
sion equation under study is the best regression.
(c) If F
s
< F
0
, we remove the variable x
s
on which the partial F-test
is made. We then compute the regression equation without x
s
and
establish the stage 2 again and continue performing this analysis
until the lowest partial F test value F
s
i
is greater than xed
F-to remove value.
3. Forward Selection
We can summarize this procedure as follows:
Select x
i
, i = 1, 2, .., p which is most correlated with y, say x
1
and t the regression equation on this single variable. Check if
the regression equation is signicant by comparing the partial F-
test value (F
1
say) with a pre xed F value, called F to enter.
if F
1
< F, then stop and conclude that there is no signicant
regression.
If F
1
> F, nd the next variable x
j
= x
1
(say x
2
) which has
the highest partial correlation coecient(after the eect of x
1
is
taken) with y and t the regression
y =
0
+
1
x
1
+
2
x
2
.
Continue adding variables to the regression equation according to
the highest partial correlation coecient with y, (after the eect
of variables in the equation are taken out)
As soon as the partial F-value of the most recently entered vari-
able becomes non-signicant (as compared to the xed F-to enter
value) stop the selection procedure. The latest regression equation
with the signicant partial F-test is the best regression.
The forward and backward selection procedures are almost optimal
ways of choosing the best regression equation, however, they are not
36
perfect. We explain this by an example: Suppose that we have (x
1
, x
2
, x
3
)
as independent variables and in the rst stage x
1
enters as an signi-
cant variable. x
2
is then entered as the variable with the highest partial
correlation coecient and is seen to have a signicant partial F-value.
According to the forward selection procedure, both of the variables x
1
and x
2
are taken as signicant contributors and then we go on to test
the signicance of x
3
. However, it is possible that although x
1
enters
as the variable with the highest correlation, its eectiveness may drop
drastically, once x
2
enters into the equation. This suggests that once
x
2
is in the equation, one should test yet again the signicance of x
1
as if it is last to enter the regression equation. This suggests the up-
grading of the forward selection procedure by including extra partial
F test as explained above. The resulting procedure is called the step
wise regression and is possibly the most used selection procedure.
4. Step-wise regression
We can summarize this method as follows:
Fix F-to enter and F-to leave values (F-to enter > F-to leave).
Calculate the partial correlation coecients of each x
i
with y and
t the regression equation with the x
i
1
which has the highest par-
tial correlation coecient. Calculate the F value and if it is sig-
nicant as compared to F-to enter value, go to 2
choose x
i
2
with the second highest partial correlation coecient.
t
y =
0
+
1
x
i
1
+
2
x
i
2
+
and perform partial F-tests both for x
i
1
and x
i
2
as if they were the
last to enter. Delete or keep these variables by comparing these F-
values to the F-to leave value. (eg If F
i
1
< F-to leave then delete
x
i
1
)
Continue adding new variables and performing stage 2 for each
of the variables in the regression there is no signicant variable
outside the regression is left, that is, until the last variable to
enter have a F-value smaller than F-to enter value.
37
5 Examination of residuals
In the previous sections, while performing the regression analysis
Y = X +, (45)
we assumed that

i
are uncorrelated, identically distributed random variables with

i
N(0,
2
).
and hence
i
are independent.
Thus, if we t our model under these conditions and obtain a random sample

i
= y
i
y
i
,
where, y
i
= Xb are the values of y
i
obtained from the tted model, then if our
model is the correct one, we would expect this random sample
i
, i = 1, 2, ..., n
to resemble a random sample from iid random variables with distribution
function N(0,
2
).
i
are called the residuals. In order to facilitate notation,
we will not distinguish the random variables
i
from the residuals
i
and
simply write
i
for both. if need to distinguish both arises, we will indicate.
By analyzing the random sample of residuals, we can decide whether
the model assumptions appear to be violated and if this is the case, the
examination of the residuals should also give us an indication as to how to
proceed to correct the model.
Tests on residuals can be collected under the following categories:
1. Check if the mean of the residuals is constant , if not check what is
contributing to the systematic change in the mean
2. Check if the variance of the residuals is constant, if not check what is
contributing to the systematic change in the variance
3. check if the residuals are normally distributed
4. check if the residuals are uncorrelated
38
Probably the most signicant and troublesome deviation is due to (4). It is
known that regression analysis is not robust against deviation from uncor-
related errors. It is known that all of the t and F tests are very sensitive
to this deviation and in the slightest deviation their conclusion should be in
doubt. Also, under this deviation, the least squares estimates are no longer
the maximum likelihood estimates. when this happens, there is no quick
x to the problem. More complicated methods such as regression analysis
with correlated errors or even time series techniques will have to be used.
It is known that deviation from normality is not that serious, in the sense
that the tests of signicance which the regression analysis depends on, are
quite robust against deviation from normality. However, the least square
estimators will no longer have universal accepted qualities. Under certain
transformations on y (for example, Box-Cox transformation) it may be pos-
sible to transform the data to normality. However, as we will briey mention
in the next section, One has to be very careful with employing transforma-
tions to the data, as conclusions we take on the transformed data can not
always be translated back to conclusions that we would like to have on the
original data. Deviations due to non zero mean or variance are easier to x.
One can take simple corrective measures(not always) to transform the data
back to zero mean, constant variance errors. The regression analysis can also
be performed on data with nonequal variances. See for example, Draper and
smith(1998) chapter 9 on generalized least square method. We now look at
some the ways of analyzing residuals.
5.1 Testing the independence of residuals
There are many ways the residuals can be correlated. However, the most
common way that they exhibit correlation is serial correlation. If we know the
time order of the observations and assume that the observations are equally
spaced in that ordering, we can check either graphically or by specic tests,
the existence of serial correlation in the residuals. When the residuals are
serially correlated, their correlation function given by

k
=
E[
t

t+k
]
E[
2
t
]
,
39
is a measure of how the residuals separated by k units of time are linearly
related. The function
r
k
=

n|k|
i=1

i+|k|

n
i=1

2
i
,
is an asymptotically unbiased estimator of the correlation function. Under
the hypotheses of uncorrelated residuals the correlation function of the resid-
uals is given by

k
=
_
1 if k = 0
0 if |k| > 0
(46)
The sampling properties of the estimator r
k
is quite complicated. However,
under the hypotheses that the residuals are uncorrelated, asymptotically we
have r
k
N(0, 1/n) for each k, so that the hypotheses can be veried by
checking if each correlation falls in the interval 1.961/

n. However, Instead
of testing if each sample correlation falls inside this interval, we can built an
over all portmanteau test: under the hypotheses of independent residuals
Q = n
h

j=1
r
2
j
has a chi square distribution with h degrees of freedom and large values of
Q would indicate the residuals deviate from independence.
Other tests that can be used to check the independence of their residuals
are as follows:
1. Mcleod and Li test (Brockwell and Davis, 1996, page 35): test that the
residuals are iid normal:
Q = n(n + 2)
h

j=1
1/(n j)r
2
j

h
2. Turning Point test (Brockwell and Davis, 1996, page 35)
3. The dierence-sign test (Brockwell and Davis, 1996, page 35)
4. Durbin-Watson test (Draper and Smith, 1998, chapter 7)
40
5.2 Checking for normality
There are numerous ways one can check for normality. For example, the
simplest way is to plot histogram of the residuals and see if it is compatible
with a sample from a gaussian distribution. However, histograms notoriously
depend on how they are devised and can be quite misleading. Better option
is the q-q plot and the squared correlation, which we explain briey: Let

(1)
,
(2)
....,
(n)
be the order statistics of the residuals assumed to be a iid
sample from N(,
2
). if X
(1)
, X
(2)
, ...., X
(n)
are the order statistics from an
iid sample with N(0, 1) distribution, then
E(
(j)
) = +m
j
,
where
m
j
= E(X
(j)
).
Thus under the hypotheses that the residuals are normally distributed, a
graph of the points (m
1
,
(1)
), ...., (m
n
,
(n)
) (known as the q-q plot) should
be approximately linear, and if the residuals are not normally distributed,
then this graph will be nonlinear. One can also device a formal test based
on this idea: under the assumption of normality, the squared correlation
coecient of the points (m
i
,
(i)
) should be near to 1. Hence the assumption
of normality should be rejected if the squared correlation R
2
is small. Now,
m
i
are not known but can be estimated by
1
((i 0.5)/n) and then
R
2
=
(

n
i=1
(
(i)
) m
i

n
i=1
(
(i)
)
2

n
i=1
( m
i
)
2
.
the percentage points F R
2
are given in Brockwell and Davis, page 37.
5.3 Plots of residuals
Plotting residuals is possibly the easiest way of seeing deviations from the
model assumptions. Here we look at some plots:
1. plot against the tted values y
i
: When residuals are plotted against
y
i
, under the model hypotheses, often we get the following or similar
plots:
41
No evidence against the model assumptions
Variance is not constant
42
Systematic error in analysis
Model is not adequate, extra square or cross terms are needed
2. Plot against the variables x
j
Typical plots are similar to the one given
above, indicating how mean and the variance of the residuals change
according to each independent variable
43
3. Plot in time sequence Again, typical plots are similar to the ones given
above and each case how the mean and the variance of the residuals
change in time sequel, indicating what corrective actions should be
taken.
4. Plot of the correlation function. As we have said, the sampling proper-
ties of the sampling correlation function is quite complicated, However,
a plot of this function gives a very good idea if we deviate from inde-
pendence assumption.
5.4 Outliers and test for inuential observations
We note that for the regression equation
Y = X +,
the estimator for is obtained by
b = (X
T
X)
1
X
T
Y,
and the tted values are given by

Y = Xb
= X(X
T
X)
1
X
T
Y
= HY, (47)
and the residuals can than be written as
= Y

Y = YHY = (I H)Y.
Here, H is called the hat matrix ( due to the fact that

Y = HY, that is H
transforms the observations to predictions) and appears in every aspect of
regression analysis. For example, we can write
1. The residuals can be written as
= Y

Y = YHY = (I H)Y,
hence, one can write
E() = (I H),
44
so that Variance matrix of the residuals can be written as (see Draper
and Smith, page 206 for details)
V () = (I H)
2
,
so that
2. We can write V (
i
) = (1 h
ii
)
2
, where 0 h
ii
< 1. Since
2
is
estimated by
s
2
= 1/(n p)
n

i=1

2
,
we can studentize the residuals by
s
i
=

i
s(1 h
ii
)
1/2
,
which are called studentized residuals.
3.
SS(due to regression) =

Y
T
Y
= Y
T
HY
4. The diagonal elements {h
ii
} are called leverages, since we can write
y
i
= h
ii
y
i
+

j=i
h
ij
y
j
,
hence {h
ii
} show how heavily y
i
contributes to y
i
.
Having this notational introduction, let us now look at what we mean by
an outlier and an inuential observation. We will call an observation on the
response variable as outlier, if it is produced by some random mechanism
other than the one producing the rest of the data (Usually by some unknown
methodological sampling error). Often outliers are very large (small) obser-
vations and in principle, it is not quite clear if an observation is a extreme
observation coming from the tail of the distribution of the random variable
Y or is an outlier. By an inuential observation we will mean a observation
whose presence or absence in the regression analysis would have big eect in
45
the results. In principal, all outliers are inuential observations, but converse
is not true as the following example shows: Consider the following data set:
y x
11 3
2 3
4 3
5 3
3 7
4 3
(48)
Note that we have 5 observations, 4 at x = 3 and one x = 7. if V (y) =
2
,
then at x = 3, V (
i
) = 0.75
2
, while at x = 7, V (
5
) = 0. That is, the tted
line is totally determined by the value y
5
. hence, y
5
is an extremely inuential
value whether it is correct or not. Note that y
1
is a large value coming from
the tail, possibly an outlier, but it is not as inuential as the observation y
5
.
Hence, an observation need not be an outlier to be inuential and outliers,
no matter how large they may be, are not always inuential.
As one can imagine, one way to check if an observation is inuential or
not is to see if its deletion from the analysis greatly eects the analysis. One
way to check this is called the Cooks statistics: The inuence of the ith data
point is measured by the squared scaled distance
D
i
=
(

Y

Y(i))
T
(

Y

Y(i))
ps
2
,
where

Y are the usual vector of predicted values, whereas

Y(i) is the vector
of predicted values from a least squares t when ith data point is deleted (b(i)
the corresponding parameter estimates). Here p is the number of parameters
in the model. If ith observation has little inuence, D
i
will be small. Since

Y

Y(i) = X(b b(i)),
we can write
D
i
= (b b(i))
T
X
T
X(b b(i))/(ps
2
).
Another way of writing D
i
is
_

i
s(1 h
ii
)
1/2
_
2
_
h
ii
1 h
ii
_
1
p
,
46
where
i
is the residual from the ith observation when the full data set is
used, h
ii
are the diagonal elements of the hat matrix with the full data set.
Note that the rst element in the above equation is the studentized residual,
whereas the second term is the ratio
(variance of the ith predicted value)/(variance of the ith residual)
There are no formal tests for Cooks statistics, however, routine calcula-
tion of Cooks statistics is recommended. Ad hoc comparison of the Cooks
statistics for each i, would reveal inuential observations, if there is any. For
examples, see Draper and Smith. Other related tests for inuential observa-
tions are DFFITS :
DFFITS
i
= [(b b(i))
T
X
T
X(b b(i))/s
2
(i)]
1/2
= [D
i
ps
2
/s
2
(i)]
1/2
, (49)
and Atkinsons modied Cooks statistics:
A
i
= [(b b(i))
T
X
T
X(b b(i))(n p)/(ps
2
(i))]
1/2
= [D
i
(n p)s
2
/s
2
(i)]
1/2
= DFFITS
i
[(n p)/n]
1/2
(50)
See Draper and Smith, chapter 8, for more details. There is a well developed
theory for outliers and a wide variety of tests to see if an observation is outlier.
See Barnett and Lewis(1994) for an excellent treatment of the subject.
6 Further Topics
6.1 Transformations
When we do regression analysis, we start from the assumption that the re-
lationship between the response variable y and the independent variables
x
1
, ...., x
p
is linear both in parameters and in the variables, and further we
assume that the response variable are independent, normally distributed vari-
ables with nite variances. Often this is not the case and some of the model
hypotheses are not correct. Traditionally transformations on the response
variable y as well as on the independent variables are used to remedy such
deviations from model assumptions and still use linear regression. For exam-
ple, a common situation occurs when we think that
y = x

1
x

2
x

3
47
is a sensible model function for a set of data.Logarithmic transformation will
then help us to transform this nonlinear function to
log y = log + log x
1
+ log log x
2
+ log log x
3
leading us to t the linear model
log y = log + log x
1
+ log log x
2
+ log log x
3
+, (51)
assuming the usual distributional assumption on the residuals . In another
example, suppose we see that a quadratic polynomial regression
y =
0
+
1
x +
2
x
2
+, (52)
gives a good t , but upon the transformation

y we see that the model

y =
0
+
1
x +, (53)
gives even a better t. Other situation arises when the actual relationship is
given by
y =
0
+
1
x
1
+
2
x
2
(54)
and we t the linear model
y =
0
+
1
x
1/2
1
+
2
x
1/2
2
+. (55)
to reduce the scale of the independent variables. Although these transforma-
tions look innocent and seem to transform apparent forms on nonlinearity
to linearity, they are not at all innocent and extra care is needed to see if
we reach legitimate conclusions. For example, in (51) and (52) , the log
and square root transformations evidently play with the error structure. By
assuming that
log y = log + log x
1
+ log log x
2
+ log log x
3
+,
where
i
N(0,
2
), we are implicity assuming that
y = x

1
x

2
x

3
e

.
If, to start with the correct model was
y = x

1
x

2
x

3
+
48
then evidently we could not have taken the logarithms and used linear re-
gression. In using the model (53) instead of (52), we implicitly assume one
error structure which is totally dierent from the other one and similar model
assumptions for both models would be contradictory. Hence, one will need to
check how error structures are aected by transformations and if the trans-
formed error structure has the required form and if so, then one has to be
very careful in transforming back the results to the original data.
Obviously, (55) is a linear model in the parameters hence the transfor-
mations carried on the independent variables should present no problems on
the error structure. However, extra care is needed to interpret the results. A
signicant linear regression on the transformed variables does not necessarily
imply a signicant linear relationship between the actual variables and any
conclusion we take form the tted model needs to be very carefully checked.
In general, it is impossible to relate the parameters of the initially intended
model and the model on the transformed data. Hence, in general, results on
transformed data can not be translated back on to the results on the original
model. Hence, whenever possible, transformations should be avoided and if
it is not possible to avoid, than extra care should be given to understand the
eect of transformations on the results. However, certain transformations are
extremely eective in reducing variability, heteroscedasticity (non-constant
variances) non-normality as well as nonlinearity and their use can be highly
desirable. Usually, dierent types plots on residuals as well as on the response
and independent variables often indicate what the type of transformations
are needed for the occasion. We refer the reader to Draper and Smith, chap-
ter 13 and Sen and Srivastava, chapter 9 on this issue. However, we note that
the most used family of transformations is Box-Cox transformation, which is
used for variance stabilization, but depending on the situation, can be used
for reducing non-normality and nonlinearity. Box-Cox transformations are
given by
y
()
i
=
_
(y

i
1)/ if = 0
log(y
i
) if = 0.
(56)
Here, the original data y
i
need to be positive. (adding a constant to all
data to shift the location, so that the data become positive is the usual way
to handle non-positive data ) Here, is an unknown parameter which needs
to be chosen for a specic problem. We refer the reader to draper and Smith
and Sen and Srivastava for ways of choosing the proper value of .
49
6.2 Unequal variances
Usually the variances of the observations are not equal, in which case the
variance-covariance matrix of the error term V () will not be of the form I
2
but will have unequal diagonal elements as well as having non-zero o diago-
nal elements. In this case, the ordinary least squares estimation will not ap-
ply, eecting the variance-covariance structure of parameter estimates(they
still would be unbiased) as well as their asymptotic distributions. There are
two dierent ways to approach this problems. One is to apply an proper
Variance stabilizing transformation on y, making sure that such transforma-
tion does not eect negatively other aspects of the analysis. if on the other
hand, if we know the form of V (), we can adjust the least squares method
to generalized least squares method (Draper and Smith, section 9.2). The
problem with this method is that it is not common to know the error struc-
ture completely and if we substitute it by a proper estimator, we do not know
the resulting sampling distributions. Hence, in practice, transformations are
the best bet.
6.3 Ill-conditioned regression, collinearity and Ridge
regression
The matrix X
T
X and its inverse (X
T
X)
1
play the central role in linear
regression. Virtually everything that we have to calculate one way or the
other depend on this inverse. However, when X
T
X matrix is singular, then
this inverse will not exist, and the regression analysis will fail. For example,
when we t the regression
Y = X +,
the solution b = (X
T
X)
1
X
T
Y for the normal equations will not have an
unique solution (in fact will have innitely many solutions) when X
T
X is
singular. This happens when at least one of the independent variables is
a linear combination of the other independent variables, in which case, the
corresponding columns in the X
T
X matrix will be a linear combination of
the other columns. In this case, the determinant of this matrix will be zero
and hence the inverse of the matrix will not exist. When this happens, we say
that collinearity (or sometimes it is called multicollinearity) exists among the
independent variables. Quite often, none of the variables are exactly a linear
combination of the others, but still, there is almost near linear dependency
(or near collinearity among the variables) and the determinant of the X
T
X
50
matrix is very near 0. In this case, we say that the matrix X
T
X is ill-
conditioned. ill-conditioning in regression is highly undesirable, as it may
cause enormous numerical problems depending on how ill-conditioned the
matrix is, resulting in unreliable estimates, large variances, covariances and
thus in overall unreliable inference and predictions. Thus, this situation
should be avoided or when it occurs, it should be remedied before advancing
further on the analysis. Among the important issues are measuring and
detecting collinearity, nding those independent variables which are linearly
dependent of the others to be excluded from regression analysis. we refer the
reader to Draper and Smith, chapter 16 and Sen and Srivastava, chapter 10.
Ridge regression is a procedure to over come ill-conditioned situations,
where near (linear )dependencies between the independent variables exist.
Basic steps of the ridge regression are as follows:
1. In the regression model
y
i
=
0
+
1
x
1i
+... +
p
x
pi
+
i
,
center and scale the independent variables by
f
ji
=
x
ji
x
j
S
1/2
jj
, i = 1, ..., n,
where
S
jj
=

(x
ji
x
j
)
2
,
and center the dependent variable by y
i
y. (denote this centered
vector still by Y) Call F regression matrix of the transformed variables
corresponding to X, and F
T
F corresponding to X
T
X
2. The new regression equation becomes
y
i
y =
1
f
1i
+... +
p
f
pi
+
i
Instead of getting the unbiased least square estimates
= F
T
FF
T
Y,
consider biased estimators
() = (F
T
F +I)F
T
Y.
51
for some (unknown but positive value of to be chosen). (Note that
the new estimators
= F
T
FF
T
Y,
are still ill-conditioned if the original variables are linearly dependent;
such transformations do not take out any linear dependence. however,
centering and scaling reduce the number of parameters in the model by
1 (there is no
0
in the model and F
T
F now is the correlation matrix
of the scaled and transformed variables )
3. Usual values of in practice are in the range (0, 1). There are automatic
ways of choosing from data, although it is possible to introduce prior
belief we may have on to the analysis by Bayesian considerations.
4. Although biased, these estimators will no longer be ill-conditioned, as
the matrix F
T
F is shifted by a quantity I which will result in a deter-
minant away from 0. With properly chosen , the mean square error
will be much more smaller than the ill conditioned regression, thus it
will compensate the bias it introduces. This is a trade-o, which is very
common in the theory statistics.
5. Ridge regression can only be eective for ill conditioned regression and
should be avoided for properly conditioned regression, as it will in-
troduce unnecessary bias in the analysis. Note also that, scaling and
centering the data helps reducing the bias, as we have less one param-
eter to estimate.
6. it is possible to recover the values of b() = (X
T
X+I)X
T
Y in terms
of (), thus enabling to see the eect of centering and scaling on the
original regression equation.
For detailed study of Ridge regression, see Draper and Smith, chapter 17, or
Sen and Srivastava, chapter 12.
6.4 Generalized Linear models (GLIM)
when we look at the expected value E(Y |X = x) of a random variable
Y conditioned on other p random variables X, this expectation need not
be a linear function of the xs, except when Y, X have a joint multinormal
distribution. As a consequence, when we use regression in the presence of
52
non-normal residuals (hence non-normal response variables) it does not make
sense to use linear regression, expressing the conditional mean in terms of a
linear function. In this case, it is much more sensible to assume that some
properly chosen function of the conditional mean E(Y |X = x) has a linear
form, that is assume that there is a function g, such that
g(E(Y |X = x)) =
0
+
1
x
1
+.... +
p
x
p
.
Here, the function g is called the link function, and the class of such models
is called the generalized linear models. ( GLIM often is also used for the
method of inference on such models) For example, suppose that we have
integer valued response variable Y which we assume that it has a Poisson
distribution with some mean and we want to regress this variable on p
independent variables x
1
, x
2
, ..., x
p
(not necessarily taking integer values) .
In this case, certainly the linear regression model
y
i
=
0
+
1
x
1i
+.... +
p
x
pi
+
i
will not make much sense, since such model will not result in discrete values
for y. First of all, the intrinsic meaning of this regression equation is that

0
+
1
x
1
+ .... +
p
x
p
is the expected value of Y conditional on the inde-
pendent variables. Hence, the regression equation should be interpreted as
(x
1
, ..., x
p
) = E(Y |x
1
, ..., x
p
) =
0
+
1
x1+.... +
p
x
p
. this still would not
be proper; the mean value of a Poisson random variable has to be positive
and such a regression would not guarantee positive values for predicted .
Hence, it would be much more sensible to assume a model of the sort
log (x
1
, ..., x
p
) = log E(Y |x
1
, ..., x
p
) = X
Here, g(.) = log(.) is the link function. Other possible link function for
poisson data is g(.) =

., so that the model


_
(x
1
, ..., x
p
= X
is another possible regression model for a response variable with Poisson
distribution.
The choice of the proper link function for response variables with other
types of distribution may not be straight forward; A complete theory for
generalized linear models is given for exponential family of distributions.
53
A random variable belongs to the exponential family with the parameter
of interest if it has a density function of the form
f(z, ) = exp[a(z)b() + c() + d(z)].
for some known functions a, b, c, d. When a(z) = z, the distribution is said
to be in canonical format b() is called the natural parameter of the distri-
bution. Functions a, b, c, d can involve (nuisance) parameters other than the
parameter of interest .
Examples of distributions in the exponential family
1. Normal distribution N(,
2
) Taking to be the parameter of interest
and
2
as the nuisance parameter,, We can write the normal density
f(z, ) = exp[z(/
2
) + (z
2
(2
2
) 1/2 ln(2
2
)) z
2
/(2
2
)],
Here, is the parameter of interest and
2
is the nuisance parameter.
Hence a(z) = z, b() = /
2
.
2. Poisson distribution
f(z, ) = exp[z ln ln(z!)],
Here the natural parameter of the distribution is .
3. Binomial distribution
f(z, p) = exp[z ln(p/(1 p)) + nln(1 p) + ln(
__
n
z
__
)].
here the natural parameter of the distribution is ln(p/(1 p)).
4. Gamma distribution with parameter of interest and as nuisance
parameter
f(z, ) = exp[z + (ln z ln ()) + ( 1) ln z].
5. Pareto distribution with shape parameter
f(z, ) = exp[(1 + ) ln z + ln ]
Note that this is not a canonical form.
54
6. Exponential distribution
f(z, ) = exp[z + ln ],
It is easy to check for the exponential family that E(a(z)) is given by
c

()/b

(), where c

() is the derivative of c with respect to . Hence, when


the density can be written in the canonical form, E(Z) = c

()/b

(). (Note
that, the mean of the random variable and the natural parameter need not
coincide. For example, for the poisson distribution, they coincide, but for the
binomial distribution, they dont. Table below gives some possible choices
of link functions for response variables with the corresponding distribution
function belonging to the exponential family.
Name of the link function Form of the link function Used for
Complementary log-log ln(ln(1 )) Binomial
logit ln(p/(1 p) Binomial
Identity Normal
Logarithm ln Poisson, Gamma
Probit
1
((u )/) Binomial
Reciprocal
1
Gamma, Poisson
Square root
1/2
Poisson
Fitting the generalized linear model is done by applying re-weighted least
squares method iteratively. See Draper and Smith or Amaral Turkman and
Silva( Modelos Lineares Generalizados, 2000, SPE) and for further references
on the subject.The actual maximization procedure which maximizes the like-
lihood function is generally messy and the normal equations have the general
form
X
T
WXb = X
T
Wz, (57)
Although this looks the ordinary normal equations, the weight matrix W and
the vector z depend on the specic model (ie. the distribution for y
i
s, the link
function) , as well as the parameters b, hence there is no analytical solution
to (57). However, good software is available for estimating the parameters of
55
the most common generalized linear models,such as GLIM, Statistica, SAS,
SPSS, SPLUS and so on.
6.5 Nonlinear models
We will call any regression model nonlinear if it is not in the form
y =
0
+
1
x
1
+.... +
p
x
p
+, (58)
that is, any regression model which is not linear in the parameters will be
called nonlinear. Some nonlinear models such as
y = exp(
0
+x
1
),
are intrinsically linear in the sense that the problem can be transformed into
a linear form. Others such as
y =
0
+
2
x

1
+,
are intrinsically nonlinear, as they can not be transformed to a linear form.
The basic problem with nonlinear regression is that the least squares
method, or any other method of parameter estimation is highly complicated.
As a general rule, there are no analytical solutions to normal equations, often
needing analytical, iterative estimation techniques. Even under the assump-
tion of normal errors, the likelihood function is dicult to maximize, often its
success depends on the good choice of initial values for b
0
chosen to initiate
the maximization. there are many dierent ways of maximizing a nonlinear
function, depending on for example if we have second order derivatives of
the function to be maximized.One such method is Marquardts method. For
details of nonlinear regression, see Draper and Smith, chapter 24.
56

You might also like