Stat331-Multiple Linear Regression

STAT 331 - SYDE 334: Applied Linear Models
Multiple Linear Regression

version: 2015-02-07 01:55:07
1. Some Useful Matrix Properties

A concise treatment of the multiple linear regression model relies quite a bit on matrix
algebra. An overview of some matrix background material is presented below.
1.1. The Mean and Variance of Random Vectors.
Definition. Suppose that X = (X1 , . . . , Xn ) is a random vector. The mean of X denoted
E[X] is defined as
E[X] = (E[X1 ], . . . , E[Xn ]),
and the variance matrix (also: covariance matrix, variance-covariance matrix) of X denoted
var(X) is defined as
var(X1 )
cov(X1 , X2 )
cov(X2 , X1 )
var(X2 )
var(X) =
..
.
..
.
cov(Xn , X1 ) cov(Xn , X2 )
cov(X1 , Xn )
cov(X2 , Xn )
.
..
...
var(Xn )
It turns out that the mean and variance of a linear combination of the Xi , y =
can be obtained from just the elements of E[X] and var(X):
E[y] =
n
X
ai E[Xi ],
var(y) =
i=1
n
X
a2i var(Xi ) +
i=1
Pn
i=1
ai X i ,
cov(Xi , Xj ).
i6=j
In fact, it will be extremely convenient for later to represent these quantities using matrix
algebra. That is, consider a = (a1 , . . . , an ), X, and = E[X] as (n 1) column vectors and
let V = var(X) denote the n n variance matrix. Then y = a X, and
E[y] = a ,
var(y) = a V a.
1
Multiple Regression
Now consider a random vector of k linear combinations of X:
Y1 = a11 X1 + a12 X2 + + a1n Xn
Y2 = a21 X1 + a22 X2 + + a2n Xn

..
.
Yk = ak1 X1 + ak2 X2 + + akn Xn .

Then the column vector Y = (Y1 , . . . , Yk ) can be expressed as the matrix-vector product
Y = AX, where A is a (k n) matrix with elements [A]ij = aij . Moreover, we can express
the mean and variance matrix of Y with matrix algebra as well. To do this, recall the
linearity property of covariance:
!
n
n
n X
n
X
X
X
cov
ai Xi + c,
bj Xj + d =
ai bj cov(Xi , Xj ).
i=1
j=1
i=1 j=1
By applying this property to each of the elements of var(Y ), we obtain

E[Y ] = A,
var(Y ) = AV A .
(1.1)
Remark. In the beginning, its confusing to remember where the transpose in the variance
formula comes in the first or second term. The trick is to make sure that the dimensions
of the matrix multiplication agree. Since V = var(X) is (n n), it must be premultiplied
by a matrix with n columns, which is A. For the single linear combination case, we had
var(y) = a V a with the transpose in the first term, but this is because we considered a as
an (n 1 column vector. The matrix multiplication aV a is forbidden since the dimensions
dont agree.
1.2. Two Matrix Decompositions. The variance matrix V of a random vector X has
two important properties:
1. V must be a symmetric matrix, since cov(Xi , Xj ) = cov(Xj , Xi ).
2. For any vector a Rd with a 6= 0 (i.e., at least one of the elements of a is nonzero),
we must have a V a > 0. This is because the linear combination y = a X must have
var(y) = a V a > 0. This property of variance matrices is called positive definiteness.
Symmetric positive definite matrices have two very important decompositions which will
come in handy later:
1. Cholesky Decomposition: Any variance matrix V can be uniquely written as
V = LL ,
where L is a lower triangular matrix with positive entries Lii > 0 on the diagonal.
2
(1.2)
2. Eigen Decomposition: Any symmetric matrix V can be uniquely written as

V = D,
(1.3)
where is an orthogonal matrix, i.e. 1 = , and D = diag(1 , . . . , d ) is a diagonal

matrix with i > 01 .
Both of these matrix decompositions can be used to compute the trace of V :
tr{V } =
d
X
Vii =
i=1
d
X
L2ii =
i=1
d
X
i .
i=1
1.3. Complete-The-Square. Recall the complete-the-square formula in one dimension:

ax2 + 2bx + c = a(x b/a)2 + c b2 /a
In n dimensions, we have a similar formula which will be useful for later.
Suppose that Ann is a variance matrix and consider the quadratic
Q(x) = (x ) A(x ) =
n X
n
X
i=1 j=1
Aij (xi i )(xj j ).
Since A is a variance matrix, we know that b Ab > 0 for any vector b 6= 0. Therefore, Q(x)
is minimized at x = . Moreover, we can write
(x ) A(x ) = x Ax Ax x A A
= x Ax 2x A A.
Therefore, a different quadratic Q(x)

= x Ax + 2x b + c can be rewritten as
Q(x)
= x Ax + 2x b + c = x Ax + 2x AA1 b + c
= (x A1 b) A(x A1 b) + c b A1 AA1 b
(1.4)
= Q(x) + c A,
where = A1 b. Since neither c nor A depend on x, we can conclude that Q(x)

is
minimized where Q(x) is minimized, namely, at x = .
1
Well allow for the case where i 0 a bit later in this chapter, but lets keep things simple for now.
Multiple Regression
1.4. The Multivariate Normal Distribution.

Definition. A random vector Y = (Y1 , . . . , Yd ) with mean and variance V is said to have
a multivariate normal distribution, denoted Y Nd (, V ), or simply Y N (, V ), if there
exists
1. A (d k) matrix A,
iid
2. A random vector Z = (Z1 , . . . , Zk ) of independent standard normals, Zi N (0, 1),

such that Y = AZ + . When V = AA is invertible, the joint pdf of Y is

1
1
1
1
exp (y ) V (y ) log |V | ,
f (y | , V ) =
(2)d/2
2
2
where |V | = det(V ) is the determinant of V .
iid
Example 1. Let Z = (Z1 , . . . , Zd ) be a collection of iid standard normals: Zi N (0, 1).

Then Z is trivially multivariate normal, and we denote this Z N (0, I). The pdf of Z is
d
Y
1
exp( 21 zi2 ).
fZ (z) =
2
i=1
Example 2. If X N (, V ) is multivariate normal, then Y = CX + d is also multivariate

normal, with Y N (C + d, CV C ).
This is because we have X = AZ + , where Z N (0, I), and Y = CAZ + (C + d).
As a corollary, if V = LL is invertible, then Y = L1 X L1 is normal with Y N (0, I),
since L1 LL (L1 ) = L1 LL (L )1 = I. In other words, Y = L1 (X ) are iid standard
normals.
Example 3. Suppose that X = (X1 , . . . , Xd ) N (, V ) and let 1 m < d. Then
(m)
Y = (X1 , . . . , Xm ) is multivariate normal with mean (m) and variance V (m) , where i = i
(m)
and variance Vij = Vij .
This is because Y = AX, where A = [Im | 0dm ] and 0dm is a (d m) (d m) matrix of
0s. Its straightforward to check that A = (m) and AV A = V (m) .
Example 4. Suppose that X N (, V ) and Vij = 0. Then Xi and Xj are independent.
To see this, assume without loss of generality that i = 1 and j = 2. Then using the result of
example 3, we have
" #
" # "
#!
X1
1
V11 0
N
,
.
X2
2
0 V22
4
Assuming that V11 and V22 are invertible, the joint pdf of (X1 , X2 ) is
(
#"
"
#)
i V 1 0
x1 1
1h
1
11
exp x1 1 x2 2
f (x1 , x2 ) =
2
2 V11 V22
0 V221 x2 2

1 (x1 1 )2 1 (x2 2 )2
1
exp
=
2
V11
2
V22
2 V11 V22

2
1
1 (x1 1 )
1 (x2 2 )2
1
=
exp
exp
2
V11
2
V22
2V11
2V22
= fX1 (x1 ) fX2 (x2 ).
The following can be generalized as follows. Suppose that X = (X1 , . . . , Xp ) and Y =

(Y1 , . . . , Yq ) are multivariate normal, with
" #
" # "
#!
X
X
VX 0
N
,
.
Y
Y
0 VY
Then X and Y are independent random vectors, i.e., g(X) h(Y ) for any functions g and
h.
1.5. Summary of Properties of the Multivariate Normal. If X = (X1 , . . . , Xd )
N (, V ) is multivariate normal, then:
Any linear combination Y = AX + b is also multivariate normal with Y N (A +
b, AV A ).
If V = LL , the linear combination Z = (Z1 , . . . , Zd ) = L1 (X ) is a collection of iid

iid
standard normals: Zi N (0, 1).
Any combination of the elements of X, say Y = (Y1 , . . . , Ym ) = (Xi1 , . . . , Xim ) for some
set of indices I = (i1 , . . . , im ), 1 ij d, is also multivariate normal. The means and
covariances of the elements in Y are E[Yj ] = ij and cov(Yj , Yk ) = Vij ,ik .
If I is a set of indices as above, and cov(Xij , Xik ) = 0 for every ij , ik I such that j 6= k,
then Xi1 , . . . , Xim are independent random variables.
2. Multiple Linear Regression

The multiple linear regression model is a straightforward extension of simple linear regression
which allows for more than one covariate. That is, each response variable is still yi , but the
covariate vector for subject i is xi = (xi1 , . . . , xip ). The model in error form is
= (1 , . . . , p )
yi = xi + i ,
iid
i N (0, 2 )
5
(2.1)
Multiple Regression
However, for what follows it is more convenient to write (2.1) as a multivariate normal. That
is, let

y1
x11 x1p
1
1
..
..
..
.
.
y = . , X = .
. , = .. , = .. .
yn
xn1
xnp
Then (2.1) can be rewritten in matrix form as y = X + with N (0, 2 I), or equivalently,
y N (X, 2 I).
(2.2)
Example 5. With p = 2 and x0i = 1, and x1i = xi , (2.2) reduces to yi = 0 + 1 xi + i ,

which is just the simple linear regression model with one covariate.
Example 6. A pharmaceutical company is designing a new drug to reduce blood pressure.
The drug contains ingredients A, B, and C, and the pharmaceutical company wishes to
understand how each ingredient affect blood pressure outcomes. Therefore, a clinical trial
is conducted with n = 100 patients with similar medical characteristics and initial blood
pressure level. Each patient is administered a different concentration of the three active
ingredients and the change in blood pressure after 10 days of usage is recorded. In this
setting, we might model the change in blood pressure yi in patient i as
yi = A xA,i + B xB,i + C xC,i + i ,
where xA,i is the concentration of ingredient A administered to patient i, and so on.
An interesting feature of this example is that unlike Dr. Gladstones brain data, here the
covariates xA , xB , and xC are truly fixed in that they are controlled within the experiment.
That is, we can imagine giving exactly the same concentration of each ingredient to a number of patients and there would be some variability in their response yi . This example is
conceptually closer to the notion of p(y | x) with fixed x which weve been assuming all along.
2.1. Summary of Upcoming Results. In the following few sections, well be proving the
following propositions:
The MLE of is = (X X)1 X y, and its distribution is

N , (X X)1 .
The MLE of 2 is
(ML)
where e = y X .
Pn
2
xi )
e e
=
,
n
n
i=1 (yi
The unbiased estimator of 2 is
e e
.
=
np
The distribution of the unbiased estimator is
np 2
2(np) .
2
2
The random vector and the random variable

are independent.
2.2. The MLE of . The likelihood function is

1 (y X) (y X) n
2
2
2
L(, | y, X) p(y | X, , ) = exp
log( ) .
2
2
2
One way of finding the MLE of is to take derivatives and set them to zero. Another is to
complete the square as in Section 1.3.
Proposition 1. The loglikelihood function of the multiple regression model (2.2) can be
written as
X X( )
+ e e n
1 ( )
log( 2 ),
(2.3)
(, | y, X) =
2
2
2
where = (X X)1 X y and e = y X .

Proof. Expanding (y X) (y X) and completing the square as in (1.4), we obtain
(y X) (y X) = y y 2 X y + X X
= X X 2 (X X) (X X)1 X y +y y
{z
}
|
X X( )
+ y y X X .
= ( )
Moreover,
y X = y X(X X)1 X y
= y X(X X)1 (X X)(X X)1 X y = X X ,

such that
(y X )
= y y 2y X + X X
e e = (y X )
= y y 2 X X + X X = y y X X .
Therefore, the loglikelihood function is
1 (y X) (y X) n
(, | y, X) =
log( 2 )
2
2
2
X X( )
+ e e n
1 ( )
log( 2 ).
=
2
2
2
which yields (2.3).
7
Multiple Regression
To find the MLE of , lets assume that X X is a variance matrix: its clearly symmetric, and
it turns out that its positive definite as long as the columns of X are not linear combinations
of each other. Moreover, notice that e e does not depend on . So for fixed , the (
X X( )
term in (2.3) is minimized at = ,
so the loglikelihood is maximized at
)
= for any value of .
Therefore, the loglikelihood is
(, 2 | y, X) =
1 y y X X n
X)( )
1 ( )(X
log( 2 ),
2
2
2
2
2
Therefore, = (X X)1 X y is the MLE of

which for any fixed is maximized when = .
. Moreover, since is a linear combination of the y N (X, 2 I), it is also normal with
mean
= (X X)1 X E[y] = (X X)1 X X =
E[]
and variance
= (X X)1 X var(y)X(X X)1 = 2 (X X)1 X X(X X)1 = 2 (X X)1 .
var()
That is,

N , 2 (X X)1 .
2.3. Maximum Likelihood vs. Unbiased Estimator of 2 . To find the MLE of 2 , we

set = 2 and take derivatives of the log-likelihood at and set it to zero:
1 y y + X X n 1
y y + X X
2
=
0
=
=
.
ML
2
2
2
n
In fact, its slightly more convenient to rewrite the numerator by noting that
y X = y X(X X)1 X y = y X(X X)1 (X X)(X X)1 X y = X X ,

such that
(y X )
= y y 2y X + X X = y y X X ,
(y X )
P
2
It turns out that the unbiased estimate loses
so that
ML
= n1 ni=1 e2i , where ei = (yi xi ).
a few degrees of freedom in the denominator. To show this, lets proceed in a few steps.
That is, ei =
Proposition 2. Consider the residual vector e = (e1 , . . . , en ) = y X .
Then e and are independent.
yi xi .
Proof. We have = (X X)1 X y and
e = y X = y X(X X)1 X y = (In X(X X)1 X )y
|
{z
}
H

e) is a linear combination of the multivariate normal random
Therefore, the random vector (,
vector y N (X, 2 In ), and its variance matrix is given by
" #!
"
#
i
h
(X
X)
X
var
= 2
I [(X X)1 X ] [I H]
e
I H
"
# "
#
e)
(X
X)
X
X(X
X)
(X
X)
X
(I
H
)
cov(
)
cov(
,
=
.
= 2
(I H)X(X X)1
(I H)(I H )
cov(e, )
var(e)
(2.4)
The covariance between and e is calculated to be

e) = (X X)1 X (I H)
cov(,
= (X X)1 X (I X(X X)1 X )
= (X X)1 X (X X)1 X X(X X)1 X

= (X X)1 X (X X)1 X = 0pn .
e) are multivariate normal, they are also

Therefore, and e are uncorrelated, and since (,
independent.
Now we are ready to find the distribution of e e =
Pn
2
i=1 ei .
P
2 is
Proposition 3. The distribution of the sum-of-squared residuals e e = ni=1 (yi xi )
chi-squared:
1
e e 2(np) .
2
In particular, this shows that

= e e/(n p) is an unbiased estimator of 2 .
Proof. Since e = (I H)y is a linear combination of normals, it must be multivariate normal.
The mean is
E[e] = (I H)E[y] = (I H)X = X X(X X)1 X X = X X = 0,
and using (2.4), the variance matrix of e is given by var(e) = 2 (I H)(I H ). Before
going further, note that H = X(X X)1 X is a symmetric matrix, and in fact
H 2 = X(X X)1 X [X(X X)1 X ] = X(X X)1 X = H.
Such a matrix is called idempotent, i.e., H = H and H 2 = H. The upshot is that
var(e) = 2 (I H)2 = 2 (I 2H + H 2 ) = 2 (I H),
such that

e N 0, 2 [I H] .
9
(2.5)
Multiple Regression
Since (I H)2 = I H, it is also an idempotent matrix. Lets now consider the eigen
decomposition
1

...
I H = D, where = I, D =
.
n
Then we have
(I H)2 = ( D)( D) = D2
= I H = D.
That is, D2 = D, such that 2i = i . In other words, the diagonal elements of D are either
i = 1 or i = 0. At first it would seem that each of the i = 1, but in fact this is not the
case. To see this, lets consider a very basic regression model with p = 1:
iid
yi N (0 , 2 ).
P
P
In this case, the MLE of 0 is 0 = y, and ei = yi y. However, ni=1 ei = ni=1 yi n
y = 0.
Pn1
So if e1 , . . . , en1 are given, then we must have en = i=1 ei . In other words, the residuals
for this model have only n 1 degrees of freedom. It turns out that for the more general
regression model we lose a few more.
P
To see how many of the eigen values i = 0, recall that tr{I H} = ni=1 i , such that
Pn
1
i=1 i = tr{In } tr{H} = n tr{X(X X) X }
= n tr{(X X)1 X X}
= n tr{Id } = n d.
So, for Z = (Z1 , . . . , Zn ) N (0, I), we have var( DZ) = DD = D = I H, such

that

DZ N 0, 2 [I H] ,
which is exactly the distribution of e given in (2.5). Moreover,
[ DZ] [ DZ] = Z D DZ = Z DZ =
n
X
i Zi2 ,
i=1
Pn
and in fact i=1 i Zi2 is the sum of exactly n d iid standard normals, because n p of the
i = 1 and the rest equal zero. Therefore,
( DZ) ( DZ) 2(np) =
1
e e 2(np) ,
2
since e/ and DZ have exactly the same distribution.
10
3. Case Study 1: Satisfaction with Quality of Care in Hospitals

Lets put the developments of the previous sections into context.
The dataset satisfaction.csv contains four variables collected from n = 46 patients in
given a hospital:
Satisfaction: the degree of satisfaction with the quality of care (higher values indicated
greater satisfaction)
Age: the age of the patient in years
Severity: the severity of the patients condition (higher values are more severe)
Stress: the patients self-reported degree of stress (higher values are more stress)
Lets start by fitting the multiple linear regression model

Satisfactioni = 0 + 1 Agei + 2 Severityi + 3 Stressi + i .
(3.1)
Model (3.1) has three covariates (Age, Severity, Stress), but the design matrix X has four
columns,
1 Age1 Severity1 Stress1
,
X = .
..
.
.
.
such that p = 4. The R code for producing fitting this model and summarizing the results is
# load the data

satis <- read.csv("satisfaction.csv")
# fit the linear model
M <- lm(formula = Satisfaction ~ Age + Severity + Stress,
data = satis)
# summary of results
summary(M)
11
Multiple Regression
The output of the last command is:
Call:
lm(formula = Satisfaction ~ Age + Severity + Stress, data = satis)
Residuals:
Min
1Q
-18.3524 -6.4230
Median
0.5196
3Q
8.3715
Max
17.1601
Coefficients:
Estimate Std. Error t value
(Intercept) 158.4913
18.1259
8.744
Age
-1.1416
0.2148 -5.315
Severity
-0.4420
0.4920 -0.898
Stress
-13.4702
7.0997 -1.897
--Signif. codes: 0 *** 0.001 ** 0.01
Pr(>|t|)
5.26e-11 ***
3.81e-06 ***
0.3741
0.0647 .
* 0.05 . 0.1 1
Residual standard error: 10.06 on 42 degrees of freedom

Multiple R-squared: 0.6822, Adjusted R-squared: 0.6595
F-statistic: 30.05 on 3 and 42 DF, p-value: 1.542e-10
The meaning of some of the terms in this output is given below.
Estimate: These are the = (X X)1 X y.
That is, we know that

Std. Error: These are the so-called standard errors of .
= 2 (X X)1 . The standard error of j is defined as
var()
p
se(j ) =
[(X X)1 ]ii .
That is, it is a sample-based approximation to var(j ).
t value: The values of the test statistics for each null hypothesis H0 : j = 0. These
are calculated as
j
.
Tj =
se(j )
Pr(>|t|): The p-value for each of these null hypotheses. To obtain these, recall that
under H0 : j = 0,
j
N (0, 1),
Z= p
[(X X)1 ]ii
W =
np 2
2(np) ,
2
and since j and

2 are independent, so are Z and W . Therefore,
j
Z
j
p
=
t(np) .
= p
W/(n p)
[(X X)1 ]ii

se(j )
Thus, R uses the CDF of the t-distribution to calculate the p-value

pj = P (|Tj | > |Tj,obs | | j = 0).
12
Residual standard error: The square root of the unbiased estimator,

s
s
Pn
2
(y
x
)
e e
i
i
i=1
=
.
=
np
np
degrees of freedom: The degrees of freedom for the problem, which is n p.
According to this analysis,
13

Stat331-Multiple Linear Regression

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat331-Multiple Linear Regression

Uploaded by

Copyright:

Available Formats

STAT 331 - SYDE 334: Applied Linear Models

Multiple Linear Regression

1. Some Useful Matrix Properties

Y2 = a21 X1 + a22 X2 + + a2n Xn

Yk = ak1 X1 + ak2 X2 + + akn Xn .

By applying this property to each of the elements of var(Y ), we obtain

STAT 331 - SYDE 334: Applied Linear Models

2. Eigen Decomposition: Any symmetric matrix V can be uniquely written as

where is an orthogonal matrix, i.e. 1 = , and D = diag(1 , . . . , d ) is a diagonal

1.3. Complete-The-Square. Recall the complete-the-square formula in one dimension:

Aij (xi i )(xj j ).

Therefore, a different quadratic Q(x)

where = A1 b. Since neither c nor A depend on x, we can conclude that Q(x)

1.4. The Multivariate Normal Distribution.

2. A random vector Z = (Z1 , . . . , Zk ) of independent standard normals, Zi N (0, 1),

Example 1. Let Z = (Z1 , . . . , Zd ) be a collection of iid standard normals: Zi N (0, 1).

Example 2. If X N (, V ) is multivariate normal, then Y = CX + d is also multivariate

STAT 331 - SYDE 334: Applied Linear Models

The following can be generalized as follows. Suppose that X = (X1 , . . . , Xp ) and Y =

If V = LL , the linear combination Z = (Z1 , . . . , Zd ) = L1 (X ) is a collection of iid

2. Multiple Linear Regression

Example 5. With p = 2 and x0i = 1, and x1i = xi , (2.2) reduces to yi = 0 + 1 xi + i ,

The MLE of is = (X X)1 X y, and its distribution is

STAT 331 - SYDE 334: Applied Linear Models

The unbiased estimator of 2 is

The random vector and the random variable

2.2. The MLE of . The likelihood function is

where = (X X)1 X y and e = y X .

= y X(X X)1 (X X)(X X)1 X y = X X ,

Therefore, = (X X)1 X y is the MLE of

2.3. Maximum Likelihood vs. Unbiased Estimator of 2 . To find the MLE of 2 , we

y X = y X(X X)1 X y = y X(X X)1 (X X)(X X)1 X y = X X ,

STAT 331 - SYDE 334: Applied Linear Models

The covariance between and e is calculated to be

= (X X)1 X (I X(X X)1 X )

= (X X)1 X (X X)1 X X(X X)1 X

e) are multivariate normal, they are also

In particular, this shows that

So, for Z = (Z1 , . . . , Zn ) N (0, I), we have var( DZ) = DD = D = I H, such

since e/ and DZ have exactly the same distribution.

STAT 331 - SYDE 334: Applied Linear Models

3. Case Study 1: Satisfaction with Quality of Care in Hospitals

Age: the age of the patient in years

Lets start by fitting the multiple linear regression model

1 Age1 Severity1 Stress1

1 Age2 Severity2 Stress2

1 Age46 Severity46 Stress46

# load the data

Residual standard error: 10.06 on 42 degrees of freedom

The meaning of some of the terms in this output is given below.

Estimate: These are the = (X X)1 X y.

That is, we know that

That is, it is a sample-based approximation to var(j ).

and since j and

[(X X)1 ]ii

Thus, R uses the CDF of the t-distribution to calculate the p-value

STAT 331 - SYDE 334: Applied Linear Models

Residual standard error: The square root of the unbiased estimator,

degrees of freedom: The degrees of freedom for the problem, which is n p.

According to this analysis,

You might also like