Professional Documents
Culture Documents
var(X1 )
cov(X1 , X2 )
cov(X2 , X1 )
var(X2 )
var(X) =
..
.
..
.
cov(Xn , X1 ) cov(Xn , X2 )
cov(X1 , Xn )
cov(X2 , Xn )
.
..
...
var(Xn )
It turns out that the mean and variance of a linear combination of the Xi , y =
can be obtained from just the elements of E[X] and var(X):
E[y] =
n
X
ai E[Xi ],
var(y) =
i=1
n
X
a2i var(Xi ) +
i=1
Pn
i=1
ai X i ,
cov(Xi , Xj ).
i6=j
In fact, it will be extremely convenient for later to represent these quantities using matrix
algebra. That is, consider a = (a1 , . . . , an ), X, and = E[X] as (n 1) column vectors and
let V = var(X) denote the n n variance matrix. Then y = a X, and
E[y] = a ,
var(y) = a V a.
1
Multiple Regression
Now consider a random vector of k linear combinations of X:
Y1 = a11 X1 + a12 X2 + + a1n Xn
j=1
i=1 j=1
var(Y ) = AV A .
(1.1)
Remark. In the beginning, its confusing to remember where the transpose in the variance
formula comes in the first or second term. The trick is to make sure that the dimensions
of the matrix multiplication agree. Since V = var(X) is (n n), it must be premultiplied
by a matrix with n columns, which is A. For the single linear combination case, we had
var(y) = a V a with the transpose in the first term, but this is because we considered a as
an (n 1 column vector. The matrix multiplication aV a is forbidden since the dimensions
dont agree.
1.2. Two Matrix Decompositions. The variance matrix V of a random vector X has
two important properties:
1. V must be a symmetric matrix, since cov(Xi , Xj ) = cov(Xj , Xi ).
2. For any vector a Rd with a 6= 0 (i.e., at least one of the elements of a is nonzero),
we must have a V a > 0. This is because the linear combination y = a X must have
var(y) = a V a > 0. This property of variance matrices is called positive definiteness.
Symmetric positive definite matrices have two very important decompositions which will
come in handy later:
1. Cholesky Decomposition: Any variance matrix V can be uniquely written as
V = LL ,
where L is a lower triangular matrix with positive entries Lii > 0 on the diagonal.
2
(1.2)
(1.3)
d
X
Vii =
i=1
d
X
L2ii =
i=1
d
X
i .
i=1
Q(x) = (x ) A(x ) =
n X
n
X
i=1 j=1
Since A is a variance matrix, we know that b Ab > 0 for any vector b 6= 0. Therefore, Q(x)
is minimized at x = . Moreover, we can write
(x ) A(x ) = x Ax Ax x A A
= x Ax 2x A A.
Q(x)
= x Ax + 2x b + c = x Ax + 2x AA1 b + c
= (x A1 b) A(x A1 b) + c b A1 AA1 b
(1.4)
= Q(x) + c A,
Well allow for the case where i 0 a bit later in this chapter, but lets keep things simple for now.
Multiple Regression
Assuming that V11 and V22 are invertible, the joint pdf of (X1 , X2 ) is
(
#"
"
#)
i V 1 0
x1 1
1h
1
11
exp x1 1 x2 2
f (x1 , x2 ) =
2
2 V11 V22
0 V221 x2 2
1 (x1 1 )2 1 (x2 2 )2
1
exp
=
2
V11
2
V22
2 V11 V22
2
1
1 (x1 1 )
1 (x2 2 )2
1
=
exp
exp
2
V11
2
V22
2V11
2V22
= fX1 (x1 ) fX2 (x2 ).
Any combination of the elements of X, say Y = (Y1 , . . . , Ym ) = (Xi1 , . . . , Xim ) for some
set of indices I = (i1 , . . . , im ), 1 ij d, is also multivariate normal. The means and
covariances of the elements in Y are E[Yj ] = ij and cov(Yj , Yk ) = Vij ,ik .
If I is a set of indices as above, and cov(Xij , Xik ) = 0 for every ij , ik I such that j 6= k,
then Xi1 , . . . , Xim are independent random variables.
yi = xi + i ,
iid
i N (0, 2 )
5
(2.1)
Multiple Regression
However, for what follows it is more convenient to write (2.1) as a multivariate normal. That
is, let
y1
x11 x1p
1
1
..
..
..
.
.
y = . , X = .
. , = .. , = .. .
yn
xn1
xnp
Then (2.1) can be rewritten in matrix form as y = X + with N (0, 2 I), or equivalently,
y N (X, 2 I).
(2.2)
(ML)
where e = y X .
Pn
2
xi )
e e
=
,
n
n
i=1 (yi
e e
.
=
np
The distribution of the unbiased estimator is
np 2
2(np) .
2
2
One way of finding the MLE of is to take derivatives and set them to zero. Another is to
complete the square as in Section 1.3.
Proposition 1. The loglikelihood function of the multiple regression model (2.2) can be
written as
X X( )
+ e e n
1 ( )
log( 2 ),
(2.3)
(, | y, X) =
2
2
2
= X X 2 (X X) (X X)1 X y +y y
{z
}
|
X X( )
+ y y X X .
= ( )
Moreover,
y X = y X(X X)1 X y
(y X )
= y y 2y X + X X
e e = (y X )
= y y 2 X X + X X = y y X X .
Therefore, the loglikelihood function is
1 (y X) (y X) n
(, | y, X) =
log( 2 )
2
2
2
X X( )
+ e e n
1 ( )
log( 2 ).
=
2
2
2
which yields (2.3).
7
Multiple Regression
To find the MLE of , lets assume that X X is a variance matrix: its clearly symmetric, and
it turns out that its positive definite as long as the columns of X are not linear combinations
of each other. Moreover, notice that e e does not depend on . So for fixed , the (
X X( )
term in (2.3) is minimized at = ,
so the loglikelihood is maximized at
)
= for any value of .
Therefore, the loglikelihood is
(, 2 | y, X) =
1 y y X X n
X)( )
1 ( )(X
log( 2 ),
2
2
2
2
2
=
.
ML
2
2
2
n
In fact, its slightly more convenient to rewrite the numerator by noting that
(y X )
P
2
It turns out that the unbiased estimate loses
so that
ML
= n1 ni=1 e2i , where ei = (yi xi ).
a few degrees of freedom in the denominator. To show this, lets proceed in a few steps.
That is, ei =
Proposition 2. Consider the residual vector e = (e1 , . . . , en ) = y X .
Then e and are independent.
yi xi .
Proof. We have = (X X)1 X y and
e = y X = y X(X X)1 X y = (In X(X X)1 X )y
|
{z
}
H
(X
X)
X
var
= 2
I [(X X)1 X ] [I H]
e
I H
"
# "
#
e)
(X
X)
X
X(X
X)
(X
X)
X
(I
H
)
cov(
)
cov(
,
=
.
= 2
(I H)X(X X)1
(I H)(I H )
cov(e, )
var(e)
(2.4)
Pn
2
i=1 ei .
P
2 is
Proposition 3. The distribution of the sum-of-squared residuals e e = ni=1 (yi xi )
chi-squared:
1
e e 2(np) .
2
(2.5)
Multiple Regression
Since (I H)2 = I H, it is also an idempotent matrix. Lets now consider the eigen
decomposition
1
...
I H = D, where = I, D =
.
n
Then we have
(I H)2 = ( D)( D) = D2
= I H = D.
That is, D2 = D, such that 2i = i . In other words, the diagonal elements of D are either
i = 1 or i = 0. At first it would seem that each of the i = 1, but in fact this is not the
case. To see this, lets consider a very basic regression model with p = 1:
iid
yi N (0 , 2 ).
P
P
In this case, the MLE of 0 is 0 = y, and ei = yi y. However, ni=1 ei = ni=1 yi n
y = 0.
Pn1
So if e1 , . . . , en1 are given, then we must have en = i=1 ei . In other words, the residuals
for this model have only n 1 degrees of freedom. It turns out that for the more general
regression model we lose a few more.
P
To see how many of the eigen values i = 0, recall that tr{I H} = ni=1 i , such that
Pn
1
i=1 i = tr{In } tr{H} = n tr{X(X X) X }
= n tr{(X X)1 X X}
= n tr{Id } = n d.
[ DZ] [ DZ] = Z D DZ = Z DZ =
n
X
i Zi2 ,
i=1
Pn
and in fact i=1 i Zi2 is the sum of exactly n d iid standard normals, because n p of the
i = 1 and the rest equal zero. Therefore,
( DZ) ( DZ) 2(np) =
1
e e 2(np) ,
2
10
Severity: the severity of the patients condition (higher values are more severe)
Stress: the patients self-reported degree of stress (higher values are more stress)
(3.1)
Model (3.1) has three covariates (Age, Severity, Stress), but the design matrix X has four
columns,
,
X = .
..
.
.
.
such that p = 4. The R code for producing fitting this model and summarizing the results is
11
Multiple Regression
The output of the last command is:
Call:
lm(formula = Satisfaction ~ Age + Severity + Stress, data = satis)
Residuals:
Min
1Q
-18.3524 -6.4230
Median
0.5196
3Q
8.3715
Max
17.1601
Coefficients:
Estimate Std. Error t value
(Intercept) 158.4913
18.1259
8.744
Age
-1.1416
0.2148 -5.315
Severity
-0.4420
0.4920 -0.898
Stress
-13.4702
7.0997 -1.897
--Signif. codes: 0 *** 0.001 ** 0.01
Pr(>|t|)
5.26e-11 ***
3.81e-06 ***
0.3741
0.0647 .
* 0.05 . 0.1 1
t value: The values of the test statistics for each null hypothesis H0 : j = 0. These
are calculated as
j
.
Tj =
se(j )
Pr(>|t|): The p-value for each of these null hypotheses. To obtain these, recall that
under H0 : j = 0,
j
N (0, 1),
Z= p
[(X X)1 ]ii
W =
np 2
2(np) ,
2
x
)
e e
i
i
i=1
=
.
=
np
np
13