You are on page 1of 33

ColumbiaX: Machine Learning

Lecture 3

Prof. John Paisley

Department of Electrical Engineering


& Data Science Institute
Columbia University
R EGRESSION : P ROBLEM D EFINITION

Data
Measured pairs (x, y), where x ∈ Rd+1 (input) and y ∈ R (output)

Goal
Find a function f : Rd+1 → R such that y ≈ f (x; w) for the data pair (x, y).
f (x; w) is the regression function and the vector w are its parameters.

Definition of linear regression


A regression method is called linear if the prediction f is a linear function of
the unknown parameters w.
L EAST S QUARES ( CONTINUED )
L EAST SQUARES LINEAR REGRESSION

Least squares solution


Least squares finds the w that minimizes the sum of squared errors. The least
squares objective in the most basic form where f (x; w) = xT w is
n
X
L= (yi − xiT w)2 = ky − Xwk2 = (y − Xw)T (y − Xw).
i=1

We defined y = [y1 , . . . , yn ]T and X = [x1 , . . . , xn ]T .

Taking the gradient with respect to w and setting to zero, we find that

∇w L = 2X T Xw − 2X T y = 0 ⇒ wLS = (X T X)−1 X T y.

In other words, wLS is the vector that minimizes L.


P ROBABILISTIC VIEW

I Last class, we discussed the geometric interpretation of least squares.

I Least squares also has an insightful probabilistic interpretation that


allows us to analyze its properties.

I That is, given that we pick this model as reasonable for our problem,
we can ask: What kinds of assumptions are we making?
P ROBABILISTIC VIEW

Recall: Gaussian density in n dimensions


Assume a diagonal covariance matrix Σ = σ 2 I. The density is
1  1 
p(y|µ, σ 2 ) = n exp − (y − µ) T
(y − µ) .
(2πσ 2 ) 2 2σ 2

What if we restrict the mean to µ = Xw


and find the maximum likelihood
solution for w?
P ROBABILISTIC VIEW

Maximum likelihood for Gaussian linear regression


Plug µ = Xw into the multivariate Gaussian distribution and solve for w
using maximum likelihood.

wML = arg max ln p(y|µ = Xw, σ 2 )


w
1 n
= arg max − ky − Xwk2 − ln(2πσ 2 ).
w 2σ 2 2

Least squares (LS) and maximum likelihood (ML) share the same solution:
1
LS: arg min ky − Xwk2 ⇔ ML: arg max − ky − Xwk2
w w 2σ 2
P ROBABILISTIC VIEW

I Therefore, in a sense we are making an independent Gaussian noise


assumption about the error, i = yi − xiT w.

I Other ways of saying this:


iid
1) yi = xiT w + i , i ∼ N(0, σ 2 ), for i = 1, . . . , n,
ind
2) yi ∼ N(xiT w, σ 2 ), for i = 1, . . . , n,
3) y ∼ N(Xw, σ 2 I), as on the previous slides.

I Can we use this probabilistic line of analysis to better understand the


maximum likelihood (i.e., least squares) solution?
P ROBABILISTIC VIEW

Expected solution
Given: The modeling assumption that y ∼ N(Xw, σ 2 I).

We can calculate the expectation of the ML solution under this distribution,


 Z 
−1 T
T
 T −1 T 
E[wML ] = E[(X X) X y] = (X X) X y p(y|X, w) dy
= (X T X)−1 X T E[y]
= (X T X)−1 X T Xw
= w

Therefore wML is an unbiased estimate of w, i.e., E[wML ] = w.


R EVIEW: A N EQUALITY FROM PROBABILITY

I Even though the “expected” maximum likelihood solution is the correct


one, should we actually expect to get something near it?
R EVIEW: A N EQUALITY FROM PROBABILITY

I Even though the “expected” maximum likelihood solution is the correct


one, should we actually expect to get something near it?

I We should also look at the covariance. Recall that if y ∼ N(µ, Σ), then

Var[y] = E[(y − E[y])(y − E[y])T ] = Σ.


R EVIEW: A N EQUALITY FROM PROBABILITY

I Even though the “expected” maximum likelihood solution is the correct


one, should we actually expect to get something near it?

I We should also look at the covariance. Recall that if y ∼ N(µ, Σ), then

Var[y] = E[(y − E[y])(y − E[y])T ] = Σ.

I Plugging in E[y] = µ, this is equivalently written as

Var[y] = E[(y − µ)(y − µ)T ]


= E[yyT − yµT − µyT + µµT ]
= E[yyT ] − µµT

I Immediately we also get E[yyT ] = Σ + µµT .


P ROBABILISTIC VIEW

Variance of the solution


Returning to least squares linear regression, we wish to find

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]


= E[wML wTML ] − E[wML ]E[wML ]T .

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .


P ROBABILISTIC VIEW

Variance of the solution


Returning to least squares linear regression, we wish to find

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]


= E[wML wTML ] − E[wML ]E[wML ]T .

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .


P ROBABILISTIC VIEW

Variance of the solution


Returning to least squares linear regression, we wish to find

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]


= E[wML wTML ] − E[wML ]E[wML ]T .

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT


= (X T X)−1 X T E[yyT ]X(X T X)−1 − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .


P ROBABILISTIC VIEW

Variance of the solution


Returning to least squares linear regression, we wish to find

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]


= E[wML wTML ] − E[wML ]E[wML ]T .

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT


= (X T X)−1 X T E[yyT ]X(X T X)−1 − wwT
= (X T X)−1 X T (σ 2 I + XwwT X T )X(X T X)−1 − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .


P ROBABILISTIC VIEW

Variance of the solution


Returning to least squares linear regression, we wish to find

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]


= E[wML wTML ] − E[wML ]E[wML ]T .

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT


= (X T X)−1 X T E[yyT ]X(X T X)−1 − wwT
= (X T X)−1 X T (σ 2 I + XwwT X T )X(X T X)−1 − wwT
= (X T X)−1 X T σ 2 IX(X T X)−1 + · · ·
(X T X)−1 X T XwwT X T X(X T X)−1 − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .


P ROBABILISTIC VIEW

Variance of the solution


Returning to least squares linear regression, we wish to find

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]


= E[wML wTML ] − E[wML ]E[wML ]T .

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT


= (X T X)−1 X T E[yyT ]X(X T X)−1 − wwT
= (X T X)−1 X T (σ 2 I + XwwT X T )X(X T X)−1 − wwT
= (X T X)−1 X T σ 2 IX(X T X)−1 + · · ·
(X T X)−1 X T XwwT X T X(X T X)−1 − wwT
= σ 2 (X T X)−1

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .


P ROBABILISTIC VIEW

I We’ve shown that, under the Gaussian assumption y ∼ N(Xw, σ 2 I),

E[wML ] = w, Var[wML ] = σ 2 (X T X)−1 .

I When there are very large values in σ 2 (X T X)−1 , the values of wML are
very sensitive to the measured data y (more analysis later).

I This is bad if we want to analyze and predict using wML .


R IDGE R EGRESSION
R EGULARIZED LEAST SQUARES

I We saw how with least squares, the values in wML may be huge.

I In general, when developing a model for data we often wish to


constrain the model parameters in some way.

I There are many models of the form

wOPT = arg min ky − Xwk2 + λ g(w).


w

I The added terms are


1. λ > 0 : a regularization parameter,
2. g(w) > 0 : a penalty function that encourages desired properties about w.
R IDGE REGRESSION

Ridge regression is one g(w) that addresses variance issues with wML .

It uses the squared penalty on the regression coefficient vector w,

wRR = arg min ky − Xwk2 + λkwk2


w

The term g(w) = kwk2 penalizes large values in w.

However, there is a tradeoff between the first and second terms that is
controlled by λ.
I Case λ → 0 : wRR → wLS
I Case λ → ∞ : wRR → ~0
R IDGE REGRESSION SOLUTION

Objective: We can solve the ridge regression problem using exactly the
same procedure as for least squares,

L = ky − Xwk2 + λkwk2
= (y − Xw)T (y − Xw) + λwT w.

Solution: First, take the gradient of L with respect to w and set to zero,

∇w L = −2X T y + 2X T Xw + 2λw = 0

Then, solve for w to find that

wRR = (λI + X T X)−1 X T y.


R IDGE REGRESSION GEOMETRY

There is a tradeoff between w2


squared error and penalty on w.
wLS
We can write both in terms of
level sets: Curves where function λwTw
evaluation gives the same number.
T
The sum of these gives a new set (w-wLS) (XTX)(w-wLS)
of levels with a unique minimum.

w1
You can check that we can write:
x1
ky − Xwk2 + λkwk2 = (w − wLS )T (X T X)(w − wLS ) + λwT w + (const. w.r.t. w).
DATA PREPROCESSING

Ridge regression is one possible regularization scheme. For this problem, we


first assume the following preprocessing steps are done:
1. The mean is subtracted off of y:
n
1X
y←y− yi .
n
i=1

2. The dimensions of xi have been standardized before constructing X:


v
u n
u1 X
xij ← (xij − x̄·j )/σ̂j , σ̂j = t (xij − x̄·j )2 .
n
i=1

i.e., subtract the empirical mean and divide by the empirical standard
deviation for each dimension.
3. We can show that there is no need for the dimension of 1’s in this case.
S OME A NALYSIS OF R IDGE
R EGRESSION
R IDGE REGRESSION VS L EAST SQUARES

The solutions to least squares and ridge regression are clearly very similar,

wLS = (X T X)−1 X T y ⇔ wRR = (λI + X T X)−1 X T y.

I We can use linear algebra and probability to compare the two.

I This requires the singular value decomposition, which we review next.


R EVIEW: S INGULAR VALUE DECOMPOSITIONS

I We can write any n × d matrix X (assume n > d) as X = USV T , where


1. U: n × d and orthonormal in the columns, i.e. U T U = I.
2. S: d × d non-negative diagonal matrix, i.e. Sii ≥ 0 and Sij = 0 for i 6= j.
3. V: d × d and orthonormal, i.e. V T V = VV T = I.

I From this we have the immediate equalities

X T X = (USV T )T (USV T ) = VS2 V T , XX T = US2 U T .

I Assuming Sii 6= 0 for all i (i.e., “X is full rank”), we also have that

(X T X)−1 = (VS2 V T )−1 = VS−2 V T .

Proof: Plug in and see that it satisfies definition of inverse

(X T X)(X T X)−1 = VS2 V T VS−2 V T = I.


L EAST SQUARES AND THE SVD

Using the SVD we can rewrite the variance,

Var[wLS ] = σ 2 (X T X)−1 = σ 2 VS−2 V T .

This inverse becomes huge when Sii is very small for some values of i.
(Aside: This happens when columns of X are highly correlated.)

The least squares prediction for new data is


T
ynew = xnew T
wLS = xnew (X T X)−1 X T y = xnew
T
VS−1 U T y.

When S−1 has very large values, this can lead to unstable predictions.
R IDGE REGRESSION VS L EAST SQUARES I

Relationship to least squares solution


Recall for two symmetric matrices, (AB)−1 = B−1 A−1 .

wRR = (λI + X T X)−1 X T y


= (λI + X T X)−1 (X T X) (X T X)−1 X T y
| {z }
wLS

T T −1 −1
= [(X X)(λ(X X) + I)] (X T X)wLS
= (λ(X T X)−1 + I)−1 (X T X)−1 (X T X)wLS
= (λ(X T X)−1 + I)−1 wLS

Can use this to prove that the solution shrinks toward zero: kwRR k2 ≤ kwLS k2 .
R IDGE REGRESSION VS L EAST SQUARES II

Continue analysis with the SVD: X = USV T → (X T X)−1 = VS−2 V T :

wRR = (λ(X T X)−1 + I)−1 wLS


= (λVS−2 V T + I)−1 wLS
= V(λS−2 + I)−1 V T wLS
:= VMV T wLS
Sii2
M is a diagonal matrix with Mii = λ+Sii2
. We can pursue this to show that

S11
 
λ+S112 0
wRR = VSλ−1 U T y, Sλ−1 = 
 .. 
 . 

Sdd
0 λ+Sdd2

Compare with wLS = VS−1 U T y, which is the case where λ = 0 above.


R IDGE REGRESSION VS L EAST SQUARES III

Ridge regression can also be seen as a special case of least squares.

Define ŷ ≈ X̂w in the following way,

y
   
− X −  
   √  w1
 0  
λ 0 

.. 
≈
   
 .. ..
 . 
.
   
. wd

   
0 0 λ

If we solved wLS for this regression problem, we find wRR of the original
problem: Calculating (ŷ − X̂w)T (ŷ − X̂w) in two parts gives
√ √
(ŷ − X̂w)T (ŷ − X̂w) = (y − Xw)T (y − Xw) + ( λw)T ( λw)
= ky − Xwk2 + λkwk2
S ELECTING λ

Degrees of freedom:

df (λ) = trace X(X T X + λI)−1 X T


 

d
X Sii2
=
i=1
λ + Sii2

This gives a way of


visualizing relationships.

We will discuss methods for


picking λ later.

You might also like