Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF

ColumbiaX: Machine Learning
Lecture 3
Prof. John Paisley
Department of Electrical Engineering

& Data Science Institute
Columbia University
R EGRESSION : P ROBLEM D EFINITION
Data
Measured pairs (x, y), where x ∈ Rd+1 (input) and y ∈ R (output)
Goal
Find a function f : Rd+1 → R such that y ≈ f (x; w) for the data pair (x, y).
f (x; w) is the regression function and the vector w are its parameters.
Definition of linear regression

A regression method is called linear if the prediction f is a linear function of
the unknown parameters w.
L EAST S QUARES ( CONTINUED )
L EAST SQUARES LINEAR REGRESSION
Least squares solution

Least squares finds the w that minimizes the sum of squared errors. The least
squares objective in the most basic form where f (x; w) = xT w is
n
X
L= (yi − xiT w)2 = ky − Xwk2 = (y − Xw)T (y − Xw).
i=1
We defined y = [y1 , . . . , yn ]T and X = [x1 , . . . , xn ]T .
Taking the gradient with respect to w and setting to zero, we find that
∇w L = 2X T Xw − 2X T y = 0 ⇒ wLS = (X T X)−1 X T y.
In other words, wLS is the vector that minimizes L.

P ROBABILISTIC VIEW
I Last class, we discussed the geometric interpretation of least squares.
I Least squares also has an insightful probabilistic interpretation that

allows us to analyze its properties.
I That is, given that we pick this model as reasonable for our problem,
we can ask: What kinds of assumptions are we making?
P ROBABILISTIC VIEW
Recall: Gaussian density in n dimensions

Assume a diagonal covariance matrix Σ = σ 2 I. The density is
1 1
p(y|µ, σ 2 ) = n exp − (y − µ) T
(y − µ) .
(2πσ 2 ) 2 2σ 2
What if we restrict the mean to µ = Xw

and find the maximum likelihood
solution for w?
P ROBABILISTIC VIEW
Maximum likelihood for Gaussian linear regression

Plug µ = Xw into the multivariate Gaussian distribution and solve for w
using maximum likelihood.
wML = arg max ln p(y|µ = Xw, σ 2 )

w
1 n
= arg max − ky − Xwk2 − ln(2πσ 2 ).
w 2σ 2 2
Least squares (LS) and maximum likelihood (ML) share the same solution:
1
LS: arg min ky − Xwk2 ⇔ ML: arg max − ky − Xwk2
w w 2σ 2
P ROBABILISTIC VIEW
I Therefore, in a sense we are making an independent Gaussian noise

assumption about the error, i = yi − xiT w.
I Other ways of saying this:

iid
1) yi = xiT w + i , i ∼ N(0, σ 2 ), for i = 1, . . . , n,
ind
2) yi ∼ N(xiT w, σ 2 ), for i = 1, . . . , n,
3) y ∼ N(Xw, σ 2 I), as on the previous slides.
I Can we use this probabilistic line of analysis to better understand the

maximum likelihood (i.e., least squares) solution?
P ROBABILISTIC VIEW
Expected solution
Given: The modeling assumption that y ∼ N(Xw, σ 2 I).
We can calculate the expectation of the ML solution under this distribution,

Z
−1 T
T
T −1 T
E[wML ] = E[(X X) X y] = (X X) X y p(y|X, w) dy
= (X T X)−1 X T E[y]
= (X T X)−1 X T Xw
= w
Therefore wML is an unbiased estimate of w, i.e., E[wML ] = w.

R EVIEW: A N EQUALITY FROM PROBABILITY
I Even though the “expected” maximum likelihood solution is the correct

one, should we actually expect to get something near it?

I We should also look at the covariance. Recall that if y ∼ N(µ, Σ), then
Var[y] = E[(y − E[y])(y − E[y])T ] = Σ.


I We should also look at the covariance. Recall that if y ∼ N(µ, Σ), then
Var[y] = E[(y − E[y])(y − E[y])T ] = Σ.
I Plugging in E[y] = µ, this is equivalently written as
Var[y] = E[(y − µ)(y − µ)T ]

= E[yyT − yµT − µyT + µµT ]
= E[yyT ] − µµT
I Immediately we also get E[yyT ] = Σ + µµT .

P ROBABILISTIC VIEW
Variance of the solution

Returning to least squares linear regression, we wish to find
Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]

= E[wML wTML ] − E[wML ]E[wML ]T .
1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .

P ROBABILISTIC VIEW


The sequence of equalities follows:1
Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT

P ROBABILISTIC VIEW



= (X T X)−1 X T E[yyT ]X(X T X)−1 − wwT

P ROBABILISTIC VIEW



= (X T X)−1 X T (σ 2 I + XwwT X T )X(X T X)−1 − wwT

P ROBABILISTIC VIEW



= (X T X)−1 X T σ 2 IX(X T X)−1 + · · ·
(X T X)−1 X T XwwT X T X(X T X)−1 − wwT

P ROBABILISTIC VIEW



= (X T X)−1 X T σ 2 IX(X T X)−1 + · · ·
(X T X)−1 X T XwwT X T X(X T X)−1 − wwT
= σ 2 (X T X)−1

P ROBABILISTIC VIEW
I We’ve shown that, under the Gaussian assumption y ∼ N(Xw, σ 2 I),
E[wML ] = w, Var[wML ] = σ 2 (X T X)−1 .
I When there are very large values in σ 2 (X T X)−1 , the values of wML are
very sensitive to the measured data y (more analysis later).
I This is bad if we want to analyze and predict using wML .

R IDGE R EGRESSION
R EGULARIZED LEAST SQUARES
I We saw how with least squares, the values in wML may be huge.
I In general, when developing a model for data we often wish to

constrain the model parameters in some way.
I There are many models of the form
wOPT = arg min ky − Xwk2 + λ g(w).

w
I The added terms are

1. λ > 0 : a regularization parameter,
2. g(w) > 0 : a penalty function that encourages desired properties about w.
R IDGE REGRESSION
Ridge regression is one g(w) that addresses variance issues with wML .
It uses the squared penalty on the regression coefficient vector w,
wRR = arg min ky − Xwk2 + λkwk2

w
The term g(w) = kwk2 penalizes large values in w.
However, there is a tradeoff between the first and second terms that is
controlled by λ.
I Case λ → 0 : wRR → wLS
I Case λ → ∞ : wRR → ~0
R IDGE REGRESSION SOLUTION
Objective: We can solve the ridge regression problem using exactly the
same procedure as for least squares,
L = ky − Xwk2 + λkwk2
= (y − Xw)T (y − Xw) + λwT w.
Solution: First, take the gradient of L with respect to w and set to zero,
∇w L = −2X T y + 2X T Xw + 2λw = 0
Then, solve for w to find that
wRR = (λI + X T X)−1 X T y.

R IDGE REGRESSION GEOMETRY
There is a tradeoff between w2

squared error and penalty on w.
wLS
We can write both in terms of
level sets: Curves where function λwTw
evaluation gives the same number.
T
The sum of these gives a new set (w-wLS) (XTX)(w-wLS)
of levels with a unique minimum.
w1
You can check that we can write:
x1
ky − Xwk2 + λkwk2 = (w − wLS )T (X T X)(w − wLS ) + λwT w + (const. w.r.t. w).
DATA PREPROCESSING
Ridge regression is one possible regularization scheme. For this problem, we

first assume the following preprocessing steps are done:
1. The mean is subtracted off of y:
n
1X
y←y− yi .
n
i=1
2. The dimensions of xi have been standardized before constructing X:

v
u n
u1 X
xij ← (xij − x̄·j )/σ̂j , σ̂j = t (xij − x̄·j )2 .
n
i=1
i.e., subtract the empirical mean and divide by the empirical standard
deviation for each dimension.
3. We can show that there is no need for the dimension of 1’s in this case.
S OME A NALYSIS OF R IDGE
R EGRESSION
R IDGE REGRESSION VS L EAST SQUARES
The solutions to least squares and ridge regression are clearly very similar,
wLS = (X T X)−1 X T y ⇔ wRR = (λI + X T X)−1 X T y.
I We can use linear algebra and probability to compare the two.
I This requires the singular value decomposition, which we review next.

R EVIEW: S INGULAR VALUE DECOMPOSITIONS
I We can write any n × d matrix X (assume n > d) as X = USV T , where

1. U: n × d and orthonormal in the columns, i.e. U T U = I.
2. S: d × d non-negative diagonal matrix, i.e. Sii ≥ 0 and Sij = 0 for i 6= j.
3. V: d × d and orthonormal, i.e. V T V = VV T = I.
I From this we have the immediate equalities
X T X = (USV T )T (USV T ) = VS2 V T , XX T = US2 U T .
I Assuming Sii 6= 0 for all i (i.e., “X is full rank”), we also have that
(X T X)−1 = (VS2 V T )−1 = VS−2 V T .
Proof: Plug in and see that it satisfies definition of inverse
(X T X)(X T X)−1 = VS2 V T VS−2 V T = I.

L EAST SQUARES AND THE SVD
Using the SVD we can rewrite the variance,
Var[wLS ] = σ 2 (X T X)−1 = σ 2 VS−2 V T .
This inverse becomes huge when Sii is very small for some values of i.
(Aside: This happens when columns of X are highly correlated.)
The least squares prediction for new data is

T
ynew = xnew T
wLS = xnew (X T X)−1 X T y = xnew
T
VS−1 U T y.
When S−1 has very large values, this can lead to unstable predictions.
R IDGE REGRESSION VS L EAST SQUARES I
Relationship to least squares solution

Recall for two symmetric matrices, (AB)−1 = B−1 A−1 .
wRR = (λI + X T X)−1 X T y

= (λI + X T X)−1 (X T X) (X T X)−1 X T y
| {z }
wLS
T T −1 −1
= [(X X)(λ(X X) + I)] (X T X)wLS
= (λ(X T X)−1 + I)−1 (X T X)−1 (X T X)wLS
= (λ(X T X)−1 + I)−1 wLS
Can use this to prove that the solution shrinks toward zero: kwRR k2 ≤ kwLS k2 .
R IDGE REGRESSION VS L EAST SQUARES II
Continue analysis with the SVD: X = USV T → (X T X)−1 = VS−2 V T :
wRR = (λ(X T X)−1 + I)−1 wLS

= (λVS−2 V T + I)−1 wLS
= V(λS−2 + I)−1 V T wLS
:= VMV T wLS
Sii2
M is a diagonal matrix with Mii = λ+Sii2
. We can pursue this to show that
S11
 
λ+S112 0
wRR = VSλ−1 U T y, Sλ−1 = 
 .. 
 . 

Sdd
0 λ+Sdd2
Compare with wLS = VS−1 U T y, which is the case where λ = 0 above.

R IDGE REGRESSION VS L EAST SQUARES III
Ridge regression can also be seen as a special case of least squares.
Define ŷ ≈ X̂w in the following way,
y
   
− X −  
   √  w1
 0  
λ 0 

.. 
≈
   
 .. ..
 . 
.
   
. wd
√
   
0 0 λ
If we solved wLS for this regression problem, we find wRR of the original
problem: Calculating (ŷ − X̂w)T (ŷ − X̂w) in two parts gives
√ √
(ŷ − X̂w)T (ŷ − X̂w) = (y − Xw)T (y − Xw) + ( λw)T ( λw)
= ky − Xwk2 + λkwk2
S ELECTING λ
Degrees of freedom:
df (λ) = trace X(X T X + λI)−1 X T

d
X Sii2
=
i=1
λ + Sii2
This gives a way of

visualizing relationships.
We will discuss methods for

picking λ later.

Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF

Uploaded by

Copyright:

Available Formats

ColumbiaX: Machine Learning

Prof. John Paisley

Department of Electrical Engineering

Definition of linear regression

Least squares solution

We defined y = [y1 , . . . , yn ]T and X = [x1 , . . . , xn ]T .

In other words, wLS is the vector that minimizes L.

I Last class, we discussed the geometric interpretation of least squares.

I Least squares also has an insightful probabilistic interpretation that

Recall: Gaussian density in n dimensions

What if we restrict the mean to µ = Xw

Maximum likelihood for Gaussian linear regression

wML = arg max ln p(y|µ = Xw, σ 2 )

I Therefore, in a sense we are making an independent Gaussian noise

I Other ways of saying this:

I Can we use this probabilistic line of analysis to better understand the

We can calculate the expectation of the ML solution under this distribution,

Therefore wML is an unbiased estimate of w, i.e., E[wML ] = w.

I Even though the “expected” maximum likelihood solution is the correct

I Even though the “expected” maximum likelihood solution is the correct

Var[y] = E[(y − E[y])(y − E[y])T ] = Σ.

I Even though the “expected” maximum likelihood solution is the correct

Var[y] = E[(y − E[y])(y − E[y])T ] = Σ.

I Plugging in E[y] = µ, this is equivalently written as

Var[y] = E[(y − µ)(y − µ)T ]

I Immediately we also get E[yyT ] = Σ + µµT .

Variance of the solution

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .

Variance of the solution

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .

Variance of the solution

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .

Variance of the solution

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .

Variance of the solution

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .

Variance of the solution

Var[wML ] = E[(wML − E[wML ])(wML − E[wML ])T ]

The sequence of equalities follows:1

Var[wML ] = E[(X T X)−1 X T yyT X(X T X)−1 ] − wwT

1 Aside: For matrices A, B and vector c, recall that (ABc)T = cT BT AT .

I We’ve shown that, under the Gaussian assumption y ∼ N(Xw, σ 2 I),

E[wML ] = w, Var[wML ] = σ 2 (X T X)−1 .

I This is bad if we want to analyze and predict using wML .

I In general, when developing a model for data we often wish to

I There are many models of the form

wOPT = arg min ky − Xwk2 + λ g(w).

I The added terms are

It uses the squared penalty on the regression coefficient vector w,