Professional Documents
Culture Documents
Lecture 3
Data
Measured pairs (x, y), where x ∈ Rd+1 (input) and y ∈ R (output)
Goal
Find a function f : Rd+1 → R such that y ≈ f (x; w) for the data pair (x, y).
f (x; w) is the regression function and the vector w are its parameters.
Taking the gradient with respect to w and setting to zero, we find that
∇w L = 2X T Xw − 2X T y = 0 ⇒ wLS = (X T X)−1 X T y.
I That is, given that we pick this model as reasonable for our problem,
we can ask: What kinds of assumptions are we making?
P ROBABILISTIC VIEW
Least squares (LS) and maximum likelihood (ML) share the same solution:
1
LS: arg min ky − Xwk2 ⇔ ML: arg max − ky − Xwk2
w w 2σ 2
P ROBABILISTIC VIEW
Expected solution
Given: The modeling assumption that y ∼ N(Xw, σ 2 I).
I We should also look at the covariance. Recall that if y ∼ N(µ, Σ), then
I We should also look at the covariance. Recall that if y ∼ N(µ, Σ), then
I When there are very large values in σ 2 (X T X)−1 , the values of wML are
very sensitive to the measured data y (more analysis later).
I We saw how with least squares, the values in wML may be huge.
Ridge regression is one g(w) that addresses variance issues with wML .
However, there is a tradeoff between the first and second terms that is
controlled by λ.
I Case λ → 0 : wRR → wLS
I Case λ → ∞ : wRR → ~0
R IDGE REGRESSION SOLUTION
Objective: We can solve the ridge regression problem using exactly the
same procedure as for least squares,
L = ky − Xwk2 + λkwk2
= (y − Xw)T (y − Xw) + λwT w.
Solution: First, take the gradient of L with respect to w and set to zero,
∇w L = −2X T y + 2X T Xw + 2λw = 0
w1
You can check that we can write:
x1
ky − Xwk2 + λkwk2 = (w − wLS )T (X T X)(w − wLS ) + λwT w + (const. w.r.t. w).
DATA PREPROCESSING
i.e., subtract the empirical mean and divide by the empirical standard
deviation for each dimension.
3. We can show that there is no need for the dimension of 1’s in this case.
S OME A NALYSIS OF R IDGE
R EGRESSION
R IDGE REGRESSION VS L EAST SQUARES
The solutions to least squares and ridge regression are clearly very similar,
I Assuming Sii 6= 0 for all i (i.e., “X is full rank”), we also have that
This inverse becomes huge when Sii is very small for some values of i.
(Aside: This happens when columns of X are highly correlated.)
When S−1 has very large values, this can lead to unstable predictions.
R IDGE REGRESSION VS L EAST SQUARES I
T T −1 −1
= [(X X)(λ(X X) + I)] (X T X)wLS
= (λ(X T X)−1 + I)−1 (X T X)−1 (X T X)wLS
= (λ(X T X)−1 + I)−1 wLS
Can use this to prove that the solution shrinks toward zero: kwRR k2 ≤ kwLS k2 .
R IDGE REGRESSION VS L EAST SQUARES II
S11
λ+S112 0
wRR = VSλ−1 U T y, Sλ−1 =
..
.
Sdd
0 λ+Sdd2
y
− X −
√ w1
0
λ 0
..
≈
.. ..
.
.
. wd
√
0 0 λ
If we solved wLS for this regression problem, we find wRR of the original
problem: Calculating (ŷ − X̂w)T (ŷ − X̂w) in two parts gives
√ √
(ŷ − X̂w)T (ŷ − X̂w) = (y − Xw)T (y − Xw) + ( λw)T ( λw)
= ky − Xwk2 + λkwk2
S ELECTING λ
Degrees of freedom:
d
X Sii2
=
i=1
λ + Sii2