Some Course Admin
For those wanting to do some programming?
Assignment 1 By Monday the 2nd of April send me 1 page
describing a problem related to your research you would like to
tackle with the methods introduced so far in the course.
In this description include some of the methods/algorithms
you would will use and why.
Assignment 2 Will obviously be implementing this plan!
Deadline for the homework exercises
Deadline for homework sets 1, 2,3 Monday the 2nd of
April.
Note this deadline is only to ensure you get the homework
corrected in a timely fashion!
Chapter 3: Linear Methods for Regression
DD3364
March 16, 2012
Introduction: Why focus on these models?
Simple and Interpretable
E[Y  X] = 0 + 1 X1 + + p Xp
Can outperform nonlinear methods when one has
a small number of training examples
low signaltonoise ratio
sparse data
Can be made nonlinear by applying a nonlinear
transformation to the data.
(1)
Introduction: Why focus on these models?
Simple and Interpretable
E[Y  X] = 0 + 1 X1 + + p Xp
Can outperform nonlinear methods when one has
a small number of training examples
low signaltonoise ratio
sparse data
Can be made nonlinear by applying a nonlinear
transformation to the data.
(1)
Introduction: Why focus on these models?
Simple and Interpretable
E[Y  X] = 0 + 1 X1 + + p Xp
Can outperform nonlinear methods when one has
a small number of training examples
low signaltonoise ratio
sparse data
Can be made nonlinear by applying a nonlinear
transformation to the data.
(1)
Linear Regression Models and Least Squares
Have an input vector X = (X1 , X2 , . . . , Xp )t .
Want to predict a realvalued output Y .
The linear regression has the form
f (X) = 0 + 1 X1 + + p Xp
How to estimate :
Training data: (x1 , y1 ), . . . , (xn , yn ) each xi Rp and yi R
Estimate parameters: Choose which minimizes
RSS() =
p
n
n
X
X
X
(yi f (xi ))2 =
(yi 0
xij j )2
i=1
residual sumofsquares
i=1
j=1
Linear Regression Models and Least Squares
Have an input vector X = (X1 , X2 , . . . , Xp )t .
Want to predict a realvalued output Y .
The linear regression has the form
f (X) = 0 + 1 X1 + + p Xp
How to estimate :
Training data: (x1 , y1 ), . . . , (xn , yn ) each xi Rp and yi R
Estimate parameters: Choose which minimizes
RSS() =
p
n
n
X
X
X
(yi f (xi ))2 =
(yi 0
xij j )2
i=1
residual sumofsquares
i=1
j=1
Linear least squares fitting
3.2 Linear Regression Models and Least Squares
45
Training data:
X2
(x1 , y1 ), . . . , (xn , yn ) each
xi Rp and yi R
Estimate parameters:
Choose which minimizes
RSS() =
X1
E 3.1. Linear least squares fitting with X IR2 . We seek the linear
of X that minimizes the sum of squared residuals from Y .
Find which minimizes the
sumofsquared residuals from Y .
cupied by the pairs (X, Y ). Note that (3.2) makes no assumptions
he validity of model (3.1); it simply finds the best linear fit to the
ast squares fitting is intuitively satisfying no matter how the data
e criterion measures the average lack of fit.
do we minimize (3.2)? Denote by X the N (p + 1) matrix with
n
X
i=1
n
X
i=1
(yi f (xi ))2
(yi 0
p
X
j=1
xij j )2
Minimizing RSS()
Rewrite
RSS() =
n
X
i=1
(yi 0
p
X
xij j )2
j=1
in vector and matrix notation as
RSS() = (y X)t (y X)
where
= (0 , 1 , . . . , p )t ,
and y = (y1 , . . . , yn )t .
1 x11 x12
1 x21 x22
X = .
..
..
..
.
.
1 xn1 xn2
. . . x1p
. . . x2p
..
...
.
. . . xnp
Minimizing RSS()
Want to find which minimizes
RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain
RSS
= 2Xt (y X )
Assume X has full column rank = is positive definite, set
RSS
= 2Xt (y X ) = 0
to obtain the unique solution
= (X t X)1 X t y
Minimizing RSS()
Want to find which minimizes
RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain
RSS
= 2Xt (y X )
Assume X has full column rank = is positive definite, set
RSS
= 2Xt (y X ) = 0
to obtain the unique solution
= (X t X)1 X t y
Minimizing RSS()
Want to find which minimizes
RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain
RSS
= 2Xt (y X )
Assume X has full column rank = is positive definite, set
RSS
= 2Xt (y X ) = 0
to obtain the unique solution
= (X t X)1 X t y
Estimates using
Given an input x0 this model predicts its output as
y0 = (1, xt0 )
The fitted values at the training inputs are
y = X = X(Xt X)1 Xt y
= H y
Hat matrix
Estimates using
Given an input x0 this model predicts its output as
y0 = (1, xt0 )
The fitted values at the training inputs are
y = X = X(Xt X)1 Xt y
= H y
Hat matrix
Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y
x2
x1
Let
FIGURE 3.2. The N dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions
Let x.i be the ith column of X
The predicted values at an input vector x are given by f(x ) = (1 : x )T ;
0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation
(3.7) is sometimes called the hat matrix because it puts the hat on y.
Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N
estimate, this time in IR . We denote the column vectors of X by x0 , x1 , . . . , xp ,
Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y
x2
x1
Let
FIGURE 3.2. The N dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions
Let x.i be the ith column of X
The predicted values at an input vector x are given by f(x ) = (1 : x )T ;
0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation
(3.7) is sometimes called the hat matrix because it puts the hat on y.
Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N
estimate, this time in IR . We denote the column vectors of X by x0 , x1 , . . . , xp ,
Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y
x2
x1
Let
FIGURE 3.2. The N dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions
Let x.i be the ith column of X
The predicted values at an input vector x are given by f(x ) = (1 : x )T ;
0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation
(3.7) is sometimes called the hat matrix because it puts the hat on y.
Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N
estimate, this time in IR . We denote the column vectors of X by x0 , x1 , . . . , xp ,
An example
Regression Example: Face Landmark Estimation
Example of training data
Have training data in the following format.
Input: image of fixed size of a face (W H matrix of pixel
intensities = vector of length W H)
Output: coordinates of F facial features of the face
Want to learn F linear regression functions fi
fi maps the image vector to xcoord of the ith facial feature.
Learn also F regression fns gi for the ycoord.
Regression Example: Face Landmark Estimation
Example of training data
Have training data in the following format.
Input: image of fixed size of a face (W H matrix of pixel
intensities = vector of length W H)
Output: coordinates of F facial features of the face
Want to learn F linear regression functions fi
fi maps the image vector to xcoord of the ith facial feature.
Learn also F regression fns gi for the ycoord.
Regression Example: Face Landmark Estimation
Example of training data
Have training data in the following format.
Input: image of fixed size of a face (W H matrix of pixel
intensities = vector of length W H)
Output: coordinates of F facial features of the face
Want to learn F linear regression functions fi
fi maps the image vector to xcoord of the ith facial feature.
Learn also F regression fns gi for the ycoord.
Regression Example: Face Landmark Estimation
Example of training data
Have training data in the following format.
Input: image of fixed size of a face (W H matrix of pixel
intensities = vector of length W H)
Output: coordinates of F facial features of the face
Want to learn F linear regression functions fi
fi maps the image vector to xcoord of the ith facial feature.
Learn also F regression fns gi for the ycoord.
Regression Example: Face Landmark Estimation
f14, g14
Input
Output
Given a test image want to predict each of its facial landmark
points.
How well can ordinary least squares regression do on this
problem?
Regression Example: Face Landmark Estimation
f14, g14
Input
Output
Given a test image want to predict each of its facial landmark
points.
How well can ordinary least squares regression do on this
problem?
Landmark estimation using ols regression
x,14
x,14 
y,14
y,14 
These are not promising weight vectors!
Estimated Landmark
on novel image
Estimate not
even in image
This problem is too hard for ols regression and it fails
miserably.
p is too large and many of the xi are highly correlated.
Landmark estimation using ols regression
x,14
x,14 
y,14
y,14 
These are not promising weight vectors!
Estimated Landmark
on novel image
Estimate not
even in image
This problem is too hard for ols regression and it fails
miserably.
p is too large and many of the xi are highly correlated.
Singular Xt X
When X is not full rank?
Not all the columns of X are linearly independent.
In this case Xt X is singular = not uniquely defined.
The fitted values y = X are still the projection of y onto the
column space of X but 6= such that
y = X = X
Nonfullrank case occurs when
one or more of the qualitative inputs are encoded redundantly,
when the number of inputs p > n the number of training
examples.
When X is not full rank?
Not all the columns of X are linearly independent.
In this case Xt X is singular = not uniquely defined.
The fitted values y = X are still the projection of y onto the
column space of X but 6= such that
y = X = X
Nonfullrank case occurs when
one or more of the qualitative inputs are encoded redundantly,
when the number of inputs p > n the number of training
examples.
When X is not full rank?
Not all the columns of X are linearly independent.
In this case Xt X is singular = not uniquely defined.
The fitted values y = X are still the projection of y onto the
column space of X but 6= such that
y = X = X
Nonfullrank case occurs when
one or more of the qualitative inputs are encoded redundantly,
when the number of inputs p > n the number of training
examples.
When X is not full rank?
Not all the columns of X are linearly independent.
In this case Xt X is singular = not uniquely defined.
The fitted values y = X are still the projection of y onto the
column space of X but 6= such that
y = X = X
Nonfullrank case occurs when
one or more of the qualitative inputs are encoded redundantly,
when the number of inputs p > n the number of training
examples.
What can we say about the distribution of ?
Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and
xi are fixed (nonrandom) this make analysis easier
The covariance matrix of is then
= Var((Xt X)1 Xt y) = (Xt X)1 X t Var(y) X(Xt X)1
Var()
= (Xt X)1 2
Usually one estimates the variance 2 with
n
X
1
=
(yi yi )2
n p 1 i=1
2
Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and
xi are fixed (nonrandom) this make analysis easier
The covariance matrix of is then
= Var((Xt X)1 Xt y) = (Xt X)1 X t Var(y) X(Xt X)1
Var()
= (Xt X)1 2
Usually one estimates the variance 2 with
n
X
1
=
(yi yi )2
n p 1 i=1
2
Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and
xi are fixed (nonrandom) this make analysis easier
The covariance matrix of is then
= Var((Xt X)1 Xt y) = (Xt X)1 X t Var(y) X(Xt X)1
Var()
= (Xt X)1 2
Usually one estimates the variance 2 with
n
X
1
=
(yi yi )2
n p 1 i=1
2
Analysis of the distribution of
To say more we need to make more assumptions. Therefore
assume
Y = E(Y  X1 , X2 , . . . , Xp ) +
= 0 +
p
X
Xj j +
i=1
where N (0, 2 )
Then its easy to show that (assuming nonrandom xi )
N (, (Xt X)1 2 )
Given this additive model generate
y
0 + 1 x
yi
2
0 + 1 x
0
0
0 + 1 x
x
4
yi
0 + 1 x
0
0
T1
T is a training set {(xi , yi )}n
i=1
= (1, 1)t , n = 40, = .6
In this simulation the xi s differ across trials.
Tntrial
x
4
The distribution of
1
1.2
0.8
0.5
1.5
Have plotted these s
Each Ti results in a different estimate of .
for ntrial = 500.
Given this additive model generate
y
0 + 1 x
yi
0 + 1 x
0
0
0 + 1 x
x
4
T1
yi
0 + 1 x
0
0
Tntrial
T is a training set {(xi , yi )}n
i=1
= (1, 1)t , n = 40, = .6
In this simulation the xi s are fixed across trials.
x
4
The distribution of
1
0.8
0
0.5
1.5
Have plotted these s
Each Ti results in a different estimate of .
for ntrial = 500.
Which j s are probably zero?
To interpret the weights estimated by least squares it would
be nice to say which ones are probably zero.
The associated predictors can then be removed from the
model.
If j = 0 then N (0, 2 vjj ) where vjj is the jth diagonal
element of (Xt X)1 .
Then if the actual value computed for j is larger than 2 vjj
then it is highly improbable that j = 0.
Statisticians have exact tests based on suitable distributions.
In this case compute
j
zj =
vjj
and if j = 0 then zj has a tdistribution with n p 1 dof.
Which j s are probably zero?
To interpret the weights estimated by least squares it would
be nice to say which ones are probably zero.
The associated predictors can then be removed from the
model.
If j = 0 then N (0, 2 vjj ) where vjj is the jth diagonal
element of (Xt X)1 .
Then if the actual value computed for j is larger than 2 vjj
then it is highly improbable that j = 0.
Statisticians have exact tests based on suitable distributions.
In this case compute
j
zj =
vjj
and if j = 0 then zj has a tdistribution with n p 1 dof.
Which j s are probably zero?
To interpret the weights estimated by least squares it would
be nice to say which ones are probably zero.
The associated predictors can then be removed from the
model.
If j = 0 then N (0, 2 vjj ) where vjj is the jth diagonal
element of (Xt X)1 .
Then if the actual value computed for j is larger than 2 vjj
then it is highly improbable that j = 0.
Statisticians have exact tests based on suitable distributions.
In this case compute
j
zj =
vjj
and if j = 0 then zj has a tdistribution with n p 1 dof.
Which j s are probably zero?
To interpret the weights estimated by least squares it would
be nice to say which ones are probably zero.
The associated predictors can then be removed from the
model.
If j = 0 then N (0, 2 vjj ) where vjj is the jth diagonal
element of (Xt X)1 .
Then if the actual value computed for j is larger than 2 vjj
then it is highly improbable that j = 0.
Statisticians have exact tests based on suitable distributions.
In this case compute
j
zj =
vjj
and if j = 0 then zj has a tdistribution with n p 1 dof.
Is 1 zero?
0.4
p(Zscore  1 = 0)
tdistribution
0.3
0.2
Zscores for 1 s
0.1
0
Zscore
0
10
20
30
For the example we had with = (1, 1)t , n = 40 and = .6
then the tdistribution of z1 is shown if j = 0.
The z1 computed from each estimated with Ti is shown.
Obviously even if we didnt know and only saw one Ti we
would not think j 6= 0.
Look at an example when 1 = 0
0 + 1 x
4
2
0 + 1 x
0
0
x
4
T1
0 + 1 x
4
2
0 + 1 x
0
0
Tntrial
T is a training set {(xi , yi )}n
i=1
= (3, 0)t , n = 40, = .6
In this simulation the xi s are fixed across trials.
x
4
The distribution of
1
0.1
0
0.1
0.2
0
2.5
3.5
Have plotted these s
Each Ti results in a different estimate of .
for ntrial = 500.
Is 1 zero?
0.4
p(Zscore  1 = 0)
tdistribution
0.3
0.2
Zscores for 1 s
0.1
0
Zscore
4
For this example we have = (3, 0)t , n = 40 and = .6 then
the tdistribution of z1 is shown if j = 0
The z1 s computed from the estimated with Ti are shown.
Obviously even if we didnt know and only saw one Ti we
would conclude in most trials that j 6= 0.
Other tests can be performed
We will not look into these but you can
test for the significance of groups of coefficients
simultaneously
get confidence bounds for j centred at j .
GaussMarkov Theorem
GaussMarkov Theorem
A famous result in statistics
The least squares estimate ls of the parameters
has the smallest variance among all linear unbiased
estimates.
To explain a simple case of the theorem. Let = at .
The least squares estimate of at is
= at ls = at (Xt X)1 Xt y
If X is fixed this is a linear function, ct0 y, of the response
vector y.
If we assume E[y] = X then at ls is unbiased
E[at ls ] = E[at (Xt X)1 Xt y] = at (Xt X)1 Xt X = at =
GaussMarkov Theorem: Simple example
GaussMarkov Theorem states any other linear estimator
= ct y that is unbiased for at has
Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one
parameter at but can state it in terms of the entire
parameter vector .
However, having an unbiased estimator is not always crucial.
GaussMarkov Theorem: Simple example
GaussMarkov Theorem states any other linear estimator
= ct y that is unbiased for at has
Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one
parameter at but can state it in terms of the entire
parameter vector .
However, having an unbiased estimator is not always crucial.
GaussMarkov Theorem: Simple example
GaussMarkov Theorem states any other linear estimator
= ct y that is unbiased for at has
Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one
parameter at but can state it in terms of the entire
parameter vector .
However, having an unbiased estimator is not always crucial.
The Bias  Variance Tradeoff (once again!)
Consider the meansquared error of an estimator in
estimating
= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance
bias
GaussMarkov says the least square estimator has the
smallest MSE for all linear estimators with zero bias.
But there may be biased estimates with smaller MSE.
In these cases have traded an increase in squared bias for a
reduction in variance.
The Bias  Variance Tradeoff (once again!)
Consider the meansquared error of an estimator in
estimating
= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance
bias
GaussMarkov says the least square estimator has the
smallest MSE for all linear estimators with zero bias.
But there may be biased estimates with smaller MSE.
In these cases have traded an increase in squared bias for a
reduction in variance.
The Bias  Variance Tradeoff (once again!)
Consider the meansquared error of an estimator in
estimating
= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance
bias
GaussMarkov says the least square estimator has the
smallest MSE for all linear estimators with zero bias.
But there may be biased estimates with smaller MSE.
In these cases have traded an increase in squared bias for a
reduction in variance.
Simple Univariate Regression and
GramSchmidt
Multiple regression from simple univariate regression
Suppose we have univariate model with no intercept
Y = X +
The least square estimate is
hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by
r = y xt
Say xi Rp and the columns of X are orthogonal then
hx.j , yi
j =
,
hx.j , x.j i
where x.j is jth column of X
Multiple regression from simple univariate regression
Suppose we have univariate model with no intercept
Y = X +
The least square estimate is
hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by
r = y xt
Say xi Rp and the columns of X are orthogonal then
hx.j , yi
j =
,
hx.j , x.j i
where x.j is jth column of X
Multiple regression from simple univariate regression
Suppose we have univariate model with no intercept
Y = X +
The least square estimate is
hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by
r = y xt
Say xi Rp and the columns of X are orthogonal then
hx.j , yi
j =
,
hx.j , x.j i
where x.j is jth column of X
Multiple regression from simple univariate regression
Suppose we have univariate model with no intercept
Y = X +
The least square estimate is
hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by
r = y xt
Say xi Rp and the columns of X are orthogonal then
hx.j , yi
j =
,
hx.j , x.j i
where x.j is jth column of X
OLS via successive orthogonalization
X acquired from observations are rarely orthogonal.
Hence they have to be orthogonalized to take advantage the
previous insight.
x.0 be the 0th column of X Rn2 (vector of ones) then
hx.0 , x.1 i
Regress x.1 on x.0 that is
=
and let z = x.1 x.0
hx.0 , x.0 i
hx.1 , zi
Regress y on z then 1 =
hx.1 , x.1 i
Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .
The solution is same as if one had directly calculated ls . Have
just used an orthogonal basis for the col. space of X
Note Step 1 orthogonalized x.1 w.r.t. x.0 .
Step 2 is simple univariate regression using the orthogonal
predictors x.0 and z.
OLS via successive orthogonalization
X acquired from observations are rarely orthogonal.
Hence they have to be orthogonalized to take advantage the
previous insight.
x.0 be the 0th column of X Rn2 (vector of ones) then
hx.0 , x.1 i
Regress x.1 on x.0 that is
=
and let z = x.1 x.0
hx.0 , x.0 i
hx.1 , zi
Regress y on z then 1 =
hx.1 , x.1 i
Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .
The solution is same as if one had directly calculated ls . Have
just used an orthogonal basis for the col. space of X
Note Step 1 orthogonalized x.1 w.r.t. x.0 .
Step 2 is simple univariate regression using the orthogonal
predictors x.0 and z.
OLS via successive orthogonalization
X acquired from observations are rarely orthogonal.
Hence they have to be orthogonalized to take advantage the
previous insight.
x.0 be the 0th column of X Rn2 (vector of ones) then
hx.0 , x.1 i
Regress x.1 on x.0 that is
=
and let z = x.1 x.0
hx.0 , x.0 i
hx.1 , zi
Regress y on z then 1 =
hx.1 , x.1 i
Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .
The solution is same as if one had directly calculated ls . Have
just used an orthogonal basis for the col. space of X
Note Step 1 orthogonalized x.1 w.r.t. x.0 .
Step 2 is simple univariate regression using the orthogonal
predictors x.0 and z.
OLS via successive orthogonalization
X acquired from observations are rarely orthogonal.
Hence they have to be orthogonalized to take advantage the
previous insight.
x.0 be the 0th column of X Rn2 (vector of ones) then
hx.0 , x.1 i
Regress x.1 on x.0 that is
=
and let z = x.1 x.0
hx.0 , x.0 i
hx.1 , zi
Regress y on z then 1 =
hx.1 , x.1 i
Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .
The solution is same as if one had directly calculated ls . Have
just used an orthogonal basis for the col. space of X
Note Step 1 orthogonalized x.1 w.r.t. x.0 .
Step 2 is simple univariate regression using the orthogonal
predictors x.0 and z.
OLS via successive orthogonalization
Can extend the process to when xi s are pdimensional.
See Algorithm 3.1 in the book.
At each iteration j a multiple least squares regression problem
with jth orthogonal inputs is solved.
And after this a new residual is formed which is orthogonal to
all these current directions.
This process is the GramSchmidt regression procedure.
OLS via successive orthogonalization
Can extend the process to when xi s are pdimensional.
See Algorithm 3.1 in the book.
At each iteration j a multiple least squares regression problem
with jth orthogonal inputs is solved.
And after this a new residual is formed which is orthogonal to
all these current directions.
This process is the GramSchmidt regression procedure.
Subset Selection
Inadequacies of least squares estimates
Prediction Accuracy
Least squares estimates often have
Low bias and high variance
This can affect prediction accuracy
Frequently better to set some of the j s to zero.
This increases the bias but reduces the variance and in turn
improve prediction accuracy.
Interpretation
For p large, it may be difficult to decipher the important
factors.
Therefore would like to determine a smaller subset of
predictors which are most informative.
May sacrifice small detail for the big picture.
Inadequacies of least squares estimates
Prediction Accuracy
Least squares estimates often have
Low bias and high variance
This can affect prediction accuracy
Frequently better to set some of the j s to zero.
This increases the bias but reduces the variance and in turn
improve prediction accuracy.
Interpretation
For p large, it may be difficult to decipher the important
factors.
Therefore would like to determine a smaller subset of
predictors which are most informative.
May sacrifice small detail for the big picture.
Inadequacies of least squares estimates
Prediction Accuracy
Least squares estimates often have
Low bias and high variance
This can affect prediction accuracy
Frequently better to set some of the j s to zero.
This increases the bias but reduces the variance and in turn
improve prediction accuracy.
Interpretation
For p large, it may be difficult to decipher the important
factors.
Therefore would like to determine a smaller subset of
predictors which are most informative.
May sacrifice small detail for the big picture.
Inadequacies of least squares estimates
Prediction Accuracy
Least squares estimates often have
Low bias and high variance
This can affect prediction accuracy
Frequently better to set some of the j s to zero.
This increases the bias but reduces the variance and in turn
improve prediction accuracy.
Interpretation
For p large, it may be difficult to decipher the important
factors.
Therefore would like to determine a smaller subset of
predictors which are most informative.
May sacrifice small detail for the big picture.
Inadequacies of least squares estimates
Prediction Accuracy
Least squares estimates often have
Low bias and high variance
This can affect prediction accuracy
Frequently better to set some of the j s to zero.
This increases the bias but reduces the variance and in turn
improve prediction accuracy.
Interpretation
For p large, it may be difficult to decipher the important
factors.
Therefore would like to determine a smaller subset of
predictors which are most informative.
May sacrifice small detail for the big picture.
BestSubset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.
RSS(j1 , j2 , . . . , jk ) =
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
finding these best subsets of size k.
Question still remains of how to choose best value of k.
Once again it is a tradeoff between bias and variance....
BestSubset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.
RSS(j1 , j2 , . . . , jk ) =
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
finding these best subsets of size k.
Question still remains of how to choose best value of k.
Once again it is a tradeoff between bias and variance....
BestSubset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.
RSS(j1 , j2 , . . . , jk ) =
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
finding these best subsets of size k.
Question still remains of how to choose best value of k.
Once again it is a tradeoff between bias and variance....
BestSubset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.
RSS(j1 , j2 , . . . , jk ) =
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
finding these best subsets of size k.
Question still remains of how to choose best value of k.
Once again it is a tradeoff between bias and variance....
BestSubset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the
j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.
RSS(j1 , j2 , . . . , jk ) =
min
0 ,j1 ,...,jl
n
X
i=1
(yi 0
k
X
jl xi,jl )2
l=1
is smallest.
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for
finding these best subsets of size k.
Question still remains of how to choose best value of k.
Once again it is a tradeoff between bias and variance....
ForwardStepwise Selection
Instead of searching all possible subsets (infeasible for large p)
can take a greedy approach.
The steps of ForwardStepwise Selection are
Set I = {1, . . . , p}
For l = 1, . . . , k choose jl according to
jl = arg min
min
jI 0 ,j1 ,...,jl1 ,j
and
I = I\{jl }
n
X
i=1
(yi 0
l1
X
s=1
js xijs j xij )2
ForwardStepwise Selection
Instead of searching all possible subsets (infeasible for large p)
can take a greedy approach.
The steps of ForwardStepwise Selection are
Set I = {1, . . . , p}
For l = 1, . . . , k choose jl according to
jl = arg min
min
jI 0 ,j1 ,...,jl1 ,j
and
I = I\{jl }
n
X
i=1
(yi 0
l1
X
s=1
js xijs j xij )2
ForwardStepwise Vs Best Subset Selection
ForwardStepwise may be suboptimal compared to the best
subset selection but may be preferred because
It is computational feasible for large p n. Not true for best
subset selection.
best subset selection may overfit.
Forward stepwise will probably produce a function with lower
variance but perhaps more bias.
ForwardStepwise Vs Best Subset Selection
ForwardStepwise may be suboptimal compared to the best
subset selection but may be preferred because
It is computational feasible for large p n. Not true for best
subset selection.
best subset selection may overfit.
Forward stepwise will probably produce a function with lower
variance but perhaps more bias.
ForwardStepwise Vs Best Subset Selection
ForwardStepwise may be suboptimal compared to the best
subset selection but may be preferred because
It is computational feasible for large p n. Not true for best
subset selection.
best subset selection may overfit.
Forward stepwise will probably produce a function with lower
variance but perhaps more bias.
ForwardStepwise Vs Best Subset Selection
ForwardStepwise may be suboptimal compared to the best
subset selection but may be preferred because
It is computational feasible for large p n. Not true for best
subset selection.
best subset selection may overfit.
Forward stepwise will probably produce a function with lower
variance but perhaps more bias.
ForwardStagewise Regression
The steps of ForwardStagewise Regression are
n
1X
Set 0 =
yi
i=1
Set 1 = 2 = = p = 0
At each iteration
ri = yi 0
p
X
j xij ,
j = arg max hx.j , ri
jI
compute residual for each example
j=1
find Xj most correlated with r
j j + sign(hx.j , ri)
Stop iterations when the residuals are uncorrelated with all the
predictors.
ForwardStagewise Regression: What to note..
Only one j is updated at each iteration.
A j can be updated at several different iterations.
It can be slow to reach the least squares fit.
But slow fitting may not be such a bad thing in high
dimensional problems.
ForwardStagewise Regression: What to note..
Only one j is updated at each iteration.
A j can be updated at several different iterations.
It can be slow to reach the least squares fit.
But slow fitting may not be such a bad thing in high
dimensional problems.
ForwardStagewise Regression: What to note..
Only one j is updated at each iteration.
A j can be updated at several different iterations.
It can be slow to reach the least squares fit.
But slow fitting may not be such a bad thing in high
dimensional problems.
ForwardStagewise Regression: What to note..
Only one j is updated at each iteration.
A j can be updated at several different iterations.
It can be slow to reach the least squares fit.
But slow fitting may not be such a bad thing in high
dimensional problems.
Shrinkage methods
Why shrinkage methods?
Selecting a subset of predictors produces a model that is
interpretable and probably has lower prediction error than the
full model.
However it is a discrete process = introduces variation
into learning the model.
Shrinkage methods are more continuous and have a lower
variance.
Why shrinkage methods?
Selecting a subset of predictors produces a model that is
interpretable and probably has lower prediction error than the
full model.
However it is a discrete process = introduces variation
into learning the model.
Shrinkage methods are more continuous and have a lower
variance.
Why shrinkage methods?
Selecting a subset of predictors produces a model that is
interpretable and probably has lower prediction error than the
full model.
However it is a discrete process = introduces variation
into learning the model.
Shrinkage methods are more continuous and have a lower
variance.
Ridge Regression
Shrinkage Method 1: Ridge Regression
Ridge regression shrinks j s by imposing a penalty on their
size.
The ridge coefficients minimize a penalized RSS
nonnegative complexity parameter
ridge
p
p
n
X
X
X
2
2
= arg min
(yi 0
xij j ) +
j
i=1
j=1
j=1
residual sumofsquares
regularizer fn
The larger 0 the greater of the amount of shrinkage. This
implies j s are shrunk toward zero (except 0 ).
Shrinkage Method 1: Ridge Regression
Ridge regression shrinks j s by imposing a penalty on their
size.
The ridge coefficients minimize a penalized RSS
nonnegative complexity parameter
ridge
p
p
n
X
X
X
2
2
= arg min
(yi 0
xij j ) +
j
i=1
j=1
j=1
residual sumofsquares
regularizer fn
The larger 0 the greater of the amount of shrinkage. This
implies j s are shrunk toward zero (except 0 ).
Shrinkage Method 1: Ridge Regression
Ridge regression shrinks j s by imposing a penalty on their
size.
The ridge coefficients minimize a penalized RSS
nonnegative complexity parameter
ridge
p
p
n
X
X
X
2
2
= arg min
(yi 0
xij j ) +
j
i=1
j=1
j=1
residual sumofsquares
regularizer fn
The larger 0 the greater of the amount of shrinkage. This
implies j s are shrunk toward zero (except 0 ).
An equivalent formulation of Ridge Regression
ridge = arg min
n
X
i=1
(yi 0
p
X
j=1
xij j )2 subject to
p
X
j=1
j2 t
This formulation puts an explicit constraint on the size of the
j s.
There is a 11 correspondence between and t in the two
formulations.
Note the estimated ridge changes if the scaling of the inputs
change.
Ridge Regression and centering of data
The centered version of the input data is
x
ij = xij
n
X
xsj
s=1
Then the ridge regression coefficients found using the
centered data
c = arg min
c
n
X
i=1
(yi
0c
p
X
x
ij jc )2
j=1
p
X
+
(jc )2
j=1
are related to the coefficients found using the original data via
n
1X
0c =
yi = y,
n
i=1
jc = jridge
0ridge = y
for i = 1, . . . , p
p
X
j=1
x
.j jridge
Ridge Regression and centering of data
If the ys have zero mean = 0c = 0
Can drop the intercept term from the linear model if the input
data is centred.
Then for ridge regression, given all the necessary centering,
find the = (1 , . . . , p )t which minimizes
= arg min
= arg min
n
X
i=1
(yi
p
X
xij j )2 +
j=1
p
X
j=1
(y X)t (y X) + t
where y = (y1 , . . . , yn )t and
x11
x21
X= .
..
xn1
x12
x22
..
.
xn2
x1p
x2p
..
.
xnp
j2
Ridge Regression and centering of data
If the ys have zero mean = 0c = 0
Can drop the intercept term from the linear model if the input
data is centred.
Then for ridge regression, given all the necessary centering,
find the = (1 , . . . , p )t which minimizes
= arg min
= arg min
n
X
i=1
(yi
p
X
xij j )2 +
j=1
p
X
j=1
(y X)t (y X) + t
where y = (y1 , . . . , yn )t and
x11
x21
X= .
..
xn1
x12
x22
..
.
xn2
x1p
x2p
..
.
xnp
j2
Ridge Regression and centering of data
If the ys have zero mean = 0c = 0
Can drop the intercept term from the linear model if the input
data is centred.
Then for ridge regression, given all the necessary centering,
find the = (1 , . . . , p )t which minimizes
= arg min
= arg min
n
X
i=1
(yi
p
X
xij j )2 +
j=1
p
X
j=1
(y X)t (y X) + t
where y = (y1 , . . . , yn )t and
x11
x21
X= .
..
xn1
x12
x22
..
.
xn2
x1p
x2p
..
.
xnp
j2
Ridge Regression and centering of data
For rest of lecture will assume centered input and output data.
The ridge regression solution is given by
ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular
matrix Xt X is averted as (Xt X + Ip ) is full rank even if
Xt X is not.
Ridge Regression and centering of data
For rest of lecture will assume centered input and output data.
The ridge regression solution is given by
ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular
matrix Xt X is averted as (Xt X + Ip ) is full rank even if
Xt X is not.
Ridge Regression and centering of data
For rest of lecture will assume centered input and output data.
The ridge regression solution is given by
ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular
matrix Xt X is averted as (Xt X + Ip ) is full rank even if
Xt X is not.
Insight into ridge regression
Insight into Ridge Regression
Compute the SVD of the n p input matrix X then
X = UDVt
where
U is an n p orthogonal matrix
V is a p p orthogonal matrix
D is a p p diagonal matrix with d1 d2 dp 0.
Can write least squares fitted vector as
Xls = X(Xt X)1 Xt y = UUt y
which is the closest approximation to y in the subspace
spanned by the columns of U (= column space of X).
Insight into Ridge Regression
Can write ridge regression fitted vector as
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
where uj s are columns of U
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
orthonormal basis of the columns of U.
It then shrinks these coordinates by the factors d2j /(d2j + ).
More shrinkage applied to basis vectors with smaller d2j .
Insight into Ridge Regression
Can write ridge regression fitted vector as
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
where uj s are columns of U
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
orthonormal basis of the columns of U.
It then shrinks these coordinates by the factors d2j /(d2j + ).
More shrinkage applied to basis vectors with smaller d2j .
Insight into Ridge Regression
Can write ridge regression fitted vector as
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
where uj s are columns of U
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
orthonormal basis of the columns of U.
It then shrinks these coordinates by the factors d2j /(d2j + ).
More shrinkage applied to basis vectors with smaller d2j .
Insight into Ridge Regression
Can write ridge regression fitted vector as
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
where uj s are columns of U
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
orthonormal basis of the columns of U.
It then shrinks these coordinates by the factors d2j /(d2j + ).
More shrinkage applied to basis vectors with smaller d2j .
Insight into Ridge Regression
Can write ridge regression fitted vector as
Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=
p
X
j=1
uj
d2j
d2j +
utj y,
where uj s are columns of U
As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the
orthonormal basis of the columns of U.
It then shrinks these coordinates by the factors d2j /(d2j + ).
More shrinkage applied to basis vectors with smaller d2j .
Insight into Ridge Regression: What are the d2j s ?
The sample covariance matrix of the data is given by
S=
1 t
XX
n
From the SVD of X we know that
Xt X = VD2 Vt
eigendecomposition of Xt X
The eigenvectors vj  columns of V are the principal
component directions of X.
Project the input of each training example onto the first
(1)
principal component direction v1 to get zi = v1t xi . The
(1)
variance of the zi s is given by (remember xi s are centred)
n
i=1
i=1
1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n
Insight into Ridge Regression: What are the d2j s ?
The sample covariance matrix of the data is given by
S=
1 t
XX
n
From the SVD of X we know that
Xt X = VD2 Vt
eigendecomposition of Xt X
The eigenvectors vj  columns of V are the principal
component directions of X.
Project the input of each training example onto the first
(1)
principal component direction v1 to get zi = v1t xi . The
(1)
variance of the zi s is given by (remember xi s are centred)
n
i=1
i=1
1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n
Insight into Ridge Regression: What are the d2j s ?
The sample covariance matrix of the data is given by
S=
1 t
XX
n
From the SVD of X we know that
Xt X = VD2 Vt
eigendecomposition of Xt X
The eigenvectors vj  columns of V are the principal
component directions of X.
Project the input of each training example onto the first
(1)
principal component direction v1 to get zi = v1t xi . The
(1)
variance of the zi s is given by (remember xi s are centred)
n
i=1
i=1
1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n
Insight into Ridge Regression: What are the d2j s ?
v1 represents the direction (of unit length) which the
67
3.4 Shrinkage Methods
projected points have largest variance.
Largest Principal
Component
o
o o oo o
o
o oo oo
ooooo o
oo o o
oo o
ooo o o ooo o
o oo o o o
o
o oo
o
o o
oo o o
ooo oo o oo o o
o
o
oooooooooooooo o oo oo
oo oo
ooo ooo o
ooo ooooooo o
o o o o
o
oo o
o
o
o
o
o
o
ooo o ooooo
o
o
o
o
o
o
o
o oo o ooooo o
o
o o oo o
o
o
o ooo oo ooo o o
o
o
o ooo oo
o
o
o
oo
o
o
oo o o
oo o
Smallest Principal
o
Component
2
X2
o
4
4
2
X1
FIGURE 3.9. Principal components of some input data points. The largest prin
(j)
cipal component is the direction that maximizes the variance of the projected data,
Subsequent
components
zi that have
variance
andprincipal
the smallest principal
component minimizes
variance. maximum
Ridge regression
d2j /n subject to v being orthogonal to the earlier directions.
projects y onto these components, and then shrinks the coefficients of the low
variance components
more than the highvariance components.
j
Insight into Ridge Regression: What are the d2j s ?
v1 represents the direction (of unit length) which the
67
3.4 Shrinkage Methods
projected points have largest variance.
Largest Principal
Component
o
o o oo o
o
o oo oo
ooooo o
oo o o
oo o
ooo o o ooo o
o oo o o o
o
o oo
o
o o
oo o o
ooo oo o oo o o
o
o
oooooooooooooo o oo oo
oo oo
ooo ooo o
ooo ooooooo o
o o o o
o
oo o
o
o
o
o
o
o
ooo o ooooo
o
o
o
o
o
o
o
o oo o ooooo o
o
o o oo o
o
o
o ooo oo ooo o o
o
o
o ooo oo
o
o
o
oo
o
o
oo o o
oo o
Smallest Principal
o
Component
2
X2
o
4
4
2
X1
FIGURE 3.9. Principal components of some input data points. The largest prin
(j)
cipal component is the direction that maximizes the variance of the projected data,
Subsequent
components
zi that have
variance
andprincipal
the smallest principal
component minimizes
variance. maximum
Ridge regression
d2j /n subject to v being orthogonal to the earlier directions.
projects y onto these components, and then shrinks the coefficients of the low
variance components
more than the highvariance components.
j
Insight into Ridge Regression: What are the d2j s ?
The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column
space of X having small variance.
Ridge regression shrinks these directions the most !
The estimated directions vj s with small dj have more
uncertainty associated with the estimate. (Using a narrow
baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.
Ridge regression implicitly assumes that the output will vary
most in the directions of the high variance of the inputs. A
reasonable assumption but not always true.
Insight into Ridge Regression: What are the d2j s ?
The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column
space of X having small variance.
Ridge regression shrinks these directions the most !
The estimated directions vj s with small dj have more
uncertainty associated with the estimate. (Using a narrow
baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.
Ridge regression implicitly assumes that the output will vary
most in the directions of the high variance of the inputs. A
reasonable assumption but not always true.
Insight into Ridge Regression: What are the d2j s ?
The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column
space of X having small variance.
Ridge regression shrinks these directions the most !
The estimated directions vj s with small dj have more
uncertainty associated with the estimate. (Using a narrow
baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.
Ridge regression implicitly assumes that the output will vary
most in the directions of the high variance of the inputs. A
reasonable assumption but not always true.
Insight into Ridge Regression: What are the d2j s ?
The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column
space of X having small variance.
Ridge regression shrinks these directions the most !
The estimated directions vj s with small dj have more
uncertainty associated with the estimate. (Using a narrow
baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.
Ridge regression implicitly assumes that the output will vary
most in the directions of the high variance of the inputs. A
reasonable assumption but not always true.
Insight into Ridge Regression: What are the d2j s ?
The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column
space of X having small variance.
Ridge regression shrinks these directions the most !
The estimated directions vj s with small dj have more
uncertainty associated with the estimate. (Using a narrow
baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.
Ridge regression implicitly assumes that the output will vary
most in the directions of the high variance of the inputs. A
reasonable assumption but not always true.
Insight into Ridge Regression: What are the d2j s ?
The book defines the effective degrees of freedom of the
ridge regression fit as
dfridge () =
p
X
j=1
d2j
d2j +
we will derive this later on in the course.
But it is interesting as
dfridge () p when 0 (ordinary least squares) and
dfridge () 0 when
Back to our regression problem
Regression Example: Face Landmark Estimation
f14, g14
Input
Output
Given a test image want to predict each of its facial landmark
points.
How well can ridge regression do on this problem?
Regression Example: Face Landmark Estimation
f14, g14
Input
Output
Given a test image want to predict each of its facial landmark
points.
How well can ridge regression do on this problem?
Landmark estimation using ridge regression
x,14
= 1,df 232
= 102 ,df 187
= 103 ,df 97
x,14 
y,14
y,14 
Estimated Landmark
on novel image
Landmark estimation using ridge regression
x,14
= 5000,df 44
= 104 ,df 29
x,14 
y,14
y,14 
Estimated Landmark
on novel image
Landmark estimation using ridge regression
= 1000,df 97, Ground truth point, Estimated point
How the coefficients vary with
0.6
lcavol
0.4
0.2
Coefficients
0.2
0.0
svi
lweight
pgg45
lbph
gleason
age
lcp
0
df()
3.8. Profiles
of ridge
coefficients
for theNotice
prostate cancer
example,
as
This is an FIGURE
example
from
the
book.
how
the
weights
the tuning parameter is varied. Coefficients are plotted versus df(), the effective
degrees of freedom. A vertical line is drawn at df = 5.0, the value chosen by
associated
with
each
predictor
vary
with
.
crossvalidation.
The Lasso
Shrinkage Method 2: The Lasso
The lasso estimate is defined by
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
j  t
penalty is L1 instead of L2 norm
Equivalent formulation of the lasso problem is
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j +
p
X
j=1
j 
The solution is nonlinear in yi s and there is no closed form
solution.
It is convex and is, in fact, a quadratic programming problem.
Shrinkage Method 2: The Lasso
The lasso estimate is defined by
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
j  t
penalty is L1 instead of L2 norm
Equivalent formulation of the lasso problem is
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j +
p
X
j=1
j 
The solution is nonlinear in yi s and there is no closed form
solution.
It is convex and is, in fact, a quadratic programming problem.
Shrinkage Method 2: The Lasso
The lasso estimate is defined by
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
j  t
penalty is L1 instead of L2 norm
Equivalent formulation of the lasso problem is
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j +
p
X
j=1
j 
The solution is nonlinear in yi s and there is no closed form
solution.
It is convex and is, in fact, a quadratic programming problem.
Shrinkage Method 2: The Lasso
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
j  t
Because of the L1 constraint, making t small will force some
of the j s to be exactly 0.
Lasso does some kind of continuous subset selection.
However the nature of the shrinkage is not so obvious.
If t
Pp
ls
j=1 j 
is sufficiently large, then lasso = ls .
Shrinkage Method 2: The Lasso
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
j  t
Because of the L1 constraint, making t small will force some
of the j s to be exactly 0.
Lasso does some kind of continuous subset selection.
However the nature of the shrinkage is not so obvious.
If t
Pp
ls
j=1 j 
is sufficiently large, then lasso = ls .
Shrinkage Method 2: The Lasso
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
j  t
Because of the L1 constraint, making t small will force some
of the j s to be exactly 0.
Lasso does some kind of continuous subset selection.
However the nature of the shrinkage is not so obvious.
If t
Pp
ls
j=1 j 
is sufficiently large, then lasso = ls .
Shrinkage Method 2: The Lasso
lasso = arg min
n
X
i=1
yi 0
p
X
j=1
xij j
subject to
p
X
j=1
j  t
Because of the L1 constraint, making t small will force some
of the j s to be exactly 0.
Lasso does some kind of continuous subset selection.
However the nature of the shrinkage is not so obvious.
If t
Pp
ls
j=1 j 
is sufficiently large, then lasso = ls .
Back to our regression problem
Landmark estimation using lasso
x,14
= .01,df
= .04,df 93
= .1,df 53
x,14 
y,14
y,14 
Estimated Landmark
on novel image
Landmark estimation using lasso
x,14
= .3,df 21
= .5,df 7
= .7,df 2
x,14 
y,14
y,14 
Estimated Landmark
on novel image
Landmark estimation using lasso regression
= .04,df 93, Ground truth point, Estimated point
Subset Selection Vs Ridge Regression Vs Lasso
When X has orthonormal columns
This implies dj = 1 for j = 1, . . . , p.
In this case each method applies a simple transformation to
jls :
Estimator
Formula
Best subset (size M)
ls
jls Ind(jls  M
)
Ridge
jls /(1 + )
Lasso
sign(jls ) (jls  )+
ls
where M
is the M th largest coefficient.
idge
j /(1 + )
sign(j )(j  )+
asso
Ridge
sign(j )(j 
Lasso
Ridge Regression Vs Lasso
Best Subset
Ridge
Lasso
When X does not have orthogonal columns 
(0,0)
Red elliptical contours show the isoscores
of RSS().
(M )
(M ) 
(0,0)
(0,0)
(0,0)
Cyan regions show the feasible regions 12 + 22 t2 and
1  + 2  t resp.
^
!
!2
!1
^
!
!2
^
!
!1
Estimation picture for the lasso (left) and ridge regression
contours of the error and constraint functions. The solid blue
traint regions 1  + 2  t and 2 + 2 t2 , respectively,
Ridge regression
!2
!1
Lasso
FIGURE 3.11. Estimation picture for the lasso (left)
(right). Shown are contours of the error and constraint fun
TABLE 3.4. Estimators of j in the case of orthonormal columns of X. M and
are constants chosen by the corresponding techniques; sign denotes the sign of its
argument (1), and x+ denotes positive part of x. Below the table, estimators
are shown by broken red lines. The 45 line in gray shows the unrestricted estimate
for reference.
TABLE 3.4. Estimators of j in the case of orthonormal columns of X. M an
are constants chosen by the corresponding techniques; sign denotes the sign of
argument (1), and x+ denotes positive part of x. Below the table, estimat
are shown by broken red lines. The 45 line in gray shows the unrestricted estim
for reference.
Ridge Regression Vs Lasso
Estimator
Formula
Estimator
Formula
When
X does
not
have
orthogonal columns
Best subset (size M )
j I(j  (M ) )
Best subset (size M )
j I(j  (M ) )
j /(1 + )
Ridge
/(1 + )
Both methods
choose the first point whereLasso
the elliptical contours
hit
sign( )(  )
Lasso
sign( )(  )
the constraint
region.
Ridge
Best Subset
Best Subset
Ridge
Ridge
Lasso
Lasso
The Lasso region has corners, if then solution occurs at a corner
then one j = 0.
(0,0)
(M ) 
(0,0)
(0,0)
(M ) 
(0,0)
(0,0)
(0,0)
When p > 2 the diamond becomes a rhomboid with many corners
and flat edges = many opportunities for j s to be 0.
!2
^
!
!2
!1
^
!
!2
^
!
!1
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
(right). Shown are contours of the error and constraint functions. The solid blue
areas are the constraint regions 1  + 2  t and 12 + 22 t2 , respectively,
Ridge regression
!2
^
!
!1
!1
FIGURE 3.11. Estimation picture for the lasso (left) and ridge regress
(right). Shown are contours of the error and constraint functions. The solid b
Lasso
regression. Notice that for q 1, the prior is not uniform in direction, but
concentrates more mass in the coordinate directions. The prior corresponding to the q = 1 case is an independent double exponential (or Laplace)
distribution for each input, with density (1/2 ) exp(/ ) and = 1/.
The case q = 1 (lasso) is the smallest q such that the constraint region
is convex; nonconvex constraint regions make the optimization problem
more difficult.
p
p
n
X
X
X
In this view, the lasso, ridge regression and best
2 subset selection
q are
= arg min
(y
)
+


i priors.
0 Note, however,
ij j that they arejderived
Bayes estimates with different
as posterior modes, that
of the posterior. It isj=1
more common
i=1is, maximizers j=1
to use the mean of the posterior as the Bayes estimate. Ridge regression is
also the posterior mean, but the lasso and best subset selection are not.
again at the criterion (3.53), we might try using other values
q of=Looking
0  Variable
subset selection
q besides 0, 1, or 2. Although one might consider estimating q from
our experience is that it is not worth the effort for the extra
q the
= 1data,
 Lasso
variance incurred. Values of q (1, 2) suggest a compromise between the
ridge regression
regression. Although this is the case, with q > 1, j q is
q lasso
= 2 and
 Ridge
differentiable at 0, and so does not share the ability of lasso (q = 1) for
Generalization of ridge and lasso regression
q=4
q=2
q=1
FIGURE 3.12. Contours of constant value of
q = 0.5
q = 0.1
j q for given values of q.
Thinking of j  as the logprior density for j , these are also the equicontours of the prior distribution of the parameters. The value q = 0 corresponds to variable subset selection, as the penalty simply counts the number
of nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridge
regression. Notice that for q 1, the prior is not uniform in direction, but
concentrates more mass in the coordinate directions. The prior corresponding to the q = 1 case
is an independent double exponential (or Laplace)
p(1/2 ) exp(/X
n
distribution for each
input,
with density
) pand =
1/.
X
X
2the constraint region
The
case
q =
1 (lasso) is(ythesmallest
that
=
arg
min
0 q such
xij
j q
i
j) +
is convex; nonconvex
constraint regions make the optimization problem
i=1
j=1
j=1
more difficult.
In this view, the lasso, ridge regression and best subset selection are
Bayes estimates with different priors. Note, however, that they are derived
Can
try other
values
q.
as posterior
modes,
that of
is, maximizers
of the posterior. It is more common
to use the mean of the posterior as the Bayes estimate. Ridge regression is
When
q posterior
1 stillmean,
havebut
a convex
also the
the lassoproblem.
and best subset selection are not.
Looking again at the criterion (3.53), we might try using other values
When
0 q 0,<1,1 ordo2.not
have aoneconvex
problem.
of q besides
Although
might consider
estimating q from
the data, our experience is that it is not worth the effort for the extra
When
q
1 sparse
solutions
are
encouraged.
variance
incurred.
Values
of q (1,
2) explicitly
suggest a compromise
between the
lasso and ridge regression. Although this is the case, with q > 1, j q is
When
q
>
1
cannot
set
coefficients
to
zero.
differentiable at 0, and so does not share the ability of lasso (q = 1) for
Generalization of ridge and lasso regression
q=4
q=2
q=1
FIGURE 3.12. Contours of constant value of
q = 0.5
q = 0.1
j q for given values of q.
Elasticnet penalty
A compromise between the ridge and lasso penalty is the Elastic
net penalty:
p
X
j=1
(j2 + (1 )j )
The elasticnet
select variables like the lasso and
shrinks together the coefficients of correlated predictors like
ridge regression.
Effective degrees of freedom
Definition of the effective degrees of freedom
Traditionally the number of linearly independent parameters is
what is meant by degrees of freedom.
If we carry out a best subset selection to determine the
optimal set of k predictors, then surely we have used more
than k dofs.
A more general definition for the effective degrees of
freedom of adaptively fitted is
df(
y) =
n
1 X
Cov(
yi , yi )
2
i=1
where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the
covariance and hence df(
y)
Definition of the effective degrees of freedom
Traditionally the number of linearly independent parameters is
what is meant by degrees of freedom.
If we carry out a best subset selection to determine the
optimal set of k predictors, then surely we have used more
than k dofs.
A more general definition for the effective degrees of
freedom of adaptively fitted is
df(
y) =
n
1 X
Cov(
yi , yi )
2
i=1
where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the
covariance and hence df(
y)
Definition of the effective degrees of freedom
Traditionally the number of linearly independent parameters is
what is meant by degrees of freedom.
If we carry out a best subset selection to determine the
optimal set of k predictors, then surely we have used more
than k dofs.
A more general definition for the effective degrees of
freedom of adaptively fitted is
df(
y) =
n
1 X
Cov(
yi , yi )
2
i=1
where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the
covariance and hence df(
y)
Definition of the effective degrees of freedom
Traditionally the number of linearly independent parameters is
what is meant by degrees of freedom.
If we carry out a best subset selection to determine the
optimal set of k predictors, then surely we have used more
than k dofs.
A more general definition for the effective degrees of
freedom of adaptively fitted is
df(
y) =
n
1 X
Cov(
yi , yi )
2
i=1
where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the
covariance and hence df(
y)