You are on page 1of 156

# Some Course Admin

## For those wanting to do some programming?

Assignment 1 By Monday the 2nd of April send me 1 page

## describing a problem related to your research you would like to

tackle with the methods introduced so far in the course.

## Deadline for the homework exercises

Deadline for homework sets 1, 2,3 Monday the 2nd of

April.

DD3364

## Introduction: Why focus on these models?

Simple and Interpretable

E[Y | X] = 0 + 1 X1 + + p Xp
Can outperform non-linear methods when one has
a small number of training examples
low signal-to-noise ratio
sparse data

(1)

## Introduction: Why focus on these models?

Simple and Interpretable

E[Y | X] = 0 + 1 X1 + + p Xp
Can outperform non-linear methods when one has
a small number of training examples
low signal-to-noise ratio
sparse data

(1)

## Introduction: Why focus on these models?

Simple and Interpretable

E[Y | X] = 0 + 1 X1 + + p Xp
Can outperform non-linear methods when one has
a small number of training examples
low signal-to-noise ratio
sparse data

(1)

## Linear Regression Models and Least Squares

Have an input vector X = (X1 , X2 , . . . , Xp )t .
Want to predict a real-valued output Y .
The linear regression has the form

f (X) = 0 + 1 X1 + + p Xp

How to estimate :
Training data: (x1 , y1 ), . . . , (xn , yn ) each xi Rp and yi R
Estimate parameters: Choose which minimizes

RSS() =

p
n
n
X
X
X
(yi f (xi ))2 =
(yi 0
xij j )2
i=1

residual sum-of-squares

i=1

j=1

## Linear Regression Models and Least Squares

Have an input vector X = (X1 , X2 , . . . , Xp )t .
Want to predict a real-valued output Y .
The linear regression has the form

f (X) = 0 + 1 X1 + + p Xp

How to estimate :
Training data: (x1 , y1 ), . . . , (xn , yn ) each xi Rp and yi R
Estimate parameters: Choose which minimizes

RSS() =

p
n
n
X
X
X
(yi f (xi ))2 =
(yi 0
xij j )2
i=1

residual sum-of-squares

i=1

j=1

## Linear least squares fitting

3.2 Linear Regression Models and Least Squares

45

Training data:

X2

## (x1 , y1 ), . . . , (xn , yn ) each

xi Rp and yi R

Estimate parameters:

## Choose which minimizes

RSS() =

X1

E 3.1. Linear least squares fitting with X IR2 . We seek the linear
of X that minimizes the sum of squared residuals from Y .

## Find which minimizes the

sum-of-squared residuals from Y .

## cupied by the pairs (X, Y ). Note that (3.2) makes no assumptions

he validity of model (3.1); it simply finds the best linear fit to the
ast squares fitting is intuitively satisfying no matter how the data
e criterion measures the average lack of fit.
do we minimize (3.2)? Denote by X the N (p + 1) matrix with

n
X
i=1

n
X
i=1

(yi 0

p
X
j=1

xij j )2

Minimizing RSS()
Re-write

RSS() =

n
X
i=1

(yi 0

p
X

xij j )2

j=1

## in vector and matrix notation as

RSS() = (y X)t (y X)
where

= (0 , 1 , . . . , p )t ,

and y = (y1 , . . . , yn )t .

1 x11 x12
1 x21 x22

X = .
..
..
..
.
.
1 xn1 xn2

. . . x1p
. . . x2p

..
...
.
. . . xnp

Minimizing RSS()
Want to find which minimizes

RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain

RSS
= 2Xt (y X )

RSS
= 2Xt (y X ) = 0

## to obtain the unique solution

= (X t X)1 X t y

Minimizing RSS()
Want to find which minimizes

RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain

RSS
= 2Xt (y X )

RSS
= 2Xt (y X ) = 0

## to obtain the unique solution

= (X t X)1 X t y

Minimizing RSS()
Want to find which minimizes

RSS() = (y X)t (y X)
Differentiate RSS() w.r.t. to obtain

RSS
= 2Xt (y X )

RSS
= 2Xt (y X ) = 0

## to obtain the unique solution

= (X t X)1 X t y

Estimates using
Given an input x0 this model predicts its output as

y0 = (1, xt0 )
The fitted values at the training inputs are

y = X = X(Xt X)1 Xt y
= H y
Hat matrix

Estimates using
Given an input x0 this model predicts its output as

y0 = (1, xt0 )
The fitted values at the training inputs are

y = X = X(Xt X)1 Xt y
= H y
Hat matrix

Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y

x2

x1

Let

FIGURE 3.2. The N -dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions

## The predicted values at an input vector x are given by f(x ) = (1 : x )T ;

0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation

(3.7) is sometimes called the hat matrix because it puts the hat on y.

Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N

## estimate, this time in IR . We denote the column vectors of X by x0 , x1 , . . . , xp ,

Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y

x2

x1

Let

FIGURE 3.2. The N -dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions

## The predicted values at an input vector x are given by f(x ) = (1 : x )T ;

0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation

(3.7) is sometimes called the hat matrix because it puts the hat on y.

Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N

## estimate, this time in IR . We denote the column vectors of X by x0 , x1 , . . . , xp ,

Geometric
interpretation of the least squares estimate
46
3. Linear Methods for Regression
y

x2

x1

Let

FIGURE 3.2. The N -dimensional geometry of least squares regression with two
predictors.
Theinput
outcomedata
vectormatrix.
y is orthogonally projected onto the hyperplane
X be the
represents the vector
spanned by the input vectors x1 and x2 . The projection y
of the least squares predictions

## The predicted values at an input vector x are given by f(x ) = (1 : x )T ;

0
0
0
In the
the atvector
of outputs
the figure
fitted values
the training
inputs arey is orthogonally projected onto
the hyperplane spanned
by the vectors x.1 and x.2 .
= X = X(XT X)1 XT y,
y
(3.7)
T
1 T
Thewhere
projection
). represents
least
estimate.
yi = f(xiy
The matrix Hthe
= X(X
X)squares
X appearing
in equation

(3.7) is sometimes called the hat matrix because it puts the hat on y.

Figure
3.2 shows
different geometrical
representationprojection.
of the least squares
The hat
matrix
Ha computes
the orthogonal
N

An example

## Example of training data

Have training data in the following format.

## Want to learn F linear regression functions fi

fi maps the image vector to x-coord of the ith facial feature.
Learn also F regression fns gi for the y-coord.

## Example of training data

Have training data in the following format.

## Want to learn F linear regression functions fi

fi maps the image vector to x-coord of the ith facial feature.
Learn also F regression fns gi for the y-coord.

## Example of training data

Have training data in the following format.

## Want to learn F linear regression functions fi

fi maps the image vector to x-coord of the ith facial feature.
Learn also F regression fns gi for the y-coord.

## Example of training data

Have training data in the following format.

## Want to learn F linear regression functions fi

fi maps the image vector to x-coord of the ith facial feature.
Learn also F regression fns gi for the y-coord.

f14, g14

Input

Output

points.

problem?

f14, g14

Input

Output

points.

problem?

x,14

|x,14 |

y,14

|y,14 |

## These are not promising weight vectors!

Estimated Landmark
on novel image

Estimate not
even in image

miserably.

x,14

|x,14 |

y,14

|y,14 |

## These are not promising weight vectors!

Estimated Landmark
on novel image

Estimate not
even in image

miserably.

Singular Xt X

## When X is not full rank?

Not all the columns of X are linearly independent.
In this case Xt X is singular = not uniquely defined.
The fitted values y = X are still the projection of y onto the

y = X = X

## Non-full-rank case occurs when

one or more of the qualitative inputs are encoded redundantly,
when the number of inputs p > n the number of training

examples.

## When X is not full rank?

Not all the columns of X are linearly independent.
In this case Xt X is singular = not uniquely defined.
The fitted values y = X are still the projection of y onto the

y = X = X

## Non-full-rank case occurs when

one or more of the qualitative inputs are encoded redundantly,
when the number of inputs p > n the number of training

examples.

## When X is not full rank?

Not all the columns of X are linearly independent.
In this case Xt X is singular = not uniquely defined.
The fitted values y = X are still the projection of y onto the

y = X = X

## Non-full-rank case occurs when

one or more of the qualitative inputs are encoded redundantly,
when the number of inputs p > n the number of training

examples.

## When X is not full rank?

Not all the columns of X are linearly independent.
In this case Xt X is singular = not uniquely defined.
The fitted values y = X are still the projection of y onto the

y = X = X

## Non-full-rank case occurs when

one or more of the qualitative inputs are encoded redundantly,
when the number of inputs p > n the number of training

examples.

What can we say about the distribution of ?

Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and

## The covariance matrix of is then

= Var((Xt X)1 Xt y) = (Xt X)1 X t Var(y) X(Xt X)1
Var()
= (Xt X)1 2
Usually one estimates the variance 2 with
n

X
1

=
(yi yi )2
n p 1 i=1
2

Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and

## The covariance matrix of is then

= Var((Xt X)1 Xt y) = (Xt X)1 X t Var(y) X(Xt X)1
Var()
= (Xt X)1 2
Usually one estimates the variance 2 with
n

X
1

=
(yi yi )2
n p 1 i=1
2

Analysis of the distribution of .
This requires making some assumptions. These are
the observations yi are uncorrelated
yi have constant variance 2 and

## The covariance matrix of is then

= Var((Xt X)1 Xt y) = (Xt X)1 X t Var(y) X(Xt X)1
Var()
= (Xt X)1 2
Usually one estimates the variance 2 with
n

X
1

=
(yi yi )2
n p 1 i=1
2

## Analysis of the distribution of

To say more we need to make more assumptions. Therefore

assume

Y = E(Y | X1 , X2 , . . . , Xp ) + 
= 0 +

p
X

Xj j + 

i=1

where  N (0, 2 )
Then its easy to show that (assuming non-random xi )

N (, (Xt X)1 2 )

## Given this additive model generate

y
0 + 1 x

yi
2

0 + 1 x

0
0

0 + 1 x

x
4

yi

0 + 1 x

0
0

T1
T is a training set {(xi , yi )}n
i=1
= (1, 1)t , n = 40, = .6
In this simulation the xi s differ across trials.

Tntrial

x
4

The distribution of
1
1.2

0.8
0.5

1.5

## Each Ti results in a different estimate of .

for ntrial = 500.

y
0 + 1 x

yi

0 + 1 x

0
0

0 + 1 x

x
4

T1

yi

0 + 1 x

0
0

Tntrial

## T is a training set {(xi , yi )}n

i=1
= (1, 1)t , n = 40, = .6
In this simulation the xi s are fixed across trials.

x
4

The distribution of
1

0.8

0
0.5

1.5

## Each Ti results in a different estimate of .

for ntrial = 500.

## Which j s are probably zero?

To interpret the weights estimated by least squares it would

model.

## In this case compute

j
zj =

vjj
and if j = 0 then zj has a t-distribution with n p 1 dof.

## Which j s are probably zero?

To interpret the weights estimated by least squares it would

model.

## In this case compute

j
zj =

vjj
and if j = 0 then zj has a t-distribution with n p 1 dof.

## Which j s are probably zero?

To interpret the weights estimated by least squares it would

model.

## In this case compute

j
zj =

vjj
and if j = 0 then zj has a t-distribution with n p 1 dof.

## Which j s are probably zero?

To interpret the weights estimated by least squares it would

model.

## In this case compute

j
zj =

vjj
and if j = 0 then zj has a t-distribution with n p 1 dof.

Is 1 zero?
0.4

p(Z-score | 1 = 0)
t-distribution

0.3
0.2

Z-scores for 1 s
0.1
0

Z-score
0

10

20

30

0 + 1 x

4
2

0 + 1 x

0
0

x
4

T1

0 + 1 x

4
2

0 + 1 x

0
0

Tntrial

## T is a training set {(xi , yi )}n

i=1
= (3, 0)t , n = 40, = .6
In this simulation the xi s are fixed across trials.

x
4

The distribution of
1
0.1
0
0.1
0.2

0
2.5

3.5

## Each Ti results in a different estimate of .

for ntrial = 500.

Is 1 zero?
0.4

p(Z-score | 1 = 0)
t-distribution

0.3
0.2

Z-scores for 1 s

0.1
0

Z-score
4

## The z1 s computed from the estimated with Ti are shown.

Obviously even if we didnt know and only saw one Ti we

## Other tests can be performed

We will not look into these but you can
test for the significance of groups of coefficients

simultaneously

## get confidence bounds for j centred at j .

Gauss-Markov Theorem

Gauss-Markov Theorem
A famous result in statistics

## The least squares estimate ls of the parameters

has the smallest variance among all linear unbiased
estimates.

## To explain a simple case of the theorem. Let = at .

The least squares estimate of at is

= at ls = at (Xt X)1 Xt y
If X is fixed this is a linear function, ct0 y, of the response
vector y.
If we assume E[y] = X then at ls is unbiased

## Gauss-Markov Theorem: Simple example

Gauss-Markov Theorem states any other linear estimator

## = ct y that is unbiased for at has

Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one

## parameter at but can state it in terms of the entire

parameter vector .

## Gauss-Markov Theorem: Simple example

Gauss-Markov Theorem states any other linear estimator

## = ct y that is unbiased for at has

Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one

## parameter at but can state it in terms of the entire

parameter vector .

## Gauss-Markov Theorem: Simple example

Gauss-Markov Theorem states any other linear estimator

## = ct y that is unbiased for at has

Var[at ls ] Var[ct y]
Have only stated the result for the estimation of one

## parameter at but can state it in terms of the entire

parameter vector .

## The Bias - Variance Trade-off (once again!)

Consider the mean-squared error of an estimator in

estimating

= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance

bias

## But there may be biased estimates with smaller MSE.

In these cases have traded an increase in squared bias for a

reduction in variance.

## The Bias - Variance Trade-off (once again!)

Consider the mean-squared error of an estimator in

estimating

= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance

bias

## But there may be biased estimates with smaller MSE.

In these cases have traded an increase in squared bias for a

reduction in variance.

## The Bias - Variance Trade-off (once again!)

Consider the mean-squared error of an estimator in

estimating

= E(( )2 )
MSE()
+ ( E()
)2
= Var()
variance

bias

## But there may be biased estimates with smaller MSE.

In these cases have traded an increase in squared bias for a

reduction in variance.

Gram-Schmidt

## Multiple regression from simple univariate regression

Suppose we have univariate model with no intercept

Y = X + 
The least square estimate is

hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by

r = y xt
Say xi Rp and the columns of X are orthogonal then

hx.j , yi
j =
,
hx.j , x.j i

## Multiple regression from simple univariate regression

Suppose we have univariate model with no intercept

Y = X + 
The least square estimate is

hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by

r = y xt
Say xi Rp and the columns of X are orthogonal then

hx.j , yi
j =
,
hx.j , x.j i

## Multiple regression from simple univariate regression

Suppose we have univariate model with no intercept

Y = X + 
The least square estimate is

hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by

r = y xt
Say xi Rp and the columns of X are orthogonal then

hx.j , yi
j =
,
hx.j , x.j i

## Multiple regression from simple univariate regression

Suppose we have univariate model with no intercept

Y = X + 
The least square estimate is

hx, yi
=
hx, xi
where x = (x1 , x2 , . . . , xn )t and y = (y1 , y2 , . . . , yn ).
The residuals are given by

r = y xt
Say xi Rp and the columns of X are orthogonal then

hx.j , yi
j =
,
hx.j , x.j i

## OLS via successive orthogonalization

X acquired from observations are rarely orthogonal.
Hence they have to be orthogonalized to take advantage the

previous insight.

## x.0 be the 0th column of X Rn2 (vector of ones) then

hx.0 , x.1 i
Regress x.1 on x.0 that is
=
and let z = x.1 x.0
hx.0 , x.0 i
hx.1 , zi
Regress y on z then 1 =

hx.1 , x.1 i

Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .

## The solution is same as if one had directly calculated ls . Have

just used an orthogonal basis for the col. space of X

## Note Step 1 orthogonalized x.1 w.r.t. x.0 .

Step 2 is simple univariate regression using the orthogonal

## OLS via successive orthogonalization

X acquired from observations are rarely orthogonal.
Hence they have to be orthogonalized to take advantage the

previous insight.

## x.0 be the 0th column of X Rn2 (vector of ones) then

hx.0 , x.1 i
Regress x.1 on x.0 that is
=
and let z = x.1 x.0
hx.0 , x.0 i
hx.1 , zi
Regress y on z then 1 =

hx.1 , x.1 i

Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .

## The solution is same as if one had directly calculated ls . Have

just used an orthogonal basis for the col. space of X

## Note Step 1 orthogonalized x.1 w.r.t. x.0 .

Step 2 is simple univariate regression using the orthogonal

## OLS via successive orthogonalization

X acquired from observations are rarely orthogonal.
Hence they have to be orthogonalized to take advantage the

previous insight.

## x.0 be the 0th column of X Rn2 (vector of ones) then

hx.0 , x.1 i
Regress x.1 on x.0 that is
=
and let z = x.1 x.0
hx.0 , x.0 i
hx.1 , zi
Regress y on z then 1 =

hx.1 , x.1 i

Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .

## The solution is same as if one had directly calculated ls . Have

just used an orthogonal basis for the col. space of X

## Note Step 1 orthogonalized x.1 w.r.t. x.0 .

Step 2 is simple univariate regression using the orthogonal

## OLS via successive orthogonalization

X acquired from observations are rarely orthogonal.
Hence they have to be orthogonalized to take advantage the

previous insight.

## x.0 be the 0th column of X Rn2 (vector of ones) then

hx.0 , x.1 i
Regress x.1 on x.0 that is
=
and let z = x.1 x.0
hx.0 , x.0 i
hx.1 , zi
Regress y on z then 1 =

hx.1 , x.1 i

Then y 1 z = 1 (x.1
x.0 ) = X where = (1 , 1 )t .

## The solution is same as if one had directly calculated ls . Have

just used an orthogonal basis for the col. space of X

## Note Step 1 orthogonalized x.1 w.r.t. x.0 .

Step 2 is simple univariate regression using the orthogonal

## OLS via successive orthogonalization

Can extend the process to when xi s are p-dimensional.
See Algorithm 3.1 in the book.
At each iteration j a multiple least squares regression problem

## OLS via successive orthogonalization

Can extend the process to when xi s are p-dimensional.
See Algorithm 3.1 in the book.
At each iteration j a multiple least squares regression problem

Subset Selection

## Inadequacies of least squares estimates

Prediction Accuracy

## Least squares estimates often have

Low bias and high variance
This can affect prediction accuracy

## Frequently better to set some of the j s to zero.

This increases the bias but reduces the variance and in turn

Interpretation

factors.

## predictors which are most informative.

May sacrifice small detail for the big picture.

## Inadequacies of least squares estimates

Prediction Accuracy

## Least squares estimates often have

Low bias and high variance
This can affect prediction accuracy

## Frequently better to set some of the j s to zero.

This increases the bias but reduces the variance and in turn

Interpretation

factors.

## predictors which are most informative.

May sacrifice small detail for the big picture.

## Inadequacies of least squares estimates

Prediction Accuracy

## Least squares estimates often have

Low bias and high variance
This can affect prediction accuracy

## Frequently better to set some of the j s to zero.

This increases the bias but reduces the variance and in turn

Interpretation

factors.

## predictors which are most informative.

May sacrifice small detail for the big picture.

## Inadequacies of least squares estimates

Prediction Accuracy

## Least squares estimates often have

Low bias and high variance
This can affect prediction accuracy

## Frequently better to set some of the j s to zero.

This increases the bias but reduces the variance and in turn

Interpretation

factors.

## predictors which are most informative.

May sacrifice small detail for the big picture.

## Inadequacies of least squares estimates

Prediction Accuracy

## Least squares estimates often have

Low bias and high variance
This can affect prediction accuracy

## Frequently better to set some of the j s to zero.

This increases the bias but reduces the variance and in turn

Interpretation

factors.

## predictors which are most informative.

May sacrifice small detail for the big picture.

Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the

## j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.

RSS(j1 , j2 , . . . , jk ) =

min

0 ,j1 ,...,jl

n
X
i=1

(yi 0

k
X

jl xi,jl )2

l=1

is smallest.
 
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for

## Question still remains of how to choose best value of k.

Once again it is a trade-off between bias and variance....

Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the

## j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.

RSS(j1 , j2 , . . . , jk ) =

min

0 ,j1 ,...,jl

n
X
i=1

(yi 0

k
X

jl xi,jl )2

l=1

is smallest.
 
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for

## Question still remains of how to choose best value of k.

Once again it is a trade-off between bias and variance....

Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the

## j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.

RSS(j1 , j2 , . . . , jk ) =

min

0 ,j1 ,...,jl

n
X
i=1

(yi 0

k
X

jl xi,jl )2

l=1

is smallest.
 
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for

## Question still remains of how to choose best value of k.

Once again it is a trade-off between bias and variance....

Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the

## j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.

RSS(j1 , j2 , . . . , jk ) =

min

0 ,j1 ,...,jl

n
X
i=1

(yi 0

k
X

jl xi,jl )2

l=1

is smallest.
 
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for

## Question still remains of how to choose best value of k.

Once again it is a trade-off between bias and variance....

Best-Subset Selection
Best subset regression finds for k {0, 1, 2, . . . , p} the

## j1 , j2 , . . . , jk with each jl {1, 2, . . . , p} s.t.

RSS(j1 , j2 , . . . , jk ) =

min

0 ,j1 ,...,jl

n
X
i=1

(yi 0

k
X

jl xi,jl )2

l=1

is smallest.
 
p
There are
different subsets to try for a given k.
k
If p 40 there exist computational feasible algorithms for

## Question still remains of how to choose best value of k.

Once again it is a trade-off between bias and variance....

Forward-Stepwise Selection
Instead of searching all possible subsets (infeasible for large p)

## The steps of Forward-Stepwise Selection are

Set I = {1, . . . , p}
For l = 1, . . . , k choose jl according to

jl = arg min

min

jI 0 ,j1 ,...,jl1 ,j

and
I = I\{jl }

n
X
i=1

(yi 0

l1
X
s=1

js xijs j xij )2

Forward-Stepwise Selection
Instead of searching all possible subsets (infeasible for large p)

## The steps of Forward-Stepwise Selection are

Set I = {1, . . . , p}
For l = 1, . . . , k choose jl according to

jl = arg min

min

jI 0 ,j1 ,...,jl1 ,j

and
I = I\{jl }

n
X
i=1

(yi 0

l1
X
s=1

js xijs j xij )2

## Forward-Stepwise Vs Best Subset Selection

Forward-Stepwise may be sub-optimal compared to the best

## It is computational feasible for large p  n. Not true for best

subset selection.

## best subset selection may overfit.

Forward stepwise will probably produce a function with lower

## Forward-Stepwise Vs Best Subset Selection

Forward-Stepwise may be sub-optimal compared to the best

## It is computational feasible for large p  n. Not true for best

subset selection.

## best subset selection may overfit.

Forward stepwise will probably produce a function with lower

## Forward-Stepwise Vs Best Subset Selection

Forward-Stepwise may be sub-optimal compared to the best

## It is computational feasible for large p  n. Not true for best

subset selection.

## best subset selection may overfit.

Forward stepwise will probably produce a function with lower

## Forward-Stepwise Vs Best Subset Selection

Forward-Stepwise may be sub-optimal compared to the best

## It is computational feasible for large p  n. Not true for best

subset selection.

## best subset selection may overfit.

Forward stepwise will probably produce a function with lower

## variance but perhaps more bias.

Forward-Stagewise Regression
The steps of Forward-Stagewise Regression are
n
1X
Set 0 =
yi

i=1

Set 1 = 2 = = p = 0
At each iteration
ri = yi 0

p
X

j xij ,

jI

j=1

## find Xj most correlated with r

j j + sign(hx.j , ri)
Stop iterations when the residuals are uncorrelated with all the

predictors.

## Forward-Stagewise Regression: What to note..

Only one j is updated at each iteration.
A j can be updated at several different iterations.
It can be slow to reach the least squares fit.
But slow fitting may not be such a bad thing in high

dimensional problems.

## Forward-Stagewise Regression: What to note..

Only one j is updated at each iteration.
A j can be updated at several different iterations.
It can be slow to reach the least squares fit.
But slow fitting may not be such a bad thing in high

dimensional problems.

## Forward-Stagewise Regression: What to note..

Only one j is updated at each iteration.
A j can be updated at several different iterations.
It can be slow to reach the least squares fit.
But slow fitting may not be such a bad thing in high

dimensional problems.

## Forward-Stagewise Regression: What to note..

Only one j is updated at each iteration.
A j can be updated at several different iterations.
It can be slow to reach the least squares fit.
But slow fitting may not be such a bad thing in high

dimensional problems.

Shrinkage methods

## Why shrinkage methods?

Selecting a subset of predictors produces a model that is

full model.

variance.

## Why shrinkage methods?

Selecting a subset of predictors produces a model that is

full model.

variance.

## Why shrinkage methods?

Selecting a subset of predictors produces a model that is

full model.

variance.

Ridge Regression

## Shrinkage Method 1: Ridge Regression

Ridge regression shrinks j s by imposing a penalty on their

size.

## non-negative complexity parameter

ridge

p
p
n
X

X
X
2
2
= arg min
(yi 0
xij j ) +
j

i=1

j=1
j=1

residual sum-of-squares

regularizer fn

## Shrinkage Method 1: Ridge Regression

Ridge regression shrinks j s by imposing a penalty on their

size.

## non-negative complexity parameter

ridge

p
p
n
X

X
X
2
2
= arg min
(yi 0
xij j ) +
j

i=1

j=1
j=1

residual sum-of-squares

regularizer fn

## Shrinkage Method 1: Ridge Regression

Ridge regression shrinks j s by imposing a penalty on their

size.

## non-negative complexity parameter

ridge

p
p
n
X

X
X
2
2
= arg min
(yi 0
xij j ) +
j

i=1

j=1
j=1

residual sum-of-squares

regularizer fn

## ridge = arg min

n
X
i=1

(yi 0

p
X
j=1

xij j )2 subject to

p
X
j=1

j2 t

j s.

formulations.

change.

## Ridge Regression and centering of data

The centered version of the input data is

x
ij = xij

n
X

xsj

s=1

## Then the ridge regression coefficients found using the

centered data
c = arg min
c

n
X
i=1

(yi

0c

p
X

x
ij jc )2

j=1

p
X
+
(jc )2
j=1

are related to the coefficients found using the original data via
n

1X
0c =
yi = y,
n
i=1

jc = jridge

0ridge = y

for i = 1, . . . , p

p
X
j=1

x
.j jridge

## Ridge Regression and centering of data

If the ys have zero mean = 0c = 0
Can drop the intercept term from the linear model if the input

data is centred.

= arg min

= arg min

n
X

i=1

(yi

p
X

xij j )2 +

j=1

p
X
j=1



(y X)t (y X) + t

x11
x21

X= .
..

xn1

x12
x22
..
.
xn2

x1p
x2p

..
.

xnp

j2

## Ridge Regression and centering of data

If the ys have zero mean = 0c = 0
Can drop the intercept term from the linear model if the input

data is centred.

= arg min

= arg min

n
X

i=1

(yi

p
X

xij j )2 +

j=1

p
X
j=1



(y X)t (y X) + t

x11
x21

X= .
..

xn1

x12
x22
..
.
xn2

x1p
x2p

..
.

xnp

j2

## Ridge Regression and centering of data

If the ys have zero mean = 0c = 0
Can drop the intercept term from the linear model if the input

data is centred.

= arg min

= arg min

n
X

i=1

(yi

p
X

xij j )2 +

j=1

p
X
j=1



(y X)t (y X) + t

x11
x21

X= .
..

xn1

x12
x22
..
.
xn2

x1p
x2p

..
.

xnp

j2

## Ridge Regression and centering of data

For rest of lecture will assume centered input and output data.
The ridge regression solution is given by

ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular

Xt X is not.

## Ridge Regression and centering of data

For rest of lecture will assume centered input and output data.
The ridge regression solution is given by

ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular

Xt X is not.

## Ridge Regression and centering of data

For rest of lecture will assume centered input and output data.
The ridge regression solution is given by

ridge = (Xt X + Ip )1 Xt y
Note that the problem of inverting the potentially singular

Xt X is not.

## Insight into Ridge Regression

Compute the SVD of the n p input matrix X then

X = UDVt
where
U is an n p orthogonal matrix
V is a p p orthogonal matrix
D is a p p diagonal matrix with d1 d2 dp 0.

## Xls = X(Xt X)1 Xt y = UUt y

which is the closest approximation to y in the subspace
spanned by the columns of U (= column space of X).

## Insight into Ridge Regression

Can write ridge regression fitted vector as

Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=

p
X
j=1

uj

d2j
d2j +

utj y,

## where uj s are columns of U

As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the

## It then shrinks these coordinates by the factors d2j /(d2j + ).

More shrinkage applied to basis vectors with smaller d2j .

## Insight into Ridge Regression

Can write ridge regression fitted vector as

Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=

p
X
j=1

uj

d2j
d2j +

utj y,

## where uj s are columns of U

As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the

## It then shrinks these coordinates by the factors d2j /(d2j + ).

More shrinkage applied to basis vectors with smaller d2j .

## Insight into Ridge Regression

Can write ridge regression fitted vector as

Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=

p
X
j=1

uj

d2j
d2j +

utj y,

## where uj s are columns of U

As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the

## It then shrinks these coordinates by the factors d2j /(d2j + ).

More shrinkage applied to basis vectors with smaller d2j .

## Insight into Ridge Regression

Can write ridge regression fitted vector as

Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=

p
X
j=1

uj

d2j
d2j +

utj y,

## where uj s are columns of U

As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the

## It then shrinks these coordinates by the factors d2j /(d2j + ).

More shrinkage applied to basis vectors with smaller d2j .

## Insight into Ridge Regression

Can write ridge regression fitted vector as

Xridge = X(Xt X + Ip )1 Xt y
= UD(UD2 + Ip )1 DUt y
=

p
X
j=1

uj

d2j
d2j +

utj y,

## where uj s are columns of U

As 0 = d2j /(d2j + ) 1
Ridge regression computes the coordinates of y wrt to the

## It then shrinks these coordinates by the factors d2j /(d2j + ).

More shrinkage applied to basis vectors with smaller d2j .

## Insight into Ridge Regression: What are the d2j s ?

The sample covariance matrix of the data is given by

S=

1 t
XX
n

## From the SVD of X we know that

Xt X = VD2 Vt

eigen-decomposition of Xt X

## The eigenvectors vj - columns of V are the principal

component directions of X.

(1)

## principal component direction v1 to get zi = v1t xi . The

(1)
variance of the zi s is given by (remember xi s are centred)
n

i=1

i=1

1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n

## Insight into Ridge Regression: What are the d2j s ?

The sample covariance matrix of the data is given by

S=

1 t
XX
n

## From the SVD of X we know that

Xt X = VD2 Vt

eigen-decomposition of Xt X

## The eigenvectors vj - columns of V are the principal

component directions of X.

(1)

## principal component direction v1 to get zi = v1t xi . The

(1)
variance of the zi s is given by (remember xi s are centred)
n

i=1

i=1

1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n

## Insight into Ridge Regression: What are the d2j s ?

The sample covariance matrix of the data is given by

S=

1 t
XX
n

## From the SVD of X we know that

Xt X = VD2 Vt

eigen-decomposition of Xt X

## The eigenvectors vj - columns of V are the principal

component directions of X.

(1)

## principal component direction v1 to get zi = v1t xi . The

(1)
variance of the zi s is given by (remember xi s are centred)
n

i=1

i=1

1 X (1) 2
1X t
1
d2
(zi ) =
v1 xi xti v1 = v1t X t Xv1 = 1
n
n
n
n

## Insight into Ridge Regression: What are the d2j s ?

v1 represents the direction (of unit length) which the
67

## 3.4 Shrinkage Methods

projected points have largest variance.

Largest Principal
Component
o

o o oo o
o
o oo oo
ooooo o
oo o o
oo o
ooo o o ooo o
o oo o o o
o
o oo
o
o o
oo o o
ooo oo o oo o o
o
o
oooooooooooooo o oo oo
oo oo
ooo ooo o
ooo ooooooo o
o o o o
o
oo o
o
o
o
o
o
o
ooo o ooooo
o
o
o
o
o
o
o
o oo o ooooo o
o
o o oo o
o
o
o ooo oo ooo o o
o
o
o ooo oo
o
o
o
oo
o
o
oo o o
oo o
Smallest Principal
o
Component

-2

X2

o
-4

-4

-2

X1

FIGURE 3.9. Principal components of some input data points. The largest prin-

(j)

cipal component is the direction that maximizes the variance of the projected data,
Subsequent
components
zi that have
variance
andprincipal
the smallest principal
component minimizes
variance. maximum
Ridge regression

## d2j /n subject to v being orthogonal to the earlier directions.

projects y onto these components, and then shrinks the coefficients of the low
variance components
more than the high-variance components.
j

## Insight into Ridge Regression: What are the d2j s ?

v1 represents the direction (of unit length) which the
67

## 3.4 Shrinkage Methods

projected points have largest variance.

Largest Principal
Component
o

o o oo o
o
o oo oo
ooooo o
oo o o
oo o
ooo o o ooo o
o oo o o o
o
o oo
o
o o
oo o o
ooo oo o oo o o
o
o
oooooooooooooo o oo oo
oo oo
ooo ooo o
ooo ooooooo o
o o o o
o
oo o
o
o
o
o
o
o
ooo o ooooo
o
o
o
o
o
o
o
o oo o ooooo o
o
o o oo o
o
o
o ooo oo ooo o o
o
o
o ooo oo
o
o
o
oo
o
o
oo o o
oo o
Smallest Principal
o
Component

-2

X2

o
-4

-4

-2

X1

FIGURE 3.9. Principal components of some input data points. The largest prin-

(j)

cipal component is the direction that maximizes the variance of the projected data,
Subsequent
components
zi that have
variance
andprincipal
the smallest principal
component minimizes
variance. maximum
Ridge regression

## d2j /n subject to v being orthogonal to the earlier directions.

projects y onto these components, and then shrinks the coefficients of the low
variance components
more than the high-variance components.
j

## Insight into Ridge Regression: What are the d2j s ?

The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column

## Ridge regression shrinks these directions the most !

The estimated directions vj s with small dj have more

## uncertainty associated with the estimate. (Using a narrow

baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.

## most in the directions of the high variance of the inputs. A

reasonable assumption but not always true.

## Insight into Ridge Regression: What are the d2j s ?

The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column

## Ridge regression shrinks these directions the most !

The estimated directions vj s with small dj have more

## uncertainty associated with the estimate. (Using a narrow

baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.

## most in the directions of the high variance of the inputs. A

reasonable assumption but not always true.

## Insight into Ridge Regression: What are the d2j s ?

The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column

## Ridge regression shrinks these directions the most !

The estimated directions vj s with small dj have more

## uncertainty associated with the estimate. (Using a narrow

baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.

## most in the directions of the high variance of the inputs. A

reasonable assumption but not always true.

## Insight into Ridge Regression: What are the d2j s ?

The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column

## Ridge regression shrinks these directions the most !

The estimated directions vj s with small dj have more

## uncertainty associated with the estimate. (Using a narrow

baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.

## most in the directions of the high variance of the inputs. A

reasonable assumption but not always true.

## Insight into Ridge Regression: What are the d2j s ?

The last principal component has minimum variance.
Hence the small dj correspond to the directions of the column

## Ridge regression shrinks these directions the most !

The estimated directions vj s with small dj have more

## uncertainty associated with the estimate. (Using a narrow

baseline to estimate a direction). Ridge regression protects
against relying on these high variance directions.

## most in the directions of the high variance of the inputs. A

reasonable assumption but not always true.

## Insight into Ridge Regression: What are the d2j s ?

The book defines the effective degrees of freedom of the

dfridge () =

p
X
j=1

d2j
d2j +

## we will derive this later on in the course.

But it is interesting as
dfridge () p when 0 (ordinary least squares) and
dfridge () 0 when

f14, g14

Input

Output

points.

f14, g14

Input

Output

points.

x,14
= 1,df 232

## = 102 ,df 187

= 103 ,df 97

|x,14 |

y,14

|y,14 |

Estimated Landmark
on novel image

## Landmark estimation using ridge regression

x,14
= 5000,df 44

= 104 ,df 29

|x,14 |

y,14

|y,14 |

Estimated Landmark
on novel image

## Landmark estimation using ridge regression

= 1000,df 97, Ground truth point, Estimated point

## How the coefficients vary with

0.6

lcavol

0.4

0.2

Coefficients

0.2

0.0

svi

lweight
pgg45
lbph

gleason

age

lcp
0

df()

3.8. Profiles
of ridge
coefficients
for theNotice
prostate cancer
example,
as
This is an FIGURE
example
from
the
book.
how
the
weights
the tuning parameter is varied. Coefficients are plotted versus df(), the effective
degrees of freedom. A vertical line is drawn at df = 5.0, the value chosen by
associated
with
each
predictor
vary
with
.
cross-validation.

The Lasso

## Shrinkage Method 2: The Lasso

The lasso estimate is defined by
lasso = arg min

n
X
i=1

yi 0

p
X
j=1

xij j

subject to

p
X
j=1

|j | t

## penalty is L1 instead of L2 norm

Equivalent formulation of the lasso problem is
lasso = arg min

n
X
i=1

yi 0

p
X
j=1

xij j +

p
X
j=1

|j |

solution.

## Shrinkage Method 2: The Lasso

The lasso estimate is defined by
lasso = arg min

n
X
i=1

yi 0

p
X
j=1

xij j

subject to

p
X
j=1

|j | t

## penalty is L1 instead of L2 norm

Equivalent formulation of the lasso problem is
lasso = arg min

n
X
i=1

yi 0

p
X
j=1

xij j +

p
X
j=1

|j |

solution.

## Shrinkage Method 2: The Lasso

The lasso estimate is defined by
lasso = arg min

n
X
i=1

yi 0

p
X
j=1

xij j

subject to

p
X
j=1

|j | t

## penalty is L1 instead of L2 norm

Equivalent formulation of the lasso problem is
lasso = arg min

n
X
i=1

yi 0

p
X
j=1

xij j +

p
X
j=1

|j |

solution.

n
X
i=1

yi 0

p
X
j=1

xij j

subject to

p
X
j=1

|j | t

## Because of the L1 constraint, making t small will force some

of the j s to be exactly 0.

## Lasso does some kind of continuous subset selection.

However the nature of the shrinkage is not so obvious.
If t

Pp

ls
j=1 |j |

n
X
i=1

yi 0

p
X
j=1

xij j

subject to

p
X
j=1

|j | t

## Because of the L1 constraint, making t small will force some

of the j s to be exactly 0.

## Lasso does some kind of continuous subset selection.

However the nature of the shrinkage is not so obvious.
If t

Pp

ls
j=1 |j |

n
X
i=1

yi 0

p
X
j=1

xij j

subject to

p
X
j=1

|j | t

## Because of the L1 constraint, making t small will force some

of the j s to be exactly 0.

## Lasso does some kind of continuous subset selection.

However the nature of the shrinkage is not so obvious.
If t

Pp

ls
j=1 |j |

n
X
i=1

yi 0

p
X
j=1

xij j

subject to

p
X
j=1

|j | t

## Because of the L1 constraint, making t small will force some

of the j s to be exactly 0.

## Lasso does some kind of continuous subset selection.

However the nature of the shrinkage is not so obvious.
If t

Pp

ls
j=1 |j |

## Landmark estimation using lasso

x,14
= .01,df

= .04,df 93

= .1,df 53

|x,14 |

y,14

|y,14 |

Estimated Landmark
on novel image

## Landmark estimation using lasso

x,14
= .3,df 21

= .5,df 7

= .7,df 2

|x,14 |

y,14

|y,14 |

Estimated Landmark
on novel image

## Landmark estimation using lasso regression

= .04,df 93, Ground truth point, Estimated point

## Subset Selection Vs Ridge Regression Vs Lasso

When X has orthonormal columns
This implies dj = 1 for j = 1, . . . , p.
In this case each method applies a simple transformation to

jls :

Estimator

Formula

## Best subset (size M)

ls
jls Ind(|jls | |M
|)

Ridge

jls /(1 + )

Lasso

sign(jls ) (|jls | )+

ls
where M
is the M th largest coefficient.

idge

j /(1 + )
sign(j )(|j | )+

asso
Ridge

sign(j )(|j |

Lasso

Best Subset

Ridge

Lasso

## When X does not have orthogonal columns| |

(0,0)
Red elliptical contours show the iso-scores
of RSS().
(M )

(M ) |

(0,0)

(0,0)

(0,0)

## Cyan regions show the feasible regions 12 + 22 t2 and

|1 | + |2 | t resp.

^
!

!2

!1

^
!

!2

^
!

!1

## Estimation picture for the lasso (left) and ridge regression

contours of the error and constraint functions. The solid blue
traint regions |1 | + |2 | t and 2 + 2 t2 , respectively,

Ridge regression

!2

!1

Lasso

## FIGURE 3.11. Estimation picture for the lasso (left)

(right). Shown are contours of the error and constraint fun

## TABLE 3.4. Estimators of j in the case of orthonormal columns of X. M and

are constants chosen by the corresponding techniques; sign denotes the sign of its
argument (1), and x+ denotes positive part of x. Below the table, estimators
are shown by broken red lines. The 45 line in gray shows the unrestricted estimate
for reference.

## TABLE 3.4. Estimators of j in the case of orthonormal columns of X. M an

are constants chosen by the corresponding techniques; sign denotes the sign of
argument (1), and x+ denotes positive part of x. Below the table, estimat
are shown by broken red lines. The 45 line in gray shows the unrestricted estim
for reference.

## Ridge Regression Vs Lasso

Estimator

Formula

Estimator
Formula
When
X does
not
have
orthogonal columns
Best subset (size M )

j I(|j | |(M ) |)

## Best subset (size M )

j I(|j | |(M ) |)
j /(1 + )

Ridge

/(1 + )
Both methods
choose the first point whereLasso
the elliptical contours
hit
sign( )(| | )
Lasso
sign( )(| | )
the constraint
region.
Ridge

Best Subset

Best Subset

Ridge

Ridge

Lasso

Lasso

then one j = 0.

(0,0)

|(M ) |

(0,0)

(0,0)

|(M ) |

(0,0)

(0,0)

(0,0)

## When p > 2 the diamond becomes a rhomboid with many corners

and flat edges = many opportunities for j s to be 0.

!2

^
!

!2

!1

^
!

!2

^
!

!1

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regression
(right). Shown are contours of the error and constraint functions. The solid blue
areas are the constraint regions |1 | + |2 | t and 12 + 22 t2 , respectively,

Ridge regression

!2

^
!

!1

!1

FIGURE 3.11. Estimation picture for the lasso (left) and ridge regress
(right). Shown are contours of the error and constraint functions. The solid b

Lasso

regression. Notice that for q 1, the prior is not uniform in direction, but
concentrates more mass in the coordinate directions. The prior corresponding to the q = 1 case is an independent double exponential (or Laplace)
distribution for each input, with density (1/2 ) exp(||/ ) and = 1/.
The case q = 1 (lasso) is the smallest q such that the constraint region
is convex; non-convex constraint regions make the optimization problem

more difficult.
p
p
n

X
X
X
In this view, the lasso, ridge regression and best
2 subset selection
q are
= arg min

(y

)
+

|
|
i priors.
0 Note, however,
ij j that they arejderived
Bayes estimates with different

## as posterior modes, that

of the posterior. It isj=1
more common
i=1is, maximizers j=1
to use the mean of the posterior as the Bayes estimate. Ridge regression is
also the posterior mean, but the lasso and best subset selection are not.
again at the criterion (3.53), we might try using other values
q of=Looking
0 - Variable
subset selection
q besides 0, 1, or 2. Although one might consider estimating q from
our experience is that it is not worth the effort for the extra
q the
= 1data,
- Lasso
variance incurred. Values of q (1, 2) suggest a compromise between the
ridge regression
regression. Although this is the case, with q > 1, |j |q is
q lasso
= 2 and
- Ridge
differentiable at 0, and so does not share the ability of lasso (q = 1) for

q=4

q=2

q=1

q = 0.5

q = 0.1

## |j |q for given values of q.

Thinking of |j | as the log-prior density for j , these are also the equicontours of the prior distribution of the parameters. The value q = 0 corresponds to variable subset selection, as the penalty simply counts the number
of nonzero parameters; q = 1 corresponds to the lasso, while q = 2 to ridge
regression. Notice that for q 1, the prior is not uniform in direction, but
concentrates more mass in the coordinate directions. The prior corresponding to the q = 1 case
is an independent double exponential (or Laplace)

p(1/2 ) exp(||/X
n
distribution for each
input,
with density
) pand =
1/.
X
X
2the constraint region
The
case
q =
1 (lasso) is(ythesmallest
that
=
arg
min
0 q such
xij
|j |q
i
j) +
is convex; non-convex
constraint regions make the optimization problem

i=1
j=1
j=1
more difficult.
In this view, the lasso, ridge regression and best subset selection are
Bayes estimates with different priors. Note, however, that they are derived
Can
try other
values
q.
as posterior
modes,
that of
is, maximizers
of the posterior. It is more common
to use the mean of the posterior as the Bayes estimate. Ridge regression is
When
q posterior
1 stillmean,
havebut
a convex
also the
the lassoproblem.
and best subset selection are not.
Looking again at the criterion (3.53), we might try using other values
When
0 q 0,<1,1 ordo2.not
have aoneconvex
problem.
of q besides
Although
might consider
estimating q from
the data, our experience is that it is not worth the effort for the extra
When
q
1 sparse
solutions
are
encouraged.
variance
incurred.
Values
of q (1,
2) explicitly
suggest a compromise
between the
lasso and ridge regression. Although this is the case, with q > 1, |j |q is
When
q
>
1
cannot
set
coefficients
to
zero.
differentiable at 0, and so does not share the ability of lasso (q = 1) for

q=4

q=2

q=1

q = 0.5

q = 0.1

## |j |q for given values of q.

Elastic-net penalty
A compromise between the ridge and lasso penalty is the Elastic
net penalty:

p
X
j=1

(j2 + (1 )|j |)

The elastic-net
select variables like the lasso and
shrinks together the coefficients of correlated predictors like

ridge regression.

## Definition of the effective degrees of freedom

Traditionally the number of linearly independent parameters is

than k dofs.

## freedom of adaptively fitted is

df(
y) =

n
1 X
Cov(
yi , yi )
2
i=1

where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the

y)

## Definition of the effective degrees of freedom

Traditionally the number of linearly independent parameters is

than k dofs.

## freedom of adaptively fitted is

df(
y) =

n
1 X
Cov(
yi , yi )
2
i=1

where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the

y)

## Definition of the effective degrees of freedom

Traditionally the number of linearly independent parameters is

than k dofs.

## freedom of adaptively fitted is

df(
y) =

n
1 X
Cov(
yi , yi )
2
i=1

where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the

y)

## Definition of the effective degrees of freedom

Traditionally the number of linearly independent parameters is

than k dofs.

## freedom of adaptively fitted is

df(
y) =

n
1 X
Cov(
yi , yi )
2
i=1

where Cov(
yi , yi ) is the estimate of the
Intuitively the harder we fit to the data, the larger the

y)