You are on page 1of 98

Chapter 5: Basis Expansion and Regularization

DD3364

April 1, 2012

Introduction

Moving beyond linearity


Main idea
Augment the vector of inputs X with additional variables.
These are transformations of X
hm (X) : Rp R

with m = 1, . . . , M .
Then model the relationship between X and Y
f (X) =

M
X

m hm (X) =

m=1

M
X

m Z m

m=1

as a linear basis expansion in X.


Have a linear model w.r.t. Z. Can use the same methods as

before.

Which transformations?
Some examples
Linear:

hm (X) = Xm , m = 1, . . . , p
Polynomial:

hm (X) = Xj2 ,

or

hm (X) = Xj Xk

Non-linear transformation of single inputs:

hm (X) = log(Xj ),

p
Xj , ...

Non-linear transformation of multiple input:

hm (X) = kXk
Use of Indicator functions:

hm (X) = Ind(Lm Xk < Um )

Pros and Cons of this augmentation


Pros
Can model more complicated decision boundaries.
Can model more complicated regression relationships.

Cons
Lack of locality in global basis functions.
Solution Use local polynomial representations such as

piecewise-polynomials and splines.

How should one find the correct complexity in the model?


There is the danger of over-fitting.

Pros and Cons of this augmentation


Pros
Can model more complicated decision boundaries.
Can model more complicated regression relationships.

Cons
Lack of locality in global basis functions.
Solution Use local polynomial representations such as

piecewise-polynomials and splines.

How should one find the correct complexity in the model?


There is the danger of over-fitting.

Controlling the complexity of the model


Common approaches taken:
Restriction Methods
Limit the class of functions considered. Use additive models
f (X) =

Mj
p X
X

jm hjm (Xj )

j=1 m=1

Selection Methods

Scan the set of hm and only include those that contribute


significantly to the fit of the model - Boosting, CART.

Regularization Methods

Let

f (X) =

M
X

j hj (X)

j=1

but when learning the j s restrict their values in the manner


of ridge regression and lasso.

Piecewise Polynomials and Splines

Piecewise polynomial function


To obtain a piecewise polynomial function f (X)
Divide the domain of X into contiguous intervals.
142
Basis Expansions and Regularization
Represent
f5. by
a separate polynomial in each interval.

Examples
Piecewise Constant
O
O

O O
O
OO
O O

OO

Piecewise Linear
O
O

O
O

O
O
OOO

O
O
O
O
O

O O
O
OO
O O

O
O
O
O
O
O
O

OO

O
O

O
O

O
O
O
O

OOO

O
O
O

O
O
O

O O
O

O
O

O
O

O O

Continuous Piecewise Linear

Piecewise-linear Basis Function

Blue curve - ground truth function.


Green curve - piecewise constant/linear fit to the training data.
O
O

O O
O
OO
O O
OO

O
O

OOO

O
O
O
O

Piecewise polynomial function


To obtain a piecewise polynomial function f (X)
Divide the domain of X into contiguous intervals.
142
Basis Expansions and Regularization
Represent
f5. by
a separate polynomial in each interval.

Examples
Piecewise Constant
O
O

O O
O
OO
O O

OO

Piecewise Linear
O
O

O
O

O
O
OOO

O
O
O
O
O

O O
O
OO
O O

O
O
O
O
O
O
O

OO

O
O

O
O

O
O
O
O

OOO

O
O
O

O
O
O

O O
O

O
O

O
O

O O

Continuous Piecewise Linear

Piecewise-linear Basis Function

Blue curve - ground truth function.


Green curve - piecewise constant/linear fit to the training data.
O
O

O O
O
OO
O O
OO

O
O

OOO

O
O
O
O

Example: Piecewise constant function


Piecewise Constant
O
O

O O
O
OO
O O

Piecewise Linear
O
O

O
O

O
O
O

O
O
O

OO

O
O

O
O

O
O
O
O

OOO

O
O
O

O
O

O O
O

O
O

O O
O
OO
O O

O
O
O
O

OOO

OO

O
O

O
O

O
O

O
O
O

O O

Continuous
Linear three regions
Piecewise-linear Basis Function
Divide [a, b], the domain
ofPiecewise
X, into
O
O

[a, 1 ), [1 , 2 ), [2 , b]
O O
O
OO
O O

OO

with 1 < 2 < 3


O
O

O
O

Define three basis functions


O

i s are referred to as knots

O
O
O
O

OOO

O
O
O

O O

(X =
1 )+
h1 (X) = Ind(X < 1 ), h2 (X) = Ind(1 X < 2 ), h3 (X)
Ind(
2 X)
O
O

O
O

The model f (X) =

P3

m=1

m hm (X) is fit using least-squares.

As basis functions dont overlap = m = mean


of yi s
in
the mth region. FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
1

Example: Piecewise linear function


Piecewise Constant
O
O

O O
O
OO
O O

OO

Piecewise Linear
O
O

O
O

O
O

O
O
O

O
O
O

OO

O
O

O
O

O
O
O
O

OOO

O
O

O
O

In this case define 6 basis functions

Continuous Piecewise Linear


O
O

O O
O
OO
O O

O
O
O

O
O O
O

O
O

O
O

O
O

O O
O

O
O

O O
O
OO
O O

O
O
O
O

OOO

OO

Piecewise-linear Basis Function

O
O

h1 (X) = Ind(X < 1 ),

O
O
O
O

OOO

h4 (X) = X h1 (X),
O

O
O
O
O
O

O
O
O

h5 (X) = X h2 (X),

O O
O

The model f (X) =


O

h2 (X) = Ind(1 X < 2 ),

P6

h3 (X) = Ind(2 X)
h6 (X) = X h3 (X)

(X 1 )+

m=1

m h m (X) is fit using least-squares.

As basis functions dont overlap = fit a separate linear


O

model to
the data in each
region.
2

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

Example: Continuous piecewise linear function


O

Continuous Piecewise Linear


O
O

O O
O
OO
O O

O
O

OO

Piecewise-linear Basis Function


O
O

O
O

O
O
O
O

OOO

O
O
O
O
O

O
O
O

O O
O

O
O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
artificial data. The broken vertical lines indicate the positions of the two knots
1 and 2 . The blue curve represents the true function, from which the data were
generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left
restricted to be continuous at the knots. The lower right panel shows a piecewise
linear basis function, h3 (X) = (X 1 )+ , continuous at 1 . The black points
indicate the sample evaluations h3 (xi ), i = 1, . . . , N .

Additionally impose the constraint that f (X) is continuous as

1 and 2 .

This means

1 + 2 1 = 3 + 4 1 , and
3 + 4 2 = 5 + 6 2
This reduces the # of dof of f (X) from 6 to 4.

Piecewise Constant
O
O

O O
O
OO
O O

Piecewise Linear

A more compact set of basis functions


O
O

O
O

OO

basis instead:
O
O

O
O

O
O

O
O
O
O

OOO

O
O
O

OO

O
O
O

h1 (X) = 1

h2 (X) = X
h4 (X) = (X 2 )+

h3 (X) = (X 1 )+

O
O
O

O O
O

O
O

Continuous Piecewise Linear


O O
O
OO
O O

OO

O
O

O O

O
O

O O
O
OO
O O

O
O
O
O

OOO

To impose the continuity constraints directly can use this

O
O

O
O

Piecewise-linear Basis Function


O
O

O
O

O
O
O
O

OOO

O
O
O
O
O

O O
O

O
O

O
O
O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
artificial data. The broken vertical lines indicate the positions of the two knots
1 and 2 . The blue curve represents the true function, from which the data were
generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left

Smoother f (X)
Can achieve a smoother f (X) by increasing the order
of the local polynomials

of the continuity at the knots

Smoother f (X)
Can achieve a smoother f (X) by increasing the order
of the local polynomials
5.2 Piecewise Polynomials and Splines
143
of the continuity at the knots
Piecewise Cubic Polynomials

Piecewise-cubic polynomials with increasing orders of continuity


Discontinuous

O
O

O O
O
OO O

OO

Continuous

O
O

O
O
OOO
O

O
O
O

O O
O

O
O

O O
O
OO O

OO

O
O

O
O
OOO
O

O
O

O
O
O

O
O

O
O
O
O
O
O
O
O

O
O

O O
O
OO O
O

OO

O
O

O
O
O

O O

O
O

O
O

O
O
O
O
O
O
O
O

O
O

O O
O
OO O

O
O

OO

O
O

O
O

O
O
O

O
O
O

O O

O
O

O
O

O
O
O
O
O

OOO

Continuous Second Derivative

O
O

OOO
O

Continuous First Derivative

O
O

O O
O

O
O

O
O
O
O
O
O
O
O

O
O
O

O
O

O O
O
OO O
O

OO

O
O

O
O
OOO
O

O
O
O

O O

O
O
O
O
O
O
O
O

O
O

O O
O
OO O

OO

O
O

O
O
OOO
O

O
O

f (X) is a cubic spline if


O

O
O

O O
O

O
O

O
O
O
O
O
O
O
O

O
O

it is a piecewise cubic polynomial and


O

1
2
has 1st
and 2nd continuity
at the2 knots

Continuous First Derivative

O
O

O O
O
OO O
O

OO

Continuous Second Derivative

O
O

O
O
OOO
O

O
O
O

O
O
O

O O

O
O

O
O

O
O
O
O
O
O
O
O

O
O

O O
O
OO O

O
O

OO

O
O

O
O
OOO
O

O
O
O

O
O
O

O O

O
O

O
O
O
O
O
O
O
O

O
O

IGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of


ntinuity.

A cubic spline

creasing orders of continuity at the knots. The function in the lower


ght panel is continuous, and has continuous first and second derivatives
t the knots. It is known as a cubic spline. Enforcing one more order of

Cubic Spline

O
O

O
O

O
O

O
O

O
O

O
O

O
O

O
O

O O
O
OO O
O

OO

O
O
O

O O

A cubic spline

O
O

O
O

O O
O
OO O

O
O

O
O

O
O
O
O
O
O
O
O

OO

O
O

O
O
OOO
O

O
O
O

Continuous Second Derivative

O
O

O
O

Cubic Spline

OOO
O

Continuous First Derivative

O
O

O
O
O

O O

O
O

O
O
O
O
O
O
O
O

O
O

IGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of


ntinuity.

The following basis represents a cubic spline with knots at 1 and


2 :

creasing orders of continuity at the knots. The function in the lower


2
ght panel is continuous,
and
second
h1 (X)
= has
1, continuoush3first
(X)and
=X
, derivatives
h5 (X) = (X 1 )3+
t the knots. It is known as a cubic spline. Enforcing one more order of
3 hard to show
ontinuity would lead
a global
h2to(X)
= X,cubic polynomial.
h4 (X) It=isXnot
,
h6 (X) = (X 2 )3+
Exercise 5.1) that the following basis represents a cubic spline with knots
t 1 and 2 :
h (X) = 1,

h (X) = X 2 ,

h (X) = (X )3 ,

Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2

The general form for the truncated-power basis set is

hj (X) = X j1

j = 1, . . . , M

1
hM +l (X) = (X l )M
,
+

l = 1, . . . , K

In practice the most widely used orders are M = 1, 2, 4.

Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2

The general form for the truncated-power basis set is

hj (X) = X j1

j = 1, . . . , M

1
hM +l (X) = (X l )M
,
+

l = 1, . . . , K

In practice the most widely used orders are M = 1, 2, 4.

Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2

The general form for the truncated-power basis set is

hj (X) = X j1

j = 1, . . . , M

1
hM +l (X) = (X l )M
,
+

l = 1, . . . , K

In practice the most widely used orders are M = 1, 2, 4.

Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.

One common approach is to set a knot at each observation xi .


There are many equivalent bases for representing splines and

the truncated power basis is intuitively attractive but not


computationally attractive.

A better basis set for implementation is the B-spline basis set.

Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.

One common approach is to set a knot at each observation xi .


There are many equivalent bases for representing splines and

the truncated power basis is intuitively attractive but not


computationally attractive.

A better basis set for implementation is the B-spline basis set.

Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.

One common approach is to set a knot at each observation xi .


There are many equivalent bases for representing splines and

the truncated power basis is intuitively attractive but not


computationally attractive.

A better basis set for implementation is the B-spline basis set.

Natural Cubic Splines

Natural Cubic Splines


Problem
The polynomials fit beyond the boundary knots behave wildly.
Solution: Natural Cubic Splines
Have the additional constraints that the function is linear

beyond the boundary knots.

This frees up 4 dof which can be used by having more knots

in the interior region.

Near the boundaries one has reduced the variance of the fit

but increased its bias!

Smoothing Splines

Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes

Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1

Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes

Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1

Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes

Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1

smoothing parameter
closeness to data

curvature penalty

Smoothing Splines: Smoothing parameter


Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1

establishes a trade-off between predicting the training data

and minimizing the curvature of f (x).

The two special cases are


= 0: f is any function which interpolates the data.
= : f is the simple least squares line fit.

In these two cases go from very rough to very smooth f(x).


Hope is (0, ) indexes an interesting class of functions in

between.

Smoothing Splines: Smoothing parameter


Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1

establishes a trade-off between predicting the training data

and minimizing the curvature of f (x).

The two special cases are


= 0: f is any function which interpolates the data.
= : f is the simple least squares line fit.

In these two cases go from very rough to very smooth f(x).


Hope is (0, ) indexes an interesting class of functions in

between.

Smoothing Splines: Smoothing parameter


Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1

establishes a trade-off between predicting the training data

and minimizing the curvature of f (x).

The two special cases are


= 0: f is any function which interpolates the data.
= : f is the simple least squares line fit.

In these two cases go from very rough to very smooth f(x).


Hope is (0, ) indexes an interesting class of functions in

between.

Smoothing Splines: Form of the solution


Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1

Amazingly the above equation has an explicit,

finite-dimensional unique minimizer for a fixed .

It is a natural cubic spline with knots as the unique values of

the xi , i = 1, . . . , n.

That is

f(x) =

n
X

Nj (x)j

j=1

where the Nj (x) are an N -dimensional set of basis functions


for representing this family of natural splines.

Smoothing Splines: Form of the solution


Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1

Amazingly the above equation has an explicit,

finite-dimensional unique minimizer for a fixed .

It is a natural cubic spline with knots as the unique values of

the xi , i = 1, . . . , n.

That is

f(x) =

n
X

Nj (x)j

j=1

where the Nj (x) are an N -dimensional set of basis functions


for representing this family of natural splines.

Smoothing Splines: Estimating the coefficients


The criterion to be optimized thus reduces to
RSS(, ) = (y N)t (y N) + t N
where
N1 (x1 )
N1 (x2 )

N= .
..
N1 (xn )

..
.

N2 (x1 )
N2 (x2 )
..
.
N2 (xn )

N100 (t)N100 (t)dt

R 00
N2 (t)N100 (t)dt
N =
..

R 00 . 00
Nn (t)N1 (t)dt

y = (y1 , y2 , . . . , yn )t

Nn (x1 )
Nn (x2 )

..

.
Nn (xn )

N100 (t)N200 (t)dt

N200 (t)N200 (t)dt


..
R 00 . 00
Nn (t)N2 (t)dt

..
.

N100 (t)Nn00 (t)dt

N200 (t)Nn00 (t)dt

..

.
R 00
00
Nn (t)Nn (t)dt

Smoothing Splines: Estimating the coefficients


The criterion to be optimized thus reduces to
RSS(, ) = (y N)t (y N) + t N
and its solution is given by
= (Nt N + N )1 Nt y
The fitted smoothing spline is then given by
f(x) =

n
X
j=1

Nj (x)j

Smoothing Splines: Estimating the coefficients


The criterion to be optimized thus reduces to
RSS(, ) = (y N)t (y N) + t N
and its solution is given by
= (Nt N + N )1 Nt y
The fitted smoothing spline is then given by
f(x) =

n
X
j=1

Nj (x)j

Degrees of Freedom and Smoother Matrices

A smoothing spline is a linear smoother


Assume that has been set.
Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y
Let
f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y
where S = N(Nt N + N )1 Nt .

A smoothing spline is a linear smoother


Assume that has been set.
Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y
Let
f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y
where S = N(Nt N + N )1 Nt .

Properties of S
S is symmetric and positive semi-definite.
S S  S
S has rank n.
The book defines the effective degrees of freedom of a

smoothing spline to be

df = trace(S )

Effective dof of a smoothing spline


152

5. Basis Expansions and Regularization

0.15
0.10
0.05
0.0
-0.05

Relative Change in Spinal BMD

0.20

Male


Female

10

15

20

25

Age

FIGURE 5.6. The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing spline
was fit to the males and females, with 0.00022. This choice corresponds to
about 12 degrees of freedom.

Both curves were fit with .00022. This choice corresponds to


about 12 degrees of freedom.
where the Nj (x) are an N -dimensional set of basis functions for representing this family of natural splines (Section 5.2.1 and Exercise 5.4). The
criterion thus reduces to

The eigen-decomposition of S : S in Reinsch form


Let N = U SV t be the svd of N .
Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt
as
S = (1 + K)1
where
K = U S 1 V t N V S 1 U t .
It is also easy to show that
f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf
f

The eigen-decomposition of S : S in Reinsch form


Let N = U SV t be the svd of N .
Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt
as
S = (1 + K)1
where
K = U S 1 V t N V S 1 U t .
It is also easy to show that
f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf
f

The eigen-decomposition of S
Let K = P DP 1 be the real eigen-decomposition of K -

possible as K symmetric and positive semi-definite.

Then

S = (I + K)1 = (I + P DP 1 )1
= (P P 1 + P DP 1 )1
= (P (I + D)P 1 )1
= P (I + D)1 P 1
n
X
1
pk ptk
=
1 + dk
i=1

where dk are the elements of diagonal D and e-values of K


and pk are the e-vectors of K.
pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

The eigen-decomposition of S
Let K = P DP 1 be the real eigen-decomposition of K -

possible as K symmetric and positive semi-definite.

Then

S = (I + K)1 = (I + P DP 1 )1
= (P P 1 + P DP 1 )1
= (P (I + D)P 1 )1
= P (I + D)1 P 1
n
X
1
pk ptk
=
1 + dk
i=1

where dk are the elements of diagonal D and e-values of K


and pk are the e-vectors of K.
pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

Example: Cubic spline smoothing to air pollution data


5.4 Smoothing Splines

155

30

20

10

Ozone Concentration

-50

50

100

Daggot Pressure Gradient

1.0

1.2

Green curve smoothing spline with df = trace(S ) = 11.


df=5
df=11

0.6
0.4
0.2
0.0
-0.2

Eigenvalues

0.8

Red curve smoothing


spline with df = trace(S ) = 5.

Daggot Pressure Gradient

0.6
0.4

df=5
df=11

0.2

Eigenvalues

0.8

1.0

1.2

Example: Eigenvalues of S

-0.2

0.0

10

15
Order

20

25
-50

50

100

FIGURE
5.7.of(Top:)
Smoothing
spline fit of ozone concentr
Green curve
eigenvalues
S with
df = 11.
pressure gradient. The two fits correspond to different valu
achieve
Red curveparameter,
eigenvalueschosen
of S to
with
df =five
5. and eleven effective degrees
by df = trace(S ). (Lower left:) First 25 eigenvalues for the t
matrices. The first two are exactly 1, and all are 0. (Lo

-50

50

100

Example: Eigenvectors of S

Daggot Pressure Gradient

df=5
df=11

15
Order

20

25
-50

50

100

-50

50

100

p:) Smoothing
spline
of ozone
versus
Daggot
Each
bluefit
curve
is an concentration
eigenvector of S
plotted against x. Top left
The two fits has
correspond
to different
values
the smoothing
highest e-value,
bottom
right of
samllest.
o achieve five and eleven effective degrees of freedom, defined

curve
is the eigenvector
by 1/(1 + dk ).
(Lower left:)Red
First
25 eigenvalues
for thedamped
two smoothing-spline

Highlights of the eigenrepresentation


The eigenvectors of S do not depend on .
The smoothing spline decomposes y w.r.t. the basis {pk } and

shrinks the contributions using 1/(1 + dk ) as


S y =

n
X
k=1

1
pk (ptk y)
1 + dk

The first two e-values are always 1 of S and correspond to

the eigenspace of functions linear in x.

The sequence of pk , ordering by decreasing 1/(1 + dk ),

appear to increase in complexity.

df = trace(S ) =

n
X
k=1

1/(1 + dk ).

Highlights of the eigenrepresentation


The eigenvectors of S do not depend on .
The smoothing spline decomposes y w.r.t. the basis {pk } and

shrinks the contributions using 1/(1 + dk ) as


S y =

n
X
k=1

1
pk (ptk y)
1 + dk

The first two e-values are always 1 of S and correspond to

the eigenspace of functions linear in x.

The sequence of pk , ordering by decreasing 1/(1 + dk ),

appear to increase in complexity.

df = trace(S ) =

n
X
k=1

1/(1 + dk ).

Highlights of the eigenrepresentation


The eigenvectors of S do not depend on .
The smoothing spline decomposes y w.r.t. the basis {pk } and

shrinks the contributions using 1/(1 + dk ) as


S y =

n
X
k=1

1
pk (ptk y)
1 + dk

The first two e-values are always 1 of S and correspond to

the eigenspace of functions linear in x.

The sequence of pk , ordering by decreasing 1/(1 + dk ),

appear to increase in complexity.

df = trace(S ) =

n
X
k=1

1/(1 + dk ).

Visualization of a S
Equivalent Kernels

Row 12

Smoother Matrix

12

Row 25

Row 50

25

50

Row 75
75

100
115

Row 100

Row 115

FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded,
indicating an equivalent kernel with local support. The left panel represents the

Choosing ???
This is a crucial and tricky problem.
Will deal with this problem in Chapter 7 when we consider the

problem of Model Selection.

Nonparametric Logistic Regression

Back to logistic regression


Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)
= 0 + t x
P (Y = 0|X = x)

However, consider the case when

log

P (Y = 1|X = x)
= f (x)
P (Y = 0|X = x)

which in turn implies


P (Y = 1|X = x) =

ef (x)
1 + ef (x)

Fitting f (x) in a smooth fashion leads to a smooth estimate

of P (Y = 1|X = x).

Back to logistic regression


Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)
= 0 + t x
P (Y = 0|X = x)

However, consider the case when

log

P (Y = 1|X = x)
= f (x)
P (Y = 0|X = x)

which in turn implies


P (Y = 1|X = x) =

ef (x)
1 + ef (x)

Fitting f (x) in a smooth fashion leads to a smooth estimate

of P (Y = 1|X = x).

The penalized log-likelihood criterion


Construct the penalized log-likelihood criterion
`(f ; ) =

Z
n
X
[yi log P (Y = 1|xi ) + (1 yi ) log(1 P (Y = 1|xi ))] .5 (f 00 (t))2 dt
i=1

Z
n
X
=
[yi f (xi ) log(1 + ef (xi ) )] .5 (f 00 (t))2 dt
i=1

Regularization and Reproducing Kernel


Hilbert Spaces

General class of regularization problems


There is a class of generalization problems which have the form

min
f H

"

n
X

L(yi , f (xi )) + J(f )

i=1

where
L(yi , f (xi )) is a loss function,
J(f ) is a penalty functional,
H is a space of functions on which J(f ) is defined.

Important subclass of problems of this form


These are generated by a positive definite kernel K(x, y) and
the corresponding space of functions HK called a reproducing

kernel Hilbert space (RKHS),

the penalty functional J is defined in terms of the kernel as

well.

What does all this mean??


What follows is mainly based on the notes of Nuno Vasconcelos.

Types of Kernels
Definition
A kernel is a mapping k : X X R.
These three types of kernels are equivalent

dot-product kernel
m

positive definite kernel


m

Mercer kernel

Dot-product kernel
Definition
A mapping
k :X X R
is a dot-product kernel if and only if
k(x, y) = h(x), (y)i
where
:X H
and H is a vector space and h, i is an inner-product on H.

Positive definite kernel


Definition
A mapping
k :X X R
is a positive semi-definite kernel on X X if m N and
x1 , . . . , xm with each xi X the Gram matrix

k(x1 , x1 ) k(x1 , x2 ) k(x1 , xm )


k(x2 , x1 ) k(x2 , x2 ) k(x2 , xm )

K=

.
.
...

.
...
...
k(xm , x1 ) k(xm , x2 ) k(xm , xm )

is positive semi-definite.

Mercer kernel
Definition
A symmetric mapping k : X X R such that
Z Z
k(x, y) f (x) f (y) dx dy 0
for all functions f s.t.
Z
is a Mercer kernel.

f (x)2 dx <

Two different pictures


These different definitions lead to different interpretations of what
the kernel does:
Interpretation I
Reproducing kernel map:

X
Hk = f (.) | f () =
i k(, xi )

j=1

hf, gi =

m X
m
X
i=1 j=1

: X k(, x)

i j k(xi , x0j )

Two different pictures


These different definitions lead to different interpretations of what
the kernel does:
Interpretation II
Mercer kernel map:
HM = `2 =

x|

X
i

x2i <

hf, gi = f g
p
p
: X ( 1 1 (x), 2 2 (x), ...)t

where i , i are the e-values and eigenfunctions


of k(x, y) with i > 0.
P 2
where `2 is the space of vectors s.t.
i ai < .

Interpretation I: The dot-product picture


When a Gaussian kernel k(x, xi ) = exp(kx xi k2 /) is used
the point xi X is mapped into the Gaussian G(, xi , I)
Hk is the space of all functions that are linear combinations of

Gaussians.

the kernel is a dot product in Hk and a non-linear similarity

on X .

The reproducing property


With the definition of Hk and h, i one has

hk(, x), f ()i = f (x)

f Hk

This is called the reproducing property.


Leads to the reproducing Kernel Hilbert Spaces

Definition
A Hilbert Space is a complete dot-product space.
(vector space + dot product + limit points of all Cauchy
sequences)

The reproducing property


With the definition of Hk and h, i one has

hk(, x), f ()i = f (x)

f Hk

This is called the reproducing property.


Leads to the reproducing Kernel Hilbert Spaces

Definition
A Hilbert Space is a complete dot-product space.
(vector space + dot product + limit points of all Cauchy
sequences)

Reproducing kernel Hilbert spaces


Definition
Let H be a Hilbert space of functions f : X R. H is a
Reproducing Kernel Hilbert Space (rkhs) with inner-product h, i
if there exists a
k :X X R
s. t.
k(, ) spans H that is

H = {f () | f () =

i i k(, xi )

for i R and xi X }

k(, ) is a reproducing kernel of H

f (x) = hf (), k(, x)i

f H

Interpretation II: Mercer Kernels


Theorem
Let k : X X R be a Mercer kernel. Then there exists an
orthonormal set of functions
Z
i (x)j (x)dx = ij
and a set of i 0 such that
Z Z

X
2
1
i =
k 2 (x, y)dx dy < and
i

k(x, y) =

X
i=1

i i (x)i (y)

Transformation induced by a Mercer kernel


This eigen-decomposition gives another way to design the feature
transformation induced by the kernel k(, ).
Let
: X `2
be defined by

(x) = ( 1 1 (x), 2 2 (x), . . .)


where `2 is the space of square summable sequences.
Clearly
p
X
p
i i (x) i i (y)
h(x), (y)i =
i=1

X
i=1

i i (x)i (y) = k(x, y)

Issues
Therefore there is a vector space `2 other than Hk such that
k(x, y) is a dot product in that space.
Have two very different interpretations of what the kernel

does
1
2

Reproducing kernel map


Mercer kernel map

They are in fact more or less the same.

rkhs Vs Mercer maps


For HM we write

(x) =

P
i

i i (x)ei

As the i s are orthonormal there is a 1-1 map

: `2 span{k }
Can write
( )(x) =

P
i

ek =

p
k k ()

i i (x)i () = k(, x)

Hence k(, x) maps x into M = span{k ()}

The Mercer picture

The Mercer picture

we have
!

xi

x2
x

x x x
x
x
o o
x
x
o o o o
x
x
o
o
o
o
o o

x
e1

)(xi)

l2 d

x x x
x
x
x
xo o
o oo
o
o

o
o

ed
x1

e3

e2

x
x

x x x
x
x
x
x

I1

7R)(xi)=k(.,x
)=k( xi)
"

xo o
o oo
o
o

o
o
o

o
o

Id
I3

I2

13

Mercer map
Define the inner-product in M as
Z
hf, gim = f (x)g(x) dx
Note we will normalize the eigenfunctions l such that
Z
lk
l (x)k (x) dx =
l
Any function f M can be written as
f (x) =

X
k=1

then

k k (x)

Mercer map

hf (), k(, y)im =


=
=
=

f (x)k(x, y) dx

Z X

k k (x)

k=1
X

l=1

k l l (y)

k=1 l=1

l l l (y)

l=1

1
l

l l (y) = f (y)

l=1

k is a reproducing kernel on M.

l l (x)l (y) dx
k (x)l (x) dx

Mercer map Vs Reproducing kernel map


We want to check if
the space M = Hk

hf, gim and hf, gi are equivalent.

To do this will involve the following steps


1
2
3

Show Hk M.

Show hf, gim = hf, gi for f, g Hk .


Show M Hk .

Hk M
If f Hk then there exists m N, {i } and {xi } such that
f () =
=
=
=

m
X
i=1
m
X

i=1

X
l=1

i k(, xi )
i

l l (xi ) l ()

l=1

m
X

i l l (xi )

i=1

l ()

l l ()

l=1

Thus f is a linear combination of the i s and f M.


This shows that if f H then f M and therefore H M.

Equivalence of the inner-products


Let f, g H with
f () =

n
X

i k(, xi ),

g() =

i=1

m
X

j k(, yj )

j=1

Then by definition
hf, gi =
While
hf, gim =
=
=

n X
m
X

i j k(xi , yj )

i=1 j=1

f (x)g(x) dx

Z X
n

i k(x, xi )

i=1
n
m
XX
i=1 j=1

i j

m
X

j k(x, yj ) dx

j=1

k(x, xi ) k(x, yj ) dx

Equivalence of the inner-products ctd

hf, gim =
=
=

n X
m
X

i=1 j=1
n X
m
X
i=1 j=1
n X
m
X

i j

Z X

l l (x)l (xi )

l=1

i j

l l (xi ) l (yj )

l=1

i j k(xi , yj )

i=1 j=1

= hf, gi

Thus for all f, g H


hf, gim = hf, gi

X
s=1

s s (x)s (yj ) dx

MH
Can also show that if f M then also f Hk .
Will not prove that here.
But it implies M Hk

Summary
The reproducing kernel map and the Mercer Kernel map lead to
the same RKHS, Mercer gives us an orthonormal basis.
Interpretation I
Reproducing kernel map:

X
Hk = f (.) | f () =
i k(, xi )

j=1

hf, gi =

m X
m
X
i=1 j=1

r : X k(, x)

i j k(xi , x0j )

Summary
The reproducing kernel map and the Mercer Kernel map lead to
the same RKHS, Mercer gives us an orthonormal basis.
Interpretation II
Mercer kernel map:
HM = `2 =

x|

X
i

x2i

<

hf, gi = f t g
p
p
M : X ( 1 1 (x), 2 2 (x), ...)t
: `2 span{k ()}

M = r

Back to Regularization

Back to regularization
We to solve
min

f Hk

" n
X
i=1

L(yi , f (xi )) + J(f )

where Hk is the RKHS of some appropriate Mercer kernel k(, ).

What is a good regularizer ?


Intuition: wigglier functions have larger norm than smoother

functions.

For f Hk we have

f (x) =

i k(x, xi )

X
i

X
l

X
l

i
"

l l (x)l (xi )

X
i

cl l (x)

i l (xi ) l (x)

What is a good regularizer ?


and therefore
kf (x)k2 =

with cl = l
Hence

lk

cl ck hl (x), k (x)im =

X 1
X c2
l
cl ck lk =
l
l
lk

i i l (xi ).

kf k2 grows with the number of ci different than zero.


functions with large e-values get penalized less and vice versa
more coefficients means more high frequencies or less

smoothness.

Representer Theorem
Theorem
Let
: [0, ) R be a strictly monotonically increasing function
H is the RKHS associated with a kernel k(x, y)
L(y, f (x)) be a loss function

then

f = arg min

f Hk

" n
X
i=1

L(yi , f (xi )) + (kf k2 )

has a representation of the form


f(x) =

n
X
i=1

i k(x, xi )

Relevance
The remarkable consequence of the theorem is that

Can reduce the minimization over the infinite dimensional

space of functions to a minimization over a finite dimensional


space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =
=

X
ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and
f(xi ) =

j k(xi , xj ) = Ki

where K = (k(xi , xj )), Gram matrix, and Ki is its ith row.

Relevance
The remarkable consequence of the theorem is that

Can reduce the minimization over the infinite dimensional

space of functions to a minimization over a finite dimensional


space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =
=

X
ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and
f(xi ) =

j k(xi , xj ) = Ki

where K = (k(xi , xj )), Gram matrix, and Ki is its ith row.

Representer Theorem
Theorem
Let
: [0, ) R be a strictly monotonically increasing function
H is the RKHS associated with a kernel k(x, y)

L(y, f (x)) be a loss function


then

"
f = arg min

f Hk

n
X
i=1

#
2

L(yi , f (xi )) + (kf k )

has a representation of the form


Pn
f(x) = i=1
i k(x, xi )

where

= arg min

" n
X
i=1

L(yi , Ki ) + ( K)

Regularization and SVM

Rejigging the formulation of the SVM


When given linearly separable data {(xi , yi )} the optimal

separating hyperplane is given by


min kk2
0 ,

subject to

yi (0 + t xi ) 1 i

The constraints are fulfilled when

max(0, 1 yi (0 + t xi )) = (1 yi (0 + t xi )+ = 0 i
Hence we can re-write the optimization problem as
min
0 ,

"

n
X
(1 yi (0 + t xi ))+ + kk2
i=1

SVMs connections to regularization


Finding the optimal separating hyperplane
min
0 ,

"

n
X
(1 yi (0 + t xi ))+ + kk2
i=1

can be seen as a regularization problem


min
f

"

n
X
i=1

L(yi , f (xi )) + (kf k )

where
L(y, f (x)) = (1 yi f (xi ))+
(kf k2 ) = kf k2

SVMs connections to regularization


From the Representor theorem know the solution to the latter

problem is

f(x) =

n
X

i xti x

i=1

if the basic kernel k(x, y) = xt y is used.


Therefore kf k2 = t K
This is the same form of the solution found via the KKT

conditions

n
X
i=1

i yi x i