You are on page 1of 98

# Chapter 5: Basis Expansion and Regularization

DD3364

April 1, 2012

Introduction

## Moving beyond linearity

Main idea
Augment the vector of inputs X with additional variables.
These are transformations of X
hm (X) : Rp R

with m = 1, . . . , M .
Then model the relationship between X and Y
f (X) =

M
X

m hm (X) =

m=1

M
X

m Z m

m=1

## as a linear basis expansion in X.

Have a linear model w.r.t. Z. Can use the same methods as

before.

Which transformations?
Some examples
Linear:

hm (X) = Xm , m = 1, . . . , p
Polynomial:

hm (X) = Xj2 ,

or

hm (X) = Xj Xk

## Non-linear transformation of single inputs:

hm (X) = log(Xj ),

p
Xj , ...

## Non-linear transformation of multiple input:

hm (X) = kXk
Use of Indicator functions:

## Pros and Cons of this augmentation

Pros
Can model more complicated decision boundaries.
Can model more complicated regression relationships.

Cons
Lack of locality in global basis functions.
Solution Use local polynomial representations such as

## How should one find the correct complexity in the model?

There is the danger of over-fitting.

## Pros and Cons of this augmentation

Pros
Can model more complicated decision boundaries.
Can model more complicated regression relationships.

Cons
Lack of locality in global basis functions.
Solution Use local polynomial representations such as

## How should one find the correct complexity in the model?

There is the danger of over-fitting.

## Controlling the complexity of the model

Common approaches taken:
Restriction Methods
Limit the class of functions considered. Use additive models
f (X) =

Mj
p X
X

jm hjm (Xj )

j=1 m=1

Selection Methods

## Scan the set of hm and only include those that contribute

significantly to the fit of the model - Boosting, CART.

Regularization Methods

Let

f (X) =

M
X

j hj (X)

j=1

## but when learning the j s restrict their values in the manner

of ridge regression and lasso.

## Piecewise polynomial function

To obtain a piecewise polynomial function f (X)
Divide the domain of X into contiguous intervals.
142
Basis Expansions and Regularization
Represent
f5. by
a separate polynomial in each interval.

Examples
Piecewise Constant
O
O

O O
O
OO
O O

OO

Piecewise Linear
O
O

O
O

O
O
OOO

O
O
O
O
O

O O
O
OO
O O

O
O
O
O
O
O
O

OO

O
O

O
O

O
O
O
O

OOO

O
O
O

O
O
O

O O
O

O
O

O
O

O O

## Blue curve - ground truth function.

Green curve - piecewise constant/linear fit to the training data.
O
O

O O
O
OO
O O
OO

O
O

OOO

O
O
O
O

## Piecewise polynomial function

To obtain a piecewise polynomial function f (X)
Divide the domain of X into contiguous intervals.
142
Basis Expansions and Regularization
Represent
f5. by
a separate polynomial in each interval.

Examples
Piecewise Constant
O
O

O O
O
OO
O O

OO

Piecewise Linear
O
O

O
O

O
O
OOO

O
O
O
O
O

O O
O
OO
O O

O
O
O
O
O
O
O

OO

O
O

O
O

O
O
O
O

OOO

O
O
O

O
O
O

O O
O

O
O

O
O

O O

## Blue curve - ground truth function.

Green curve - piecewise constant/linear fit to the training data.
O
O

O O
O
OO
O O
OO

O
O

OOO

O
O
O
O

## Example: Piecewise constant function

Piecewise Constant
O
O

O O
O
OO
O O

Piecewise Linear
O
O

O
O

O
O
O

O
O
O

OO

O
O

O
O

O
O
O
O

OOO

O
O
O

O
O

O O
O

O
O

O O
O
OO
O O

O
O
O
O

OOO

OO

O
O

O
O

O
O

O
O
O

O O

Continuous
Linear three regions
Piecewise-linear Basis Function
Divide [a, b], the domain
ofPiecewise
X, into
O
O

[a, 1 ), [1 , 2 ), [2 , b]
O O
O
OO
O O

OO

O
O

O
O

O

## i s are referred to as knots

O
O
O
O

OOO

O
O
O

O O

(X =
1 )+
h1 (X) = Ind(X < 1 ), h2 (X) = Ind(1 X < 2 ), h3 (X)
Ind(
2 X)
O
O

O
O

P3

m=1

## As basis functions dont overlap = m = mean

of yi s
in
the mth region. FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
1

## Example: Piecewise linear function

Piecewise Constant
O
O

O O
O
OO
O O

OO

Piecewise Linear
O
O

O
O

O
O

O
O
O

O
O
O

OO

O
O

O
O

O
O
O
O

OOO

O
O

O
O

O
O

O O
O
OO
O O

O
O
O

O
O O
O

O
O

O
O

O
O

O O
O

O
O

O O
O
OO
O O

O
O
O
O

OOO

OO

O
O

## h1 (X) = Ind(X < 1 ),

O
O
O
O

OOO

h4 (X) = X h1 (X),
O

O
O
O
O
O

O
O
O

h5 (X) = X h2 (X),

O O
O

O

## h2 (X) = Ind(1 X < 2 ),

P6

h3 (X) = Ind(2 X)
h6 (X) = X h3 (X)

(X 1 )+

m=1

## As basis functions dont overlap = fit a separate linear

O

model to
the data in each
region.
2

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

O

O
O

O O
O
OO
O O

O
O

OO

## Piecewise-linear Basis Function

O
O

O
O

O
O
O
O

OOO

O
O
O
O
O

O
O
O

O O
O

O
O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
artificial data. The broken vertical lines indicate the positions of the two knots
1 and 2 . The blue curve represents the true function, from which the data were
generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left
restricted to be continuous at the knots. The lower right panel shows a piecewise
linear basis function, h3 (X) = (X 1 )+ , continuous at 1 . The black points
indicate the sample evaluations h3 (xi ), i = 1, . . . , N .

## Additionally impose the constraint that f (X) is continuous as

1 and 2 .

This means

1 + 2 1 = 3 + 4 1 , and
3 + 4 2 = 5 + 6 2
This reduces the # of dof of f (X) from 6 to 4.

Piecewise Constant
O
O

O O
O
OO
O O

Piecewise Linear

O
O

O
O

OO

O
O

O
O

O
O

O
O
O
O

OOO

O
O
O

OO

O
O
O

h1 (X) = 1

h2 (X) = X
h4 (X) = (X 2 )+

h3 (X) = (X 1 )+

O
O
O

O O
O

O
O

O O
O
OO
O O

OO

O
O

O O

O
O

O O
O
OO
O O

O
O
O
O

OOO

O
O

O
O

## Piecewise-linear Basis Function

O
O

O
O

O
O
O
O

OOO

O
O
O
O
O

O O
O

O
O

O
O
O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
artificial data. The broken vertical lines indicate the positions of the two knots
1 and 2 . The blue curve represents the true function, from which the data were
generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left

Smoother f (X)
Can achieve a smoother f (X) by increasing the order
of the local polynomials

## of the continuity at the knots

Smoother f (X)
Can achieve a smoother f (X) by increasing the order
of the local polynomials
5.2 Piecewise Polynomials and Splines
143
of the continuity at the knots
Piecewise Cubic Polynomials

Discontinuous

O
O

O O
O
OO O

OO

Continuous

O
O

O
O
OOO
O

O
O
O

O O
O

O
O

O O
O
OO O

OO

O
O

O
O
OOO
O

O
O

O
O
O

O
O

O
O
O
O
O
O
O
O

O
O

O O
O
OO O
O

OO

O
O

O
O
O

O O

O
O

O
O

O
O
O
O
O
O
O
O

O
O

O O
O
OO O

O
O

OO

O
O

O
O

O
O
O

O
O
O

O O

O
O

O
O

O
O
O
O
O

OOO

O
O

OOO
O

O
O

O O
O

O
O

O
O
O
O
O
O
O
O

O
O
O

O
O

O O
O
OO O
O

OO

O
O

O
O
OOO
O

O
O
O

O O

O
O
O
O
O
O
O
O

O
O

O O
O
OO O

OO

O
O

O
O
OOO
O

O
O

O

O
O

O O
O

O
O

O
O
O
O
O
O
O
O

O
O

## it is a piecewise cubic polynomial and

O

1
2
has 1st
and 2nd continuity
at the2 knots

O
O

O O
O
OO O
O

OO

O
O

O
O
OOO
O

O
O
O

O
O
O

O O

O
O

O
O

O
O
O
O
O
O
O
O

O
O

O O
O
OO O

O
O

OO

O
O

O
O
OOO
O

O
O
O

O
O
O

O O

O
O

O
O
O
O
O
O
O
O

O
O

ntinuity.

A cubic spline

## creasing orders of continuity at the knots. The function in the lower

ght panel is continuous, and has continuous first and second derivatives
t the knots. It is known as a cubic spline. Enforcing one more order of

Cubic Spline

O
O

O
O

O
O

O
O

O
O

O
O

O
O

O
O

O O
O
OO O
O

OO

O
O
O

O O

A cubic spline

O
O

O
O

O O
O
OO O

O
O

O
O

O
O
O
O
O
O
O
O

OO

O
O

O
O
OOO
O

O
O
O

O
O

O
O

Cubic Spline

OOO
O

O
O

O
O
O

O O

O
O

O
O
O
O
O
O
O
O

O
O

ntinuity.

2 :

## creasing orders of continuity at the knots. The function in the lower

2
ght panel is continuous,
and
second
h1 (X)
= has
1, continuoush3first
(X)and
=X
, derivatives
h5 (X) = (X 1 )3+
t the knots. It is known as a cubic spline. Enforcing one more order of
3 hard to show
a global
h2to(X)
= X,cubic polynomial.
h4 (X) It=isXnot
,
h6 (X) = (X 2 )3+
Exercise 5.1) that the following basis represents a cubic spline with knots
t 1 and 2 :
h (X) = 1,

h (X) = X 2 ,

h (X) = (X )3 ,

Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2

## The general form for the truncated-power basis set is

hj (X) = X j1

j = 1, . . . , M

1
hM +l (X) = (X l )M
,
+

l = 1, . . . , K

## In practice the most widely used orders are M = 1, 2, 4.

Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2

## The general form for the truncated-power basis set is

hj (X) = X j1

j = 1, . . . , M

1
hM +l (X) = (X l )M
,
+

l = 1, . . . , K

## In practice the most widely used orders are M = 1, 2, 4.

Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2

## The general form for the truncated-power basis set is

hj (X) = X j1

j = 1, . . . , M

1
hM +l (X) = (X l )M
,
+

l = 1, . . . , K

## In practice the most widely used orders are M = 1, 2, 4.

Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.

## One common approach is to set a knot at each observation xi .

There are many equivalent bases for representing splines and

## the truncated power basis is intuitively attractive but not

computationally attractive.

## A better basis set for implementation is the B-spline basis set.

Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.

## One common approach is to set a knot at each observation xi .

There are many equivalent bases for representing splines and

## the truncated power basis is intuitively attractive but not

computationally attractive.

## A better basis set for implementation is the B-spline basis set.

Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.

## One common approach is to set a knot at each observation xi .

There are many equivalent bases for representing splines and

## the truncated power basis is intuitively attractive but not

computationally attractive.

## Natural Cubic Splines

Problem
The polynomials fit beyond the boundary knots behave wildly.
Solution: Natural Cubic Splines
Have the additional constraints that the function is linear

## in the interior region.

Near the boundaries one has reduced the variance of the fit

## but increased its bias!

Smoothing Splines

Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes

Z
n
X
2
(yi f (xi )) + (f 00 (t))2 dt
i=1

Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes

Z
n
X
2
(yi f (xi )) + (f 00 (t))2 dt
i=1

Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes

Z
n
X
2
(yi f (xi )) + (f 00 (t))2 dt
i=1

smoothing parameter
closeness to data

curvature penalty

## Smoothing Splines: Smoothing parameter

Z
n
X
2
(yi f (xi )) + (f 00 (t))2 dt
i=1

## The two special cases are

= 0: f is any function which interpolates the data.
= : f is the simple least squares line fit.

## In these two cases go from very rough to very smooth f(x).

Hope is (0, ) indexes an interesting class of functions in

between.

## Smoothing Splines: Smoothing parameter

Z
n
X
2
(yi f (xi )) + (f 00 (t))2 dt
i=1

## The two special cases are

= 0: f is any function which interpolates the data.
= : f is the simple least squares line fit.

## In these two cases go from very rough to very smooth f(x).

Hope is (0, ) indexes an interesting class of functions in

between.

## Smoothing Splines: Smoothing parameter

Z
n
X
2
(yi f (xi )) + (f 00 (t))2 dt
i=1

## The two special cases are

= 0: f is any function which interpolates the data.
= : f is the simple least squares line fit.

## In these two cases go from very rough to very smooth f(x).

Hope is (0, ) indexes an interesting class of functions in

between.

## Smoothing Splines: Form of the solution

Z
n
X
2
(yi f (xi )) + (f 00 (t))2 dt
i=1

## It is a natural cubic spline with knots as the unique values of

the xi , i = 1, . . . , n.

That is

f(x) =

n
X

Nj (x)j

j=1

## where the Nj (x) are an N -dimensional set of basis functions

for representing this family of natural splines.

## Smoothing Splines: Form of the solution

Z
n
X
2
(yi f (xi )) + (f 00 (t))2 dt
i=1

## It is a natural cubic spline with knots as the unique values of

the xi , i = 1, . . . , n.

That is

f(x) =

n
X

Nj (x)j

j=1

## where the Nj (x) are an N -dimensional set of basis functions

for representing this family of natural splines.

## Smoothing Splines: Estimating the coefficients

The criterion to be optimized thus reduces to
RSS(, ) = (y N)t (y N) + t N
where
N1 (x1 )
N1 (x2 )

N= .
..
N1 (xn )

..
.

N2 (x1 )
N2 (x2 )
..
.
N2 (xn )

## N100 (t)N100 (t)dt

R 00
N2 (t)N100 (t)dt
N =
..

R 00 . 00
Nn (t)N1 (t)dt

y = (y1 , y2 , . . . , yn )t

Nn (x1 )
Nn (x2 )

..

.
Nn (xn )

..
R 00 . 00
Nn (t)N2 (t)dt

..
.

..

.
R 00
00
Nn (t)Nn (t)dt

## Smoothing Splines: Estimating the coefficients

The criterion to be optimized thus reduces to
RSS(, ) = (y N)t (y N) + t N
and its solution is given by
= (Nt N + N )1 Nt y
The fitted smoothing spline is then given by
f(x) =

n
X
j=1

Nj (x)j

## Smoothing Splines: Estimating the coefficients

The criterion to be optimized thus reduces to
RSS(, ) = (y N)t (y N) + t N
and its solution is given by
= (Nt N + N )1 Nt y
The fitted smoothing spline is then given by
f(x) =

n
X
j=1

Nj (x)j

## A smoothing spline is a linear smoother

Assume that has been set.
Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y
Let
f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y
where S = N(Nt N + N )1 Nt .

## A smoothing spline is a linear smoother

Assume that has been set.
Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y
Let
f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y
where S = N(Nt N + N )1 Nt .

Properties of S
S is symmetric and positive semi-definite.
S S  S
S has rank n.
The book defines the effective degrees of freedom of a

smoothing spline to be

df = trace(S )

152

0.15
0.10
0.05
0.0
-0.05

## Relative Change in Spinal BMD

0.20

Male

Female

10

15

20

25

Age

FIGURE 5.6. The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing spline
was fit to the males and females, with 0.00022. This choice corresponds to

## Both curves were fit with .00022. This choice corresponds to

where the Nj (x) are an N -dimensional set of basis functions for representing this family of natural splines (Section 5.2.1 and Exercise 5.4). The
criterion thus reduces to

## The eigen-decomposition of S : S in Reinsch form

Let N = U SV t be the svd of N .
Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt
as
S = (1 + K)1
where
K = U S 1 V t N V S 1 U t .
It is also easy to show that
f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf
f

## The eigen-decomposition of S : S in Reinsch form

Let N = U SV t be the svd of N .
Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt
as
S = (1 + K)1
where
K = U S 1 V t N V S 1 U t .
It is also easy to show that
f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf
f

The eigen-decomposition of S
Let K = P DP 1 be the real eigen-decomposition of K -

## possible as K symmetric and positive semi-definite.

Then

S = (I + K)1 = (I + P DP 1 )1
= (P P 1 + P DP 1 )1
= (P (I + D)P 1 )1
= P (I + D)1 P 1
n
X
1
pk ptk
=
1 + dk
i=1

## where dk are the elements of diagonal D and e-values of K

and pk are the e-vectors of K.
pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

The eigen-decomposition of S
Let K = P DP 1 be the real eigen-decomposition of K -

## possible as K symmetric and positive semi-definite.

Then

S = (I + K)1 = (I + P DP 1 )1
= (P P 1 + P DP 1 )1
= (P (I + D)P 1 )1
= P (I + D)1 P 1
n
X
1
pk ptk
=
1 + dk
i=1

## where dk are the elements of diagonal D and e-values of K

and pk are the e-vectors of K.
pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

## Example: Cubic spline smoothing to air pollution data

5.4 Smoothing Splines

155

30

20

10

Ozone Concentration

-50

50

100

1.0

1.2

df=5
df=11

0.6
0.4
0.2
0.0
-0.2

Eigenvalues

0.8

## Red curve smoothing

spline with df = trace(S ) = 5.

0.6
0.4

df=5
df=11

0.2

Eigenvalues

0.8

1.0

1.2

Example: Eigenvalues of S

-0.2

0.0

10

15
Order

20

25
-50

50

100

FIGURE
5.7.of(Top:)
Smoothing
spline fit of ozone concentr
Green curve
eigenvalues
S with
df = 11.
pressure gradient. The two fits correspond to different valu
achieve
Red curveparameter,
eigenvalueschosen
of S to
with
df =five
5. and eleven effective degrees
by df = trace(S ). (Lower left:) First 25 eigenvalues for the t
matrices. The first two are exactly 1, and all are 0. (Lo

-50

50

100

Example: Eigenvectors of S

df=5
df=11

15
Order

20

25
-50

50

100

-50

50

100

p:) Smoothing
spline
of ozone
versus
Daggot
Each
bluefit
curve
is an concentration
eigenvector of S
plotted against x. Top left
The two fits has
correspond
to different
values
the smoothing
highest e-value,
bottom
right of
samllest.
o achieve five and eleven effective degrees of freedom, defined

curve
is the eigenvector
by 1/(1 + dk ).
(Lower left:)Red
First
25 eigenvalues
for thedamped
two smoothing-spline

## Highlights of the eigenrepresentation

The eigenvectors of S do not depend on .
The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n
X
k=1

1
pk (ptk y)
1 + dk

df = trace(S ) =

n
X
k=1

1/(1 + dk ).

## Highlights of the eigenrepresentation

The eigenvectors of S do not depend on .
The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n
X
k=1

1
pk (ptk y)
1 + dk

df = trace(S ) =

n
X
k=1

1/(1 + dk ).

## Highlights of the eigenrepresentation

The eigenvectors of S do not depend on .
The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n
X
k=1

1
pk (ptk y)
1 + dk

## appear to increase in complexity.

df = trace(S ) =

n
X
k=1

1/(1 + dk ).

Visualization of a S
Equivalent Kernels

Row 12

Smoother Matrix

12

Row 25

Row 50

25

50

Row 75
75

100
115

Row 100

Row 115

FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded,
indicating an equivalent kernel with local support. The left panel represents the

Choosing ???
This is a crucial and tricky problem.
Will deal with this problem in Chapter 7 when we consider the

## Back to logistic regression

Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)
= 0 + t x
P (Y = 0|X = x)

log

P (Y = 1|X = x)
= f (x)
P (Y = 0|X = x)

## which in turn implies

P (Y = 1|X = x) =

ef (x)
1 + ef (x)

## Fitting f (x) in a smooth fashion leads to a smooth estimate

of P (Y = 1|X = x).

## Back to logistic regression

Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)
= 0 + t x
P (Y = 0|X = x)

log

P (Y = 1|X = x)
= f (x)
P (Y = 0|X = x)

## which in turn implies

P (Y = 1|X = x) =

ef (x)
1 + ef (x)

## Fitting f (x) in a smooth fashion leads to a smooth estimate

of P (Y = 1|X = x).

## The penalized log-likelihood criterion

Construct the penalized log-likelihood criterion
`(f ; ) =

Z
n
X
[yi log P (Y = 1|xi ) + (1 yi ) log(1 P (Y = 1|xi ))] .5 (f 00 (t))2 dt
i=1

Z
n
X
=
[yi f (xi ) log(1 + ef (xi ) )] .5 (f 00 (t))2 dt
i=1

Hilbert Spaces

## General class of regularization problems

There is a class of generalization problems which have the form

min
f H

"

n
X

## L(yi , f (xi )) + J(f )

i=1

where
L(yi , f (xi )) is a loss function,
J(f ) is a penalty functional,
H is a space of functions on which J(f ) is defined.

## Important subclass of problems of this form

These are generated by a positive definite kernel K(x, y) and
the corresponding space of functions HK called a reproducing

well.

## What does all this mean??

What follows is mainly based on the notes of Nuno Vasconcelos.

Types of Kernels
Definition
A kernel is a mapping k : X X R.
These three types of kernels are equivalent

dot-product kernel
m

## positive definite kernel

m

Mercer kernel

Dot-product kernel
Definition
A mapping
k :X X R
is a dot-product kernel if and only if
k(x, y) = h(x), (y)i
where
:X H
and H is a vector space and h, i is an inner-product on H.

## Positive definite kernel

Definition
A mapping
k :X X R
is a positive semi-definite kernel on X X if m N and
x1 , . . . , xm with each xi X the Gram matrix

## k(x1 , x1 ) k(x1 , x2 ) k(x1 , xm )

k(x2 , x1 ) k(x2 , x2 ) k(x2 , xm )

K=

.
.
...

.
...
...
k(xm , x1 ) k(xm , x2 ) k(xm , xm )

is positive semi-definite.

Mercer kernel
Definition
A symmetric mapping k : X X R such that
Z Z
k(x, y) f (x) f (y) dx dy 0
for all functions f s.t.
Z
is a Mercer kernel.

f (x)2 dx <

## Two different pictures

These different definitions lead to different interpretations of what
the kernel does:
Interpretation I
Reproducing kernel map:

X
Hk = f (.) | f () =
i k(, xi )

j=1

hf, gi =

m X
m
X
i=1 j=1

: X k(, x)

i j k(xi , x0j )

## Two different pictures

These different definitions lead to different interpretations of what
the kernel does:
Interpretation II
Mercer kernel map:
HM = `2 =

x|

X
i

x2i <

hf, gi = f g
p
p
: X ( 1 1 (x), 2 2 (x), ...)t

## where i , i are the e-values and eigenfunctions

of k(x, y) with i > 0.
P 2
where `2 is the space of vectors s.t.
i ai < .

## Interpretation I: The dot-product picture

When a Gaussian kernel k(x, xi ) = exp(kx xi k2 /) is used
the point xi X is mapped into the Gaussian G(, xi , I)
Hk is the space of all functions that are linear combinations of

Gaussians.

on X .

## The reproducing property

With the definition of Hk and h, i one has

f Hk

## This is called the reproducing property.

Leads to the reproducing Kernel Hilbert Spaces

Definition
A Hilbert Space is a complete dot-product space.
(vector space + dot product + limit points of all Cauchy
sequences)

## The reproducing property

With the definition of Hk and h, i one has

f Hk

## This is called the reproducing property.

Leads to the reproducing Kernel Hilbert Spaces

Definition
A Hilbert Space is a complete dot-product space.
(vector space + dot product + limit points of all Cauchy
sequences)

## Reproducing kernel Hilbert spaces

Definition
Let H be a Hilbert space of functions f : X R. H is a
Reproducing Kernel Hilbert Space (rkhs) with inner-product h, i
if there exists a
k :X X R
s. t.
k(, ) spans H that is

H = {f () | f () =

i i k(, xi )

for i R and xi X }

f H

## Interpretation II: Mercer Kernels

Theorem
Let k : X X R be a Mercer kernel. Then there exists an
orthonormal set of functions
Z
i (x)j (x)dx = ij
and a set of i 0 such that
Z Z

X
2
1
i =
k 2 (x, y)dx dy < and
i

k(x, y) =

X
i=1

i i (x)i (y)

## Transformation induced by a Mercer kernel

This eigen-decomposition gives another way to design the feature
transformation induced by the kernel k(, ).
Let
: X `2
be defined by

## (x) = ( 1 1 (x), 2 2 (x), . . .)

where `2 is the space of square summable sequences.
Clearly
p
X
p
i i (x) i i (y)
h(x), (y)i =
i=1

X
i=1

## i i (x)i (y) = k(x, y)

Issues
Therefore there is a vector space `2 other than Hk such that
k(x, y) is a dot product in that space.
Have two very different interpretations of what the kernel

does
1
2

## Reproducing kernel map

Mercer kernel map

For HM we write

(x) =

P
i

i i (x)ei

## As the i s are orthonormal there is a 1-1 map

: `2 span{k }
Can write
( )(x) =

P
i

ek =

p
k k ()

i i (x)i () = k(, x)

## The Mercer picture

we have
!

xi

x2
x

x x x
x
x
o o
x
x
o o o o
x
x
o
o
o
o
o o

x
e1

)(xi)

l2 d

x x x
x
x
x
xo o
o oo
o
o

o
o

ed
x1

e3

e2

x
x

x x x
x
x
x
x

I1

7R)(xi)=k(.,x
)=k( xi)
"

xo o
o oo
o
o

o
o
o

o
o

Id
I3

I2

13

Mercer map
Define the inner-product in M as
Z
hf, gim = f (x)g(x) dx
Note we will normalize the eigenfunctions l such that
Z
lk
l (x)k (x) dx =
l
Any function f M can be written as
f (x) =

X
k=1

then

k k (x)

Mercer map

## hf (), k(, y)im =

=
=
=

f (x)k(x, y) dx

Z X

k k (x)

k=1
X

l=1

k l l (y)

k=1 l=1

l l l (y)

l=1

1
l

l l (y) = f (y)

l=1

k is a reproducing kernel on M.

l l (x)l (y) dx
k (x)l (x) dx

## Mercer map Vs Reproducing kernel map

We want to check if
the space M = Hk

1
2
3

Show Hk M.

## Show hf, gim = hf, gi for f, g Hk .

Show M Hk .

Hk M
If f Hk then there exists m N, {i } and {xi } such that
f () =
=
=
=

m
X
i=1
m
X

i=1

X
l=1

i k(, xi )
i

l l (xi ) l ()

l=1

m
X

i l l (xi )

i=1

l ()

l l ()

l=1

## Thus f is a linear combination of the i s and f M.

This shows that if f H then f M and therefore H M.

## Equivalence of the inner-products

Let f, g H with
f () =

n
X

i k(, xi ),

g() =

i=1

m
X

j k(, yj )

j=1

Then by definition
hf, gi =
While
hf, gim =
=
=

n X
m
X

i j k(xi , yj )

i=1 j=1

f (x)g(x) dx

Z X
n

i k(x, xi )

i=1
n
m
XX
i=1 j=1

i j

m
X

j k(x, yj ) dx

j=1

k(x, xi ) k(x, yj ) dx

## Equivalence of the inner-products ctd

hf, gim =
=
=

n X
m
X

i=1 j=1
n X
m
X
i=1 j=1
n X
m
X

i j

Z X

l l (x)l (xi )

l=1

i j

l l (xi ) l (yj )

l=1

i j k(xi , yj )

i=1 j=1

= hf, gi

## Thus for all f, g H

hf, gim = hf, gi

X
s=1

s s (x)s (yj ) dx

MH
Can also show that if f M then also f Hk .
Will not prove that here.
But it implies M Hk

Summary
The reproducing kernel map and the Mercer Kernel map lead to
the same RKHS, Mercer gives us an orthonormal basis.
Interpretation I
Reproducing kernel map:

X
Hk = f (.) | f () =
i k(, xi )

j=1

hf, gi =

m X
m
X
i=1 j=1

r : X k(, x)

i j k(xi , x0j )

Summary
The reproducing kernel map and the Mercer Kernel map lead to
the same RKHS, Mercer gives us an orthonormal basis.
Interpretation II
Mercer kernel map:
HM = `2 =

x|

X
i

x2i

<

hf, gi = f t g
p
p
M : X ( 1 1 (x), 2 2 (x), ...)t
: `2 span{k ()}

M = r

Back to Regularization

Back to regularization
We to solve
min

f Hk

" n
X
i=1

## What is a good regularizer ?

Intuition: wigglier functions have larger norm than smoother

functions.

For f Hk we have

f (x) =

i k(x, xi )

X
i

X
l

X
l

i
"

l l (x)l (xi )

X
i

cl l (x)

i l (xi ) l (x)

## What is a good regularizer ?

and therefore
kf (x)k2 =

with cl = l
Hence

lk

cl ck hl (x), k (x)im =

X 1
X c2
l
cl ck lk =
l
l
lk

i i l (xi ).

## kf k2 grows with the number of ci different than zero.

functions with large e-values get penalized less and vice versa
more coefficients means more high frequencies or less

smoothness.

Representer Theorem
Theorem
Let
: [0, ) R be a strictly monotonically increasing function
H is the RKHS associated with a kernel k(x, y)
L(y, f (x)) be a loss function

then

f = arg min

f Hk

" n
X
i=1

## has a representation of the form

f(x) =

n
X
i=1

i k(x, xi )

Relevance
The remarkable consequence of the theorem is that

## space of functions to a minimization over a finite dimensional

space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =
=

X
ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and
f(xi ) =

j k(xi , xj ) = Ki

## where K = (k(xi , xj )), Gram matrix, and Ki is its ith row.

Relevance
The remarkable consequence of the theorem is that

## space of functions to a minimization over a finite dimensional

space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =
=

X
ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and
f(xi ) =

j k(xi , xj ) = Ki

## where K = (k(xi , xj )), Gram matrix, and Ki is its ith row.

Representer Theorem
Theorem
Let
: [0, ) R be a strictly monotonically increasing function
H is the RKHS associated with a kernel k(x, y)

then

"
f = arg min

f Hk

n
X
i=1

#
2

## has a representation of the form

Pn
f(x) = i=1
i k(x, xi )

where

= arg min

" n
X
i=1

L(yi , Ki ) + ( K)

## Rejigging the formulation of the SVM

When given linearly separable data {(xi , yi )} the optimal

## separating hyperplane is given by

min kk2
0 ,

subject to

yi (0 + t xi ) 1 i

## The constraints are fulfilled when

max(0, 1 yi (0 + t xi )) = (1 yi (0 + t xi )+ = 0 i
Hence we can re-write the optimization problem as
min
0 ,

"

n
X
(1 yi (0 + t xi ))+ + kk2
i=1

## SVMs connections to regularization

Finding the optimal separating hyperplane
min
0 ,

"

n
X
(1 yi (0 + t xi ))+ + kk2
i=1

min
f

"

n
X
i=1

## L(yi , f (xi )) + (kf k )

where
L(y, f (x)) = (1 yi f (xi ))+
(kf k2 ) = kf k2

## SVMs connections to regularization

From the Representor theorem know the solution to the latter

problem is

f(x) =

n
X

i xti x

i=1

## if the basic kernel k(x, y) = xt y is used.

Therefore kf k2 = t K
This is the same form of the solution found via the KKT

conditions

n
X
i=1

i yi x i