9 views

Uploaded by missinu

Basis Expansion and Regularization

- Tai Lieu Hay
- MC0074
- Study of Person Identification Using Palmprint
- Finite Volume Code for Compressible Flow
- maths2
- ME-2008-Exp
- Population matrix models.pdf
- w16_206_hwk14_solns
- MTH603 Final Term Solved MCQs
- EE453 08fa Mt1 Solns
- Lecture 20 Control
- 2710002
- AQA-MFP4-W-QP-JUN08
- resumen mathstudio
- hw1
- Power Series
- PhysRevLett.82
- Exam2 Review
- Curriculum BTech Mech
- Less Conservative Consensus of Multi-agent Systems with Generalized Lipschitz Nonlinear Dynamics

You are on page 1of 98

DD3364

April 1, 2012

Introduction

Main idea

Augment the vector of inputs X with additional variables.

These are transformations of X

hm (X) : Rp R

with m = 1, . . . , M .

Then model the relationship between X and Y

f (X) =

M

X

m hm (X) =

m=1

M

X

m Z m

m=1

Have a linear model w.r.t. Z. Can use the same methods as

before.

Which transformations?

Some examples

Linear:

hm (X) = Xm , m = 1, . . . , p

Polynomial:

hm (X) = Xj2 ,

or

hm (X) = Xj Xk

hm (X) = log(Xj ),

p

Xj , ...

hm (X) = kXk

Use of Indicator functions:

Pros

Can model more complicated decision boundaries.

Can model more complicated regression relationships.

Cons

Lack of locality in global basis functions.

Solution Use local polynomial representations such as

There is the danger of over-fitting.

Pros

Can model more complicated decision boundaries.

Can model more complicated regression relationships.

Cons

Lack of locality in global basis functions.

Solution Use local polynomial representations such as

There is the danger of over-fitting.

Common approaches taken:

Restriction Methods

Limit the class of functions considered. Use additive models

f (X) =

Mj

p X

X

jm hjm (Xj )

j=1 m=1

Selection Methods

significantly to the fit of the model - Boosting, CART.

Regularization Methods

Let

f (X) =

M

X

j hj (X)

j=1

of ridge regression and lasso.

To obtain a piecewise polynomial function f (X)

Divide the domain of X into contiguous intervals.

142

Basis Expansions and Regularization

Represent

f5. by

a separate polynomial in each interval.

Examples

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O O

Green curve - piecewise constant/linear fit to the training data.

O

O

O O

O

OO

O O

OO

O

O

OOO

O

O

O

O

To obtain a piecewise polynomial function f (X)

Divide the domain of X into contiguous intervals.

142

Basis Expansions and Regularization

Represent

f5. by

a separate polynomial in each interval.

Examples

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O O

Green curve - piecewise constant/linear fit to the training data.

O

O

O O

O

OO

O O

OO

O

O

OOO

O

O

O

O

Piecewise Constant

O

O

O O

O

OO

O O

Piecewise Linear

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

O

O

O O

O

OO

O O

O

O

O

O

OOO

OO

O

O

O

O

O

O

O

O

O

O O

Continuous

Linear three regions

Piecewise-linear Basis Function

Divide [a, b], the domain

ofPiecewise

X, into

O

O

[a, 1 ), [1 , 2 ), [2 , b]

O O

O

OO

O O

OO

O

O

O

O

O

O

O

O

O

OOO

O

O

O

O O

(X =

1 )+

h1 (X) = Ind(X < 1 ), h2 (X) = Ind(1 X < 2 ), h3 (X)

Ind(

2 X)

O

O

O

O

P3

m=1

of yi s

in

the mth region. FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

1

Piecewise Constant

O

O

O O

O

OO

O O

OO

Piecewise Linear

O

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

OO

O O

O

O

O

O

O O

O

O

O

O

O

O

O

O O

O

O

O

O O

O

OO

O O

O

O

O

O

OOO

OO

O

O

O

O

O

O

OOO

h4 (X) = X h1 (X),

O

O

O

O

O

O

O

O

O

h5 (X) = X h2 (X),

O O

O

O

P6

h3 (X) = Ind(2 X)

h6 (X) = X h3 (X)

(X 1 )+

m=1

O

model to

the data in each

region.

2

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

O

O

O

O O

O

OO

O O

O

O

OO

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O O

O

O

O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

artificial data. The broken vertical lines indicate the positions of the two knots

1 and 2 . The blue curve represents the true function, from which the data were

generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left

restricted to be continuous at the knots. The lower right panel shows a piecewise

linear basis function, h3 (X) = (X 1 )+ , continuous at 1 . The black points

indicate the sample evaluations h3 (xi ), i = 1, . . . , N .

1 and 2 .

This means

1 + 2 1 = 3 + 4 1 , and

3 + 4 2 = 5 + 6 2

This reduces the # of dof of f (X) from 6 to 4.

Piecewise Constant

O

O

O O

O

OO

O O

Piecewise Linear

O

O

O

O

OO

basis instead:

O

O

O

O

O

O

O

O

O

O

OOO

O

O

O

OO

O

O

O

h1 (X) = 1

h2 (X) = X

h4 (X) = (X 2 )+

h3 (X) = (X 1 )+

O

O

O

O O

O

O

O

O O

O

OO

O O

OO

O

O

O O

O

O

O O

O

OO

O O

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O

O

O

O

OOO

O

O

O

O

O

O O

O

O

O

O

O

O

(X 1 )+

FIGURE 5.1. The top left panel shows a piecewise constant function fit to some

artificial data. The broken vertical lines indicate the positions of the two knots

1 and 2 . The blue curve represents the true function, from which the data were

generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left

Smoother f (X)

Can achieve a smoother f (X) by increasing the order

of the local polynomials

Smoother f (X)

Can achieve a smoother f (X) by increasing the order

of the local polynomials

5.2 Piecewise Polynomials and Splines

143

of the continuity at the knots

Piecewise Cubic Polynomials

Discontinuous

O

O

O O

O

OO O

OO

Continuous

O

O

O

O

OOO

O

O

O

O

O O

O

O

O

O O

O

OO O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

O

OO

O

O

O

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

OOO

O

O

OOO

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O

OOO

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

1

2

has 1st

and 2nd continuity

at the2 knots

O

O

O O

O

OO O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

ntinuity.

A cubic spline

ght panel is continuous, and has continuous first and second derivatives

t the knots. It is known as a cubic spline. Enforcing one more order of

Cubic Spline

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

OO O

O

OO

O

O

O

O O

A cubic spline

O

O

O

O

O O

O

OO O

O

O

O

O

O

O

O

O

O

O

O

O

OO

O

O

O

O

OOO

O

O

O

O

O

O

O

O

Cubic Spline

OOO

O

O

O

O

O

O

O O

O

O

O

O

O

O

O

O

O

O

O

O

ntinuity.

2 :

2

ght panel is continuous,

and

second

h1 (X)

= has

1, continuoush3first

(X)and

=X

, derivatives

h5 (X) = (X 1 )3+

t the knots. It is known as a cubic spline. Enforcing one more order of

3 hard to show

ontinuity would lead

a global

h2to(X)

= X,cubic polynomial.

h4 (X) It=isXnot

,

h6 (X) = (X 2 )3+

Exercise 5.1) that the following basis represents a cubic spline with knots

t 1 and 2 :

h (X) = 1,

h (X) = X 2 ,

h (X) = (X )3 ,

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Order M spline

An order M spline with knots 1 , . . . , K is

a piecewise-polynomial of order M and

has continuous derivatives up to order M 2

hj (X) = X j1

j = 1, . . . , M

1

hM +l (X) = (X l )M

,

+

l = 1, . . . , K

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Regression Splines

Fixed-knot splines are known as regression splines.

For a regression spline one needs to select

the order of the spline,

the number of knots and

the placement of the knots.

There are many equivalent bases for representing splines and

computationally attractive.

Problem

The polynomials fit beyond the boundary knots behave wildly.

Solution: Natural Cubic Splines

Have the additional constraints that the function is linear

Near the boundaries one has reduced the variance of the fit

Smoothing Splines

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

Smoothing Splines

Avoid knot selection problem by using a maximal set of knots.

Complexity of the fit is controlled by regularization.

Consider the following problem:

Find the function f (x) with continuous second derivative

which minimizes

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

smoothing parameter

closeness to data

curvature penalty

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

= 0: f is any function which interpolates the data.

= : f is the simple least squares line fit.

Hope is (0, ) indexes an interesting class of functions in

between.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

the xi , i = 1, . . . , n.

That is

f(x) =

n

X

Nj (x)j

j=1

for representing this family of natural splines.

Z

n

X

2

RSS(f, ) =

(yi f (xi )) + (f 00 (t))2 dt

i=1

the xi , i = 1, . . . , n.

That is

f(x) =

n

X

Nj (x)j

j=1

for representing this family of natural splines.

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

where

N1 (x1 )

N1 (x2 )

N= .

..

N1 (xn )

..

.

N2 (x1 )

N2 (x2 )

..

.

N2 (xn )

R 00

N2 (t)N100 (t)dt

N =

..

R 00 . 00

Nn (t)N1 (t)dt

y = (y1 , y2 , . . . , yn )t

Nn (x1 )

Nn (x2 )

..

.

Nn (xn )

..

R 00 . 00

Nn (t)N2 (t)dt

..

.

..

.

R 00

00

Nn (t)Nn (t)dt

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

and its solution is given by

= (Nt N + N )1 Nt y

The fitted smoothing spline is then given by

f(x) =

n

X

j=1

Nj (x)j

The criterion to be optimized thus reduces to

RSS(, ) = (y N)t (y N) + t N

and its solution is given by

= (Nt N + N )1 Nt y

The fitted smoothing spline is then given by

f(x) =

n

X

j=1

Nj (x)j

Assume that has been set.

Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y

Let

f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y

where S = N(Nt N + N )1 Nt .

Assume that has been set.

Remember the estimated coefficients are a linear

combination of the yi s

= (Nt N + N )1 Nt y

Let

f be the n-vector of the fitted values f(xi ) then

f = N = N(Nt N + N )1 Nt y = S y

where S = N(Nt N + N )1 Nt .

Properties of S

S is symmetric and positive semi-definite.

S S S

S has rank n.

The book defines the effective degrees of freedom of a

smoothing spline to be

df = trace(S )

152

0.15

0.10

0.05

0.0

-0.05

0.20

Male

Female

10

15

20

25

Age

FIGURE 5.6. The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing spline

was fit to the males and females, with 0.00022. This choice corresponds to

about 12 degrees of freedom.

about 12 degrees of freedom.

where the Nj (x) are an N -dimensional set of basis functions for representing this family of natural splines (Section 5.2.1 and Exercise 5.4). The

criterion thus reduces to

Let N = U SV t be the svd of N .

Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt

as

S = (1 + K)1

where

K = U S 1 V t N V S 1 U t .

It is also easy to show that

f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf

f

Let N = U SV t be the svd of N .

Using this decomposition it is straightforward to re-write

S = N(Nt N + N )1 Nt

as

S = (1 + K)1

where

K = U S 1 V t N V S 1 U t .

It is also easy to show that

f = S y is the solution to the

optimization problem

min (y f )t (y f ) + f t Kf

f

The eigen-decomposition of S

Let K = P DP 1 be the real eigen-decomposition of K -

Then

S = (I + K)1 = (I + P DP 1 )1

= (P P 1 + P DP 1 )1

= (P (I + D)P 1 )1

= P (I + D)1 P 1

n

X

1

pk ptk

=

1 + dk

i=1

and pk are the e-vectors of K.

pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

The eigen-decomposition of S

Let K = P DP 1 be the real eigen-decomposition of K -

Then

S = (I + K)1 = (I + P DP 1 )1

= (P P 1 + P DP 1 )1

= (P (I + D)P 1 )1

= P (I + D)1 P 1

n

X

1

pk ptk

=

1 + dk

i=1

and pk are the e-vectors of K.

pk are also the e-vectors of S and 1/(1 + dk ) its e-values.

5.4 Smoothing Splines

155

30

20

10

Ozone Concentration

-50

50

100

1.0

1.2

df=5

df=11

0.6

0.4

0.2

0.0

-0.2

Eigenvalues

0.8

spline with df = trace(S ) = 5.

0.6

0.4

df=5

df=11

0.2

Eigenvalues

0.8

1.0

1.2

Example: Eigenvalues of S

-0.2

0.0

10

15

Order

20

25

-50

50

100

FIGURE

5.7.of(Top:)

Smoothing

spline fit of ozone concentr

Green curve

eigenvalues

S with

df = 11.

pressure gradient. The two fits correspond to different valu

achieve

Red curveparameter,

eigenvalueschosen

of S to

with

df =five

5. and eleven effective degrees

by df = trace(S ). (Lower left:) First 25 eigenvalues for the t

matrices. The first two are exactly 1, and all are 0. (Lo

-50

50

100

Example: Eigenvectors of S

df=5

df=11

15

Order

20

25

-50

50

100

-50

50

100

p:) Smoothing

spline

of ozone

versus

Daggot

Each

bluefit

curve

is an concentration

eigenvector of S

plotted against x. Top left

The two fits has

correspond

to different

values

the smoothing

highest e-value,

bottom

right of

samllest.

o achieve five and eleven effective degrees of freedom, defined

curve

is the eigenvector

by 1/(1 + dk ).

(Lower left:)Red

First

25 eigenvalues

for thedamped

two smoothing-spline

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

The eigenvectors of S do not depend on .

The smoothing spline decomposes y w.r.t. the basis {pk } and

S y =

n

X

k=1

1

pk (ptk y)

1 + dk

df = trace(S ) =

n

X

k=1

1/(1 + dk ).

Visualization of a S

Equivalent Kernels

Row 12

Smoother Matrix

12

Row 25

Row 50

25

50

Row 75

75

100

115

Row 100

Row 115

FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded,

indicating an equivalent kernel with local support. The left panel represents the

Choosing ???

This is a crucial and tricky problem.

Will deal with this problem in Chapter 7 when we consider the

Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)

= 0 + t x

P (Y = 0|X = x)

log

P (Y = 1|X = x)

= f (x)

P (Y = 0|X = x)

P (Y = 1|X = x) =

ef (x)

1 + ef (x)

of P (Y = 1|X = x).

Previously considered a binary classifier s.t.

log

P (Y = 1|X = x)

= 0 + t x

P (Y = 0|X = x)

log

P (Y = 1|X = x)

= f (x)

P (Y = 0|X = x)

P (Y = 1|X = x) =

ef (x)

1 + ef (x)

of P (Y = 1|X = x).

Construct the penalized log-likelihood criterion

`(f ; ) =

Z

n

X

[yi log P (Y = 1|xi ) + (1 yi ) log(1 P (Y = 1|xi ))] .5 (f 00 (t))2 dt

i=1

Z

n

X

=

[yi f (xi ) log(1 + ef (xi ) )] .5 (f 00 (t))2 dt

i=1

Hilbert Spaces

There is a class of generalization problems which have the form

min

f H

"

n

X

i=1

where

L(yi , f (xi )) is a loss function,

J(f ) is a penalty functional,

H is a space of functions on which J(f ) is defined.

These are generated by a positive definite kernel K(x, y) and

the corresponding space of functions HK called a reproducing

well.

What follows is mainly based on the notes of Nuno Vasconcelos.

Types of Kernels

Definition

A kernel is a mapping k : X X R.

These three types of kernels are equivalent

dot-product kernel

m

m

Mercer kernel

Dot-product kernel

Definition

A mapping

k :X X R

is a dot-product kernel if and only if

k(x, y) = h(x), (y)i

where

:X H

and H is a vector space and h, i is an inner-product on H.

Definition

A mapping

k :X X R

is a positive semi-definite kernel on X X if m N and

x1 , . . . , xm with each xi X the Gram matrix

k(x2 , x1 ) k(x2 , x2 ) k(x2 , xm )

K=

.

.

...

.

...

...

k(xm , x1 ) k(xm , x2 ) k(xm , xm )

is positive semi-definite.

Mercer kernel

Definition

A symmetric mapping k : X X R such that

Z Z

k(x, y) f (x) f (y) dx dy 0

for all functions f s.t.

Z

is a Mercer kernel.

f (x)2 dx <

These different definitions lead to different interpretations of what

the kernel does:

Interpretation I

Reproducing kernel map:

X

Hk = f (.) | f () =

i k(, xi )

j=1

hf, gi =

m X

m

X

i=1 j=1

: X k(, x)

i j k(xi , x0j )

These different definitions lead to different interpretations of what

the kernel does:

Interpretation II

Mercer kernel map:

HM = `2 =

x|

X

i

x2i <

hf, gi = f g

p

p

: X ( 1 1 (x), 2 2 (x), ...)t

of k(x, y) with i > 0.

P 2

where `2 is the space of vectors s.t.

i ai < .

When a Gaussian kernel k(x, xi ) = exp(kx xi k2 /) is used

the point xi X is mapped into the Gaussian G(, xi , I)

Hk is the space of all functions that are linear combinations of

Gaussians.

on X .

With the definition of Hk and h, i one has

f Hk

Leads to the reproducing Kernel Hilbert Spaces

Definition

A Hilbert Space is a complete dot-product space.

(vector space + dot product + limit points of all Cauchy

sequences)

With the definition of Hk and h, i one has

f Hk

Leads to the reproducing Kernel Hilbert Spaces

Definition

A Hilbert Space is a complete dot-product space.

(vector space + dot product + limit points of all Cauchy

sequences)

Definition

Let H be a Hilbert space of functions f : X R. H is a

Reproducing Kernel Hilbert Space (rkhs) with inner-product h, i

if there exists a

k :X X R

s. t.

k(, ) spans H that is

H = {f () | f () =

i i k(, xi )

for i R and xi X }

f H

Theorem

Let k : X X R be a Mercer kernel. Then there exists an

orthonormal set of functions

Z

i (x)j (x)dx = ij

and a set of i 0 such that

Z Z

X

2

1

i =

k 2 (x, y)dx dy < and

i

k(x, y) =

X

i=1

i i (x)i (y)

This eigen-decomposition gives another way to design the feature

transformation induced by the kernel k(, ).

Let

: X `2

be defined by

where `2 is the space of square summable sequences.

Clearly

p

X

p

i i (x) i i (y)

h(x), (y)i =

i=1

X

i=1

Issues

Therefore there is a vector space `2 other than Hk such that

k(x, y) is a dot product in that space.

Have two very different interpretations of what the kernel

does

1

2

Mercer kernel map

For HM we write

(x) =

P

i

i i (x)ei

: `2 span{k }

Can write

( )(x) =

P

i

ek =

p

k k ()

i i (x)i () = k(, x)

we have

!

xi

x2

x

x x x

x

x

o o

x

x

o o o o

x

x

o

o

o

o

o o

x

e1

)(xi)

l2 d

x x x

x

x

x

xo o

o oo

o

o

o

o

ed

x1

e3

e2

x

x

x x x

x

x

x

x

I1

7R)(xi)=k(.,x

)=k( xi)

"

xo o

o oo

o

o

o

o

o

o

o

Id

I3

I2

13

Mercer map

Define the inner-product in M as

Z

hf, gim = f (x)g(x) dx

Note we will normalize the eigenfunctions l such that

Z

lk

l (x)k (x) dx =

l

Any function f M can be written as

f (x) =

X

k=1

then

k k (x)

Mercer map

=

=

=

f (x)k(x, y) dx

Z X

k k (x)

k=1

X

l=1

k l l (y)

k=1 l=1

l l l (y)

l=1

1

l

l l (y) = f (y)

l=1

k is a reproducing kernel on M.

l l (x)l (y) dx

k (x)l (x) dx

We want to check if

the space M = Hk

1

2

3

Show Hk M.

Show M Hk .

Hk M

If f Hk then there exists m N, {i } and {xi } such that

f () =

=

=

=

m

X

i=1

m

X

i=1

X

l=1

i k(, xi )

i

l l (xi ) l ()

l=1

m

X

i l l (xi )

i=1

l ()

l l ()

l=1

This shows that if f H then f M and therefore H M.

Let f, g H with

f () =

n

X

i k(, xi ),

g() =

i=1

m

X

j k(, yj )

j=1

Then by definition

hf, gi =

While

hf, gim =

=

=

n X

m

X

i j k(xi , yj )

i=1 j=1

f (x)g(x) dx

Z X

n

i k(x, xi )

i=1

n

m

XX

i=1 j=1

i j

m

X

j k(x, yj ) dx

j=1

k(x, xi ) k(x, yj ) dx

hf, gim =

=

=

n X

m

X

i=1 j=1

n X

m

X

i=1 j=1

n X

m

X

i j

Z X

l l (x)l (xi )

l=1

i j

l l (xi ) l (yj )

l=1

i j k(xi , yj )

i=1 j=1

= hf, gi

hf, gim = hf, gi

X

s=1

s s (x)s (yj ) dx

MH

Can also show that if f M then also f Hk .

Will not prove that here.

But it implies M Hk

Summary

The reproducing kernel map and the Mercer Kernel map lead to

the same RKHS, Mercer gives us an orthonormal basis.

Interpretation I

Reproducing kernel map:

X

Hk = f (.) | f () =

i k(, xi )

j=1

hf, gi =

m X

m

X

i=1 j=1

r : X k(, x)

i j k(xi , x0j )

Summary

The reproducing kernel map and the Mercer Kernel map lead to

the same RKHS, Mercer gives us an orthonormal basis.

Interpretation II

Mercer kernel map:

HM = `2 =

x|

X

i

x2i

<

hf, gi = f t g

p

p

M : X ( 1 1 (x), 2 2 (x), ...)t

: `2 span{k ()}

M = r

Back to Regularization

Back to regularization

We to solve

min

f Hk

" n

X

i=1

Intuition: wigglier functions have larger norm than smoother

functions.

For f Hk we have

f (x) =

i k(x, xi )

X

i

X

l

X

l

i

"

l l (x)l (xi )

X

i

cl l (x)

i l (xi ) l (x)

and therefore

kf (x)k2 =

with cl = l

Hence

lk

cl ck hl (x), k (x)im =

X 1

X c2

l

cl ck lk =

l

l

lk

i i l (xi ).

functions with large e-values get penalized less and vice versa

more coefficients means more high frequencies or less

smoothness.

Representer Theorem

Theorem

Let

: [0, ) R be a strictly monotonically increasing function

H is the RKHS associated with a kernel k(x, y)

L(y, f (x)) be a loss function

then

f = arg min

f Hk

" n

X

i=1

f(x) =

n

X

i=1

i k(x, xi )

Relevance

The remarkable consequence of the theorem is that

space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =

=

X

ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and

f(xi ) =

j k(xi , xj ) = Ki

Relevance

The remarkable consequence of the theorem is that

space.

This is because as f =

Pn

i=1 i k(, xi )

kfk2 = hf, fi =

=

X

ij

then

i j hk(, xi ), k(, xj )i

i j k(xi , xj ) = t K

ij

and

f(xi ) =

j k(xi , xj ) = Ki

Representer Theorem

Theorem

Let

: [0, ) R be a strictly monotonically increasing function

H is the RKHS associated with a kernel k(x, y)

then

"

f = arg min

f Hk

n

X

i=1

#

2

Pn

f(x) = i=1

i k(x, xi )

where

= arg min

" n

X

i=1

L(yi , Ki ) + ( K)

When given linearly separable data {(xi , yi )} the optimal

min kk2

0 ,

subject to

yi (0 + t xi ) 1 i

max(0, 1 yi (0 + t xi )) = (1 yi (0 + t xi )+ = 0 i

Hence we can re-write the optimization problem as

min

0 ,

"

n

X

(1 yi (0 + t xi ))+ + kk2

i=1

Finding the optimal separating hyperplane

min

0 ,

"

n

X

(1 yi (0 + t xi ))+ + kk2

i=1

min

f

"

n

X

i=1

where

L(y, f (x)) = (1 yi f (xi ))+

(kf k2 ) = kf k2

From the Representor theorem know the solution to the latter

problem is

f(x) =

n

X

i xti x

i=1

Therefore kf k2 = t K

This is the same form of the solution found via the KKT

conditions

n

X

i=1

i yi x i

- Tai Lieu HayUploaded bydcngatran
- MC0074Uploaded bydashingvicky15
- Study of Person Identification Using PalmprintUploaded byInternational Journal of Research in Engineering and Technology
- Finite Volume Code for Compressible FlowUploaded bymafrisbe
- maths2Uploaded bybharathu0484
- ME-2008-ExpUploaded bynkpatil
- Population matrix models.pdfUploaded byAlbeb Lim
- w16_206_hwk14_solnsUploaded byanthalya
- MTH603 Final Term Solved MCQsUploaded byMuhammad Asif Butt Mohsini
- EE453 08fa Mt1 SolnsUploaded byÜlber Onur Akın
- Lecture 20 ControlUploaded byBIRRU JEEVAN KUMAR
- 2710002Uploaded byNIRAV PATEL
- AQA-MFP4-W-QP-JUN08Uploaded byÖzgür Oz Yilmaz Mehmet
- resumen mathstudioUploaded byHector Herrera Chavez
- hw1Uploaded byYehh
- Power SeriesUploaded byengrasheed
- PhysRevLett.82Uploaded bythyagosmesme
- Exam2 ReviewUploaded byXi Wang
- Curriculum BTech MechUploaded byYaser Mohamed
- Less Conservative Consensus of Multi-agent Systems with Generalized Lipschitz Nonlinear DynamicsUploaded byInternational Journal of Research in Engineering and Science
- InterlacingUploaded byAmirul Arifin
- Kochen-Specker and MeasurementUploaded byquantumrealm
- Evolutionary Optimization Technique Applied to Resistance Reduction of Ship Hull Form - 2013.pdfUploaded byMaurizio Bernasconi
- Analysis of the Behaviour of Reduced and Compressed Data with Various Learning AlgorithmsUploaded byIJIRST
- dynamic instabilityUploaded byshincejoseph
- 7Uploaded bynk
- MAT223_Mid2_2013FUploaded byexamkiller
- toeplitz matrices review by grayUploaded bysurvinderpal
- Smooth Beamforming for OFDMUploaded bykhajarasool_sk
- Book - Volume IUploaded byAlexandra Simona Apostol

- Exam4135 2004 SolutionsUploaded bymissinu
- Balabolka SampleUploaded bymissinu
- Lecture 2 - Some Course AdminUploaded bymissinu
- Bai Giang Toan C2 (2009)Uploaded bymissinu
- Lecture 3 - Linear Methods for ClassificationUploaded bymissinu
- MulticollinearityUploaded bymissinu
- AnkiUploaded bymissinu
- Exam ECON301B 2002 CommentedUploaded bymissinu
- Exam ECON301B 2002 CommentedUploaded bymissinu
- Stata RUploaded bymissinu
- Midterm Microeconomics 1 2012-13Uploaded bymissinu
- Lecture 1 - Overview of Supervised LearningUploaded bymissinu
- Exit Questionnaire 2005 FinalUploaded bymissinu
- Economic Accounts 2005 FinalUploaded bymissinu
- Economic Accounts 2005 FinalUploaded bymissinu
- Introduction to Microeconomic Theory and GE Theory (2015)Uploaded bymissinu
- 3-Vu Trong Khai - Tich Tu Ruong DatUploaded byematn
- GRE VocabularyUploaded byKoksiong Poon
- Visual BasicUploaded byxuananh
- Giai Tich 2 2014 Chuong 5Uploaded bymissinu
- Vocabulary IELTS Speaking Theo TopicUploaded byBBBBBBB
- Cuc BVTVUploaded bymissinu
- Bai Giang Toan Kinh Te Quang 2012 1171Uploaded bynicksforums
- Thuchanh CH 141030Uploaded bymissinu
- Tom Tat Cong Thuc XSTKUploaded byTuấn Lê
- Giao Trinh VBA_GXDUploaded byYumi Ling
- Bai Tap Giai Tich 2 Chuong 2Uploaded bymissinu
- Regulations Livestock in VN SummaryUploaded bymissinu
- Manure Estimates.pdfUploaded bymissinu

- linsn-led-control-system-manual english.pdfUploaded byLaza Urosevic
- Collecting Landmarks ImagejUploaded byMary Tilt
- Magnetometer PaperUploaded byStanley Perkins
- Digitally Controlled Oscillator SystemUploaded bysuper_lativo
- Copper Bus Bars Property ChartUploaded byMukesh Padwal
- Pramod AssignUploaded byRajesh Manjhi
- sdarticleUploaded by123Kopf
- MDOFUploaded byKumar Ashish
- CA7Uploaded bysh_07
- Márcio Yamamoto (2011) - A Study About the Dynamic Behavior of Flexible Tubes Including Internal FlowUploaded bygertruna3
- Seminar ReportUploaded byGaurav Kataria
- The stress field near the front of an arbitrarily shaped crack in a three-dimensional elastic bodyUploaded byOlivier Torlai
- Digit1_lab02Uploaded bypriyadarshini212007
- Rod Stroke - Whats Your AngleUploaded byJameel Khan
- Chapter5 SolutionsUploaded bygabrenilla
- Simrad RadarUploaded byNesil Abiera
- Fluidization paperUploaded byDnl Ganesh
- Die-Casting-design-and-spec-guide.pdfUploaded byKhổng Mạnh
- Me Systems (2)Uploaded bykarthick_mariner92
- AB_AW_VRFUploaded byhrenam
- Blobs OpencvUploaded byCarlos Pilatasig
- Basic Oracle ArchitectureUploaded byvkumari1225
- Technical Description of an Electric Tooth BrushUploaded byAre Ley
- This infinite, unanimous dissonanceUploaded byParadox of Death
- Setting up a Persistent Virtual Drive on Windows/XPUploaded bydoss_balaraman
- Raja CorrFailureUploaded byArunprasad Murugesan
- Barg Hava 2005Uploaded byAriadnne Vargas Rivadeneira
- B9MR1Ed4_QOS_310106_Ed2Uploaded byPhuong Le
- 16aic Final Paper by VillarazaUploaded byRay Ramilo
- WekaUploaded byLalithaJyothi