Lecture 4 - Basis Expansion and Regularization

Chapter 5: Basis Expansion and Regularization
DD3364
April 1, 2012
Introduction
Moving beyond linearity

Main idea
Augment the vector of inputs X with additional variables.
These are transformations of X
hm (X) : Rp R
with m = 1, . . . , M .
Then model the relationship between X and Y
f (X) =
M
X
m hm (X) =
m=1
M
X
m Z m
m=1
as a linear basis expansion in X.

Have a linear model w.r.t. Z. Can use the same methods as
before.
Which transformations?
Some examples
Linear:
hm (X) = Xm , m = 1, . . . , p
Polynomial:
hm (X) = Xj2 ,
or
hm (X) = Xj Xk
Non-linear transformation of single inputs:
hm (X) = log(Xj ),
p
Xj , ...
Non-linear transformation of multiple input:
hm (X) = kXk
Use of Indicator functions:
hm (X) = Ind(Lm Xk < Um )
Pros and Cons of this augmentation

Pros
Can model more complicated decision boundaries.
Can model more complicated regression relationships.
Cons
Lack of locality in global basis functions.
Solution Use local polynomial representations such as
piecewise-polynomials and splines.
How should one find the correct complexity in the model?

There is the danger of over-fitting.
Pros and Cons of this augmentation

Pros
Can model more complicated decision boundaries.
Can model more complicated regression relationships.
Cons
Lack of locality in global basis functions.
Solution Use local polynomial representations such as
piecewise-polynomials and splines.
How should one find the correct complexity in the model?

There is the danger of over-fitting.
Controlling the complexity of the model

Common approaches taken:
Restriction Methods
Limit the class of functions considered. Use additive models
f (X) =
Mj
p X
X
jm hjm (Xj )
j=1 m=1
Selection Methods
Scan the set of hm and only include those that contribute

significantly to the fit of the model - Boosting, CART.
Regularization Methods
Let
f (X) =
M
X
j hj (X)
j=1
but when learning the j s restrict their values in the manner

of ridge regression and lasso.
Piecewise Polynomials and Splines
Piecewise polynomial function

To obtain a piecewise polynomial function f (X)
Divide the domain of X into contiguous intervals.
142
Basis Expansions and Regularization
Represent
f5. by
a separate polynomial in each interval.
Examples
Piecewise Constant
O
O
O O
O
OO
O O
OO
Piecewise Linear
O
O
O
O
O
O
OOO
O
O
O
O
O
O O
O
OO
O O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O O
O
O
O
O
O
O O
Continuous Piecewise Linear
Piecewise-linear Basis Function
Blue curve - ground truth function.

Green curve - piecewise constant/linear fit to the training data.
O
O
O O
O
OO
O O
OO
O
O
OOO
O
O
O
O
Piecewise polynomial function

To obtain a piecewise polynomial function f (X)
Divide the domain of X into contiguous intervals.
142
Basis Expansions and Regularization
Represent
f5. by
a separate polynomial in each interval.
Examples
Piecewise Constant
O
O
O O
O
OO
O O
OO
Piecewise Linear
O
O
O
O
O
O
OOO
O
O
O
O
O
O O
O
OO
O O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O O
O
O
O
O
O
O O
Blue curve - ground truth function.

Green curve - piecewise constant/linear fit to the training data.
O
O
O O
O
OO
O O
OO
O
O
OOO
O
O
O
O
Example: Piecewise constant function

Piecewise Constant
O
O
O O
O
OO
O O
Piecewise Linear
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O O
O
O
O
O O
O
OO
O O
O
O
O
O
OOO
OO
O
O
O
O
O
O
O
O
O
O O
Continuous
Linear three regions
Divide [a, b], the domain
ofPiecewise
X, into
O
O
[a, 1 ), [1 , 2 ), [2 , b]
O O
O
OO
O O
OO
with 1 < 2 < 3

O
O
O
O
Define three basis functions

O
i s are referred to as knots
O
O
O
O
OOO
O
O
O
O O
(X =
1 )+
h1 (X) = Ind(X < 1 ), h2 (X) = Ind(1 X < 2 ), h3 (X)
Ind(
2 X)
O
O
O
O
The model f (X) =
P3
m=1
m hm (X) is fit using least-squares.
As basis functions dont overlap = m = mean

of yi s
in
the mth region. FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
1
Example: Piecewise linear function

Piecewise Constant
O
O
O O
O
OO
O O
OO
Piecewise Linear
O
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
O
O
O
O
OOO
O
O
O
O
In this case define 6 basis functions

O
O
O O
O
OO
O O
O
O
O
O
O O
O
O
O
O
O
O
O
O O
O
O
O
O O
O
OO
O O
O
O
O
O
OOO
OO
O
O
h1 (X) = Ind(X < 1 ),
O
O
O
O
OOO
h4 (X) = X h1 (X),
O
O
O
O
O
O
O
O
O
h5 (X) = X h2 (X),
O O
O
The model f (X) =

O
h2 (X) = Ind(1 X < 2 ),
P6
h3 (X) = Ind(2 X)
h6 (X) = X h3 (X)
(X 1 )+
m=1
m h m (X) is fit using least-squares.
As basis functions dont overlap = fit a separate linear

O
model to
the data in each
region.
2
FIGURE 5.1. The top left panel shows a piecewise constant function fit to some
Example: Continuous piecewise linear function

O

O
O
O O
O
OO
O O
O
O
OO

O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O
O
O
O O
O
O
O
(X 1 )+
artificial data. The broken vertical lines indicate the positions of the two knots
1 and 2 . The blue curve represents the true function, from which the data were
generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left
restricted to be continuous at the knots. The lower right panel shows a piecewise
linear basis function, h3 (X) = (X 1 )+ , continuous at 1 . The black points
indicate the sample evaluations h3 (xi ), i = 1, . . . , N .
Additionally impose the constraint that f (X) is continuous as
1 and 2 .
This means
1 + 2 1 = 3 + 4 1 , and
3 + 4 2 = 5 + 6 2
This reduces the # of dof of f (X) from 6 to 4.
Piecewise Constant
O
O
O O
O
OO
O O
Piecewise Linear
A more compact set of basis functions

O
O
O
O
OO
basis instead:
O
O
O
O
O
O
O
O
O
O
OOO
O
O
O
OO
O
O
O
h1 (X) = 1
h2 (X) = X
h4 (X) = (X 2 )+
h3 (X) = (X 1 )+
O
O
O
O O
O
O
O

O O
O
OO
O O
OO
O
O
O O
O
O
O O
O
OO
O O
O
O
O
O
OOO
To impose the continuity constraints directly can use this
O
O
O
O

O
O
O
O
O
O
O
O
OOO
O
O
O
O
O
O O
O
O
O
O
O
O
(X 1 )+
artificial data. The broken vertical lines indicate the positions of the two knots
1 and 2 . The blue curve represents the true function, from which the data were
generated with Gaussian noise. The remaining two panels show piecewise linear functions fit to the same datathe top right unrestricted, and the lower left
Smoother f (X)
Can achieve a smoother f (X) by increasing the order
of the local polynomials
of the continuity at the knots
Smoother f (X)
Can achieve a smoother f (X) by increasing the order
of the local polynomials
5.2 Piecewise Polynomials and Splines
143
of the continuity at the knots
Piecewise Cubic Polynomials
Piecewise-cubic polynomials with increasing orders of continuity

Discontinuous
O
O
O O
O
OO O
OO
Continuous
O
O
O
O
OOO
O
O
O
O
O O
O
O
O
O O
O
OO O
OO
O
O
O
O
OOO
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
OO
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
O
OO
O
O
O
O
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
OOO
Continuous Second Derivative
O
O
OOO
O
Continuous First Derivative
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
OO
O
O
O
O
OOO
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
OO
O
O
O
O
OOO
O
O
O
f (X) is a cubic spline if

O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
it is a piecewise cubic polynomial and

O
1
2
has 1st
and 2nd continuity
at the2 knots
O
O
O O
O
OO O
O
OO
O
O
O
O
OOO
O
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
O
OO
O
O
O
O
OOO
O
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
IGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of

ntinuity.
A cubic spline
creasing orders of continuity at the knots. The function in the lower

ght panel is continuous, and has continuous first and second derivatives
t the knots. It is known as a cubic spline. Enforcing one more order of
Cubic Spline
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O
O O
O
OO O
O
OO
O
O
O
O O
A cubic spline
O
O
O
O
O O
O
OO O
O
O
O
O
O
O
O
O
O
O
O
O
OO
O
O
O
O
OOO
O
O
O
O
O
O
O
O
Cubic Spline
OOO
O
O
O
O
O
O
O O
O
O
O
O
O
O
O
O
O
O
O
O
IGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of

ntinuity.
The following basis represents a cubic spline with knots at 1 and

2 :
creasing orders of continuity at the knots. The function in the lower

2
ght panel is continuous,
and
second
h1 (X)
= has
1, continuoush3first
(X)and
=X
, derivatives
h5 (X) = (X 1 )3+
t the knots. It is known as a cubic spline. Enforcing one more order of
3 hard to show
ontinuity would lead
a global
h2to(X)
= X,cubic polynomial.
h4 (X) It=isXnot
,
h6 (X) = (X 2 )3+
Exercise 5.1) that the following basis represents a cubic spline with knots
t 1 and 2 :
h (X) = 1,
h (X) = X 2 ,
h (X) = (X )3 ,
Order M spline
An order M spline with knots 1 , . . . , K is
a piecewise-polynomial of order M and
has continuous derivatives up to order M 2
The general form for the truncated-power basis set is
hj (X) = X j1
j = 1, . . . , M
1
hM +l (X) = (X l )M
,
+
l = 1, . . . , K
In practice the most widely used orders are M = 1, 2, 4.
Order M spline
hj (X) = X j1
j = 1, . . . , M
1
hM +l (X) = (X l )M
,
+
l = 1, . . . , K
Order M spline
hj (X) = X j1
j = 1, . . . , M
1
hM +l (X) = (X l )M
,
+
l = 1, . . . , K
Regression Splines
Fixed-knot splines are known as regression splines.
For a regression spline one needs to select
the order of the spline,
the number of knots and
the placement of the knots.
One common approach is to set a knot at each observation xi .

There are many equivalent bases for representing splines and
the truncated power basis is intuitively attractive but not

computationally attractive.
A better basis set for implementation is the B-spline basis set.
Regression Splines


Regression Splines


Natural Cubic Splines
Natural Cubic Splines

Problem
The polynomials fit beyond the boundary knots behave wildly.
Solution: Natural Cubic Splines
Have the additional constraints that the function is linear
beyond the boundary knots.
This frees up 4 dof which can be used by having more knots
in the interior region.
Near the boundaries one has reduced the variance of the fit
but increased its bias!
Smoothing Splines
Smoothing Splines
Avoid knot selection problem by using a maximal set of knots.
Complexity of the fit is controlled by regularization.
Consider the following problem:
Find the function f (x) with continuous second derivative
which minimizes
Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
Smoothing Splines
which minimizes
Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
Smoothing Splines
which minimizes
Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
smoothing parameter
closeness to data
curvature penalty
Smoothing Splines: Smoothing parameter

Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
establishes a trade-off between predicting the training data
and minimizing the curvature of f (x).
The two special cases are

= 0: f is any function which interpolates the data.
= : f is the simple least squares line fit.
In these two cases go from very rough to very smooth f(x).

Hope is (0, ) indexes an interesting class of functions in
between.

Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1


between.

Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1


between.
Smoothing Splines: Form of the solution

Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
Amazingly the above equation has an explicit,
finite-dimensional unique minimizer for a fixed .
It is a natural cubic spline with knots as the unique values of
the xi , i = 1, . . . , n.
That is
f(x) =
n
X
Nj (x)j
j=1
where the Nj (x) are an N -dimensional set of basis functions

for representing this family of natural splines.
Smoothing Splines: Form of the solution

Z
n
X
2
RSS(f, ) =
(yi f (xi )) + (f 00 (t))2 dt
i=1
Amazingly the above equation has an explicit,
finite-dimensional unique minimizer for a fixed .
It is a natural cubic spline with knots as the unique values of
the xi , i = 1, . . . , n.
That is
f(x) =
n
X
Nj (x)j
j=1
where the Nj (x) are an N -dimensional set of basis functions

for representing this family of natural splines.
Smoothing Splines: Estimating the coefficients

The criterion to be optimized thus reduces to
RSS(, ) = (y N)t (y N) + t N
where
N1 (x1 )
N1 (x2 )
N= .
..
N1 (xn )
..
.
N2 (x1 )
N2 (x2 )
..
.
N2 (xn )
N100 (t)N100 (t)dt
R 00
N2 (t)N100 (t)dt
N =
..
R 00 . 00
Nn (t)N1 (t)dt
y = (y1 , y2 , . . . , yn )t
Nn (x1 )
Nn (x2 )
..
.
Nn (xn )
N100 (t)N200 (t)dt
N200 (t)N200 (t)dt

..
R 00 . 00
Nn (t)N2 (t)dt
..
.
N100 (t)Nn00 (t)dt
N200 (t)Nn00 (t)dt
..
.
R 00
00
Nn (t)Nn (t)dt

RSS(, ) = (y N)t (y N) + t N
and its solution is given by
= (Nt N + N )1 Nt y
The fitted smoothing spline is then given by
f(x) =
n
X
j=1
Nj (x)j

RSS(, ) = (y N)t (y N) + t N
and its solution is given by
= (Nt N + N )1 Nt y
The fitted smoothing spline is then given by
f(x) =
n
X
j=1
Nj (x)j
Degrees of Freedom and Smoother Matrices
A smoothing spline is a linear smoother

Assume that has been set.
Remember the estimated coefficients are a linear
combination of the yi s
= (Nt N + N )1 Nt y
Let
f be the n-vector of the fitted values f(xi ) then
f = N = N(Nt N + N )1 Nt y = S y
where S = N(Nt N + N )1 Nt .
A smoothing spline is a linear smoother

Assume that has been set.
Remember the estimated coefficients are a linear
combination of the yi s
= (Nt N + N )1 Nt y
Let
f be the n-vector of the fitted values f(xi ) then
f = N = N(Nt N + N )1 Nt y = S y
where S = N(Nt N + N )1 Nt .
Properties of S
S is symmetric and positive semi-definite.
S S S
S has rank n.
The book defines the effective degrees of freedom of a
smoothing spline to be
df = trace(S )
Effective dof of a smoothing spline

152
5. Basis Expansions and Regularization
0.15
0.10
0.05
0.0
-0.05
Relative Change in Spinal BMD
0.20
Male

Female
10
15
20
25
Age
FIGURE 5.6. The response is the relative change in bone mineral density measured at the spine in adolescents, as a function of age. A separate smoothing spline
was fit to the males and females, with 0.00022. This choice corresponds to
about 12 degrees of freedom.
Both curves were fit with .00022. This choice corresponds to

about 12 degrees of freedom.
where the Nj (x) are an N -dimensional set of basis functions for representing this family of natural splines (Section 5.2.1 and Exercise 5.4). The
criterion thus reduces to
The eigen-decomposition of S : S in Reinsch form

Let N = U SV t be the svd of N .
Using this decomposition it is straightforward to re-write
S = N(Nt N + N )1 Nt
as
S = (1 + K)1
where
K = U S 1 V t N V S 1 U t .
It is also easy to show that
f = S y is the solution to the
optimization problem
min (y f )t (y f ) + f t Kf
f
The eigen-decomposition of S : S in Reinsch form

Let N = U SV t be the svd of N .
Using this decomposition it is straightforward to re-write
S = N(Nt N + N )1 Nt
as
S = (1 + K)1
where
K = U S 1 V t N V S 1 U t .
It is also easy to show that
f = S y is the solution to the
optimization problem
min (y f )t (y f ) + f t Kf
f
The eigen-decomposition of S
Let K = P DP 1 be the real eigen-decomposition of K -
possible as K symmetric and positive semi-definite.
Then
S = (I + K)1 = (I + P DP 1 )1
= (P P 1 + P DP 1 )1
= (P (I + D)P 1 )1
= P (I + D)1 P 1
n
X
1
pk ptk
=
1 + dk
i=1
where dk are the elements of diagonal D and e-values of K

and pk are the e-vectors of K.
pk are also the e-vectors of S and 1/(1 + dk ) its e-values.
The eigen-decomposition of S
Let K = P DP 1 be the real eigen-decomposition of K -
possible as K symmetric and positive semi-definite.
Then
S = (I + K)1 = (I + P DP 1 )1
= (P P 1 + P DP 1 )1
= (P (I + D)P 1 )1
= P (I + D)1 P 1
n
X
1
pk ptk
=
1 + dk
i=1
where dk are the elements of diagonal D and e-values of K

and pk are the e-vectors of K.
pk are also the e-vectors of S and 1/(1 + dk ) its e-values.
Example: Cubic spline smoothing to air pollution data

5.4 Smoothing Splines
155
30
20
10
Ozone Concentration
-50
50
100
Daggot Pressure Gradient
1.0
1.2
Green curve smoothing spline with df = trace(S ) = 11.

df=5
df=11
0.6
0.4
0.2
0.0
-0.2
Eigenvalues
0.8
Red curve smoothing

spline with df = trace(S ) = 5.
0.6
0.4
df=5
df=11
0.2
Eigenvalues
0.8
1.0
1.2
Example: Eigenvalues of S
-0.2
0.0
10
15
Order
20
25
-50
50
100
FIGURE
5.7.of(Top:)
Smoothing
spline fit of ozone concentr
Green curve
eigenvalues
S with
df = 11.
pressure gradient. The two fits correspond to different valu
achieve
Red curveparameter,
eigenvalueschosen
of S to
with
df =five
5. and eleven effective degrees
by df = trace(S ). (Lower left:) First 25 eigenvalues for the t
matrices. The first two are exactly 1, and all are 0. (Lo
-50
50
100
Example: Eigenvectors of S
df=5
df=11
15
Order
20
25
-50
50
100
-50
50
100
p:) Smoothing
spline
of ozone
versus
Daggot
Each
bluefit
curve
is an concentration
eigenvector of S
plotted against x. Top left
The two fits has
correspond
to different
values
the smoothing
highest e-value,
bottom
right of
samllest.
o achieve five and eleven effective degrees of freedom, defined
curve
is the eigenvector
by 1/(1 + dk ).
(Lower left:)Red
First
25 eigenvalues
for thedamped
two smoothing-spline
Highlights of the eigenrepresentation

The eigenvectors of S do not depend on .
The smoothing spline decomposes y w.r.t. the basis {pk } and
shrinks the contributions using 1/(1 + dk ) as

S y =
n
X
k=1
1
pk (ptk y)
1 + dk
The first two e-values are always 1 of S and correspond to
the eigenspace of functions linear in x.
The sequence of pk , ordering by decreasing 1/(1 + dk ),
appear to increase in complexity.
df = trace(S ) =
n
X
k=1
1/(1 + dk ).


S y =
n
X
k=1
1
pk (ptk y)
1 + dk
df = trace(S ) =
n
X
k=1
1/(1 + dk ).


S y =
n
X
k=1
1
pk (ptk y)
1 + dk
df = trace(S ) =
n
X
k=1
1/(1 + dk ).
Visualization of a S
Equivalent Kernels
Row 12
Smoother Matrix
12
Row 25
Row 50
25
50
Row 75
75
100
115
Row 100
Row 115
FIGURE 5.8. The smoother matrix for a smoothing spline is nearly banded,
indicating an equivalent kernel with local support. The left panel represents the
Choosing ???
This is a crucial and tricky problem.
Will deal with this problem in Chapter 7 when we consider the
problem of Model Selection.
Nonparametric Logistic Regression
Back to logistic regression

Previously considered a binary classifier s.t.
log
P (Y = 1|X = x)
= 0 + t x
P (Y = 0|X = x)
However, consider the case when
log
P (Y = 1|X = x)
= f (x)
P (Y = 0|X = x)
which in turn implies

P (Y = 1|X = x) =
ef (x)
1 + ef (x)
Fitting f (x) in a smooth fashion leads to a smooth estimate
of P (Y = 1|X = x).
Back to logistic regression

Previously considered a binary classifier s.t.
log
P (Y = 1|X = x)
= 0 + t x
P (Y = 0|X = x)
However, consider the case when
log
P (Y = 1|X = x)
= f (x)
P (Y = 0|X = x)
which in turn implies

P (Y = 1|X = x) =
ef (x)
1 + ef (x)
Fitting f (x) in a smooth fashion leads to a smooth estimate
of P (Y = 1|X = x).
The penalized log-likelihood criterion

Construct the penalized log-likelihood criterion
`(f ; ) =
Z
n
X
[yi log P (Y = 1|xi ) + (1 yi ) log(1 P (Y = 1|xi ))] .5 (f 00 (t))2 dt
i=1
Z
n
X
=
[yi f (xi ) log(1 + ef (xi ) )] .5 (f 00 (t))2 dt
i=1
Regularization and Reproducing Kernel

Hilbert Spaces
General class of regularization problems

There is a class of generalization problems which have the form
min
f H
"
n
X
L(yi , f (xi )) + J(f )
i=1
where
L(yi , f (xi )) is a loss function,
J(f ) is a penalty functional,
H is a space of functions on which J(f ) is defined.
Important subclass of problems of this form

These are generated by a positive definite kernel K(x, y) and
the corresponding space of functions HK called a reproducing
kernel Hilbert space (RKHS),
the penalty functional J is defined in terms of the kernel as
well.
What does all this mean??

What follows is mainly based on the notes of Nuno Vasconcelos.
Types of Kernels
Definition
A kernel is a mapping k : X X R.
These three types of kernels are equivalent
dot-product kernel
m
positive definite kernel

m
Mercer kernel
Dot-product kernel
Definition
A mapping
k :X X R
is a dot-product kernel if and only if
k(x, y) = h(x), (y)i
where
:X H
and H is a vector space and h, i is an inner-product on H.
Positive definite kernel

Definition
A mapping
k :X X R
is a positive semi-definite kernel on X X if m N and
x1 , . . . , xm with each xi X the Gram matrix
k(x1 , x1 ) k(x1 , x2 ) k(x1 , xm )

k(x2 , x1 ) k(x2 , x2 ) k(x2 , xm )
K=
.
.
...
.
...
...
k(xm , x1 ) k(xm , x2 ) k(xm , xm )
is positive semi-definite.
Mercer kernel
Definition
A symmetric mapping k : X X R such that
Z Z
k(x, y) f (x) f (y) dx dy 0
for all functions f s.t.
Z
is a Mercer kernel.
f (x)2 dx <
Two different pictures

These different definitions lead to different interpretations of what
the kernel does:
Interpretation I
Reproducing kernel map:
X
Hk = f (.) | f () =
i k(, xi )
j=1
hf, gi =
m X
m
X
i=1 j=1
: X k(, x)
i j k(xi , x0j )
Two different pictures

These different definitions lead to different interpretations of what
the kernel does:
Interpretation II
Mercer kernel map:
HM = `2 =
x|
X
i
x2i <
hf, gi = f g
p
p
: X ( 1 1 (x), 2 2 (x), ...)t
where i , i are the e-values and eigenfunctions

of k(x, y) with i > 0.
P 2
where `2 is the space of vectors s.t.
i ai < .
Interpretation I: The dot-product picture

When a Gaussian kernel k(x, xi ) = exp(kx xi k2 /) is used
the point xi X is mapped into the Gaussian G(, xi , I)
Hk is the space of all functions that are linear combinations of
Gaussians.
the kernel is a dot product in Hk and a non-linear similarity
on X .
The reproducing property

With the definition of Hk and h, i one has
hk(, x), f ()i = f (x)
f Hk
This is called the reproducing property.

Leads to the reproducing Kernel Hilbert Spaces
Definition
A Hilbert Space is a complete dot-product space.
(vector space + dot product + limit points of all Cauchy
sequences)
The reproducing property

With the definition of Hk and h, i one has
hk(, x), f ()i = f (x)
f Hk
This is called the reproducing property.

Leads to the reproducing Kernel Hilbert Spaces
Definition
A Hilbert Space is a complete dot-product space.
(vector space + dot product + limit points of all Cauchy
sequences)
Reproducing kernel Hilbert spaces

Definition
Let H be a Hilbert space of functions f : X R. H is a
Reproducing Kernel Hilbert Space (rkhs) with inner-product h, i
if there exists a
k :X X R
s. t.
k(, ) spans H that is
H = {f () | f () =
i i k(, xi )
for i R and xi X }
k(, ) is a reproducing kernel of H
f (x) = hf (), k(, x)i
f H
Interpretation II: Mercer Kernels

Theorem
Let k : X X R be a Mercer kernel. Then there exists an
orthonormal set of functions
Z
i (x)j (x)dx = ij
and a set of i 0 such that
Z Z
X
2
1
i =
k 2 (x, y)dx dy < and
i
k(x, y) =
X
i=1
i i (x)i (y)
Transformation induced by a Mercer kernel

This eigen-decomposition gives another way to design the feature
transformation induced by the kernel k(, ).
Let
: X `2
be defined by
(x) = ( 1 1 (x), 2 2 (x), . . .)

where `2 is the space of square summable sequences.
Clearly
p
X
p
i i (x) i i (y)
h(x), (y)i =
i=1
X
i=1
i i (x)i (y) = k(x, y)
Issues
Therefore there is a vector space `2 other than Hk such that
k(x, y) is a dot product in that space.
Have two very different interpretations of what the kernel
does
1
2
Reproducing kernel map

Mercer kernel map
They are in fact more or less the same.
rkhs Vs Mercer maps

For HM we write
(x) =
P
i
i i (x)ei
As the i s are orthonormal there is a 1-1 map
: `2 span{k }
Can write
( )(x) =
P
i
ek =
p
k k ()
i i (x)i () = k(, x)
Hence k(, x) maps x into M = span{k ()}
The Mercer picture
The Mercer picture
we have
!
xi
x2
x
x x x
x
x
o o
x
x
o o o o
x
x
o
o
o
o
o o
x
e1
)(xi)
l2 d
x x x
x
x
x
xo o
o oo
o
o
o
o
ed
x1
e3
e2
x
x
x x x
x
x
x
x
I1
7R)(xi)=k(.,x
)=k( xi)
"
xo o
o oo
o
o
o
o
o
o
o
Id
I3
I2
13
Mercer map
Define the inner-product in M as
Z
hf, gim = f (x)g(x) dx
Note we will normalize the eigenfunctions l such that
Z
lk
l (x)k (x) dx =
l
Any function f M can be written as
f (x) =
X
k=1
then
k k (x)
Mercer map
hf (), k(, y)im =

=
=
=
f (x)k(x, y) dx
Z X
k k (x)
k=1
X
l=1
k l l (y)
k=1 l=1
l l l (y)
l=1
1
l
l l (y) = f (y)
l=1
k is a reproducing kernel on M.
l l (x)l (y) dx
k (x)l (x) dx
Mercer map Vs Reproducing kernel map

We want to check if
the space M = Hk
hf, gim and hf, gi are equivalent.
To do this will involve the following steps

1
2
3
Show Hk M.
Show hf, gim = hf, gi for f, g Hk .

Show M Hk .
Hk M
If f Hk then there exists m N, {i } and {xi } such that
f () =
=
=
=
m
X
i=1
m
X
i=1
X
l=1
i k(, xi )
i
l l (xi ) l ()
l=1
m
X
i l l (xi )
i=1
l ()
l l ()
l=1
Thus f is a linear combination of the i s and f M.

This shows that if f H then f M and therefore H M.
Equivalence of the inner-products

Let f, g H with
f () =
n
X
i k(, xi ),
g() =
i=1
m
X
j k(, yj )
j=1
Then by definition
hf, gi =
While
hf, gim =
=
=
n X
m
X
i j k(xi , yj )
i=1 j=1
f (x)g(x) dx
Z X
n
i k(x, xi )
i=1
n
m
XX
i=1 j=1
i j
m
X
j k(x, yj ) dx
j=1
k(x, xi ) k(x, yj ) dx
Equivalence of the inner-products ctd
hf, gim =
=
=
n X
m
X
i=1 j=1
n X
m
X
i=1 j=1
n X
m
X
i j
Z X
l l (x)l (xi )
l=1
i j
l l (xi ) l (yj )
l=1
i j k(xi , yj )
i=1 j=1
= hf, gi
Thus for all f, g H

hf, gim = hf, gi
X
s=1
s s (x)s (yj ) dx
MH
Can also show that if f M then also f Hk .
Will not prove that here.
But it implies M Hk
Summary
The reproducing kernel map and the Mercer Kernel map lead to
the same RKHS, Mercer gives us an orthonormal basis.
Interpretation I
Reproducing kernel map:
X
Hk = f (.) | f () =
i k(, xi )
j=1
hf, gi =
m X
m
X
i=1 j=1
r : X k(, x)
i j k(xi , x0j )
Summary
The reproducing kernel map and the Mercer Kernel map lead to
the same RKHS, Mercer gives us an orthonormal basis.
Interpretation II
Mercer kernel map:
HM = `2 =
x|
X
i
x2i
<
hf, gi = f t g
p
p
M : X ( 1 1 (x), 2 2 (x), ...)t
: `2 span{k ()}
M = r
Back to Regularization
Back to regularization
We to solve
min
f Hk
" n
X
i=1
L(yi , f (xi )) + J(f )
where Hk is the RKHS of some appropriate Mercer kernel k(, ).
What is a good regularizer ?

Intuition: wigglier functions have larger norm than smoother
functions.
For f Hk we have
f (x) =
i k(x, xi )
X
i
X
l
X
l
i
"
l l (x)l (xi )
X
i
cl l (x)
i l (xi ) l (x)
What is a good regularizer ?

and therefore
kf (x)k2 =
with cl = l
Hence
lk
cl ck hl (x), k (x)im =
X 1
X c2
l
cl ck lk =
l
l
lk
i i l (xi ).
kf k2 grows with the number of ci different than zero.

functions with large e-values get penalized less and vice versa
more coefficients means more high frequencies or less
smoothness.
Representer Theorem
Theorem
Let
: [0, ) R be a strictly monotonically increasing function
H is the RKHS associated with a kernel k(x, y)
L(y, f (x)) be a loss function
then
f = arg min
f Hk
" n
X
i=1
L(yi , f (xi )) + (kf k2 )
has a representation of the form

f(x) =
n
X
i=1
i k(x, xi )
Relevance
The remarkable consequence of the theorem is that
Can reduce the minimization over the infinite dimensional
space of functions to a minimization over a finite dimensional

space.
This is because as f =
Pn
i=1 i k(, xi )
kfk2 = hf, fi =
=
X
ij
then
i j hk(, xi ), k(, xj )i
i j k(xi , xj ) = t K
ij
and
f(xi ) =
j k(xi , xj ) = Ki
where K = (k(xi , xj )), Gram matrix, and Ki is its ith row.
Relevance
The remarkable consequence of the theorem is that
Can reduce the minimization over the infinite dimensional
space of functions to a minimization over a finite dimensional

space.
This is because as f =
Pn
i=1 i k(, xi )
kfk2 = hf, fi =
=
X
ij
then
i j hk(, xi ), k(, xj )i
i j k(xi , xj ) = t K
ij
and
f(xi ) =
j k(xi , xj ) = Ki
where K = (k(xi , xj )), Gram matrix, and Ki is its ith row.
Representer Theorem
Theorem
Let
: [0, ) R be a strictly monotonically increasing function
H is the RKHS associated with a kernel k(x, y)
L(y, f (x)) be a loss function

then
"
f = arg min
f Hk
n
X
i=1
#
2
L(yi , f (xi )) + (kf k )
has a representation of the form

Pn
f(x) = i=1
i k(x, xi )
where
= arg min
" n
X
i=1
L(yi , Ki ) + ( K)
Regularization and SVM
Rejigging the formulation of the SVM

When given linearly separable data {(xi , yi )} the optimal
separating hyperplane is given by

min kk2
0 ,
subject to
yi (0 + t xi ) 1 i
The constraints are fulfilled when
max(0, 1 yi (0 + t xi )) = (1 yi (0 + t xi )+ = 0 i
Hence we can re-write the optimization problem as
min
0 ,
"
n
X
(1 yi (0 + t xi ))+ + kk2
i=1
SVMs connections to regularization

Finding the optimal separating hyperplane
min
0 ,
"
n
X
(1 yi (0 + t xi ))+ + kk2
i=1
can be seen as a regularization problem

min
f
"
n
X
i=1
L(yi , f (xi )) + (kf k )
where
L(y, f (x)) = (1 yi f (xi ))+
(kf k2 ) = kf k2
SVMs connections to regularization

From the Representor theorem know the solution to the latter
problem is
f(x) =
n
X
i xti x
i=1
if the basic kernel k(x, y) = xt y is used.

Therefore kf k2 = t K
This is the same form of the solution found via the KKT
conditions
n
X
i=1
i yi x i

Lecture 4 - Basis Expansion and Regularization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 4 - Basis Expansion and Regularization

Uploaded by

Copyright:

Available Formats

Chapter 5: Basis Expansion and Regularization

Moving beyond linearity

as a linear basis expansion in X.

Non-linear transformation of single inputs:

Non-linear transformation of multiple input:

hm (X) = Ind(Lm Xk < Um )

Pros and Cons of this augmentation

piecewise-polynomials and splines.

How should one find the correct complexity in the model?

Pros and Cons of this augmentation

piecewise-polynomials and splines.

How should one find the correct complexity in the model?

Controlling the complexity of the model

Scan the set of hm and only include those that contribute

but when learning the j s restrict their values in the manner

Piecewise Polynomials and Splines

Piecewise polynomial function

Continuous Piecewise Linear

Piecewise-linear Basis Function

Blue curve - ground truth function.

Piecewise polynomial function

Continuous Piecewise Linear

Piecewise-linear Basis Function

Blue curve - ground truth function.

Example: Piecewise constant function

with 1 < 2 < 3

Define three basis functions

i s are referred to as knots

The model f (X) =

m hm (X) is fit using least-squares.

As basis functions dont overlap = m = mean

Example: Piecewise linear function

In this case define 6 basis functions

Continuous Piecewise Linear

Piecewise-linear Basis Function

h1 (X) = Ind(X < 1 ),

The model f (X) =

h2 (X) = Ind(1 X < 2 ),

m h m (X) is fit using least-squares.

As basis functions dont overlap = fit a separate linear

Example: Continuous piecewise linear function

Continuous Piecewise Linear

Piecewise-linear Basis Function

Additionally impose the constraint that f (X) is continuous as

A more compact set of basis functions

Continuous Piecewise Linear

To impose the continuity constraints directly can use this

Piecewise-linear Basis Function

of the continuity at the knots

Piecewise-cubic polynomials with increasing orders of continuity

Continuous Second Derivative

Continuous First Derivative

f (X) is a cubic spline if

it is a piecewise cubic polynomial and

Continuous First Derivative

Continuous Second Derivative

IGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of

creasing orders of continuity at the knots. The function in the lower

Continuous Second Derivative

Continuous First Derivative

IGURE 5.2. A series of piecewise-cubic polynomials, with increasing orders of

The following basis represents a cubic spline with knots at 1 and

creasing orders of continuity at the knots. The function in the lower

The general form for the truncated-power basis set is