You are on page 1of 6

Machine Learning

Reproducing Kernel Hilbert Spaces


ECE 595 (Homework 2)
Adebello Jelili
Department of Electrical and Computer Engineering, University of New Mexico,
Albuquerque, NM 87131 USA
AbstractUsing the Toy problem as given in the assignment,
we analyze the theoretical considerations by proving that the
square matrix 1 is necessary to compute the bias b and relating
the small diagonal matrix I of the kernel regression as a
regularization to minimizing the complexity of the classifier.
Simulations for both the adjustment of the free parameters and
is used to verify these claims using a cross-validation algorithm
through v-fold while implementing this on a double Fibonacci
spiral data-set and comparing the results with that of a SVM
criteria that uses a Gaussian kernel. The performance analysis
was carried out using the Wisconsin Breast Cancer Database by
comparing our results with other related papers.

I. I NTRODUCTION
Predicting the future rather than describing the data at
hand is the primary goal of learning but many statistical
learning problems involve some form of dimension reduction
to achieve this goal, which may involve feature selection
with the aim of finding similarities of the original set of
variables [2] and a new set of data to appropriately predict
future behaviors of variables. Finding an accurate model, faces
challenges of having a much smaller sample data in a much
higher dimensional space that results in strong reaction if
the model is too highly-parameterized which result in overfitting of the data and fails to learn the function features
[1] [8] . However model with too few parameters may not
even describe the training data adequately and also provide
similar performance. With regularization we can balance a
reliable way to create a model that evaluates the method for
complexity matching for each class depending n the dimension
of the data been processed. Therefore, kernels will provide a
computationally feasible and flexible method for implementing
such algorithm. Reproducing Kernel Hilbert Spaces (RKHS)
introduces a particular useful set of hypothesis associated with
a particular kernel, and using this in deriving the general
solution of Tikhonov regularization in RKHS will help us
implement and solve the above statistical complications [6].
A. Kernels
Considering RKHS theory, let V Rn be endowed with
an inner product. The configuration involves three aspects: the
vector V , the orientation of V in Rn , and the inner product
on V all with a basis v1 , ..., vr V Rn for V and the
The work in this paper was aided by several online materials including
some of the proofs which can be provided on request

corresponding Gram matrix G whose ijth element is Gij =


hvi , vj i[4] . However, these representation does not satisfy the
necessary requirements because there is no unique choice for
a basis. RKHS produces a unique spanning set k1 , ...., kn for
V Rn by the rule that ki is the unique vector in V satisfying
hv, ki i = eTi v for all v V . Therefore, the kernel of V is
the unique matrix K = [k1 k2 ....kn ] Rnn determined by
K in such a way that ki span V and hki , kj i = Kij called
the Gram matrix corresponding to the vectors k1 , ..kn . The
name reproducing kernel can then be attributed to the kernel
K reproducing itself in that it ijth element Kij is the inner
product of its jth and ith columns hki , kj i.
B. Regularization
The main goal of regularization is to renovate the illconditioned aspect [3] of the empirical risk minimization
(ERM) by effectively restricting the hypothesis space of our
data by introducing a complexity penalization term into our
minimization ERR(f ) + pen(f ), where regularization parameter controls the trade-off between the two terms which
cause the minimization to seek out simpler learning process
with lesser cost.
Using Tikhonov regularization to rewrite our ERM as
  2
n


1X
L(f (xi ), yi ) +
f


n i=1
2

(1)

where is a regularization parameters, L(f (xi ), yi ) is the loss


function, and k.k is the norm in the function space. From (1),
the penalization should somehow force the f to be as smooth
as possible while still fitting the data.
C. Representer Theorem
The algorithm that we want to study are defined by an optimization problem over RKHS from (1) where the is positive
real number, the RKHS is defined by K(., .), and L(., .) is
the loss function. We have imposed stability on this problem
through the use of regularization also since the function
space may be infinite-dimensional our solution to Tikhonov
regularization could be impossible but surprisingly the result
actually has a very compact representation as described below.

The minimizer over the RKHS of the regularized empirical


function can be represented by the expression
f =

n
X

K(xi , x) ,

(2)

i=1

for some n-tuple (1 , ...., n ) Rn . Hence, minimizing over


the Hilbert space boils down to minimizing over Rn . Notice
that there are only finite number n of training set points, so the
fact that the minimizer can be written as a linear combination
of kernel terms guarantees the representation of the minimizer
as a vector on Rn .
II. M ETHODOLOGY

that can be seen to help stabilize the inverse numerically by


bounding the smallest eigenvalues away from zero.
In matrix form (7) becomes;
1 T
Xy .
(8)
w = XXT + I
taking the bias b into consideration our cost function becomes


 2


2
(9)
L = y wT X + b1T + kwk + 0 b2
solving the derivative w.r.t b and equating to zero

L
= 2nb 2 y wT X 1 + 20 b = 0
b
we get

Analyzing of data using kernel functions is a powerful


computational tools [7] and it advantages is due to it capability
to map a set of data at a high dimensional Hilbert space
without explicitly computing the coordinates of the structure.
This is possible through K a positive semi-definite function,
if
n
X
i j K(xi , xj ) 0
(3)

b=


1
y wT X 1 = y wT x
n + 0

(11)

y=

1
y1,
n + 0

(12)

x=

1
X1
n + 0

(13)

where

and

i,j=1

for any m N , samples xi X(i = [1, n]), and set of


coefficients i R(i = [1, n]) that is symmetric, positive
semi-definite and ensures the existence of a Hilbert space H
that maps : X 7 H such that
k(x, x0 ) = h(x), (x0 )i

(4)

for all x, x0 X.
As discussed in Section 1, regularization is introduces to
prevent over-fitting by introducing additional constraints on
the model as a form of complexity penalty. Considering the
classifier given in the homework:
yj = wT (xj ) + b.

substituting (11) back into (9)


B. Kernel Ridge Regression
We now replace all data-cases with their feature vector:
xj 7 i = (xi ). Using the kernel trick we can now perform
the inverse above in the smallest of the two spaces, either the
dimension of the feature space or the number of data set cases.
1) Bias Computation: From the solution of the ridge regression,
1 T
w = T + I
y
(14)
The prediction y of the future input x is

(5)

y = wT
1 T
= T T + I
y

where yj is a n 1 vector, xj is a n p matrix. We can


minimize the sum of the squared error using ridge regression.
A. Ridge Regression
Finding a function that models the dependencies between
xj and yj is to minimize the cost function while working in
the feature space [5] where xj 7 (xj ) without over-fitting
by regularizing the cost. A simple effective way to regularize
is to penalize the norm of w and determine how to choose
our . The most useful algorithm is to use cross-validation
or leave-one-out estimates and the total cost function hence
becomes,
1
1X
2
(yj wT xj )2 + kwk
(6)
L=
2 j
2
which is minimized by taking the derivatives and equating
them to zero. Resulting in
!1
!
X
X
T
w=
xj xj + I
yi xi
(7)
i

(10)

= k (K + I)

(15)

where the elements of vector k and matrix K are kj = (x, xj


and Kij = (xi , xj . Considering the bias by augmenting the
input directly as x = [1, xT ]T and then mapping it to feature
space with not solve the problem, besides not all kernels can
be applied to vectors (i.e. string kernel, graph kernels, etc),
instead we argument in the feature space which will lead to a
non-linear model as in equation (5) with bias term. augmenting
in the feature space as = [1, T ]T and then computing the
inner product will get
(xi , xj ) = hi , j i = hi , j i + 1 = (xi , xj ) + 1
using this to solve (9) in feature space;


 2


2
L = y wT + b1T + kwk + 0 b2
where
= K + I

1

yT

(16)

(17)
(18)

the bias becomes


b=

1
y wT 1 = y wT
n + 0


(19)

substituting y from (18) into (19)


1
1T I
n + 0
= 1T
n
X
=
j

b=

(20)

j=1

2) Complexity Minimization: By substituting the parametrs


into the ridge regression in the feature space , we can minimise
(21) w.r.t
2


XX
X
X


j i (xi , xj )
j (xi , xj ) 0 +
yj


j
i
i
j
(21)
from which we can see that in order to control the model
complexity, we have to control; the dual form of the norm of
w using and directly controlling the norm of the kernel since
the kernel itself will make our function arbitrarily complex if
we do not control it.
C. Parameter Validation

generating a MMSE output that can provide the best learning


processing of our data by addressing two problems (Section
III A and B) in terms of: (1)using a nonlinear mapping
function to improve the trade-off property of the classifier,
and (2) selection of regularization parameters to minimize the
complexity of the classifier and estimate for minimum error.
A. Parameters Adjustment
Experimental setup for parameter adjustment was carried
out using algorithms in MATLAB: prac21.m with script that
generates two sets of samples, one set of 100 samples for
training and another set of 1000 samples for test. The training
samples were fixed, while the test samples was regenerated
for quality performance analysis. Using v-fold to train and
validate the parameters, we split the training samples into a
5-fold set and apply a cross-validation algorithm for training
and validation in order to extract the best function of similarity
in the data while avoiding over-fitting. A set of 10 parameters
were used for both parameters ranging from interval = [0, 1]
and = [0, 2] for validation. Applying this setup to the XOR
problem in the homework (data N = 100, generated result as
shown in Fig.1 and Fig.2). Fig.3 and Fig.4. shows the results
generated from the above setup using classifier of N = 400
with the MMSE criterion and Gaussian kernels.

Validating the free parameters and using a set of v-fold


to split the training sample and implement cross-validation
according to algorithm 1. We assess the test set and adjust
each of the parameters to their optimal values.
Algorithm 1 Validation
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:

procedure PARAMETER VALIDATION


range length of parameters
i range
top:
for i = range do
for j = len v-fold do
X N samples
XV V-folds
validate 1 : Vlength
train (V-1)folds and parameter(i)
mse errors
goto top
loop:
if index(i) = min(mmse) then
parameter mmse(index).
goto loop.
close;
output test parameters.

III. E XPERIMENTS AND E VALUATION


Based on the non-linear classifier framework given in assignment, we analyze simulated data with the aim of validating
the classifier free parameter and the kernel parameter for

Fig. 1: plot of The XOR problem and a classifier based on MMSE.


After cross validation and validating the parameters, we
found the parameters with the lowest possible mean square
error (MSE = 0.507) to be = 0.1 and = 0.6. Choosing
these optimal parameters for the test data results into a strong
classifier as shown in Fig. 4 where the data are very cluster
and contain considerably amount of Gaussian noise.
Analyzing these results, we notice that exploiting prior
knowledge on the function to be learned leads to a lower generalization error. Also prior knowledge on the hypothesis space
and the related kernel gives considerably higher performance
to the RKHS algorithm. The MSE approach shows that the
RKHS is very sensitive to parameter tuning as observed from
the experiment.

with the classifier of the SVM criterion and a Gaussian kernel


we generated several results as shown in Fig. 6 - 9. These
plots shows the corresponding classification boundaries that
are been validated by the two SVM parameters and C in
comparison with that of the XOR problem using a kernel
matrix that validated parameters and as described in
experiment I.

Fig. 2: Contour plot of the above XOR problem.

Fig. 5: The double Fibonacci spiral

Fig. 3: plot of The XOR problem and a classifier based on MMSE..

Fig. 6: The double Fibonacci spiral for N = 400 data using the reproducing
kernel Hilbert space.

Fig. 4: Contour plot of the above XOR problem.

B. Double Fibonacci Spiral


Fig.5 shows two sets generated from a distribution that
follows the Fibonacci spiral using script from MATLAB:
prac22.m. Repeating this experiment and comparing its results

The use of the double Fibonacci spiral in this experiment


targets at illustrating the concept of regularization by taking a
critical look at the behaviors of both the RKHS and SVM machine learning tools. Both methods uses a Gaussian kernel in
form of Radial Basis function for classifications. As performed
in the previous experiment a range of finely sampled hyperparameters was used to test and validate the model selection.
Each case used an average of the MSE over 400 datasets of
10 iterations to evaluate the validations. Fig. 6 and 7 shows
the classification of the spiral data using the RKHS, while
Fig. 8 and 9 shows the results using SVM classifier with
Gaussain kernel. We can see that the RKHS produce a perfect
classification with negligible error, which was achieved due to

with much error content as compared to that of the RKHS,


this is due to the fact that the used data-set is considered to
be small and from Section III.C. we found out that the SVM
performs better as the number of data-set increases.
C. Performance Analysis

Fig. 7: Contour plot for the RKHS.

Evaluating the performance of the experiment performed


above, we compare the performance of the SVM and RKHS
in the previous section using various criteria as shown in Table
I;
computational complexity
large vs small data-sets
non-linearity order of the classifier
MMSE
smoothness of performance behavior
Using the Kernels provided in the MATLAB scripts which
provide a non-linear mapping and computational convenience
with results as shown in Table I. The SVM algorithm outperformed the RKHS method in term of the reduce MSE used
to compute the optimal parameter, but the RKHS have lesser
computational complexity provided the dimension of the dataset is lesser than that of the feature space used to classify. The
plots in Fig.6 also clarify that the performance of the RKHS
method through the smoothness behavior of the classifier and
high SNR generates a perfect classifier with less or no error.
Table I: performance comparison of the SVM and RKHS
criteria
computational complexity
MSE
Optimal Hyper-parameters

Fig. 8: The double Fibonacci spiral for N = 400 data using the SVM with
Gaussian kernel.

SVM
N3
0.50
C= 5 and = 0.1

RKHS
N2
0.5012
= 0.01 and = 0.1

Overall the RKHS as shown to be equivalent or even better


than the SVM in term of the data dimension used in this
experiment. we also noticed that the accuracy of the methods
depends on the non-linearity, that is to say that the higher the
non-linearity the lower the accuracy of the SVM tool. Also, it
seem that both approaches are dependent on hyper-parameters
setting and tuning which leads to acceptable performance on
a wide range of hyper-parameters.
Using the UCI [9] database to analyze the performance of
these tools on a real data, which contains almost 696 datasets and 10 attributes, we would have been able to critically
compare the overall performance of each of these algorithms
with other published results, but we do not have enough time
to check.
IV. C ONCLUSION

Fig. 9: Contour plot for the SVM using Gaussian Kernel.

the cross validation process that produce optimal parameters


= 0.01 and = 0.1 for learning process as compared to
the results for the SVM in Fig 8 and 9.
The result of the SVM was able to classifier the data-set but

The comparative study performed in this paper provide an


insightful observations into the performance of the SMV(SVR)
and RKHS based on the RBF and Gaussian kernels formulation of a nonlinear model. This makes it possible to scrutinize
the statistical properties in the data-sets and take advantage
of kernels in perfectly mapping of input data, output data and
similarities in data.

In general, the accuracy and efficiency of each approach


is based on different criteria which should be determined by
the user towards the goal to be achieved, since each method
only perform good depending on the data scale and level of
non-linearity.
Finally, using the UIC database real data in the future and
incorporating prior knowledge in both methods, we could
improve the quality of the RKHS and regression model and
categorically state the advantages of these algorithms over
other methods.
R EFERENCES
[1] J. Hainmueller and C. Hazlett, Kernel Regularized Least Squares: Reducing Mis-Specification Bias with a Flexible and Interpretable Machine
Learning Approach, MIT, MA, Sept. 2013.
[2] M. W. Chang, and C. J. Lin, Leave-One-Out Bounds for
Support Vector Regression Model Selection , Neural Computation,
2005; 17:1188-1222.
[3] H. Zou and T. Hastie, Regularization and variable selection via the
elastic net, Journal of the Royal Statistical Society B, 67: 301- 320,
2005.
[4] G. C. Cawley, N. L. Talbot and O. Chapelle, Estimating Predictive
Variances with Kernel Ridge Regression, Uni.of East Anglia, Norwich,
UK, 2003.
[5] X. Lu, M. Unoki, S. Matsuda, C. Hori, H. Kashioka, Controlling Tradeoff Between Approximation Accuracy and Complexity of a
Smooth Function in a Reproducing Kernel Hilbert Space for Noise
Reduction, IEEE trans. Signal Processing., Vol. 61, pp. 601-610, Nov.,
2012.
[6] Dupe, F. X. Bougleux, S. Brun, L. Lezoray, O. Elmoataz, A., KernelBased Implicit Regularization of Structured Objects, IEEE conf. on
pattern recognition ICPR, pp. 2142-2145, Aug., 2010.
[7] http://www.mit.edu/ 9.520/scribe-notes/cl7.pdf.
[8] http://www.mblondel.org/journal/2011/02/09/regularized-least-squares/
[9] https://archive.ics.uci.edu/ml/datasets.html

You might also like