Professional Documents
Culture Documents
PR
Dr. Robi Polikar
Week 9
Lecture 9
Structural Risk Minimization
&
Other Kernel Methods
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR This Week in PR
Another geometric interpretation of SVMs:
Convex hull of datasets
Relationship to -SVMs
Overview of Support Vector Machines
Structural Risk Minimization
Vapnik – Chervonenkis Dimension
Capacity / flexibility of a classifier
Strength and weaknesses of SVMs
Overview of Support Vector Machines
Connection to neural networks D Duda, Hart & Stork, Pattern Classification
G
Other kernel based approaches R. Gutieerez-Osuna, Lecture Notes
Kernel PCA RP
Original graphic created / generated by Robi Polikar –
All Rights Reserved © 2001 – 2012
May be used with permission and citation.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Another Geometric
Interpretation of SVMs
We have already seen that SVMs provide the optimal solution through
maximizing the margin, which intuitively makes sense.
…but there is another interpretation of SVM solution which also makes common
sense
The solution found by the SVM is the hyperplane that bisects the segment
joining the two nearest points between the convex hulls of the data from
the two classes.
Convex huh?
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Convex Hull
Let’s first understand the concept of convexity:
In Euclidean space, an object is said to be convex (as opposed to concave), if every straight line that
connects any pair of points remains inside the object.
Not convex
Convex (concave)
Similarly, a real-valued function 𝑓 defined on some interval is called a convex function, if for any two
points (𝑥, 𝑦) within its domain 𝐶, and for any real valued constant in [0 1], the condition
𝑓 𝛼𝑥 + 1 − 𝛼 𝑦 ≤ 𝛼𝑓 𝑥 + 1 − 𝛼 𝑓 𝑦 is satisfied.
x )
)
http://en.wikipedia.org/wiki/File:Convex_polygon_illustration1.png; http://en.wikipedia.org/wiki/File:Convex_polygon_illustration2.png
http://en.wikipedia.org/wiki/File:Convex-function-graph-1.png; http://en.wikipedia.org/wiki/File:Convex_supergraph.png
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Convex hull
Now, we can define a convex hull of a set of points:
The convex hull of the set X is the minimal convex set that contains all elements of X
Intuitively, for planar objects (objects lying on a plane, 2-D), the convex hull is equivalent to
the shape obtained by an elastic band that is stretched to encompass the object. This
interpretation is – while not technically accurate – is close enough for our understanding of
the general meaning of the convex hull.
𝑅 𝑋 = 𝛼𝑖 𝑥𝑖 𝑥𝑖 ∈ 𝑋, 𝛼𝑖 ∈ ℝ, 𝛼𝑖 ≥ 0, 𝛼𝑖 = 1, 𝑖 = 1,2, ⋯ , 𝑛
𝑖=1 𝑖=1
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SVM Convex Hull
It turns out that the solution given by the dual formulation of the SVM
𝑛 𝑛 𝑛
1
max 𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 + 𝛼𝑖 ,
2
𝑖=1 𝑗=1 𝑖=1
𝑛
subject to 𝛼𝑖 ≥ 0, 𝛼𝑖 𝑦𝑖 = 0
𝑖=1
for the linearly separable task results in the hyperplane that bisects the linear segment
joining the nearest points between the convex hulls of the data classes:
TK
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SVM Convex Hull
It is easy to see this connection: the solution described by the nearest points on
the convex hulls requires minimizing (convince yourself of the following!)
2
min 𝛼𝑖 𝐱𝑖 − 𝛼 𝑖 𝐱𝑖
𝛼 𝑖:𝑦𝑖 =1 𝑖:𝑦𝑖 =−1
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SVM Convex Hull
The more interesting connection, however, is the nonseparable class behavior. For this one,
we turn to our -SVM formulation. Recall:
1 2
1 𝑛
min 𝐰 − 𝜐𝜌 + 𝜉𝑖 ,
2 𝑛 𝑖=1
𝑇
subject to 𝑦𝑖 𝐰 𝐱 𝑖 + 𝑏 ≥ 𝜌 − 𝜉𝑖 , 𝜉𝑖 ≥ 0, 𝜌 ≥ 0
Normalizing the cost function with 2/2 and the set of constraints by , we obtain
𝑛
min 𝐰 2 − 2𝜌 + 𝜇 𝜉𝑖 ,
𝑖=1
subject to 𝑦𝑖 𝐰𝑇 𝐱 𝑖 + 𝑏 ≥ 𝜌 − 𝜉𝑖 , 𝜉𝑖 ≥ 0, 𝜌 ≥ 0
where 𝜇 = 2/𝑛. In the above formulation, all parameters 𝒘, 𝑏, 𝜌 and 𝜉 are all normalized by the
constant , which is not shown i) because the it is just a multiplicative constant that does not change the
value of the equation, and ii) to show the similarity that remains with the original equation above. The
interesting concept here is that the Wolfe dual formulation of the above optimization problem is
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Reduced Convex Hulls
The reduced convex hull of a vector space X, denoted as R(X,) is defined as the convex set
𝑛 𝑛
𝑅 𝑋, 𝜇 = 𝑦: 𝑦 = 𝛼𝑖 𝑥𝑖 𝑥𝑖 ∈ 𝑋, 𝛼𝑖 ∈ ℝ, 0 ≤ 𝛼𝑖 ≤ 𝜇, 𝛼𝑖 = 1, 𝑖 = 1,2, ⋯ , 𝑛
𝑖=1 𝑖=1
=1
=0.4
=0.1
TK
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR -SVM
Reduced Convex Hull
Note that we started with the -SVM formulation and we ended up with the reduced convex
hull interpretation of the SVM.
Also note that the only difference between the two convex hull formulations, first linearly
separable and second not separable cases, is the additional 𝜇 - term in the Lagrange multiplier
constraints, that is, each 𝛼 must have an upper bound of 𝜇.
In the separable case, we have 0 𝛼 1, which means 𝜇 = 1, which then means that the convex hulls
do not overlap. The solution is then the maximum margin separating the full convex hulls.
In the non-separable case, we have 0 𝛼 𝜇, where 𝜇 1, which gives us the reduced convex hull case.
The margin is then searched between the reduced convex hulls.
Also recall from earlier the relationship between 𝜇 and 𝜈: 𝜇 = 2/𝜈𝑛, and the parameter 𝜈 directly
controls the error rate and the number of misclassified instances. The larger the 𝜈 (the more
[misclassified samples] are allowed in the margin) to the smaller the 𝜇 (the extend of the non-
overlapping convex hulls) . In order to ensure a feasible solution:
𝑛min where 𝑛𝑚𝑖𝑛 is min{𝑛+ , 𝑛− }, the number of + and –
𝜐 ≤ 𝜐max = 2 ≤1 samples, satisfying
𝑛
2 1
𝜐 ≥ 𝜐min = 𝜇 ≥ 𝜇min = , and 𝜇 ≤ 𝜇max ≤ 1
𝜇max 𝑛 𝑛min
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR -SVM Demo
1
- SVM, Kernel = gaussian, Kernel option = 0.2 = 0.3
0.9 1
0.8
0.9
0.7
0.8
0.6
0.7
0.5
0.6
0.4
0.3 0.5
0.2 0.4
0.1
0.3
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2
Learning Data and Margin
1
0 0 0.1
0.9
0 0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.8
0
0
0
0.7
0.6 0
0
0
x2
0.5
0
0
0
0.4
0.3
0
0
0.2
0
0
0.1
0
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x1
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Some observations
SVMs, provide a unique optimal solution for linearly separable two-class
problems
If there is noise in the data, we can add slack variables
Do not appear in the problem formulation, instead we have the 𝐶 (or 𝜈) parameter,
a trade-off between maximizing margin and minimizing training error
If the problem is not linearly separable, SVM can still solve it by going to a
higher dimensional space, where the problem may be (more) linearly
separable
All calculations are done in the original space through the kernel trick
We do not need to know the transformation / mapping to the high dimension
The transformation function is determined by the choice of the kernel
There is no guarantee that the problem will be linearly separable in the space
determined by the chosen kernel – it is just that the problem is more likely to be
linearly separable in the high dimensional space.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Overfitting
If in fact we are working in a much higher dimensional (virtual) space, how
come we do not suffer from overfitting / curse of dimensionality?
Why do SVMs work as well as they do…?!
Let’s recall overfitting: A classifier that has too many degrees of freedom
(adjustable parameters) can always find a decision boundary that will
correctly classify all training data points.
However, the parameters may then be too specific to the given (finite size)
training data. Essentially, the classifier has memorized the training data, and cannot
generalize well on previously unseen test data.
Vapnik: The primary issue at hand is not just the number of adjustable
parameters to be estimated, but also the flexibility / capacity of the classifier.
Surely, a classifier with many adjustable parameters is also very flexible, capable to
adjust itself to many variations in the data
But there are exceptions.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR An Analogy
(Burges, 1998)
To illustrate the flexibility / capacity of a classifier, the following
analogy is useful:
A classifier with very high capacity A botanist with a photographic memory:
• Given a tree previously not seen, this botanist will declare this is NOT a tree, because it
has different number of leaves than anything it has seen before. Having memorized
every other tree she has ever seen, and this tree not looking like any of these trees
before, she concludes that this must not be a tree.
A classifier with very little capacity The botanist’s lazy brother
• If it is green, it is a tree! Having very little capacity to learn the variations among
different examples, this classifier learns the very most basic common denominator,
which fails to distinguish among samples belonging to different classes.
So how do we decide?
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Empirical Risk
Minimization (ERM)
The fundamental problem of finding the best classifier:
Given a set of training i.i.d. data for a binary classification problem, drawn from
some unknown probability distribution 𝑃(𝒙, 𝑦)
𝐱1 , 𝑦1 , 𝐱2 , 𝑦2 , ⋯ , 𝐱𝑛 , 𝑦𝑛 ∈ ℝ𝐷 × 𝑌, 𝑌 ∈ −1, +1
our goal is to find the mapping function 𝑓 that will correctly classify unknown
samples (𝒙, 𝑦) that also come from the same distribution 𝑃(𝒙, 𝑦)
The best function 𝑓 is the one that minimizes the expected risk / error / loss :
2
𝑅𝑓 = 𝐿 𝑓 𝐱 , 𝑦 𝑑𝑃 𝐱, 𝑦 𝐿 𝑓 𝐱 , 𝑦 = 𝑓 𝐱 − 𝑦
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Empirical Risk
Minimization (ERM)
A fancy term that simply means minimizing the error on the training data.
This is usually not a bad thing.
In fact, for most classification algorithms, this is the primary objective in
determining the classifier parameters
However, relying on the empirical error for selecting the best classifier (model
selection) is only a meaningful idea if there are lots of training data.
We should try to minimize the expected risk
The probability that the classifier will be incorrect on the previously unseen data
As the number of training data available → ∞, empirical risk approaches to
expected risk
For small sample sizes, we cannot guarantee that small empirical error will also lead
to small test error
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Empirical Risk
Minimization (ERM)
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR VC Dimension
Some definitions:
Given a set of 𝑛 points for a two-class problem, there are 2𝑛 ways in which
these 𝑛 points can be labeled. If a member of the family 𝑓 can be found
that can correctly assign all possible labelings, these set of 𝑛 points are said
to be shattered by 𝑓.
The VC dimension of a family of functions 𝑓 is the largest number of
training points that can be shattered by 𝑓.
If the VC dimension of 𝑓 is ℎ, then there exists at least one set of ℎ points that can
be shattered, but in general not every set of ℎ points can be shattered.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR VC Dimension
For example, let 𝑛 = 3. There are 23 = 8 possible labeling for a two-class
dichotomy: -1: +1:
{(−𝟏, −𝟏, −𝟏), (−𝟏, −𝟏, +𝟏), (−𝟏, +𝟏, +𝟏), (−𝟏, +𝟏, −𝟏), (+𝟏, −𝟏, −𝟏), (+𝟏, −𝟏, +𝟏), (+𝟏, +𝟏, −𝟏), (+𝟏, +𝟏, +𝟏)}
We can find a linear function 𝑓 that can correctly label these set of three points
whichever way they are labeled.
RP
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR VC Dimension
But if we choose four points, it can be shown that there are no set of four
points such that all possible (24 = 16) labelings can be correctly classified.
Recall the XOR problem. For any given set of four points, we can find at least one
way of assigning labels to these points such that a linear function fails to label them
correctly.
Hence the VC dimension of the family of linear functions (in 2D) is three!
In general, it can be shown that the VC dimension of the family of
hyperplanes in 𝑑 dimensional space is 𝑑 + 1
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR VC Dimension
VC dimension gives a measure of the complexity (capacity) of the
(classifier) functions.
One may think that the larger the number of parameters a classifier (function) has,
the higher the VC dimension – while in general true, there are many exceptions
(see Burges 1998)
Hence, VC dimension is a more accurate representation of the capacity of a
classifier than the number of parameters. Hence, the higher the VC
dimension of a classifier, the higher its capacity
The k-nearest neighbor (for k=1) classifier has a VC dimension that is infinite.
Why? Because, with kNN, one can correctly classify any number of training
samples, regardless what the correct labels are
Actually, the SVM with the RBF kernel also has infinite VC dimension (see Burges
1998), because for sufficiently small choice of the spread parameter, any number of
instances can be correctly classified by the RBF-SVM.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
Let’s summarize:
We want to choose the function / classifier that minimizes expected risk.
• Cannot do it, since we do not know the underlying distributions
Then, let’s minimize the empirical risk / error, after all, the model that minimizes
the training error is the one that best describes the given data
• Ok, but unreliable when the amount of training data is limited – We can choose a
classifier with a large capacity possibly overfitting
So we need to control the capacity of a classifier. How do we do that?
• Choose the classifier with fewer parameters? No, not necessarily
• Instead, choose a classifier with a lower VC dimension, as VC dimension is a better
indicator of the classifier’s capacity
So, is there a relationship between the VC dimension and the expected risk,
the quantity that we really want to minimize?
Funny you asked…
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
VC dimension provides an upper bound to the expected risk as a function of the
empirical risk and the number of training data points:
We can show that given the VC dimension ℎ, the number of training data points 𝑛, and a
constant 0 < 𝜂 < 1, the following upper bound holds with probability 1 − 𝜂
ℎ log 2𝑛 ℎ + 1 − log 𝜂 4
𝑅 𝑓 < 𝑅𝑒𝑚𝑝 𝑓 +
𝑛
Now, note the following VC confidence
As 𝑛 increases, VC confidence becomes smaller and expected risk approaches to that of
empirical risk Makes sense, if we have lots of data, empirical error is a good indicator of
the true performance.
The bound is not dependent on the (unknown) probability distribution 𝑃(𝒙, 𝑦), nor on the
number of adjustable parameters, nor on the dimensionality of the problem !!!
While we can never actually compute the left hand side (expected risk), this inequality gives
us an upper bound. If we know ℎ, we can easily compute the upper bound.
Given some small 𝜂, we should then choose the classifier (among all classifiers) that
minimizes the VC confidence, that is the one with the smallest VC dimension. This is called
structural risk minimization, as it minimizes the expected risk by reducing the second
term on the right hand side.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
SRM is therefore a formal term that simply states that the best classifier is the one
that provides a meaningful balance between empirical risk and VC dimension.
Practically speaking, SRM tells us that we should choose a classifier that minimizes
the sum of empirical risk and VC dimension.
Confidence interval
for Classifier 1
Underfitting Overfitting
Confidence interval
for Classifier 2
Expected Risk
Error rate
Classifier 2
VC Confidence Empirical Risk
RP Muller, ’01
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SRM in Practice
Ah, it is easy then…All we need to do is to find the VC dimension of the
classifiers we generate, and then use the one that has the smallest ℎ + 𝐶𝐼
If only it were that simple!
The VC dimension for many classifiers cannot be easily computed, and hence the
upper bound cannot be determined. It may be infinite (as in kNN), or the bound
may prove to be very loose, and not being very useful.
Blah! All this VC theory crap for nothing…?
Well, not really. It turns out for linear classifiers, there is a very interesting link
between the VC dimension and the separating margin between two classes.
Remember: we want to minimize the sum – and therefore – both the empirical
error and the capacity of the classifier
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
A different interpretation:
Minimize structural risk Minimize sum of empirical error + classifier capacity
It is obvious why we want to minimize empirical error, but why the classifier capacity?
• Because, a large capacity classifier can explain not only the observed data, but many other data that
are not actually generated by the underlying boundary.
RP RP
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
So we want a classifier that can explain the given data as best as it can, but
cannot explain any other data!
Let’s go back to our original representative data: what is the classifier (model) that
best describes the given data, but least likely to explain any other data?
Intuitively speaking…
Optimal decision
boundary
Feature 2
xi
ξi
xj
ξj
Maximum
RP margin RP
Feature 1
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SRM VC Dimension SVM
So, intuitively, then, the classifier that best explains the given data, and least likely to explain
any other data is the one that provides the largest margin between the classes.
Margin: The minimum distance of an instance to the decision boundary.
It turns out, there is an interesting relationship between the margin and the VC dimension,
one that provides a theoretical basis to this intuition (Vapnik, 1998):
𝑅2 2
ℎ ≤ min , 𝑑 + 1 𝑚 =
𝑚2 ‖𝐰‖
where R is the radius of the smallest (hyper) sphere that includes all data points, and 𝑑 is the
dimensionality of the dataset
So…?
So, by maximizing the margin, we are minimizing the VC dimension!
The SVM takes advantage of this very insight:
SVM finds the classifier / model / function that maximizes the margin between the classes
Hence, it automatically minimizes the VC dimension, even though we cannot compute the VC
dimension itself (actually for linear classifiers, the VC dimension is 𝑑 + 1)
Hence, it automatically minimizes the VC confidence, hence the expected risk!
Hence, the next best thing since sliced bread…er…since Bayes classifier
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SRM VC Dimension SVM
A further insight:
Recall the optimization problem for the noisy dataset case (using slack variables)
𝑛
1 2
min 𝐰 +𝐶 𝜉𝑖 , 𝐶 > 0
2
𝑖=1
𝑇
subject to 𝑦𝑖 𝐰 𝐱𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 , 𝜉𝑖 ≥ 0
The first term in the objective function minimizes 12 ‖𝐰‖2 hence maximizes the
margin 𝑚 = 2 ‖𝐰‖2 which reduces the capacity of the classifier
The second term minimizes the number of instances that fall inside the margin
and/or on the wrong side of the decision boundary, hence minimizes the
misclassification rate, hence minimizes the empirical error
Thus, this objective function (the Lagrangian) minimizes the sum of empirical error
and the classifier capacity, precisely the term that SRM asks to minimize
Hence, the support vector machines implement the structural risk
minimization!
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Strengths & Weaknesses
of SVMs
Strengths:
Training is straightforward, given a good quadratic programming algorithm
No local minima – unlike neural networks the SVMs will find the same global
optimum – each and every time. Excellent generalization ability
Scales well to high dimensions (but not so well for large datasetsuse SMO)
Provides an explicit control between classifier complexity (capacity) and error
minimization
Solution is sparse – it only uses the subset of the data – the support vectors
An elegant and intuitive theory
Only a few parameters need to be specified: the kernel itself, (usually one, 𝜎 for
Gaussian kernel, 𝑑 for polynomial) kernel parameter, and the penalty term 𝐶
Weakness:
Need to choose a “good” kernel function, also how to choose 𝐶
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Multi-Class Classification
So far, we have only looked at binary (2-class) classification problems.
What do we do if we have more than two classes?
One possible solution
Change the quadratic programming formulation such that the multi-class
objective function is minimized.
More commonly: Use an ensemble of SVMs. Two variations
1. One against one: Create one SVM for each pair wise combinations of classes
• 3 classes: Class 1 vs Class2, Class 1 vs. Class 3 and Class 2 vs. Class3
2. One against all: Create C SVMs, one for each of the C classes, against all others
• Class1 vs. Classes 1&2, Class 2 vs. Classes 1&3, Class 3 vs. Classes 1&2
Then combine these classifiers using a suitable combination rule, such as majority
voting. The class that gets the most number of votes is selected.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Connection to
Neural Networks
For certain choices of the kernel functions, there is a one-to-one
connection between the SVM and certain neural networks.
Specifically (# of output nodes = 1 in each case)
The RBF network is a special case of the SVM with the Gaussian kernel:
• The number of centers (number of receptive fields - hidden layer nodes) = number of
support vectors
• The centers themselves = the support vectors as found by SVM
• The weights 𝐰 from hidden to output layer = Lagrange multipliers 𝛼𝑖 and the threshold
(𝑏) are all determined automatically by the SVM solution!
MLP network architecture can be determined by the SVM with the hyp. tangent
kernel:
• The number of hidden layer nodes = number of support vectors
• Input to output weights = kernel values obtained by the sigmoid
• The output weights are determined by the Lagrangian multipliers
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Connection to
Neural Networks
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Other Kernel
Based Methods
The take home message from these lectures:
A problem that is not linearly separable in the given low dimensional input space, may very
well be linearly separable in the high dimensional space, and one does not need to perform
any computations in the high dimensional space for this!
The same approach can therefore be used for other data analysis type problems, such as
kernel PCA, kernel LDA, kernel / SVM for regression and for one-class classification
(novelty detection).
Extra credit: The take home “take-home project” from these lectures:
In groups of 2-3, pick one of the above mentioned topics, prepare a 20 minute mini-lecture,
followed by a demo of the approach using any of the freely available software packages for
Matlab. Each group should have at least 1 grad and 1 undergraduate student
• Your own implementation of SVM using quadprog()
• Kernel PCA (non-linear PCA)
• Kernel (Fisher) LDA
• SVM for regression (function approximation) – support vector regression
• One-class SVMs for classification (novelty detection)
• Any other Kernel method not covered in class.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ