You are on page 1of 35

Dept.

of Electrical and Computer Engineering


0909.555.01

PR
Dr. Robi Polikar

Week 9
Lecture 9
Structural Risk Minimization
&
Other Kernel Methods

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR This Week in PR
 Another geometric interpretation of SVMs:
 Convex hull of datasets
 Relationship to -SVMs
 Overview of Support Vector Machines
 Structural Risk Minimization
 Vapnik – Chervonenkis Dimension
 Capacity / flexibility of a classifier
 Strength and weaknesses of SVMs
 Overview of Support Vector Machines
 Connection to neural networks D Duda, Hart & Stork, Pattern Classification

G
 Other kernel based approaches R. Gutieerez-Osuna, Lecture Notes

 Relevance vector machine TK Theodoridis & Koutroumbas, Pattern Recognition

 Kernel PCA RP
Original graphic created / generated by Robi Polikar –
All Rights Reserved © 2001 – 2012
May be used with permission and citation.

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Another Geometric
Interpretation of SVMs
 We have already seen that SVMs provide the optimal solution through
maximizing the margin, which intuitively makes sense.
 …but there is another interpretation of SVM solution which also makes common
sense
 The solution found by the SVM is the hyperplane that bisects the segment
joining the two nearest points between the convex hulls of the data from
the two classes.
 Convex huh?

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Convex Hull
 Let’s first understand the concept of convexity:
 In Euclidean space, an object is said to be convex (as opposed to concave), if every straight line that
connects any pair of points remains inside the object.
Not convex
Convex (concave)

 Similarly, a real-valued function 𝑓 defined on some interval is called a convex function, if for any two
points (𝑥, 𝑦) within its domain 𝐶, and for any real valued constant  in [0 1], the condition
𝑓 𝛼𝑥 + 1 − 𝛼 𝑦 ≤ 𝛼𝑓 𝑥 + 1 − 𝛼 𝑓 𝑦 is satisfied.

 

x )

 )
http://en.wikipedia.org/wiki/File:Convex_polygon_illustration1.png; http://en.wikipedia.org/wiki/File:Convex_polygon_illustration2.png
http://en.wikipedia.org/wiki/File:Convex-function-graph-1.png; http://en.wikipedia.org/wiki/File:Convex_supergraph.png
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Convex hull
 Now, we can define a convex hull of a set of points:
 The convex hull of the set X is the minimal convex set that contains all elements of X
 Intuitively, for planar objects (objects lying on a plane, 2-D), the convex hull is equivalent to
the shape obtained by an elastic band that is stretched to encompass the object. This
interpretation is – while not technically accurate – is close enough for our understanding of
the general meaning of the convex hull.

Convex (hull of a) bunny!

 Mathematically, the convex hull of a set of points is 𝑅(𝑋):


𝑛 𝑛

𝑅 𝑋 = 𝛼𝑖 𝑥𝑖 𝑥𝑖 ∈ 𝑋,  𝛼𝑖 ∈ ℝ,   𝛼𝑖 ≥ 0,   𝛼𝑖 = 1,  𝑖 = 1,2, ⋯ , 𝑛
𝑖=1 𝑖=1

http://en.wikipedia.org/wiki/File:ConvexHull.svg http://xcellerator.info/mPower/pages/convexHull.html http://codercorner.com/ConvexHull.jpg

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SVM  Convex Hull
 It turns out that the solution given by the dual formulation of the SVM
𝑛 𝑛 𝑛
1
max                  𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 +   𝛼𝑖 ,            
2
𝑖=1 𝑗=1 𝑖=1
𝑛

subject to 𝛼𝑖 ≥ 0,      𝛼𝑖 𝑦𝑖 = 0
𝑖=1

for the linearly separable task results in the hyperplane that bisects the linear segment
joining the nearest points between the convex hulls of the data classes:

TK

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SVM  Convex Hull
 It is easy to see this connection: the solution described by the nearest points on
the convex hulls requires minimizing (convince yourself of the following!)
2
min 𝛼𝑖 𝐱𝑖 − 𝛼 𝑖 𝐱𝑖
𝛼 𝑖:𝑦𝑖 =1 𝑖:𝑦𝑖 =−1

subject to 𝛼𝑖 = 1,  𝛼𝑖 = 1 𝛼𝑖 ≥ 0,  𝑖 = 1,2, ⋯ , 𝑛


𝑖:𝑦𝑖 =1 𝑖:𝑦𝑖 =−1
 Expanding the above norm, and with a little bit of algebra, we can show that the
above optimization problem reduces to
𝑛 𝑛
min                   𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱𝑇𝑖 𝐱𝑗 ,            
𝑖=1 𝑗=1
𝑛
subject to 𝛼𝑖 𝑦𝑖 = 0,   𝛼𝑖 = 2,  𝛼𝑖 ≥ 0,  𝑖 = 1,2, ⋯ , 𝑛
𝑖=1 𝑖

which can be shown to be equivalent to our original SVM formulation


1 𝑛 𝑛 𝑛
max                  𝐿𝐷 = − 𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱 𝑖𝑇 𝐱𝑗 +   𝛼𝑖 ,            
2 𝑖=1 𝑗=1 𝑖=1
𝑛
subject to 𝛼𝑖 ≥ 0,      𝛼𝑖 𝑦𝑖 = 0
𝑖=1

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SVM  Convex Hull
 The more interesting connection, however, is the nonseparable class behavior. For this one,
we turn to our -SVM formulation. Recall:
1 2
1 𝑛
min                   𝐰 − 𝜐𝜌 + 𝜉𝑖 ,
2 𝑛 𝑖=1
𝑇
subject  to     𝑦𝑖 𝐰 𝐱 𝑖 + 𝑏 ≥ 𝜌 − 𝜉𝑖 ,  𝜉𝑖 ≥ 0,  𝜌 ≥ 0
 Normalizing the cost function with 2/2 and the set of constraints by , we obtain
𝑛
min                   𝐰 2 − 2𝜌 + 𝜇 𝜉𝑖 ,
𝑖=1
subject  to     𝑦𝑖 𝐰𝑇 𝐱 𝑖 + 𝑏 ≥ 𝜌 − 𝜉𝑖 ,  𝜉𝑖 ≥ 0,  𝜌 ≥ 0

where 𝜇 = 2/𝑛. In the above formulation, all parameters 𝒘, 𝑏, 𝜌 and 𝜉 are all normalized by the
constant , which is not shown i) because the it is just a multiplicative constant that does not change the
value of the equation, and ii) to show the similarity that remains with the original equation above. The
interesting concept here is that the Wolfe dual formulation of the above optimization problem is

𝑛 𝑛 which is identical to convex hull


min                   𝛼𝑖 𝛼𝑗 𝑦𝑖 𝑦𝑗 𝐱𝑇𝑖 𝐱𝑗 ,            
𝑖=1 𝑗=1 formulation, except for the condition on
𝑛
subject to 𝛼𝑖 𝑦𝑖 = 0,   𝛼𝑖 = 2,  0 ≤ 𝛼𝑖 ≤ 𝜇 
the Lagrange multipliers which now have
𝑖=1 𝑖 an upper bound of . So…?

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Reduced Convex Hulls
 The reduced convex hull of a vector space X, denoted as R(X,) is defined as the convex set
𝑛 𝑛

𝑅 𝑋, 𝜇 = 𝑦: 𝑦 = 𝛼𝑖 𝑥𝑖 𝑥𝑖 ∈ 𝑋,  𝛼𝑖 ∈ ℝ,   0 ≤ 𝛼𝑖 ≤ 𝜇,   𝛼𝑖 = 1,  𝑖 = 1,2, ⋯ , 𝑛
𝑖=1 𝑖=1

 Note that 𝑅(𝑋, 1) = 𝑅(𝑋), and that 𝑅(𝑋, )  𝑅(𝑋).


 The following figures illustrates the concept of the reduced convex hull.
• For data with overlapping classes (non-separable), the convex hulls clearly overlap.
• However, for appropriate choice of 𝜇, the reduced convex hulls do not overlap.

=1

=0.4

=0.1
TK

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR -SVM 
Reduced Convex Hull
 Note that we started with the -SVM formulation and we ended up with the reduced convex
hull interpretation of the SVM.
 Also note that the only difference between the two convex hull formulations, first linearly
separable and second not separable cases, is the additional 𝜇 - term in the Lagrange multiplier
constraints, that is, each 𝛼 must have an upper bound of 𝜇.
 In the separable case, we have 0  𝛼  1, which means 𝜇 = 1, which then means that the convex hulls
do not overlap. The solution is then the maximum margin separating the full convex hulls.
 In the non-separable case, we have 0  𝛼  𝜇, where 𝜇  1, which gives us the reduced convex hull case.
The margin is then searched between the reduced convex hulls.
 Also recall from earlier the relationship between 𝜇 and 𝜈: 𝜇 = 2/𝜈𝑛, and the parameter 𝜈 directly
controls the error rate and the number of misclassified instances. The larger the 𝜈 (the more
[misclassified samples] are allowed in the margin) to the smaller the 𝜇 (the extend of the non-
overlapping convex hulls) . In order to ensure a feasible solution:
𝑛min where 𝑛𝑚𝑖𝑛 is min{𝑛+ , 𝑛− }, the number of + and –
𝜐 ≤ 𝜐max = 2 ≤1 samples, satisfying
𝑛
2 1
𝜐 ≥ 𝜐min = 𝜇 ≥ 𝜇min = ,  and 𝜇 ≤ 𝜇max ≤ 1
𝜇max 𝑛 𝑛min

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR -SVM Demo
1
 - SVM, Kernel = gaussian, Kernel option = 0.2  = 0.3
0.9 1

0.8
0.9
0.7
0.8
0.6
0.7
0.5
0.6
0.4

0.3 0.5

0.2 0.4

0.1
0.3
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2
Learning Data and Margin
1
0 0 0.1
0.9
0 0

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.8
0

0
0
0.7

0.6 0
0
0
x2

0.5
0
0
0

0.4

0.3
0
0

0.2
0
0

0.1
0

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x1
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Some observations
 SVMs, provide a unique optimal solution for linearly separable two-class
problems
 If there is noise in the data, we can add slack variables
 Do not appear in the problem formulation, instead we have the 𝐶 (or 𝜈) parameter,
a trade-off between maximizing margin and minimizing training error
 If the problem is not linearly separable, SVM can still solve it by going to a
higher dimensional space, where the problem may be (more) linearly
separable
 All calculations are done in the original space through the kernel trick
 We do not need to know the transformation / mapping to the high dimension
 The transformation function is determined by the choice of the kernel
 There is no guarantee that the problem will be linearly separable in the space
determined by the chosen kernel – it is just that the problem is more likely to be
linearly separable in the high dimensional space.

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Overfitting
 If in fact we are working in a much higher dimensional (virtual) space, how
come we do not suffer from overfitting / curse of dimensionality?
 Why do SVMs work as well as they do…?!
 Let’s recall overfitting: A classifier that has too many degrees of freedom
(adjustable parameters) can always find a decision boundary that will
correctly classify all training data points.
 However, the parameters may then be too specific to the given (finite size)
training data. Essentially, the classifier has memorized the training data, and cannot
generalize well on previously unseen test data.
 Vapnik: The primary issue at hand is not just the number of adjustable
parameters to be estimated, but also the flexibility / capacity of the classifier.
 Surely, a classifier with many adjustable parameters is also very flexible, capable to
adjust itself to many variations in the data
 But there are exceptions.

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR An Analogy
(Burges, 1998)
 To illustrate the flexibility / capacity of a classifier, the following
analogy is useful:
 A classifier with very high capacity   A botanist with a photographic memory:
• Given a tree previously not seen, this botanist will declare this is NOT a tree, because it
has different number of leaves than anything it has seen before. Having memorized
every other tree she has ever seen, and this tree not looking like any of these trees
before, she concludes that this must not be a tree.
 A classifier with very little capacity   The botanist’s lazy brother
• If it is green, it is a tree! Having very little capacity to learn the variations among
different examples, this classifier learns the very most basic common denominator,
which fails to distinguish among samples belonging to different classes.
 So how do we decide?

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Empirical Risk
Minimization (ERM)
 The fundamental problem of finding the best classifier:
 Given a set of training i.i.d. data for a binary classification problem, drawn from
some unknown probability distribution 𝑃(𝒙, 𝑦)
𝐱1 , 𝑦1 , 𝐱2 , 𝑦2 , ⋯ , 𝐱𝑛 , 𝑦𝑛 ∈ ℝ𝐷 × 𝑌,  𝑌 ∈ −1, +1
our goal is to find the mapping function 𝑓 that will correctly classify unknown
samples (𝒙, 𝑦) that also come from the same distribution 𝑃(𝒙, 𝑦)
 The best function 𝑓 is the one that minimizes the expected risk / error / loss :
2
𝑅𝑓 = 𝐿 𝑓 𝐱 , 𝑦 𝑑𝑃 𝐱, 𝑦   𝐿 𝑓 𝐱 , 𝑦 = 𝑓 𝐱 − 𝑦

 …but this function cannot be minimized, since we do not know 𝑃(𝒙, 𝑦)


 Instead, we use empirical risk, computed as the observed loss (typically, a function
of the misclassifications) 𝑛
1
𝑅𝑒𝑚𝑝 𝑓 = 𝐿 𝑓 𝐱 ,𝑦
𝑛
𝑖=1

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Empirical Risk
Minimization (ERM)
 A fancy term that simply means minimizing the error on the training data.
 This is usually not a bad thing.
 In fact, for most classification algorithms, this is the primary objective in
determining the classifier parameters
 However, relying on the empirical error for selecting the best classifier (model
selection) is only a meaningful idea if there are lots of training data.
 We should try to minimize the expected risk
 The probability that the classifier will be incorrect on the previously unseen data
 As the number of training data available → ∞, empirical risk approaches to
expected risk
 For small sample sizes, we cannot guarantee that small empirical error will also lead
to small test error

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Empirical Risk
Minimization (ERM)

Model A – Simple (low capacity)

Model B – Complex (high capacity)

More data reveals that: More data reveals that:


Too little data, cannot True distribution is simple
tell which model is good True distribution is complex
Model A: underfits Model B: overfits
From Muller (2001) / Osuna G
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Vapnik-Chervonenkis
(VC) Theory
 So, then empirical error is not reliable when there is little data. What to do?
 We need to control the complexity / capacity of the classifier
 Choose the simplest classifier that adequately describes the data
 Occams’s razor
 But, what is a simple classifier? The one with fewest parameters?
 Vapnik & Chervonenkis argue: not necessarily!
 It is not just the number of parameters that matters, but the flexibility / capacity of
the classifier
 The VC dimension: A measure of the capacity of a classifier ( described by a class
of functions 𝑓 (for example, class of linear functions 𝑓 (𝐰, 𝑏) )
 Colloquially, the VC dimension of a classifier (family function, 𝑓 ) is the largest
number of instances that can be described by 𝑓, regardless of the class labels of
those instances.

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR VC Dimension
Some definitions:
 Given a set of 𝑛 points for a two-class problem, there are 2𝑛 ways in which
these 𝑛 points can be labeled. If a member of the family 𝑓 can be found
that can correctly assign all possible labelings, these set of 𝑛 points are said
to be shattered by 𝑓.
 The VC dimension of a family of functions 𝑓 is the largest number of
training points that can be shattered by 𝑓.
 If the VC dimension of 𝑓 is ℎ, then there exists at least one set of ℎ points that can
be shattered, but in general not every set of ℎ points can be shattered.

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR VC Dimension
 For example, let 𝑛 = 3. There are 23 = 8 possible labeling for a two-class
dichotomy: -1: +1:
 {(−𝟏, −𝟏, −𝟏), (−𝟏, −𝟏, +𝟏), (−𝟏, +𝟏, +𝟏), (−𝟏, +𝟏, −𝟏), (+𝟏, −𝟏, −𝟏), (+𝟏, −𝟏, +𝟏), (+𝟏, +𝟏, −𝟏), (+𝟏, +𝟏, +𝟏)}

 We can find a linear function 𝑓 that can correctly label these set of three points
whichever way they are labeled.

RP

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR VC Dimension
 But if we choose four points, it can be shown that there are no set of four
points such that all possible (24 = 16) labelings can be correctly classified.
 Recall the XOR problem. For any given set of four points, we can find at least one
way of assigning labels to these points such that a linear function fails to label them
correctly.

 Hence the VC dimension of the family of linear functions (in 2D) is three!
 In general, it can be shown that the VC dimension of the family of
hyperplanes in 𝑑 dimensional space is 𝑑 + 1

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR VC Dimension
 VC dimension gives a measure of the complexity (capacity) of the
(classifier) functions.
 One may think that the larger the number of parameters a classifier (function) has,
the higher the VC dimension – while in general true, there are many exceptions
(see Burges 1998)
 Hence, VC dimension is a more accurate representation of the capacity of a
classifier than the number of parameters. Hence, the higher the VC
dimension of a classifier, the higher its capacity
 The k-nearest neighbor (for k=1) classifier has a VC dimension that is infinite.
Why? Because, with kNN, one can correctly classify any number of training
samples, regardless what the correct labels are
 Actually, the SVM with the RBF kernel also has infinite VC dimension (see Burges
1998), because for sufficiently small choice of the spread parameter, any number of
instances can be correctly classified by the RBF-SVM.

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
 Let’s summarize:
 We want to choose the function / classifier that minimizes expected risk.
• Cannot do it, since we do not know the underlying distributions
 Then, let’s minimize the empirical risk / error, after all, the model that minimizes
the training error is the one that best describes the given data
• Ok, but unreliable when the amount of training data is limited – We can choose a
classifier with a large capacity  possibly overfitting
 So we need to control the capacity of a classifier. How do we do that?
• Choose the classifier with fewer parameters? No, not necessarily
• Instead, choose a classifier with a lower VC dimension, as VC dimension is a better
indicator of the classifier’s capacity
 So, is there a relationship between the VC dimension and the expected risk,
the quantity that we really want to minimize?
 Funny you asked…

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
 VC dimension provides an upper bound to the expected risk as a function of the
empirical risk and the number of training data points:
 We can show that given the VC dimension ℎ, the number of training data points 𝑛, and a
constant 0 < 𝜂 < 1, the following upper bound holds with probability 1 − 𝜂
ℎ log 2𝑛 ℎ + 1 − log 𝜂 4
𝑅 𝑓 < 𝑅𝑒𝑚𝑝 𝑓 +
𝑛
 Now, note the following VC confidence
 As 𝑛 increases, VC confidence becomes smaller and expected risk approaches to that of
empirical risk  Makes sense, if we have lots of data, empirical error is a good indicator of
the true performance.
 The bound is not dependent on the (unknown) probability distribution 𝑃(𝒙, 𝑦), nor on the
number of adjustable parameters, nor on the dimensionality of the problem !!!
 While we can never actually compute the left hand side (expected risk), this inequality gives
us an upper bound. If we know ℎ, we can easily compute the upper bound.
 Given some small 𝜂, we should then choose the classifier (among all classifiers) that
minimizes the VC confidence, that is the one with the smallest VC dimension. This is called
structural risk minimization, as it minimizes the expected risk by reducing the second
term on the right hand side.
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
 SRM is therefore a formal term that simply states that the best classifier is the one
that provides a meaningful balance between empirical risk and VC dimension.
 Practically speaking, SRM tells us that we should choose a classifier that minimizes
the sum of empirical risk and VC dimension.
Confidence interval
for Classifier 1
Underfitting  Overfitting
Confidence interval
for Classifier 2

Expected Risk

Error rate
Classifier 2
VC Confidence Empirical Risk

small (Classifier capacity) large


VC dimension Choice of SRM
Classifier 1

RP Muller, ’01

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SRM in Practice
 Ah, it is easy then…All we need to do is to find the VC dimension of the
classifiers we generate, and then use the one that has the smallest ℎ + 𝐶𝐼
 If only it were that simple!
 The VC dimension for many classifiers cannot be easily computed, and hence the
upper bound cannot be determined. It may be infinite (as in kNN), or the bound
may prove to be very loose, and not being very useful.
 Blah! All this VC theory crap for nothing…?
 Well, not really. It turns out for linear classifiers, there is a very interesting link
between the VC dimension and the separating margin between two classes.
 Remember: we want to minimize the sum – and therefore – both the empirical
error and the capacity of the classifier

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
 A different interpretation:
 Minimize structural risk  Minimize sum of empirical error + classifier capacity
 It is obvious why we want to minimize empirical error, but why the classifier capacity?
• Because, a large capacity classifier can explain not only the observed data, but many other data that
are not actually generated by the underlying boundary.

RP RP

Model A explains both Model B mostly explains only

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Structural Risk
Minimization (SRM)
 So we want a classifier that can explain the given data as best as it can, but
cannot explain any other data!
 Let’s go back to our original representative data: what is the classifier (model) that
best describes the given data, but least likely to explain any other data?
 Intuitively speaking…

Optimal decision
boundary
Feature 2

xi
ξi

xj
ξj
Maximum
RP margin RP
Feature 1
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SRM   VC Dimension  SVM
 So, intuitively, then, the classifier that best explains the given data, and least likely to explain
any other data is the one that provides the largest margin between the classes.
 Margin: The minimum distance of an instance to the decision boundary.
 It turns out, there is an interesting relationship between the margin and the VC dimension,
one that provides a theoretical basis to this intuition (Vapnik, 1998):

𝑅2 2
ℎ ≤ min , 𝑑 + 1  𝑚 =
𝑚2 ‖𝐰‖

 where R is the radius of the smallest (hyper) sphere that includes all data points, and 𝑑 is the
dimensionality of the dataset
 So…?
 So, by maximizing the margin, we are minimizing the VC dimension!
 The SVM takes advantage of this very insight:
 SVM finds the classifier / model / function that maximizes the margin between the classes
 Hence, it automatically minimizes the VC dimension, even though we cannot compute the VC
dimension itself (actually for linear classifiers, the VC dimension is 𝑑 + 1)
 Hence, it automatically minimizes the VC confidence, hence the expected risk!
 Hence, the next best thing since sliced bread…er…since Bayes classifier
Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR SRM   VC Dimension  SVM

 A further insight:
 Recall the optimization problem for the noisy dataset case (using slack variables)
𝑛
1 2
min                   𝐰 +𝐶 𝜉𝑖 ,    𝐶 > 0
2
𝑖=1
𝑇
subject  to     𝑦𝑖 𝐰 𝐱𝑖 + 𝑏 ≥ 1 − 𝜉𝑖 ,  𝜉𝑖 ≥ 0

 The first term in the objective function minimizes 12 ‖𝐰‖2 hence maximizes the
margin 𝑚 = 2 ‖𝐰‖2 which reduces the capacity of the classifier
 The second term minimizes the number of instances that fall inside the margin
and/or on the wrong side of the decision boundary, hence minimizes the
misclassification rate, hence minimizes the empirical error
 Thus, this objective function (the Lagrangian) minimizes the sum of empirical error
and the classifier capacity, precisely the term that SRM asks to minimize
 Hence, the support vector machines implement the structural risk
minimization!

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Strengths & Weaknesses
of SVMs
 Strengths:
 Training is straightforward, given a good quadratic programming algorithm
 No local minima – unlike neural networks the SVMs will find the same global
optimum – each and every time. Excellent generalization ability
 Scales well to high dimensions (but not so well for large datasetsuse SMO)
 Provides an explicit control between classifier complexity (capacity) and error
minimization
 Solution is sparse – it only uses the subset of the data – the support vectors
 An elegant and intuitive theory
 Only a few parameters need to be specified: the kernel itself, (usually one, 𝜎 for
Gaussian kernel, 𝑑 for polynomial) kernel parameter, and the penalty term 𝐶
 Weakness:
 Need to choose a “good” kernel function, also how to choose 𝐶

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Multi-Class Classification
 So far, we have only looked at binary (2-class) classification problems.
What do we do if we have more than two classes?
 One possible solution
 Change the quadratic programming formulation such that the multi-class
objective function is minimized.
 More commonly: Use an ensemble of SVMs. Two variations
1. One against one: Create one SVM for each pair wise combinations of classes
• 3 classes: Class 1 vs Class2, Class 1 vs. Class 3 and Class 2 vs. Class3
2. One against all: Create C SVMs, one for each of the C classes, against all others
• Class1 vs. Classes 1&2, Class 2 vs. Classes 1&3, Class 3 vs. Classes 1&2
 Then combine these classifiers using a suitable combination rule, such as majority
voting. The class that gets the most number of votes is selected.

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Connection to
Neural Networks
 For certain choices of the kernel functions, there is a one-to-one
connection between the SVM and certain neural networks.
 Specifically (# of output nodes = 1 in each case)
 The RBF network is a special case of the SVM with the Gaussian kernel:
• The number of centers (number of receptive fields - hidden layer nodes) = number of
support vectors
• The centers themselves = the support vectors as found by SVM
• The weights 𝐰 from hidden to output layer = Lagrange multipliers 𝛼𝑖 and the threshold
(𝑏) are all determined automatically by the SVM solution!
 MLP network architecture can be determined by the SVM with the hyp. tangent
kernel:
• The number of hidden layer nodes = number of support vectors
• Input to output weights = kernel values obtained by the sigmoid
• The output weights are determined by the Lagrangian multipliers

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Connection to
Neural Networks

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ
PR Other Kernel
Based Methods
 The take home message from these lectures:
 A problem that is not linearly separable in the given low dimensional input space, may very
well be linearly separable in the high dimensional space, and one does not need to perform
any computations in the high dimensional space for this!
 The same approach can therefore be used for other data analysis type problems, such as
kernel PCA, kernel LDA, kernel / SVM for regression and for one-class classification
(novelty detection).
 Extra credit: The take home “take-home project” from these lectures:
 In groups of 2-3, pick one of the above mentioned topics, prepare a 20 minute mini-lecture,
followed by a demo of the approach using any of the freely available software packages for
Matlab. Each group should have at least 1 grad and 1 undergraduate student
• Your own implementation of SVM using quadprog()
• Kernel PCA (non-linear PCA)
• Kernel (Fisher) LDA
• SVM for regression (function approximation) – support vector regression
• One-class SVMs for classification (novelty detection)
• Any other Kernel method not covered in class.

Advanced Topics in Pattern Recognition © 2001- 2012, Robi Polikar, Rowan University, Glassboro, NJ

You might also like