Professional Documents
Culture Documents
Lecture 1
Kristiaan Pelckmans
September 8, 2015
Overview
Today:
I Overview of the course.
I Support Vector Machines (SVMs) - the
separable case.
I Convex Optimization.
I Analysis.
I Kernels.
I SVMs - the inseparable case.
Overview (Ctd)
Organization:
I 10 Lectures.
I 1 computer lab (mid october).
(content)
I Miniprojects (due end october).
(content)
I Participants giving lectures using
material.
Overview (Ctd)
Course:
1. Introduction.
2. Support Vector Machines (SVMs).
3. Probably Approximatively Correct (PAC) analysis.
4. Boosting.
5. Online Learning.
6. Multi-class classification (*).
7. Ranking (*).
8. Regression (*).
9. Stability-based analysis (*).
10. Dimensionality reduction (*).
11. Reinforcement learning (*).
12. Presentations of the results of the mini-projects.
Introduction
n-fold Cross-validation
I Let Sm = {(xi , yi )}m
i=1 be the original training set.
I Divide set Sm into n disjunct folds so that every
point included once.
I Make n sets with n 1 folds, denote them as Si .
I Let Si = {(xij , yij )}m i
j=1 be the training set of the
i-th iteration.
I Hence hSi the outcome of A() applied to the i-th
training set.
n
1X 1 X
RCV () = L hSi (xij ), yij
n mi
i=1 j=1
Introduction (Ctd)
Learning Scenarios
I Supervised learning.
I Unsupervised learning.
I Semi-supervised learning.
I Transductive inference.
I Online learning.
I Reinforcement learning.
I Active Learning.
SVM - separable case
Support Vector Machine (SVM)
I Assume that there is a f s.t. y = f (x).
I Find h with minimal
Maximal Margin
I Hyperplane {x : w x + b = 0}.
I Normalise such that mini |w xi + b| = 1
(w.l.o.g.).
I Distance point x0 - margin:
|w x0 + b|
kwk
I Thus margin is given as
mini |w xi + b| 1
= = .
kwk kwk
SVM - separable case (Ctd)
Maximal Margin
I Maximal Hyperplane:
yi (w xi + b) 0 i
|wxi +b|
max s.t. = mini kwk
,w,b
mini |w xi + b| = 1
I Maximal hyperplane:
1
max s.t. yi (w xi + b) 1 i, =
,w,b kwk
I Or
1
min kwk2 s.t. yi (w xi + b) 1 i.
w,b 2
Maximal Margin
I Convex objective.
I Affine inequality constraints.
I Quadratic Programming problem.
I Dual problem: proberties!
Convex Optimization
Convex
I A set X is convex iff for any two points
x, x0 X , the segment
{x + (1 )x0 : 0 1} X .
I A function f : X R is convex iff for all
x, x0 X and all 0 1 one has that
I Lagrangian:
X
x X , 0 : L(x, ) = f (x)+ i gi (x).
i
0 : F () = inf L(x, ).
xX
so that F () p .
I Dual problem:
d = max F ()
0
Convex Optimization (Ctd)
Convex Programming
I Weak duality: p d .
I Strong duality: p = d .
I Duality gap: p d .
I Strong duality holds when Constraint
qualifications hold.
I Strong constraint qualification (Slater):
x int(C) : gi (x) < 0 i
I Weak constraint qualification (weak
Slater): x int(C) : gi (x) < 0
or gi is affine, gi (x) = 0 i
Convex Optimization (Ctd)
I Lagrangian:
m
1 X
L(w, b, ) = kwk2 i (yi (w xi + b) 1)
2
i=1
I KKT conditions
w= m
P
w L = 0 P
i=1 i yi xi
m
b L = 0 i=1 i yi = 0
i : i (yi (w xi + b) 1) = 0
h(x) = sign(w x + b)
Analysis of SVMs (Ctd)
Generalization error
I Leave-one-out analysis.
I In terms of NSV .
I Margin-based analysis.
Analysis of SVMs (Ctd)
Leave-one-out analysis
m
1 X
RLOO (A(), S) = 1(hS/(xi ,yi ) (xi ) = yi )
m
i=1
I A()(S) = hS .
I 1(z) = 1 iff z is true, 1(z) = 0.
I In terms of NSV .
I Then
Proof:
m
1 X
E SD m [RLOO (A(), S)] = ESD m 1(hS/(xi ,yi ) (xi ) = yi )
m
i=1
= ESD m 1(hS/(x1 ,y1 ) (x1 ) = y1 )
= ESD m 1(hS/(x1 ,y1 ) (x1 ) = y1 )
= ES 0 D m1 [Ex1 D [1(hS 0 (x1 ) = y1 )]]
= ES 0 D m1 [R(hS 0 )].
Analysis of SVMs (Ctd)
NSV (S 0 )
ESD m [R(hS )] ES 0 D m+1
m+1
NSV (S)
RLOO (A(), S)
m+1
SVM - Margin analysis (Ctd)
Vapnik-Chervonenkis (VC) dimension:
I Distance point x0 with label y0 to a
hyperplane {x : w x + b = 0} is
y0 (w x0 + b)
(x) =
kwk
I Margin is given as
yi (w xi + b)
= min
i kwk
I capacity of H (Structural Risk
Minimisation: see next lecture)
I VC dimension (try) of hyperplane is N + 1
...
I But high-dimensions?
SVM - Margin analysis (Ctd)
I Measures capacity of H.
I Relates to Rademacher complexity:
m
" #
1 X
RS (H) = E1 ,...,m sup i h(xi ) .
m hH i=1
SVM - non-separable case
Maximal Soft Margin:
I Non-separable case: (w, b)
i : yi (w xi + b) 6 1
yi (w xi + b) 1 i
Dual problem:
I Dual problem
m
X m X
X m
max i i j yi yj (xi xj )
0C
i=1 i=1 j=1
m
X
s.t. i yi = 0 (3)
i=1
I
m
" #
1 X
RS (H) = E1 ,...,m sup i h(xi )
m hHi=1
I H = {h(x) = sign(w x + b), kw k , b R}.
I Theorem: Let H be a set of real-valued functions, fix
> 0. For any > 0, with probability exceeding 1
one has that
s
2 log 2
h H : R(h) R (h) + RS (H) + 3 .
2m
SVM - Analysis (Ctd).
Rademacher complexity:
m
" #
1 X
RS (H) = E1 ,...,m sup i h(xi )
m hH i=1
RS ( H) LRS (H)
m
" #
1 X
, E1 ,...,m sup i ( h)(xi )
m hH i=1
1
= E1 ,...,m1 Em sup m ( h)(xm ) + um1 (5)
m hH
Pm1
with um1 (h) = i=1 i ( h)(xi ).
SVM - Analysis (Ctd).
1
, E1 ,...,m1 Em sup um1 (h) + m ( h)(xm )
m hH
1 1
[um1 (h1 ) + (h1 (xm ))] + [um1 (h2 ) (h2 (xm ))]
2 2
1 1
[um1 (h1 ) + um1 (h2 )] + sL [(h1 (xm )) (h2 (xm ))]
2 2
E sup um1 (h) + m Lh(xm ) , (6)
hH
Xm
H={ i yi K (xi , ), kk }.
i=1
Conclusions