Professional Documents
Culture Documents
The slides shown here are developed around the basic notion of
They are intended to support both the instructors in the development and delivery
of course content and the learners in acquiring the ideas and techniques
presented in the book in a more pleasant way than just reading.
T O P I C S in INTRODUCTARY PART
SVMs & NNs as the BLACK box modeling, and
FLMs as the WHITE box modeling
Motivation and basics. Graphic presentation of
any approximation or classification scheme leads
to the so-called NNs and/or SVMs
Curve and surface fittings, multivariate function
approximation, nonlinear optimization and NN
FL Models
x
Black-Box
W hite-Box
Structured knowledge
(experience, expertise or heuristics).
IF - THEN rules are the most typical
examples of the structured knowledge.
x1
o1
x2
x3
o2
vj
wk
o3
x4
o4
x5
This is the most common gray box situation covered by the paradigm of
Neuro-Fuzzy or Fuzzy-Neuro models.
If we do not have any prior knowledge AND we do not have any
measurements (by all accounts very hopeless situation indeed) it may be
hard to expect or believe that the problem at hand may be approached
and solved easily. This is a no-color box situation.
V
is prescribed
1
x
vji
It is same with
POLYNOMIALS
y2
yj
o = F(x)
wj
4
yj+1
yJ
N
+1
NONLINEAR
LEARNING
PROBLEM !!!
HL
w1
w2
netHL
net
o -
W e know that the function is sinus but w e dont know its frequency and am plitude. Thus, by using the
training data set{x,d},w e w antto m odelthis system w ith the N N m odelconsisting ofa single neuron in H L
(having sinus as an activation function)as given above.
The cost function J dependence upon A (dashed) and w (solid)
250
Cost fumction J
200
150
100
A = w2
Z = w1
50
0
-2
-1
1
2
Weight w = [A; w]
HL
w1
net
w2
netHL
o -
W e know that the function is sinus but w e dont know its frequency and am plitude. Thus, by using the
training data set{x,d},w e w antto m odelthis system w ith the N N m odelconsisting ofa single neuron in H L
(having sinus as an activation function)as given above.
The cost function J dependence upon A (dashed) and w (solid)
250
400
-w 2sin(w 1x))2
Cost fumction J
Cost fumction J
200
150
100
A = w2
Z = w1
50
500
300
200
100
0
6
4
6
4
2
0
-2
-1
1
2
Weight w = [A; w]
Frequency w
0
-2
-2
Amplitude A
w xi
i=0
y1
V
is prescribed
y2
1
2
x
yj
vji 3
4
5
yj+1
yJ
+1
wj
o = F(x)
With a prescribed
(integer) exponents
this is again LINEAR
APPROXIMATION
x
SCHEME. Linear in
terms of parameters
to learn and not in
terms of the resulting
approximation
function. This one is
NL function for i > 1.
w xi
i=0
y1
V
is prescribed
y2
1
2
yj
vji 3
4
5
yj+1
yJ
+1
wj
o = F(x)
With a prescribed
(integer) exponents
this is again LINEAR
APPROXIMATION
x
SCHEME. Linear in
terms of parameters
to learn and not in
terms of the resulting
approximation
function. This one is
NL function for i > 1.
w xi
i=0
y1
V
is prescribed
y2
1
2
yj
vji 3
wj
o = F(x)
4
5
yj+1
yJ
+1
The following
approximation scheme is a
NOVEL ONE called NN
or RBF network here.
Approximation of
some
*
*
*
2
(RBF)
*
* *
Gaussian
**
1
*
*
*
N
F(x) =
i =1
i+1
i+2 N
xk
ci
c2
y1
x1
xi
yk 3
NL 1D function by
F(x)
y2
yj
vji
xn+1
yj+1
yJ
+1
wj
o = F(x)
w i i ( x ,c i )
Measurements
Images
0
40
30
40
30
20
20
10
Records
Observations
Data
10
0
8
6
4
2
0
40
30
40
30
20
20
10
10
0
y1
x1
xi
F(x) =
J
j =1
w j j (x, c j , j )
y2
vji
yj
yj+1
xn
yJ
+1
wj
o = F(x)
y1
x1
xi
F(x) =
J
j =1
w j j (x, c j , j )
y2
vji
yj
yj+1
xn
yJ
+1
wj
o = F(x)
y1
x1
xi
F(x) =
J
j =1
w j j (x, c j , j )
y2
vji
yj
wj
o = F(x)
yj+1
xn
yJ
+1
y1
x1
xi
F(x) =
J
j =1
w j j (x, c j , j )
y2
vji
yj
yj+1
xn
yJ
+1
wj
o = F(x)
y1
x1
xi
F(x) =
J
j =1
w j j (x, c j , j )
y2
vji
yj
yj+1
xn
yJ
+1
wj
o = F(x)
y1
x1
xi
F(x) =
J
j =1
w j j (x, c j , j )
y2
vji
yj
yj+1
xn
yJ
+1
wj
o = F(x)
There is no difference in
representational capacity, i.e., in a
structure.
However, there is an important
difference in LEARNING.
Therefore,
lets talk about basics of
identification, estimation,
regression,classification,
pattern recognition, function
approximation, curve or surface
fitting etc.
Large sample
noisy data set
noiseless data set
final error
Data size l
Large sample
noisy data set
noiseless data set
final error
Data size l
Glivenko-Cantelli-Kolmogorov results.
Glivenko-Cantelli theorem states that:
Distribution function Pemp(x) P(x)
as the number of data l .
However, for both regression and classification we need
) and not a
probability density functions p(x), i.e., p(x|
distribution P(x).
p(x)dx = P(x).
(Analogy with a classic problem Ax = y, x = A-1y)!
p(x)dx = P(x).
(Analogy with a classic problem Ax = y, x = A-1y)!
What is needed here is the theory of
UNIFORM CONVERGENCE of the set of functions
implemented by a model, i.e., learning machine
(Vapnik, Chervonenkis, 1960-70ties).
We, will skip a discussion about this part here.
Skip it, but not forget it.
M or
F?
M or
F?
Problem from
Brunelli & Poggio,
1993.
Just a few w ords about the data set size and dim ensionality. A pproxim ation and
classification are sam e for any dim ensionality of the input space.N othing (!) but
size changes. But the change is D RA STIC. H igh dim ensionality m eans both an
EX PLO SIO N in a num berO F PA RA M ETERS to learn and a SPA RSE training data
set.
H igh dim ensionalspaces seem to be terrifyingly em pty.
y
T
P
H
PLANT
U
T
N data
T
U
Just a few w ords about the data set size and dim ensionality. A pproxim ation and
classification are sam e for any dim ensionality of the input space.N othing (!) but
size changes. But the change is D RA STIC. H igh dim ensionality m eans both an
EX PLO SIO N in a num berO F PA RA M ETERS to learn and a SPA RSE training data
set.
H igh dim ensionalspaces seem to be terrifyingly em pty.
y
T
P
H
PLANT
T
U
U
T
N data
Just a few w ords about the data set size and dim ensionality. A pproxim ation and
classification are sam e for any dim ensionality of the input space.N othing (!) but
size changes. But the change is D RA STIC. H igh dim ensionality m eans both an
EX PLO SIO N in a num berO F PA RA M ETERS to learn and a SPA RSE training data
set.
H igh dim ensionalspaces seem to be terrifyingly em pty.
y
T
P
H
PLANT
U
T
N data
N2 data
Nn data
x2
d1 = +1
d2 = -1
x1
x2
d1 = +1
d2 = -1
and,
for another data setw e w illobtain another separation line.
A gain, forsm allsam ple a solution defined by w = X * D is N O LO N G ER G O O D O N E !!!
W hatis com m on for both separation lines the red and the blue one.
Both have a SM A LL M A R G IN .
W H A TS W R O N G W ITH SM A LL M A R G IN ?
Itis very likely thatthe new exam ples ( ,
H ow ever,the question is
how to D EFIN E and FIN D
the
O PTIM A L SEPA R A TIO N
H Y PER PLA N E
G IV EN (scarce)
D A TA SA M PLES ???
H1 H2 Hn-1
Hn
fo
fn,l
H1 H2 Hn-1
Hn
h ~ n, capacity
T
Target space
Therefore, in other to reduce a level of confusion at the first-timereaders, i.e., beginners, in the world of SVM, we developed the
LEARNSC software that should elucidate all these mathematical
objects, their relations and the spaces they are living in.
Thus, the software that can be downloaded from the same site will
show the kind of figures below and a little more complex ones.
Desired value y
+1
-1
Input plane
( x 1, x 2)
d(x, w, b)
Stars denote support vectors
Input x1
Therefore, in other to reduce a level of confusion at the first-timereaders, i.e., beginners, in the world of SVM, we developed the
LEARNSC software that should elucidate all these mathematical
objects, their relations and the spaces they are living in.
Thus, the software that can be downloaded from the same site will
show the kind of figures below and a little more complex ones.
Desired value y
+1
-1
Input plane
( x 1, x 2)
d(x, w, b)
Stars denote support vectors
Input x1
E=
, w))
(d f(x
i =1
E=
Closeness to data
2
x
w
P
d
f
f
(
(
,
))
||
||
i
i
i =1
2
Closeness to data
Regularization (RBF) NN
Smoothness
2
x
w
, h
,
E = ( di f ( i , )) + (
l
)
i =1
Closeness to data
Capacity of a machine
O.K.
3) Nonlinear Classifier.
yi {-1, +1}
find the function f(x, w0) f(x, w) which best approximates the
unknown discriminant (separation) function y = f(x).
Linearly separable data can
be separated by in infinite
number of linear
hyperplanes that can be
written as
f(x, w) = wTx + b
The problem is: find the
optimal separating
hyperplane
M
MARGIN IS DEFINED by
w as follows:
2
M=
w
(Vapnik, Chervonenkis 74)
i = 1, l
where l denotes a number of training data and NSV stands for a number of
SV.
Note that maximization of M means a minimization of ||w||.
Minimization of a norm of a hyperplane normal weight vector ||w|| =
leads to a maximization of a margin M.
w T w = w12 + w22 + ... + wn2
Because sqrt(f )is a monotonic function, its minimization is equivalent to
a minimization of f .
Consequently, a minimization of norm ||w|| equals a minimization of
wTw = w12 + w22 + + wn2
and this leads to a maximization of a margin M.
J = wT w = || w ||2
subject to constraints
yi[wT xi + b] 1
and this is a classic QP problem with constraints
that ends in forming and solving of a primal and/or
dual Lagrangian.
J = wT w = || w ||2
maximization!
subject to constraints
Correct
classification!
yi[wT xi + b] 1
wTxi + b +1 - i,
for yi = +1,
wTxi + b -1 + i,
for yi = -1.
The problem is no longer convex and the solution is given by the saddle
point of the primal Lagrangian Lp(w, b, , , ) where i and i are the
Lagrange multipliers. Again, we should find an optimal saddle point (wo,
bo, o, o, o) because the Lagrangian Lp has to be minimized with
respect to w, b and , and maximized with respect to nonnegative i and
i.
s.t.
Feature x2
4
Class 1
y = +1
Class 2
y = -1
0
0
Feature x1
x3 = x1x2 ,
x3
1.5
f
x2
f > 0
0.5
1
0
1
f
-0.5
-0.5
0
0.5
1.5
x1
INPUT
x1
1
1
1
x2
-1.5
x3 -2
1
-1/3
Mapping
= (x)
Hyperplane in a feature
space F: d(z) = wTz + b
x1
1(x)
2(x)
x2
3(x)
1
1
x1
x2
x3
1
x1
5(x)
w1
(x1)2
x2
x2
(x2)2
6(x)
d(x)
7(x)
(x3)2
x3
x3
4(x)
8(x)
x3
w9
x 1 x2
9(x)
x1
x 2 x3
x 1 x3
+1
iF=sign(d(x))
(xj).
K(xi, xj) = ziTzj = T(xi)
Note that a kernel function K(xi, xj) is a function in input space.
Kernel functions
Type of classifier
Polynomial of degree d
K (x, x i ) = e
1
[( x x i ) T 1 ( x x i )]
2
Gaussian RBF
Multilayer perceptron
*only for certain values of b
1 l
T
z
y
y
i j i j i zj
i
2 i , j =1
i =1
l
or,
1 l
i yi y j i j K (x i , x j )
2 i , j =1
i =1
i 0,
i = 1, l
i = 1, l
and
yi = 0
i =1
d (x) = yi i K (x, x i ) + b
i =1
We see that the final structure of the SVM is equal to the NN model.
In essence it is a weighted linear combination of some kernel (basis)
functions. Well show this (hyper)surfaces in simulations later.
Regression by SVMs
Initially developed for solving classification problems, SV techniques
can be successfully applied in regression, i.e., for a functional
approximation problems (Drucker et al, (1997), Vapnik et al, (1997)).
Unlike pattern recognition problems (where the desired outputs yi are
discrete values e.g., Boolean), here we deal with real valued functions.
Now, the general regression learning problem is set as follows;
the learning machine is given l training data from which it attempts to
learn the input-output relationship (dependency, mapping or function)
f(x).
A training data set D = {[x(i), y(i)] n , i = 1,...,l} consists of l
pairs (x1, y1), (x2, y2), , (xl, yl), where the inputs x are n-dimensional
vectors x n and system responses y , are continuous values. The
SVM considers approximating functions of the form
N
f(x, w) = wi i(x )
i 1
Vapnik introduced a more general error (loss) function the so-called -insensitivity loss function
if | y - f (x, w) |
0
| y - f (x, w) | =
| y - f (x, w) | - ,
otherwise.
%&
Thus, the loss is equal to 0 if the difference between the predicted f(x, w)
and the measured value is less than . Vapniks -insensitivity loss
function defines an tube around f(x, w). If the predicted value is
within the tube the loss (error, cost) is zero. For all other predicted points
outside the tube, the loss equals the magnitude of the difference between
the predicted value and the radius of the tube. See the next figure.
e
e
e
y - f(x, w)
c) -insensitivity
f(x, w)
yi
Measured
Predicted f(x, w)
solid line
j*
Measured
yj
= || w|| + C4
!2
2
i=1
i =1
l
*
*
T
*
w
x
y
b
i =1 i=1
i =1 i i
i
i
l
i =1 i w x i + b yi + + i i =1 ( *i *i + i i )
l
*
i
subject to constraints
*
i
i=1 i = i=1 i
l
0 i* C
0 i C
i = 1, l,
i = 1, l.
1 l
and an optimal bias bo can be found from bo = i =1 ( yi gi ) .
l
where g = G wo and the matrix G is a corresponding design matrix of
given RBF kernels.
The best nonlinear regression hyperfunction is given by
z = f(x, w) = Gw + b.
There are a few learning parameters in constructing SV machines for
regression. The two most relevant are the insensitivity zone e and the
penalty parameter C that determines the trade-off between the training
error and VC dimension of the model. Both parameters should be
chosen by the user.
Generally, an increase in an insensitivity zone e has smoothing effects
on modeling highly noisy polluted data. Increase in e means a reduction
in requirements on the accuracy of approximation. It decreases the
number of SVs leading to data compression too. See the next figures.
y 1
y 1
-1
-1
-2
-4
-2
0
x
-2
-4
-2
0
x
w1+
#
wP+
[1 1!1 1 1!1]
min cTw = min
w
P columns P columns w1
#
wP
subject to,
G G w y + 1 +
G G y + 1, w > 0, w > 0