Professional Documents
Culture Documents
Support Vector Machine (SVM) is a new type learning machine issued by V. Vapnik et al
[1], and has been widely used in many fields such as patter recognition and function
approximation [2]. Comparing with traditional methods, its performance of generalization and
ability to deal with nonlinear problem are better than others. However, for a given learning
problem, could the generalization performance of a SVM be measured in quantification? Could
it be obtained with training dataset? These questions should be answered by generalization
performance estimating.
In short, generalization performance is maximal error rate when a SVM is used to test the
testing dataset, Generalization performance estimating is to obtain the bound of error rate.
Many bounds and methods have been issued [3], but some require the samples linearly
separable, the other requires that the classifying hyper-plane must pass through the origin. It
means they cannot be used in general SVM.
Method presented in the paper is proposed by Thorsten Joachime and used in character text
classif ication which uses the inter-median parameters and when training the SVM [4]. A
novel algorithm of estimating the generalization performance of RBF-SVM using the special
features of RBF-SVM is discussed. With this method, we can easily estimate the generalization
performance.
f ( x) = sgn( i K ( xi , x ) + b)
i =1
(1)
max i
i =1
1 l
i j yi y j K ( xi , x j );0 i C
2 i, j =1
This work was support by the National ECM Lab Foundation of China under project number 5143505
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
(2)
l
1
( ) + C i ; C 0; y i [( x ) + b ] 1 i ; i 0
2
i =1
(3)
In (2), K ( i ) is the kernel function satisfying the Mercer condition, it maps the input space
into high (or infinite) dimension space (feature space). The kernel function generally include
linear (no mapping), polynomial, Radial basis function
(RBF) and sigmoid kernel. RBF kernel
2
discussed here is K xi , x j = exp xi x j
. C is a constant controlling the tradeoff
between the training error and complexity of SVM. We can observe that only partial i
satisfy i > 0 , corresponding samples are called support vectors. These samples construct the
classifying hyper-plane and influence the classifying performance. The classifying hyper-plane
is also called soft margin classifying hyper-plane or soft margin SVM.
In optimal theory, the necessary and sufficient condition of the local optimal solution to be
global is that the solution must satisfy the KKT condition, namely:
i0 [ y i ( 0 xi + b0 ) 1 + i ] = 0 , i i = 0 , i + i = C
(4)
1 l
L(h iL ( xi ), y i )
l i =1
(5)
Although leave-one-out method has high accuracy and can be used by all learning algorithm,
its efficiency is low , because the method need l times learning and classifying, while sample
number increase, the complexity and time consumption of computation increase sharply. So
the method is seldom used to estimate the generalization performance in practic e, instead to
assess the performance of other estimating method.
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
d
, d = {i : (2 i + i ) 1}
l
(6)
Where i and i are the solution of (2) and (3). (6) is called the estimator of RBF-SVM
generalization performance.
For the estimator, we have following theorem:
Theorem 1: For a stable soft margin RBF-SVM, the number of sample classified incorrectly
by leave-one-out method satisfies the following inequality:
l
L(h
i
L
i =1
( xi ), yi ) {i : ( 2 i + i ) 1}
(7)
Prove:
When a leave-one-out error was produced, we have:
l
(8)
i =1
i t
it and b0t is solution of following equation which is QP (quadratic program) with the sample
( xt , yt ) being removed in QP (3):
Wt ( t ) = max
t
i
t
i
i =1
i t
t
j
y i y j exp( x i x t )
(9)
i =1 j =1
i t j t
i =1
i t
i =1
2. 0 < t0 < C . The sample ( xt , yt ) is an unbound support vector. With the solution of
(9), a feasible point of (3) can be constructed:
it
it = 0 or it = C
t
i = i yi yt vi i SV t
0
i=t
t
(10)
SV t is the index set of all unbound support vectors attained from (9); vi is selected to
satisfy
i SV
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
So, 0 < i < C , yi i = 0 , is a feasible point of (3). When (10) is substituted into (3) with
i =1
(11)
1
2
2
vi j exp( xi xt ) + t0 vi exp( xi xt )
2 i SV t j SV t
iSV t
Analogously, with the solution of (3), a feasible point of (9) can be constructed:
0
i0 = 0 or i0 = C
i = i0
\t
i + yi y t i i SV
(12)
SV \t is the index set of all support vectors except the sample ( xt , yt ) attained from (3). i is
selected to satisfied i 0 and
iSV
= t0 . So 0 < i < C ,
\t
i i
i=1,i t
=0,
is a feasible point
1 2
2
2
0 = Wt ( ) +W ( 0 ) t0 exp( xi xt ) + t0 i exp( xi xt )
2
iSV \ t
(13)
1
2
ij exp( xi xt )
2 iSV \t i SV \t
Because W ( 0 ) and W ( t )
W ( ) W ( ) and W ( ) W ( ) hold true. Summing (11) and (13) and considering with
the fact of maximum distance between two samples of RBF-SVM is 1, we have:
0
yt [ it yi exp( xi xt ) + b0t ]
2
i=1
i t
t0 t0 exp( xt xt )
+ t0
iSV
t0 t0
2 iSV \ t
jSV
2 iSV t
vv
i
j SV
exp( xi x j ) + t0
i j exp( xi x j ) + 0t
\t
1
1
vi v j
t
t
2 iSV j SV
2 iSV \ t
iSV
jSV
v exp(
i
i SV
xi xt )
exp( xi xt )
\t
(14)
i j
\t
2
1 2 1 2
= t0 t0 t0 t0
2
2
2
0
0
= t 2 t
At this time, if there is a leave -one-out error, (8) holds true. So:
2
0 t0 2 t0 = t0 (1 2 t0 ) , considering with 0 < t0 < C and t = 0 , we have
2t0 = 2 t0 + t 1 .
3. t0 = C . Now, the sample ( xt , yt ) is a bound support vector. With the analogous
method used in the case of unbound support vector, considering with KKT condition (4), we
can also get2t0 + t 1 .
Theorem 1 is proved.
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
In recent years, many methods estimating the generalization performance are issued; some
dominant methods are discussed as follows:
a. Jaakkola-Haussler method [6]
When classifying hyper-plane pass through origin, the method used (3) to estimate the
l
i =1
0
l
The estimating result of this method is Err l ( h L ) = 1 ( i
1
i =1
( K SV ) ii
1
1) , K SV is the inverse
matrix composed by all support vectors. It requires the samples linearly separable and hyperplane through origin, but the computation is complex.
c. Vapnik-Chapelle method [8]
In the method, a new concept named Span which was defined as distance between an
unbound support vector and a point set composed by the constrained linear combination of
other unbound support vectors is introduced. The method can estimate the generalization
performance well, but from the result offered by the author, we can see that its computation is
same magnitude as leave -one-out method.
Both methods a and b could not be used in general SVM. Because method b needs to
compute the inverse matrix, it would spend much time to estimate the generalization
performance. In method c, in order to compute the span, a new quadratic program was
inducted, so would spend much time also.
4. Experiments
To verify the generality of the estimator, three datasets of UCI repository which is the
benchmark of machine learning used for comparing the learning algorithm are selected [9].
Following is the abstraction.
a. Wisconsin Breast Cancer (abbr. WBC). There are totally 699 samples; every sample is
constructed with 9 features assigned from 0 to 10 discretely. The samples belong to two class,
458 samples for first class, and 241 samples for second class.
b. Iris. There are totally 150 samples, every samples is constructed with 4 features
assigned continuously. The samples belong to three classes, and 50 samples for every class.
First class is linearly separable from second and third classes, second class is not linearly
separable from third. The datasets are abbreviated as Iris1vs23 and Iris2vs3 respectively.
c. Prima Indian Diabetes (abbr. PID). There are totally 768 samples, 500 samples for first
class, and 268 samples for second class; every sample is constructed with 8 features assigned
continuously.
In the experiment, about 20 percentage samples are randomly selected to construct the
training dataset, others construct the testing dataset. The proportion of each class sample in
both dataset almost remains same as origin dataset. The detail comparisons are made in table 1.
Table 1 the comparisons of dataset
WBC
Iris1vs23 Iris2vs3
PID
Sparseness
No
No
No
No
Discreteness
Yes
Yes
Yes
No
Linear separability
No
Yes
N0
No
137/72
7/14
11/6
54/107
Ratio 1
Ratio 2
307/106
43/86
39/44
214/399
1. The ratio of first (positive) class and second (negative) class in training dataset.
2. The ratio of first (positive) class and second (negative) class in testing dataset
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE
Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE