You are on page 1of 6

An Algorithm of Estimating the Generalization Performance of RBF-SVM

Dong Chun-xi, Yang Shao-quan, Rao Xian, Tang Jian-long


School of Electronic Engineering, Xidian Univ., Xian, China
Chxdong,shqyang,jltang,xrao@mail.xidian.edu.cn
Abstract
Using the sp arseness of a Support Vector Machine (SVM) solution, properties of Radial
basis function (RBF) kernel and the inter -median parameters in training the SVM, a novel
algorithm to estimate the generalization performance of RBF-SVM is presented. Without
additional complex computing, it overcomes many disadvantages of existing algorithm such as
longer computation time and narrower application range. It is proved to be a general method
for estimating the generalization performance of a RBF-SVM theoretically and experimentally
and can be applied wide range problems of pattern recognition using SVM.

Support Vector Machine (SVM) is a new type learning machine issued by V. Vapnik et al
[1], and has been widely used in many fields such as patter recognition and function
approximation [2]. Comparing with traditional methods, its performance of generalization and
ability to deal with nonlinear problem are better than others. However, for a given learning
problem, could the generalization performance of a SVM be measured in quantification? Could
it be obtained with training dataset? These questions should be answered by generalization
performance estimating.
In short, generalization performance is maximal error rate when a SVM is used to test the
testing dataset, Generalization performance estimating is to obtain the bound of error rate.
Many bounds and methods have been issued [3], but some require the samples linearly
separable, the other requires that the classifying hyper-plane must pass through the origin. It
means they cannot be used in general SVM.
Method presented in the paper is proposed by Thorsten Joachime and used in character text
classif ication which uses the inter-median parameters and when training the SVM [4]. A
novel algorithm of estimating the generalization performance of RBF-SVM using the special
features of RBF-SVM is discussed. With this method, we can easily estimate the generalization
performance.

1. Support Vector Machine


In two class patte rn recognition, given the training dataset{( xi , yi ); xi R N ; yi = 1; i [1, l ]} ,
the SVM is the classification rule defined by following equation:
l

f ( x) = sgn( i K ( xi , x ) + b)
i =1

(1)

Where x i is training sample, x is sample to be classified, and i is the solution of following


equation:
l

max i

i =1

1 l
i j yi y j K ( xi , x j );0 i C
2 i, j =1

This work was support by the National ECM Lab Foundation of China under project number 5143505

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

(2)

This is the dual form of following quadratic program:


min
,

l
1
( ) + C i ; C 0; y i [( x ) + b ] 1 i ; i 0
2
i =1

(3)

In (2), K ( i ) is the kernel function satisfying the Mercer condition, it maps the input space
into high (or infinite) dimension space (feature space). The kernel function generally include
linear (no mapping), polynomial, Radial basis function
(RBF) and sigmoid kernel. RBF kernel
2
discussed here is K xi , x j = exp xi x j
. C is a constant controlling the tradeoff
between the training error and complexity of SVM. We can observe that only partial i
satisfy i > 0 , corresponding samples are called support vectors. These samples construct the
classifying hyper-plane and influence the classifying performance. The classifying hyper-plane
is also called soft margin classifying hyper-plane or soft margin SVM.
In optimal theory, the necessary and sufficient condition of the local optimal solution to be
global is that the solution must satisfy the KKT condition, namely:

i0 [ y i ( 0 xi + b0 ) 1 + i ] = 0 , i i = 0 , i + i = C

(4)

From (4), we can see, if i > 0 , then i = 0 and i0 = C ; if i > 0 ,


then i = 0 and 0 < i0 < C ; if i = C , then i0 = 0and i = 0 . If i0 = C , corresponding
samples are called bound support vectors, in contrast, if 0 < i0 < C , the samples are called
unbound support vectors. In a SVM, if there is more than one unbound support vectors, the
SVM is stable.
Following derivation is done in feature space, but the result can also be applied in input
space.

2. Using leave-one -out method to estimate generalization performance


When estimating the generalization performance with leave-one-out method, one firstly
removes one sample from training dataset, trains a classifying rule with other samples and tests
the sample with this rule. If the sample was classified incorrectly, a leave-one-out error was
produced. It is proved that the error rate attained by leave-one-out method is an unbiased
estimation of generalization performance [5]. Lets denote hiL as the classifying rule attained
without ith sample, hiL ( xi ) as the processing of classifying sample xi with hiL , and
L (hiL ( xi ), yi ) as the classifying result which should be assigned with 1 or 0 according
whether the rule classify the sample correctly or not. The estimation of generalization
performance with leave-one-out method can be denoted as:
ErrLl (h L ) =

1 l
L(h iL ( xi ), y i )
l i =1

(5)

Although leave-one-out method has high accuracy and can be used by all learning algorithm,
its efficiency is low , because the method need l times learning and classifying, while sample
number increase, the complexity and time consumption of computation increase sharply. So
the method is seldom used to estimate the generalization performance in practic e, instead to
assess the performance of other estimating method.

3. Method of estimating generalization performance using of RBF-SVM

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

According with the speciality of RBF-SVM such as classifying hyper-plane determined by


support vectors, the size of samples corresponding with the support vectors far less than that of
training dataset, and all the samples in feature space embodied by unit sphere, the estimation of
generalization performance using can be defined as:
Definition 1: For a stable soft margin RBF-SVM, the estimation of generalization
performance is:
Errl ( hL ) =

d
, d = {i : (2 i + i ) 1}
l

(6)

Where i and i are the solution of (2) and (3). (6) is called the estimator of RBF-SVM
generalization performance.
For the estimator, we have following theorem:
Theorem 1: For a stable soft margin RBF-SVM, the number of sample classified incorrectly
by leave-one-out method satisfies the following inequality:
l

L(h

i
L

i =1

( xi ), yi ) {i : ( 2 i + i ) 1}

(7)

Prove:
When a leave-one-out error was produced, we have:
l

yt ( x i + b0t ) = y t [ it y i exp( x i x t ) + b0t ] < 0


2

(8)

i =1
i t

it and b0t is solution of following equation which is QP (quadratic program) with the sample
( xt , yt ) being removed in QP (3):
Wt ( t ) = max


t
i

t
i

i =1
i t

t
j

y i y j exp( x i x t )

(9)

i =1 j =1
i t j t

Now, we should analyze (3) with three cases:


1. t0 = 0 . The sample ( xt , yt ) is not a support vector, so the hyper-plane is same as the
sample not being removed.
l

Wt ( t ) = W ( ) , yt [ it yi exp( xi xt ) + b0t ] = yt [ i0 yi exp( xi xt ) + b0 ] > 1 ,


2

i =1
i t

i =1

there is not a leave-one-out error.

2. 0 < t0 < C . The sample ( xt , yt ) is an unbound support vector. With the solution of
(9), a feasible point of (3) can be constructed:
it
it = 0 or it = C
t
i = i yi yt vi i SV t
0
i=t
t

(10)

SV t is the index set of all unbound support vectors attained from (9); vi is selected to
satisfy

i SV

= t0 . It is proved that such

vi exist for a stable soft margin SVM.

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

So, 0 < i < C , yi i = 0 , is a feasible point of (3). When (10) is substituted into (3) with
i =1

some transform, we have:


l
1 2
2
2
yt [ it yi exp( xi xt ) + b0t ] = W ( ) + Wt ( t ) t0 exp( xi xt ) + t0
2
i =1
i t

(11)

1
2
2
vi j exp( xi xt ) + t0 vi exp( xi xt )
2 i SV t j SV t
iSV t

Analogously, with the solution of (3), a feasible point of (9) can be constructed:

0
i0 = 0 or i0 = C
i = i0
\t
i + yi y t i i SV

(12)

SV \t is the index set of all support vectors except the sample ( xt , yt ) attained from (3). i is
selected to satisfied i 0 and

iSV

= t0 . So 0 < i < C ,

\t

i i

i=1,i t

=0,

is a feasible point

of (3). Substituting (12) into (9) with some transform, we have:

1 2
2
2
0 = Wt ( ) +W ( 0 ) t0 exp( xi xt ) + t0 i exp( xi xt )
2
iSV \ t

(13)

1
2
ij exp( xi xt )
2 iSV \t i SV \t

Because W ( 0 ) and W ( t )

are optimal solution of (3) and (9) respectively,

W ( ) W ( ) and W ( ) W ( ) hold true. Summing (11) and (13) and considering with
the fact of maximum distance between two samples of RBF-SVM is 1, we have:
0

yt [ it yi exp( xi xt ) + b0t ]
2

i=1
i t

t0 t0 exp( xt xt )
+ t0

iSV

t0 t0

2 iSV \ t

jSV

2 iSV t

vv
i

j SV

exp( xi x j ) + t0

i j exp( xi x j ) + 0t
\t

1
1
vi v j
t
t
2 iSV j SV
2 iSV \ t

iSV

jSV

v exp(
i

i SV

xi xt )

exp( xi xt )

\t

(14)

i j
\t

2
1 2 1 2
= t0 t0 t0 t0
2
2
2
0
0
= t 2 t

At this time, if there is a leave -one-out error, (8) holds true. So:
2
0 t0 2 t0 = t0 (1 2 t0 ) , considering with 0 < t0 < C and t = 0 , we have
2t0 = 2 t0 + t 1 .
3. t0 = C . Now, the sample ( xt , yt ) is a bound support vector. With the analogous
method used in the case of unbound support vector, considering with KKT condition (4), we
can also get2t0 + t 1 .
Theorem 1 is proved.

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

In recent years, many methods estimating the generalization performance are issued; some
dominant methods are discussed as follows:
a. Jaakkola-Haussler method [6]
When classifying hyper-plane pass through origin, the method used (3) to estimate the
l

generalization performance. That is Err l (hL ) = 1 ( i0 K ( xi , x i ) 1) , ( i) is step function.


l

b. Opper-Winther method [7]

i =1

0
l
The estimating result of this method is Err l ( h L ) = 1 ( i
1

i =1

( K SV ) ii

1
1) , K SV is the inverse

matrix composed by all support vectors. It requires the samples linearly separable and hyperplane through origin, but the computation is complex.
c. Vapnik-Chapelle method [8]
In the method, a new concept named Span which was defined as distance between an
unbound support vector and a point set composed by the constrained linear combination of
other unbound support vectors is introduced. The method can estimate the generalization
performance well, but from the result offered by the author, we can see that its computation is
same magnitude as leave -one-out method.
Both methods a and b could not be used in general SVM. Because method b needs to
compute the inverse matrix, it would spend much time to estimate the generalization
performance. In method c, in order to compute the span, a new quadratic program was
inducted, so would spend much time also.

4. Experiments
To verify the generality of the estimator, three datasets of UCI repository which is the
benchmark of machine learning used for comparing the learning algorithm are selected [9].
Following is the abstraction.
a. Wisconsin Breast Cancer (abbr. WBC). There are totally 699 samples; every sample is
constructed with 9 features assigned from 0 to 10 discretely. The samples belong to two class,
458 samples for first class, and 241 samples for second class.
b. Iris. There are totally 150 samples, every samples is constructed with 4 features
assigned continuously. The samples belong to three classes, and 50 samples for every class.
First class is linearly separable from second and third classes, second class is not linearly
separable from third. The datasets are abbreviated as Iris1vs23 and Iris2vs3 respectively.
c. Prima Indian Diabetes (abbr. PID). There are totally 768 samples, 500 samples for first
class, and 268 samples for second class; every sample is constructed with 8 features assigned
continuously.
In the experiment, about 20 percentage samples are randomly selected to construct the
training dataset, others construct the testing dataset. The proportion of each class sample in
both dataset almost remains same as origin dataset. The detail comparisons are made in table 1.
Table 1 the comparisons of dataset
WBC
Iris1vs23 Iris2vs3
PID
Sparseness
No
No
No
No
Discreteness
Yes
Yes
Yes
No
Linear separability
No
Yes
N0
No
137/72
7/14
11/6
54/107
Ratio 1
Ratio 2
307/106
43/86
39/44
214/399
1. The ratio of first (positive) class and second (negative) class in training dataset.
2. The ratio of first (positive) class and second (negative) class in testing dataset

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

The comparison of estimating performance of RBF-SVM estimator, leave-one-out


method and real testing are made in table 2. The criteria of selecting and C is to minimize
the real testing error. The time consumption of RBF-SVM estimator and leave-one-out
method are compared in table 3 in second.
Table 2: The comparison of estimating performance
WBC
Iris1vs23
Iris2vs3
PID
0.354839
RBF-SVM estimator 0.062201 0.000000 0.235294
Leave -one-out
0.028708 0.000000 0.176471
0.283871
Real testing
0.027484 0.000000 0.072289
0.311582
Table 3: The comparison of time consumption
WBC
Iris1vs23
Iris2vs3
PID
0.09
0.01
0.01
0.703
RBF-SVM estimator
Leave-one-out
27250.323
2277.854 1327.708
5210.672
With the results, we can see, the RBF-SVM estimator can estimate the generalization
performance well, and can be used widely, whether the dataset is sparse, discrete and linearly
separable or not. For Iris1vs23, it attained the same result as leave-one-out, the main reason is
the linear separability of the dataset; for WBC and PID, the result is very close; for Iris2vs3,
the result is some difference, the main reason is the small dataset, indeed, according with the
training dataset, there is only 4 samples being classified incorrectly by the estimator while 3
samples by leave-one-out method; at the same time, the time spent by the estimator is far less
than that spent by leave -one-out method, so its efficiency is high. On the other hand, its
accuracy is less than leave-one-out method, from the theorem 1, the reason is that the estimator
is derived form leave-one-out method, but it is the necessary condition not the necessary and
sufficient condition.

5. Discussing and future work


Inspired by Thorsten Joachime method used in character text classification, a general
method used to estimate the generalization performance of RBF-SVM in pattern recognition is
present; its efficiency is proved theoretically and experimentally . So it would be a good
method to estimate the generalization performance of RBF-SVM. Because estimating the
generalization performance is the basic of parameters and features selecting in machine
learning, we should make great effort to improve the accuracy of the estimator and use it to
select parameters and feature of RBF-SVM.
References
[1] Vladimir N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc, 2000.
[2] J.C. Burges. A Tutorial on Support Vector Machines for Pattern Recognition. Kluwer Academic Publishers,
Boston. 1999.
[3] Devroye, L., Gyorfi, L., Lugosi, GA. A Probabilistic Theory of Pattern Recognition. Springer-Verlag New York,
Inc, 1996.
[4] Joachims, T. Estimating the Generalization Performance of a SVM Efficiently. LS VIII-Report 25, University
Dortmund, Germany, 1999.
[5] Lunts, A., Brailovskiy, V., Evaluation of Attributes Obtained in Statistical Decision Rules. Engineering
Cybernetics, 3, P98-109, 1967.
[6] Jaakkola, T., Haussler, D. Probabilistic Kernel Regression Models. Proceedings of the Seventh Workshop on AI
and Statistics. San Francisco, 1999.
[7] M. Opper, O. Winther. Gaussian Processes and SVM: Mean Field and Leave-one out. Advances in Large
Margin Classifiers, P311-326, Cambridge, MA, MIT Press, 2000.
[8] V. Vapnik, O. Chapelle. Bounds on Error Expectation for Support Vector Machines. Neural. 12(9), 2000.
[9] Murphy,~P.~M., Aha,~D.~W. Irvine, CA: University of California, Department of Information and Computer
Science. [http://www.ics.uci.edu/~mlearn/MLRepository.html], 1994.

Proceedings of the Fifth International Conference on Computational Intelligence and Multimedia Applications (ICCIMA03)
0-7695-1957-1/03 $17.00 2003 IEEE

You might also like