Very Large SVM Training Using Core Vector Machines

Very Large SVM Training using Core Vector Machines
Ivor W. Tsang James T. Kwok Pak-Ming Cheung

Department of Computer Science
The Hong Kong University of Science and Technology
Clear Water Bay
Hong Kong
Abstract However, on very large data sets, the resulting rank of the
kernel matrix may still be too high to be handled efficiently.
Standard SVM training has O(m3 ) time and Another approach to scale up kernel methods is by chunk-
O(m2 ) space complexities, where m is the training or more sophisticated decomposition methods. How-
ing set size. In this paper, we scale up kernel ever, chunking needs to optimize the entire set of non-zero
methods by exploiting the approximateness in Lagrange multipliers that have been identified, and the re-
practical SVM implementations. We formulate sultant kernel matrix may still be too large to fit into mem-
many kernel methods as equivalent minimum en- ory. Osuna et al. (1997) suggested optimizing only a fixed-
closing ball problems in computational geome- size subset of the training data (working set) each time,
try, and then obtain provably approximately opti- while the variables corresponding to the other patterns are
mal solutions efficiently with the use of core-sets. frozen. Going to the extreme, the sequential minimal opti-
Our proposed Core Vector Machine (CVM) al- mization (SMO) algorithm (Platt, 1999) breaks a large QP
gorithm has a time complexity that is linear in m into a series of smallest possible QPs, each involving only
and a space complexity that is independent of m. two variables. In the context of classification, Mangasar-
Experiments on large toy and real-world data sets ian and Musicant (2001) proposed the Lagrangian SVM
demonstrate that the CVM is much faster and can (LSVM) that avoids the QP (or LP) altogether. Instead,
handle much larger data sets than existing scale- the solution is obtained by a fast iterative scheme. How-
up methods. In particular, on our PC with only ever, for nonlinear kernels (which is the focus in this pa-
512M RAM, the CVM with Gaussian kernel can per), it still requires the inversion of an m m matrix. Fur-
process the checkerboard data set with 1 million ther speed-up is possible by employing the reduced SVM
points in less than 13 seconds. (RSVM) (Lee & Mangasarian, 2001), which uses a rectan-
gular subset of the kernel matrix. However, this may lead
to performance degradation (Lin & Lin, 2003).
1 Introduction In practice, state-of-the-art SVM implementations typically
have a training time complexity that scales between O(m)
In recent years, there has been a lot of interest on using and O(m2.3 ) (Platt, 1999). This can be further driven down
kernels in various machine learning problems, with the sup- to O(m) with the use of a parallel mixture (Collobert et al.,
port vector machines (SVM) being the most prominent ex- 2002). However, these are only empirical observations and
ample. Many of these kernel methods are formulated as not theoretical guarantees. For reliable scaling behavior to
quadratic programming (QP) problems. Denote the number very large data sets, our goal is to develop an algorithm that
of training patterns by m. The training time complexity of can be proved (using tools in analysis of algorithms) to be
QP is O(m3 ) and its space complexity is at least quadratic. asymptotically efficient in both time and space.
Hence, a major stumbling block is in scaling up these QPs
Moreover, practical SVM implementations, as in many nu-
to large data sets, such as those commonly encountered in
merical routines, only approximate the optimal solution by
data mining applications.
an iterative strategy. Typically, the stopping criterion uti-
To reduce the time and space complexities, a popular tech- lizes either the precision of the Lagrange multipliers (e.g.,
nique is to obtain low-rank approximations on the kernel (Joachims, 1999; Platt, 1999)) or the duality gap (e.g.,
matrix, by using the Nystrom method (Williams & Seeger, (Smola & Scholkopf, 2004)). However, while approxi-
2001), greedy approximation (Smola & Scholkopf, 2000) mation algorithms (with provable performance guarantees)
or matrix decompositions (Fine & Scheinberg, 2001). have been extensively used in tackling computationally dif-
ficult problems like NP-complete problems (Garey & John- Here, we will focus on approximate MEB algorithms based
son, 1979), such approximateness has never been ex- on core-sets. Let B(c, R) be the ball with center c and
ploited in the design of SVM implementations. radius R. Given > 0, a ball B(c, (1 + )R) is an (1 +
)-approximation of MEB(S) if R rMEB(S) and S
In this paper, we first transform the SVM optimization
B(c, (1 + )R). A subset X S is a core-set of S if an
problem (with a possibly nonlinear kernel) to the minimum
expansion by a factor (1 + ) of its MEB contains S, i.e.,
enclosing ball (MEB) problem in computational geometry.
S B(c, (1 + )r), where B(c, r) = MEB(X) (Figure 1).
The MEB problem computes the ball of minimum radius
enclosing a given set of points (or, more generally, balls). To obtain such an (1 + )-
Traditional algorithms for finding exact MEBs do not scale approximation, Badoiu and
well with the dimensionality d of the points. Consequently, Clarkson (2002) proposed R
recent attention has shifted to the development of approxi- a simple iterative scheme:
R
mation algorithms. Lately, a breakthrough was obtained by At the tth iteration, the
Badoiu and Clarkson (2002), who showed that an (1 + )- current estimate B(ct , rt )
approximation of the MEB can be efficiently obtained us- is expanded incrementally
ing core-sets. Generally speaking, in an optimization prob- by including the furthest
lem, a core-set is a subset of the input points, such that point outside the (1 + )-
we can get a good approximation (with an approximation ball B(ct , (1 + )rt ). This
ratio1 specified by a user-defined parameter) to the orig- is repeated until all the
inal input by solving the optimization problem directly on points in S are covered by Figure 1: The inner cir-
the core-set. Moreover, a surprising property of (Badoiu & B(ct , (1 + )rt ). Despite cle is the MEB of the set
Clarkson, 2002) is that the size of its core-set is indepen- its simplicity, Badoiu and of squares and its (1 + )
dent of both d and the size of the point set. Clarkson (2002) showed expansion (the outer cir-
that the number of itera- cle) covers all the points.
Inspired from this core-set-based approximate MEB al-
tions, and hence the size of The set of squares is thus a
gorithm, we will develop an approximation algorithm for
the final core-set, depends
SVM training that has an approximation ratio of (1 + )2 . core-set.
only on but not on d or m.
Its time complexity is linear in m while its space complex-
This independence of d is important on applying this algo-
ity is independent of m. The rest of this paper is organized
rithm to kernel methods (Section 3) as the kernel-induced
as follows. Section 2 gives a short introduction on the MEB
feature space can be infinite-dimensional. As for the inde-
problem and its approximation algorithm. The connection
pendence on m, it allows both the time and space complex-
between kernel methods and the MEB problem is given in
ities of our algorithm to grow slowly, as will be shown in
Section 3. Section 4 then describes our proposed Core Vec-
Section 4.3.
tor Machine (CVM) algorithm. Experimental results are
presented in Section 5, and the last section gives some con-
cluding remarks. 3 MEB Problems and Kernel Methods
Obviously, the MEB is equivalent to the hard-margin sup-
2 MEB in Computational Geometry port vector data description (SVDD) (Tax & Duin, 1999),
which will be briefly reviewed in Section 3.1. The MEB
Given a set of points S = {x1 , . . . , xm }, where each xi problem can also be used for finding the radius compo-
Rd , the minimum enclosing ball of S (denoted MEB(S)) nent of the radius-margin bound (Chapelle et al., 2002).
is the smallest ball that contains all the points in S. The Thus, as pointed out by Kumar et al. (2003), the MEB
MEB problem has found applications in diverse areas such problem is useful in support vector clustering and SVM
as computer graphics (e.g., collision detection, visibility parameter tuning. However, we will show in Section 3.2
culling), machine learning (e.g., similarity search) and fa- that other kernel-related problems, including the training
cility locations problems. of soft-margin one-class and two-class L2-SVMs, can also
1
Let C be the cost (or value of the objective function) of be viewed as MEB problems.
the solution returned by an approximate algorithm, and C be
the cost of the optimal solution. Then, the approximate algo- 3.1 Hard-Margin SVDD
rithmhas an approximation
ratio (n) for an input size n if Given a kernel k with the associated feature map , let the

max CC , CC (n). Intuitively, this measures how bad MEB in the kernel-induced feature space be B(c, R). The
the approximate solution is compared with the optimal solution. primal problem in the hard-margin SVDD is
A large (small) approximation ratio means the solution is much
worse than (more or less the same as) the optimal solution. Ob-
min R2 : kc (xi )k2 R2 , i = 1, . . . , m. (1)
serve that (n) is always 1. If the ratio does not depend on The corresponding dual is
n, we may just write and call the algorithm an -approximation
algorithm. max diag(K) K : 0 , 1 = 1, (2)
where = [i , . . . , m ] are the Lagrange multipli- liers from the normal data by solving the primal problem:
ers, 0 = [0, . . . , 0] , 1 = [1, . . . , 1] and Kmm = m
[k(xi , xj )] = [(xi ) (xj )] is the kernel matrix. As is
X
min kwk2 2 + C i2 : w (xi ) i ,
well-known, this is a QP problem. The primal variables w,,i
i=1
can be recovered from the optimal as
where w (x) = is the desired hyperplane and C is a
m
X p user-defined parameter. Note that constraints i 0 are
c= i (xi ), R = diag(K) K. (3) not needed for the L2-SVM. The corresponding dual is
i=1

1
max K + I : 0 , 1 = 1
3.2 Viewing Kernel Methods as MEB Problems C
In this paper, we consider the situation where = max K : 0 , 1 = 1, (7)
where I is the mm identity matrix and K = [k(zi , zj )] =

k(x, x) = , (4)
[k(xi , xj ) + Cij ]. It is thus of the form in (5). Since
a constant2 . This will be the case when either (1) the k(x, x) = , k(z, z) = + C1 is also a constant. This
isotropic kernel k(x, y) = K(kxyk) (e.g., Gaussian ker- one-class SVM thus corresponds to the MEB problem (1),
nel); or (2) the dot product kernel k(x, y) = K(x y) (e.g., in which is replaced by the nonlinear map satisfying
polynomial kernel) with normalized inputs; or (3) any nor- (zi ) (zj ) = k(zi , zj ). From the Karush-Kuhn-Tucker
malized kernel k(x, y) = K(x,y) (KKT) conditions, we can recover w = m
P
is used. Using i=1 i (xi ) and
K(x,x) K(y,y)
i = Ci , and = w (xi ) + Ci from any support vector
the condition 1 = 1 in (2), we have diag(K) = .
xi .
Dropping this constant term from the dual objective in (2),
we obtain a simpler optimization problem:
3.2.2 Two-Class L2-SVM

max K : 0 , 1 = 1. (5) Given a training set {zi = (xi , yi )}m
i=1 with yi {1, 1},
Conversely, when the kernel k satisfies (4), QPs of the the primal of the two-class L2-SVM is
form (5) can always be regarded as a MEB problem (1). m
X
Note that (2) and (5) yield the same set of s, Moreover, minw,b,,i kwk2 + b2 2 + C i2
let d1 and d2 denote the optimal dual objectives in (2) and i=1
(5) respectively, then, obviously, s.t. yi (w (xi ) + b) i . (8)
d1 = d2 + . (6) The corresponding dual is

1
In the following, we will show that when (4) is satisfied, max K yy + yy + I : 1 = 1

0 C
the duals in a number of kernel methods can be rewrit-
ten in the form of (5). While the 1-norm error has been = max K : 0 , 1 = 1, (9)
commonly used for the SVM, our main focus will be on where denotes the Hadamard product, y = [y1 , . . . , ym ]
the 2-norm error. In theory, this could be less robust in and K = [k(zi , zj )] with
the presence of outliers. However, experimentally, its gen-
eralization performance is often comparable to that of the ij
k(zi , zj ) = yi yj k(xi , xj ) + yi yj + , (10)
L1-SVM (e.g., (Lee & Mangasarian, 2001; Mangasarian & C
Musicant, 2001). Besides, the 2-norm error is more advan-
involving both input and label information. Again, this is of
tageous here because a soft-margin L2-SVM can be trans-
the form in (5), with k(z, z) = + 1 + C1 , a constant.
formed to a hard-margin one. While the 2-norm error has
Again, we can recover
been used in classification (Section 3.2.2), we will also ex-
tend its use for novelty detection (Section 3.2.1). m m
X X i
w= i yi (xi ), b= i yi , i = , (11)
i=1 i=1
C
3.2.1 One-Class L2-SVM
from the optimal and = yi (w (xi ) + b) + Ci from
Given a set of unlabeled patterns {zi }m
i=1 where zi only has any support vector zi . Note that all the support vectors
the input part xi , the one-class L2-SVM separates the out-
of this L2-SVM, including those defining the margin and
2 those that are misclassified, now reside on the surface of the
In this case, it can be shown that the hard (soft) margin SVDD
yields identical solution as the hard (soft) margin one-class SVM ball in the feature space induced by k. A similar relation-
(Scholkopf et al., 2001). Moreover, the weight w in the one-class ship connecting one-class classification and binary classifi-
SVM solution is equal to the center c in the SVDD solution.
cation is also described in (Scholkopf et al., 2001).
4 Core Vector Machine (CVM) opposing classes, which is also the heuristic used in initial-
izing the SimpleSVM (Vishwanathan et al., 2003).
After formulating the kernel method as a MEB problem,
we obtain a transformed kernel k, together with the associ-
4.1.2 Distance Computations
ated feature space F, mapping and constant = k(z, z).
To solve this kernel-induced MEB problem, we adopt the Steps 2 and 3 involve computing kct (z )k for z S.
approximation algorithm3 described in the proof of Theo- Now,
rem 2.2 in (Badoiu & Clarkson, 2002). As mentioned in kct (z )k2 (12)
Section 2, the idea is to incrementally expand the ball by X X
= i j k(zi , zj ) 2 i k(zi , z ) + k(z , z ),
including the point furthest away from the current center.
zi ,zj St zi St
In the following, we denote the core-set, the balls center
and radius at the tth iteration by St , ct and Rt respectively. on using (3). Hence, computations are based on kernel
Also, the center and radius of a ball B are denoted by cB evaluations instead of the explicit (zi )s, which may be
and rB . Given an > 0, the CVM then works as follows: infinite-dimensional. Note that, in contrast, existing MEB
algorithms only consider finite-dimensional spaces.
1. Initialize S0 , c0 and R0 . However, in the feature space, ct cannot be obtained as
2. Terminate if there is no (z) (where z is a training an explicit point but rather as a convex combination of
point) falling outside the (1+)-ball B(ct , (1+)Rt ). (at most) |St | (zi )s. Computing (12) for all m training
points takes O(|St |2 + m|St |) = O(m|St |) time at the tth
3. Find z such that (z) is furthest away from ct . Set iteration. This becomes very expensive when m is large.
St+1 = St {z}. Here, we use the probabilistic speedup method in (Smola
& Scholkopf, 2000). The idea is to randomly sample a suf-
4. Find the new MEB(St+1 ) from (5) and set ct+1 = ficiently large subset S from S, and then take the point in
cMEB(St+1 ) and Rt+1 = rMEB(St+1 ) using (3). S that is furthest away from ct as the approximate furthest
5. Increment t by 1 and go back to step 2. point over S. As shown in (Smola & Scholkopf, 2000),
by using a small random sample of, say, size 59, the fur-
thest point obtained from S is with probability 0.95 among
In the sequel, points that are added to the core-set will be
the furthest 5% of points from the whole S. Instead of
called core vectors. Details of each of the above steps will
taking O(m|St |) time, this randomized method only takes
be described in Section 4.1. Despite its simplicity, CVM
O(|St |2 + |St |) = O(|St |2 ) time, which is much faster as
has an approximation guarantee (Section 4.2) and also
|St | m. This trick can also be used in initialization.
provably small time and space complexities (Section 4.3).
4.1 Detailed Procedure 4.1.3 Adding the Furthest Point
4.1.1 Initialization
Points outside MEB(St ) have zero i s (Section 4.1.1) and
Badoiu and Clarkson (2002) simply used an arbitrary point so violate the KKT conditions of the dual problem. As in
z S to initialize S0 = {z}. However, a good initial- (Osuna et al., 1997), one can simply add any such violating
ization may lead to fewer updates and so we follow the point to St . Our step 3, however, takes a greedy approach
scheme in (Kumar et al., 2003). We start with an arbi- by including the point furthest away from the current cen-
trary point z S and find za S that is furthest away ter. In the classification case4 (Section 3.2.2), we have
from z in the feature space F. Then, we find another arg max kct (z )k2
point zb S that is furthest away from za in F. The ini- /
z B(ct ,(1+)Rt )
tial core-set is then set to be S0 = {za , zb }. Obviously,

X
= arg min i yi y (k(xi , x ) + 1)
MEB(S0 ) (in F) has center c0 = 21 ((za ) + (zb )) On /
z B(ct ,(1+)Rt )
zi St
using (3), we thus have a = b = 12 and all the other
= arg min y (w (x ) + b), (13)
i s are zero. The initial radius is R0 = 21 k(za ) /
z B(ct ,(1+)Rt )
(zb )k = 12 k(za )k2 + k(zb )k2 2(za ) (zb ) =

p
q on using (10), (11) and (12). Hence, (13) chooses the worst
1 violating pattern corresponding to the constraint (8). Also,
2 2 2k(za , zb ).
as the dual objective in (9) has gradient 2K, so for a
In a classification problem, one may further require za and pattern currently outside the ball
zb to comeq from different classes. On using (10), R0 then
m
1
becomes 2 2 + 2 + C1 + 2k(xa , xb ). As and C are
X i
(K) = i yi y k(xi , x ) + yi y +
C
constants, choosing the pair (xa , xb ) that maximizes R0 is i=1
then equivalent to choosing the closest pair belonging to = y (w (x ) + b),
3 4
A similar algorithm is also described in (Kumar et al., 2003). The case for one-class classification (Section 3.2.1) is similar.
on using (10), (11) and = 0. Thus, the pattern chosen a parameter similar to our is required at termination. For
in (13) also makes the most progress towards maximizing example, in SMO and SVMlight (Joachims, 1999), train-
the dual objective. This subset selection heuristic has been ing stops when the KKT conditions are fulfilled within .
commonly used by various decomposition algorithms (e.g., Experience with these softwares indicate that near-optimal
(Chang & Lin, 2004; Joachims, 1999; Platt, 1999)). solutions are often good enough in practical applications.
Moreover, it can also be shown that when the CVM ter-
4.1.4 Finding the MEB minates, all the points satisfy loose KKT conditions as in
SMO and SVMlight .
At each iteration of step 4, we find the MEB by using the
QP formulation in Section 3.2. As the size |St | of the 4.3 Time and Space Complexities
core-set is much smaller than m in practice (Section 5),
the computational complexity of each QP sub-problem is Existing decomposition algorithms cannot guarantee the
much lower than solving the whole QP. Besides, as only number of iterations and consequently the overall time
one core vector is added at each iteration, efficient rank-one complexity (Chang & Lin, 2004). In this Section, we show
update procedures (Cauwenberghs & Poggio, 2001; Vish- how this can be obtained for CVM. In the following, we as-
wanathan et al., 2003) can also be used. The cost then be- sume that a plain QP implementation, which takes O(m3 )
comes quadratic rather than cubic. In the current imple- time and O(m2 ) space for m patterns, is used for the MEB
mentation (Section 5), we use SMO. As only one point is sub-problem in Section 4.1.4. Moreover, we assume that
added each time, the new QP is just a slight perturbation of each kernel evaluation takes constant time.
the original. Hence, by using the MEB solution obtained
from the previous iteration as starting point (warm start), As proved in (Badoiu & Clarkson, 2002), CVM converges
SMO can often converge in a small number of iterations. in at most 2/ iterations. In other words, the total number
of iterations, and consequently the size of the final core-set,
are of = O(1/). In practice, it has often been observed
4.2 Convergence to (Approximate) Optimality that the size of the core-set is much smaller than this worst-
First, consider = 0. The proof in (Badoiu & Clarkson, case theoretical upper bound (Kumar et al., 2003). This
2002) does not apply as it requires > 0. Nevertheless, as will also be corroborated by our experiments in Section 5.
the number of core vectors increases by one at each itera- Consider first the case where probabilistic speedup is not
tion and the training set size is finite, so CVM must termi- used in Section 4.1.2. As only one core vector is added at
nate in a finite number (say, ) of iterations, With = 0, each iteration, |St | = t + 2. Initialization takes O(m) time
MEB(S ) is an enclosing ball for all the points on termina- while distance computations in steps 2 and 3 take O((t +
tion. Because S is a subset of the whole training set and 2)2 + tm) = O(t2 + tm) time. Finding the MEB in step 4
the MEB of a subset cannot be larger than the MEB of the takes O((t + 2)3 ) = O(t3 ) time, and the other operations
whole set. Hence, MEB(S ) must also be the exact MEB take constant time. Hence, the tth iteration takes O(tm +
of the whole (-transformed) training set. In other words, t3 ) time, and the overall time for = O(1/) iterations is
when = 0, CVM outputs the exact solution of the kernel
problem.
X
3 2 4 m 1
O(tm + t ) = O( m + ) = O 2 + 4 ,
t=1

Now, consider > 0. Assume that the algorithm terminates
at the th iteration, then which is linear in m for a fixed .
R rMEB(S) (1 + )R (14) As for space5 , since only the core vectors are involved
in the QP, the space complexity for the tth iteration is
by definition. Recall that the optimal primal objective p O(|St |2 ). As = O(1/), the space complexity for the
of the kernel problem in Section 3.2.1 (or 3.2.2) is equal to whole procedure is O(1/2 ), which is independent of m
the optimal dual objective d2 in (7) (or (9)), which in turn for a fixed .
2
is related to the optimal dual objective d1 = rMEB(S) in (2) On the other hand, when probabilistic speedup is used, ini-
by (6). Together with (14), we can then bound p as tialization only takes O(1) time while distance computa-
tions in steps 2 and 3 take O((t + 2)2 ) = O(t2 ) time. Time
R2 p + (1 + )2 R2 . (15)
for the other operations remains the same. Hence, tth iter-
2
R
ation takes O(t3 ) time and the whole procedure takes
Hence, max p + , pR+
2 (1 + )2 and thus CVM is

an (1 + )2 -approximation algorithm. This also holds with
X 1
O(t3 ) = O( 4 ) = O 4 .
high probability when probabilistic speedup is used. t=1

As mentioned in Section 1, practical SVM implementa- 5
As the patterns may be stored out of core, we ignore the
tions also output approximated solutions only. Typically, O(m) space required for storing the m patterns.
For a fixed , it is thus constant, independent of m. The GaussianPkernel k(x, y) = exp(kx yk2 /), with
m
space complexity, which depends only on the number of = m12 i,j=1 kxi xj k2 .
iterations , is still O(1/2 ).
Our CVM implementation is adapted from LIBSVM, and
If more efficient QP solvers were used in the MEB sub- uses SMO for each QP sub-problem in Section 4.1.4. As in
problem of Section 4.1.4, both the time and space complex- LIBSVM, our CVM also uses caching (with the same cache
ities can be further improved. For example, with SMO, the size as in the other LIBSVM implementations above) and
space complexity for the tth iteration is reduced to O(|St |) stores all training patterns in main memory. For simplicity,
and that for the whole procedure driven down to O(1/). shrinking is not used in our current CVM implementation.
Moreover, we employ probabilistic speedup (Section 4.1.2)
Note that when decreases, the CVM solution becomes
and set = 106 in all the experiments. As in other de-
closer to the exact optimal solution, but at the expense of
composition methods, the use of a very stringent stopping
higher time and space complexities. Such a tradeoff be-
criterion is not necessary in practice. Preliminary studies
tween efficiency and approximation quality is typical of all
show that = 106 is acceptable for most tasks. Using an
approximation schemes. Morever, be cautioned that the O-
even smaller does not show improved generalization per-
notation is used for studying the asymptotic efficiency of
formance, but may increase the training time unnecessarily.
algorithms. As we are interested on handling very large
data sets, an algorithm that is asymptotically more effi- 5.1 Checkerboard Data
cient (in time and space) will be the best choice. However,
We first experiment on the 4 4 checkerboard data used
on smaller problems, this may be outperformed by algo-
by Lee and Mangasarian (2001) for evaluating large-scale
rithms that are not as efficient asymptotically. These will
SVM implementations. We use training sets with a maxi-
be demonstrated experimentally in Section 5.
mum of 1 million points and 2000 independent points for
testing. Of course, this problem does not need so many
5 Experiments
points for training, but it is convenient for illustrating the
In this Section, we implement the two-class L2-SVM in scaling properties. Experimentally, L2-SVM with low rank
Section 3.2.2 and illustrate the scaling behavior of CVM (in approximation does not yield satisfactory performance on
C++) on both toy and real-world data sets. For comparison, this data set, and so its result is not reported here. RSVM,
we also run the following SVM implementations6: on the other hand, has to keep a rectangular kernel matrix
of size m m and cannot be run on our machine when m
1. L2-SVM: LIBSVM implementation (in C++); exceeds 10K. Similarly, the SimpleSVM has to store the
kernel matrix of the active set, and runs into storage prob-
2. L2-SVM: LSVM implementation (in MATLAB), with lem when m exceeds 30K.
low-rank approximation (Fine & Scheinberg, 2001) of
As can be seen from Figure 2, CVM is as accurate as the
the kernel matrix added;
others. Besides, it is much faster7 and produces far fewer
3. L2-SVM: RSVM (Lee & Mangasarian, 2001) imple- support vectors (which implies faster testing) on large data
mentation (in MATLAB). The RSVM addresses the sets. In particular, one million patterns can be processed in
scale-up issue by solving a smaller optimization prob- under 13s. On the other hand, for relatively small training
lem that involves a random m m rectangular subset sets, with less than 10K patterns, LIBSVM is faster. This,
of the kernel matrix. Here, m is set to 10% of m; however, is to be expected as LIBSVM uses more sophis-
ticated heuristics and so will be more efficient on small-to-
4. L1-SVM: LIBSVM implementation (in C++); medium sized data sets. Figure 2(b) also shows the core-set
size, which can be seen to be small and its curve basically
5. L1-SVM: SimpleSVM (Vishwanathan et al., 2003) overlaps with that of the CVM. Thus, almost all the core
implementation (in MATLAB). vectors are useful support vectors. Moreover, it also con-
firms our theoretical findings that both time and space are
Parameters are used in their default settings unless other- constant w.r.t. the training set size, when it is large enough.
wise specified. All experiments are performed on a 3.2GHz 5.2 Forest Cover Type Data8
Pentium4 machine with 512M RAM, running Windows
XP. Since our focus is on nonlinear kernels, we use the This data set has been used for large scale SVM training
by Collobert et al. (2002). Following (Collobert et al.,
6
Our CVM implementation can be downloaded from
7
http://www.cs.ust.hk/jamesk/cvm.zip. LIBSVM can be As some implementations are in MATLAB, so not all the
downloaded from http://www.csie.ntu.edu.tw/cjlin/libsvm/; CPU time measurements can be directly compared. However, it
LSVM from http://www.cs.wisc.edu/dmi/lsvm; and Sim- is still useful to note the constant scaling exhibited by the CVM
pleSVM from http://asi.insa-rouen.fr/gloosli/. Moreover, we and its speed advantage over other C++ implementations, when
followed http://www.csie.ntu.edu.tw/cjlin/libsvm/faq.html in the data set is large.
8
adapting the LIBSVM package for L2-SVM. http://kdd.ics.uci.edu/databases/covertype/covertype.html
6 5 40
10 10
L2SVM (CVM) L2SVM (CVM) L2SVM (CVM)
L2SVM (LIBSVM) coreset size L2SVM (LIBSVM)
5 35 L2SVM (RSVM)
10 L2SVM (RSVM) L2SVM (LIBSVM)
L1SVM (LIBSVM) L2SVM (RSVM) L1SVM (LIBSVM)
L1SVM (SimpleSVM) L1SVM (LIBSVM) L1SVM (SimpleSVM)
30
4
10 L1SVM (SimpleSVM)
4
CPU time (in seconds)

10
25
error rate (in %)

number of SVs
3
10
20
2
10
15
3
10
1
10
10
0
10 5
1 2
10 10 0
1K 3K 10K 30K 100K 300K 1M 1K 3K 10K 30K 100K 300K 1M 1K 3K 10K 30K 100K 300K 1M
size of training set size of training set size of training set
(a) CPU time. (b) number of SVs. (c) testing error.

Figure 2: Results on the checkerboard data set (Except for the CVM, all the other implementations have to terminate
early because of not enough memory and / or the training time is too long). Note that the CPU time, number of support
vectors, and size of the training set are in log scale.
2002), we aim at separating class 2 from the other classes. ber of training patterns. From another perspective, recall
1% 90% of the whole data set (with a maximum of that the worst case core-set size is 2/, independent of
522,911 patterns) are used for training while the remaining m (Section 4.3). For the value of = 106 used here,
are used for testing. We set = 10000 for the Gaussian 2/ = 2 106 . Although we have seen that the actual size
kernel. Preliminary studies show that the number of sup- of the core-set is often much smaller than this worst case
port vectors is over ten thousands. Consequently, RSVM value, however, when m 2/, the number of core vec-
and SimpleSVM cannot be run on our machine. Similarly, tors can still be dependent on m. Moreover, as has been ob-
for low rank approximation, preliminary studies show that served in Section 5.1, the CVM is slower than the more so-
over thousands of basis vectors are required for a good ap- phisticated LIBSVM on processing these smaller data sets.
proximation. Therefore, only the two LIBSVM implemen-
tations will be compared with the CVM here. 6 Conclusion
Figure 3 shows that CVM is, again, as accurate as the oth- In this paper, we exploit the approximateness in SVM
ers. Note that when the training set is small, more training implementations. We formulate kernel methods as equiv-
patterns bring in additional information useful for classi- alent MEB problems, and then obtain provably approxi-
fication and so the number of core vectors increases with mately optimal solutions efficiently with the use of core-
training set size. However, after processing around 100K sets. The proposed CVM procedure is simple, and does not
patterns, both the time and space requirements of CVM be- require sophisticated heuristics as in other decomposition
gin to exhibit a constant scaling with the training set size. methods. Moreover, despite its simplicity, CVM has small
With hindsight, one might simply sample 100K training asymptotic time and space complexities. In particular, for
patterns and hope to obtain comparable results9 . However, a fixed , its asymptotic time complexity is linear in the
for satisfactory classification performance, different prob- training set size m while its space complexity is indepen-
lems require samples of different sizes and CVM has the dent of m. When probabilistic speedup is used, it even has
important advantage that the required sample size does not constant asymptotic time and space complexities for a fixed
have to be pre-specified. Without such prior knowledge, , independent of the training set size m. Experimentally,
random sampling gives poor testing results, as has been on large data sets, it is much faster and produce far fewer
demonstrated in (Lee & Mangasarian, 2001). support vectors (and thus faster testing) than existing meth-
ods. On the other hand, on relatively small data sets where
5.3 Relatively Small Data Sets: UCI Adult Data10 m 2/, SMO can be faster. CVM can also be used for
Following (Platt, 1999), we use training sets with up to other kernel methods such as support vector regression, and
32,562 patterns. As can be seen in Figure 4, CVM is details will be reported elsewhere.
still among the most accurate methods. However, as this
data set is relatively small, more training patterns do carry
References
more classification information. Hence, as discussed in
Section 5.2, the number of iterations, the core set size Badoiu, M., & Clarkson, K. (2002). Optimal core-sets for balls.
and consequently the CPU time all increase with the num- DIMACS Workshop on Computational Geometry.
9
In fact, we tried both LIBSVM implementations on a random Cauwenberghs, G., & Poggio, T. (2001). Incremental and decre-
sample of 100K training patterns, but their testing accuracies are mental support vector machine learning. Advances in Neural
inferior to that of CVM. Information Processing Systems 13. Cambridge, MA: MIT
10
http://research.microsoft.com/users/jplatt/smo.html Press.
6 6
10 10 25
L1SVM (LIBSVM) L2SVM (LIBSVM) L1SVM (LIBSVM)
10
5 L1SVM (LIBSVM)
20

10
number of SVs
error rate (in %)

4
10 15
3
10 10
4
10
2
10 5
1 3
10 10 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 0 1 2 3 4 5 6
size of training set 5 size of training set 5 size of training set 5
x 10 x 10 x 10

Figure 3: Results on the forest cover type data set. Note that the y-axes in Figures 3(a) and 3(b) are in log scale.
5 5 20
10 10
4 L2SVM (low rank) L2SVM (LIBSVM) L2SVM (low rank)
10 19 L2SVM (RSVM)
L2SVM (RSVM) L2SVM (low rank)
L1SVM (LIBSVM) L2SVM (RSVM) L1SVM (LIBSVM)
L1SVM (SimpleSVM) L1SVM (LIBSVM) L1SVM (SimpleSVM)
3 4 L1SVM (SimpleSVM)
10 10 18
error rate (in %)

number of SVs
2
10 17
1 3
10 10 16
0
10 15
1 2
10 10 14
1000 3000 6000 10000 30000 1000 3000 6000 10000 30000 1000 3000 6000 10000 30000
size of training set size of training set size of training set

Figure 4: Results on the UCI adult data set (The other implementations have to terminate early because of not enough
memory and/or training time is too long). Note that the CPU time, number of SVs and size of training set are in log scale.
Chang, C.-C., & Lin, C.-J. (2004). LIBSVM: a li- Mangasarian, O., & Musicant, D. (2001). Lagrangian support
brary for support vector machines. Software available at vector machines. Journal of Machine Learning Research, 1,
http://www.csie.ntu.edu.tw/cjlin/libsvm. 161177.
Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Osuna, E., Freund, R., & Girosi, F. (1997). Training support vec-
Choosing multiple parameters for support vector machines. tor machines: an application to face detection. Proceedings of
Machine Learning, 46, 131159. Computer Vision and Pattern Recognition (pp. 130136). San
Juan, Puerto Rico.
Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel mixture
of SVMs for very large scale problems. Neural Computation, Platt, J. (1999). Fast training of support vector machines using
14, 11051114. sequential minimal optimization. In B. Scholkopf, C. Burges
and A. Smola (Eds.), Advances in kernel methods support
Fine, S., & Scheinberg, K. (2001). Efficient SVM training using vector learning, 185208. Cambridge, MA: MIT Press.
low-rank kernel representation. Journal of Machine Learning
Research, 2, 243264. Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., &
Williamson, R. (2001). Estimating the support of a high-
Garey, M., & Johnson, D. (1979). Computers and intractability: dimensional distribution. Neural Computation, 13, 14431471.
A guide to the theory of NP-completeness. W.H. Freeman.
Smola, A., & Scholkopf, B. (2000). Sparse greedy matrix approx-
Joachims, T. (1999). Making large-scale support vector machine imation for machine learning. Proceedings of the Seventeenth
learning practical. In B. Scholkopf, C. Burges and A. Smola International Conference on Machine Learning (pp. 911918).
(Eds.), Advances in kernel methods Support vector learning, Standord, CA, USA.
169184. Cambridge, MA: MIT Press. Smola, A., & Scholkopf, B. (2004). A tutorial on support vector
Kumar, P., Mitchell, J., & Yildirim, A. (2003). Approximate min- regression. Statistics and Computing, 14, 199222.
imum enclosing balls in high dimensions using core-sets. ACM Tax, D., & Duin, R. (1999). Support vector domain description.
Journal of Experimental Algorithmics, 8. Pattern Recognition Letters, 20, 11911199.
Lee, Y.-J., & Mangasarian, O. (2001). RSVM: Reduced support Vishwanathan, S., Smola, A., & Murty, M. (2003). SimpleSVM.
vector machines. Proceeding of the First SIAM International Proceedings of the Twentieth International Conference on Ma-
Conference on Data Mining. chine Learning (pp. 760767). Washington, D.C., USA.
Lin, K.-M., & Lin, C.-J. (2003). A study on reduced support Williams, C., & Seeger, M. (2001). Using the Nystrom method
vector machines. IEEE Transactions on Neural Networks, 14, to speed up kernel machines. Advances in Neural Information
14491459. Processing Systems 13. Cambridge, MA: MIT Press.

Very Large SVM Training Using Core Vector Machines

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Very Large SVM Training Using Core Vector Machines

Uploaded by

Copyright:

Available Formats

Very Large SVM Training using Core Vector Machines

Ivor W. Tsang James T. Kwok Pak-Ming Cheung

In this paper, we consider the situation where = max K : 0 , 1 = 1, (7)

where I is the mm identity matrix and K = [k(zi , zj )] =

d1 = d2 + . (6) The corresponding dual is

tial core-set is then set to be S0 = {za , zb }. Obviously,

(zb )k = 12 k(za )k2 + k(zb )k2 2(za ) (zb ) =

CPU time (in seconds)

error rate (in %)

(a) CPU time. (b) number of SVs. (c) testing error.

CPU time (in seconds)

error rate (in %)

(a) CPU time. (b) number of SVs. (c) testing error.

error rate (in %)

(a) CPU time. (b) number of SVs. (c) testing error.

You might also like