Professional Documents
Culture Documents
Abstract However, on very large data sets, the resulting rank of the
kernel matrix may still be too high to be handled efficiently.
Standard SVM training has O(m3 ) time and Another approach to scale up kernel methods is by chunk-
O(m2 ) space complexities, where m is the train- ing or more sophisticated decomposition methods. How-
ing set size. In this paper, we scale up kernel ever, chunking needs to optimize the entire set of non-zero
methods by exploiting the approximateness in Lagrange multipliers that have been identified, and the re-
practical SVM implementations. We formulate sultant kernel matrix may still be too large to fit into mem-
many kernel methods as equivalent minimum en- ory. Osuna et al. (1997) suggested optimizing only a fixed-
closing ball problems in computational geome- size subset of the training data (working set) each time,
try, and then obtain provably approximately opti- while the variables corresponding to the other patterns are
mal solutions efficiently with the use of core-sets. frozen. Going to the extreme, the sequential minimal opti-
Our proposed Core Vector Machine (CVM) al- mization (SMO) algorithm (Platt, 1999) breaks a large QP
gorithm has a time complexity that is linear in m into a series of smallest possible QPs, each involving only
and a space complexity that is independent of m. two variables. In the context of classification, Mangasar-
Experiments on large toy and real-world data sets ian and Musicant (2001) proposed the Lagrangian SVM
demonstrate that the CVM is much faster and can (LSVM) that avoids the QP (or LP) altogether. Instead,
handle much larger data sets than existing scale- the solution is obtained by a fast iterative scheme. How-
up methods. In particular, on our PC with only ever, for nonlinear kernels (which is the focus in this pa-
512M RAM, the CVM with Gaussian kernel can per), it still requires the inversion of an m m matrix. Fur-
process the checkerboard data set with 1 million ther speed-up is possible by employing the reduced SVM
points in less than 13 seconds. (RSVM) (Lee & Mangasarian, 2001), which uses a rectan-
gular subset of the kernel matrix. However, this may lead
to performance degradation (Lin & Lin, 2003).
1 Introduction In practice, state-of-the-art SVM implementations typically
have a training time complexity that scales between O(m)
In recent years, there has been a lot of interest on using and O(m2.3 ) (Platt, 1999). This can be further driven down
kernels in various machine learning problems, with the sup- to O(m) with the use of a parallel mixture (Collobert et al.,
port vector machines (SVM) being the most prominent ex- 2002). However, these are only empirical observations and
ample. Many of these kernel methods are formulated as not theoretical guarantees. For reliable scaling behavior to
quadratic programming (QP) problems. Denote the number very large data sets, our goal is to develop an algorithm that
of training patterns by m. The training time complexity of can be proved (using tools in analysis of algorithms) to be
QP is O(m3 ) and its space complexity is at least quadratic. asymptotically efficient in both time and space.
Hence, a major stumbling block is in scaling up these QPs
Moreover, practical SVM implementations, as in many nu-
to large data sets, such as those commonly encountered in
merical routines, only approximate the optimal solution by
data mining applications.
an iterative strategy. Typically, the stopping criterion uti-
To reduce the time and space complexities, a popular tech- lizes either the precision of the Lagrange multipliers (e.g.,
nique is to obtain low-rank approximations on the kernel (Joachims, 1999; Platt, 1999)) or the duality gap (e.g.,
matrix, by using the Nystrom method (Williams & Seeger, (Smola & Scholkopf, 2004)). However, while approxi-
2001), greedy approximation (Smola & Scholkopf, 2000) mation algorithms (with provable performance guarantees)
or matrix decompositions (Fine & Scheinberg, 2001). have been extensively used in tackling computationally dif-
ficult problems like NP-complete problems (Garey & John- Here, we will focus on approximate MEB algorithms based
son, 1979), such approximateness has never been ex- on core-sets. Let B(c, R) be the ball with center c and
ploited in the design of SVM implementations. radius R. Given > 0, a ball B(c, (1 + )R) is an (1 +
)-approximation of MEB(S) if R rMEB(S) and S
In this paper, we first transform the SVM optimization
B(c, (1 + )R). A subset X S is a core-set of S if an
problem (with a possibly nonlinear kernel) to the minimum
expansion by a factor (1 + ) of its MEB contains S, i.e.,
enclosing ball (MEB) problem in computational geometry.
S B(c, (1 + )r), where B(c, r) = MEB(X) (Figure 1).
The MEB problem computes the ball of minimum radius
enclosing a given set of points (or, more generally, balls). To obtain such an (1 + )-
Traditional algorithms for finding exact MEBs do not scale approximation, Badoiu and
well with the dimensionality d of the points. Consequently, Clarkson (2002) proposed R
recent attention has shifted to the development of approxi- a simple iterative scheme:
R
mation algorithms. Lately, a breakthrough was obtained by At the tth iteration, the
Badoiu and Clarkson (2002), who showed that an (1 + )- current estimate B(ct , rt )
approximation of the MEB can be efficiently obtained us- is expanded incrementally
ing core-sets. Generally speaking, in an optimization prob- by including the furthest
lem, a core-set is a subset of the input points, such that point outside the (1 + )-
we can get a good approximation (with an approximation ball B(ct , (1 + )rt ). This
ratio1 specified by a user-defined parameter) to the orig- is repeated until all the
inal input by solving the optimization problem directly on points in S are covered by Figure 1: The inner cir-
the core-set. Moreover, a surprising property of (Badoiu & B(ct , (1 + )rt ). Despite cle is the MEB of the set
Clarkson, 2002) is that the size of its core-set is indepen- its simplicity, Badoiu and of squares and its (1 + )
dent of both d and the size of the point set. Clarkson (2002) showed expansion (the outer cir-
that the number of itera- cle) covers all the points.
Inspired from this core-set-based approximate MEB al-
tions, and hence the size of The set of squares is thus a
gorithm, we will develop an approximation algorithm for
the final core-set, depends
SVM training that has an approximation ratio of (1 + )2 . core-set.
only on but not on d or m.
Its time complexity is linear in m while its space complex-
This independence of d is important on applying this algo-
ity is independent of m. The rest of this paper is organized
rithm to kernel methods (Section 3) as the kernel-induced
as follows. Section 2 gives a short introduction on the MEB
feature space can be infinite-dimensional. As for the inde-
problem and its approximation algorithm. The connection
pendence on m, it allows both the time and space complex-
between kernel methods and the MEB problem is given in
ities of our algorithm to grow slowly, as will be shown in
Section 3. Section 4 then describes our proposed Core Vec-
Section 4.3.
tor Machine (CVM) algorithm. Experimental results are
presented in Section 5, and the last section gives some con-
cluding remarks. 3 MEB Problems and Kernel Methods
Obviously, the MEB is equivalent to the hard-margin sup-
2 MEB in Computational Geometry port vector data description (SVDD) (Tax & Duin, 1999),
which will be briefly reviewed in Section 3.1. The MEB
Given a set of points S = {x1 , . . . , xm }, where each xi problem can also be used for finding the radius compo-
Rd , the minimum enclosing ball of S (denoted MEB(S)) nent of the radius-margin bound (Chapelle et al., 2002).
is the smallest ball that contains all the points in S. The Thus, as pointed out by Kumar et al. (2003), the MEB
MEB problem has found applications in diverse areas such problem is useful in support vector clustering and SVM
as computer graphics (e.g., collision detection, visibility parameter tuning. However, we will show in Section 3.2
culling), machine learning (e.g., similarity search) and fa- that other kernel-related problems, including the training
cility locations problems. of soft-margin one-class and two-class L2-SVMs, can also
1
Let C be the cost (or value of the objective function) of be viewed as MEB problems.
the solution returned by an approximate algorithm, and C be
the cost of the optimal solution. Then, the approximate algo- 3.1 Hard-Margin SVDD
rithmhas an approximation
ratio (n) for an input size n if Given a kernel k with the associated feature map , let the
max CC , CC (n). Intuitively, this measures how bad MEB in the kernel-induced feature space be B(c, R). The
the approximate solution is compared with the optimal solution. primal problem in the hard-margin SVDD is
A large (small) approximation ratio means the solution is much
worse than (more or less the same as) the optimal solution. Ob-
min R2 : kc (xi )k2 R2 , i = 1, . . . , m. (1)
serve that (n) is always 1. If the ratio does not depend on The corresponding dual is
n, we may just write and call the algorithm an -approximation
algorithm. max diag(K) K : 0 , 1 = 1, (2)
where = [i , . . . , m ] are the Lagrange multipli- liers from the normal data by solving the primal problem:
ers, 0 = [0, . . . , 0] , 1 = [1, . . . , 1] and Kmm = m
[k(xi , xj )] = [(xi ) (xj )] is the kernel matrix. As is
X
min kwk2 2 + C i2 : w (xi ) i ,
well-known, this is a QP problem. The primal variables w,,i
i=1
can be recovered from the optimal as
where w (x) = is the desired hyperplane and C is a
m
X p user-defined parameter. Note that constraints i 0 are
c= i (xi ), R = diag(K) K. (3) not needed for the L2-SVM. The corresponding dual is
i=1
1
max K + I : 0 , 1 = 1
3.2 Viewing Kernel Methods as MEB Problems C
Conversely, when the kernel k satisfies (4), QPs of the the primal of the two-class L2-SVM is
form (5) can always be regarded as a MEB problem (1). m
X
Note that (2) and (5) yield the same set of s, Moreover, minw,b,,i kwk2 + b2 2 + C i2
let d1 and d2 denote the optimal dual objectives in (2) and i=1
(5) respectively, then, obviously, s.t. yi (w (xi ) + b) i . (8)
R rMEB(S) (1 + )R (14) As for space5 , since only the core vectors are involved
in the QP, the space complexity for the tth iteration is
by definition. Recall that the optimal primal objective p O(|St |2 ). As = O(1/), the space complexity for the
of the kernel problem in Section 3.2.1 (or 3.2.2) is equal to whole procedure is O(1/2 ), which is independent of m
the optimal dual objective d2 in (7) (or (9)), which in turn for a fixed .
2
is related to the optimal dual objective d1 = rMEB(S) in (2) On the other hand, when probabilistic speedup is used, ini-
by (6). Together with (14), we can then bound p as tialization only takes O(1) time while distance computa-
tions in steps 2 and 3 take O((t + 2)2 ) = O(t2 ) time. Time
R2 p + (1 + )2 R2 . (15)
for the other operations remains the same. Hence, tth iter-
2
R
ation takes O(t3 ) time and the whole procedure takes
Hence, max p + , pR+
2 (1 + )2 and thus CVM is
an (1 + )2 -approximation algorithm. This also holds with
X 1
O(t3 ) = O( 4 ) = O 4 .
high probability when probabilistic speedup is used. t=1
As mentioned in Section 1, practical SVM implementa- 5
As the patterns may be stored out of core, we ignore the
tions also output approximated solutions only. Typically, O(m) space required for storing the m patterns.
For a fixed , it is thus constant, independent of m. The GaussianPkernel k(x, y) = exp(kx yk2 /), with
m
space complexity, which depends only on the number of = m12 i,j=1 kxi xj k2 .
iterations , is still O(1/2 ).
Our CVM implementation is adapted from LIBSVM, and
If more efficient QP solvers were used in the MEB sub- uses SMO for each QP sub-problem in Section 4.1.4. As in
problem of Section 4.1.4, both the time and space complex- LIBSVM, our CVM also uses caching (with the same cache
ities can be further improved. For example, with SMO, the size as in the other LIBSVM implementations above) and
space complexity for the tth iteration is reduced to O(|St |) stores all training patterns in main memory. For simplicity,
and that for the whole procedure driven down to O(1/). shrinking is not used in our current CVM implementation.
Moreover, we employ probabilistic speedup (Section 4.1.2)
Note that when decreases, the CVM solution becomes
and set = 106 in all the experiments. As in other de-
closer to the exact optimal solution, but at the expense of
composition methods, the use of a very stringent stopping
higher time and space complexities. Such a tradeoff be-
criterion is not necessary in practice. Preliminary studies
tween efficiency and approximation quality is typical of all
show that = 106 is acceptable for most tasks. Using an
approximation schemes. Morever, be cautioned that the O-
even smaller does not show improved generalization per-
notation is used for studying the asymptotic efficiency of
formance, but may increase the training time unnecessarily.
algorithms. As we are interested on handling very large
data sets, an algorithm that is asymptotically more effi- 5.1 Checkerboard Data
cient (in time and space) will be the best choice. However,
We first experiment on the 4 4 checkerboard data used
on smaller problems, this may be outperformed by algo-
by Lee and Mangasarian (2001) for evaluating large-scale
rithms that are not as efficient asymptotically. These will
SVM implementations. We use training sets with a maxi-
be demonstrated experimentally in Section 5.
mum of 1 million points and 2000 independent points for
testing. Of course, this problem does not need so many
5 Experiments
points for training, but it is convenient for illustrating the
In this Section, we implement the two-class L2-SVM in scaling properties. Experimentally, L2-SVM with low rank
Section 3.2.2 and illustrate the scaling behavior of CVM (in approximation does not yield satisfactory performance on
C++) on both toy and real-world data sets. For comparison, this data set, and so its result is not reported here. RSVM,
we also run the following SVM implementations6: on the other hand, has to keep a rectangular kernel matrix
of size m m and cannot be run on our machine when m
1. L2-SVM: LIBSVM implementation (in C++); exceeds 10K. Similarly, the SimpleSVM has to store the
kernel matrix of the active set, and runs into storage prob-
2. L2-SVM: LSVM implementation (in MATLAB), with lem when m exceeds 30K.
low-rank approximation (Fine & Scheinberg, 2001) of
As can be seen from Figure 2, CVM is as accurate as the
the kernel matrix added;
others. Besides, it is much faster7 and produces far fewer
3. L2-SVM: RSVM (Lee & Mangasarian, 2001) imple- support vectors (which implies faster testing) on large data
mentation (in MATLAB). The RSVM addresses the sets. In particular, one million patterns can be processed in
scale-up issue by solving a smaller optimization prob- under 13s. On the other hand, for relatively small training
lem that involves a random m m rectangular subset sets, with less than 10K patterns, LIBSVM is faster. This,
of the kernel matrix. Here, m is set to 10% of m; however, is to be expected as LIBSVM uses more sophis-
ticated heuristics and so will be more efficient on small-to-
4. L1-SVM: LIBSVM implementation (in C++); medium sized data sets. Figure 2(b) also shows the core-set
size, which can be seen to be small and its curve basically
5. L1-SVM: SimpleSVM (Vishwanathan et al., 2003) overlaps with that of the CVM. Thus, almost all the core
implementation (in MATLAB). vectors are useful support vectors. Moreover, it also con-
firms our theoretical findings that both time and space are
Parameters are used in their default settings unless other- constant w.r.t. the training set size, when it is large enough.
wise specified. All experiments are performed on a 3.2GHz 5.2 Forest Cover Type Data8
Pentium4 machine with 512M RAM, running Windows
XP. Since our focus is on nonlinear kernels, we use the This data set has been used for large scale SVM training
by Collobert et al. (2002). Following (Collobert et al.,
6
Our CVM implementation can be downloaded from
7
http://www.cs.ust.hk/jamesk/cvm.zip. LIBSVM can be As some implementations are in MATLAB, so not all the
downloaded from http://www.csie.ntu.edu.tw/cjlin/libsvm/; CPU time measurements can be directly compared. However, it
LSVM from http://www.cs.wisc.edu/dmi/lsvm; and Sim- is still useful to note the constant scaling exhibited by the CVM
pleSVM from http://asi.insa-rouen.fr/gloosli/. Moreover, we and its speed advantage over other C++ implementations, when
followed http://www.csie.ntu.edu.tw/cjlin/libsvm/faq.html in the data set is large.
8
adapting the LIBSVM package for L2-SVM. http://kdd.ics.uci.edu/databases/covertype/covertype.html
6 5 40
10 10
L2SVM (CVM) L2SVM (CVM) L2SVM (CVM)
L2SVM (LIBSVM) coreset size L2SVM (LIBSVM)
5 35 L2SVM (RSVM)
10 L2SVM (RSVM) L2SVM (LIBSVM)
L1SVM (LIBSVM) L2SVM (RSVM) L1SVM (LIBSVM)
L1SVM (SimpleSVM) L1SVM (LIBSVM) L1SVM (SimpleSVM)
30
4
10 L1SVM (SimpleSVM)
4
0
10 5
1 2
10 10 0
1K 3K 10K 30K 100K 300K 1M 1K 3K 10K 30K 100K 300K 1M 1K 3K 10K 30K 100K 300K 1M
size of training set size of training set size of training set
2002), we aim at separating class 2 from the other classes. ber of training patterns. From another perspective, recall
1% 90% of the whole data set (with a maximum of that the worst case core-set size is 2/, independent of
522,911 patterns) are used for training while the remaining m (Section 4.3). For the value of = 106 used here,
are used for testing. We set = 10000 for the Gaussian 2/ = 2 106 . Although we have seen that the actual size
kernel. Preliminary studies show that the number of sup- of the core-set is often much smaller than this worst case
port vectors is over ten thousands. Consequently, RSVM value, however, when m 2/, the number of core vec-
and SimpleSVM cannot be run on our machine. Similarly, tors can still be dependent on m. Moreover, as has been ob-
for low rank approximation, preliminary studies show that served in Section 5.1, the CVM is slower than the more so-
over thousands of basis vectors are required for a good ap- phisticated LIBSVM on processing these smaller data sets.
proximation. Therefore, only the two LIBSVM implemen-
tations will be compared with the CVM here. 6 Conclusion
Figure 3 shows that CVM is, again, as accurate as the oth- In this paper, we exploit the approximateness in SVM
ers. Note that when the training set is small, more training implementations. We formulate kernel methods as equiv-
patterns bring in additional information useful for classi- alent MEB problems, and then obtain provably approxi-
fication and so the number of core vectors increases with mately optimal solutions efficiently with the use of core-
training set size. However, after processing around 100K sets. The proposed CVM procedure is simple, and does not
patterns, both the time and space requirements of CVM be- require sophisticated heuristics as in other decomposition
gin to exhibit a constant scaling with the training set size. methods. Moreover, despite its simplicity, CVM has small
With hindsight, one might simply sample 100K training asymptotic time and space complexities. In particular, for
patterns and hope to obtain comparable results9 . However, a fixed , its asymptotic time complexity is linear in the
for satisfactory classification performance, different prob- training set size m while its space complexity is indepen-
lems require samples of different sizes and CVM has the dent of m. When probabilistic speedup is used, it even has
important advantage that the required sample size does not constant asymptotic time and space complexities for a fixed
have to be pre-specified. Without such prior knowledge, , independent of the training set size m. Experimentally,
random sampling gives poor testing results, as has been on large data sets, it is much faster and produce far fewer
demonstrated in (Lee & Mangasarian, 2001). support vectors (and thus faster testing) than existing meth-
ods. On the other hand, on relatively small data sets where
5.3 Relatively Small Data Sets: UCI Adult Data10 m 2/, SMO can be faster. CVM can also be used for
Following (Platt, 1999), we use training sets with up to other kernel methods such as support vector regression, and
32,562 patterns. As can be seen in Figure 4, CVM is details will be reported elsewhere.
still among the most accurate methods. However, as this
data set is relatively small, more training patterns do carry
References
more classification information. Hence, as discussed in
Section 5.2, the number of iterations, the core set size Badoiu, M., & Clarkson, K. (2002). Optimal core-sets for balls.
and consequently the CPU time all increase with the num- DIMACS Workshop on Computational Geometry.
9
In fact, we tried both LIBSVM implementations on a random Cauwenberghs, G., & Poggio, T. (2001). Incremental and decre-
sample of 100K training patterns, but their testing accuracies are mental support vector machine learning. Advances in Neural
inferior to that of CVM. Information Processing Systems 13. Cambridge, MA: MIT
10
http://research.microsoft.com/users/jplatt/smo.html Press.
6 6
10 10 25
L2SVM (CVM) L2SVM (CVM) L2SVM (CVM)
L2SVM (LIBSVM) coreset size L2SVM (LIBSVM)
L1SVM (LIBSVM) L2SVM (LIBSVM) L1SVM (LIBSVM)
10
5 L1SVM (LIBSVM)
20
number of SVs
3
10 10
4
10
2
10 5
1 3
10 10 0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 0 1 2 3 4 5 6
size of training set 5 size of training set 5 size of training set 5
x 10 x 10 x 10
5 5 20
10 10
L2SVM (CVM) L2SVM (CVM) L2SVM (CVM)
L2SVM (LIBSVM) coreset size L2SVM (LIBSVM)
4 L2SVM (low rank) L2SVM (LIBSVM) L2SVM (low rank)
10 19 L2SVM (RSVM)
L2SVM (RSVM) L2SVM (low rank)
L1SVM (LIBSVM) L2SVM (RSVM) L1SVM (LIBSVM)
L1SVM (SimpleSVM) L1SVM (LIBSVM) L1SVM (SimpleSVM)
3 4 L1SVM (SimpleSVM)
CPU time (in seconds)
10 10 18
1 3
10 10 16
0
10 15
1 2
10 10 14
1000 3000 6000 10000 30000 1000 3000 6000 10000 30000 1000 3000 6000 10000 30000
size of training set size of training set size of training set
Chang, C.-C., & Lin, C.-J. (2004). LIBSVM: a li- Mangasarian, O., & Musicant, D. (2001). Lagrangian support
brary for support vector machines. Software available at vector machines. Journal of Machine Learning Research, 1,
http://www.csie.ntu.edu.tw/cjlin/libsvm. 161177.
Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2002). Osuna, E., Freund, R., & Girosi, F. (1997). Training support vec-
Choosing multiple parameters for support vector machines. tor machines: an application to face detection. Proceedings of
Machine Learning, 46, 131159. Computer Vision and Pattern Recognition (pp. 130136). San
Juan, Puerto Rico.
Collobert, R., Bengio, S., & Bengio, Y. (2002). A parallel mixture
of SVMs for very large scale problems. Neural Computation, Platt, J. (1999). Fast training of support vector machines using
14, 11051114. sequential minimal optimization. In B. Scholkopf, C. Burges
and A. Smola (Eds.), Advances in kernel methods support
Fine, S., & Scheinberg, K. (2001). Efficient SVM training using vector learning, 185208. Cambridge, MA: MIT Press.
low-rank kernel representation. Journal of Machine Learning
Research, 2, 243264. Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A., &
Williamson, R. (2001). Estimating the support of a high-
Garey, M., & Johnson, D. (1979). Computers and intractability: dimensional distribution. Neural Computation, 13, 14431471.
A guide to the theory of NP-completeness. W.H. Freeman.
Smola, A., & Scholkopf, B. (2000). Sparse greedy matrix approx-
Joachims, T. (1999). Making large-scale support vector machine imation for machine learning. Proceedings of the Seventeenth
learning practical. In B. Scholkopf, C. Burges and A. Smola International Conference on Machine Learning (pp. 911918).
(Eds.), Advances in kernel methods Support vector learning, Standord, CA, USA.
169184. Cambridge, MA: MIT Press. Smola, A., & Scholkopf, B. (2004). A tutorial on support vector
Kumar, P., Mitchell, J., & Yildirim, A. (2003). Approximate min- regression. Statistics and Computing, 14, 199222.
imum enclosing balls in high dimensions using core-sets. ACM Tax, D., & Duin, R. (1999). Support vector domain description.
Journal of Experimental Algorithmics, 8. Pattern Recognition Letters, 20, 11911199.
Lee, Y.-J., & Mangasarian, O. (2001). RSVM: Reduced support Vishwanathan, S., Smola, A., & Murty, M. (2003). SimpleSVM.
vector machines. Proceeding of the First SIAM International Proceedings of the Twentieth International Conference on Ma-
Conference on Data Mining. chine Learning (pp. 760767). Washington, D.C., USA.
Lin, K.-M., & Lin, C.-J. (2003). A study on reduced support Williams, C., & Seeger, M. (2001). Using the Nystrom method
vector machines. IEEE Transactions on Neural Networks, 14, to speed up kernel machines. Advances in Neural Information
14491459. Processing Systems 13. Cambridge, MA: MIT Press.