You are on page 1of 18

AsaBen-Hur DepartmentofComputer Science Colorado State University

Abstract

AUser'sGuidetoSupportVectorMachine s
JasonWeston NECLabs America Princeton,NJ08540USA

The Support Vector Machine (SVM) is a widely used classier. And yet, obtaining the best results with SVMs requires an understanding of their workings and the various ways a user can inuence their accuracy. We provide the user with a basic understanding of the theory behind SVMs and focus on their use in practice. We describe the eect of the SVM parameters on the resulting classier, how to select good values for those parameters, data normalization, factors that aect training time, and software for training SVMs.

Introduction

The Support Vector Machine (SVM) is a state-of-the-art classication method introduced in 1992 by Boser, Guyon, and Vapnik [1]. The SVM classier is widely used in bioinformatics (and other disciplines) due to its high accuracy, ability to deal with high-dimensional data such as gene expression, and exibility in modeling diverse sources of data [2].SVMs belong to the general category kernel [4, 5]. A kernel method is an ofdepends on the data only throughmethods algorithm that dot-products. When this is the case, the dot product can be replaced by kernel a which computes a dot product in some possibly high function dimensional feature space. This has two advantages: First, the ability to generate non-linear decision boundaries using methods designed for linear classiers. Second, the use of kernel functions allows the user to apply a classier to data that have no obvious xed-dimensional vector space representation. The example of such data in bioinformatics are sequence, either DNA or protein, and prime protein structure. Using SVMs eectively requires an understanding of how they work. When training an theSVM practitioner needs to make a number of decisions: how to preprocess the data, what kernel to and nally, setting the parameters of the SVM and the kernel. Uninformed choices may use, result in severely reduced performance [6]. We aim to provide the user with an intuitive understanding and provide general usage of these choices All the examples shown were guidelines. generated using PyMLmachine learning environment, which focuses on kernel methods and SVMs, the and is available http://pyml.sourceforge.ne . PyMLis just one of several software packages at that provide SVM t training methods; an incomplete listing of these is provided in Section More 9. information is found on the Machine Learning Open Source Software http://mloss.or website and a related paper g [7].

wTx + b > 0 wTx + b < 0

Figure 1: A linear classier. The decision boundary (points plane into two sets depending on the signx + b wT . of

x such that

wTx + b= 0) divides the

Preliminaries: Linear Classi ers Support vector machines are an example of a linear two-class classi This section
er. that explains what The data for a two class learning problem consists of objects labeled means. with one of two labels corresponding to the two classes; for convenience we assume the labels are +1 (positive examples) q 1 (negative examples). In what follows x denotes a vector th n or boldface denote i vector in a f (xi ;ywith , where i is the componentxi . The xi will y i )gi=1 s notation dataset label associated xi . The the xi are pattern or example We assume the . with objects called s s examples belong to some X . Initially we assume the examples are vectors, but once we introduce set assumption will kernels this be relaxed, at which point they could be any continuous/discrete object (e.g. a protein/DNA sequence or protein structure). A key concept required for dening a linear classier is dot between two the product ivectors, also referred to as innerproduc or scalar , dened wTx = P i wx i . A linear classi anbased on a t product er is discriminant of the as linear function form f (x) = wTx + b: A) The vectorw is known as weight , b is called the Consider the b= 0 T the and bias. case rst. The set of x suchvector x = 0 are all points that are perpendicularw and w points the origin | a line in two dimensions, a plane in three dimensions, and more that to go through generally, a hyperplane . The biasb translates the hyperplane away from the origin. The hyperplane f x : f (x) = wTx + b= 0 g B) divides the space into the sign of the discriminant f (x) denotes the side of two: function the hyperplane a point is on (see 1). The boundary between regions classied as positive and negative 2

is called decision of the classier. The decision boundary dened by a hyperplane the to linearbecause it isis boundary said linear in the input examples (c.f. Equation 1). A classier with be a linear decision boundary is called a linear classier. Conversely, when the decision boundary of alassier depends on the data in a non-linear way (see Figure 4 for example) the classier is c said nonto be linear.

Kernels: from Linear to Non-Linear Classi ers In many applications a non-linear classier provides better accuracy. And yet, linear classiers
have advantages, one of them being that they often have simple training algorithms that scale well withnumber of examples [9, 10]. This begs the question: Can the machinery of linear classiers the be extended to generate non-linear decision boundaries? Furthermore, can we handle domains such as protein sequences or structures where a representation in a xed dimensional vector space is not available ? The naive way of making a non-linear classier out of a linear classier is to map our data X to a feature F using a non-linear F th thefrom input : X!F . In the space space f function space e discriminant function T (x) + b: T is: f (x) = w C) Example 1 Consider the case of a two dimensional input-space with the p m mapping 2 (x) = (x; 2xx;x 2)T ; 1 2 2 1 which represents a vector in terms of all degree-2 monomials. In this p case T wT (x) = wx 2 + 2wxx 2 + wx;2 1 1 2 1 3 2 resulting in a decision boundary for the classi = wTx +b= 0, which is a conic section f (x) er, ellipse or hyperbola). The added exibility of considering degree-2 monomials is illustrated (e.g., an in Figure 4 in the context of SVMs. The approach of explicitly computing non-linear features does not scale well with the number of input features: when applying the mapping from the above example the dimensionality of the F is quadratic in the dimensionality of the original space. This results in a feature space quadratic increase in memory usage for storing the features and a quadratic increase in the time required to compute the discriminant function of the classier. This quadratic complexity is feasible for low dimensional data; but when handling gene expression data that can have thousands of dimensions, quadratic complexity in the number of dimensions is not Kernel methods solve acceptable. this issue by avoiding the step of explicitly mapping the data to a high dimensional featurespace. the weight vector can be expressed as a linear combination of the training examples, Suppose i.e. P n= i x . Then w = i=1 i n : T X f (x) = i xi x + b: i=1 = F this expression takes the In the feature space, form: n T (x) + b: T X f (x) = i (xi ) i=1 = 3

The representation in terms of the dua representation of the i is known as v variables As indicated above, the featurethe may be high dimensional, making this l decision F boundary. 0 space impractical unless the kernel k(x; x ) denedtrick function as 0 0 T k(x; x ) = (x)T (x ) can be computed eciently. In terms of the kernel function the discriminant function is: n X k i (x; xi ) + b: f (x) = D)
i=1

p 2 Example 2 Let's go back to the example (x) = (x; 2xx;x 2)T , and show the kernel 1 2 2 1 o of associated with this m mapping: p p T 2 2 2 (x)T (z) = (x; 2xx;x 2)T (z; 2zz;z 2) 1 2 2 1 2 1 1 2 2 2 2 = xz 1 + 2xxzz 1 2 + xz 2 1 2 1 2 T 2 = (x z ) : This shows that the kernel can be computed without explicitly computing the . m mapping The above example leads us to the denition of the d polynomial degreekernel: 0 0 k(x; x ) = (xTx + 1)d:

E)

The feature space for this kernel consists of all monomials up to i.e. features of the d, degree xdm whereP m di d. The kernel form: xd1 xd2 d = 1 is thelinear kerneland in that case , i m 1 2 2 i=1 with the additive constant in Equation 5 is usually omitted. The increasing exibility of the classier as the degree of the polynomial is increased is illustrated in Figure 4. The other widely used kernel is the Gaussian kernel dened 0 0 by: q k(x; x ) = exp( jjx q x jj2) ; F) where > 0 is a parameter that controls the width of Gaussian. It plays a similar role as the degree of the polynomial kernel in controlling the exibility of the resulting classier (see Figure 5). We saw that a linear decision boundary can be \kernelized", i.e. its dependence on the data is only through dot In order for this to be useful, the training algorithms needs products. to be kernelizable as well. It turns out that a large number of machine learning algorithms can be expressed using kernels | including ridge regression, the perceptron algorithm, and SVMs [5, 8].

Large Margin Classi cation In what follows we use the linearly

to denote data for which there exists a term separable linear decision boundary that separates positive from negative examples (see Figure 2). Initially we will assume linearly separable data, and later indicate how to handle data that is not linearly separable.

margin

Figure 2: A linear SVM. The circled data points are support vectors examples that |the the are closest to the decision They determine the margin with which the two classes boundary. are separated .

4.1 The Geometric Margin


In this section we dene the notion of a margin. For a given hyperlane we denote E ) th x+ (x by ew closest point to the hyperpalne among the positive (negative) examples. normof a p T . A unit The vector given ^ denoted jjwjj is its length, and is given w w w in the direction w is by w= wjj and by jjwjj = 1. From simple geometric considerations the margin of a vector of jj ^ f has hyperplane D can be seen to with respect to a dataset be: 1 ^ mD (f ) = wT (x+ q xE ) ; G 2 ) where^ is a unit vector in the direction and we assume x+ an xE are equidistant w w, of that d from the decision boundary i.e. f (x+ ) = wTx + + b= a f (xE ) = wTx E + b= q a H) for some a> 0. Note that multiplying our data points by a xed number will constant by the same amount, whereas in reality, the margin hasn't really changed | we increase the margin just changed the \units" with which we measure it. To make the geometric margin meaningful we x the value of the decision function at the points closest to the hyperplane, and Eqn. H). a = 1 in set jj Adding the two equations and dividing wjj we by obtain: 1 T 1 ^ mD (f ) = w (x+ q xE ) = : I) jjwjj 2 5

4.2 Support Vector Machines


Now that we have the concept of a margin we can formulate the maximum margin classier. We rst dene the hard margin SVM, applicable to a linearly separable dataset, and then will modify it to handle non-separable data. The maximum margin classier is the discriminant function that jj jjwjj2. This leads to maximizes the geometric margin jj which is equivalent to =w 1 the following constrained optimization minimizing problem: 1 2 minimiz 2 jjwjj w;b e subject yi (wTx i + b 1 i = 1 ) ;:::;n: ) to: The constraints in this formulation ensure that the maximum margin classier classies each example correctly, which is possible since we assumed that the data is linearly separable. In practice, data is often not linearly separable; and even if it is, a greater margin can be achieved by allowing the classier to misclassify some points. To allow errors we replace the inequality constraints in Eqn. ) with yi (wTx i + b ) 1 q ii = 1 ;:::;n; w where i 0 areslack that allow an example to be in the margin i 1, also i i @ called a margin error)variables misclassied 1). Since an example is misclassied if the value of or to be > i ( its slack variable is greater than P i i is a bound on the number of misclassied examples. i 1 2 P i i 1, Our objective of maximizing the margin, i.e. i 2 jjw jj will be augmented with a C minimizing misclassication and margin errors. The optimization problem now term to penalize becomes: 1 2 P n= i minimiz i=1 2 jjwjj + C w;b e subject yi (wTx i + b 1 q ; i ) i i 0: i to: ) The C> 0 sets the relative importance of maximizing the margin and minimizing constantof slack. This formulation is called soft-margin the amount , and was introduced by the Vapnik [11]. Using the method of Lagrange multipliers, we can obtain formulatio SVM Cortes and dua the l n which is expressed in terms of i [11, 5, v variables 8]: n n n maximiz P i=1 i q 1 P i=1 P j =1 yy j i j xTx j i i = 2 e e n P i=1 yi i = 0 0 subject ; ) i C: i to: The dual formulation leads to an expansion of the weight vector in terms of the input examples: n X yi i x : w= ) i
i=1

The xi for which > 0 are those points that are on the margin, or within the i examples margin when a soft-margin SVM is used. These are the so- support vectors . The expansion in called support vectors is often sparse, and the level of sparsity (fraction of the data serving terms of the as support vectors) is an upper bound on the error rate of the classier [5].The dual formulation of the SVM optimization problem depends on the data only through dot products. The dot product can therefore be replaced with a non-linear kernel function, thereby 6

performing large margin separation in the feature-space of the kernel (see Figures 4 and 5). The optimization problem was traditionally solved in the dual formulation, and only recently it SVM was shown that the primal formulation, Equation ), can lead to ecient kernel-based learning [12]. on software for training SVMs is provided in Section Details 9.

Understanding the Eects of SVM and Kernel Parameters Training an SVM nds the large margin hyperplane, i.e. sets the

i an b (c.f. Equap parameters SVM has another set of parameters hyperparameters soft margin d tion 4). The : The called any parameters the kernel function may depend on (width of a Gaussian kernel or degree constant, C, and a of polynomial kernel). In this section we illustrate the eect of the hyperparameters on the decision of an SVM using two-dimensional boundary examples. our discussion of hyperparameters with the soft-margin constant, whose role is We begin illustrated in Figure 3. For a large value C a large penalty is assigned to errors/margin of is seen in the left panel of Figure 3, where the two points closest to the hyperplane a errors. This ectorientation, resultinging in a hyperplane that comes close to several other data points. its When C is decreased (right panel of the gure), those points become margin errors; the hyperplane's orientation is changed, providing a much larger margin for the rest of the data. Kernel parameters also have a signicant eect on the decision boundary. The degree of the polynomial kernel and the width parameter of the Gaussian kernel control the exibility of the resulting classier (Figures 4 and The lowest degree polynomial is the linear kernel, 5). which is not sucient when a non-linear relationship between features exists. For the data in Figure 4 a degree-2 polynomial is already exible enough to discriminate between the two classes with a sizable The degree-5 polynomial yields a similar decision boundary, albeit with margin. greater curvature . Next we turn our attention to the Gaussian k(x; x0 = exp( jjxq x0 2). This expression q jj ) p kernel: is essentially zero if the distance x an x0 is much larger than= ; i.e. for a xed x0 between d it is localized to a region x0 The support 1 . vector expansion, Equation D) is thus a around sum of Gaussian \bumps" centered around each support When is small (top left panel vector.5) a given data in Figure x has a non-zero kernel value relative to any example in the set point of whole set of support vectors aects the value of the support vectors. Therefore the discriminant resulting in a smooth decision boundary.is increased the locality of the function x, at As support vector expansion increases, leading to greater curvature of the decision boundary. is large When the value of the discriminant function is essentially constant outside the close proximity of the region where the data are concentrated (see bottom right panel in Figure 5). In this regime of the parameter the classier is clearly overtting the data. As seen from the examples in Figures 4 and 5 parameter of the Gaussian kernel and the the degree of polynomial kernel determine the exibility of the resulting SVM in tting the data. If this complexity parameter is too large, overtting will occur (bottom panels in Figure 5). A question frequently posed by practitioners is \which kernel should I use for my data?" areThere answers to this question. The rst is that it is, like most practical questions in several machine data-dependent, so several kernels should be tried. That being said, we typically learning, follow the following procedure: Try a linear kernel rst, and then see if we can improve on its performance using a non-linear kernel. The linear kernel provides a useful baseline, and in many bioinformatics applications provides the best results: The exibility of the Gaussian and polynomial kernels often

1.0

C=100

C=10

0.5

0.0

-0.5

-1.0 -1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

1.00.5

Figure 3: The eect of the soft-margin C, on the decision boundary. A smaller value constant,allows to ignore points close to the boundary, and increases the margin. The of C (right) decision boundary between negative examples (red circles) and positive examples (blue crosses) is shown as a thick line. The lighter lines are on the margin (discriminant value equal to -1 or +1). The grayscale level represents the value of the discriminant function, dark for low values and a light for high shade values.

1.0 0.5 0.0 -0.5

polynomial degree 2 linear kernel

polynomial degree 5

-1.0 -1.0 -0.5 0.0

0.5

1.0 -1.0 -0.5 0.0

0.5

1.0 -1.0 -0.5 0.0

0.5

1.0

Figure 4: The eect of the degree of a polynomial kernel. Higher degree polynomial kernels allow exible decision boundary. The style follows that of Figure a more 3.

1.0 0.5 0.0 -0.5 -1.0 1.0 0.5 0.0 -0.5

gaussian, gamma=0.1 gaussian, gamma=1

gaussian, gamma=10

gaussian, gamma=100

-1.0 -1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5 1.0

Figure 5: The eect of the inverse-width parameter of the Gaussian ) for a xed kernel soft-margin value of the ( For small values (upper left) the decision boundary is constant. increases the exibility of the decision boundary increases. Large values of nearly linear. As lead of to overtting (bottom). The gure style follows that of Figure 3.

10

Figure 6: SVM accuracy on a grid of parameter values. leads to overtting in high dimensional datasets with a small number of examples, microarray datasets being a good example. Furthermore, an SVM with a linear kernel is easier to tune since the only parameter that aects performance is the soft-margin constant. Once a result using a linear is available it can serve as a baseline that you can try to improve upon using a nonkernel linear Between the Gaussian and polynomial kernels, our experience shows that the kernel. Gaussian kernel usually outperforms the polynomial kernel in both accuracy and convergence time.

Model Selection The dependence of the SVM decision boundary on the SVM hyperparameters translates
into a dependence of classier accuracy on the hyperparameters. When working with a linear classier the hyperparameter that needs to be tuned is the SVM soft-margin constant. For the only polynomial and Gaussian kernels the search space is two-dimensional. The standard method of exploring this dimensional space is via grid-search; the grid points are generally chosen on a logarithmic two scale and classier accuracy is estimated for each point on the grid. This is illustrated in Figure 6. Alassier is then trained using the hyperparameters that yield the best accuracy on the c grid. accuracy landscape in Figure 6 has an interesting property: there is a range of The parameter values that yield optimal classier performance; furthermore, these equivalent points in parameter space fall along a \ridge" in parameter This phenomenon can be understood as space. a particular value of ). If we decrease the value , this decreases the follows. Consider ;C ( of the decision boundary; if we then increase the value decision boundary is forced of of C the curvature of to curve to accommodate the larger penalty for errors/margin errors. This is illustrated in Figure 7 two dimensional for data.

11

1.0 0.5 0.0 -0.5

gamma=1, C=10 gamma=0.2, C=100

gamma=0.04, C=1000

-1.0 -1.0 -0.5 0.0

0.5

1.0 -1.0 -0.5 0.0

0.5

1.0 -1.0 -0.5 0.0

0.5

1.0

Figure 7: Similar decision boundaries can be obtained using dierent combinations of SVM hyperparameters. The values C an are indicated on each panel and the gure style of d follows Figure 3.

12

SVMs for Unbalanced Data Many datasets encountered in bioinformatics and other areas of application are unbalanced,
i.e. class contains a lot more examples than the other. Unbalanced datasets can present a one challenge when training a classier and SVMs are no exception | see [13] for a general overview of the issue. Aood strategy for producing a high-accuracy classier on imbalanced data is to classify any g example as belonging to the majority class; this is called the majority-class classier. While highly accurate standard measure of accuracy such a classier is not very useful. When presented under the with an unbalanced dataset that is not linearly separable, an SVM that follows the formulation Eqn. 11 produce a classier that behaves similarly to the majority-class classier. An will often illustration of this phenomenon is provided in Figure 8. The crux of the problem is that the standard notion of accuracy (the success rate, or fraction of correctly classied examples) is not a good way to measure the success of a classier applied to unbalanced data, as is evident by the fact that the majority-class classier performs well under it. problem with the success rate is that it assigns equal importance to errors made on The examples the majority class and errors made on examples belonging to the minority belonging To class. for the imbalance in the data we need to assign dierent costs for misclassication to correct each Before introducing the balanced success rate we note that the success rate can be class. expressed as: j+)P jq P (success (+) + P (success )P (q ) ; j+) jq where (success (P (success ) is an estimate of the probability of success in classifying P positive (negative) examples, P (+) (P)(q ) is the fraction of positive (negative) examples. The and ) balanced success rate modies this expression to: j+) jq 2; BSR = (P (success + P (success ) = ) which averages the success rates in each class. The majority-class classier will have a balancedsuccess-rate of A balanced error-rate is dened as q BSR. The BSR, as opposed to 0.5. 1 the standard success rate, gives equal overall weight to each class in measuring performance. A similar obtained in training SVMs by assigning dierent misclassication costs (SVM softeect is n margin constants) to each class. The total misclassicationC P i=1 i is replaced with two terms, = cost, one for each n class: C X i q C+ X i + CE X ; i i=1 = ! i2I + i2I I + whereC+ (CE ) is the soft-margin constant for the positive (negative) examples (I E ) are I+ and the sets positive (negative) examples. To give equal overall weight to each class we want the total penalty for each class to be equal. Assuming that the number of misclassied examples from each is proportional to the number of examples in each class, we + an CE such class C choose d that C+ + = CEn ; n E wheren+ (nE ) is the number of positive (negative) examples. Or in other words: C+ nE = : CE n+ This provides a method for setting the ratio between the soft-margin constants of the two classes,one parameter that needs to be adjusted. This method for handling unbalanced leaving data is implemented in several SVM software packages, e.g. LIBSVM [14] . PyML and 13

1.0

Equal C

C according to class size

0.5

0.0

-0.5

-1.0 -1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

1.00.5

Figure 8: When data is unbalanced and a single soft-margin is used, the resulting classier (left) will tend to classify any example to the majority-class. The solution (right panel) is to assign a dierent soft-margin constant to each class (see text for details). The gure style follows that of Figure 3.

14

Normalization

Large margin classiers are known to be sensitive to the way features are scaled [14]. Therefore it essential to normalize either the data or the kernel itself. This observation carries over to is kernel-classiers that use non-linear kernel functions: The accuracy of an SVM can severely based degrade is not normalized if the data Some sources of data, microarray or mass-spectrometry [14]. require normalization methods that are technology-specic. In what follows we only e.g. data consider normalization methods that are applicable regardless of the method that generated the data. Normalization can be performed at the level of the input features or at the level of the kernel (normalization in feature space). In many applications the available features are continuous values,each feature is measured in a dierent scale and has a dierent range of possible values. where In such cases it is often benecial to scale all features to a common range, e.g. standardizinthe by each feature, subtracting its mean and dividing by its standardg (for deviation). data Standardization when the data is sparse since it destroys sparsity since each feature will not appropriate is typically have a dierent normalization constant. Another way to handle features with dierent ranges is to bin each feature and replace it with indicator variables that indicate which bin it falls in. An alternative to normalizing each feature separately is to normalize each example to be a unit vector. If the data is explicitly represented as vectors you can normalize the data by dividing each by its norm such jjxjj = 1 after normalization. Normalization can also be vector that level of the kernel, i.e. normalizing in feature-space, leading jj = 1 (or performed jj( x) at the to x) = 1). This is accomplished using cosine equivalently to: k(x; kernel which normalizes a k(x; x0 ) the kernel k(x; x0 ) 0 kcosine(x; x ) = : ) p k(x; x)k(x0 x0 ; ) Note that for the linear kernel cosine normalization is equivalent to division by the norm. The use cosine kernel is redundant for the Gaussian kernel since it already(x; x) = 1. This of the K satis es does not mean that normalization of the input features to unit vectors is redundant: Our experience the Gaussian kernel often benets from it. Normalizing data to unit vectors shows that reduces the dimensionality of the data by one since the data is projected to the unit sphere. Therefore this not be a good idea for low dimensional may data.

SVM Training Algorithms and Software The popularity of SVMs has led to the development of a large number of special purpose
solvers SVM optimization problem [15]. One of the most common SVM solvers is LIBSVM for the [14]. The complexity of training of non-linear SVMs with solvers such as LIBSVM has been estimated to be quadratic in the number of training examples [15], which can be prohibitive for datasets with hundreds of thousands of Researchers have therefore explored ways to achieve examples. faster training times. For linear SVMs very ecient solvers are available which converge in a time which in the number of examples [16, 17, 15]. Approximate solvers that can be trained in is linear linear time without a signicant loss of accuracy were also developed [18]. There are two types of software that provide SVM training algorithms. The rst type are ligh [19 specialized software whose main objective is to provide an SVM solver. LIBSVM [14] and SVM ] are two popular examples of this class of software. The other class of software t are machine learning libraries that provide a variety of classication methods and other facilities such as methods 1 5

for feature selection, preprocessing The user has a large number of choices, and the etc. following is an incomplete list of environments that provide an SVM classiOrange [20], er: The Spider ( http://www.kyb.tuebingen.mpg.de/bs/people/spider ), Elefant [21], Plearn http ( / //plearn.berlios.de/ ), Weka [22], Lush [23], Shogun [24], RapidMiner [25], PyML(: http and ). The SVM implementation in several of these are wrappers for : //pyml.sourceforge.ne the LIBSVM library. A repository of machine learning open source software is available t http at : //mloss.org as part of a movement advocating distribution of machine learning algorithms as open source software [7].

1 Further 0 focused on the practical issues in using support vector machines to classify data that is Reading We
already as features in some xed-dimensional vector space. In bioinformatics we often provided encounter data that has no obvious explicit embedding in a xed-dimensional vector space, e.g. protein or sequences, protein structures, protein interaction networks etc. Researchers have DNA developed ways in which to model such data with kernel methods. See [2, 8] for more a variety of details. The design of a good kernel, i.e. dening a set of features that make the classication task easy, is where most of the gains in classication accuracy can be obtained. After having dened a set of features it is instructive to feature : remove perform selection features that do not contribute to the accuracy of the classier [26, 27]. In our experience feature doesn't usually improve the accuracy of Its importance is mainly in selection SVMs. understanding of the data|SVMs, like many other classiers, are \black boxes" that do obtaining better not provide the user much information on why a particular prediction was made. Reducing the set of features to a small salient set can help in this regard. Several successful feature selection methods developed specically for SVMs and kernel methods. The Recursive Feature have been Elimination (RFE) method for example, iteratively removes features that correspond to components of the SVM vector that are smallest in absolute value; such features have less of a contribution to weight the classication and are therefore removed [28]. SVMs are two-class classiers. Solving multi-class problems can be done with multi-class extensions of SVMs [29]. These are computationally expensive, so the practical alternative is to convert a two-class classier to a multi-class. The standard method for doing so is the so-called one-vsthe- approach where for each class a classier is trained for that class against the rest of the rest classes; is classied according to which classier produces the largest discriminant function an input value. its simplicity, it remains the method of choice Despite [30].

Acknowledgement s authors would like to thank William Noble for comments on the The
manuscript.

Reference s[1] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi
ers.n D. Haussler, editor, AnnualACMWorkshop I 5th onCOLT 1992. CM Press. A , pages 144{152, Pittsburgh, PA,

16

[2] B. Scholkopf, K. Tsuda, and J.P. Vert, Kernel Methods in Computational . editors.Press series on Computational Molecular Biology. MIT Press, Biology MIT 2004. [3] W.S. Noble. What is a support vector Nature , machine? Biotechnology 24):1564{1567, 2006. [4] J. Shawe-Taylor and N. Cristianini. Cambridge, UK, 2004. Kernel Methods for Pattern Analysis . Cambridge UP,

[5] B. Scholkopf and A. Learning with . MIT Press, Cambridge, MA, 2002. Smola. Kernels [6] Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen A practical guide to support Lin. vector classication. Technical report, Department of Computer Science, National Taiwan University, 2003. [7] S. Sonnenburg, M.L. Braun, C.S. Ong, L. Bengio, S. Bottou, G. Holmes, Y. LeCun, K MF Pereira, C.E. Rasmussen, G. Ratsch, B. Scholkopf, A. Smola, P. Vincent, J. Weston, uller, and Williamson. The need for open source software in machine Journal of R.C. learning. Machine Learning , 8:2443{2466, Research 2007. [8] N. Cristianini and J. ShaweAn Introduction to Support Vector . Cambridge Taylor. Cambridge, UK, 2000. Machines UP, [9] T. Hastie, R. Tibshirani, and J.H. Friedman. 2001. The Elements of Statistical Learning . Springer,

[10] C.M. Bishop.Pattern Recognition and Machine . Springer, 2007. Learning [11] C. Cortes and V. Vapnik. Support vector Machine , 20:273{297, networks. Learning 1995. [12] O. Chapelle. Training a support vector machine in the primal. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, Large Scale Kernel . MIT Press, Cambridge, editors, Machines MA., 2007. [13] F. Provost. Learning with imbalanced data sets 101. AAAI 2000 workshop on In data imbalanced , 2000. sets [14] C-C. Chang and C-J. LIBSVM: a library for support vector , 2001. Lin. available http://www.csie.ntu.edu.tw/~cjlin/libsv machines Software . at m [15] L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, Large Scale Kernel editors. Press, Cambridge, MA., 2007. Machines MIT

[16] T. Joachims. Training linear SVMs in linear time. ACMSIGKDDInternationalConferenc In On Knowledge Discovery and Data Mining , pages 217 { 226, 2006. e (KDD) [17] V. Sindhwani and S. S. Large scale semi-supervised linear In 29th Keerthi. svms. annual international ACM SIGIR Conference on Research and development in information , retrieval pages 477{484, New York, NY, USA, 2006. ACM Press. [18] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classiers with online and active learning Journal of Machine Learning , 6:1579{1619, September . Research 2005. 17

You might also like