You are on page 1of 22

C++ Project

Support Vector Machines & Image Classication


Authors:

Pascal-Adam Sitbon Alain Soltani


Supervisor:

Beno t Patra
February 2014

Contents
Contents 1 Support Vector Machines 1.1 Introduction . . . . . . . . . . . . . . . . . 1.2 Support Vector Machines . . . . . . . . . 1.2.1 Linearly separable set . . . . . . . 1.2.2 Nearly linearly separable set . . . . 1.2.3 Linearly inseparable set . . . . . . 1.2.3.1 The kernel trick . . . . . 1.2.3.2 Classication : projection 1.2.3.3 Mapping conveniently . . 1.2.3.4 Usual kernel functions . . i 1 1 1 2 4 4 5 5 6 6 7 7 7 8 8 8 8 9 10 11 11 13 14 14 15 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . into a bigger space . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

2 Computation under C++ 2.1 Librairies & datasets employed . . . . . . . . . . . . 2.2 Project format . . . . . . . . . . . . . . . . . . . . . 2.3 Two-class SVM implementation . . . . . . . . . . . . 2.3.1 First results . . . . . . . . . . . . . . . . . . . 2.3.2 Parameter selection . . . . . . . . . . . . . . 2.3.2.1 Optimal training on parameter grid 2.3.2.2 Iterating and sharpening results . . 2.4 A good insight : testing on a small zone . . . . . . . 2.5 Central results : testing on a larger zone . . . . . . . 2.5.1 Results . . . . . . . . . . . . . . . . . . . . . 2.5.2 Case of an unreached minimum . . . . . . . . 2.6 Going further : enriching our model . . . . . . . . . 2.6.1 Case (A) : limited dataset . . . . . . . . . . . 2.6.2 Case (B) : richer dataset . . . . . . . . . . . . 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

A Unbalanced data set 17 A.1 Dierent costs for misclassication . . . . . . . . . . . . . . . . . . . . . . 17 B Multi-class SVM 19 B.1 One-versus-all . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.2 One-versus-one . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Bibliography i 20

Chapter 1

Support Vector Machines


1.1 Introduction

Support vector learning is based on simple ideas, which originated from statistical learning theory [1]. Support vector machines are supervised learning models with associated learning algorithms that analyze data and recognize patterns. Basic SVM takes a set of input data and predicts, for each given input, which of the two possible classes forms the output, making it a nonprobabilistic binary linear classier. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.

1.2

Support Vector Machines

A data set containing points which belong to two dierent classes can be represented by the following set : D = {(xi , yi ), 1 i m | i, yi {1; 1} , xi Rq } , (m, q ) N2 (1.1)

where yi represents the belonging to one of the two classes, xi the training points, q the dimension of the data set. One of the most important things we have to focus on is the shape of the data set. Our goal is to nd the best way to distinguish between the two classes. Ideally, we would like to have a linearly separable data set - in which our two set of points can be fully separated by a line for a two-dimensional space, or a hyperplane for a n-dimensional space. However, this is not the case in general. We will look in the following subsections at three possible congurations for our dataset. 1

Chapter 1. Introduction to Support Vector Machines

1.2.1

Linearly separable set

In the following example (Fig. 1.1), it is easy to see that the data points can be easily linearly separated. Most of the time, with a big data set, its impossible to say just by visualizing the data whether it can be linearly separated or not - even the data cannot be visualized.

Figure 1.1: A simple linearly separable dataset. Blue points are labelled 1 ; red are labelled -1.

To solve the problem analytically, we have to dene several new objects. Definition. A linear separator is a function f that depends on two parameters w and b, given by the following formula : fw,b (x) = w, x + b, b R, w Rq . (1.2)

This separator can take more values than 1 and 1. When fw,b (x) 0, x will belong to the class of vectors such that yi = 1 ; in the opposite case, to the other class (i.e. such that yi = 1). The line of separation is the contour line dened by the equation fw,b (x) = 0.
f Definition. The margin of an element (xi , yi ), relatively to a separator f , noted ( xi ,yi ) , is the real given by : f ( xi ,yi ) = f (xi ) yi

0.

(1.3)

Definition. The margin of a set of points D, relatively to a separator f , is the minimum of the margins for all the elements in D :
f f D = min ( xi ,yi ) | (xi , yi ) D .

(1.4)

Definition. The support vectors are the vectors such that :


f ( xi ,yi )

1, i.e. yi ( w, xi + b)

1.

(1.5)

The goal of the SVM is to maximize the margin of the data set.

Chapter 1. Introduction to Support Vector Machines

S u p p o r t v e c t o r s .

Mi n i ma l ma r g i n .

Figure 1.2: Support vectors and minimal margin. The orange line represents the separation, while the pink and blue ones represents respectively the hyperplans associated to the equations fw,b (x) = 1 and fw,b (x) = 1.

Lemma. The width of the band constituted by the hyperplans fw,b (x) = 1 and fw,b (x) = 2 1 equals w . Proof. Let u be a point of the contour line dened by fw,b (x) = 1.
Let u be his orthogonal projection on the contour line fw,b (x) = 1. Hence we have : fw,b (u) fw,b (u ) = 2 i.e. u u , w = 2. Yet we have : i.e. u u , w = u u w . Indeed, they are colinear, and have the same orientation. Besides, u u is equal to the width constituted by the two contour lines.

In order to nd the best separator - i.e. the one providing the maximum margin - we f have to seek within the class of separators such that ( xi ,yi ) > 1, (xi , yi ) D and retain the one for which w is minimal. This leads us to solve the following constrained optimization problem : min
w,b

w 2

(1.6)

f under (xi , yi ) D, ( xi ,yi ) = yi ( w, xi + b) > 1.

for calculus purposes : derivations become easier ; besides, it is NB. We minimize w 2 better to work with the square norm. By introducing Lagrange multipliers i , the previous constrained problem can be expressed as : m w 2 argmin max i [yi ( w, xi + b) 1] (1.7) i 0 2 w,b
i=1

that is we look for a saddle point. In doing so all the points which can be separated as yi ( w, xi + b) 1 > 0 do not matter since we must set the corresponding i to zero.

Chapter 1. Introduction to Support Vector Machines This problem can now be solved by standard quadratic programming techniques.

1.2.2

Nearly linearly separable set

In this subsection, we will discuss the case of a nearly separable set - i.e. a dataset for which using a linear separator would be ecient enough. If there exists no hyperplane that can split entirely the dataset, the following method - called soft margin method - will choose a hyperplane that splits the examples as cleanly as possible, while still maximizing the distance to the nearest cleanly split examples. Let us modify the maximum margin idea to allow mislabeled examples to be treated the same way, by allowing points to have a margin which can be smaller than 1, even negative. The previous constraint in (1.6) now becomes : (xi , yi ) D, yi ( w, xi + b) > 1 i . (1.8)

where i 0 are called the slack variables, and measure the degree of misclassication of the data xi . The objective function we minimize has also to be changed : we increase it by a function which penalizes non-zero i , and the optimization becomes a trade-o between a large margin and a small error penalty. If the penalty function is linear, the optimization problem becomes : min
w,b,

w 2

+C
i=1

(1.9) 0.

under (xi , yi ) D, yi ( w, xi + b) > 1 i ,

This constraint minimization problem above can be solved using Lagrange multipliers as done previously. We now solve the following problem : argmin max
w,b,

i ,i 0

w 2

m i

+C
i=1

i=1

i [yi ( w, xi + b) 1 + i ]
i=1

(1.10)

with i , i

0.

1.2.3

Linearly inseparable set

We saw in the previous subsection that linear classication can read to misclassications - this is especially true if the dataset D is not separable at all. Let us consider the following example (Fig. 1.3). For this set of data points, any linear classication would introduce too much misclassication to be considered as accurate enough.

Chapter 1. Introduction to Support Vector Machines

Figure 1.3: Linearly inseparable set. Blue points are labelled 1 ; red are labelled -1.

1.2.3.1

The kernel trick

To solve our classication problem, let us introduce the kernel trick. For machine learning algorithms, the kernel trick is a way of mapping observations from a general data set S into an inner product space V , without having to compute the mapping explicitly, such that the observations will have a meaningful linear structure in V. Hence linear classications in V are equivalent to generic classications in S. The trick or method used to avoid the explicit mapping is to use learning algorithms that only require dot products between the vectors in V , and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a certain kernel function - a function K : S 2 V that can be expressed as an inner product.

1.2.3.2

Classication : projection into a bigger space

To understand the usefulness of the trick, lets go back to our classication problem. Let us consider a simple projection of vectors in D, our dataset, into a much richer, higher-dimension feature space. We project each point of D in this bigger space and make a linear separation there. Lets name p this projection : p1 (xi ) (xi , yi ) D, p(xi ) = ... pn (xi ) as we express the projected vector p(xi ) in a base of the n-dimensional new space. This point of view can lead to problems, because n can grow without any limit, and nothing assures us that the pi are linear in the vectors. Following the same method than above would imply to work on a new set D : D = p(D) = {(p(xi ), yi ), 1 i m | i, yi {1; 1} , xi Rq } , (m, q ) N2 (1.11)

Chapter 1. Introduction to Support Vector Machines Because it implies to calculate p for each vector of D, this method will be never used.

1.2.3.3

Mapping conveniently

Lets rst notice that its not necessary to calculate p, as the optimization problem only involves inner products between the dierent vectors. We can now consider the kernel trick approach. We construct : K : D2 V such as K (x, z ) = p(x), p(z ) , (x, yx ), (z, yz ) D (1.12)

making sure that it corresponds to a projection in the unknown space V . We then avoid the calculus of p, and the description of the space in which we are projecting. The optimization problem remains the same, through replacing ., . by k (., .): m k (w, w) min +C (1.13) i 2 w,b,
i=1

under (xi , yi ) D, yi (k (w, xi ) + b) > 1 i ,

0.

1.2.3.4

Usual kernel functions

Polynomial :

K (x, z ) = (xT z + c)d

where c 0 is a constant trading o the inuence of higher-order versus lower-order terms in the polynomial. Polynomials such that c = 0 are called homogeneous. Gaussian radial basis (RBF) : K (x, z ) = exp( x z 2 ), > 0. Sometimes parametrized using = Hyperbolic tangent : K (x, z ) = tanh(xT z + c), for > 0, c < 0 well chosen.
1 . 2 2

Chapter 2

Computation under C++


2.1 Librairies & datasets employed

We used for this project the computer vision and machine learning library OpenCV. All its SVM features are based on the specic library LibSVM, by Chih-Chung Chang and Chih-Jen Lin. We trained our models on the Image Classication Dataset from Andrea Vedaldi and Andrew Zissermans Oxford assignment. It includes ve dierent image classes - aeroplanes, motorbikes, people, horses and cars - of various sizes, and pre-computed feature vectors, in form of a sequence of consecutive 6-digit values. Pictures used are all color images in .jpg format, of various dimensions. The dataset can be downloaded at : http://www.robots.ox.ac.uk/~vgg/share/practical-image-classification.htm.

2.2

Project format

The C++ project itself possesses 4 branches, for opening, saving, training & testing phases. In its original form, it allows opening two training les and a testing one, on a user-friendly, console-input base. User enters les directories, format used and labels for the dierent training classes. For the testing phase, a label is asked, so results obtained via the SVM classication can be compared with the prior label given by user ; the latter can directly see the misclassication results - rate, number of misclassied les - in the console output. The user can either choose its own kernel type, parameter values, or let the computer run the optimal one ; classes have been created in consequence. Following results have been obtained using this program and additional versions (especially when including multiple training les) that derive directly from it ; these will not be presented here. The project can be found on GitHub at : https://github.com/Parveez/CPP_Project_ENSAE_2013.

Chapter 2. Computation under C++

2.3
2.3.1

Two-class SVM implementation


First results

We rst trained our SVM with training sets aeroplane train.txt and horse train .txt ; the data tested was contained in aeroplane val.txt and horse val.txt. As the images included in the two training classes may vary in size, we all resized them to a unique testing zone ; same goes for the testing set. All images are stored in two matrices - one for the training phase, one for the testing phase : each matrix row is a point (here, an image), and all its coecients are features (here, pixels). For example, for 251 training images, all of size 50x50 pixels, the training matrix will be of dimensions 251x2500. For a 50x50 pixel zone, with respectively 112 and 139 elements in each class, learning time amounts to 0.458 seconds ; testing time, for 274 elements, amounts to 11.147 seconds. But a classier of any type produces bad results for randomly-assigned parameter values : for example, with the default value assigned to C and , a gaussian classier misclassies 126 elements of the aeroplane val.txt le. The following section discuss the optimal selection of the statmodel parameters.

2.3.2
2.3.2.1

Parameter selection
Optimal training on parameter grid

The eectiveness of SVM depends on the selection of kernel, the kernels parameters, and soft margin parameter C . The best combination is here selected by a grid search with multiplicative growing sequences of the parameter, given a certain step. Input parameters for the parameter selection are : min val,max val the extremal values tested step the step parameter. Parameter values are tested through the following iteration sequence : (min val, min val step, ..., min val stepn ) with n such that min val stepn < max val. Parameters are considered optimal when having the best cross-validation accuracy. Using an initial grid gives us a rst approximation of the best parameter possible, and produces better results than default training and testing. It is important to mention here that, without specifying any kernel type to our program, RBF kernel was always chosen as the best t for our data. All the results presented thereafter will be presented for the RBF kernel, with optimization of parameters C and ; following methods are applicable to other classiers as well, even though they remain less ecient.

Chapter 2. Computation under C++ 2.3.2.2 Iterating and sharpening results

Even if results are improved by the use of a parameter grid, renements can be added. Indeed, we sharpen our estimation by computing iterative parameter selection - each time on smaller grids : Data: Default inital grid Result: Optimal parameter for SVM training while iterations under threshold do train SVM on grid through cross-validation ; return best parameter; set parameter = best parameter; re-center grid; diminish grid size; end Algorithm 1: Basic iterative parameter testing. One can initially think of : (j ) (j ) max val = max val min val(j ) = min val(j ) + (j 1) step(j ) = step2

max val(j 1) param(j ) 2 param(j ) min val(j 1) 2

to implement the grid resizing at the step j , with param(j ) best parameter value obtained after training the SVM model. Yet such recursion is not properly ecient : as j grows, the calculation time grows very fastly. Indeed, as step gets smaller, the number of iterations to reach max value increases very easily. As we usually initialize C and grid extremal values at dierent powers of ten, with step(0) = 10, a convient way to resize the grid at step j is the following : 1 1 (j ) = param(j ) 10 2j 1 +10 2j max val 2
(j ) (j ) 2j min val = param 10 1 step(j ) = step(j 1) = ... = 10 2j
1

as we can express min val, max val using powers of ten after replacing step(j ) . It only takes a couple of iterations to go through the grid, and produces equivalent or better results. Besides, the more precise the estimation of the parameters, the faster the iteration.

Chapter 2. Computation under C++

10

2.4

A good insight : testing on a small zone

We rst sought results for a small zone of 50x50 pixels, to get a primary overview of how our algorithms works. For such zone, and the following initial grid and characteristics1 : Grid min val max val 107 3 10 + 1010 C 103 7 10 + 103 Number of class 1 les Number of class -1 les Files tested 112 139 247

we obtained the following results : No iterations nor grid usage (latest calculation time2 : 0.599 seconds): default value C default value Files misclassied Misclassication rate 1 1 126 0.459

After 1 iteration (latest calculation time : 11.691 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 68 0.248

After 5 iterations (latest calculation time : 4.138 seconds): nal value C nal value Files misclassied Misclassication rate 9.085 108 90.851 68 0.248

After 20 iterations (latest calculation time : 3.974 seconds): nal value C nal value Files misclassied Misclassication rate
1

4.966 108 18.011 66 0.240

Again, we point our that RBF kernel type was not specied initially by the user, but chosen by the program during paramater optimization. 2 Here represents the total calculation time - i.e. including training and testing time - for the last iteration mentionned.

Chapter 2. Computation under C++

11

Figure 2.1: Values of C per iterations.

Figure 2.2: Values of 108 per iterations.

What can we surmise from those results ? Firstly, the number of misclassied images is improved by automatically training our model on a grid. Secondly, it is also improved by iterating the parameter selection process. Although decay is slow, each iteration help our SVM classifying better the testing data. Lastly, calculation time seem to be globally lower iteration after iteration, in acceptable proportions considering the small size of our zone.

2.5
2.5.1

Central results : testing on a larger zone


Results

Let us now run training and testing on a larger zone of 300x300 zone, to gain better comprehension of our models behaviour. Parameter grids are initialized to the same values as in the previous subsection ; here again, RBF kernel is the optimal kernel type for the data. No iterations nor grid usage (latest calculation time : 20.857 seconds): default value C default value Files misclassied Misclassication rate 1 1 126 0.459

Chapter 2. Computation under C++ After 1 iteration (latest calculation time : 420.265 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 118 0.430

12

After 5 iterations (latest calculation time : 161.741 seconds): nal value C nal value Files misclassied Misclassication rate 9.085 109 133.352 60 0.218

After 15 iterations (latest calculation time : 143.982 seconds): nal value C nal value Files misclassied Misclassication rate 3.048 109 38.983 68 0.248

Figure 2.3: Number of misclassied images per iterations.

Figure 2.4: Values of C per iterations. Blue background, left : Normal scale. Red background, right : logarithmic scale.

Chapter 2. Computation under C++

13

Figure 2.5: Values of 1010 per iterations. Blue background, left : Normal scale. Red background, right : logarithmic scale.

2.5.2

Case of an unreached minimum

Here the most intriguing fact may be probably be that after 5 iterations, the number of misclassied les drops to 60 les out 274 tested, and raises to 62 the next step. This can be explained by the following fact : the point ( (5) , C (5) ) is near the minimum value - i.e. the one providing the minimal misclassication rate - we are seeking, which exact value cannot be reached through the grid at fth step ; and as reposition (, C ) and resize the grid at ( (5) , C (5) ), we might actually re-center the problem on a new area that does not include the minimum at all.

U n r e a c h e dmi n i mu m. ( G a mma , C ) a t s t e p5 .

( G a mma , C ) g r i da t s t e p5 ( G a mma , C ) g r i da t s t e p6

Figure 2.6: Problem of the unreached minimum. Here the minimum is included in the upper-middle case of the grid at step 5. (Gamma, C) is the best approximation available over the grid, but shrinking the grid at this exact point leaves the minimum o the new grid.

A solution to address this problem may be to have a smoother re-sizing algorithm, like the rst one we presented. But this may actually have a negative impact on calculation time at each step. For example, let us compare our results with those obtained with the initial, less-ecient re-sizing algorithm ; for the latter, with the same 300x300 pixel zone, the rst three steps of iteraion on parameter selection produced the following results :

Chapter 2. Computation under C++ After 1 iteration (latest calculation time : 432.228 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 118 0.430

14

After 2 iterations (latest calculation time : 644.136 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 118 0.430

After 3 iterations (latest calculation time : 1590.78 seconds): nal value C nal value Files misclassied Misclassication rate 107 1000 118 0.430

At rst step, the misclassication rate is the same as with the second re-sizing method ; the decay is indeed much slower (the resizing is so smooth that second and third steps still give a rate of 0.430), but the calculation time are very poor. The third iteration takes 1590.78 seconds to compute, compared to 161.975 with the convenient method. The conclusion of this section is that there might be an actual trade-o between computing performances and avoiding the unreached-minimum problem in many cases.

2.6

Going further : enriching our model

In the rst two sections, we trained our model on two dierent subsets : aeroplane train .txt and horse train.txt, trying to make predictions for both aeroplanes and horses. Here, we will include more objects - horses, background, motorbikes, and cars - in the class -1, and leave aeroplanes in the class 1 ; we will only try to classify les from testing test aeroplane val.txt. Our goal is here to show how using a larger training set can improve our predictions. Let us compare the results between a class -1 testing test containing only horses - case (A) -, and the testing set described above - case (B). RBF kernel is the optimal kernel type in both cases. Zone used is of size 300x300.

2.6.1

Case (A) : limited dataset


Grid min val max val 107 103 + 1010 C 103 107 + 103 Number of class 1 les Number of class -1 les Files tested 112 139 126

Chapter 2. Computation under C++ After 5 iterations (latest calculation time : 145.650 seconds): nal value C nal value Files misclassied Misclassication rate 3.83 108 177.8 41 0.325

15

After 10 iterations (latest calculation time : 137.342 seconds): nal value C nal value Files misclassied Misclassication rate 3.28 108 56.51 41 0.325

After 20 iterations (latest calculation time : 135.250 seconds): nal value C nal value Files misclassied Misclassication rate 2.27 108 26.25 36 0.285

After 40 iterations (latest calculation time : 141.561 seconds): nal value C nal value Files misclassied Misclassication rate 1.59 108 13.98 34 0.269

2.6.2

Case (B) : richer dataset


Grid min val max val 107 3 10 + 1010 C 103 7 10 + 103 Number of class 1 les Number of class -1 les Files tested 112 1717 126

After 1 iteration (latest calculation time : 681.084 seconds): nal value C nal value Files misclassied Misclassication rate 106 1000 12 0.095

We directly see here, after only 1 iteration, that the classication accuracy is much better ; the larger the initial training set, the better. Note that calculation time can reach quite high rates for very large datasets.

Chapter 2. Computation under C++

16

2.7

Conclusions

From all the experiments we conducted in this section, we can draw the following conclusions : The number of misclassied images is improved by automatically training our model on a parameter grid. It can also be improved by selecting the best parameter iteratively, and shriking our grid after each step. Choosing the right shrinking algorithm is very important, and can be very tricky. Indeed, for a very sharp resizing, calculation time can be acceptable but we might leave the point of minimal misclassication out of the grid. Using a large training set is always a good thing, as it improves drastically classication accuracy.

Appendix A

Unbalanced data set


In this study, we luckily were in possession of well-balanced data sets : the number of les for each subset were of the same order. However, in general, data sets can be unbalanced : one class may contain a lot more example than others. The principal problem linked to these data sets is that we no longer can say that a classier is ecient just by looking at the accuracy. Indeed, lets say that the ratio is 99% - for, per ex., the class 1 - against 1% - for class -1. A classier which misclassies every vector which belonging to class 1, but well classifes the vectors of the other class will return a 99% accuracy. Yet if you are especially interested in the other class in your study, this separator is not very useful. There are several ways to avoid this problem, we will treat the most well known: dierent costs for misclassication.

A.1

Dierent costs for misclassication

Let us consider an unbalanced data set of the following form : D = {(xi , yi ), 1 i m | i, yi {1; 1} , xi Rq } , (m, q ) N2 (A.1)

The optimization problem remains the same as in (1.9): w min 2 w,b,


2 m

+C
i=1

(A.2) 0.
m i=1 i

under (xi , yi ) D, yi ( w, xi + b) > 1 i ,

The solution is to replace the total misclassication penalty term C one : C+ j + C j


j J+ j J

by a new (A.3)

C+

0, C 17

0,

Appendix A. Unbalanced data set J+ = {j 1, m | yj = 1} , J = {j 1, m | yj = 1} .

18

One condition has to be satised, in order to give equal overall weight to each class : the total penalty term has to be the same for each class. A hypothesis commonly made is to suppose that the number of misclassied vectors in each class is proportional to the number of vector in each class, leading us to the following condition : C Card(J ) = C+ Card(J+ ) (A.4)

If, for instance, Card(J ) Card(J+ ), then C C +. A larger importance will be given to misclassied vectors xi such that yi = 1.

Appendix B

Multi-class SVM
Several methods have been suggested to extend the previous SVM scheme to solve multiple-class problems. [2] All the following schemes are applicable to any binary classier, and are not exclusively related to SVM. The most famous methods are the one-versus-all and one-versus-one methods.

B.1

One-versus-all

In this and the following subsections, the training and testing sets can be classied in M classes C1 , C2 , ...CM . The one-versus-all method is based on the construction of M binary classiers, each labelling 1 a specied class, -1 the others. During the testing phase, the classier providing the highest margin wins the majority vote.

B.2

One-versus-one

1) The one-versus-all method is based on the construction of M (M binary classiers by 2 confronting each of the M classes. During the testing phase, every point is analysed by each classier, and a majority vote is conducted to determine its class. If we denote xt the point to classify and hij the SVM classier separating classes Ci , Cj , then the label awarded to xt can be formally written :

Card({hij (xt )} {k } | i, j, k [1; M ], i < j )

(B.1)

This represents the class awarded to xt most of the time, after being analysed by all the classiers hij . Some ambiguity may exist in the counting of votes, if there is no majority election. Both methods presents downsides. For the one-versus-all version, nothing indicates that the classication results between the M classiers are comparable. Besides, the problem isnt well-balanced anymore : for example, with M = 10, we use only 10% of positives examples, against 90% negative ones. 19

Bibliography
[1] Vladimir N. Vapnik. The Nature of Statistical Learning Theory, 1995. [2] Christopher M. Bishop. Pattern Recognition And Machine Learning, 2006.

20

You might also like