You are on page 1of 12

Statistical Classification Using Subgradient Method

Ravi Kumar

M. Hanmandlu

Dept. of Electrical Engineering


Indian Institute of Technology- Delhi
New Delhi, India
ravi.ee09.iitd@gmail.com

Dept. of Electrical Engineering


Indian Institute of Technology- Delhi
New Delhi, India
mhanmandlu@gmail.com

AbstractA newly developed classification algorithm based on


the powerful concept of margin maximization combined with low
training error and better generalization performance has been
presented in this paper. Special focus has been given towards
speeding up the training process and obtaining the model
parameter rather than exact solution. Starting with the
estimation of VC dimension, which is an important parameter to
generalization performance of learning machine, the paper
presents a novel approach to linear discriminant method for
classification. Since, no prior information is available, a minimax
problem has been formulated from the concept of maximal
margin. In order to solve the minimax problem, a subgradient
method along with an averaging scheme has been used to obtain
the saddle point and the solution, thus obtained, gives the optimal
parameter which has been used for the classification. The
approach has been applied to both the separable cases as well as
non-separable cases. A risk decision rule has been further
incorporated into the algorithm to improve its performance.
Index Terms VC Dimension, Minimax Formulation, Saddle
Point, Subgradient Method, Risk Decision, Empirical Risk.

I. INTRODUCTION
Classification algorithms are one of the most widely used,
extensively researched and an active area of research in
statistical learning theory with applications in a wide range of
problems. Broadly defined, some decision or prediction is
shaped based on the available information and a classification
method is employed for discrimination in new circumstances.
The goal of any learning algorithm is to formulate a rule that
generalizes from the given data to new situations in a
reasonable way. Specifically, given a sample of training
vectors {(x , y ); i 1,..., l} , our aim is to find a function
i

h:

{1, 1} that accurately predicts the class label

when presented with a new testing instance, where accuracy is


usually associated with having a small probability of error.
However achieving this small probability of error is not an
easy task. Since, we are only dealing with problems that have
no prior information, the best we can try is to minimize is the
expected risk. A general model of learning from examples can
be described by figure 1.
G: Generator of random vector x
which are drawn
independently from i.i.d. (independent and identical
distribution) but unknown function F(x).
n

S: Supervisor which assigns output value to every input vector


based on some conditional distribution function which is also
fixed but unknown.
LM: A learning machine which implements a set of
function f ( x, ), where is parameter space.

Figure 1. A general model of learning from examples.


The problem of learning is choosing a function f ( x, ) such
that its response best matches to that of supervisors for
particular choice of .
The problem is we do not have any information except for
training samples.
( x1 , y1 ),...,( xl , yl )
(1)
In order to choose the best function from
f ( x, ), which has a similar response to that of
supervisors,
we
measure
the
discrepancy
L( y, f ( x, )) between the response y of supervisor and that
provided by learning machine. We consider the expected
value of the loss, Risk functional.

R( ) L( y, f ( x, ))dF ( x, y)

(2)

So our goal is to find that f ( x, ) such that R( ) is


minimum.
We consider here a particular problem and that is of pattern
recognition. Suppose y can take only two values
y {0,1} and let f ( x, ), be the set of indicator
functions (functions which assume only two values 0 and 1).
We consider the loss function

0 if y f ( x, )
L( y, f ( x, ))
1 if y f ( x, )

(3)

For this loss function, the risk function R( ) , gives the


classification error. Basically, we reduce the problem of
minimizing the probability of classification error when F(x,y)
is unknown but the data are given.
The main principle for solving such problems is that
we should find the function from the state of admissible
function f ( x, ), that minimizes the probability of
error.
In section 2, we discuss the concept of margin
maximization which forms the basis of the classification
method. In section 3, generalization ability of learning
machine has been discussed. This section focusses on some
key concepts used in making any learning algorithm. Section 4
discusses the minimax principle and saddle point concept. In
section 5, we compute the saddle point using subgradient
method and present the result also. We incorporate risk
decision rule in section 6 to improve the performance of the
algorithm. Finally, we present conclusion in the last section.
II. MARGIN MAXIMIZATION

which two classes are separated by a line in 2-D plane (more


general term is hyper plane). We have taken two samples A
and C (x for class +1 and o for class -1). The distance of C
from the line is greater than the distance of line from A.
Hence, A is more confident to get classified as class 1 than C.
Basically, we look for a decision boundary, which allow us to
make our prediction for the sample more confident and
accurate. We use the notion of geometric margin and
functional margin to formulate this concept.

Figure 2. Margin concept for classification. (Obtained from


the method implemented in this paper)
B. Functional and Geometric Margins

We start the section by considering the idea of classifying data


with a large gap. For this, we have presented the concept of
margins. Next, we examine the formulation of underlying
theme of the margin to separate the instances. We then briefly
discuss the linear classification for the separable and nonseparable cases.

Given a training sample (x


b) is as follow:

, yi ) , the functional margin of (w,

f .m. vi yi (wT xi b)
{where,

pw,b (x) u(w x b)


T

(5)
assumes

+1

if

A. Intuition

(wT xi b) 0

In this section, we give the intuition about margins and the


level of confidence involved in our prediction of the instances.
In the probabilistic analysis of classification, a model

For a linear classifier with your this choice of u , there is one


property that makes it not a suitable choice for classification
purpose and that is its inability to distinguish the magnitude of
the argument. As the function does not care whether (w, b) is

y ( x) u( T x) would predict the sample +1 (in binary


case) iff y ( x) 0.5. The value of y ( x) near 0.5 will show a
low level of confidence in the prediction of sample while a
value close 1 will show a very high level of confidence. Now,
consider a non-probabilistic case where if T x 0, we say

0
0

T x
0
0

more confident and correct " 1"class


more confident and correct " 1"class
less confident and correct " 1"class
less confident and correct " 1"class

and -1 otherwise}

pw,b (x)

replaced by ( 2w , 2b ). This would not affect


So a normalization condition is used such that
we replace

( w, b) by ( w /

w ,b / w ) .

at all.

w 2 1 and

Now, with this

background we move to the set analysis. So given a training

{(xi , yi ); i 1,..., l} we define the functional


margin with respect to to be the minimal of the functional
set

margin of the individual instance. We denote it by

f .m. v min vi
i 1,...,l

(6)

(4)
For example, we consider the following figure 2
(Obtained from the method implemented in this paper), in

Next discuss about geometric margin. Let's consider a point A


in figure 2 and end point of AB as perpendicular distance.

Then, the co-ordinate of B is

xi d i w / w

which lie on

wT x b 0
w T ( xi d i w / w ) b 0
d i y i {( w / w )T xi (b / w )}
(7)
Finally, for a training set {(x , y ); i 1,..., l} , we define
the geometric margin of ( w, b)
with respect to
i

margin of Individual Instances.

G.M . d min d i
i 1,...,l

(8)

C. The optimal Margin Classifier


Given a training set, it is a natural desideratum to try
to find a decision boundary that maximizes the (geometric)
margin. In order to find the one that achieves the maximum
geometric margin, consider the following optimization
problem:

max d ,w,b d
s.t yi (wT xi b) d , i 1,..., l

w 1

(9)

We want to maximize d , subject to each training example


having functional margin at least d . The ||w|| = 1 constraint
moreover ensures that the functional margin equals to the
geometric margin, so we are also guaranteed that all the
geometric margins are at least d . Thus, solving this problem
will result in (w, b) with the largest possible geometric margin
with respect to the training set.
We transform the problem (9) to a standard convex
optimization problem by considering,

v
w

(10)

which we convert to following:

D. Linear Classification for Separable and Non-Separable


case
For the simple case of separable data, a
discriminative linear classification looks for a hyper plane
which separates them with largest margin, under the following
constraints
wT x b 1 for y =+1

wT x b 1 for

y =-1

(12)

We want to find an optimal value of w which satisfy above


equations with the additional greed of maximizing the
functional and the geometric margin. Solving this problem
leads to finally minimizing w

subject to constraints. Thus

we expect the solution for a typical two dimensional case to


have the form shown in Figure 2.
One interesting thing that comes up is when the same
equations are employed to non-separable data. This, for
obvious reasons, gives no solution but with a slight change,
which handles the problem quite elegantly, of introducing the

slack variable i ; i 1,..., l in the constraints solve the


problem but adding extra cost to the objective function.
Given a training set {(x , y
following optimization problem
i

min
w,b ,

); i 1,..., l} such solves the

l
1
2
w C i
2
i 1

s.t y i (w T xi b) 1 i

(13)

i 0

We can add an arbitrary scaling constraint on w and b without


changing anything. We will introduce the scaling constraint
that the functional margin of w, b with respect to the training
set must be 1 i.e. v =1. So finally what we have to do is
maximize 1/

The above is an optimization problem with a convex quadratic


objective and only linear constraints. Its solution gives us the
optimal margin classifier.

s.t y ( w x b) v, i 1,..., l
i

(11)

{(xi , yi ); i 1,..., l} to be the minimal of the geometric

max d , w,b

1
2
w
2
i
T i
s.t y ( w x b) 1, i 1,..., l

min d , w,b

the decision boundary. So,

i 1

is the upper bound on the error. The constant

C is the penalty parameter of the error term. This parameter is


fine-tuned using cross validation. It defines the trade-off
between the margin maximization and error minimization. A
large value of C includes results in the problem of over-fitting
(C = leads to hard margin SVM from solving the problem
using KKT condition). A smaller value on the other end result
in the problem of under-fitting.

III. GENERALIZATION ABILITY OF LEARNING MACHINE


Several theoretical and experimental studies have shown the
influence of the capacity of a learning machine on its
generalization ability. The natural question that comes up is
what is capacity of learning machines? How do we quantify
this quantity? We explain here probably the most important
and powerful concept of capacity of learning machine
giving the intuitive idea of it and also explaining the way it
works for developing learning algorithms.
A. VC-Dimension of Learning Machine
VC dimension (for Vapnik-Chervonenkis dimension)
measures the capacity of a learning machines. Capacity is a
measure of complexity of the learning task and measures the
expressive power, richness or flexibility of a set of functions.
When theory of learning theory was developed, a lot of
emphasis was given on the fact that there needed to find
certain properties of the set of function which were more
generic and do not depend only on the elements of the set of
functions. One such concept developed was of VC-dimension.
The definitions below are taken from [20].
Definition 1 The VC dimension of a set of indicator functions:
The VC dimension of a set of indicator functions
Q( z, ), , is the maximum number h of vector

x1 ,..., xh that

can be separated into two classes in all

2h possible ways using functions of the set1 (i.e., the maximum


number of vectors that can be shattered by the set of functions).
If for any n there exists a set of n vectors that can be shattered
by the set Q(z, ), , then the VC dimension is equal to
infinity.

p( ) E | Q( z, ) | .

(14)

Where the expectation is taken with respect to the probability


distribution P( x, ) . Since, we have no information about the
distribution, the only way to assess p( ) is through the
frequency of errors computed on the training set

vl ( )

1 l
| i Q( zi , ) |
l i 1

(15)

Many learning algorithm are based on the so-called principle


of empirical risk minimization", which consists in choosing
the function Q( z, l ) that minimizes the number of error on
the training set.
We are going to use (1) and (2) in finding the VC-dimension.
Unfortunately, estimation of the VC-dimension by theoretical
means have been obtained for only handful of simple Classes
of functions, most notably the class of linear discriminant
functions. So, an empirical method developed in [21] has been
notably used to get the most accurate estimation of VCdimension. The method is valid for any learning machine.
C. Theory
In this section, we briefly discuss how the empirical
method was developed. In [19], It was shown that a learning
algorithm that minimizes the empirical risk (i.e. minimizes the
error on the training set) will be consistent if and only if the
following one-sided uniform convergence condition holds

lim P{sup( p( ) v( )) } 0

(16)

In other words, the necessary and sufficient condition for the


consistency of the learning process is the one-sided uniform
convergence (within a given set of function) of frequencies to
probabilities.
In the 1930, Kolmogorov and Smirnov found the law of
distribution of the maximal deviation between a distribution
function and an empirical distribution function for any random
variable. This result can be formulated as follows. For the set
of functions

Q *( z, ) ( z ), (, )

(17)

The equality
*(l / h) = P{sup( p *( ) v *( )) }

Figure 3. A simple example of VC dimension. (Here, VC


dimension = 3)
B. Measuring VC-Dimension of Learning Machine
For pattern recognition problem, the goal of learning
*
machine is to choose a function Q( z, ) within that set
which minimizes the probability of error, i.e., the probability
of disagreement between the value of
and the output of the
learning machine Q( z, )

exp{2 2l} 2 (1) n exp{2 2 n2l} (18)


n2

Holds for sufficiently large l where

p *( ) E[Q *( )]

v *( )

1 l
Q *( zi , ) .
l i 1

(19)

Where l is the number of training instances and h is the VC


dimension. Now, assuming we know a functional form
for * , then we can experimentally estimate the expected
maximal deviation between the empirical risk and expected
risk for various values of l , and measure h by fitting * to
the measurements. Instead of measuring the maximum
difference between the training set error and the error on an
infinite test set, an easier way is to measure the maximum
difference between errors rates measured on two separate sets.

R( )

Since,

1 l
1 2l
(i 2 Q( xi , )2 2iQ( xi , )) (i 2 Q( xi , ) 2 2iQ( xi , ))

l i 1
l i 11

i , i {1,1} eqn. (25) reduces to


l

2l

i 1

i l 1

R( ) 4 2 [ iQ( xi , ) iQ( xi , )]
l

2l

i 1

i l 1

R( ) [ i Q( xi , ) i Q( xi , )]

Where

(27)

To solve this maximization problem we consider the following


way keeping in mind the following fact:

Q( xi , ) ( f ( xi ) ) Where xi is a vector and attains only


value from the set {-1, 1}.

f ( xi ) such that

We sort all the value of

Where

(26)

In order to minimize eqn. (3.14), we can maximize

(20)

two different samples of size n. According to the VC-theory,


this deviation is bounded by
Where =total
no of samples taken and is the VC-dimension. So,
(21)

(25)

f ( xi1 ) f ( xi2 ) ... f ( xi2l )

(22)
Where

and constants a=0.16, b=1.2 and k=0.14928


have been determined empirically.

(28)

Since there are 2l such samples we evaluate 2l +1 values in


these sorted numbers as only thing that we care about is the
sign of

f ( xi ) .

So for all values in of

xi s, we take our

to be
f ( xi ) , f ( xi ) , f ( xi 2 ) , f ( xi ) ,..., f ( xi ) , f ( xi )
1

2l

2l

E. Estimation of Parameter
For the given real function, all we require to estimate is the
value of within the range of real function in order to form an
indicator function for which we can use the proposed
theoretical match with the our experimental measurements to
get the VC dimension. In order to estimate the parameter ,
2l
lets denote the sequence of 2l samples by Z . Denote a
new sequence Z 2l which can be obtained from the set Z 2l

(29)
Where is a small value which is smaller than the
least difference between all values of real functions.
From this exhaustive search we can easily find an optimal
value of which will provide us the indicator function which
we need for calculation of maximum difference between the
errors rates measured on two separate sets.
Once we have this value of , we can use it to calculate

by changing
set.

v1l and v2l

values

of

for

the

first

half

Z 2l x1 , 1; x2 , 2 ,..., xl , l ; xl 1, l 1; xl 2 , l 2 ,..., x2l , 2l


i 1 i
(23)
2l

We will use Z as a training set for the learning machine. In


this case, training results in the minimization of

R( )

1 l
1 2l
(i Q( xi , ))2 (i Q( xi , )) 2

l i 1
l i 11

Now minimization of eqn. (24) is equivalent to

using eqn. (18) and (19)

1 l
|i Q( xi , ) |
l i 1
1 2l
v2l ( Z 2l , ) |i Q( xi , ) |
l i l 1

v1l ( Z 2l , )

In such a way that


(24)

(30)

attains

maximum value.
Using experimental measurements we can fit these
measurements to analytic function that depends only on VCdim.

F. Experimental Procedure
We present an overall summary of the method used
in the calculation of VC dimension. The following procedure
summaries it all:
1) Generate a randomly labeled set of size 2n

Z 2l x1 , 1; x2 , 2 ,..., xl , l ; xl 1, l 1; xl 2 , l 2 ,..., x2l , 2l

Z1l and Z 2l such

2) Split it into two sets of equal size:


that

G. Results
Here we discuss the results that has been obtained by
using the above method for the linear discrimination function.
Theoretical estimate of VC dimension with n inputs in n+1.
We first present the experimental set up used and then the
results obtained.
Experimental setup:
We have taken 25 different datasets (generating
random samples in each dataset) to conduct this
experiment.

Z1l x1 , 1 ; x2 , 2 ,..., xl , l ;

We simulated 100 experiment for each datasets.

Z 2l xl 1 , l 1 ; xl 2 , l 2 ,..., x2l , 2l ;

We then take average of all experiments for


estimating maximal difference between the error
rates.

We have used only linear discriminant function (in ndimension).

3) Flip the class labels for the second set

Z1l such that

Z x1 , 1 ; x2 , 2 ,..., xl , l ;
l
1

Z 2l xl 1 , l 1 ; xl 2 , l 2 ,..., x2l , 2l ;
4) Merge the two sets and train our learning machine to
get value of from the procedure described above.
5) Separate the sets and flip the labels on the second set
back again to calculate the maximal difference
between the error rates on two separate data sets.
6) Measure the difference between the error rates on the
two sets:

(l ) sup(v1l ( Z 2l , ) v2l ( Z 2l , ))

The above procedure gives a single estimate of


we can obtain a single point estimate of
to (l ) (l / h) .

n from which
h according

In order to reduce the variability of estimates due to random


samples in the experiment, the above procedure is repeated for
different datasets of different Sample sizes
the range of 0.5 ni

30 .

n1 , n2 ,..., nd

in

For each such value of

ni several ( mi ) repeated experiments are performed.


Then the mean values of these repeated experiments are taken
at each design point

(n1 ), (n2 ),..., (nd )


The effective VC-dimension h of the regression function can
then be determined by finding the parameter h * that provides
the best fit between (n / h) and (ni )' s :
d

h* arg min [ (ni ) (ni / h)]2


h

i 1

(31)

Figure 4. VC dimension estimate for one experiment.


Figure 4 represents a particular simulation result where we
generated 142 samples. The red curve represents the value
for different VC dimension and the green curve is the best
approximation of for the empirical method described above.
Table 1. VC dimension estimate for linear discriminant
function in m-dimensional space.
m(dimension)
Theoretical value
Estimated value
5
6
6
10
11
12
20
21
23
30
31
34
40
41
44
We end this section with two points.
(a) We have randomly chosen the number of experiments and
datasets in this method. So, we have to address the variability
of random samples in the experimental procedure proposed by
procedure for specifying the measurement points (i.e., the

sample size and the number of repeated experiments at a given


sample size.)

Then,
(a) For any

w W and k 0

wk 1 w wk w 2 ( (wk , k ) (w, k )) 2 w (wk , k )


2

(b) We have exhaustively searched all value of VC dimension


for the best approximation result. The effect of numerical
optimization algorithm (used for estimating VC dimension)
should also be considered.

(36)
(b) For any

and k 0

k 1 k 2 ( (wk , k ) (wk , )) 2 (wk , k )


2

IV. MINIMAX LEARNING AND COMPUTATION OF SADDLE POINT

min max (x, y)


xX

(32)

yY

Proof.
(a) We start with,

wk 1 w = W [wk w (wk , k )] w

Where X and Y are closed convex set and is a convexconcave function define over XxY. In particular

(, y) : X
(x, ) : Y

is a convex function for all y Y

wk w (wk , k ) w

= wk 1 w

( x*, y) ( x*, y*) ( x, y*)

(33)
Such a point is called saddle point of the function . In the
next section, we see how to obtain the saddle point solution
using subgradient method.
A. Subgradient Algorithm for Saddle Point
We first describe the notation used in this section.

[We

have

convex

Therefore, for any


2

min max ( w, )

(34)

for k 0,1,...

w0 and 0

w (wk , k )

and

with respect to

for k 0,1,...

denote the projection on sets W and

respectively.

subgradient of

We present a lemma here with its proof.


Lemma 1: Let the two sequences generated by eqn. (35) be
denoted by {wk } and {k }

each and

w W and k 0 ,

k 1 [k (wk , k )]
2

k (wk , k )

So,
(35)

are the initial iterates. The vector

(wk , k ) denote the


w and respectively.

in w for

(b) Similar proof goes for this with a slight modification,


For any and k 0

We consider algorithm motivated by Arrow-Hurwicz-Uzawa


algorithm[10].
This
algorithm
assumes
the
following
form:

and

an

wk 1 w wk w 2 ( (wk , k ) (w, k )) 2 w (wk , k )

: Min-max problem.
We consider the minimax problem described by (32)

here

at w wk ]

: Step size.

Where

used

inequality,
w (wk , k )(wk w) (w, k ) (wk , k ) . Since the

SF ( x) : Subgradient of convex function F.

k 1 [k ( wk , k )]

since w ( wk , k ) is a subgradient of ( w, k ) w.r.t

x i : i-th component of vector x.

wk 1 W [wk w (wk , k )]

wk w 2 w (wk , k )(wk w) 2 w (wk , k )

function ( w, ) is

: m-dimensional non-negative orthant.

wX

[Using the non-expansive property of projection


operation]

is a concave function for all x X


Solution to problem (4.2), (x *, y *) satisfy the condition

(37)

USING SUBGRADIENT METHOD

We consider the following general problem

k 1 k 2 (k ) (wk , k ) (wk , k )
2

k 0
Since, ( wk , k ) is a subgradient of the concave function
( wk , ) at k , We have for all ,
(k ) (wk , k ) ( wk , k ) ( wk , )
Hence,
for
any
and
k 0
2
2
k 1 k 2 ( (wk , k ) (wk , ))
2 ( wk , k )

k 1

( (wi , i ) (w, i ))

Some additional properties of iterates wk and k has been


established here under the assumption of boundedness of
subgradient which we state here formally.
Subgradient Boundedness Assumption:
The subgradient w ( wk , k ) and ( wk , k ) used in the
method defined by eqn. (5.3) and (5.4) are uniformly bounded,
i.e.
constant
that
a
0 such

w (wk , k ) ,

( wk , k )

k 0

This assumption is valid when the sets W and are compact


and the function is continuous over W x .
Now, with this assumption we move to the
next lemma.

i 0

Hence,

1 k 1
1 k 1
1
2
2
( w, i )
w0 w
(wi , i ) k
k i 0
2 k
2
i 0
Since, the function ( w, ) is concave in . For any fixed
w W there holds,
1 k 1
( w, k ) ( w, i ) with w W and
k i 0
1 k 1
k i
k i 0
Combining the preceding two relations, we get

Lemma 2:
Let the two sequences generated by eqn. (35) be denoted by

2
1 k 1
) 1 w w 2

(
w
,

(
w
,

i i
k
0
k i 0
2 k
2

{wk } , {k } and let the subgradient boundedness assumption

k and
hold. Further, we let w
by

k be the iterate averages given

1 k 1
wi ,
k i 0
1 k 1
k i
k i 0
We then have k 1
1
(wi , i ) (w, k )
k i 0
for any w W ,
0

By adding these relations over i=0,,k-1, we obtain,

w0 w
2 k

i0

wi 1 w wi w 2 ( (wi , i ) (w, i )) 2 2
2

So,

1
2
2
2
( i 1 i )
( wi , i ) ( wi , )
2
2

From lemma 1, we have for w W and


2

and

So,

1 k 1

( wi , i ) ( w k , )
2 k
2
k i 0
for any ,
Proof.

thus establishing eqn. ().


for
any

Similarly,
i 0,

i 1 i 2 ( (wi , i ) ( wi , )) 2 2

w k

k 1

1
k 2
2
2
( w0 w wk w )
2
2

1
k 2 k 1
2
2
( k 0 )
( ( wi , i ) ( wi , ))
2
2
i 0
Implying that

1 k 1
1 k 1
( wi , i ) ( wi , )

2 k
2
k i 0
k i 0
Because the function ( w, ) convex in w for any
fixed , we have
1 k 1
with
and
(wi , ) (w k , )
k i 0
1 k 1
w k wi
k i 0
1

0
2

1
2 Combining the preceding two relations, we get
2
2
( wi w wi 1 w )
2
2
2
0
2 1 k 1

( wi , i ) ( w k , )
By adding these relations over i=0,,k-1, we get
2 k
2
k i 0

( wi , i ) ( w, i )

thus showing relation ().

Lemma 2 will be key in providing approximate saddle points


which we show next.
Proposition 1: Let the subgradient Boundedness Assumption
hold. Let { wk } and { k } be the sequence generated by

equation (5.3) & (5.4), and let (W*, *) W*x be a


saddle point of ( w, ) , we then have
(a)

1
2 k

1
2 k

0 k

1 k 1

( wi , i ) ( w* , * )
2
k i 0
2
1
2

w0 w*
2 k
2

* 2

relation

* 2

k 1
2

2 ( w k , k ) ( w* , * )

2 k

2
2

1 k 1
( wi , i ) ( w k , k )

k i 0

1
2 k

w0 w k

k 1

1
( w k , k ) ( wi , i )
2
k i 0
2
2
1

()

0 k
2 k
2
2

The results follows by summing () and ().

w k and k satisfy the following

w0 w k

Multiplying above my -1, we get

k 1

(b) The averages

1 k 1
1
( wi , i ) ( w k , k )
w0 w k

k i 0
2 k

So finally we come to a conclusion:


Let the two sequences generated by eqn. (5.3) and (5.4) be
denoted by

{wk } and {k } and let {w k } and {k } be

the iterates average, given by

where

2
w0 w* 0 k

1 k 1
wi
k i 0
1 k 1
k i
k i 0

w k

2 k
is the subgradient bound of Assumption 1.

Proof.
(a) In lemma 2, we take w w and in eqn. ()
and () respectively, we get for any k 1
*

1 k 1
1
( wi , i ) ( w* , k )
w0 w k

k i 0
2 k

0 *
2 k

2
2

2
2

W and , we have w k W and

k k 1 . So, by saddle point relation we have


(w* , k ) (w* , * ) (w k , * )
Combining the preceding relations, we get

1
2 k

0 *

a)

2
2
1

1 k 1
( wi , i ) ( w* , * )

k i 0
2

w0 w*

(5.7)

2 k
2

(b) Since, wk W and k , k 1 , we use


and to get,
lemma 2 with w w
k

1 k 1
( wi , i ) ( w* , * )

converges to
k i 0
Within the error level which depends on step
size and bound of the subgradient and at the rate of
1/k.

1 k 1
( wi , i ) ( w k , * )

k i 0

By convexity of the set

Then, under the assumption of boundedness of subgradient ,


the following results hold true:

b)

( w k , k )

converges to

( w* , * )

Within

the error level which depends on step size and bound


of the subgradient and at the rate of 1/k.
We use this method to obtain the solution of the optimization
problem described in section 2 for optimal separating hyper
plane.
B. Results
Table 2 shows the results obtained from the above
method. The result has been compared with Linear C-SVM
classifier with same value of C.

I2 : [0, 0*),
I3 : [-0*,0),
I4 : (-1, -0*),
where 0*is an optimal probability threshold based on the risk
decision rule.
Based on risk decision rule of ERM we then calculate the
number of positive and negative samples in different intervals

Table 2. Percentage accuracy on the datasets using the


subgradient method and Linear SVM
Datasets
Training Linear SVM
Classifier using
samples
(with C=1)
subgradient
(Accuracy in %) method (in %)
a1a
1605
84.42
85.61
a2a
2265
83.97
80.62
australian 200
85.71
82.44
breast483
99.5
89
cancer
Cod-rna
10000
85.34
84.14
German- 700
78.66
68
numer
a3a
3185
83.9
79.3
diabetes
400
72.55
64.13
Source:http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

V. IMPROVING ALGORITHM THROUGH RISK DECISION RULE


For non-separable samples the optimization model use the
additional regularization parameter to minimize the risk
functional. In this section we incorporate a risk decision rule
to improve the performance of our algorithm. We first propose
the risk decision approach. We then present the result
achieved on several dataset from the method described in
section 4.
A. Classification Rule of Risk Decision
The The main idea of risk decision is to model rule for
samples in non-separable domain. Given a test sample we can
calculate its distance from optimal separating hyper plane. We
then classify those samples in the following category:
a)

If f ( x0 ) 1 Then sample
positive sample class.

b) If f ( x0 ) 1 Then sample
negative sample class.
c)

If

x0

and

j=1,2,3,4. For the samples in non-separable domain we use the


following rule of classification:

B. Optimization Threshold
0* is critical parameter in decision rule for any given
optimization problem, Threshold 0*, there exists two types of
errors. The type 1 error, which are negative samples wrongly
classified to positive class and type 2 error which are positive
samples wrongly classified to negative class.
Probability of Type I error:
1.Pr(1<0),
Probability of Type II error:
(1-1).Pr(1 0)
The risk decision rule of ERM minimizes the expectation of
two types of errors by

Minimizing this expression

x0

1 f ( x0 ) 1 Then sample x0

is classified to
is classified to

class with probability 2 such that 1 + 2 =1

nij i=1,2.

is classified to

positive class by probability 1 and negative

1 and

respectively. We denote these numbers by

are defined as follow:

=
()
=
Then we apply the following classification rule to samples in
different interval of non-separable domain. The intervals are:
I1 : [ 0*,1),

C. Results
Table 3 shows the results on different datasets using the
subgradient method combined with above risk decision rule.
The result has been compared with the Linear C-SVM with
c=1 with the same data sets as in previous result in section 4.

Table 3. Percentage accuracy on the datasets using the


subgradient method combined with Risk Decision rule and
Linear SVM
Datasets
Training
Classifier using subgradient
samples
method combined with Risk
Decision rule (in %)
a1a
1605
87.17
a2a
2265
81.48
australian
200
78.97
breast483
86
cancer
Cod-rna
10000
88.12
German700
71
numer
a3a
3185
81.34
diabetes
400
68.75
Note: The dataset used is same as the previous one in section
4.
VI. CONCLUSION
In this thesis, a new statistical classification method has
been presented using the concept of maximal margin. A
Minimax problem has been formulated using the margin
concept. The algorithm used for solving the saddle point takes
steps along the direction defined by the function ( w, ) with

respect to w , and uses an averaging scheme to generate


approximate saddle point. The method can be used for nondifferential case also.
In order to further improve the performance, a risk decision
rule has been incorporated based on the principle of Empirical
risk minimization. It uses the probabilistic approach to change
the decision surface and captivates those points which are near
the separating hyperplane.
The result shown in table 2 clearly shows that accuracy
achieved by this classification method is very close to that of
SVM if not better. Table 3 shows that after the incorporation
of risk decision rule it gives better performance on some data
sets than SVM. The reason being, in cases where number of
training samples is large, the empirical risk is close to the
actual risk and thus this model have better accuracy but other
sets do not perform well due to less number of samples used in
training process.
Though, we have not used Kernel based learning using this
approach, it remains to be a future work to incorporate such
concept into this classification. Another significant extension
of this project is the formation of the hyperplane when both
convex sets in the classification are open sets. In such cases, it
has been proved that a separating hyperplane exists, but not
necessarily any gap.

VII. REFERENCES
[1] H. Drucker, C.J.C. Burges,A.J. Smola,V. Vapnik, Support
Vector Regression Machines, NIPS, pp.155-161(1996).
[2] A. Chambolle, T. Pock, A First-Order Primal-Dual
Algorithm for Convex Problems with Applications to
Imaging, Journal of Mathematical Imaging and Vision
40(1): pp.120-145(2011).
[3] I.Y. Zabotin, A subgradient method for finding a saddle
point of a convex-concave function, Issledovania po
prikladnoi matematike 15, pp. 6-12(1988).
[4] A.S. Nemirovski and D.B. Judin, Cezare convergence of
gradient method approximation of saddle points for
convex-concave functions, Doklady Akademii Nauk
SSSR 239, pp.1056-1059 (1978).
[5] G.M. Korpelevich, The extragradient method for finding
saddle points and other problems, Matekon 13, pp.35-49
(1977).
[6] A. Nedic, A. E. Ozdaglar, Approximate Primal Solutions
and Rate Analysis for Dual Subgradient Methods, SIAM
Journal on Optimization 19(4): pp. 1757-1780 (2009).
[7] T. Larsson, M. Patriksson, and A. Stromberg, Ergodic
results and bounds on the optimal value in subgradient
optimization, Operations Research Proceedings (P.
Kelinschmidt et al., ed.), Springer, pp. 30-35(1995).
[8] C. Schwartz, "Estimating the dimension of a model,"
Annals of Statistics 6, pp. 461-464 (1978).
[9] N.V. Snlirnov, Theory of probability and mathematical
statistics, (Selected works), Nauka, Moscow (1970).
[10] K.J. Arrow, L. Hurwicz, and H. Uzawa, Studies in linear
and non-linear programming, Stanford University Press,
Stanford, CA (1958).
[11] D. Maistroskii, Gradient methods for finding saddle
points, Matekon 13, pp. 3-22 (1977).
[12] Hamed Masnadi-Shirazi and Nuno Vasconcelos, Risk
minimization, probability elicitation, and cost-sensitive
SVMs, Proceedings of International Conference on
Machine Learning (ICML), 2010.
[13] H. Uzawa, Iterative methods in concave programming,
Studies in Linear and Nonlinear Programming (K. Arrow,
L. Hurwicz, and H. Uzawa, eds.), Stanford University
Press, , pp. 154-165(1958).
[14] Xuemei Zhang and Li Yang. Improving SVM through a
Risk Decision Rule Running on MATLAB, Journal of
software, VOL. 7, NO. 10, pp. 2252-2257, october 2012.
[15] Angelia Nedic, Asuman E. Ozdaglar,Subgradient methods
in network resource allocation: Rate analysis, CISS 2008:
pp.1189-1194.
[16] R.J. Solomonoff, "A formal theory of inductive inference,"
Parts 1 and 2, Inform. Contr.,7, pp. 1-22, pp. 224-254
(1964).
[17] R.A. Tapia and J.R. Thompn, Nonparametric probability
density estimation, The Job Hopkins University Press,
Baltimore (1978).
[18] E.G. Gol'shtein, A generalized gradient method for
finding saddle points, Matekon 10 (1974), 36-52.
[19] V. Vapnik, The Nature of Statistical Learning Theory,
Springer-Verlag, New York, 1995.
[20] V. Vapnik, Statistical Learning Theory, John Wiley and
Sons, Inc., New York, 1998.
[21] Vapnik,V., Levin,E. & LeCun,Y. Measuring the VCDimension of a Learning Machine, neural Computation, 6
, 851-876,(1994).

[22] C. Burges, A tutorial on support vector machines for


pattern recognition, DataMin. knowl. Discovery 2 (2)
(1998).
[23] C. Cortes,Prediction of Generalization Ability in Learning
Machine,s PhD dissertation, University of Rochester
(1995).
[24] J. Berger, Statistical Decision Theory and Bayesian
Analysis, Springer (1985).

You might also like