You are on page 1of 9

Linear Classifiers

April 4, 2017

1 An Introduction to Linear Classifiers


By Sibt ul Hussain

In [1]: %pylab inline


import scipy.stats
from collections import defaultdict # default dictionary
plt.style.use('ggplot')
matplotlib.rcParams['figure.figsize'] = (10.0, 8.0)
%load_ext autoreload
%autoreload 2

Populating the interactive namespace from numpy and matplotlib

In [2]: np.random.seed(seed=99)

# make some data up


mean1 = [-3,-3]
mean2 = [2,2]
cov = [[1.0,0.0],[0.0,1.0]]

#create some points


x1 = np.random.multivariate_normal(mean1,cov,50)
x2 = np.random.multivariate_normal(mean2,cov,50)

plt.scatter(x1[:,0],x1[:,1], c='r', s=100)


plt.scatter(x2[:,0],x2[:,1], c='b', s=100)

plt.plot([-4,5.5],[4,-5.5], c='g', linewidth=5.0)

plt.title("Linear Classification")
plt.xlabel("feature $x_1$")
plt.ylabel("feature $x_2$")

fig_ml_in_10 = plt.gcf()
plt.savefig('linear-class-fig.svg',format='svg')

1
Goal of a Linear Classifier Given a linearly separable dataset (as shown in above figure), goal of
the linear classifier is to learn a classifier (a separator) from the given examples that can correctly
predict the classes of future instances.
For the simplicity sake, we will focus on the given dataset where the examples have two fea-
tures (1 & 2) and we have a binary classification problem. Our goal here is to find the decision
boundary (this decision boundary is line in 2D, plane in 3D and hyperplane in higher dimensional
spaces) that can separate both the classes.
Recall that a a line or plane is defined by two attributes: 1. Its position (offset). 2. Its orientation
(normal vector or slope).
In 2D, we know that equation of line can be written as:y = mx + c, where m defines the line’s
orientation and c it’s offset. This can be written in general form of line i.e. ax + by + c = 0, i.e.,:

y − mx − c = 0 (1)
−y + mx + c = 0 (2)
by + ax + c = 0 (3)

where b = −1, a = m for current setup.

2
1.1 Line Positioned at Origin (Classes are placed around origin)
Let’s first assume that our separating surface is positioned at origin, i.e. it is passing through
origin, in that case our offset=0. So equation of line can be written as: y = mx or y − mx = 0 or
ax + by = 0 which can be then written
 
a
[ x y] =0
b

Now here we have: (i) two features (x & y) which will be now represented by x1 and x2 respectively
and (ii) two parameters a and b which will be now represented by θ1 , θ2 .
We can now compactly write the equation of line as x T θ = 0, where x = [ x1 x2 ], and θ =
[ θ1 θ2 ]
Now all the points which will be defining this line will be perpendicular to parameter vector
θ which defines the orientation of our line (or plane).
This equation of line now defines three regions:

1. First region is defined by all the points x (i) that lie on the line (shown as green), these the
points which are perpendicular to our parameter vector θ since x T θ = 0. We can see this
from the definition of dot product, x T θ = | x ||θ | cos α = 0 (here α is the angle between the 2D
point x and parameter vector θ. | x ||θ | cos α = 0 will be zero when only when α = 90◦ . All
the points lying in this region will be used for defining the decision boundary.

2. Second region is defined by all the points x (i) that lie above the line (shown as orange shaded
region), giving x T θ ≥ 0. These are the points whose angle with the parameter vector θ is
either less than 90◦ or greater than 270◦ . In other words, they lie in the 1st or 4th quadrant of
the coordinates defined by the line and its perpendicular vector θ (as shown in the figure).
Recall that cos α is positive if α ∈ {[0 90]◦ ∪ [270 360]◦ )} thus the | x ||θ | cos α will be always
greater than or equal to zero since cos α will be positive. All the points lying in this region
will be categorized as positive examples, shortly if dot product of an example x with the
parameter vector is greater than 0 we will classify it as positive example.

3
3. Third region is defined by all the points x (i) that lie below the line, giving x T θ < 0. These
are the points whose angle with the parameter vector θ is either greater than 90◦ or less than
270◦ . In other words, they lie in the 2nd or 3rd quadrant of the coordinates defined by the
line and its perpendicular vector θ (as shown in the figure). Recall that cos α is negative
if α ∈ {(90 270◦ )} thus the | x ||θ | cos α will be always less than zero since cos α will be
negative. All the points lying in this region will be classified as negative examples, shortly
if dot product of an example x with the parameter vector is less than 0 we will classify it as
negative example.

1.2 Line Positioned at arbitrary Position (classes can be placed anywhere)


In [3]: np.random.seed(seed=99)

# make some data up


mean1 = [3,3]
mean2 = [8,8]
cov = [[1.0,0.0],[0.0,1.0]]

#create some points


x1 = np.random.multivariate_normal(mean1,cov,50)
x2 = np.random.multivariate_normal(mean2,cov,50)

plt.scatter(x1[:,0],x1[:,1], c='r', s=100)


plt.scatter(x2[:,0],x2[:,1], c='b', s=100)

plt.plot([2,10],[10,1], c='g', linewidth=5.0)

plt.ylim(-3,12)
plt.title("Linear Classification")
plt.xlabel("feature $x_1$")
plt.ylabel("feature $x_2$")

fig_ml_in_10 = plt.gcf()
plt.savefig('linear-class-fig-2.svg',format='svg')

4
Now in this case our decision surface will be no more placed at the origin,
  i.e. offset!=0. So
a
equation of line will be: y = mx + c or ax + by + c = 0 or [ x y 1] b = 0. Now here
c
we have: (i) two features (x & y) which will be now represented by x1 and x2 respectively and (ii)
three parameters a, b, c which will be now represented by θ1 , θ2 and θ3 respectively.
We can now compactly write the equation of line as x T θ = 0, where x = [ x1 x2 1], and θ =
[ θ1 θ2 θ3 ]
Here we loses the unique perspective we were having when line was placed at origin. So, let’s
see how to approach this problem.
Now given a test example x = ( x1 , x2 ), our goal is to classify it using our linear classifier
defined by θ.

5
We can write this example x as sum of two other vectors (y, and zusing vector additions, i.e.
θ
x = y + z, where y is the vector lying on the line and thus [y1 y2 ] ∗ 1 = −θ3 and z is defined as
θ2
we have seen in the above section)
We can then write z = x − y, we have seen in the above section we can use simple  trigonometry
θ
to classify as z being positive or negative. Thus if z is to be classified positive z T 1 ≥ 0.
θ2
Thus
 
θ
( x − y)T 1 ≥ 0
θ2
     
T θ1 T θ1 T θ1
x −y ≥ 0 where we know that y = −θ3 so,
θ2 θ2 θ2
 
θ1
[ x y 1]  θ2  ≥ 0
θ3
Hence we will simply take the dot product of the new example with the parameter vector θ to
check whether its a positive or a negative example irrespective of where the line lies.

1.3 Linear Classifier Summary


Given an example x, if its dot product with the parameter vector is greater than 0 we will classify
it as positive example, otherwise negative.
What about our confidence in the prediction ? Luckily, it also comes from the dot product of
our example x with θ and equal to the absolute value of dot product (which is actually distance of
the example from the separating hyperplane, greater the distance greater the confidence will be).
So, in summary dot product tells us two things:

• Its sign tells us to which class example belongs to.

6
• Its magnitude (absolute value) tells us the confidence in the prediction or distance of exam-
ple from the separating hyperplane. Greater the magnitude greater the confidence we have
that this example belong to the predicted class.

2 Learning Setup
As we remember that to define a complete learning system, we have to mention its tree main
components:

• Hypothesis with parameters.

• Cost Function (or Objective or Loss Function).

• Derivative of Cost Function

Now once we have defined the hypothesis for our linear classifiers. Our next goal is to define
the cost function J.
1 m
m i∑
J (θ ) = Li + λReg(θ )
=1
Different types of popular linear classification algorithms (such as SVM, Logistic Regression,
etc.,) differentiate themselves by their different cost functions Li for penalizing the classifier mis-
takes on ith example.

2.1 Perceptron Loss Function


Perceptron loss function aims to reduce number of misclassifications on training set. However
instead of using a non-differentiable discrete loss function which penalizes with penalty of 1 or 0
based on misclassification and correct classification, respectively, i.e.

Li = sign(hθ ( x (i) ))! = y(i)

It uses a continuous function that penalizes the misclassification based on its distance from the
decision surface (or in other words its confidence), i.e.,

Li = max(0, −y(i) hθ ( x (i) ))

Here if y(i) hθ ( x (i) ) ≥ 0 this means its a correct classification. Notice how this cost function
does not penalize the correctly classified examples but imposes penalty based on its confidence in
case of wrongly classified example.
Once, the cost function has been defined, we use gradient descent (or its variants) to find
the best set of parameters. However to run gradient descent iterations, we need to calculate the
gradients of perceptron cost function. These derivatives w.r.t to a parameter θ j turn out to be.
(i )
(
∂Li (hθ ( x (i) ), y(i) ) −y(i) ( x j ) if y(i) hθ ( x (i) ) < 0,
=
∂θ j 0 otherwise.

7
2.2 Support Vector Machines (SVM).
Linear SVM goes a step further and encodes its preference for a specific plane in the cost function.
Precisely, SVM tries to find a hyperplane that maximally separates (perfectly in the center of two
classes) both classes. By doing so SVM ensures to achieve better generalization performance.
Specifically, SVM uses following loss function (also called hinge-loss),

Li = max(0, δ − y(i) hθ ( x (i) ))


Thus, SVM not only penalizes a hypothesis based on examples that it missclassifies but also
for examples for which it has low confidence, i.e. y(i) hθ ( x (i) < δ. In other words it creates an imag-
inary margin (or buffer zone) of size δ on both sides of the plane to ensure maximum separation
between classes. Here δ is a hyper-parameter.
However, if we look closely we can see this margin is controlled by magnitude of parameters,
i.e. x T θ = | x ||θ | cos α. That is if we multiply all the parameters by 2, this will reduce the margin
by a factor of δ/2. Similarly, if we decrease the margin by factor of 2, it will increase the margin by
factor of 2δ. That means smaller weights implies larger margin and larger weights implies smaller
weights. Thus, instead of controlling the margin via δ by introducing a new hyper-parameter, we
set the δ to 1 and control the weights indirectly via regularization parameter λ and thus controls
the margin to achieve maximum margin separating hyperplane.

2.3 Logistic Classifier.


Logistic classifier squashes the output range of linear classifier using sigmoid (or softmax) to pro-
duce a probabilistic output, that is it calculates probability of ith example belonging to correct class.
It then uses negative log-loss as cost function. Mathematically,

Li = − log( p(belonging to original class))


Please see the comparison of these different cost functions in the next plot.
We will see the logistic classifier in more detail in the next set of notes.

3 Loss Functions
Lets plot the different cost functions:

• Perceptron loss, i.e.max (0, −y(i) x (i)T θ )

• Hinge (SVM) loss, i.e. max (0, 1 − y(i) x (i)T θ )

• Logistic loss, i.e. log(1 + exp(−y(i) x (i)T θ ))

In [11]: x=np.arange(-4,5,0.25)
y=np.maximum(0,-x)
plt.plot(x,y)
y=np.maximum(0,1-x)
plt.plot(x,y)
y=-np.log(1.0/(1+exp(-x)))
plt.plot(x,y)
plt.legend(['Perceptron (0-1) Loss','SVM (Hinge) Loss', 'Logistic Loss'])

8
Out[11]: <matplotlib.legend.Legend at 0x7f90070ef150>

You might also like