You are on page 1of 20

Machine Learning

Husnain Inayat Hussain

List of Figures

1.1
1.2
1.3
1.4
1.5
1.6

Geometric Interpretation of Perceptron Learning . . . . . . .


(a) When the Inputs are not Scaled. (b) When the Inputs are
Gradient Descent in Action . . . . . . . . . . . . . . . . . . .
(a) Data is Noisy. (b) Target is too complex. . . . . . . . . .
(a) Bias. (b) Variance. . . . . . . . . . . . . . . . . . . . . . .
The Ellipse as the Error and the Circle as Regularizer. . . . .

.
.
.
.
.
.

5
6
8
9
10
11

2.1
2.2

Two Data Points are Separated by Margin r. . . . . . . . . . . . . . . . . . .


Three Different Classifiers with Different Margins. . . . . . . . . . . . . . . .

14
14

. . . . .
Scaled.
. . . . .
. . . . .
. . . . .
. . . . .

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

List of Tables

1.1
1.2
1.3

A Toy Example Classification Data . . . . . . . . . . . . . . . . . . . . . . . .


Modified Classification Data with Bias Included . . . . . . . . . . . . . . . . .
(a) Badly Scaled Input Data. (b) Standardized Input Data . . . . . . . . . .

2
3
6

Contents

1 Introduction
1.1 Perceptron Learning . . . . . . .
1.1.1 Linear Algebraic Notation
1.1.2 Perceptron Hypothesis . .
1.1.3 Perceptron Learning Rule
1.2 Input Scaling . . . . . . . . . . .
1.3 Stochastic Gradient Descent . . .
1.4 Noise . . . . . . . . . . . . . . . .
1.4.1 Bias Variance Analysis . .
1.4.2 Regularization . . . . . .
1.5 Conclusion . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

1
1
3
4
4
5
7
9
9
10
12

2 Supervised Learning Techniques


2.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 Maximum Margin Classifier . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Quantifying the Margin . . . . . . . . . . . . . . . . . . . . . . . . . .

13
13
14
15

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

Nations are born in the hearts


of poets, they prosper and die
in the hands of politicians.
Allama Mohammad Iqbal
(1877-1938)

Introduction

This is a short guide whose aim is to provide you with the material asked in the third hourly.
All the questions that you will see in the paper will be covered here. The answers will not be
explicit, but you should be able to respond to many of them if you took the lectures and you
read this guide too.

1.1 Perceptron Learning


Perceptrons are by far the easiest to understand and the most comfortable to implement. As
with most of the learning algorithms in this class, the first step is to analyze the data which
needs to be classified (remember perceptrons are used for classification). To understand how
perceptrons work we will consider a toy example as shown in Table (1.1). The data can be
divided into two separate categories. One belongs to the first two columns of the table. This
is generally termed as raw input and is denoted by the alphabet x (the boldface letter x if x
is multidimensional). The second category is the third column y. If we just think about the
third column without any explicit information, we could arrive at the following conclusion: y
appears to be the result of the first two x values. If we can only figure out the relationship
between the input and the output then our job is done.
In more technical jargon (language), our task is to find the mapping between the input and
the output. This mapping which could also be called as a function takes in any general input
and produces an output which is close to the original output. So we are required to learn a
function (hypothesis) given the output label y. This kind of learning is called supervised
learning . But the question is how are we going to learn the hypothesis from the data. To
answer this question, let us consider the idea of a signal . Ok! so we were already wondering
about getting the function and now a new term signal has been introduced. What are we
going to do about it? One thing is certain that signal has got something to do with the data.
More precisely, signal is connected with the input data. Let us look at Eq. (1.1) to see what
the term signal looks for the first data point.
1

1.1. PERCEPTRON LEARNING

x1
10
16
10
22
25
50
45
80
55
20
32
100

x2
60
80
77
100
95
170
180
130
150
110
130
190

y
no
no
no
no
no
yes
yes
yes
yes
no
yes
?

Table 1.1: A Toy Example Classification Data

(1)

(1)

(1)

s(x) = w0 x0 + w1 x1 + w2 x2

(1.1)

(i)

In Eq. (1.1), s(x) denotes the signal as a function of x, xj is the input where the subscript
j denotes the coordinate (dimension) and the bracketed superscript (i) denotes the example
number. The w s are the parameters of the signal. Note, however, that the original data does
not contain the coordinate x0 . Where has this coordinate come from and what is its purpose?
The answer is that it is just a mathematical artifact (a technique) which helps nicely solve
our problem. It is the threshold of the signal which is its value at x = 0. This is same reason
why this is also termed as bias. In order to include bias in the problem, the data is slightly
modified. The modification is shown with the help of Table (1.2). A column of 1s is added
to act as the bias of the problem.
Lets recap, what has been done so far.
a dataset is given
the rows of the data are identified as different instances or examples
the data is categorized into raw inputs and output
a signal is formed out of the first example of the dataset
the signal is formed using a certain weight vector w and the first input
example x(1)
Lets shed more light on what the signal is. The word has its origin in electrical engineering.
But we are not studying any electircal engineering here. We are only borrowing this term to
make things simple for us. The signal is nothing but a linear combination of the weight vector
w and the input vector x. A more compact method is available to represent the signal. This
is shown with the help of Eq. (1.2).
s(x) = wT x
(1.2)
Eq. (1.2) can be further expanded in terms of the exact vectors to understand a little clearer
how the signal on one example is formed. This is shown with the help of Eq. (1.3).

1.1. PERCEPTRON LEARNING

x0
1
1
1
1
1
1
1
1
1
1
1
1

x1
10
16
10
22
25
50
45
80
55
20
32
100

x2
60
80
77
100
95
170
180
130
150
110
130
190

y
no
no
no
no
no
yes
yes
yes
yes
no
yes
?

Table 1.2: Modified Classification Data with Bias Included

1.1.1

Linear Algebraic Notation

Linear algebra has the necessary tools to make our job easy. Eq. (1.2) can be further expanded
in terms of the exact vectors to understand a little clearer how the signal on one the ith example
is formed. This is shown with the help of Eq. (1.3).


s(x) = w0 w1

(i)
x0

w2 x(i)
1
(i)
x2

(1.3)

So far we have been forming signals out of one example only. The problem is that we have an
entire dataset and there must be some way to get signals from all the examples in the dataset.
One way to do this is to use matrix multiplication. First let us consider a data matrix which
is formed in the following:

(1)

x
0(2)
x0
(3)
x

X = 0(4)
x0
.
.
.

(m)
x0

(1)

(1)

x1
(2)
x1
(3)
x1
(4)
x1
..
.

x2
(2)
x2
(3)
x2
(4)
x2
..
.

(m)
x1

(m)
x2

...
...
...
...

(1)
xn
(2)
xn

(3)
xn

(4)
xn
..

(1.4)

...
(m)
. . . xn

In Eq. (1.4) is shown the data matrix X whose dimensions are m x (n+1) . Here, m denotes
the number of examples and n denotes the dimensions of the data. Note that although n is
the dimension of the original input data, after adding the bias the dimension becomes n + 1.
But since the subscript for the bias dimension is simply 0, therefore, the notation for the last
dimension is still n. Now when we count the dimensions including 0, it turns out to be n + 1.
Now that we know how to form signal from one data point, we could form signals from all data

1.1. PERCEPTRON LEARNING

points. Using the data matrix as shown in Eq. (1.4), we

(1)
(1)
(1)
x0
x1
x2
...
(2)
(2)
(2)
x0
(3) x1(3) x2(3) . . .
x
x1
x2
...

s(X) = 0(4)
(4)
(4)
x0
x2
...
x1
.
..
..
.
.
.
...
.
(m)

x0

(m)

x1

(m)

x2

can write the following relationship.

(1)
xn

(2) w0

xn
(3) w1
xn
w
(1.5)
(4) 2
xn ..

.. .
. wn
(m)
. . . xn

The result of the matrix multiplication presented in Eq. (1.5) is a vector of signals whose
dimension is given by m x 1 . In other words s is a column vector which has m components, where, each component corresponds to each example in the dataset. In more compact
notations, Eq. (1.5) is written as:
s = Xw
(1.6)

1.1.2

Perceptron Hypothesis

At this stage we have successfully extracted all the signals from the initial dataset. The big
question is how does the perceptron classify each of these signals into the two classes yes
and no. Perceptron uses the following rule to classify the signals:
(
+1 if s(x) 0
hw (x) =
(1.7)
1 if s(x) < 0
In Eq. (1.7), hw (x) denotes the hypothesis and is read as: the hypothesis as a function of x
and parameterized by w. The hypothesis is pretty simple. It outputs a binary value. The
result is either +1 or 1. We could think of +1 as the output of the hypothesis when y = yes
and 1 when y = no.

1.1.3

Perceptron Learning Rule

Now that we have the perceptron hypothesis, how do we know that the initially selected
weight vector produces the correct output. To do this, we use perceptron learning rule. The
rule says the following:
pick an example at random
obtain the value of hypothesis from an initially selected set of weights
compare the hypothesis response with the true label y
in case of mismatch, multiply the input vector by the correct label (+1 or
1)
update by adding this multiplied input to the weight vector
Given the dataset X having m examples, output y and weight vector w, an OCTAVE /
MATLAB implementation of the above code is presented with the help of List 1.1.

1.2. INPUT SCALING

Listing 1.1: Perceptron Algorithm in Octave / Matlab


while ( e r r = 0 ) && ( count <= 50 )
i n d = c e i l (m rand ( ) ) ;
i f ( sign (X( i n d , : )
w) = y ( i n d ) )
w = w + y ( ind )
(X( i n d , : ) ) ;
count = count + 1 ;
end
e r r = sum( sign (X w) = y ) / m;
end

The perceptron learning rule could also be illustrated with the help of Fig. 1.1. It can be seen
that the with the given set of weights the signal (s = wT x) > 0. However, the true label is
1. To correct this, the input x is multiplied by the correct lable y and added to the to the
weight vector. When this is done it can be observed in the lower part of the figure that now
the signal is < 0.

if

is not equal to

correct the weight by:


= +

Figure 1.1: Geometric Interpretation of Perceptron Learning

1.2 Input Scaling


Sometimes the input coordinates are not on the same scale. To illustrate this point let us look
at an example table. In Table (1.3) (a) x1 is on a very different scale than x2 . If we try to
minimize the mean squared error loss then we might end up with something like Fig. 1.2 (a).
The long ellipses show that one of the dimensions is much larger than the other and looking
at the table it can be verified that this is indeed the case. There is a remedy for this situation
and this is known as scaling. There are a few types of scaling available and you should
consult lecture 5 for details. Here, one of those rescaling techniques known as standardization
of the input data is elaborated. What we do is take the mean and standard deviation of all

1.2. INPUT SCALING

x1
10
16
10
22

x2
6000
8000
7700
10000

y
no
no
yes
yes

x1
-0.7833
0.2611
-0.7833
1.3055

x2
-1.1739
0.0457
-0.1372
1.2653

y
no
no
yes
yes

Table 1.3: (a) Badly Scaled Input Data. (b) Standardized Input Data

elliptic error function


contour plot
circular error function
contour plot
Figure 1.2: (a) When the Inputs are not Scaled. (b) When the Inputs are Scaled.

the features. We subtract the mean from all the features and divide them by the standard
deviation to normalize the data. After rescaling of the data we should expect the contours
of the error plot to become more circular. This is shown with the help of standardized data
in Table (1.3) (b) and also presented in Fig. 1.2 (b). It must be noted that these figures are
only illustrative and real cases may differ from the cases depicted in the figure. Also note
that the perfect circular plot as shown here is generally not the result. After rescaling the
features become such that the contours are almost circular. When gradient descent is taking
too long, rescaling of the data help it to converge to the minimum faster. An OCTAVE /
MATLAB implementation of standardization is shown with the help of List 1.2.
Listing 1.2: Input Scaling in Octave / Matlab
muX = mean(X) ;
stdX = std (X) ;
f or j = 2 : ( n + 1 )
mX( : , j ) = X( : , j ) muX( j ) ;
sX ( : , j ) = mX( : , j ) / stdX ( j ) ;
end

NOTE: The index j varies from 2 to n + 1. This is so because scaling is never


applied to the bias (the columns of 1s).

1.3. STOCHASTIC GRADIENT DESCENT

1.3 Stochastic Gradient Descent


Gradient descent is one of the most commonly used iterative techniques in machine learning.
The purpose of gradient descent is to find the minimum of an error function. The error
function is generated using input values from some given dataset. Using our notations for
data, the mean squared error function is given as:
m

2
1 X
E(w) =
hw (x(i) ) y (i)
2m

(1.8)

i=1

The gradient of the error function with respect to some weight coordinate j is given by:
m


1 X
E(w)
(i)
=
hw (x(i) ) y (i) xj
wj
m

(1.9)

i=1

Now gradient descent algorithms says, upgrade the weight vector using the following rule:
wj = wj

E(w)
wj

(1.10)

Here in addition to the previous notations denotes the learning rate. The major difference
between batch gradient descent and stochastic gradient descent is that instead of updating
the weight vector by summing over all the examples, the update is performed based only on
one randomly selected example. Therefore, if ith example is chosen, then the error gradient is
calculated only on the ith example for the j th coordinate without summing over all the other
examples. This error gradient then becomes:


E(w)
(i)
= hw (x(i) ) y (i) xj
wj

(1.11)

Now in both the batch gradient descent and stochastic gradient descent, update is made
simultaneously to all the coordinates and not just the j th coordinate. If you go through the
first three lectures you will get a very good idea about that. So how are we going to implement
stochastic gradient descent in OCTAVE / MATLAB. Lets look at the code snippet first and
then we will discuss a little more about the algorithm.

1.3. STOCHASTIC GRADIENT DESCENT

Listing 1.3: Stochastic Gradient Descent in Octave / Matlab


while ( dError > eps ) && ( n I t e r s <= 100 ) ,
% g e n e r a t e an i n d e x on random from m examples
i = c e i l (m rand ( ) ) ;
grad = (X( i , : )
w y)
(X( i , : ) ) ;
% update w e i g h t
w = w alpha
grad ;
% c a l c u l a t e e r r o r on t h e c u r r e n t example
newError = 1/2
(X( i , : )
w y ) 2 ;
% get the d i f f e r e n c e
dError = abs ( o l d E r r o r newError ) ;
% a s s i g n o l d e r r o r t o t h e new e r r o r
o l d E r r o r = newError ;
% i n c r e m e n t number i t e r a t i o n s by 1
nIters = nIters + 1 ;
% s a v e t h e new e r r o r a t each i t e r a t i o n i n t o a v e c t o r
e r r ( n I t e r s ) = newError ;
end

Suppose the training data contains one million examples. In that case, using batch gradient
descent will be computationally very expensive as you have to sum over each of the examples
at each iteration. In such a case it is a good idea to select one example at random at a
time and update all the weights based on that. Repeat the same procedure some number of
times, say a 100 or a 1000. What the code presented in List 1.3 then does is that it picks one
example at random. It calculates the error and the gradient on that example. It updates all
the weights in a single shot, only based on that example.
To finish off this section I am going to present you with a diagram which shows how gradient
descent method works generally. Notice that the blue tangent is pointing in the direction
of lower error and in order to move in the correct direction you must move opposite to the
gradient. That is the reason why there is a negative sign in front of in the weight update
step.

( )

( )
( )

( )

>0

<0

>0

slope negative is
positive

<0

slope positive is
negative

Figure 1.3: Gradient Descent in Action

1.4. NOISE

the data points fit


perfectly the target curve

the data points do not lie


on the target curve

simple
target
curve

complex
target
curve

Figure 1.4: (a) Data is Noisy. (b) Target is too complex.

1.4 Noise
In every measurement system there are accuracy and precision issues. The result is that the
data has uncertainty. We can never be sure of the accuracy of the data. This uncertainty in
the data is known as noise. Let us look at an example to appreciate the concept. In Fig. 1.4
(a) is shown a simple target. The accompanying data points are also shown. It can easily be
inferred from the figure that the data points do not agree with the actual curve. Such a type
of noise is called as stochastic noise.
Stochastic or random noise is a very common concept. A relatively uncommon notion is that
of deterministic noise. To understand deterministic noise, let us look at Fig. 1.4 (b).
In Fig. 1.4 (b) it can be seen that the data points lie perfectly on the target function. So,
now the questuon is: where has noise come from here? The data has noise not because of uncertainty but because of target complexity. In other words, the source any errors on training
and testing is the fact that the target is is too difficult to model.
To further understand this idea, consider the following example. Suppose you are a mathematics teacher in the primary section of a school. Of all the classes, Class IV-A performs
exceptionally well and the students ask you to teach them exclusive things in mathematics
as that is the best class and they are the best students. You become a little optimistic and
teach them integration. After teaching them integration, something strange happens. They
start performing poorly even on the easy questions. What happened? The answer is: Their
minds were too limited to understand integration. To them integration was complex. They
did try to learn it. But now even on slightly different questions that they think are difficult,
they try to apply integration (that they do not know how) and make big mistakes.

1.4.1

Bias Variance Analysis

The notion of noise and error in machine learning is closely related to bias and variance.
So far, the error that we have dealt with is the error incurred while training. The task of
machine learning is to see the response to your hypothesis on an unknown data point. The

10

1.4. NOISE

error it makes on a dataset (or data point) that your learning algorithm has never seen is
called the true error of your hypothesis. This true error is also called out of sample error.
For notational convenience we denote the true error by the subscript out and the training
error by the subscript in. Then the out of sample error can be decomposed in the following
manner:
Eout = Bias + Variance + Noise
(1.12)
These three terms can further be explained as follows:
1. Bias: Bias is the component in the error which is due to the simplicity of your hypothesis. If you try to learn a very simple hypothesis you might not learn true characteristics
of the data and hence you will make errors while training as well as predicting. This is
also called underfitting.
2. Variance: Variance is the component which is due to the complexity of your hypothesis and not because of the complexity of the data. In other words, if you try to learn
the dataset too well, then although your training error will be very little, but you will
make big errors while testing. This is also called overfitting. Hence variance corresponds to the deterministic noise. You can overcome variance by using lots of examples.
Note: Complexity of learned hypothesis should not be confused with the
complexity of target function.
3. Noise: This is due to the stochastic (random) nosie in the data.
Fig. 1.5 (a) and (b) provide a pictorial presentation of bias and variance.

variance

simple hypothesis
(low order
polynomial)

error

error

bias

complex
hypothesis (high
order polynomial)

# examples

# examples

Figure 1.5: (a) Bias. (b) Variance.

1.4.2

Regularization

Regularization is a cure for noise and in particular, overfitting. As has already been explained,
the error due to noise could be random or deterministic. The deterministic noise is the variance part of the error, which is also known as overftting. So, let us first understand what

1.4. NOISE

11

does overfitting mean. Fitting a curve is to find parameters which produce output as close
to the input as possible. Overfitting is to fit it too well, so much so, that all the points are
perfectly covered while training. However, at testing time this could be a big problem and
this has been explained above.
Regularization deals with putting a restraint on the learning algorithm not to learn too good
(too strong weights). To put this in effect we consider the following constrained optimization
problem.
m

2
1 X
hw (x(i) ) y (i)
min E(w) =
2m
i=1

subject to: w2 <

(1.13)

This equation could pictorially be represented as: Instead of doing detailed derivation, one

Figure 1.6: The Ellipse as the Error and the Circle as Regularizer.
could look at the diagram and infer the following mathematical relationship between the
gradient of the error and the weight vector at the optimum point presented by the green star
in the figure:
E(w) =
E(w) +

w
m

w=0
m

(1.14)

Eq. (1.14) tells me that if I could only transform the second part with into a differential
of something I could have another optimization function without the constraint. This is
presented by the following equation:
m

2
1 X

min E(w) =
hw (x(i) ) y (i) +
w2
2m
2m
i=1

(1.15)

12

1.5. CONCLUSION

Now this is almost the same as original cost function except an additive term. We also know
that the derivative of additive term is available. Using this information we could rewrite the
weight update and the exact solution for regression problems.
1. Weight Update:
m

wj = wj


1 X

(i)
hw (x(i) ) y (i) xj wj
m
m

(1.16)

i=1

2. Normal Equations:
w = (X T X + J)1 X T y

(1.17)

Note: Bias term (the column of 1s is never regularized. Therefore, J is the


identity matrix, except that its first element is 0 instead of being 1).

1.5 Conclusion
The conclusion will be in red and note that you need to pay careful attention to that. Also
go through the lecture on Neural Nets. I have asked very basic questions. Read the lecture
on VC analysis. Only conceptual questions will be asked.
1. Human brain is good at pattern recognition while computers are good at
number crunching.
2. The growth function is justified because of the VC bound.
I hope that reading this document and going through lectures 5 - 10 will get you good marks.
All the best!

With faith, discipline and selfless devotion to duty, there


is nothing worthwhile that you
cannot achieve.
Mohammad Ali Jinnah
(1876-1948)

Supervised Learning Techniques


This chapter should serve as a guide for the final examination. Although the name implies
supervised techniquesz in general, it will deal with neural nets, support vector machines and
nearest neighbours.

2.1 Support Vector Machines


Support Vector Machines (SVM) is inarguably the most accepted technique used for classification. It relies on the concept of margins. The margin is defined as the distance between
the separating hyperplane and the nearest example. An example here means a data point.
That example which is closest to the data point may be negative or positive for binaary
classification. Lets assume that the closest example as mentioned here is positive. So, that
means that there will only be one margin and that too for the nearest positive example. Well,
thats wrong. As you will see later, when margins are formed, they take into account both
the negative and positive examples. Lets look at Fig. 2.1 for the concept of margins.
In Fig. 2.1 are shown two examples from two classes. We see that margins are formed in such
a manner that the distance of the two examples (data points) is the same. If r is the distance
of each of the examples from the separating hyperplane then we can express the total margin
as sum of the individual margins as shown in Eq. (2.1).
marginT = 2r

(2.1)

In Eq. (2.1), the subscript (T) signifies the word total. We could sum the two distances
easily without worrying about any projections or components. We summed then by their
scalar magnitudes because the two margins separate the examples in one axis only. To further
clarify, think of the two examples as sitting on the surface of a sphere. The line joining them
is a straight line passing thriugh the center of the sphere. If we know the radius of the sphere,
we could easily find the distance between the two examples by summing the radius twice.
In other words the examples could be thought of sitting on the surface of the sphere and
separated by the diameter of the sphere.
13

14

2.1. SUPPORT VECTOR MACHINES

Figure 2.1: Two Data Points are Separated by Margin r.

2.1.1

Maximum Margin Classifier

Knowing that margin is simply the perpendicular distance of the nearest example to the
separating hyperplane, we may ask ourselves: What does the margin really do for us in
terms of machine learning and is it really worth spending time?. The answer to this rather
fundamental question can be given intuitively. Let us look at a figure and then things will be
clearer.

a)

b)

c)

Figure 2.2: Three Different Classifiers with Different Margins.

2.1. SUPPORT VECTOR MACHINES

15

In Fig. 2.2, three classifers are presented. We do not know which learning algorithm generated
that, but we do know that there are differences in the way, they are separating the otherwise
identical dataset. Let us try to enumerate these differences.
the three classifiers (separating planes) have different slopes.
case (a) separates the examples such that the line is very close to each of
the two classes of nearest examples.
in case (b) the separating plane is rather far from the two examples.
case (c) is where the separating plane is farthest from the two examples.
the margin for case (a) is narrowest, whereas, the margin for case (c) is
the widest.
Notice that I am using the word separating plane, separating line and separating hyperplane
interchangeably. The all mean the same. Although there are differences but for now it does
not matter and it could be treated as the same. Getting back to our original question, how
does the margin help us, we realize that perhaps the classifier with the widest margin is the
best. But why? One way to think about this is to consider a new example. A new example
is a data point which your training algorithm has not seen yet. What are the chances that
the first separator will classify it correctly? What are the chances that the second or the
third classifier will do its job without error? To a new example, the most error tolerant (or
accurate) will be the third classifier. To further strengthen the idea, ler us assume that the
new example is a little wayward (not normal or a little outside or beyond the normal region).
Which classifier will tolerate it the nost? The third one! Because it has a wider margin and
if an example is a little here or there, it could still lie in a wider margin and hence can get
correctly classified then if the margins are narrow, since there the chances are high that the
new example will be misclassified.

2.1.2

Quantifying the Margin

You might also like