Professional Documents
Culture Documents
List of Figures
1.1
1.2
1.3
1.4
1.5
1.6
.
.
.
.
.
.
5
6
8
9
10
11
2.1
2.2
14
14
. . . . .
Scaled.
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
List of Tables
1.1
1.2
1.3
2
3
6
Contents
1 Introduction
1.1 Perceptron Learning . . . . . . .
1.1.1 Linear Algebraic Notation
1.1.2 Perceptron Hypothesis . .
1.1.3 Perceptron Learning Rule
1.2 Input Scaling . . . . . . . . . . .
1.3 Stochastic Gradient Descent . . .
1.4 Noise . . . . . . . . . . . . . . . .
1.4.1 Bias Variance Analysis . .
1.4.2 Regularization . . . . . .
1.5 Conclusion . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
1
1
3
4
4
5
7
9
9
10
12
13
13
14
15
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Introduction
This is a short guide whose aim is to provide you with the material asked in the third hourly.
All the questions that you will see in the paper will be covered here. The answers will not be
explicit, but you should be able to respond to many of them if you took the lectures and you
read this guide too.
x1
10
16
10
22
25
50
45
80
55
20
32
100
x2
60
80
77
100
95
170
180
130
150
110
130
190
y
no
no
no
no
no
yes
yes
yes
yes
no
yes
?
(1)
(1)
(1)
s(x) = w0 x0 + w1 x1 + w2 x2
(1.1)
(i)
In Eq. (1.1), s(x) denotes the signal as a function of x, xj is the input where the subscript
j denotes the coordinate (dimension) and the bracketed superscript (i) denotes the example
number. The w s are the parameters of the signal. Note, however, that the original data does
not contain the coordinate x0 . Where has this coordinate come from and what is its purpose?
The answer is that it is just a mathematical artifact (a technique) which helps nicely solve
our problem. It is the threshold of the signal which is its value at x = 0. This is same reason
why this is also termed as bias. In order to include bias in the problem, the data is slightly
modified. The modification is shown with the help of Table (1.2). A column of 1s is added
to act as the bias of the problem.
Lets recap, what has been done so far.
a dataset is given
the rows of the data are identified as different instances or examples
the data is categorized into raw inputs and output
a signal is formed out of the first example of the dataset
the signal is formed using a certain weight vector w and the first input
example x(1)
Lets shed more light on what the signal is. The word has its origin in electrical engineering.
But we are not studying any electircal engineering here. We are only borrowing this term to
make things simple for us. The signal is nothing but a linear combination of the weight vector
w and the input vector x. A more compact method is available to represent the signal. This
is shown with the help of Eq. (1.2).
s(x) = wT x
(1.2)
Eq. (1.2) can be further expanded in terms of the exact vectors to understand a little clearer
how the signal on one example is formed. This is shown with the help of Eq. (1.3).
x0
1
1
1
1
1
1
1
1
1
1
1
1
x1
10
16
10
22
25
50
45
80
55
20
32
100
x2
60
80
77
100
95
170
180
130
150
110
130
190
y
no
no
no
no
no
yes
yes
yes
yes
no
yes
?
1.1.1
Linear algebra has the necessary tools to make our job easy. Eq. (1.2) can be further expanded
in terms of the exact vectors to understand a little clearer how the signal on one the ith example
is formed. This is shown with the help of Eq. (1.3).
s(x) = w0 w1
(i)
x0
w2 x(i)
1
(i)
x2
(1.3)
So far we have been forming signals out of one example only. The problem is that we have an
entire dataset and there must be some way to get signals from all the examples in the dataset.
One way to do this is to use matrix multiplication. First let us consider a data matrix which
is formed in the following:
(1)
x
0(2)
x0
(3)
x
X = 0(4)
x0
.
.
.
(m)
x0
(1)
(1)
x1
(2)
x1
(3)
x1
(4)
x1
..
.
x2
(2)
x2
(3)
x2
(4)
x2
..
.
(m)
x1
(m)
x2
...
...
...
...
(1)
xn
(2)
xn
(3)
xn
(4)
xn
..
(1.4)
...
(m)
. . . xn
In Eq. (1.4) is shown the data matrix X whose dimensions are m x (n+1) . Here, m denotes
the number of examples and n denotes the dimensions of the data. Note that although n is
the dimension of the original input data, after adding the bias the dimension becomes n + 1.
But since the subscript for the bias dimension is simply 0, therefore, the notation for the last
dimension is still n. Now when we count the dimensions including 0, it turns out to be n + 1.
Now that we know how to form signal from one data point, we could form signals from all data
(1)
(1)
(1)
x0
x1
x2
...
(2)
(2)
(2)
x0
(3) x1(3) x2(3) . . .
x
x1
x2
...
s(X) = 0(4)
(4)
(4)
x0
x2
...
x1
.
..
..
.
.
.
...
.
(m)
x0
(m)
x1
(m)
x2
(1)
xn
(2) w0
xn
(3) w1
xn
w
(1.5)
(4) 2
xn ..
.. .
. wn
(m)
. . . xn
The result of the matrix multiplication presented in Eq. (1.5) is a vector of signals whose
dimension is given by m x 1 . In other words s is a column vector which has m components, where, each component corresponds to each example in the dataset. In more compact
notations, Eq. (1.5) is written as:
s = Xw
(1.6)
1.1.2
Perceptron Hypothesis
At this stage we have successfully extracted all the signals from the initial dataset. The big
question is how does the perceptron classify each of these signals into the two classes yes
and no. Perceptron uses the following rule to classify the signals:
(
+1 if s(x) 0
hw (x) =
(1.7)
1 if s(x) < 0
In Eq. (1.7), hw (x) denotes the hypothesis and is read as: the hypothesis as a function of x
and parameterized by w. The hypothesis is pretty simple. It outputs a binary value. The
result is either +1 or 1. We could think of +1 as the output of the hypothesis when y = yes
and 1 when y = no.
1.1.3
Now that we have the perceptron hypothesis, how do we know that the initially selected
weight vector produces the correct output. To do this, we use perceptron learning rule. The
rule says the following:
pick an example at random
obtain the value of hypothesis from an initially selected set of weights
compare the hypothesis response with the true label y
in case of mismatch, multiply the input vector by the correct label (+1 or
1)
update by adding this multiplied input to the weight vector
Given the dataset X having m examples, output y and weight vector w, an OCTAVE /
MATLAB implementation of the above code is presented with the help of List 1.1.
The perceptron learning rule could also be illustrated with the help of Fig. 1.1. It can be seen
that the with the given set of weights the signal (s = wT x) > 0. However, the true label is
1. To correct this, the input x is multiplied by the correct lable y and added to the to the
weight vector. When this is done it can be observed in the lower part of the figure that now
the signal is < 0.
if
is not equal to
x1
10
16
10
22
x2
6000
8000
7700
10000
y
no
no
yes
yes
x1
-0.7833
0.2611
-0.7833
1.3055
x2
-1.1739
0.0457
-0.1372
1.2653
y
no
no
yes
yes
Table 1.3: (a) Badly Scaled Input Data. (b) Standardized Input Data
the features. We subtract the mean from all the features and divide them by the standard
deviation to normalize the data. After rescaling of the data we should expect the contours
of the error plot to become more circular. This is shown with the help of standardized data
in Table (1.3) (b) and also presented in Fig. 1.2 (b). It must be noted that these figures are
only illustrative and real cases may differ from the cases depicted in the figure. Also note
that the perfect circular plot as shown here is generally not the result. After rescaling the
features become such that the contours are almost circular. When gradient descent is taking
too long, rescaling of the data help it to converge to the minimum faster. An OCTAVE /
MATLAB implementation of standardization is shown with the help of List 1.2.
Listing 1.2: Input Scaling in Octave / Matlab
muX = mean(X) ;
stdX = std (X) ;
f or j = 2 : ( n + 1 )
mX( : , j ) = X( : , j ) muX( j ) ;
sX ( : , j ) = mX( : , j ) / stdX ( j ) ;
end
2
1 X
E(w) =
hw (x(i) ) y (i)
2m
(1.8)
i=1
The gradient of the error function with respect to some weight coordinate j is given by:
m
1 X
E(w)
(i)
=
hw (x(i) ) y (i) xj
wj
m
(1.9)
i=1
Now gradient descent algorithms says, upgrade the weight vector using the following rule:
wj = wj
E(w)
wj
(1.10)
Here in addition to the previous notations denotes the learning rate. The major difference
between batch gradient descent and stochastic gradient descent is that instead of updating
the weight vector by summing over all the examples, the update is performed based only on
one randomly selected example. Therefore, if ith example is chosen, then the error gradient is
calculated only on the ith example for the j th coordinate without summing over all the other
examples. This error gradient then becomes:
E(w)
(i)
= hw (x(i) ) y (i) xj
wj
(1.11)
Now in both the batch gradient descent and stochastic gradient descent, update is made
simultaneously to all the coordinates and not just the j th coordinate. If you go through the
first three lectures you will get a very good idea about that. So how are we going to implement
stochastic gradient descent in OCTAVE / MATLAB. Lets look at the code snippet first and
then we will discuss a little more about the algorithm.
Suppose the training data contains one million examples. In that case, using batch gradient
descent will be computationally very expensive as you have to sum over each of the examples
at each iteration. In such a case it is a good idea to select one example at random at a
time and update all the weights based on that. Repeat the same procedure some number of
times, say a 100 or a 1000. What the code presented in List 1.3 then does is that it picks one
example at random. It calculates the error and the gradient on that example. It updates all
the weights in a single shot, only based on that example.
To finish off this section I am going to present you with a diagram which shows how gradient
descent method works generally. Notice that the blue tangent is pointing in the direction
of lower error and in order to move in the correct direction you must move opposite to the
gradient. That is the reason why there is a negative sign in front of in the weight update
step.
( )
( )
( )
( )
>0
<0
>0
slope negative is
positive
<0
slope positive is
negative
1.4. NOISE
simple
target
curve
complex
target
curve
1.4 Noise
In every measurement system there are accuracy and precision issues. The result is that the
data has uncertainty. We can never be sure of the accuracy of the data. This uncertainty in
the data is known as noise. Let us look at an example to appreciate the concept. In Fig. 1.4
(a) is shown a simple target. The accompanying data points are also shown. It can easily be
inferred from the figure that the data points do not agree with the actual curve. Such a type
of noise is called as stochastic noise.
Stochastic or random noise is a very common concept. A relatively uncommon notion is that
of deterministic noise. To understand deterministic noise, let us look at Fig. 1.4 (b).
In Fig. 1.4 (b) it can be seen that the data points lie perfectly on the target function. So,
now the questuon is: where has noise come from here? The data has noise not because of uncertainty but because of target complexity. In other words, the source any errors on training
and testing is the fact that the target is is too difficult to model.
To further understand this idea, consider the following example. Suppose you are a mathematics teacher in the primary section of a school. Of all the classes, Class IV-A performs
exceptionally well and the students ask you to teach them exclusive things in mathematics
as that is the best class and they are the best students. You become a little optimistic and
teach them integration. After teaching them integration, something strange happens. They
start performing poorly even on the easy questions. What happened? The answer is: Their
minds were too limited to understand integration. To them integration was complex. They
did try to learn it. But now even on slightly different questions that they think are difficult,
they try to apply integration (that they do not know how) and make big mistakes.
1.4.1
The notion of noise and error in machine learning is closely related to bias and variance.
So far, the error that we have dealt with is the error incurred while training. The task of
machine learning is to see the response to your hypothesis on an unknown data point. The
10
1.4. NOISE
error it makes on a dataset (or data point) that your learning algorithm has never seen is
called the true error of your hypothesis. This true error is also called out of sample error.
For notational convenience we denote the true error by the subscript out and the training
error by the subscript in. Then the out of sample error can be decomposed in the following
manner:
Eout = Bias + Variance + Noise
(1.12)
These three terms can further be explained as follows:
1. Bias: Bias is the component in the error which is due to the simplicity of your hypothesis. If you try to learn a very simple hypothesis you might not learn true characteristics
of the data and hence you will make errors while training as well as predicting. This is
also called underfitting.
2. Variance: Variance is the component which is due to the complexity of your hypothesis and not because of the complexity of the data. In other words, if you try to learn
the dataset too well, then although your training error will be very little, but you will
make big errors while testing. This is also called overfitting. Hence variance corresponds to the deterministic noise. You can overcome variance by using lots of examples.
Note: Complexity of learned hypothesis should not be confused with the
complexity of target function.
3. Noise: This is due to the stochastic (random) nosie in the data.
Fig. 1.5 (a) and (b) provide a pictorial presentation of bias and variance.
variance
simple hypothesis
(low order
polynomial)
error
error
bias
complex
hypothesis (high
order polynomial)
# examples
# examples
1.4.2
Regularization
Regularization is a cure for noise and in particular, overfitting. As has already been explained,
the error due to noise could be random or deterministic. The deterministic noise is the variance part of the error, which is also known as overftting. So, let us first understand what
1.4. NOISE
11
does overfitting mean. Fitting a curve is to find parameters which produce output as close
to the input as possible. Overfitting is to fit it too well, so much so, that all the points are
perfectly covered while training. However, at testing time this could be a big problem and
this has been explained above.
Regularization deals with putting a restraint on the learning algorithm not to learn too good
(too strong weights). To put this in effect we consider the following constrained optimization
problem.
m
2
1 X
hw (x(i) ) y (i)
min E(w) =
2m
i=1
(1.13)
This equation could pictorially be represented as: Instead of doing detailed derivation, one
Figure 1.6: The Ellipse as the Error and the Circle as Regularizer.
could look at the diagram and infer the following mathematical relationship between the
gradient of the error and the weight vector at the optimum point presented by the green star
in the figure:
E(w) =
E(w) +
w
m
w=0
m
(1.14)
Eq. (1.14) tells me that if I could only transform the second part with into a differential
of something I could have another optimization function without the constraint. This is
presented by the following equation:
m
2
1 X
min E(w) =
hw (x(i) ) y (i) +
w2
2m
2m
i=1
(1.15)
12
1.5. CONCLUSION
Now this is almost the same as original cost function except an additive term. We also know
that the derivative of additive term is available. Using this information we could rewrite the
weight update and the exact solution for regression problems.
1. Weight Update:
m
wj = wj
1 X
(i)
hw (x(i) ) y (i) xj wj
m
m
(1.16)
i=1
2. Normal Equations:
w = (X T X + J)1 X T y
(1.17)
1.5 Conclusion
The conclusion will be in red and note that you need to pay careful attention to that. Also
go through the lecture on Neural Nets. I have asked very basic questions. Read the lecture
on VC analysis. Only conceptual questions will be asked.
1. Human brain is good at pattern recognition while computers are good at
number crunching.
2. The growth function is justified because of the VC bound.
I hope that reading this document and going through lectures 5 - 10 will get you good marks.
All the best!
(2.1)
In Eq. (2.1), the subscript (T) signifies the word total. We could sum the two distances
easily without worrying about any projections or components. We summed then by their
scalar magnitudes because the two margins separate the examples in one axis only. To further
clarify, think of the two examples as sitting on the surface of a sphere. The line joining them
is a straight line passing thriugh the center of the sphere. If we know the radius of the sphere,
we could easily find the distance between the two examples by summing the radius twice.
In other words the examples could be thought of sitting on the surface of the sphere and
separated by the diameter of the sphere.
13
14
2.1.1
Knowing that margin is simply the perpendicular distance of the nearest example to the
separating hyperplane, we may ask ourselves: What does the margin really do for us in
terms of machine learning and is it really worth spending time?. The answer to this rather
fundamental question can be given intuitively. Let us look at a figure and then things will be
clearer.
a)
b)
c)
15
In Fig. 2.2, three classifers are presented. We do not know which learning algorithm generated
that, but we do know that there are differences in the way, they are separating the otherwise
identical dataset. Let us try to enumerate these differences.
the three classifiers (separating planes) have different slopes.
case (a) separates the examples such that the line is very close to each of
the two classes of nearest examples.
in case (b) the separating plane is rather far from the two examples.
case (c) is where the separating plane is farthest from the two examples.
the margin for case (a) is narrowest, whereas, the margin for case (c) is
the widest.
Notice that I am using the word separating plane, separating line and separating hyperplane
interchangeably. The all mean the same. Although there are differences but for now it does
not matter and it could be treated as the same. Getting back to our original question, how
does the margin help us, we realize that perhaps the classifier with the widest margin is the
best. But why? One way to think about this is to consider a new example. A new example
is a data point which your training algorithm has not seen yet. What are the chances that
the first separator will classify it correctly? What are the chances that the second or the
third classifier will do its job without error? To a new example, the most error tolerant (or
accurate) will be the third classifier. To further strengthen the idea, ler us assume that the
new example is a little wayward (not normal or a little outside or beyond the normal region).
Which classifier will tolerate it the nost? The third one! Because it has a wider margin and
if an example is a little here or there, it could still lie in a wider margin and hence can get
correctly classified then if the margins are narrow, since there the chances are high that the
new example will be misclassified.
2.1.2