You are on page 1of 211

Source: https://towardsdatascience.

com/introduction-to-machine-learning-db7c668822c4

Introduction to Machine Learning


Machine Learning is an idea to learn from examples and experience, without being explicitly
programmed. Instead of writing code, you feed data to the generic algorithm, and it builds logic based
on the data given.

For example, one kind of algorithm is a classification algorithm. It can put data into different groups.
The classification algorithm used to detect handwritten alphabets could also be used to classify emails
into spam and not-spam.

“A computer program is said to learn from experience E with some class of tasks T and performance measure P
if its performance at tasks in T, as measured by P, improves with experience E.” -Tom M. Mitchell

Consider playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

Examples of Machine Learning

There are many examples of machine learning. Here are a few examples of classification problems
where the goal is to categorize objects into a fixed set of categories.

Face detection: Identify faces in images (or indicate if a face is present).

Email filtering: Classify emails into spam and not-spam.

Medical diagnosis: Diagnose a patient as a sufferer or non-sufferer of some disease.

Weather prediction: Predict, for instance, whether or not it will rain tomorrow.

Need of Machine Learning

Machine Learning is a field which is raised out of Artificial Intelligence(AI). Applying AI, we wanted
to build better and intelligent machines. But except for few mere tasks such as finding the shortest path
between point A and B, we were unable to program more complex and constantly evolving
challenges.There was a realisation that the only way to be able to achieve this task was to let machine
learn from itself. This sounds similar to a child learning from its self. So machine learning was
developed as a new capability for computers. And now machine learning is present in so many
segments of technology, that we don’t even realise it while using it.
Finding patterns in data on planet earth is possible only for human brains. The data being very massive,
the time taken to compute is increased, and this is where Machine Learning comes into action, to help
people with large data in minimum time.

If big data and cloud computing are gaining importance for their contributions, machine learning as
technology helps analyse those big chunks of data, easing the task of data scientists in an automated
process and gaining equal importance and recognition.

The techniques we use for data mining have been around for many years, but they were not effective as
they did not have the competitive power to run the algorithms. If you run deep learning with access to
better data, the output we get will lead to dramatic breakthroughs which is machine learning.

Kinds of Machine Learning

There are three kinds of Machine Learning Algorithms.

a. Supervised Learning

b. Unsupervised Learning

c. Reinforcement Learning

Supervised Learning

A majority of practical machine learning uses supervised learning.

In supervised learning, the system tries to learn from the previous examples that are given. (On the
other hand, in unsupervised learning, the system attempts to find the patterns directly from the example
given.)

Speaking mathematically, supervised learning is where you have both input variables (x) and output
variables(Y) and can use an algorithm to derive the mapping function from the input to the output.

The mapping function is expressed as Y = f(X).

Example :
Supervised learning problems can be further divided into two parts, namely classification, and
regression.

Classification: A classification problem is when the output variable is a category or a group, such as
“black” or “white” or “spam” and “no spam”.

Regression: A regression problem is when the output variable is a real value, such as “Rupees” or
“height.”

Unsupervised Learning

In unsupervised learning, the algorithms are left to themselves to discover interesting structures in the
data.

Mathematically, unsupervised learning is when you only have input data (X) and no corresponding
output variables.

This is called unsupervised learning because unlike supervised learning above, there are no given
correct answers and the machine itself finds the answers.

Unsupervised learning problems can be further divided into association and clustering problems.

Association: An association rule learning problem is where you want to discover rules that describe
large portions of your data, such as “people that buy X also tend to buy Y”.
Clustering: A clustering problem is where you want to discover the inherent groupings in the data,
such as grouping customers by purchasing behaviour.

Reinforcement Learning

A computer program will interact with a dynamic environment in which it must perform a particular
goal (such as playing a game with an opponent or driving a car). The program is provided feedback in
terms of rewards and punishments as it navigates its problem space.

Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine
is exposed to an environment where it continuously trains itself using trial and error method.

Example:

The Math of Intelligence

Machine Learning theory is a field that meets statistical, probabilistic, computer science and
algorithmic aspects arising from learning iteratively from data which can be used to build intelligent
applications.

Why Worry About The Maths?

There are various reasons why the mathematics of Machine Learning is necessary, and I will highlight
some of them below:

Selecting the appropriate algorithm for the problem includes considerations of accuracy, training time,
model complexity, the number of parameters and number of characteristics.

Identifying underfitting and overfitting by following the Bias-Variance tradeoff.

Choosing parameter settings and validation strategies.

Estimating the right determination period and uncertainty.


What Level of Maths Do We Need?

The foremost question when trying to understand a field such as Machine Learning is the amount of
maths necessary and the complexity of maths required to understand these systems.

The answer to this question is multidimensional and depends on the level and interest of the individual.

Here is the minimum level of mathematics that is needed for Machine Learning Engineers / Data
Scientists.

1. Linear Algebra (Matrix Operations, Projections, Factorisation, Symmetric Matrices,


Orthogonalisation)

2. Probability Theory and Statistics (Probability Rules & Axioms, Bayes’ Theorem, Random
Variables, Variance and Expectation, Conditional and Joint Distributions, Standard Distributions.)

3. Calculus (Differential and Integral Calculus, Partial Derivatives)

4. Algorithms and Complex Optimizations (Binary Trees, Hashing, Heap, Stack)

Closing Notes

Thanks for reading! Hopefully, you’re now able to understand what Machine Learning is and its
applications.

Source: https://towardsdatascience.com/introduction-to-machine-learning-db7c668822c4
Artificial Intelligence v/s Machine Learning

Source: https://www.quora.com/What-are-the-main-differences-between-artificial-intelligence-and-machine-learning-
Is-machine-learning-a-part-of-artificial-intelligence

Source: https://www.geeksforgeeks.org/difference-between-machine-
learning-and-artificial-intelligence/

Difference between Machine learning and


Artificial Intelligence
Artificial Intelligence and Machine Learning are the terms of computer science. This article discusses
some points on the basis of which we can differentiate between these two terms.

Overview

Artificial Intelligence : The word Artificial Intelligence comprises of two words “Artificial” and
“Intelligence”. Artificial refers to something which is made by human or non natural thing and
Intelligence means ability to understand or think. There is a misconception that Artificial Intelligence is
a system, but it is not a system .AI is implemented in the system. There can be so many definition of
AI, one definition can be “It is the study of how to train the computers so that computers can do
things which at present human can do better.”Therefore It is a intelligence where we want to add all
the capabilities to machine that human contain.

Machine Learning : Machine Learning is the learning in which machine can learn by its own without
being explicitly programmed. It is an application of AI that provide system the ability to automatically
learn and improve from experience. Here we can generate a program by integrating input and output of
that program. One of the simple definition of the Machine Learning is “Machine Learning is said to
learn from experience E w.r.t some class of task T and a performance measure P if learners
performance at the task in the class as measured by P improves with experiences.”

The key difference between AI and ML are:

ARTIFICIAL INTELLIGENCE MACHINE LEARNING

AI stands for Artificial intelligence, where intelligence


ML stands for Machine Learning which is defined
is defined acquisition of knowledge intelligence is
as the acquisition of knowledge or skill
defined as a ability to acquire and apply knowledge.

The aim is to increase chance of success and not The aim is to increase accuracy, but it does not
accuracy. care about success

It is a simple concept machine takes data and


It work as a computer program that does smart work
learn from data.

The goal is to learn from data on certain task to


The goal is to simulate natural intelligence to solve
maximize the performance of machine on this
complex problem
task.

AI is decision making. ML allows system to learn new things from data.

It leads to develop a system to mimic human to


It involves in creating self learning algorithms.
respond behave in a circumstances.

ML will go for only solution for that whether it is


AI will go for finding the optimal solution.
optimal or not.

AI leads to intelligence or wisdom. ML leads to knowledge.

Reference:
www.techrepublic.com/article/understanding-the-differences-between-ai-machine-learning-and-deep-
learning

Source: https://www.geeksforgeeks.org/difference-between-machine-learning-and-
artificial-intelligence/

Simple Linear Regression


Ordinary least Square method
 Regression: Allows to model relationship between two variables: Input variable and Output
(dependant) variable. The objective is to predict continues output value.
 Dependant variable uses the best fit regression line.

Example:

Size (Input variable X) Price of House(Output variable Y)


500 100
600 150
700 170
800 200
900 230

 A value Y is a function of x:
y = f(x)

f(x)= θ0 + θ1x

where
θ0 : Y Intercept of the line (where it crosses Y axis when x=0) and
θ1 : Slope of the line.

Calculation of θ0 and θ1:

Goodness of Fitness:

The difference between the best-fit and the observed value is called residual or Error. To measure the error we use Cost
function. The objective is to find the best fitting line through the data that minimize the error. In case of linear regression
Sum of Squared error function is used as the cost function. the reasons for preferring SSE as cost function over Sum of
Errors or sum of absolute errors are:

o Simple sum of errors don’t depict actual effect as for some points error is –ve and for some point its +ve.
Hence we have to use either squared or absolute values.
o The absolute error function is not differentiable. And it is hard to find derivative of function with
absolute value. In most of the machine learning algorithms we use gradient descent which requires a
differentiable function.

Example of non differentiable function, Blue line representing the function and pink line
represents its derivative, at x=-2 ans x= 3 there are multiple values for rate of change,
hence the function is not diffentiable.

o The squaring converts the function into convex function which enhances faster convergence and also
ensures global optimum, in case of gradient descent.
o Least squared approach will penalize large errors hence accommodate outliers.

Coefficient of Determination

Coefficient of Determination tells: How well does the estimated regression equation fit our data? The Total Sum
of Squares is partitioned to SSE and SSR. It quantifies this ratio as percentage and represented as:

 Sum of squared Total (SST) or Total Sum of Squares


(TSS) is squared difference of each observation from
the overall mean.

 SST= SSE + SSR


Where SSE = Sum of Squared Error (from predicted
value)
SSR= Sum of Squared residuals (or Residual Sum
of errors(RSS))
 In case we use mean line “SST = SSE”, as SSR = 0
 Coefficient of Determination is overall measure of accuracy of regression line.
 It’s varies from 0 to 1
 Value 0 indicates that the regression line doesn’t fit the data points.
 Value 1 Indicates line perfectly fits the set of data points.
Linear Regression using Gradient Descent
Gradient Descent is a technique to find values of parameters for which value of the function is minimum. For example in
following figure we are looking for W that minimizes J(W).

For Example consider function

Steps:

1. Set i = 1
2. Pick a random value for xi=3.
3. Take the derivative f’(x) of f(x)
f‘(x)=2x-2
4. Study the derivative for the value of x
F’(3)=2*3-2
=4
[
 we know that the derivative at minima is 0 : ---Global minima from Calculus
 if derivative is +ve we know that the value is getting larger
 By studying the derivative we know if we are getting closer or further away from the minimum
]
5. Set xi+1 = xi - α f’(xi)
Here α is learning rate. Assume it is .02
xi+1=3-.02 *4
= 2.2
6. Go to step 4 until the decrease in f’(x) is less than some threshold value or it converges to minima.

GRADIENT ASCENT: If instead the problem is for maximization use following equation in step number 5:
xi+1 = xi + α f’(xi)

Finding Minima using Calculus:


Set f’(x)=0 , as at minima the rate of change is 0.
2x-2=0
X= 1.

If finding minima using calculus is so easy than why Gradient Descent is preferred?
In case the function is very complicated such as
and X is a vector.
In such cases finding out minima/maxima is very tough.
This is why Gradient Decent is useful
Another reason is it is fast.

Learning Rate α :
 It provides size of each jump
 If learning rate is high the solution may not converge easily.
 If learning rate is slow it will take very large number of iterations to converge.

Suggestion to select learning rate:

 Start with a large value


 When you notice overshoot(back and forth), reduce it and try again
 Repeat these steps until find a small enough rate.

Source: Machine Learning by Andrew NG

Linear regression with Gradient Descent

Cost Function

We can measure the accuracy of our hypothesis function by using a cost function. This takes an average difference
(actually a fancier version of an average) of all the results of the hypothesis with inputs from x's and the actual output
y's.

This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved 1) as a
convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out
the ½ term.

So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to
estimate the parameters in the hypothesis function. That's where gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields θ 0 and θ1 (actually we are graphing the cost function
as a function of the parameter estimates). We are not graphing x and y itself, but the parameter range of our
hypothesis function and the cost resulting from selecting a particular set of parameters.

We put θ0 on the x axis and θ1 on the y axis, with the cost function on the vertical z axis. The points on our graph will be
the result of the cost function using our hypothesis with those specific theta parameters. The graph below depicts
such a setup.

We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when
its value is the minimum. The red arrows show the minimum points in the graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the
tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost
function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is
called the learning rate.

For example, the distance between each 'star' in the graph above represents a step determined by our parameter α. A
smaller α would result in a smaller step and a larger α results in a larger step. The direction in which the step is taken
is determined by the partial derivative of J(θ0,θ1). Depending on where one starts on the graph, one could end up at
different points. The image above shows us two different starting points that end up in two different places.

The gradient descent algorithm is:

repeat until convergence:

where

j=0,1 represents the feature index number.

When specifically applied to the case of linear regression, a new form of the gradient descent equation can
be derived. We can substitute our actual cost function and our actual hypothesis function and modify the
equation to :

where m is the size of the training set, θ0 a constant that will be changing simultaneously with θ1 and xi , yi
are values of the given training set (data).

The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these
gradient descent equations, our hypothesis will become more and more accurate.

So, this is simply gradient descent on the original cost function J. This method looks at every example in the
entire training set on every step, and is called batch gradient descent. Note that, while gradient descent
can be susceptible to local minima in general, the optimization problem we have posed here for linear
regression has only one global, and no other local, optima; thus gradient descent always converges
(assuming the learning rate α is not too large) to the global minimum. Indeed, J is a convex quadratic
function. Here is an example of gradient descent as it is run to minimize a quadratic function.
Linear Regression with Multiple Variables
Logistic Regression
Source: Machine Learning by Andrew NG

Evaluating And Improving the performance

Make sure that if you are developing machine learning systems, that you know how to choose one of
the most promising avenues to spend your time pursuing. And concretely what we'd focus on is the
problem of, suppose you are developing a machine learning system or trying to improve the
performance of a machine learning system, how do you go about deciding what are the proxy avenues
to try.

The question is what should you then try mixing in order to improve the learning algorithm?

1. One thing they could try, is to get more training examples. And concretely, you can imagine,
maybe, you know, setting up phone surveys, going door to door, to try to get more data on how
much different houses sell for.
o And the sad thing is a lot of people spend a lot of time collecting more training
examples, thinking oh, if we have twice as much or ten times as much training data, that
is certainly going to help, right?
o But sometimes getting more training data doesn't actually help and in the next few
videos we will see why, and we will see how you can avoid spending a lot of time
collecting more training data in settings where it is just not going to help.

2. Try a smaller set of features. So if you have some set of features such as x1, x2, x3 and so on,
maybe a large number of features. Maybe you want to spend time carefully selecting some small
subset of them to prevent overfitting.

3. Or maybe you need to get additional features. Maybe the current set of features aren't
informative enough and you want to collect more data in the sense of getting more features.

4. Try adding polynomial features things like x2 .We can still spend quite a lot of time thinking
about that and we can also try other things like decreasing lambda, the regularization parameter
or increasing lambda.

Given a menu of options like these, some of which can easily scale up to six month or longer
projects.
Unfortunately, the most common method that people use to pick one of these is to go by gut feeling. In
which what many people will do is sort of randomly pick one of these options and
maybe say, "Oh, lets go and get more training data." And easily spend six months collecting
more training data or maybe someone else would rather be saying, "Well, let's go collect a lot more
features on these houses in our data set." Many people spend, literally 6 months doing one of these
avenues that they have sort of at random only to discover six months later that
that really wasn't a promising avenue to pursue.

Fortunately, there is a pretty simple technique that can let you very quickly rule out half of the things on
this list as being potentially promising things to pursue. And there is a very simple technique, that if
you run, can easily rule out many of these options, and potentially save you a lot of time pursuing
something that's just is not going to work.

In next section we discuss machine learning diagnostics. And learn how to improve performance of
machine learning algorithm.
Suppose you're left to decide what degree of polynomial to fit to a data set. So that what features to
include that gives you a learning algorithm. Or suppose choose the regularization parameter longer for
learning algorithm. How do you do that? This account model selection process. Browsers, and in our
discussion of how to do this, we'll talk about not just how to split your data into the train and test sets,
but how to switch data into what we discover is called the train, validation, and test sets. We'll see how
to use them to do model selection. We've already seen a lot of times the problem of overfitting, in
which just because a learning algorithm fits a training set well, that doesn't mean it's a good
hypothesis. More generally, this is why the training set's error is not a good predictor for how well the
hypothesis will do on new example. Concretely, if you fit some set of parameter Theta0, theta1, theta2,
and so on, to your training set. Then the fact that your hypothesis does well on the training set. Well,
this doesn't mean much in terms of predicting how well your hypothesis will generalize to new
examples not seen in the training set. And a more general principle is that once your parameter is what
fit to some set of data. Maybe the training set, maybe something else. Then the error of your
hypothesis as measured on that same data set, such as the training error, that's unlikely to be a good
estimate of your actual generalization error. That is how well the hypothesis will generalize to new
examples.

Model Selection Problem.


Let's say you're trying to choose what degree polynomial to fit to data.
So, should you choose a linear function, a quadratic function, a cubic function? All the way up to a
10th-order polynomial. It's as if there's one extra parameter in this algorithm, which is going to denote
d, which is, what degree of polynomial. Do you want to pick. The first option is d equals one, if you fit
a linear function. We can choose d equals two, d equals three, all the way up to d equals 10. So, we'd
like to fit this extra sort of parameter. And concretely let's say that you want to choose a model, that is
choose a degree of polynomial, choose one of these 10 models. And fit that model and also get some
estimate of how well your fitted hypothesis was generalize to new examples. Here's one thing you
could do. What you could, first take your first model and minimize the training error. And this would
give you some parameter vector θ. And you could then take your second model, the quadratic function,
and fit that to your training set and this will give you some other parameter vector theta. In order to
distinguish between these different parameter vectors, I'm going to use a superscript one superscript
two there where θ1 one just means the parameters I get by fitting first Hypothesis equation to my
training data. And θ2 one just means the parameters I get by fitting second Hypothesis equation to my
training data. and so on. And one thing we could do is that take these parameters and look at test error.
So we can compute on my test set Jtest(θ1), Jtest(θ2), and so on So going to take each of hypotheses with
the corresponding parameters and just measure the performance of on the test set.

Now, one thing one could do then is, in order to select one of these models, we could then see which
model has the lowest test set error. And let's just say for this example that ended up choosing the fifth
order polynomial. So, this seems reasonable so far. But now let's say we want to take my fifth
hypothesis, fifth order model, and how well does this model generalize? One thing we could do is look
at how well my fifth order polynomial hypothesis had done on my test set. But the problem is this will
not be a fair estimate of how well my hypothesis generalizes. And the reason is what we've done is
we've fit this extra parameter d, that is this degree of polynomial. And what fits that parameter d, using
the test set, namely, we chose the value of d that gave us the best possible performance on the test set.
And so, the performance of my parameter vector θ5, on the test set, that's likely to be an overly
optimistic estimate of generalization error.
Right, so, that because we had fit this parameter d to my test set is no longer fair to evaluate hypothesis
on this test set, because we fit this parameters to this test set, chose the degree d of polynomial using
the test set. And so my hypothesis is likely to do better on this test set than it would on new examples
that it hasn't seen before, and that's which is, what I really care about. So just to reiterate, on the
previous slide, we saw that if we fit some set of parameters, you know, say θ1, θ2, and so on, to some
training set, then the performance of the fitted model on the training set is not predictive of how well
the hypothesis will generalize to new examples. It is because these parameters were fit to the training
set, so they're likely to do well on the training set, even if the parameters don't do well on other
examples. And, in the procedure I just described on this line, we just did the same thing. And
specifically, what we did was, we fit this parameter d to the test set. And by having fit the parameter to
the test set, this means that the performance of the hypothesis on that test set may not be a fair estimate
of how well the hypothesis is, is likely to do on examples we haven't seen before.

To address this problem, in a model selection setting, if we want to evaluate a hypothesis, this is
what we usually do instead. Given the data set, instead of just splitting into a training test set, what
we're going to do is then split it into three pieces:

1. Training Set
2. Validation Set (also called as Cross Validation Set )
3. Testing Set

And the first piece is going to be called the training set as usual. So let me call this first part the
training set. And the second piece of this data, I'm going to call the cross validation set. Sometimes it's
also called the validation, set instead of cross validation set.
And then the loss can be to call the usual test set. And the typical ratio at which to split these things
will be to send 60% of your data's, your training set, maybe 20% to your cross validation set, and 20%
to your test set. And these numbers can vary a little bit but this integration be pretty typical. And so
our training sets will now be only maybe 60% of the data, and our cross-validation set, or our
validation set, will have some number of examples.

Notations:

mcv = Number of cross validation examples.

(xicv,yicv) : ith cross validation example

And finally we also have a test set over here with our mtest being the number of test examples. So, now
that we've defined the training validation or cross validation and test sets. We can also define the
training error, cross validation error, and test error.
So when faced with a model selection problem like this, what we're going to do is, instead of using the
test set to select the model, we're instead going to use the validation set, or the cross validation set, to
select the model. Concretely, we're going to first take our first hypothesis, take this first model, and
say, minimize the cross function, and this would give me some parameter vector θ1 for the new model.
We do the same thing for the quadratic model. Get some parameter vector θ2 so on, down to θ10 for the
polynomial. And what we are going to do is, instead of testing these hypotheses on the test set, we
instead going to test them on the cross validation set. And measure Jcv, to see how well each of these
hypotheses do on my cross validation set. And pick the hypothesis with the lowest cross validation
error. So for this example, let's say for the sake of argument, that it was my 4th order polynomial, that
had the lowest cross validation error. So in that case I'm going to pick this fourth order polynomial
model. And finally, what this means is that that parameter d, remember d was the degree of
polynomial, right? So d equals two, d equals three, all the way up to d equals 10. What we've done is
we'll fit that parameter d and we'll say d equals four. And we did so using the cross-validation set. And
so this degree of polynomial, so the parameter, is no longer fit to the test set, and we've not saved away
the test set, and we can use the test set to measure, or to estimate the generalization error of the model
that was selected.
Overfitting and Underfitting With Machine
Learning Algorithms
The cause of poor performance in machine learning is either overfitting or underfitting the data. Approximate
a Target Function in Machine Learning

Supervised machine learning is best understood as approximating a target function (f) that maps input
variables (X) to an output variable (Y).

Y = f(X)

This characterization describes the range of classification and prediction problems and the machine algorithms
that can be used to address them. An important consideration in learning the target function from the training
data is how well the model generalizes to new data. Generalization is important because the data we collect is
only a sample, it is incomplete and noisy

Generalization in Machine Learning

In machine learning we describe the learning of the target function from training data as inductive
learning.

Induction refers to learning general concepts from specific examples which is exactly the problem that
supervised machine learning problems aim to solve. This is different from deduction that is the other
way around and seeks to learn specific concepts from general rules.

Generalization refers to how well the concepts learned by a machine learning model apply to specific
examples not seen by the model when it was learning.

The goal of a good machine learning model is to generalize well from the training data to any data from
the problem domain. This allows us to make predictions in the future on data the model has never seen.

There is a terminology used in machine learning when we talk about how well a machine learning
model learns and generalizes to new data, namely overfitting and underfitting.

Overfitting and underfitting are the two biggest causes for poor performance of machine learning
algorithms.

Statistical Fit

In statistics, a fit refers to how well you approximate a target function.

This is good terminology to use in machine learning, because supervised machine learning algorithms
seek to approximate the unknown underlying mapping function for the output variables given the input
variables.

Statistics often describe the goodness of fit which refers to measures used to estimate how well the
approximation of the function matches the target function.
Some of these methods are useful in machine learning (e.g. calculating the residual errors), but some of
these techniques assume we know the form of the target function we are approximating, which is not
the case in machine learning.

If we knew the form of the target function, we would use it directly to make predictions, rather than
trying to learn an approximation from samples of noisy training data.

Overfitting in Machine Learning

Overfitting refers to a model that models the training data too well.

Overfitting happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model. The problem is that
these concepts do not apply to new data and negatively impact the models ability to generalize.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when
learning a target function. As such, many nonparametric machine learning algorithms also include
parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is
subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned
in order to remove some of the detail it has picked up.

Underfitting in Machine Learning

Underfitting refers to a model that can neither model the training data nor generalize to new data.

An underfit machine learning model is not a suitable model and will be obvious as it will have poor
performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy
is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good
contrast to the problem of overfitting.

A Good Fit in Machine Learning

Ideally, you want to select a model at the sweet spot between underfitting and overfitting.

This is the goal, but is very difficult to do in practice.

To understand this goal, we can look at the performance of a machine learning algorithm over time as it
is learning a training data. We can plot both the skill on the training data and the skill on a test dataset
we have held back from the training process.

Over time, as the algorithm learns, the error for the model on the training data goes down and so does
the error on the test dataset. If we train for too long, the performance on the training dataset may
continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the
training dataset. At the same time the error for the test set starts to rise again as the model’s ability to
generalize decreases.
The following figure depicts an example of high bias. In other words, the model is underfitting. The
data points obviously follow some sort of curve, but our predictor isn’t complex enough to capture that
information. Our model is biased in that it assumes that the data will behave in a certain fashion
(linear, quadratic, etc.) even though that assumption may not be true. A key point is that there’s
nothing wrong with our training—this is the best possible fit that a linear model can achieve. There is,
however, something wrong with the model itself in that it’s not complex enough to model our data.
High Bias Error (case of underfitting): When Training error is high and cross validation/test
error is high
High Variance (case of overfitting): When Training error is low and cross validation/test error
is high

Learning Curve
Learning curves is often a very useful thing to plot. If either you wanted to check that
 your algorithm is working correctly, or
 if you want to improve the performance of the algorithm.

And learning curves is a tool that is actually used very often to try to diagnose if a physical learning
algorithm may be suffering from bias, sort of variance problem or a bit of both.

To plot a learning curve, what we usually do is plot Jtrain(θ) or Jcv(θ) as a function of m, that is as a
function of the number of training examples we have. To plot the graph we vary number of training
examples (10/20/30 ……. Upto m) and plot what the training error is and what the cross validation is
for this smallest training set exercises.

Suppose We have only one training example like that shown in this t first example
here and let's say fitting a quadratic function and able to fit it perfectly right. You know, just fit the
quadratic function we get 0 error on the one training example. As m increases the training error start
increasing (while using the same quadratic function).
However , the very small training set is not able to generalize the model, hence cross validation error is
high when m is small and it start decreasing with the increase in m.

After some threshold value of m the error rate is flatten out hence increasing value of m beyond this
point do not help to improve the performance.
Learning curve in case of high Bias

Suppose the hypothesis has high bias and for example data that, can't really fit well by a straight line.
So we end up with a hypotheses that maybe looks like

Now let's think what would happen if we increase the training set size. So if instead of five examples
like, we have a lot more training examples. Well what happens, if you fit a straight line to this, you end
up with pretty much the same straight line.
`

Hence in this case increasing the size of training data set will not improve the performance (the straight
line isn't going to change that much).
The learning curve for training and validation error is as:

But by the time you have reached a certain number of training examples, you have almost fit the best
possible straight line, and even if you end up with a much larger training set size, a much larger value
of m, you know, you're basically getting the same straight line, and so, the cross-validation error or test
set error flatten out pretty soon, once you reached beyond a certain the number of training examples.

Well, the training error will again be small. And what you find in the high bias case is that the training
error will end up close to the cross validation error, because you have so few parameters and so much
data, at least when m is large.

The performance on the training set and the cross validation set will be very similar. And so, this is
what your learning curves will look like, if you have an algorithm that has high bias.

And finally, the problem with high bias is reflected in the fact that both the cross validation error and
the training error are high,

This also implies something very interesting, which is that, if a learning algorithm has high bias, as we
get more and more training examples, the cross validation error isn't going down much, it's basically
fattened up, and so if learning algorithms are really suffering from high bias.

So knowing if your learning algorithm is suffering from high bias seems like a useful thing to know
because this can prevent you from wasting a lot of time collecting more training data where it might
just not end up being helpful.

Learning curve for High Variance

Let us just look at the training error in a around if you have very small training set like five training
examples shown on the figure on the right and if we're fitting say a very high order
polynomial(hundredth degree polynomial- Just for example) : case of overfitting.

And as this training set size increases a bit, we may still be overfitting this data a little bit but it also
becomes slightly harder to fit this data set perfectly. As the training set size increases, we'll find that
training error increases, because it is just a little harder to fit the training set perfectly when we have
more examples, but the training set error will still be pretty low.
Now, how about the cross validation error?

Well, in high variance setting, a hypothesis is overfitting and so the cross validation error will remain
high, even as we get you know, a moderate number of training. The indicative diagnostic that we have a
high variance problem, is the fact that there's this large gap between the training error and the cross
validation error.

If we think about adding more training data, that is, taking this figure and extrapolating to
the right the two curves(training and validation error) are converging to each other.

And so, if we were to extrapolate this figure to the right, then it seems it likely that the training error
will keep on going up and the cross-validation error would keep on going down.

The thing we really care about is the cross-validation error or the test set error. In this sort of scenario,
we can tell that if we keep on adding training examples and extrapolate to the right, well our cross
validation error will keep on coming down. So,in the high variance setting, getting more training data
is, indeed, likely to help.

Now, on the previous slide and this slide, I've drawn fairly clean fairly idealized curves. If you plot
these curves for an actual learning algorithm, sometimes you will actually see, you know, pretty much
curves, like what I've drawn here. Although, sometimes you see curves that are a little bit noisier and a
little bit messier than this. But plotting learning curves like these can often tell you, can often help you
figure out if your learning algorithm is suffering from bias, or variance or even a little bit of both.
For implementation of Linear Regression refer Jupyter note sheet “ML_ Learning Linear
Regression” and
Source: https://utkuufuk.github.io/2018/04/21/linear-regression/

Training a Simple Linear Regression Model


From Scratch
Posted on 2018-04-21 | Edited on 2018-04-27 | In Machine Learning | 0 Comments

Hey everyone, welcome to my first blog post! This is going to be a walkthrough on training a simple
linear regression model in Python. I’ll show you how to do it from scratch, without using any machine
learning tools or libraries. We’ll only use NumPy and Matplotlib for matrix operations and data
visualization.

Problem & Dataset

We’ll look at a regression problem from a very popular machine learning course tought by Andrew Ng.
Our objective in this problem will be to train a model that accurately predicts the profits of a food truck.

The first column in our dataset file contains city populations and the second column contains food truck profits
in each city, both in 10,000

s. Here are the first few training examples:

food_truck_data.txt
6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
...

We’re going to use this dataset as a training sample to build our model. Let’s begin by loading it:

import numpy as np
import matplotlib.pyplot as plt

data = np.loadtxt('food_truck_data.txt', delimiter=",")


x = data[:, 0] # city populations
y = data[:, 1] # food truck profits

Both x

and y

are one dimensional arrays, because we have one feature (population) and one target variable (profit) in this
problem. Therefore we can conveniently visualize our dataset using a scatter plot:
fig, ax = plt.subplots()
ax.scatter(x, y, marker="x", c="red")
plt.title("Food Truck Dataset", fontsize=16)
plt.xlabel("City Population in 10,000s", fontsize=14)
plt.ylabel("Food Truck Profit in 10,000s", fontsize=14)
plt.axis([4, 25, -5, 25])
plt.show()

Hypothesis Function

Now we need to come up with a straight line which accurately represents the relationship between
population and profit. This is called the hypothesis function and it’s formulated as:

hθ(x)=θTx=θ0+θ1x1+θ2x2+…+θnxn

where x

corresponds to the feature matrix and θ

corresponds to the vector of model parameters.

Since we have a single feature x1,

we’ll only have two model parameters θ0 and θ1

in our hypothesis function:

hθ(x)=θ0+θ1x1

As you may have noticed, the number of model parameters are equal to the number of features plus 1
. That’s because each feature is weighted by a parameter to control its impact on the hypothesis hθ(x). There is
also an independent parameter θ0 called the intercept term, which defines the point where the hypothesis
function intercepts the y

-axis as demonstrated below:

The predictions of a hypothesis function can easily be evaluated in Python by computing the cross product of x

and θT. At the moment we have our x and y

vectors but we don’t have our model parameters yet. So let’s create those as well and initialize them with
zeros:

theta = np.zeros(2)

Also we have to make sure that the matrix dimensions of x

and θT are compatible for the cross product operation. Currently x has 1 column but θT has 2 rows. The
dimensions don’t match because of the additional intercept term θ0.

We can solve this issue by inserting a first column to x

and set it to all ones. This is essentially equivalent to creating a new feature x0=1. This extra column won’t
effect the hypothesis whatsoever, because θ0 is going to be multiplied by 1

in the cross product.

Let’s create a new variable X

to store the extended x


matrix:

X = np.ones(shape=(len(x), 2))
X[:, 1] = x

Finally we can compute the predictions of our hypothesis as follows:

predictions = X @ theta

Of course the predictions are currently all zeros because we haven’t trained our model yet.

Cost Function

The objective in training a linear regression model is to minimize a cost function, which measures the
difference between actual y

values in the training sample and predictions made by the hypothesis function hθ(x)

Such a cost function can be formulated as;

J(θ)=12m∑i=1m(hθ(x(i))−y(i))2

where m

is the number of training examples.

Here’s its Python version:

def cost(theta, X, y):


predictions = X @ theta
squared_errors = np.square(predictions - y)
return np.sum(squared_errors) / (2 * len(y))

Now let’s take a look at the cost of our initial untrained model:

print('The initial cost is:', cost(theta, X, y))


The initial cost is: 32.0727338775

Gradient Descent Algorithm

Since our hypothesis is based on the model parameters θ

, we must somehow adjust them to minimize our cost function J(θ). This is where the gradient descent
algorithm comes into play. It’s an optimization algorithm which can be used in minimizing differentiable
functions. Luckily our cost function J(θ)
happens to be a differentiable one.

So here’s how the gradient descent algorithm works in a nutshell:

In each iteration, it takes a small step in the opposite gradient direction of J(θ)

. This makes the model parameters θ

gradually come closer to the optimal values. This process is repeated until eventually the minimum cost is
achieved.

More formally, gradient descent performs the following update in each iteration:

θj:=θj−α1m∑i=1m(hθ(x(i))−y(i))x(i)j

The α

term here is called the learning rate. It allows us to control the step size to update θ

in each iteration. Choosing a too large learning rate may prevent us from converging to a minimum cost,
whereas choosing a too small learning rate may significantly slow down the algorithm.

Here’s a generic implementation of the gradient descent algorithm:

def gradient_descent(X, y, alpha, num_iters):


num_features = X.shape[1]
theta = np.zeros(num_features) # initialize model parameters
for n in range(num_iters):
predictions = X @ theta # compute predictions based on the
current hypothesis
errors = predictions - y
gradient = X.transpose() @ errors
theta -= alpha * gradient / len(y) # update model parameters
return theta # return optimized parameters

Now let’s use this function to train our model and plot the hypothesis function:

theta = gradient_descent(X, y, 0.02, 600) # run GD for 600 iterations with


learning rate = 0.02
predictions = X @ theta # predictions made by the optimized
model
ax.plot(X[:, 1], predictions, linewidth=2) # plot the hypothesis on top of the
training data
fig
Debugging

Our linear fit looks pretty good, right? The algorithm must have successfully optimized our model.

Well, to be honest, it’s been fairly easy to visualize the hypothesis because there’s only one feature in
this problem.

But what if we had multiple features? Then it wouldn’t be possible to simply plot the hypothesis to see
whether the algorithm has worked as intended or not.

Fortunately, there’s a simple way to debug the gradient descent algorithm irrespective of the number of
features:

1. Modify the gradient descent function to make it record the cost at the end of each iteration.
2. Plot the cost history after the gradient descent has finished.
3. Pat yourself on the back if you see that the cost has monotonically decreased over time.

Here’s the modified version of our gradient descent function:

def gradient_descent(X, y, alpha, num_iters):


cost_history = np.zeros(num_iters) # create a vector to store the cost
history
num_features = X.shape[1]
theta = np.zeros(num_features)
for n in range(num_iters):
predictions = X @ theta
errors = predictions - y
gradient = X.transpose() @ errors
theta -= alpha * gradient / len(y)
cost_history[n] = cost(theta, X, y) # compute and record the cost
return theta, cost_history # return optimized parameters and
cost history

Now let’s try learning rates 0.01

, 0.015, 0.02
and plot the cost history for each one:

plt.figure()
num_iters = 1200
learning_rates = [0.01, 0.015, 0.02]
for lr in learning_rates:
_, cost_history = gradient_descent(X, y, lr, num_iters)
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with different learning rates", fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.legend(list(map(str, learning_rates)))
plt.axis([0, num_iters, 4, 6])
plt.grid()
plt.show()

It appears that the gradient descent algorithm worked correctly for these particular learning rates.
Notice that it takes more iterations to minimize the cost as the learning rate decreases.

Now let’s try a larger learning rate and see what happens:

learning_rate = 0.025
num_iters = 50
_, cost_history = gradient_descent(X, y, learning_rate, num_iters)
plt.plot(cost_history, linewidth=2)
plt.title("Gradient descent with learning rate = " + str(learning_rate),
fontsize=16)
plt.xlabel("number of iterations", fontsize=14)
plt.ylabel("cost", fontsize=14)
plt.axis([0, num_iters, 0, 6000])
plt.grid()
plt.show()
Doesn’t look good… That’s what happens when the learning rate is too large. Even though the gradient
descent algorithm takes steps in the correct direction, these steps are so huge that it’s going to overshoot
the target and the cost diverges from the minimum value instead of converging to it.

Right now we can safely set the learning rate to 0.02

, because it allows us to minimize the cost and it requires relatively less iterations to converge.

Prediction

Now that we’ve learned how to train our model, we can finally predict the food truck profit for a
particular city:

theta, _ = gradient_descent(X, y, 0.02, 600) # train the model


test_example = np.array([1, 7]) # pick a city with 70,000
population as a test example
prediction = test_example @ theta # use the trained model to make a
prediction
print('For population = 70,000, we predict a profit of $', prediction * 10000);
For population = 70,000, we predict a profit of $ 45905.6621788

Source:https://utkuufuk.github.io/2018/05/04/learning-curves/
Learning Curves in Linear & Polynomial
Regression
Learning curves are very useful for analyzing the bias-variance characteristics of a machine learning
model. In this post, I’m going to talk about how to make use of them in a case study of a regression
problem. We’re going to start with a simple linear regression model and improve it as much as we can
by taking advantage of learning curves.

Introduction to Learning Curves

In a nutshell, learning curves show how the training and validation errors change with respect to the
number of training examples used while training a machine learning model.

 If a model is balanced, both errors converge to small values as the training sample size
increases.
 If a model has high bias, it ends up underfitting the data. As a result, both errors fail to
decrease no matter how many examples there are in the training set.
 If a model has high variance, it ends up overfitting the training data. In that case, increasing
the training sample size decreases the training error but it fails to decrease the validation error.

The figure below demonstrates each of those cases:

Problem Definition and Dataset

After this incredibly brief introduction, let me introduce you today’s problem where we’ll get to see
learning curves in action. It’s another problem from Andrew Ng’s machine learning course, in which
the objective is to predict the amount of water flowing out of a dam, given the change of water level in
a reservoir.

The dataset file we’re about to read contains historical records on the change in water level and the
amount of water flowing out of the dam. The reason that it’s a .mat file is because this problem is
originally a MATLAB assignment. Fortunately it’s pretty easy to load .mat files in Python using the
loadmat function from SciPy. We’ll also need NumPy and Matplotlib for matrix operations and data
visualization:

import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt # we'll need this later
import scipy.io as sio

dataset = sio.loadmat("water.mat")
x_train = dataset["X"]
x_val = dataset["Xval"]
x_test = dataset["Xtest"]

# squeeze the target variables into one dimensional arrays


y_train = dataset["y"].squeeze()
y_val = dataset["yval"].squeeze()
y_test = dataset["ytest"].squeeze()

The dataset is divided into three samples:

 The training sample consists of x_train and y_train.


 The validation sample consists of x_val and y_val.
 The test sample consists of x_test and y_test.

Notice that we have to explicitly convert the target variables (y_train, y_val and y_test) to one
dimensional vectors, because they are stored as matrices inside the .mat file.

Let’s plot the training sample to see what it looks like:

fig, ax = plt.subplots()
ax.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Training sample", fontsize=16)
plt.show()

The Game Plan

Alright, it’s time to come up with a strategy. First of all, it’s clear that there’s a nonlinear relationship between x
and . Normally we would rule out any linear model because of that. However, we are going to begin by training
a linear regression model so that we can see how the learning curves of a model with high bias look like.
Then we’ll train a polynomial regression model which is going to be much more flexible than linear
regression. This will let us see the learning curves of a model with high variance.

Finally we’ll add Regularization to the existing polynomial regression model and see how a balanced
model’s learning curves look like.

Linear Regression

I’ve already shown you in the previous post how to train a linear regression model using gradient
descent. Before proceeding any further, I strongly encourage you to take a look at it if you don’t have at
least a basic understanding of linear regression.

Here I’ll show you an easier way to train a linear regression model using an optimization function
called fmin_cg from scipy.optimize. You can check out the detailed documentation here. The cool
thing about this function is that it’s faster than gradient descent and also you don’t have to select a
learning rate by trial and error.

fmin_cg needs a function that returns the cost and another one that returns the gradient of the cost for a
given hypothesis. We have to pass those to fmin_cg as function arguments. Fortunately we can reuse
some code from the previous post:

 We can completely reuse the cost function because it’s independent from the optimization method
that we use.
 From the gradient_descent function, we can borrow the part where the gradient of the cost
function is evaluated.

So here’s (almost) all we need in order to train a linear regression model:

def cost(theta, X, y):


predictions = X @ theta
return np.sum(np.square(predictions - y)) / (2 * len(y))

def cost_gradient(theta, X, y):


predictions = X @ theta
return X.transpose() @ (predictions - y) / len(y)

def train_linear_regression(X, y):


theta = np.zeros(X.shape[1]) # initialize model parameters with zeros
return opt.fmin_cg(cost, theta, cost_gradient, (X, y), disp=False)

If you look at our cost function, there we evaluate the cross product of the feature matrix X

and the vector of model parameters θ

. Remember, this is only possible if the matrix dimensions match. Therefore we also need a tiny utility function
to insert an additional first column of all ones to a raw feature matrix such as x_train.

def insert_ones(x):
X = np.ones(shape=(x.shape[0], x.shape[1] + 1))
X[:, 1:] = x
return X
Now let’s train a linear regression model and plot the linear fit on top of the training sample:

X_train = insert_ones(x_train)
theta = train_linear_regression(X_train, y_train)
hypothesis = X_train @ theta
ax.plot(X_train[:, 1], hypothesis, linewidth=2)
fig

Learning Curves for Linear Regression

The above plot clearly shows that linear regression is not suitable for this task. Let’s also look at its
learning curves and see if we can draw the same conclusion.

While plotting learning curves, we’re going to start with 2

training examples and increase them one by one. In each iteration, we’ll train a model and evaluate the training
error on the existing training sample, and the validation error on the whole validation sample:

def learning_curves(X_train, y_train, X_val, y_val):


train_err = np.zeros(len(y_train))
val_err = np.zeros(len(y_train))
for i in range(1, len(y_train)):
theta = train_linear_regression(X_train[0:i + 1, :], y_train[0:i + 1])
train_err[i] = cost(theta, X_train[0:i + 1, :], y_train[0:i + 1])
val_err[i] = cost(theta, X_val, y_val)
plt.plot(range(2, len(y_train) + 1), train_err[1:], c="r", linewidth=2)
plt.plot(range(2, len(y_train) + 1), val_err[1:], c="b", linewidth=2)
plt.xlabel("number of training examples", fontsize=14)
plt.ylabel("error", fontsize=14)
plt.legend(["training", "validation"], loc="best")
plt.axis([2, len(y_train), 0, 100])
plt.grid()

In order to use this function, we have to resize x_val just like we did x_train:

X_val = insert_ones(x_val)
plt.title("Learning Curves for Linear Regression", fontsize=16)
learning_curves(X_train, y_train, X_val, y_val)

As expected, we were unable to sufficiently decrease either the training or the validation error.

Polynomial Regression

Now it’s time to introduce some nonlinearity with polynomial regression.

Feature Mapping

In order to train a polynomial regression model, the existing feature(s) have to be mapped to artificially
generated polynomial features. Then the rest is pretty much the same drill.

In our case we only have a single feature x1, the change in water level. Therefore we can simply compute the
first several powers of x1 to artificially obtain new polynomial features. Let’s create a simple function for this:

def poly_features(x, degree):


X_poly = np.zeros(shape=(len(x), degree))
for i in range(0, degree):
X_poly[:, i] = x.squeeze() ** (i + 1);
return X_poly

Now let’s generate new feature matrices for training, validation and test samples with 8 polynomial
features in each:

x_train_poly = poly_features(x_train, 8)
x_val_poly = poly_features(x_val, 8)
x_test_poly = poly_features(x_test, 8)
Feature Normalization

Ok we have our polynomial features but we also have a tiny little problem. If you take a closer look at
one of the new matrices, you’ll see that the polynomial features are very imbalanced at the moment. For
instance let’s look at the first few rows of the x_train_poly matrix:

print(x_train_poly[:4, :])
[[ -1.59367581e+01 2.53980260e+02 -4.04762197e+03 6.45059724e+04
-1.02801608e+06 1.63832436e+07 -2.61095791e+08 4.16102047e+09]
[ -2.91529792e+01 8.49896197e+02 -2.47770062e+04 7.22323546e+05
-2.10578833e+07 6.13900035e+08 -1.78970150e+10 5.21751305e+11]
[ 3.61895486e+01 1.30968343e+03 4.73968522e+04 1.71527069e+06
6.20748719e+07 2.24646160e+09 8.12984311e+10 2.94215353e+12]
[ 3.74921873e+01 1.40566411e+03 5.27014222e+04 1.97589159e+06
7.40804977e+07 2.77743990e+09 1.04132297e+11 3.90414759e+12]]

As the polynomial degree increases, the values in the corresponding columns exponentially grow to the
point where they differ by orders of magnitude.

The thing is, the cost function will generally converge much more slowly when the features are
imbalanced like this. So we need to make sure that our features are on a similar scale before we begin
to train our model. We’re going to do this in two steps:

1. Subtract the mean value of each column from itself and make the new mean 0

2. Divide the values in each column by their standard deviation and make the new standard deviation 1

It’s important that we use the mean and standard deviation values from the training sample while
normalizing the validation and test samples.

train_means = x_train_poly.mean(axis=0)
train_stdevs = np.std(x_train_poly, axis=0, ddof=1)

x_train_poly = (x_train_poly - train_means) / train_stdevs


x_val_poly = (x_val_poly - train_means) / train_stdevs
x_test_poly = (x_test_poly - train_means) / train_stdevs

X_train_poly = insert_ones(x_train_poly)
X_val_poly = insert_ones(x_val_poly)
X_test_poly = insert_ones(x_test_poly)

Finally we can train our polynomial regression model by using our train_linear_regression
function and plot the polynomial fit. Note that when the polynomial features are simply treated as
independent features, training a polynomial regression model is no different than training a multivariate
linear regression model:

def plot_fit(min_x, max_x, means, stdevs, theta, degree):


x = np.linspace(min_x - 5, max_x + 5, 1000)
x_poly = poly_features(x, degree)
x_poly = (x_poly - means) / stdevs
x_poly = insert_ones(x_poly)
plt.plot(x, x_poly @ theta, linewidth=2)
plt.show()
theta = train_linear_regression(X_train_poly, y_train)
plt.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Polynomial Fit", fontsize=16)
plot_fit(min(x_train), max(x_train), train_means, train_stdevs, theta, 8)

What do you think, seems pretty accurate right? Let’s take a look at the learning curves.

plt.title("Learning Curves for Polynomial Regression", fontsize=16)


learning_curves(X_train_poly, y_train, X_val_poly, y_val)

Now that’s overfitting written all over it. Even though the training error is very low, the validation error
miserably fails to converge.
It appears that we need something in between in terms of flexibility. Although we can’t make linear
regression more flexible, we can decrese the flexibility of polynomial regression using regularization.
Before going further with the example,Lets discuss basics of regularization:
Ridge Regression

It performs ‘L2 regularization’, i.e. adds penalty equivalent to square of the magnitude of
coefficients. Thus, it optimises the following:

Objective = RSS + λ * (sum of square of coefficients)

Here, λ is the tuning parameter which balances the amount of emphasis given to minimising RSS vs
minimising sum of square of coefficients. It can take various values:

λ = 0:

 The objective becomes same as simple linear regression.


 We’ll get the same coefficients as simple linear regression.

λ = ∞:

 The coefficients will be zero. Why?


Because of infinite weightage on square of coefficients, anything less than zero will make the objective
infinite.

0 < λ < ∞:

 The magnitude of α will decide the weightage given to different parts of objective.
 The coefficients will be somewhere between 0 and ones for simple linear regression.

Lasso Regression

LASSO stands for Least Absolute Shrinkage and Selection Operator. I know it doesn’t give much of
an idea but there are 2 key words here - absolute and selection.

Lasso regression performs L1 regularization, i.e. it adds a factor of sum of absolute value of
coefficients in the optimisation objective.

Objective = RSS + λ * (sum of absolute value of coefficients)

Here, λ works similar to that of ridge. Like that of ridge, λ can take various values and provide a trade-
off between balancing RSS and magnitude of coefficients.

Selection of λ

Alpha can be adjusted to help you find a good fit for your model.

 However, a value that is too low might not do anything.


 One that is too high might actually cause you to under-fit the model and lose valuable information.

It’s up to the user to find the optimized value. Cross validation using different values of λ can help you
to identify the optimal λ that produces the lowest out of sample error.
Key differences between Ridge and Lasso Regression

Ridge: It includes all (or none) of the features in the model. Thus, the major advantage of ridge
regression is coefficient shrinkage and reducing model complexity.

Lasso: Along with shrinking coefficients, lasso performs feature selection as well. (Remember the
‘selection‘ in the lasso full-form?) As we observed earlier, some of the coefficients become exactly
zero, which is equivalent to the particular feature being excluded from the model.

But why is it that the lasso, unlike ridge regression, results in coefficient estimates that are exactly
equal to zero? Lets explain it in detail in the next section.
Regularized Polynomial Regression

Regularization lets us come up with simpler hypothesis functions that are less prone to overfitting. This is
achieved by penalizing large θ values during the training stage.

Of course we’ll need to reflect these changes to the corresponding Python implementations by
introducing a regularization parameter lamb:

def cost(theta, X, y, lamb=0):


predictions = X @ theta
squared_errors = np.sum(np.square(predictions - y))
regularization = np.sum(lamb * np.square(theta[1:]))
return (squared_errors + regularization) / (2 * len(y))

def cost_gradient(theta, X, y, lamb=0):


predictions = X @ theta
gradient = X.transpose() @ (predictions - y)
regularization = lamb * theta
regularization[0] = 0 # don't penalize the intercept term
return (gradient + regularization) / len(y)

We also have to slightly modify train_linear_regression and learning_curves:

def train_linear_regression(X, y, lamb=0):


theta = np.zeros(X.shape[1])
return opt.fmin_cg(cost, theta, cost_gradient, (X, y, lamb), disp=False)

def learning_curves(X_train, y_train, X_val, y_val, lamb=0):


train_err = np.zeros(len(y_train))
val_err = np.zeros(len(y_train))
for i in range(1, len(y_train)):
theta = train_linear_regression(X_train[0:i + 1, :], y_train[0:i + 1],
lamb)
train_err[i] = cost(theta, X_train[0:i + 1, :], y_train[0:i + 1])
val_err[i] = cost(theta, X_val, y_val)
plt.plot(range(2, len(y_train) + 1), train_err[1:], c="r", linewidth=2)
plt.plot(range(2, len(y_train) + 1), val_err[1:], c="b", linewidth=2)
plt.xlabel("number of training examples", fontsize=14)
plt.ylabel("error", fontsize=14)
plt.legend(["Training", "Validation"], loc="best")
plt.axis([2, len(y_train), 0, 100])
plt.grid()

Alright we’re now ready to train a regularized polynomial regression model. Let’s set λ=1

and plot our polynomial hypothesis on top of the training sample:

theta = train_linear_regression(X_train_poly, y_train, 1)


plt.scatter(x_train, y_train, marker="x", s=40, c='red')
plt.xlabel("change in water level", fontsize=14)
plt.ylabel("water flowing out of the dam", fontsize=14)
plt.title("Regularized Polynomial Fit", fontsize=16)
plot_fit(min(x_train), max(x_train), train_means, train_stdevs, theta, 8)
It is clear that this hypothesis is much less flexible than the unregularized one. Let’s plot the learning
curves and observe its bias-variance tradeoff:

plt.title("Learning Curves for Regularized Polynomial Regression", fontsize=16)


learning_curves(X_train_poly, y_train, X_val_poly, y_val, 1)

This is apparently the best model we’ve come up so far.

Choosing the Optimal Regularization Parameter

Although setting λ=1

has significantly improved the unregularized model, we can do even better by optimizing λ
as well. Here’s how we’re going to do it:

1. Select a set of λ

 values to try out.


 Train a model for each λ

 in the set.

 Find the λ

3. value that yields the minimum validation error.

lambda_values = [0, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3, 1, 3, 10];


val_err = []
for lamb in lambda_values:
theta = train_linear_regression(X_train_poly, y_train, lamb)
val_err.append(cost(theta, X_val_poly, y_val))
plt.plot(lambda_values, val_err, c="b", linewidth=2)
plt.axis([0, len(lambda_values), 0, val_err[-1] + 1])
plt.grid()
plt.xlabel("lambda", fontsize=14)
plt.ylabel("error", fontsize=14)
plt.title("Validation Curve", fontsize=16)
plt.show()

Looks like we’ve achieved the lowest validation error where λ=3

Evaluating Test Errors

It’s good practice to evaluate an optimized model’s accuracy on a separate test sample other than the
training and validation samples. So let’s train our models once again and compare test errors:
X_test = insert_ones(x_test)
theta = train_linear_regression(X_train, y_train)
test_error = cost(theta, X_test, y_test)
print("Test Error =", test_error, "| Linear Regression")

theta = train_linear_regression(X_train_poly, y_train)


test_error = cost(theta, X_test_poly, y_test)
print("Test Error =", test_error, "| Polynomial Regression")

theta = train_linear_regression(X_train_poly, y_train, 3)


test_error = cost(theta, X_test_poly, y_test)
print("Test Error =", test_error, "| Regularized Polynomial Regression (at lambda =
3)")
Test Error = 32.5057492449 | Linear Regression
Test Error = 17.2624144407 | Polynomial Regression
Test Error = 3.85988782246 | Regularized Polynomial Regression (at lambda = 3)
Bayesian Classifiers
 Statistical classifiers
 Predict class membership probabilities: probability of a given tuple belonging to a particular
class
 Based on Bayes’ Theorem
Characteristics
Comparable performance with decision tree and selected neural network classifiers

Bayesian Classifiers

1. Naïve Bayesian Classifiers


Assume independency between the effect of a given attribute on a given class and the other
values of other attributes

2. Bayesian Belief Networks

Graphical models which allows the representation of dependencies among subsets of


attributes

The Bayesian Classifiers works based on the concept of probability, in next section describes
the basics of probability:

Probability
It is value that depicts chances of happening of the event. When we toss a coin the outcome
is random and probability of of a coin that will land head is 50%

The probability is always between 0 to 1 and summation of probability of all events of same
space is always 1.
In field of Probability Theory there are two groups: Frquentist and Bayesian
Let the events are denoted as
X: Gender of the voter
Y: Voted for Female president or not (Outcome: yes or No)

Marginal Probability
P(X=male) = P(X=male ∩ Y= yes) + P(X=male ∩ Y= no)

P(X=male ∩ Y= yes) = P(X=male | Y= yes) P(Y=yes)


P(X=male ∩ Y= no) = P(X=male | Y= no) P(Y=no)

Hence
P(X=male) = P(X=male | Y= yes) P(Y=yes) + P(X=male | Y= no) P(Y=no)

In General Marginal Probability is


Dependent and Independent Events
Two events are independent when the occurrence of one has no effect on the probability that the other
will occur. Earlier we established the definition of a conditional probability, or the probability of A
given B, P(A | B). If A is completely independent of B, then this conditional probability is the same as
the unconditional probability of A. Thus the definition of independent events states that two events - A
and B - are independent of each other, if, and only if,
P(A | B) = P(A).
By the same logic, B would be independent of A if, and only if, P(B | A), which is the probability of B
given that A has occurred, is equal to P(B). Two events are not independent when the conditional
probability of A given B is higher or lower than the unconditional probability of A. In this case, A is
dependent on B. Likewise, if P(B | A) is greater or less than P(B), we know that B depends on A.
Calculating the Joint Probability of Two or More Independent Events
Recall that for calculating joint probabilities, we use the multiplication rule, stated in probability
notation as P(AB) = P(A | B) * P(B). For independent events, we've now established that P(A | B) =
P(A), so by substituting P(A) into the equation for P(A | B), we see that for independent events, the
multiplication rule is simply the product of the individual probabilities.
P(A∩B) = P(A) * P(B)

Moreover, the rule generalizes for more than two events provided they are all independent of one another, so
the joint probability of three events P(ABC) = P(A) * (P(B) * P(C), again assuming independence.
Bayes’ Theorem Example #2

You might be interested in finding out a patient’s probability of having liver disease if they are an
alcoholic. “Being an alcoholic” is the test (kind of like a litmus test) for liver disease.

 A could mean the event “Patient has liver disease.” Past data tells you that 10% of patients entering
your clinic have liver disease. P(A) = 0.10.
 B could mean the litmus test that “Patient is an alcoholic.” Five percent of the clinic’s patients are
alcoholics. P(B) = 0.05.
 You might also know that among those patients diagnosed with liver disease, 7% are alcoholics. This is
your B|A: the probability that a patient is alcoholic, given that they have liver disease, is 7%.

Bayes’ theorem tells you:


P(A|B) = (0.07 * 0.1)/0.05 = 0.14
In other words, if the patient is an alcoholic, their chances of having liver disease is 0.14 (14%). This is
a large increase from the 10% suggested by past data. But it’s still unlikely that any particular patient
has liver disease.

Bayes’ Theorem Problems Example #3

Another way to look at the theorem is to say that one event follows another. Above I said “tests” and
“events”, but it’s also legitimate to think of it as the “first event” that leads to the “second event.”
There’s no one right way to do this: use the terminology that makes most sense to you.

In a particular pain clinic, 10% of patients are prescribed narcotic pain killers. Overall, five percent of
the clinic’s patients are addicted to narcotics (including pain killers and illegal substances). Out of all
the people prescribed pain pills, 8% are addicts. If a patient is an addict, what is the probability that
they will be prescribed pain pills?

Step 1: Figure out what your event “A” is from the question. That information is in the italicized
part of this particular question. The event that happens first (A) is being prescribed pain pills. That’s
given as 10%.

Step 2: Figure out what your event “B” is from the question. That information is also in the
italicized part of this particular question. Event B is being an addict. That’s given as 5%.

Step 3: Figure out what the probability of event B (Step 2) given event A (Step 1). In other words,
find what (B|A) is. We want to know “Given that people are prescribed pain pills, what’s the
probability they are an addict?” That is given in the question as 8%, or .8.

Step 4: Insert your answers from Steps 1, 2 and 3 into the formula and solve.
P(A|B) = P(B|A) * P(A) / P(B) = (0.08 * 0.1)/0.05 = 0.16

The probability of an addict being prescribed pain pills is 0.16 (16%).


Prior : This is the probability before observing the data/features. For example Prior probability
of getting head on top p(h) = 0.5

Likelihood: This is the probability after observing the data. Suppose given a dataset
D={h,h,h,h,t,t,t,t,t,t}, the probability of getting head on top is 4/10. The likehood of the data
given class can be represented as p(X|c)

Posterior: This is the estimate of probability after considering prior and observing the data. It
can be computed by multiplying likelihood and prior.
Need for Prior
Suppose D={t,t,t,t}
In this case if we only consider likelihood the probability of getting head is 0. However it is not
possible. Hence we use the prior information to compute Posterior.

Now to find the class/hypothesis we have 3 options:


1. Maximun A Priori Estimation
Most likely hypothesis on prior
hprior= argh max p(h)

2. Maximum Likelihood Estimate (MLE)


Most likely hypothesis on likelihood/evidence
HMLE= argh max p(D|h)

3. Maximum A Posterior (MAP) Estimate


Most likely hypothesis on posterior
hMAP=arghmax p(D|h) p(h)

As data increases MAP estimate converges towards MLE.

Performance Evaluation of a model


Classification Accuracy and its Limitations
Classification accuracy is the ratio of correct predictions to total predictions made. It is often presented as a
percentage by multiplying the result by 100.

classification accuracy = correct predictions / total predictions * 100

Classification accuracy can also easily be turned into a misclassification rate or error rate by inverting the value,
such as

error rate = (1 - (correct predictions / total predictions)) * 100

Classification accuracy is a great place to start, but often encounters problems in practice.

The main problem with classification accuracy is that it hides the detail you need to better understand
the performance of your classification model. There are two examples where you are most likely to
encounter this problem:

1. When you are data has more than 2 classes. With 3 or more classes you may get a classification
accuracy of 80%, but you don’t know if that is because all classes are being predicted equally
well or whether one or two classes are being neglected by the model.
2. When your data does not have an even number of classes. You may achieve accuracy of 90% or
more, but this is not a good score if 90 records for every 100 belong to one class and you can
achieve this score by always predicting the most common class value.
Classification accuracy can hide the detail you need to diagnose the performance of your model. But
thankfully we can tease apart this detail by using a confusion matrix.

Confusion Matrix

A confusion matrix is a technique for summarizing the performance of a classification algorithm.


classification accuracy alone can be misleading if you have an unequal number of observations in each
class or if you have more than two classes in your dataset. Calculating a confusion matrix can give you
a better idea of what your classification model is getting right and what types of errors it is making.

The number of correct and incorrect predictions are summarized with count values and broken down by
each class. This is the key to the confusion matrix.

The confusion matrix shows the ways in which your classification model
is confused when it makes predictions.

It gives you insight not only into the errors being made by your classifier but more importantly the
types of errors that are being made.

It is this breakdown that overcomes the limitation of using classification accuracy alone.

How to Calculate a Confusion Matrix

Below is the process for calculating a confusion Matrix.

1. You need a test dataset or a validation dataset with expected outcome values.
2. Make a prediction for each row in your test dataset.
3. From the expected outcomes and predictions count:
1. The number of correct predictions for each class.
2. The number of incorrect predictions for each class, organized by the class that was predicted.

These numbers are then organized into a table, or a matrix as follows:

 Expected down the side: Each row of the matrix corresponds to a predicted class.
 Predicted across the top: Each column of the matrix corresponds to an actual class.

The counts of correct and incorrect classification are then filled into the table.

The total number of correct predictions for a class go into the expected row for that class value and the
predicted column for that class value.

In the same way, the total number of incorrect predictions for a class go into the expected row for that
class value and the predicted column for that class value.

“In practice, a binary classifier such as this one can make two types of errors: it can incorrectly assign
an individual who defaults to the no default category, or it can incorrectly assign an individual who
does not default to the default category. It is often of interest to determine which of these two types of
errors are being made. A confusion matrix […] is a convenient way to display this information.” —
Page 145, An Introduction to Statistical Learning: with Applications in R, 2014
This matrix can be used for 2-class problems where it is very easy to understand, but can easily be
applied to problems with 3 or more class values, by adding more rows and columns to the confusion
matrix.

Let’s make this explanation of creating a confusion matrix concrete with an example.

2-Class Confusion Matrix Case Study

Let’s pretend we have a two-class classification problem of predicting whether a photograph contains a
man or a woman.

We have a test dataset of 10 records with expected outcomes and a set of predictions from our
classification algorithm.

Expected, Predicted
man, woman
man, man
woman, woman
man, man
woman, man
woman, woman
woman, woman
man, man
man, woman
woman, woman

Let’s start off and calculate the classification accuracy for this set of predictions. The algorithm made 7
of the 10 predictions correct with an accuracy of 70%.

accuracy = total correct predictions / total predictions made * 100


accuracy = 7 / 10 * 100

Let’s turn our results into a confusion matrix. First, we must calculate the number of correct predictions
for each class.

men classified as men: 3


women classified as women: 4

Now, we can calculate the number of incorrect predictions for each class, organized by the predicted
value.

men classified as women: 2


woman classified as men: 1

We can now arrange these values into the 2-class confusion matrix:

men women
men 3 1
women 2 4
Here column label representing actual value where as the row label is representing the predicted value.

We can learn a lot from this table.

 The total actual men in the dataset is the sum of the values on the men column (3 + 2)
 The total actual women in the dataset is the sum of values in the women column (1 +4).
 The correct values are organized in a diagonal line from top left to bottom-right of the matrix (3
+ 4).
 More errors were made by predicting men as women than predicting women as men.

Two-Class Problems Are Special

In a two-class problem, we are often looking to discriminate between observations with a specific
outcome, from normal observations. Such as a disease state or event from no disease state or no event.
In this way, we can assign the event row as “positive” and the no-event row as “negative“. We can then
assign the event column of predictions as “true” and the no-event as “false“.

This gives us:

 “true positive” for correctly predicted event(yes) values.


 “false positive” for incorrectly predicted event(yes) values.
 “true negative” for correctly predicted no-event(no) values.
 “false negative” for incorrectly predicted no-event(no) values.

We can summarize this in the confusion matrix as follows:

yes (Predicted) no(Predicted)

yes (Actual) true positive false negative


no (Actual ) false positive True negative

Example:

Spam (Predicted) Non Spam(Predicted)

Spam(Actual) true positive False negative


Non Spam (Actual ) false positive True negative

Cancerous (Predicted) Normal(Predicted)

Cancerous (Actual) true positive False negative


Normal (Actual ) false positive True negative
This can help in calculating more advanced classification metrics such as precision, recall, specificity
and sensitivity of our classifier.

For example, classification accuracy is calculated as true positives + true negatives.

Consider the case where there are two classes. […] The top row of the table corresponds to samples
predicted to be events. Some are predicted correctly (the true positives, or TP) while others are
inaccurately classified (false positives or FP). Similarly, the second row contains the predicted
negatives with true negatives (TN) and false negatives (FN).

Now that we have worked through a simple 2-class confusion matrix case study, let’s see how we might
calculate a confusion matrix in modern machine learning tools.

TP + FN = P
TN + FP = N

Precision : When yes is predicted, how often it is correct

Precision = TP/(TP+FP)

It is also called the Positive Predictive Value (PPV). Precision can be thought of as a measure of a
classifiers exactness. A low precision can also indicate a large number of False Positives.

If precision is .8, it means if the model predicted 10 positives out of these positive results 8 are correct.

Recall (R)/ Sensitivity/ True Positive Rate: Capability of the model to correctly identify positives
from total given positives.

Among the actual yeses, what fraction was predicted as yes.

R= TP/(TP+FN)

Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many False
Negatives.

Suppose actually there are 100 spam emails, however during classification the only 70 has been marked
as spam.

So true positive TP =70.

False Negative FN = 30

R= 70/(70+30) = .7
F1 Score : It is also called the F Score or the F Measure. Put another way, the F1 score conveys the
balance between the precision and the recall.

F1= 2*P*R/(P+R)

The higher the F-Measure is, the better. F1 Score is needed when you want to seek a balance between
Precision and Recall. F1 is usually more useful than accuracy, especially if you have an uneven class
distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of
false positives and false negatives are very different, it’s better to look at both Precision and Recall.

Specificity (True Negative Rate ) : Capability of the model to correctly identify negatives from total
negatives.

Among the actual nos, what fraction was predicted as no? Also equivalent to 1-True negative rate:

Specificity=TN/(TN+FP)

1- Specificity (False Positive Rate ) : Capability of the model to correctly identify negatives from total
negatives.

Among the actual nos, what fraction was predicted as yes? Also equivalent to 1-False positive rate:

1-Specificity=FP/(FP+TN)

Suppose actually there are 100 non spam emails, however during classification the only 70 has been
marked as non spam.

So true positive TN =70.

False Postive FP = 30

FPR= 30/(30+100) = .3

Area Under Curve (ROC): Receiver Operating Characteristic curve is used to plot between true
positive report and false positive rate, also known as a sensitivity and 1-specificity graph.

The image below depicts a typical binary situation - where any data you receive, will fall into one of
two distributions. The two distributions are the two bell curves below. To continue the spam
classification example, let's consider the curve on the left to be emails that are "not spam", and the
curve on the right to be emails that are "spam".
ROC curves – what are they and how are they
used?
by Suzanne Ekelund

ROC curves are frequently used to show in a graphical way the connection/trade-off between clinical
sensitivity and specificity for every possible cut-off for a test or a combination of tests. In addition the
area under the ROC curve gives an idea about the benefit of using the test(s) in question.

ROC curves are used in clinical biochemistry to choose the most appropriate cut-off for a test. The best
cut-off has the highest true positive rate together with the lowest false positive rate.

As the area under an ROC curve is a measure of the usefulness of a test in general, where a greater area
means a more useful test, the areas under ROC curves are used to compare the usefulness of tests.

The term ROC stands for Receiver Operating Characteristic.

ROC curves were first employed in the study of discriminator systems for the detection of radio signals
in the presence of noise in the 1940s, following the attack on Pearl Harbor.

The initial research was motivated by the desire to determine how the US RADAR "receiver operators"
had missed the Japanese aircraft.
Now ROC curves are frequently used to show the connection between clinical sensitivity and
specificity for every possible cut-off for a test or a combination of tests. In addition, the area under the
ROC curve gives an idea about the benefit of using the test(s) in question.

HOW TO MAKE A ROC CURVE

To make an ROC curve you have to be familiar with the concepts of true positive, true negative, false
positive and false negative. These concepts are used when you compare the results of a test with the
clinical truth, which is established by the use of diagnostic procedures not involving the test in question.

TABLE I : Comparing a method with the clinical truth

Before you make a table like TABLE I you have to decide your cut-off for distinguishing healthy from
sick.

The cut-off determines the clinical sensitivity (fraction of true positives to all with disease) and
specificity (fraction of true negatives to all without disease).

When you change the cut-off, you will get other values for true positives and negatives and false
positives and negatives, but the number of all with disease is the same and so is the number of all
without disease.

Thus you will get an increase in sensitivity or specificity at the expense of lowering the other parameter
when you change the cut-off [1].

FIG. I : Cut-off = 400 µg/L


FIG. II : Cut-off = 500 µg/L

FIG. I and FIG. II demonstrate the trade-off between sensitivity and specificity. When 400 µg/L is
chosen as the analyte concentration cut-off, the sensitivity is 100 % and the specificity is 54 %. When
the cut-off is increased to 500 µg/L, the sensitivity decreases to 92 % and the specificity increases to 79
%.

An ROC curve shows the relationship between clinical sensitivity and specificity for every possible cut-
off. The ROC curve is a graph with:

 The x-axis showing 1 – specificity (= false positive fraction = FP/(FP+TN))


 The y-axis showing sensitivity (= true positive fraction = TP/(TP+FN))

Thus every point on the ROC curve represents a chosen cut-off even though you cannot see this cut-off.
What you can see is the true positive fraction and the false positive fraction that you will get when you
choose this cut-off.

To make an ROC curve from your data you start by ranking all the values and linking each value to the
diagnosis – sick or healthy.
TABLE II : Ranked data with diagnosis (Yes/No)

In the example in TABLE II 159 healthy people and 81 sick people are tested. The results and the
diagnosis (sick Y or N) are listed and ranked based on parameter concentration.

For each and every concentration it is calculated what the clinical sensitivity (true positive rate) and the
(1 – specificity) (false positive rate) of the assay will be if a result identical to this value or above is
considered positive.

TABLE III: Ranked data with calculated true positive and false positive rates for a scenario
where the specific value is used as cut-off
Now the curve is constructed by plotting the data pairs for sensitivity and (1 – specificity):

FIG. III: First point on the ROC curve

FIG. IV: Second point on the ROC curve


FIG. V: Third point on the ROC curve

FIG. VI: Points #50 and #100 on the ROC curve


FIG. VII: The finalized ROC curve

AREA UNDER ROC CURVE

The area under the ROC curve (AUROC) of a test can be used as a criterion to measure the test's
discriminative ability, i.e. how good is the test in a given clinical situation.

FIG. VIII: Area under ROC curve

Various computer programs can automatically calculate the area under the ROC curve. Several methods
can be used. An easy way to calculate the AUROC is to use the trapezoid method. To explain it simply,
the sum of all the areas between the x-axis and a line connecting two adjacent data points is calculated:

(Xk – Xk-1) * (Yk + Yk-1)/2


THE PERFECT TEST

A perfect test is able to discriminate between the healthy and sick with 100 % sensitivity and 100 %
specificity.

FIG. IX: No overlap between healthy and sick

It will have an ROC curve that passes through the upper left corner (~100 % sensitivity and 100 %
specificity). The area under the ROC curve of the perfect test is 1.

FIG. X: ROC curve for a test with no overlap between healthy and sick

THE WORTHLESS TEST

When we have a complete overlap between the results from the healthy and the results from the sick
population, we have a worthless test. A worthless test has a discriminating ability equal to flipping a
coin.

FIG. XI: Complete overlap between healthy and sick


The ROC curve of the worthless test falls on the diagonal line. It includes the point with 50 %
sensitivity and 50 % specificity. The area under the ROC curve of the worthless test is 0.5.

FIG. XII: ROC curve for a test with complete overlap between healthy and sick

COMPARING ROC CURVES

As mentioned above, the area under the ROC curve of a test can be used as a criterion to measure the
test's discriminative ability, i.e. how good is the test in a given clinical situation. Generally, tests are
categorized based on the area under the ROC curve.

The closer an ROC curve is to the upper left corner, the more efficient is the test.

In FIG. XIII test A is superior to test B because at all cut-offs the true positive rate is higher and the
false positive rate is lower than for test B. The area under the curve for test A is larger than the area
under the curve for test B.

FIG. XIII : ROC curves for tests A and B


TABLE IV : Categorization of ROC curves

As a rule of thumb the categorizations in TABLE IV can be used to describe an ROC curve.

Another Example:

T4 value Hypothyroid Normal


5 or less 18 1
5.1 - 7 7 17
7.1 - 9 4 36
9 or more 3 39
Totals: 32 93

If we take 5 as cutoff, the sensitivity is = true positive rate = TP/(TP+FN)

Where TP+ FN is total number of actually positive cases, i.e. the number of persons actually having
Hypothyroid

= 18/32 = .56

And Specificity=TN/(TN+FP)

=1/93 = .01

1 – specificity (= false positive fraction = FP/(FP+TN))

= 92 /93 = .99
If we take 7 as cutoff, the sensitivity is = true positive rate = TP/(TP+FN)

= (18+7) /93 = .78

And Specificity=TN/(TN+FP)

=17+1/93 = .19

1 – specificity (= false positive fraction = FP/(FP+TN))

= 75/93 = .81

If we take 9 as cutoff, the sensitivity is = true positive rate = TP/(TP+FN)

= (18+7+4) /93 = .91

And Specificity=TN/(TN+FP)

=17+1+36/93 = .58

1 – specificity (= false positive fraction = FP/(FP+TN))

= 39/93 = .42

If we take 10 as cutoff, the sensitivity is = true positive rate = TP/(TP+FN)

= (18+7+4+3) /93 = 1

And Specificity=TN/(TN+FP)

=17+1+36+39/93 = .1

1 – specificity (= false positive fraction = FP/(FP+TN)) = = 0/93 = .0

Specificity 1-
Cutpoint Sensitivity Specific
ity
5 0.56 .01 0.99
7 0.78 .19 0.81
9 0.91 .58 0.42
10 1.0 1 0
Distance Based Methods

Any form of learning is based on generalized from training data to unseen data by exploiting the
similarities between the two. A similarity based on the distance is one of the form.

Distance

D is a distance measure if it is a function from pairs of points to real numbers such that:

1. d(x,y)>=0
2. d(x,y) = 0 iff x=y
3. d(x,y) = d(y,x)
4. d(x,y) <= d(x,z)+d(z,y) {triangle inequality: one side of triangle is always less than
sum of two other sides of triangle.}

Distances can be classified as :

 Euclidean Distance
 Non Euclidean Distance

Euclidean Distance : Euclidean Space has Some number of real valued dimension and dense points. A
Euclidean distance is based on location in such points.

Most common Euclidean distance is L2 Norm: Square root of sum of Square of the difference between
x and y in each dimensions.

L2 distance from x to y in N dimensional space is

Non Euclidean Distance: On other side distance measures for non Euclidean spaces are based on
properties of points but not their location in a space. Some examples of non Euclidean distances are:

Edit Distance : Edit distance of two strings is the number of inserts and delets of characters needed to
turn one into the other.

d(x,y) = |x| + |y| - 2|LCS(x,y)|

LCS Longest common Substring – any longest string obtained both by deleting from x and y.

x= abcde y= bcduve

x can be turned into y by deleting a and then inserting u and v, LCS here is bcd and the distance is
d(x,y) = |x| + |y| - 2|LCS(x,y)|

= 5+6 -2*3

Why edit distance is distance measure:

Here, d(x,y) =0 iff x=y

d(x,y) >= 0

d(x,y) = d(y,x) [because insert/delete are inverse of each other]

Hamming Distance : This Non Euclidean distance defines the number of positions in which two bit
vectors differ.

P= 10101 and Q= 10011

D(P,Q) = 2 , as bit vectors differ in 3rd and 4th position.

Given N labeled training examples {xn,yn}Nn=1 from two classes positive and negative.

Number of positive example = N+

Number of negative examples is N-

Goal : To learn a model to predict label/class for a new test examples X.

A simples distance based approach is to check distance from Mean(µ) and assign class with closer
Mean.

Mean of positive class is

Mean of Negative class is


The decision boundary is like a hyperplane

A hyperplane is a subspace whose dimension is one less than dimensions that of its ambient space. If a
space is 3D its hyper plane is 2D.

Points to note about ‘Distance from Mean’ classifier:

 Simple to understand and implement


 Would usually requires plenty of training data for each class (to estimate mean of each class
reliably)
 Can only learn linear decision boundary, unless Euclidean distance is replaced by non linear
distance function.
 The mean is used as EXAMPLAR (reference point that is representing a class), after learning
Exemplar there is no requirement to maintain the training instances.
Nearest neighbor Classifier
Support Vector Machine
Pre 1980:
 Almost all learning methods learned linear decision surfaces.
 Linear learning methods have nice theoretical properties

1980’s

 Decision trees and NNs allowed efficient learning of non-linear decision surfaces
 Little theoretical basis and all suffer from local minima

1990’s

 Efficient learning algorithms for non-linear functions based on computational learning theory
developed
 Nice theoretical properties

Line, Plane, Hyperplane:


Line (2D): In 2D line separated the region into two part. It can be represented as

y= mx+c

Where m is slope and c is the intercept.

It can also be represented as general equation

ax+by+c=0

y= -c/b-(a/b)x

slope= -a/b

intercept=-c/b

Plane : In 3D to divide the area in two parts plane is required and representd as

w0 + w1x1 + w2x2 + c = 0

Hyperplane: In geometry, a hyperplane is a subspace whose dimension is one less than that of its
ambient space. If a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if
the space is 2-dimensional, its hyperplanes are the 1-dimensional lines. It is represented by

w0 + w1x1 + w2x2 + …… wnxn = + c 0

All the points that are on hyperplane should satisfy above mentioned equation.
Understanding Support Vector Machine
algorithm from examples
Sunil Ray,

https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

Introduction

Mastering machine learning algorithms isn’t a myth at all. Most of the beginners start by learning
regression. It is simple to learn and use, but does that solve our purpose? Of course not! Because, you
can do so much more than just Regression!

By now, I hope you’ve now mastered Random Forest, Naive Bayes Algorithm and Ensemble
Modeling. If not, I’d suggest you to take out few minutes and read about them as well. In this article, I
shall guide you through the basics to advanced knowledge of a crucial machine learning algorithm,
support vector machines.

Table of Contents

1. What is Support Vector Machine?


2. How does it work?
3. How to implement SVM in Python and R?
4. How to tune Parameters of SVM?
5. Pros and Cons associated with SVM

What is Support Vector Machine?

“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for
both classification or regression challenges. However, it is mostly used in classification problems. In
this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features
you have) with the value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two classes very well (look at the below
snapshot).
Support Vectors are simply the co-ordinates of individual observation. Support Vector Machine is a
frontier which best segregates the two classes (hyper-plane/ line).

You can look at definition of support vectors and a few examples of its working here.

How does it work?

Above, we got accustomed to the process of segregating the two classes with a hyper-plane. Now the
burning question is “How can we identify the right hyper-plane?”. Don’t worry, it’s not as hard as you
think!

Let’s understand:

 Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B and C). Now,
identify the right hyper-plane to classify star and circle.

You need to remember a thumb rule to


identify the right hyper-plane: “Select the hyper-plane which segregates the two classes better”. In this
scenario, hyper-plane “B” has excellently performed this job.
 Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B and C) and all are
segregating the classes well. Now, How can we identify the right hyper-plane?
Here, maximizing the distances between
nearest data point (either class) and hyper-plane will help us to decide the right hyper-plane.
This distance is called as Margin. Let’s look at the below snapshot:

Above, you can see that the margin for hyper-plane C is high as compared to both A and B.
Hence, we name the right hyper-plane as C. Another lightning reason for selecting the hyper-
plane with higher margin is robustness. If we select a hyper-plane having low margin then there
is high chance of miss-classification.

 Identify the right hyper-plane (Scenario-3):Hint: Use the rules as discussed in previous section to
identify the right hyper-plane
Some of you may have selected the hyper-
plane B as it has higher margin compared to A. But, here is the catch, SVM selects the hyper-plane
which classifies the classes accurately prior to maximizing margin. Here, hyper-plane B has a
classification error and A has classified all correctly. Therefore, the right hyper-plane is A.

 Can we classify two classes (Scenario-4)?: Below, I am unable to segregate the two classes using a
straight line, as one of star lies in the territory of other(circle) class as an outlier.

As I have already mentioned, one star


at other end is like an outlier for star class. SVM has a feature to ignore outliers and find the hyper-
plane that has maximum margin. Hence, we can say, SVM is robust to outliers.

 Find the hyper-plane to segregate to classes (Scenario-5): In the scenario below, we can’t have linear
hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have
only looked at the linear hyper-plane.

SVM can solve this problem. Easily! It


solves this problem by introducing additional feature. Here, we will add a new feature z=x^2+y^2. Now,
let’s plot the data points on axis x and z:

In above plot, points to consider are:


o All values for z would be positive always because z is the squared sum of both x and y
o In the original plot, red circles appear close to the origin of x and y axes, leading to lower value
of z and star relatively away from the origin result to higher value of z.

In SVM, it is easy to have a linear hyper-plane between these two classes. But, another burning
question which arises is, should we need to add this feature manually to have a hyper-plane.
No, SVM has a technique called the kernel trick. These are functions which takes low
dimensional input space and transform it to a higher dimensional space i.e. it converts not
separable problem to separable problem, these functions are called kernels. It is mostly useful in
non-linear separation problem. Simply put, it does some extremely complex data
transformations, then find out the process to separate the data based on the labels or outputs
you’ve defined.
When we look at the hyper-plane in original input space it looks like a circle:

Now, let’s look at the methods to apply SVM algorithm in a data science challenge.

Pros and Cons associated with SVM

 Pros:
o It works really well with clear margin of separation
o It is effective in high dimensional spaces.
o It is effective in cases where number of dimensions is greater than the number of samples.
o It uses a subset of training points in the decision function (called support vectors), so it is also
memory efficient.
 Cons:
o It doesn’t perform well, when we have large data set because the required training time is
higher
o It also doesn’t perform very well, when the data set has more noise i.e. target classes are
overlapping
o SVM doesn’t directly provide probability estimates, these are calculated using an expensive
five-fold cross-validation. It is related SVC method of Python scikit-learn library.
Vectors
In Support Vector Machine, there is the word vector. It is important to know some basics about
vectors in order to understand SVMs and how to use them.

A vector is a mathematical object that can be represented by an arrow (Figure 1).

Figure 1: Representation of a vector

When we do calculations, we denote a vector with the coordinates of its endpoint (the point
where the tip of the arrow is). In Figure 1, the point A has the coordinates (4,3). We can write:

If we want to, we can give another name to the vector, for instance, .

From this point, one might be tempted to think that a vector is defined by its coordinates.
However, if I give you a sheet of paper with only a horizontal line and ask you to trace the
same vector as the one in Figure 1, you can still do it.

You need only two pieces of information:

• What is the length of the vector?


• What is the angle between the vector and the horizontal line?
This leads us to the following definition of a vector:

A vector is an object that has both a magnitude and a direction.

Let us take a closer look at each of these components.

The magnitude of a vector


The magnitude, or length, of a vector is written , and is called its norm.

Figure 2: The magnitude of this vector is the length of the segment OA


In Figure 2, we can calculate the norm of vector by using the Pythagorean theorem:

In general, we compute the norm of a vector by using the Euclidean


norm formula:

The direction of a vector

The direction is the second component of a vector. By definition, it is a new vector for which the
coordinates are the initial coordinates of our vector divided by its norm.

The direction of a vector is the vector:

Where does it come from? Geometry. Figure 3 shows us a vector and its angles with respect
to the horizontal and vertical axis. There is an angle (theta) between and the horizontal
axis, and there is an angle (alpha) between and the vertical axis.

Figure 3: A vector u and its angles with respect to the axis

Using elementary geometry, we see that and , which means that


can also be defined by:
The coordinates of are defined by cosines. As a result, if the angle between and an axis
changes, which means the direction of changes, will also change. That is why we call this
vector the direction of vector . We can compute the value of (Code Listing 3), and we find
that its coordinates are . It is interesting to note is that if two vectors have the same
direction, they will have the same direction vector.

It makes sense, as the sole objective of this vector is to describe the direction of other
vectors— by having a norm of 1, it stays as simple as possible. As a result, a direction
vector such as is often referred to as a unit vector

Dimensions of a vector
Note that the order in which the numbers are written is important. As a result, we say that a
- dimensional vector is a tuple of real-valued numbers.

For instance, is a two-dimensional vector; we often write ( belongs to ).

Similarly, the vector is a three-dimensional vector, and .

The dot product


The dot product is an operation performed on two vectors that returns a number. A number is
sometimes called a scalar; that is why the dot product is also called a scalar product.

People often have trouble with the dot product because it seems to come out of nowhere. What
is important is that it is an operation performed on two vectors and that its result gives us some
insights into how the two vectors relate to each other. There are two ways to think about the dot
product: geometrically and algebraically.

Geometric definition of the dot product


Geometrically, the dot product is the product of the Euclidean magnitudes of the two
vectors and the cosine of the angle between them.

Figure 4: Two vectors x and y


This means that if we have two vectors, and , with an angle between them (Figure 4),
their dot product is:

By looking at this formula, we can see that the dot product is strongly influenced by the angle :

• When , we have and


• When , we have and
• When , we have and
Keep this in mind—it will be useful later when we study the Perceptron learning algorithm.

Algebraic definition of the dot product

Figure 5: Using these three angles will allow us to simplify the dot product

In Figure 5, we can see the relationship between the three angles , (beta), and (alpha):

This means computing is the same as computing .

Using the difference identity for cosine we get:


If we multiply both sides by we get:

We already know that:

This means the dot product can also be written:

Or:

In a more general way, for -dimensional vectors, we can write:

This formula is the algebraic definition of the dot product.


This definition is advantageous because we do not have to know the angle to compute the dot
product. We can write a function to compute its value and get the same result as with the geometric
definition.
We spent quite some time understanding what the dot product is and how it is computed. This is
because the dot product is a fundamental notion that you should be comfortable with in order to
figure out what is going on in SVMs. We will now see another crucial aspect, linear separability.

Understanding linear separability


In this section, we will use a simple example to introduce linear separability.

Linearly separable data


Imagine you are a wine producer. You sell wine coming from two different production batches:

• One high-end wine costing $145 a bottle.


• One common wine costing $8 a bottle.
Recently, you started to receive complaints from clients who bought an expensive bottle.
They claim that their bottle contains the cheap wine. This results in a major reputation loss for
your company, and customers stop ordering your wine.

Using alcohol-by-volume to classify wine


You decide to find a way to distinguish the two wines. You know that one of them contains more
alcohol than the other, so you open a few bottles, measure the alcohol concentration, and plot it.
Figure 6: An example of linearly separable data

In Figure 6, you can clearly see that the expensive wine contains less alcohol than the cheap
one. In fact, you can find a point that separates the data into two groups. This data is said to be
linearly separable. For now, you decide to measure the alcohol concentration of your wine
automatically before filling an expensive bottle. If it is greater than 13 percent, the production
chain stops and one of your employee must make an inspection. This improvement
dramatically reduces complaints, and your business is flourishing again.

This example is too easy—in reality, data seldom works like that. In fact, some scientists really
measured alcohol concentration of wine, and the plot they obtained is shown in Figure 7. This
is an example of non-linearly separable data. Even if most of the time data will not be linearly
separable, it is fundamental that you understand linear separability well. In most cases, we will
start from the linearly separable case (because it is the simpler) and then derive the non-
separable case.

Similarly, in most problems, we will not work with only one dimension, as in Figure 6. Real-life
problems are more challenging than toy examples, and some of them can have thousands of
dimensions, which makes working with them more abstract. However, its abstractness does
not make it more complex. Most examples in this book will be two-dimensional examples.
They are simple enough to be easily visualized, and we can do some basic geometry on them,
which will allow you to understand the fundamentals of SVMs.
Figure 7: Plotting alcohol by volume from a real dataset

In our example of Figure 6, there is only one dimension: that is, each data point is represented
by a single number. When there are more dimensions, we will use vectors to represent each
data point. Every time we add a dimension, the object we use to separate the data changes.
Indeed, while we can separate the data with a single point in Figure 6, as soon as we go into
two dimensions we need a line (a set of points), and in three dimensions we need a plane
(which is also a set of points).

To summarize, data is linearly separable when:

• In one dimension, you can find a point separating the data (Figure 6).
• In two dimensions, you can find a line separating the data (Figure 8).
• In three dimensions, you can find a plane separating the data (Figure 9).

Figure 8: Data separated by a line Figure 9: Data separated by a plane

Similarly, when data is non-linearly separable, we cannot find a separating point, line, or plane.
Figure 10 and Figure 11 show examples of non-linearly separable data in two and three
dimensions.
Figure 10: Non-linearly separable data in 2D Figure 11: Non-linearly separable data in 3D

Hyperplanes
What do we use to separate the data when there are more than three dimensions? We
use what is called a hyperplane.

What is a hyperplane?
In geometry, a hyperplane is a subspace of one dimension less than its ambient space.

This definition, albeit true, is not very intuitive. Instead of using it, we will try to understand
what a hyperplane is by first studying what a line is.

If you recall mathematics from school, you probably learned that a line has an equation of the

form , that the constant is known as the slope, and that intercepts the y-
axis. There are several values of for which this formula is true, and we say that the set
of the solutions is a line.

What is often confusing is that if you study the function in a calculus course,
you will be studying a function with one variable.

However, it is important to note that the linear equation has two


variables, respectively and , and we can name them as we want.

For instance, we can rename as and as , and the equation becomes: .

This is equivalent to .

If we define the two-dimensional vectors and , we obtain another notation for


the equation of a line (where is the dot product of and ):

What is nice with this last equation is that it uses vectors. Even if we derived it by using two-
dimensional vectors, it works for vectors of any dimensions. It is, in fact, the equation of a
hyperplane.

From this equation, we can have another insight into what a hyperplane is: it is the set of points

satisfying . And, if we keep just the essence of this definition: a hyperplane is a


set of points.
If we have been able to deduce the hyperplane equation from the equation of a line, it is
because a line is a hyperplane. You can convince yourself by reading the definition of a
hyperplane again. You will notice that, indeed, a line is a two-dimensional space surrounded
by a plane that has three dimensions. Similarly, points and planes are hyperplanes, too.

Understanding the hyperplane equation


We derived the equation of a hyperplane from the equation of a line. Doing the opposite
is interesting, as it shows us more clearly the relationship between the two.

Given vectors , and , we can define a hyperplane having the equation:

This is equivalent to:

We isolate to get:

If we define and :

We see that the bias of the line equation is only equal to the bias of the hyperplane equation

when . So you should not be surprised if is not the intersection with the vertical axis when
you see a plot for a hyperplane (this will be the case in our next example). Moreover, if

and have the same sign, the slope will be negative.

Classifying data with a hyperplane

Figure 12: A linearly separable dataset

Given the linearly separable data of Figure 12, we can use a hyperplane to perform
binary classification.
For instance, with the vector and we get the hyperplane in Figure 13.

Figure 13: A hyperplane separates the data

We associate each vector with a label , which can have the value or (respectively
the triangles and the stars in Figure 13).

We define a hypothesis function :

which is equivalent to:

It uses the position of with respect to the hyperplane to predict a value for the label .
Every data point on one side of the hyperplane will be assigned a label, and every data point
on the other side will be assigned the other label.

For instance, for , is above the hyperplane. When we do the calculation, we


get , which is positive, so .

Similarly, for , is below the hyperplane, and will return because


.

Because it uses the equation of the hyperplane, which produces a linear combination of the
values, the function , is called a linear classifier.

With one more trick, we can make the formula of even simpler by removing the b constant.

First, we add a component to the vector . We get the vector

(it reads “ hat” because we put a hat on ). Similarly, we add a component

to the vector , which becomes .


Note: In the rest of the book, we will call a vector to which we add an
artificial coordinate an augmented vector.

When we use augmented vectors, the hypothesis function becomes:

If we have a hyperplane that separates the data set like the one in Figure 13, by using
the hypothesis function , we are able to predict the label of every point perfectly. The
main question is: how do we find such a hyperplane?

How can we find a hyperplane (separating the data or not)?


Recall that the equation of the hyperplane is in augmented form. It is important to
understand that the only value that impacts the shape of the hyperplane is . To convince you, we
can come back to the two-dimensional case when a hyperplane is just a line. When we create the
augmented three-dimensional vectors, we obtain and . You can
see that the vector contains both and , which are the two main components defining the
look of the line. Changing the value of gives us different hyperplanes (lines), as shown in
Figure 14.

Figure 14: Different values of w will give you different hyperplanes

Summary
After introducing vectors and linear separability, we learned what a hyperplane is and how
we can use it to classify data. We then saw that the goal of a learning algorithm trying to
learn a linear classifier is to find a hyperplane separating the data. Eventually, we discovered
that finding a hyperplane is equivalent to finding a vector .

We will now examine which approaches learning algorithms use to find a hyperplane that
separates the data. Before looking at how SVMs do this, we will first look at one of the
simplest learning models: the Perceptron.
The Perceptron
The Perceptron is an algorithm invented in 1957 by Frank Rosenblatt, a few years before the
first SVM. It is widely known because it is the building block of a simple neural network: the
multilayer perceptron. The goal of the Perceptron is to find a hyperplane that can separate a
linearly separable data set. Once the hyperplane is found, it is used to perform binary
classification.

Given augmented vectors and , the Perceptron uses


the same hypothesis function we saw in the previous chapter to classify a data point :

The Perceptron learning algorithm


Given a training set of -dimensional training examples , the Perceptron
Learning Algorithm (PLA) tries to find a hypothesis function that predicts the label of
every correctly.

The hypothesis function of the Perceptron is , and we saw that is just


the equation of a hyperplane. We can then say that the set of hypothesis functions is the
set of dimensional hyperplanes ( because a hyperplane has one dimension less
than its ambient space).

What is important to understand here is that the only unknown value is . It means that the
goal of the algorithm is to find a value for . You find ; you have a hyperplane. There is an
infinite number of hyperplanes (you can give any value to ), so there is an infinity of
hypothesis functions.

This can be written more formally this way:

Given a training set: and a set of hypothesis functions.

Find such that for every .

This is equivalent to:

Given a training set: and a set of hypothesis functions.

Find such that for every .


The PLA is a very simple algorithm, and can be summarized this way:

1. Start with a random hyperplane (defined by a vector ) and use it to classify the data.
2. Pick a misclassified example and select another hyperplane by updating the value of
, hoping it will work better at classifying this example (this is called the update rule).

3. Classify the data with this new hyperplane.


4. Repeat steps 2 and 3 until there is no misclassified example.
Once the process is over, you have a hyperplane that separates the data.

Understanding the update rule


Why do we use this particular update rule? Recall that we picked a misclassified example at
random. Now we would like to make the Perceptron correctly classify this example. To do so,
we decide to update the vector . The idea here is simple. Since the sign of the dot product
between and is incorrect, by changing the angle between them, we can make it correct:

• If the predicted label is 1, the angle between and is smaller than , and we want
to increase it.
• If the predicted label is -1, the angle between and is bigger than , and we want
to decrease it.

Figure 15: Two vectors

Let’s see what happens with two vectors, and , having an angle between (Figure 15).

On the one hand, adding them creates a new vector and the angle between and
is smaller than (Figure 16).

Figure 16: The addition creates a smaller angle


On the other hand, subtracting them creates a new vector , and the angle between
and is bigger than (Figure 17).

Figure 17: The subtraction creates a bigger angle

We can use these two observations to adjust the angle:

• If the predicted label is 1, the angle is smaller than . We want to increase the angle,

so we set .

• If the predicted label is -1, the angle is bigger than . We want to decrease the angle,

so we set .

As we are doing this only on misclassified examples, when the predicted label has a value,
the expected label is the opposite. This means we can rewrite the previous statement:

• If the expected label is -1: We want to increase the angle, so we set .

• If the expected label is +1: We want to decrease the angle, so we set .

Note that the update rule does not necessarily change the sign of the hypothesis for the
example the first time. Sometimes it is necessary to apply the update rule several times before
it happens. This is not a problem, as we are looping across misclassified examples, so we will
continue to use the update rule until the example is correctly classified. What matters here is
that each time we use the update rule, we change the value of the angle in the right direction
(increasing it or decreasing it).
Also note that sometimes updating the value of for a particular example changes the hyperplane
in such a way that another example previously correctly classified becomes misclassified. So, the
hypothesis might become worse at classifying after being updated. This is illustrated in Figure 18,
which shows us the number of classified examples at each iteration step. One way to avoid this
problem is to keep a record of the value of before making the update and use the updated only if
it reduces the number of misclassified examples. This modification of the PLA is known as the Pocket
algorithm (because we keep in our pocket).

Figure 18: The PLA update rule oscillates

Convergence of the algorithm


We said that we keep updating the vector with the update rule until there is no misclassified
point. But how can we be so sure that will ever happen? Luckily for us, mathematicians have
studied this problem, and we can be very sure because the Perceptron convergence theorem
guarantees that if the two sets P and N (of positive and negative examples respectively) are
linearly separable, the vector is updated only a finite number of times, which was first
proved by Novikoff in 1963 (Rojas, 1996).

Understanding the limitations of the PLA


One thing to understand about the PLA algorithm is that because weights are randomly
initialized and misclassified examples are randomly chosen, it is possible the algorithm will
return a different hyperplane each time we run it. Figure 19 shows the result of running the PLA
on the same dataset four times. As you can see, the PLA finds four different hyperplanes.
Figure 19: The PLA finds a different hyperplane each time

At first, this might not seem like a problem. After all, the four hyperplanes perfectly
classify the data, so they might be equally good, right? However, when using a
machine learning algorithm such as the PLA, our goal is not to find a way to classify
perfectly the data we have right now. Our goal is to find a way to correctly classify new
data we will receive in the future.

Let us introduce some terminology to be clear about this. To train a model, we pick a
sample of existing data and call it the training set. We train the model, and it comes up
with a hypothesis (a hyperplane in our case). We can measure how well the
hypothesis performs on the training set: we call this the in-sample error (also called
training error). Once we are satisfied with the hypothesis, we decide to use it on unseen
data (the test set) to see if it indeed learned something. We measure how well the
hypothesis performs on the test set, and we call this the out-of-sample error (also
called the generalization error).

Our goal is to have the smallest out-of-sample error.

In the case of the PLA, all hypotheses in Figure 19 perfectly classify the data: their in-
sample error is zero. But we are really concerned about their out-of-sample error. We
can use a test set such as the one in Figure 20 to check their out-of-sample errors.

Figure 20: A test dataset


As you can see in Figure 21, the two hypotheses on the right, despite perfectly classifying the
training dataset, are making errors with the test dataset.

Now we better understand why it is problematic. When using the Perceptron with a linearly
separable dataset, we have the guarantee of finding a hypothesis with zero in-sample error, but
we have no guarantee about how well it will generalize to unseen data (if an algorithm
generalizes well, its out-of-sample error will be close to its in-sample error). How can we choose
a hyperplane that generalizes well? As we will see in the next chapter, this is one of the goals of
SVMs.

Figure 21: Not all hypotheses have perfect out-of-sample error

Summary

In this chapter, we have learned what a Perceptron is. We then saw in detail how the
Perceptron Learning Algorithm works and what the motivation behind the update rule is. After
learning that the PLA is guaranteed to converge, we saw that not all hypotheses are equal,
and that some of them will generalize better than others. Eventually, we saw that the
Perceptron is unable to select which hypothesis will have the smallest out-of-sample error and
instead just picks one hypothesis having the lowest in-sample error at random.
Introduction to K-means Clustering

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled
data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups
in the data, with the number of groups represented by the variable K. The algorithm works
iteratively to assign each data point to one of K groups based on the features that are provided.
Data points are clustered based on feature similarity. The results of the K-means clustering
algorithm are:

1. The centroids of the K clusters, which can be used to label new data
2. Labels for the training data (each data point is assigned to a single cluster)

Rather than defining groups before looking at the data, clustering allows you to find and analyze
the groups that have formed organically. The "Choosing K" section below describes how the
number of groups can be determined.

Each centroid of a cluster is a collection of feature values which define the resulting groups.
Examining the centroid feature weights can be used to qualitatively interpret what kind of group
each cluster represents.

This introduction to the K-means clustering algorithm covers:

 Common business cases where K-means is used


 The steps involved in running the algorithm

Business Uses

The K-means clustering algorithm is used to find groups which have not been explicitly labeled
in the data. This can be used to confirm business assumptions about what types of groups exist or
to identify unknown groups in complex data sets. Once the algorithm has been run and the
groups are defined, any new data can be easily assigned to the correct group.

This is a versatile algorithm that can be used for any type of grouping. Some examples of use
cases are:

 Behavioral segmentation:
o Segment by purchase history
o Segment by activities on application, website, or platform
o Define personas based on interests
o Create profiles based on activity monitoring
 Inventory categorization:
o Group inventory by sales activity
o Group inventory by manufacturing metrics
 Sorting sensor measurements:
o Detect activity types in motion sensors
o Group images
o Separate audio
o Identify groups in health monitoring
 Detecting bots or anomalies:
o Separate valid activity groups from bots
o Group valid activity to clean up outlier detection

141
In addition, monitoring if a tracked data point switches between groups over time can be used to
detect meaningful changes in the data.

Algorithm

The Κ-means clustering algorithm uses iterative refinement to produce a final result. The
algorithm inputs are the number of clusters Κ and the data set. The data set is a collection of
features for each data point. The algorithms starts with initial estimates for the Κ centroids,
which can either be randomly generated or randomly selected from the data set. The algorithm
then iterates between two steps:

1. Data assigment step:

Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest
centroid, based on the squared Euclidean distance. More formally, if ci is the collection of
centroids in set C, then each data point x is assigned to a cluster based on

where dist( · ) is the standard (L2) Euclidean distance. Let the set of data point assignments for
each ith cluster centroid be Si.

2. Centroid update step:

In this step, the centroids are recomputed. This is done by taking the mean of all data points
assigned to that centroid's cluster.

The algorithm iterates between steps one and two until a stopping criteria is met (i.e., no data
points change clusters, the sum of the distances is minimized, or some maximum number of
iterations is reached).

This algorithm is guaranteed to converge to a result. The result may be a local optimum (i.e. not
necessarily the best possible outcome), meaning that assessing more than one run of the
algorithm with randomized starting centroids may give a better outcome.

Choosing K

The algorithm described above finds the clusters and data set labels for a particular pre-chosen K.
To find the number of clusters in the data, the user needs to run the K-means clustering algorithm
for a range of K values and compare the results. In general, there is no method for determining
exact value of K, but an accurate estimate can be obtained using the following techniques.

One of the metrics that is commonly used to compare results across different values of K is the
mean distance between data points and their cluster centroid. Since increasing the number of
clusters will always reduce the distance to data points, increasing K will always decrease this
metric, to the extreme of reaching zero when K is the same as the number of data points. Thus,
this metric cannot be used as the sole target. Instead, mean distance to the centroid as a function

142
of K is plotted and the "elbow point," where the rate of decrease sharply shifts, can be used to
roughly determine K.

A number of other techniques exist for validating K, including cross-validation, information


criteria, the information theoretic jump method, the silhouette method, and the G-means
algorithm. In addition, monitoring the distribution of data points across groups provides insight
into how the algorithm is splitting the data for each K.

k-Means: Step-By-Step Example


As a simple illustration of a k-means algorithm, consider the following data set consisting of the
scores of two variables on each of seven individuals:

Subject A B
1 1.0 1.0
2 1.5 2.0
3 3.0 4.0
4 5.0 7.0
5 3.5 5.0
6 4.5 5.0
7 3.5 4.5

143
This data set is to be grouped into two clusters. As a first step in finding a sensible initial
partition, let the A & B values of the two individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:

Mean Vector
Individual
(centroid)
Group 1 1 (1.0, 1.0)
Group 2 4 (5.0, 7.0)

The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:

Cluster 1 Cluster 2
Mean Mean
Step Individual Vector Individual Vector
(centroid) (centroid)
1 1 (1.0, 1.0) 4 (5.0, 7.0)
2 1, 2 (1.2, 1.5) 4 (5.0, 7.0)
3 1, 2, 3 (1.8, 2.3) 4 (5.0, 7.0)
4 1, 2, 3 (1.8, 2.3) 4, 5 (4.2, 6.0)
5 1, 2, 3 (1.8, 2.3) 4, 5, 6 (4.3, 5.7)
6 1, 2, 3 (1.8, 2.3) 4, 5, 6, 7 (4.1, 5.4)

Now the initial partition has changed, and the two clusters at this stage having the following
characteristics:

Mean Vector
Individual
(centroid)
Cluster 1 1, 2, 3 (1.8, 2.3)
Cluster 2 4, 5, 6, 7 (4.1, 5.4)

But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individual’s distance to its own cluster mean and to
that of the opposite cluster. And we find:

Distance to Distance to
mean mean
Individual
(centroid) of (centroid) of
Cluster 1 Cluster 2
1 1.5 5.4
2 0.4 4.3
3 2.1 1.8
4 5.7 1.8
5 3.2 0.7
6 3.8 0.6
7 2.8 1.1

144
Only individual 3 is nearer to the mean of the opposite cluster (Cluster 2) than its own (Cluster
1). In other words, each individual's distance to its own cluster mean should be smaller that the
distance to the other cluster's mean (which is not the case with individual 3). Thus, individual 3
is relocated to Cluster 2 resulting in the new partition:

Mean Vector
Individual
(centroid)
Cluster 1 1, 2 (1.3, 1.5)
Cluster 2 3, 4, 5, 6, 7 (3.9, 5.1)

The iterative relocation would now continue from this new partition until no more relocations
occur. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.

Also, it is possible that the k-means algorithm won't find a final solution. In this case it would be
a good idea to consider stopping the algorithm after a pre-chosen maximum of iterations.

145
Introduction to Decision Tree Algorithm

Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is
and what the corresponding output is in the training data) where the data is continuously split
according to a certain parameter. The tree can be explained by two entities, namely decision
nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are
where the data is split.

An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’,
‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In
this case this was a binary classification problem (a yes no type problem).

There are two main types of Decision Trees:

1. Classification trees (Yes/No types)

What we’ve seen above is an example of classification tree, where the outcome was a variable
like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.

2. Regression trees (Continuous data types)

Here the decision or the outcome variable is Continuous, e.g. a number like 123.

Working

Now that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3.

146
Before discussing the ID3 algorithm, we’ll go through few definitions.

Entropy

Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data.

Or

Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss
whose probability of heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest
possible, since there’s no way of determining what the outcome might be. Alternatively, consider
a coin which has heads on both the sides, the entropy of such an event can be predicted perfectly
since we know beforehand that it’ll always be heads. In other words, this event has no
randomness hence it’s entropy is zero.

In particular, lower values imply less uncertainty while higher values imply high uncertainty.

147
148
149
150
151
152
153
154
Let’s understand this with the help of another example

Consider a piece of data collected over the course of 14 days where the features are Outlook,
Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the day.
Now, our job is to build a predictive model which takes in above 4 parameters and predicts
whether Golf will be played on the day. We’ll build a decision tree to do that using ID3
algorithm.

Day Outlook Temperature Humidity Wind Play Golf


D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

ID3 Algorithm will perform following tasks recursively

1. Create root node for the tree


2. If all examples are positive, return leaf node ‘positive’
3. Else if all examples are negative, return leaf node ‘negative’
4. Calculate the entropy of current state H(S)
5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S,
x)
6. Select the attribute which has maximum value of IG(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.

Now we’ll go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state. In the above example, we can see in total there are 5 No’s and 9 Yes’s.

Yes No Total
9 5 14

155
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of them
belong to one class and other half belong to other class that is perfect randomness. Here it’s 0.94
which means the distribution is fairly random.

Now the next step is to choose the attribute that gives us highest possible Information Gain
which we’ll choose as the root node.

Let’s start with ‘Wind’

where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible
values in the sample data, hence x = {Weak, Strong}

We’ll have to calculate:

Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind
is Strong.

Wind = Wind =
Total
Weak Strong
8 6 14

Now out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’
for ‘Play Golf’. So, we have,
156
Similarly, out of 6 Strong examples, we have 3 examples where the outcome was ‘Yes’ for
Play Golf and 3 where we had ‘No’ for Play Golf.

Remember, here half items belong to one class while other half belong to other. Hence we have
perfect randomness.

Now we have all the pieces required to calculate the Information Gain,

Which tells us the Information Gain by considering ‘Wind’ as the feature and give us
information gain of 0.048. Now we must similarly calculate the Information Gain for all the
features.

We can clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we
chose Outlook attribute as the root node. At this point, the decision tree looks like.

157
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no
coincidence by any chance, the simple tree resulted because of the highest information gain is
given by the attribute Outlook.

Now how do we proceed from this point? We can simply apply recursion, you might want to
look at the algorithm steps described earlier.

Now that we’ve used Outlook, we’ve got three of them remaining Humidity, Temperature, and
Wind. And, we had three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast
node already ended up having leaf node ‘Yes’, so we’re left with two subtrees to compute:
Sunny and Rain.

Table where the value of Outlook is Sunny looks like:

Temperature Humidity Wind Play Golf


Hot High Weak No
Hot High Strong No
Mild High Weak No
Cool Normal Weak Yes
Mild Normal Strong Yes

In the similar fashion, we compute the following values

As we can see the highest Information Gain is given by Humidity. Proceeding in the same
way with will give us Wind as the one with highest information gain. The final Decision
Tree looks something like this.

The final Decision Tree looks something like this.

158
Overfitting in case of decision tree classification

Overfitting is a significant practical difficulty for decision tree models and many other predictive
models. Overfitting happens when the learning algorithm continues to develop hypotheses that
re duce training set error at the cost of an increased test set error.

There are several approaches to avoiding overfitting in building decision trees.

 Pre-pruning that stop growing the tree earlier, before it perfectly classifies the training
set.
 Post-pruning that allows the tree to perfectly classify the training set, and then post
prune the tree.

Practically, the second approach of post-pruning overfit trees is more successful because it
is not easy to precisely estimate when to stop growing the tree.

The important step of tree pruning is to define a criterion be used to determine the correct
final tree size using one of the following methods:
159
1. Use a distinct dataset from the training set (called validation set), to evaluate the effect
of post-pruning nodes from the tree.
2. Build the tree by using the training set, then apply a statistical test to estimate whether
pruning or expanding a particular node is likely to produce an improvement beyond the
training set.
o Error estimation
o Significance testing (e.g., Chi-square test)
3. Minimum Description Length principle : Use an explicit measure of the complexity for
encoding the training set and the decision tree, stopping growth of the tree when this
encoding size (size(tree) + size(misclassifications(tree)) is minimized.

The first method is the most common approach. In this approach, the available data are
separated into two sets of examples: a training set, which is used to build the decision tree,
and a validation set, which is used to evaluate the impact of pruning the tree. The second
method is also a common approach. Here, we explain the error estimation and Chi2 test.

160
Dimensionality Reduction

Let’s say that you want to predict what the gross domestic product (GDP) of the United States
will be for 2017. You have lots of information available: the U.S. GDP for the first quarter of
2017, the U.S. GDP for the entirety of 2016, 2015, and so on. You have any publicly-available
economic indicator, like the unemployment rate, inflation rate, and so on. You have U.S. Census
data from 2010 estimating how many Americans work in each industry and American
Community Survey data updating those estimates in between each census. You know how many
members of the House and Senate belong to each political party. You could gather stock price
data, the number of IPOs occurring in a year, and how many CEOs seem to be mounting a bid
for public office. Despite being an overwhelming number of variables to consider, this just
scratches the surface. You have a lot of variables to consider.

If you’ve worked with a lot of variables before, you know this can present problems. Do you
understand the relationships between each variable? Do you have so many variables that you are
in danger of overfitting your model to your data or that you might be violating assumptions of
whichever modeling tactic you’re using?

You might ask the question, “How do I take all of the variables I’ve collected and focus on only
a few of them?” In technical terms, you want to “reduce the dimension of your feature space.”
By reducing the dimension of your feature space, you have fewer relationships between variables
to consider and you are less likely to overfit your model. (Note: This doesn’t immediately mean
that overfitting, etc. are no longer concerns — but we’re moving in the right direction!)

Somewhat unsurprisingly, reducing the dimension of the feature space is called “dimensionality
reduction.” There are many ways to achieve dimensionality reduction, but most of these
techniques fall into one of two classes:

 Feature Elimination
 Feature Extraction

Feature elimination is what it sounds like: we reduce the feature space by eliminating features.
In the GDP example above, instead of considering every single variable, we might drop all
variables except the three we think will best predict what the U.S.’s gross domestic product will
look like. Advantages of feature elimination methods include simplicity and maintaining
interpretability of your variables.

As a disadvantage, though, you gain no information from those variables you’ve dropped. If we
only use last year’s GDP, the proportion of the population in manufacturing jobs per the most
recent American Community Survey numbers, and unemployment rate to predict this year’s
GDP, we’re missing out on whatever the dropped variables could contribute to our model. By
eliminating features, we’ve also entirely eliminated any benefits those dropped variables would
bring.

Feature extraction, however, doesn’t run into this problem. Say we have ten independent
variables. In feature extraction, we create ten “new” independent variables, where each “new”
independent variable is a combination of each of the ten “old” independent variables. However,
we create these new independent variables in a specific way and order these new variables by
how well they predict our dependent variable.

You might say, “Where does the dimensionality reduction come into play?” Well, we keep as
many of the new independent variables as we want, but we drop the “least important ones.”
161
Because we ordered the new variables by how well they predict our dependent variable, we
know which variable is the most important and least important. But — and here’s the kicker —
 because these new independent variables are combinations of our old ones, we’re still keeping
the most valuable parts of our old variables, even when we drop one or more of these “new”
variables!

What is PCA?

Principal component analysis is a technique for feature extraction — so it combines our input
variables in a specific way, then we can drop the “least important” variables while still retaining
the most valuable parts of all of the variables! As an added benefit, each of the “new” variables
after PCA are all independent of one another. This is a benefit because the assumptions of a
linear model require our independent variables to be independent of one another. If we decide to
fit a linear regression model with these “new” variables (see “principal component regression”
below), this assumption will necessarily be satisfied.

When should I use PCA?

1. Do you want to reduce the number of variables, but aren’t able to identify variables to
completely remove from consideration?
2. Do you want to ensure your variables are independent of one another?
3. Are you comfortable making your independent variables less interpretable?

If you answered “yes” to all three questions, then PCA is a good method to use. If you answered
“no” to question 3, you should not use PCA.

How does PCA work?

The section after this discusses why PCA works, but providing a brief summary before jumping
into the algorithm may be helpful for context:

 We are going to calculate a matrix that summarizes how our variables all relate to one another.
 We’ll then break this matrix down into two separate components: direction and magnitude. We
can then understand the “directions” of our data and its “magnitude” (or how “important” each
direction is). The screenshot below, from the setosa.io applet, displays the two main directions
in this data: the “red direction” and the “green direction.” In this case, the “red direction” is the
more important one. We’ll get into why this is the case later, but given how the dots are
arranged, can you see why the “red direction” looks more important than the “green direction?”
(Hint: What would fitting a line of best fit to this data look like?)

162
Our original data in the xy-plane. (Source.)

 We will transform our original data to align with these important directions (which are
combinations of our original variables). The screenshot below (again from setosa.io) is the same
exact data as above, but transformed so that the x- and y-axes are now the “red direction” and
“green direction.” What would the line of best fit look like here?

Our original data transformed by PCA. (Source.)

 While the visual example here is two-dimensional (and thus we have two “directions”), think
about a case where our data has more dimensions. By identifying which “directions” are most
“important,” we can compress or project our data into a smaller space by dropping the
“directions” that are the “least important.” By projecting our data into a smaller space, we’re
reducing the dimensionality of our feature space… but because we’ve transformed our data in
these different “directions,” we’ve made sure to keep all original variables in our model!

Note two things in this graphic:

 The two charts show the exact same data, but the right graph reflects the original data
transformed so that our axes are now the principal components.

163
 In both graphs, the principal components are perpendicular to one another. In fact, every
principal component will ALWAYS be orthogonal (a.k.a. official math term for perpendicular) to
every other principal component. (Don’t believe me? Try to break the applet!)

Why does PCA work?

While PCA is a very technical method relying on in-depth linear algebra algorithms, it’s a
relatively intuitive method when you think about it.

 First, the covariance matrix ZᵀZ is a matrix that contains estimates of how every variable in Z
relates to every other variable in Z. Understanding how one variable is associated with another
is quite powerful.
 Second, eigenvalues and eigenvectors are important. Eigenvectors represent directions. Think of
plotting your data on a multidimensional scatterplot. Then one can think of an individual
eigenvector as a particular “direction” in your scatterplot of data. Eigenvalues represent
magnitude, or importance. Bigger eigenvalues correlate with more important directions.
 Finally, we make an assumption that more variability in a particular direction correlates with
explaining the behavior of the dependent variable. Lots of variability usually indicates signal,
whereas little variability usually indicates noise. Thus, the more variability there is in a particular
direction is, theoretically, indicative of something important we want to detect.

Thus, PCA is a method that brings together:

1. A measure of how each variable is associated with one another. (Covariance matrix.)
2. The directions in which our data are dispersed. (Eigenvectors.)
3. The relative importance of these different directions. (Eigenvalues.)

PCA combines our predictors and allows us to drop the eigenvectors that are relatively
unimportant.

Are there extensions to PCA?

Yes, more than I can address here in a reasonable amount of space. The one I’ve most frequently
seen is principal component regression, where we take our untransformed Y and regress it on the
subset of Z* that we didn’t drop. (This is where the independence of the columns of Z* comes
in; by regressing Y on Z*, we know that the required independence of independent variables will
necessarily be satisfied. However, we will need to still check our other assumptions.)

The other commonly-seen variant I’ve seen is kernel PCA.

164
165
166
167
168
169
170
Introduction to Ensemble learning

Ensemble modeling is a powerful way to improve the performance of your model. It usually pays
off to apply ensemble learning over and above various models you might be building.

Ensemble learning is a broad topic and is only confined by your own imagination. For the
purpose of this article, I will cover the basic concepts and ideas of ensemble modeling. This
should be enough for you to start building ensembles at your own end. As usual, we have tried to
keep things as simple as possible.

Let’s quickly start with an example to understand the basics of Ensemble learning. This example
will bring out, how we use ensemble model every day without realizing that we are using
ensemble modeling.

Example: I want to invest in a company XYZ. I am not sure about its performance though. So, I
look for advice on whether the stock price will increase more than 6% per annum or not? I
decide to approach various experts having diverse domain experience:

1. Employee of Company XYZ: This person knows the internal functionality of the company and
have the insider information about the functionality of the firm. But he lacks a broader
perspective on how are competitors innovating, how is the technology evolving and what will be
the impact of this evolution on Company XYZ’s product. In the past, he has been right 70%
times.

2. Financial Advisor of Company XYZ: This person has a broader perspective on how companies
strategy will fair of in this competitive environment. However, he lacks a view on how the
company’s internal policies are fairing off. In the past, he has been right 75% times.

3. Stock Market Trader: This person has observed the company’s stock price over past 3 years.
He knows the seasonality trends and how the overall market is performing. He also has
developed a strong intuition on how stocks might vary over time. In the past, he has been right
70% times.

4. Employee of a competitor: This person knows the internal functionality of the competitor
firms and is aware of certain changes which are yet to be brought. He lacks a sight of company in
focus and the external factors which can relate the growth of competitor with the company of
subject. In the past, he has been right 60% of times.

5. Market Research team in same segment: This team analyzes the customer preference of
company XYZ’s product over others and how is this changing with time. Because he deals with
customer side, he is unaware of the changes company XYZ will bring because of alignment to its
own goals. In the past, they have been right 75% of times.

6. Social Media Expert: This person can help us understand how has company XYZ has
positioned its products in the market. And how are the sentiment of customers changing over
time towards company. He is unaware of any kind of details beyond digital marketing. In the
past, he has been right 65% of times.

Given the broad spectrum of access we have, we can probably combine all the information and
make an informed decision.

171
In a scenario when all the 6 experts/teams verify that it’s a good decision(assuming all the
predictions are independent of each other), we will get a combined accuracy rate of

1 - 30%*25%*30%*40%*25%*35%
= 1 - 0.07875 = 99.92125%

Assumption: The assumption used here that all the predictions are completely independent is
slightly extreme as they are expected to be correlated. However, we see how we can be so sure
by combining various predictions together.

Let us now change the scenario slightly. This time we have 6 experts, all of them are employee
of company XYZ working in the same division. Everyone has a propensity of 70% to advocate
correctly.

What if we combine all these advice together, can we still raise up our confidence to >99% ?

Obviously not, as all the predictions are based on very similar set of information. They are
certain to be influenced by similar set of information and the only variation in their advice would
be due to their personal opinions & collected facts about the firm.

Combine Model Predictions Into Ensemble Predictions

The three most popular methods for combining the predictions from different models are:

 Bagging. Building multiple models (typically of the same type) from different subsamples of the
training dataset.
 Boosting. Building multiple models (typically of the same type) each of which learns to fix the
prediction errors of a prior model in the chain.
 Voting. Building multiple models (typically of differing types) and simple statistics (like
calculating the mean) are used to combine predictions.

Bagging (Bootstrap Aggregation)

Bagging (Bootstrap Aggregation) is used when our goal is to reduce the variance of a decision
tree. Here idea is to create several subsets of data from training sample chosen randomly with
replacement. Now, each collection of subset data is used to train their decision trees. As a result,
we end up with an ensemble of different models. Average of all the predictions from different
trees are used which is more robust than a single decision tree.

Bagging meaning Bootstrap Aggregation. Bootstrapping is a process of selecting samples from


original sample (or population) and using these samples for estimating various statistics or model
accuracy. Bagging (Bootstrap aggregating) was proposed by Leo Breiman in 1994 for
improving classification accuracy.

Bootstrapping is a process of creating random samples with replacement for estimating sample
statistics.

One of the way to select samples or bootstrap samples is to select n items with replacement
from an original sample, N. A bootstrap sample may have a few duplicate observations or
records, as the sampling is done with replacement.

172
Bootstrapping Sampling Example

N = {23,45,55,47,34,88,95,27,87,78,26,19,10,3,4,17} – Original sample with 16 elements

Bootstrap sample 1: {10, 78, 87, 55, 26, 88, 10}

Bootstrap sample 2: {55, 78, 45, 78, 55, 23, 23}

Bootstrap sample 3: {88, 27, 10, 27, 34, 34, 23}

The samples are referred to as a resample. This allows the model or algorithm to get a better
understanding of the various biases, variances and features that exist in the resample. Taking a
sample of the data allows the resample to contain different characteristics then it might have
contained as a whole. Each sample population has different pieces, and none are identical. This
would then affect the overall mean, standard deviation and other descriptive metrics of a data set.
In turn, it can develop more robust models.

Why do we create bootstrap samples? Bootstrap samples are created to estimate and validate
models for improved accuracy, reduced variance and bias, and improved stability of a model.

Once bootstrap samples are created, model classifier is used for training or building a model and
then selecting model based on popularity votes. In classification model, a label with maximum
votes will assigned to the observations. Average value is used in case of a regression model.

Bagging: Overview

Bagging is an ensambling process – where a model is trained on each of the bootstrap samples
and the final model is an aggregated models of the all sample models. For a numeric target
variable /regression problems the predicted outcome is an average of all the models and in the
classification problems, the predicted class is defined based on plurality.

What Bagging does is help reduce variance from models that are might be very accurate, but
only on the data they were trained on. This is also known as overfitting.

Overfitting is when a function fits the data too well. Typically this is because the actual equation
is much too complicated to take into account each data point and outlier.

Figure 2 Overfitting : Model represented with green line

173
Another example of an algorithm that can overfit easily is a decision tree. The models that are
developed using decision trees require very simple heuristics. Decision trees are composed of a
set of if-else statements done in a specific order. Thus, if the data set is changed to a new data set
that might have some bias or difference in spread of underlying features compared to the
previous set. The model will fail to be as accurate. This is because the data will not fit the model
as well(which is a backwards statement anyways).

Now, we will show an example of Bagging for both Regression (Numerical Outcome) and
Classification scenarios.

In R, “adabag” and “ipred” packages allow us to develop bagging based models for both
classification and regression scenarios.

Tree Building process used on Bagging is based on CART algorithm and adabag and ipred use
rpart algorithm.

Random Forest

Random Forest is an extension over bagging. It takes one extra step where in addition to taking
the random subset of data, it also takes the random selection of features rather than using all
features to grow trees. When you have many random trees. It’s called Random Forest.

Let’s look at the steps taken to implement Random forest:

1. Suppose there are N observations and M features in training data set. First, a sample from
training data set is taken randomly with replacement.

2. A subset of M features are selected randomly and whichever feature gives the best split is used
to split the node iteratively.
174
3. The tree is grown to the largest.

4. Above steps are repeated and prediction is given based on the aggregation of predictions from
n number of trees.

Advantages of using Random Forest technique:

 Handles higher dimensionality data very well.


 Handles missing values and maintains accuracy for missing data.

Disadvantages of using Random Forest technique:

 Since final prediction is based on the mean predictions from subset trees, it won’t give
precise values for the regression model.

What is Boosting?

The term ‘Boosting’ refers to a family of algorithms which converts weak learner to strong
learners.

Let’s understand this definition in detail by solving a problem of spam email identification:

How would you classify an email as SPAM or not? Like everyone else, our initial approach
would be to identify ‘spam’ and ‘not spam’ emails using following criteria. If:

1. Email has only one image file (promotional image), It’s a SPAM
2. Email has only link(s), It’s a SPAM
3. Email body consist of sentence like “You won a prize money of $ xxxxxx”, It’s a SPAM
4. Email from our official domain “ABC.com” , Not a SPAM
5. Email from known source, Not a SPAM

Above, we’ve defined multiple rules to classify an email into ‘spam’ or ‘not spam’. But, do you
think these rules individually are strong enough to successfully classify an email? No.

Individually, these rules are not powerful enough to classify an email into ‘spam’ or ‘not
spam’. Therefore, these rules are called as weak learner.

To convert weak learner to strong learner, we’ll combine the prediction of each weak learner
using methods like:
• Using average/ weighted average
• Considering prediction has higher vote

For example: Above, we have defined 5 weak learners. Out of these 5, 3 are voted as ‘SPAM’
and 2 are voted as ‘Not a SPAM’. In this case, by default, we’ll consider an email as SPAM
because we have higher(3) vote for ‘SPAM’.

175
How Boosting Algorithms works?

Now we know that, boosting combines weak learner a.k.a. base learner to form a strong rule. An
immediate question which should pop in your mind is, ‘How boosting identify weak rules?‘

To find weak rule, we apply base learning (ML) algorithms with a different distribution. Each
time base learning algorithm is applied, it generates a new weak prediction rule. This is an
iterative process. After many iterations, the boosting algorithm combines these weak rules into a
single strong prediction rule.

Here’s another question which might haunt you, ‘How do we choose different distribution for
each round?’

For choosing the right distribution, here are the following steps:

Step 1: The base learner takes all the distributions and assign equal weight or attention to each
observation.

Step 2: If there is any prediction error caused by first base learning algorithm, then we pay higher
attention to observations having prediction error. Then, we apply the next base learning
algorithm.

Step 3: Iterate Step 2 till the limit of base learning algorithm is reached or higher accuracy is
achieved.

Finally, it combines the outputs from weak learner and creates a strong learner which eventually
improves the prediction power of the model. Boosting pays higher focus on examples which are
mis-classified or have higher errors by preceding weak rules.

Types of Boosting Algorithms

Underlying engine used for boosting algorithms can be anything. It can be decision stamp,
margin-maximizing classification algorithm etc. There are many boosting algorithms which use
other types of engine such as:

1. AdaBoost (Adaptive Boosting)


2. Gradient Tree Boosting
3. XGBoost

176
Boosting Algorithm: AdaBoost

This diagram aptly explains Ada-boost. Let’s understand it closely:

Box 1: You can see that we have assigned equal weights to each data point and applied a
decision stump to classify them as + (plus) or – (minus). The decision stump (D1) has generated
vertical line at left side to classify the data points. We see that, this vertical line has
incorrectly predicted three + (plus) as – (minus). In such case, we’ll assign higher weights to
these three + (plus) and apply another decision stump.

Box 2: Here, you can see that the size of three incorrectly predicted + (plus) is bigger as
compared to rest of the data points. In this case, the second decision stump (D2) will try to
predict them correctly. Now, a vertical line (D2) at right side of this box has classified three mis-
classified + (plus) correctly. But again, it has caused mis-classification errors. This time with
three -(minus). Again, we will assign higher weight to three – (minus) and apply another
decision stump.

177
Box 3: Here, three – (minus) are given higher weights. A decision stump (D3) is applied to
predict these mis-classified observation correctly. This time a horizontal line is generated to
classify + (plus) and – (minus) based on higher weight of mis-classified observation.

Box 4: Here, we have combined D1, D2 and D3 to form a strong prediction having complex rule
as compared to individual weak learner. You can see that this algorithm has classified these
observation quite well as compared to any of individual weak learner.

Combining Result

Sign (W1*X1 + W2*X2 + .. …+WNXN)

If W1*X1 + W2*X2 + .. …+WNXN > 0 the final predicted class will be 1 other wise 0

AdaBoost (Adaptive Boosting) : It works on similar method as discussed above. It fits a


sequence of weak learners on different weighted training data. It starts by predicting original data
set and gives equal weight to each observation. If prediction is incorrect using the first learner,
then it gives higher weight to observation which have been predicted incorrectly. Being an
iterative process, it continues to add learner(s) until a limit is reached in the number of models or
accuracy.

Mostly, we use decision stamps with AdaBoost. But, we can use any machine learning
algorithms as base learner if it accepts weight on training data set. We can use AdaBoost
algorithms for both classification and regression problem.

Boosting Algorithm: Gradient Boosting

In gradient boosting, it trains many model sequentially. Each new model gradually minimizes the
loss function (y = ax + b + e, e needs special attention as it is an error term) of the whole system
using Gradient Descent method. The learning procedure consecutively fit new models to provide
a more accurate estimate of the response variable.

178
The principle idea behind this algorithm is to construct new base learners which can
be maximally correlated with negative gradient of the loss function, associated with the whole
ensemble. You can refer article “Learn Gradient Boosting Algorithm” to understand this concept
using an example.

AdaBoost Tutorial
13 Dec 2013

My education in the fundamentals of machine learning has mainly come from Andrew Ng’s
excellent Coursera course on the topic. One thing that wasn’t covered in that course, though, was
the topic of “boosting” which I’ve come across in a number of different contexts now.
Fortunately, it’s a relatively straightforward topic if you’re already familiar with machine
learning classification.

Whenever I’ve read about something that uses boosting, it’s always been with the “AdaBoost”
algorithm, so that’s what this post covers.

AdaBoost is a popular boosting technique which helps you combine multiple “weak classifiers”
into a single “strong classifier”. A weak classifier is simply a classifier that performs poorly, but
performs better than random guessing. A simple example might be classifying a person as male
or female based on their height. You could say anyone over 5’ 9” is a male and anyone under
that is a female. You’ll misclassify a lot of people that way, but your accuracy will still be
greater than 50%.

AdaBoost can be applied to any classification algorithm, so it’s really a technique that builds on
top of other classifiers as opposed to being a classifier itself.

You could just train a bunch of weak classifiers on your own and combine the results, so what
does AdaBoost do for you? There’s really two things it figures out for you:

1. It helps you choose the training set for each new classifier that you train based on the results of
the previous classifier.
2. It determines how much weight should be given to each classifier’s proposed answer when
combining the results.

Training Set Selection

Each weak classifier should be trained on a random subset of the total training set. The subsets
can overlap–it’s not the same as, for example, dividing the training set into ten portions.
AdaBoost assigns a “weight” to each training example, which determines the probability that
each example should appear in the training set. Examples with higher weights are more likely to
be included in the training set, and vice versa. After training a classifier, AdaBoost increases the
weight on the misclassified examples so that these examples will make up a larger part of the
next classifiers training set, and hopefully the next classifier trained will perform better on them.

The equation for this weight update step is detailed later on.

179
Classifier Output Weights

After each classifier is trained, the classifier’s weight is calculated based on its accuracy. More
accurate classifiers are given more weight. A classifier with 50% accuracy is given a weight of
zero, and a classifier with less than 50% accuracy (kind of a funny concept) is given negative
weight.

Formal Definition

To learn about AdaBoost, I read through a tutorial written by one of the original authors of the
algorithm, Robert Schapire. The tutorial is available here.

Below, I’ve tried to offer some intuition into the relevant equations.

Let’s look first at the equation for the final classifier.

The final classifier consists of ‘T’ weak classifiers. h_t(x) is the output of weak classifier ‘t’ (in
this paper, the outputs are limited to -1 or +1). Alpha_t is the weight applied to classifier ‘t’ as
determined by AdaBoost. So the final output is just a linear combination of all of the weak
classifiers, and then we make our final decision simply by looking at the sign of this sum.

The classifiers are trained one at a time. After each classifier is trained, we update the
probabilities of each of the training examples appearing in the training set for the next classifier.

The first classifier (t = 1) is trained with equal probability given to all training examples. After
it’s trained, we compute the output weight (alpha) for that classifier.

The output weight, alpha_t, is fairly straightforward. It’s based on the classifier’s error rate,
‘e_t’. e_t is just the number of misclassifications over the training set divided by the training set
size.

Here’s a plot of what alpha_t will look like for classifiers with different error rates.

180
There are three bits of intuition to take from this graph:

1. The classifier weight grows exponentially as the error approaches 0. Better classifiers are
given exponentially more weight.
2. The classifier weight is zero if the error rate is 0.5. A classifier with 50% accuracy is no
better than random guessing, so we ignore it.
3. The classifier weight grows exponentially negative as the error approaches 1. We give a
negative weight to classifiers with worse than 50% accuracy. “Whatever that classifier
says, do the opposite!”.

After computing the alpha for the first classifier, we update the training example weights using
the following formula.

The variable D_t is a vector of weights, with one weight for each training example in the training
set. ‘i’ is the training example number. This equation shows you how to update the weight for the
ith training example.

The paper describes D_t as a distribution. This just means that each weight D(i) represents the
probability that training example i will be selected as part of the training set.

To make it a distribution, all of these probabilities should add up to 1. To ensure this, we


normalize the weights by dividing each of them by the sum of all the weights, Z_t. So, for
example, if all of the calculated weights added up to 12.2, then we would divide each of the
weights by 12.2 so that they sum up to 1.0 instead.

This vector is updated for each new weak classifier that’s trained. D_t refers to the weight vector
used when training classifier ‘t’.

This equation needs to be evaluated for each of the training samples ‘i’ (x_i, y_i). Each weight
from the previous training round is going to be scaled up or down by this exponential term.

181
To understand how this exponential term behaves, let’s look first at how exp(x) behaves.

The function exp(x) will return a fraction for negative values of x, and a value greater than one
for positive values of x. So the weight for training sample i will be either increased or decreased
depending on the final sign of the term “-alpha * y * h(x)”. For binary classifiers whose output is
constrained to either -1 or +1, the terms y and h(x) only contribute to the sign and not the
magnitude.

y_i is the correct output for training example ‘i’, and h_t(x_i) is the predicted output by classifier
t on this training example. If the predicted and actual output agree, y * h(x) will always be +1
(either 1 * 1 or -1 * -1). If they disagree, y * h(x) will be negative.

Ultimately, misclassifications by a classifier with a positive alpha will cause this training
example to be given a larger weight. And vice versa.

Note that by including alpha in this term, we are also incorporating the classifier’s effectiveness
into consideration when updating the weights. If a weak classifier misclassifies an input, we
don’t take that as seriously as a strong classifier’s mistake.

182
Neural Networks:

Neutral networks is the state of the art technique for many different machine learning problems.
So why do we need yet another learning algorithm? We already have linear regression and we
have logistic regression, so why do we need, neural networks? In order to motivate the
discussion of neural networks, let us start by discussing a few examples of machine learning
problems where we need to learn complex non-linear hypotheses. Consider a supervised learning
classification problem where you have a training set like:

If you want to apply logistic regression to this problem, one thing you could do is apply logistic
regression with a lot of nonlinear features like that. So here, g as usual is the sigmoid function,
and we can include lots of polynomial terms like these. And, if you include enough polynomial
terms then, you know, maybe you can get a hypotheses that separates the positive and negative
examples. If you were to include all the quadratic terms, all of these, even all of the quadratic
183
that is the second or the polynomial terms, there would be a lot of them. There would be terms
like x1*x1, x1*x2, x1*x3, you know, x1*x4 up to x1*x100 and then you have x2*x2 , x2*x3 and
so on. And if you include just the second order terms, that is, the terms that are a product of, you
know, two of these terms, x1 times x1 and so on, then, for the case of n equals 100, you end up
with about 5000 features. And, asymptotically, the number of quadratic features grows roughly
as order n2, where n is the number of the original features, like x1 through x100 that we had. And
its actually closer to n2/2. So including all the quadratic features doesn't seem like it's maybe a
good idea, because that is a lot of features and you might up overfitting the training set, and it
can also be computationally expensive, you know, to be working with that many features.

One thing you could do is include only a subset of these, so if you include only the features
x1*x1, x2*x2, x3*x3, up to maybe x100 squared, then the number of features is much smaller.
Here you have only 100 such quadratic features, but this is not enough features and certainly
won't let you fit the data set like that on the upper left. In fact, if you include only these quadratic
features together with the original x1, and so on, up to x100 features, then you can actually fit
very interesting hypotheses. So, you can fit things like, you know, access a line of the ellipses
like these, but you certainly cannot fit a more complex data set like that shown here.

So 5000 features seems like a lot, if you were to include the cubic, or third order known of each
others, the x1, x2, x3. You know, x1 squared, x2, x10 and x11, x17 and so on. You can imagine
there are gonna be a lot of these features. In fact, they are going to be order and cube such
features and if any is 100 you can compute that, you end up with on the order of about 170,000
such cubic features and so including these higher auto-polynomial features when your original
feature set end is large this really dramatically blows up your feature space and this doesn't seem
like a good way to come up with additional features with which to build none many classifiers
when n is large. For many machine learning problems, n will be pretty large.

184
185
186
187
188
189
190
191
192
193
194
What is Activation Function ?

It’s just a thing (node) that you add to the output end of any neural network. It is also known as Transfer
Function. It can also be attached in between two Neural Networks.

Why we use Activation functions with Neural Networks?

It is used to determine the output of neural network like yes or no. It maps the resulting values in
between 0 to 1 or -1 to 1 etc. (depending upon the function).

The Activation Functions can be basically divided into 2 types-

1. Linear Activation Function


2. Non-linear Activation Functions

FYI: The Cheat sheet is given below.

Linear or Identity Activation Function

As you can see the function is a line or linear.Therefore, the output of the functions will not be
confined between any range.

Fig: Linear Activation Function

Equation : f(x) = x

Range : (-infinity to infinity)

195
It doesn’t help with the complexity or various parameters of usual data that is fed to the neural
networks.

Non-linear Activation Function

The Nonlinear Activation Functions are the most used activation functions. Nonlinearity helps to
makes the graph look something like this

Fig: Non-linear Activation Function

It makes it easy for the model to generalize or adapt with variety of data and to differentiate
between the output.

The main terminologies needed to understand for nonlinear functions are:

Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.

Monotonic function: A function which is either entirely non-increasing or non-decreasing.

The Nonlinear Activation Functions are mainly divided on the basis of their range or curves-

1. Sigmoid or Logistic Activation Function

The Sigmoid Function curve looks like a S-shape.

Fig: Sigmoid Function


196
The main reason why we use sigmoid function is because it exists between (0 to 1). Therefore, it
is especially used for models where we have to predict the probability as an output.Since
probability of anything exists only between the range of 0 and 1, sigmoid is the right choice.

The function is differentiable.That means, we can find the slope of the sigmoid curve at any two
points.

The function is monotonic but function’s derivative is not.

The logistic sigmoid function can cause a neural network to get stuck at the training time.

The softmax function is a more generalized logistic activation function which is used for
multiclass classification.

2. Tanh or hyperbolic tangent Activation Function

tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh
is also sigmoidal (s - shaped).

Fig: tanh v/s Logistic Sigmoid

The advantage is that the negative inputs will be mapped strongly negative and the zero inputs
will be mapped near zero in the tanh graph.

The function is differentiable.

The function is monotonic while its derivative is not monotonic.

The tanh function is mainly used classification between two classes.

Both tanh and logistic sigmoid activation functions are used in feed-forward nets.

3. ReLU (Rectified Linear Unit) Activation Function

The ReLU is the most used activation function in the world right now.Since, it is used in almost
all the convolutional neural networks or deep learning.

197
Fig: ReLU v/s Logistic Sigmoid

As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and
f(z) is equal to z when z is above or equal to zero.

Range: [ 0 to infinity)

The function and its derivative both are monotonic.

But the issue is that all the negative values become zero immediately which decreases the ability
of the model to fit or train from the data properly. That means any negative input given to the
ReLU activation function turns the value into zero immediately in the graph, which in turns
affects the resulting graph by not mapping the negative values appropriately.

4. Leaky ReLU

It is an attempt to solve the dying ReLU problem

Fig : ReLU v/s Leaky ReLU

Can you see the Leak?

The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01 or so.

When a is not 0.01 then it is called Randomized ReLU.

Therefore the range of the Leaky ReLU is (-infinity to infinity).

198
Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their derivatives
also monotonic in nature.

Why derivative/differentiation is used ?


When updating the curve, to know in which direction and how much to change or update the curve
depending upon the slope.That is why we use differentiation in almost every part of Machine Learning
and Deep Learning.

Fig: Activation Function Cheetsheet

Fig: Derivative of Activation Functions


199
200
201
202
203
FOR BACK PROPAGATION AND NEURAL NETWORK : REFER
VIDEOS BY 3BLUE1BROWN ON YOUTUBE
Another explanation of the same is mentioned below:

A Step by Step Backpropagation Example


Source: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

Background

Backpropagation is a common method for training a neural network. There is no shortage of


papers online that attempt to explain how backpropagation works, but few that include an
example with actual numbers. This post is my attempt to explain how it works with a concrete
example that folks can compare their own calculations to in order to ensure they understand
backpropagation correctly.

If this kind of thing interests you, you should sign up for my newsletter where I post about AI-
related projects that I’m working on.
204
Backpropagation in Python

You can play around with a Python script that I wrote that implements the backpropagation
algorithm in this Github repo.

Backpropagation Visualization

For an interactive visualization showing a neural network as it learns, check out my Neural
Network visualization.

Additional Resources

If you find this tutorial useful and want to continue learning about neural networks, machine
learning, and deep learning, I highly recommend checking out Adrian Rosebrock’s new book,
Deep Learning for Computer Vision with Python. I really enjoyed the book and will have a full
review up soon.

Overview

For this tutorial, we’re going to use a neural network with two inputs, two hidden neurons, two
output neurons. Additionally, the hidden and output neurons will include a bias.

Here’s the basic structure:

In order to have some numbers to work with, here are the initial weights, the biases, and training
inputs/outputs:

205
The goal of backpropagation is to optimize the weights so that the neural network can learn how
to correctly map arbitrary inputs to outputs.

For the rest of this tutorial we’re going to work with a single training set: given inputs 0.05 and
0.10, we want the neural network to output 0.01 and 0.99.

The Forward Pass

To begin, lets see what the neural network currently predicts given the weights and biases above
and inputs of 0.05 and 0.10. To do this we’ll feed those inputs forward though the network.

We figure out the total net input to each hidden layer neuron, squash the total net input using an
activation function (here we use the logistic function), then repeat the process with the output
layer neurons.

Total net input is also referred to as just net input by some sources.

Here’s how we calculate the total net input for :

We then squash it using the logistic function to get the output of :

Carrying out the same process for we get:

We repeat this process for the output layer neurons, using the output from the hidden layer
neurons as inputs.

Here’s the output for :

206
And carrying out the same process for we get:

Calculating the Total Error

We can now calculate the error for each output neuron using the squared error function and sum
them to get the total error:

Some sources refer to the target as the ideal and the output as the actual.

The is included so that exponent is cancelled when we differentiate later on. The result is eventually
multiplied by a learning rate anyway so it doesn’t matter that we introduce a constant here [1].

For example, the target output for is 0.01 but the neural network output 0.75136507, therefore
its error is:

Repeating this process for (remembering that the target is 0.99) we get:

The total error for the neural network is the sum of these errors:

The Backwards Pass

Our goal with backpropagation is to update each of the weights in the network so that they cause
the actual output to be closer the target output, thereby minimizing the error for each output
neuron and the network as a whole.

Output Layer

Consider . We want to know how much a change in affects the total error, aka .

is read as “the partial derivative of with respect to “. You can also say “the gradient with
respect to “.

By applying the chain rule we know that:


207
Visually, here’s what we’re doing:

We need to figure out each piece in this equation.

First, how much does the total error change with respect to the output?

is sometimes expressed as

When we take the partial derivative of the total error with respect to , the quantity
becomes zero because does not affect it which means we’re taking the
derivative of a constant which is zero.

Next, how much does the output of change with respect to its total net input?

The partial derivative of the logistic function is the output multiplied by 1 minus the output:

Finally, how much does the total net input of change with respect to ?

Putting it all together:

208
You’ll often see this calculation combined in the form of the delta rule:

Alternatively, we have and which can be written as , aka (the Greek letter
delta) aka the node delta. We can use this to rewrite the calculation above:

Therefore:

Some sources extract the negative sign from so it would be written as:

To decrease the error, we then subtract this value from the current weight (optionally multiplied
by some learning rate, eta, which we’ll set to 0.5):

Some sources use (alpha) to represent the learning rate, others use (eta), and others even use
(epsilon).

We can repeat this process to get the new weights , , and :

We perform the actual updates in the neural network after we have the new weights leading into
the hidden layer neurons (ie, we use the original weights, not the updated weights, when we
continue the backpropagation algorithm below).

Hidden Layer

Next, we’ll continue the backwards pass by calculating new values for , , , and .

Big picture, here’s what we need to figure out:

209
Visually:

We’re going to use a similar process as we did for the output layer, but slightly different to
account for the fact that the output of each hidden layer neuron contributes to the output (and
therefore error) of multiple output neurons. We know that affects both and
therefore the needs to take into consideration its effect on the both output neurons:

Starting with :

We can calculate using values we calculated earlier:

And is equal to :

Plugging them in:

Following the same process for , we get:

Therefore:

210
Now that we have , we need to figure out and then for each weight:

We calculate the partial derivative of the total net input to with respect to the same as we
did for the output neuron:

Putting it all together:

You might also see this written as:

We can now update :

Repeating this for , , and

Finally, we’ve updated all of our weights! When we fed forward the 0.05 and 0.1 inputs
originally, the error on the network was 0.298371109. After this first round of backpropagation,
the total error is now down to 0.291027924. It might not seem like much, but after repeating this
process 10,000 times, for example, the error plummets to 0.0000351085. At this point, when we
feed forward 0.05 and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 target) and
0.984065734 (vs 0.99 target).

211

You might also like