Professional Documents
Culture Documents
http://stackoverflow.com/questions/10059594/a-simple-explan...
help
machine-learning
dataset
classification
edited Sep 21 '14 at 14:07
Fred Foo
220k
12
It's quite easy if you understand Bayes' Theorem. If you haven' read on Bayes' theorem, try this link
yudkowsky.net/rational/bayes. Pinch Apr 8 '12 at 3:43
12
Here is a nice blog entry about How To Build a Naive Bayes Classifier. tobigue Apr 8 '12 at 13:13
Thanks for the link, but OMG is it long. Jaggerjack Apr 8 '12 at 19:11
NOTE: The accepted answer below is not a traditional example for Nave Bayes. It's mostly a k Nearest
Neighbor implementation. Read accordingly. chmullig Nov 26 '13 at 4:01
@Jaggerjack: RamNarasimhan 's answer is well explained than the accepted answer.
Unmesha SreeVeni May 20 '14 at 6:35
30
Jaggerjack
398
579
1,460
4 Answers
Your question as I understand is divided in two parts. One being you need more understanding
for Naive Bayes classifier & second being the confusion surrounding Training set.
In general all of Machine Learning Algorithms need to be trained for supervised learning tasks
like classification, prediction etc. or for unsupervised learning tasks like clustering.
By training it means to train them on particular inputs so that later on we may test them for
unknown inputs (which they have never seen before) for which they may classify or predict etc
(in case of supervised learning) based on their learning. This is what most of the Machine
Learning techniques like Neural Networks, SVM, Bayesian etc. are based upon.
So in a general Machine Learning project basically you have to divide your input set to a
Development Set (Training Set + Dev-Test Set) & a Test Set (or Evaluation set). Remember
your basic objective would be that your system learns and classifies new inputs which they
have never seen before in either Dev set or test set.
The test set typically has the same format as the training set. However, it is very important that
the test set be distinct from the training corpus: if we simply reused the training set as the test
set, then a model that simply memorized its input, without learning how to generalize to new
examples, would receive misleadingly high scores.
In general, for an example, 70% can be training set cases. Also remember to partition the
original set into the training and test sets randomly.
Now I come to your other question about Naive Bayes.
Source for example below: http://www.statsoft.com/textbook/naive-bayes-classifier
To demonstrate the concept of Nave Bayes Classification, consider the example given below:
1 of 8
5/28/16, 7:21 AM
http://stackoverflow.com/questions/10059594/a-simple-explan...
As indicated, the objects can be classified as either GREEN or RED . Our task is to classify new
cases as they arrive, i.e., decide to which class label they belong, based on the currently
existing objects.
Since there are twice as many GREEN objects as RED , it is reasonable to believe that a new
case (which hasn't been observed yet) is twice as likely to have membership GREEN rather
than RED . In the Bayesian analysis, this belief is known as the prior probability. Prior
probabilities are based on previous experience, in this case the percentage of GREEN and RED
objects, and often used to predict outcomes before they actually happen.
Thus, we can write:
Prior Probability of GREEN : number of GREEN objects / total number of objects
Prior Probability of RED : number of RED objects / total number of objects
Since there is a total of 60 objects, 40 of which are GREEN and 20 RED , our prior probabilities
for class membership are:
Prior Probability for GREEN : 40 / 60
Prior Probability for RED : 20 / 60
Having formulated our prior probability, we are now ready to classify a new object ( WHITE circle
in the diagram below). Since the objects are well clustered, it is reasonable to assume that the
more GREEN (or RED ) objects in the vicinity of X, the more likely that the new cases belong to
that particular color. To measure this likelihood, we draw a circle around X which encompasses
a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate
the number of points in the circle belonging to each class label. From this we calculate the
likelihood:
From the illustration above, it is clear that Likelihood of X given GREEN is smaller than
Likelihood of X given RED , since the circle encompasses 1 GREEN object and 3 RED ones.
Thus:
Although the prior probabilities indicate that X may belong to GREEN (given that there are
twice as many GREEN compared to RED ) the likelihood indicates otherwise; that the class
membership of X is RED (given that there are more RED objects in the vicinity of X than
GREEN ). In the Bayesian analysis, the final classification is produced by combining both
sources of information, i.e., the prior and the likelihood, to form a posterior probability using the
so-called Bayes' rule (named after Rev. Thomas Bayes 1702-1761).
Finally, we classify X as RED since its class membership achieves the largest posterior
2 of 8
5/28/16, 7:21 AM
http://stackoverflow.com/questions/10059594/a-simple-explan...
probability.
edited Feb 1 at 18:12
A. M. Bittlingmayer
Yavar
217
7,977
20
23
Thank you for taking time to put all this together! Very useful! Murat Derya zen Apr 9 '12 at 9:35
19
isn't this algorithm above more like k-nearest neighbors? Renaud Jun 12 '13 at 8:46
145
This answer is confusing - it mixes KNN (k nearest neighbours) and naive bayes. Michal Illich Sep 1 '13
at 16:39
The answer was proceeding nicely till the likelihood came up. So @Yavar has used K-nearest neighbours
for calculating the likelihood. How correct is that? If it is, what are some other methods to calculate the
likelihood? wrahool Jan 31 '14 at 6:24
49
I realize that this is an old question, with an established answer. The reason I'm posting is that
is the accepted answer has many elements of kNN (k nearest neighbor), a different algorithm.
Both kNN and NaiveBayes are classification algorithms. Conceptually, kNN uses the idea of
"nearness" to classify new entities. In kNN 'nearness' is modeled with ideas such as Euclidean
Distance or Cosine Distance. By contrast, in NaiveBayes, the concept of 'probability' is used to
classify new entities.
Since the question is about Naive Bayes, here's how I'd describe the ideas and steps to
someone. I'll try to do it with as few equations and in plain English as much as possible.
3 of 8
5/28/16, 7:21 AM
http://stackoverflow.com/questions/10059594/a-simple-explan...
that complication, one approach is to 'uncouple' multiple pieces of evidence, and to treat each
of piece of evidence as independent. This approach is why this is called naive Bayes.
P(Outcome|Multiple Evidence) =
P(Evidence1|Outcome) x P(Evidence2|outcome) x ... x P(EvidenceN|outcome) x P(Outcome)
scaled by P(Multiple Evidence)
Fruit Example
Let's try it out on an example to increase our understanding: The OP asked for a 'fruit'
identification example.
Let's say that we have data on 1000 pieces of fruit. They happen to be Banana, Orange or
some Other Fruit. We know 3 characteristics about each fruit:
1. Whether it is Long
2. Whether it is Sweet and
3. If its color is Yellow.
This is our 'training set.' We will use this to predict the type of any new fruit we encounter.
Type Long | Not Long || Sweet | Not Sweet || Yellow |Not Yellow|Total
___________________________________________________________________
Banana | 400 | 100 || 350 | 150 || 450 | 50 | 500
Orange | 0 | 300 || 150 | 150 || 300 | 0 | 300
Other Fruit | 100 | 100 || 150 | 50 || 50 | 150 | 200
____________________________________________________________________
Total | 500 | 500 || 650 | 350 || 800 | 200 | 1000
___________________________________________________________________
Probability of "Evidence"
p(Long) = 0.5
P(Sweet) = 0.65
P(Yellow) = 0.8
Probability of "Likelihood"
P(Long|Banana) = 0.8
P(Long|Orange) = 0 [Oranges are never long in all the fruit we have seen.]
....
P(Yellow|Other Fruit) = 50/200 = 0.25
P(Not Yellow|Other Fruit) = 0.75
4 of 8
5/28/16, 7:21 AM
http://stackoverflow.com/questions/10059594/a-simple-explan...
highest probability based on our prior evidence (our 1000 fruit training set):
P(Banana|Long, Sweet and Yellow) = P(Long|Banana) p(Sweet|Banana).P(Yellow|Banana) x
P(banana)
__________________________________________________
P(Long). P(Sweet). P(Yellow)
0.8 x 0.7 x 0.9 x 0.5
= ______________________
P(evidence)
= 0.252/P(evidence)
P(Orange|Long, Sweet and Yellow) = 0
P(Other Fruit|Long, Sweet and Yellow) = P(Long|Other fruit) x P(Sweet|Other fruit) x
P(Yellow/Other fruit) x P(Other Fruit)
= (100/200 x 150/200 x 50/150 x 200/1000) /
P(evidence)
= 0.01875/P(evidence)
Assign the class label of whichever is the highest number, and you are done.
Despite the name, NaiveBayes turns out to be excellent in certain applications. Text
classification is one area where it really shines.
Hope that helps in understanding the concepts behind the Naive Bayes algorithm.
edited Mar 26 at 8:51
user3494047
Ram Narasimhan
91
12k
11
19
37
12
Really great explanation, thank you! It was very useful and easy to understand PerkinsB1024 Dec 16 '13
at 16:40
Thanks for the very clear explanation! Easily one of the better ones floating around the web. Question:
since each P(outcome/evidence) is multiplied by 1 / z=p(evidence) (which in the fruit case, means each is
essentially the probability based solely on previous evidence), would it be correct to say that z doesn't
matter at all for Nave Bayes? Which would thus mean that if, say, one ran into a long/sweet/yellow fruit
that wasn't a banana, it'd be classified incorrectly. covariance Dec 21 '13 at 2:30
21
I think this is actually the best answer here. fhucho Dec 27 '13 at 1:03
29
This is much better than the accepted answer. wrahool Jan 31 '14 at 6:46
The 2 answers together are the even better answer. smwikipedia Feb 11 '14 at 10:55
As you know 10% of people are smokers your initial guess is 10% (prior probability, without
knowing anything about the person) but the other evidences (that he is a man and he is 15)
can contribute to this probability.
Each evidence may increase or decrease this chance. For example the fact that he is a man
may increase the chance that he is a smoker, providing that this percentage (being a man)
among non-smokers is lower, for example, 40%. In other words being a man must be a good
indicator of being a smoker rather than a non-smoker.
We can show this contribution in another way. For each feature, you need to compare the
commonness (probability) of that feature (f) in general vs. under the condition. ( P(f) vs. P(f
| x) . For example, if we know that the probability of being a man is 90% and 90% of smokers
are also men, then knowing that someone is a man doesn't change anything (10% * (90% /
5 of 8
5/28/16, 7:21 AM
http://stackoverflow.com/questions/10059594/a-simple-explan...
. But if men contribute to 40% of the society, but 90% of the smokers, then being a
man increases the chance of being an smoker (10% * (90% / 40%) = 22.5% ) . In the same way,
if the probability of being a man was 95%, regardless of the fact that the percentage of men
among smokers is high (90%)!, the evidence of being a man decreases the chance of being an
smoker! (10% * (90% / 95%) = 9.5%) .
90%) = 10%)
So we have:
P(X) =
P(smoker)*
(P(being a man | smoker)/P(being a man))*
(P(under 20 | smoker)/ P(under 20))
Notice in this formula we assumed that being a man and being under 20 are independent
features so we multiplied them, it means that knowing that someone is under 20 has no effect
on guessing that he is man or woman. But it may not be true, for example maybe most
adolescence in a society are men...
To use this formula in a classifier
The classifier is given some features (man and under 20) and it must decide if he is an smoker
or not. It uses the above formula to find that. To gain the needed probabilities (90%, 10%,
80%...) it uses training set. for example it counts the people in the training set that are smokers
and find they contribute 10% of the sample. Then for smokers checks how many of them are
men or women .... how many are above 20 or under 20....
edited May 19 at 14:14
Ahmad
1,428
15
29
Ram Narasimhan explained the concept very nicely here below is an alternative explanation
through the code example of Naive Bayes in action
It uses an example problem from this book on page 351
This is the data set that we will be using
Here is the code the comments explains everything we are doing here! [python]
import pandas as pd
import pprint
class Classifier():
data = None
class_attr = None
priori = {}
cp = {}
hypothesis = None
6 of 8
5/28/16, 7:21 AM
http://stackoverflow.com/questions/10059594/a-simple-explan...
output:
Priori Values: {'yes': 0.6428571428571429, 'no': 0.35714285714285715}
Calculated Conditional Probabilities:
{
'no': {
'<=30': 0.8,
'fair': 0.6,
'medium': 0.6,
'yes': 0.4
},
'yes': {
'<=30': 0.3333333333333333,
'fair': 0.7777777777777778,
'medium': 0.5555555555555556,
'yes': 0.7777777777777778
}
}
Result:
yes ==> 0.0720164609053
no ==> 0.0411428571429
7 of 8
5/28/16, 7:21 AM
8 of 8
http://stackoverflow.com/questions/10059594/a-simple-explan...
5/28/16, 7:21 AM