You are on page 1of 23

Classification Using K-NN

Ahmad Fathan Hidayatullah, ST, MCs


Supervised Vs Unsupervised
• Data mining methods may be categorized as either
supervised or unsupervised
• In unsupervised methods, no target variable is identified
as such
http://mlwithdata.blogspot.co.id/2015/04/machine-learning-supervised-vs.html
Classification
• Classification is a data mining task of predicting the value
of a categorical variable (target or class) by building a
model based on one or more numerical and/or categorical
variables (predictors or attributes)

http://www.saedsayad.com/classification.htm
Classification
• Given a collection of records (training set )
– Each record contains a set of attributes, one of the
attributes is the class
• Find a model for class attribute as a function
of the values of other attributes
• Goal: previously unseen records should be
assigned a class as accurately as possible
– A test set is used to determine the accuracy of the model. Usually, the
given data set is divided into training and test sets, with training set used
to build the model and test set used to validate it
Illustrating Classification Task
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Network
• Support Vector Machines
• Nearest Neighbor
K Nearest Neighbors a.k.a KNN
• K nearest neighbors is a simple algorithm that stores all
available cases and classifies new cases based on a
similarity measure (e.g., distance functions)
• KNN has been used in statistical estimation and pattern
recognition already in the beginning of 1970’s as a non-
parametric technique
Algorithm
• A case is classified by a majority vote of its neighbors,
with the case being assigned to the class most common
amongst its K nearest neighbors measured by a distance
function
• If K = 1, then the case is simply assigned to the class of
its nearest neighbor
• Classify an unknown example with the most common
class among k closest examples
KNN: Multiple Classes
• Easy to implement for multiple classes
• Example for k = 5
How to Choose K?
• In theory, if infinite number of samples available, the
larger is k, the better is classification
• The caveat is that all k neighbors have to be close
– Possible when infinite # samples available
– Impossible in practice since # samples is finite
How to Choose K?
• Rule of thumb is k < sqrt(n), n is number of examples
• interesting theoretical properties
• In practice, k = 1 is often used for efficiency, but can
be sensitive to “noise”
How to Choose K?
• Larger k may improve performance, but too large k destroys
locality, i.e. end up looking at samples that are not neighbors
• Cross-validation (study later) may be used to choose k
Example
• A snack company wants to classify the quality of its products into
2 groups, GOOD and BAD. There are two variables to assess the
quality, the increase of the degree of acidity (%) and volume
shrinkage. There are 10 samples used for testing as shown in the
table.
• The company wants to know whether a product with the increase
in acidity of 6% and the volume shrinkage of 3% included in the
category GOOD or BAD
No Variable Category
The increase of the Volume shrinkage (V2)
degree of acidity (V1)
1 3 2 GOOD
2 4 1 GOOD
3 4 3 GOOD
4 5 1 GOOD
5 5 4 GOOD
6 6 5 BAD
7 7 6 BAD
8 8 4 BAD
9 7 2 BAD
10 9 1 BAD
• Choose k = 5
• Find the distance of the data that will be evaluated (r =
(6,3)) with all the training data using Euclidean Distance.
Result No
V1
Variable
V2
Category Distance

1 3 2 GOOD (3 − 6)2+(2 − 3)2= 3,1623


2 4 1 GOOD (4 − 6)2+(1 − 3)2= 2,8284
3 4 3 GOOD (4 − 6)2+(3 − 3)2= 2,0000
4 5 1 GOOD (5 − 6)2+(1 − 3)2= 2,2361
5 5 4 GOOD (5 − 6)2+(4 − 3)2= 1,4142
6 6 5 BAD (6 − 6)2+(5 − 3)2= 2,0000
7 7 6 BAD (7 − 6)2+(6 − 3)2= 3,1623
8 8 4 BAD (8 − 6)2+(4 − 3)2= 2,2361
9 7 2 BAD (7 − 6)2+(2 − 3)2= 1,4142
10 9 1 BAD (9 − 6)2+(1 − 3)2= 3,6056
The data are sorted based on the closest distance
No Data Variable Category Distance
Number V1 V2
1 5 5 4 GOOD 1,4142
2 9 7 2 BAD 1,4142
3 3 4 3 GOOD 2,0000
4 6 6 5 BAD 2,0000
5 4 5 1 GOOD 2,2361
6 8 8 4 BAD 2,2361
7 2 4 1 GOOD 2,8284
8 1 3 2 GOOD 3,1623
9 7 7 6 BAD 3,1623
10 10 9 1 BAD 3,6056
• Based on the result, then we take the first five data (k=5)
• We obtained that there are three GOOD categories and
two BAD categories. So, the data r = (6,3) belongs to
GOOD category
References
• Larose, Daniel T. Discovering knowledge in data: an
introduction to data mining. John Wiley & Sons, 2005.
• Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining:
concepts and techniques. Elsevier, 2006.
• Zaki, Mohammed J., and Wagner Meira Jr., Data Mining and
Analysis. Cambridge University Press, 2014.
• Dinda Eling K. Sasmito, Data Mining Lecture slides
• http://people.revoledu.com/kardi/tutorial/KNN/KNN_Numerical
-example.html