Professional Documents
Culture Documents
Classification
Features:
non-parametric classification algorithm (the structure of the data is
unknown)
uses the neighbour points information to predict the class
one of the introductory supervised classifier
was proposed by Fix and Hodges in 1951 for performing pattern
classification task
addresses pattern recognition problems
Let’s consider the above image where we have two different target
classes: white and orange.
We have total 26 training samples.
Now we would like to predict the target class for the blue circle.
Considering k value as three, we need to calculate the similarity
distance using a similarity measure like Euclidean distance.
In the image, we have calculated the distance and placed the less
distant circles to the blue circle inside the big circle.
What will be the predicted class?
- from scratch:
http://dataaspirant.com/2017/01/02/k-nearest-neighbor-classifier-
implementation-r-scratch/
- using implemented functions found in libraries:
http://www.rpubs.com/Drmadhu/IRISclassification
1. Iris
Consider the ”iris” data set in R. Determine which is the response variable.
Load the data and split it into 2 parts (80%, 20%) that will be the training
data and test data. Train the kNN model for k = 1, 2, 3 and determine
which one is the best (i.e. most accurate) using the test data.
Use the functions: knn() in the package ”class”, CrossTable() in the package
”gmodels” or knn.cv() from the package ”class”, confussionMatrix() in the
package ”caret”.
2. Breast Cancer
To diagnose Breast Cancer, the doctor uses his experience by analyzing
details provided by a) Patient’s Past Medical History and b) Reports of all
the tests performed. At times, it becomes difficult to diagnose cancer even
for experienced doctors, since the information provided by the patient might
be unclear and insufficient. Breast cancer database was obtained from the
University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.