Lecture 11

Predicting Categorical Variables.
Classification
Data Analysis Using R 1 / 17

Prediction algorithms used so far: linear regression, multilinear
regression (OLS), nonlinear regression, logistic regression.
Classification algorithms: nearest neighbour, K-nearest neighbour,
decision trees, random forests

k Nearest Neighbour
Features:
non-parametric classification algorithm (the structure of the data is
unknown)
uses the neighbour points information to predict the class
one of the introductory supervised classifier
was proposed by Fix and Hodges in 1951 for performing pattern
classification task
addresses pattern recognition problems

Example I
We consider a basket of fruits containing: apples, bananas, grapes

and cherries.
The task is to arrange them into groups.
Example II
We have the characteristics of the fruits (X variables): color

(red/green), size (big/small) and weight (numeric).
For example: apple(red+big), banana(green+big),
grapes(green+small), cherries(red+small).
Let Y (fruit name) be the response variable = group/label
Using kNN and a training set (60% of the data set) we have to be
able to accurately predict the label (fruit name) for any fruit (new
data).

Algorithm I
simplest version: predicts class by finding the nearest neighbour class

closest class will be identified using distance measures (Euclidean
distance, Manhattan distance, etc.)
KNN algorithm:
Let (Xi , Ci ) where i = 1, n be data points (we have several predictive
variables).
Xi denotes feature values and Ci denotes labels for each i
”c” number of classes, Ci ∈ {1, 2, 3, . . . , c} for all values of i
Let X be a point for which label/group/class is not known
We would like to find the label class using k-nearest neighbor
algorithms.

Knn Algorithm
1 Calculate ”d(x, xi )”, i = 1, n; where d denotes the Euclidean distance

between the points.
2 Arrange the calculated n Euclidean distances in non-decreasing order.
3 Let k be a positive integer, take the first k distances from this sorted
list.
4 Find those k-points corresponding to these k-distances.
5 Let ki denotes the number of points belonging to the ith class
6 If kp > ki ∀p 6= i then put x in class p

If k is even, there might be ties.
To avoid this, usually weights are given to the observations, so that
nearer observations are more influential in determining which class the
data point belongs to.
An example of this system is giving a weight of d1 to each of the
observations, where d is distance to the data point.
If there is still a tie, then the class is chosen randomly.

kNN Algorithm Example I

kNN Algorithm Example II
Let’s consider the above image where we have two different target
classes: white and orange.
We have total 26 training samples.
Now we would like to predict the target class for the blue circle.
Considering k value as three, we need to calculate the similarity
distance using a similarity measure like Euclidean distance.
In the image, we have calculated the distance and placed the less
distant circles to the blue circle inside the big circle.
What will be the predicted class?

How to choose the value of k?
selecting the value of k - the most critical problem

small k ⇒ noise will have a higher influence on the result (overfitting
is very probable)
large k ⇒ computationally expensive; defeats the idea of kNN (near
points have similar classes)

Implementation in R:
- from scratch:
http://dataaspirant.com/2017/01/02/k-nearest-neighbor-classifier-
implementation-r-scratch/
- using implemented functions found in libraries:
http://www.rpubs.com/Drmadhu/IRISclassification

Validation
To optimize the results, we can use Cross Validation (which is one of

the fundamental methods in machine learning for method assessment
and picking parameters in a prediction or machine learning task).
Using the cross-validation technique, we can test kNN algorithm with
different values of k.
The model which gives good accuracy can be considered to be an
optimal choice.
To find the accuracy you can compute the Confusion matrix.
At times best process is to run through each possible value of k and
test our result.

Example I
The dataset: PimaIndiansDiabetes2 dataset from the mlbench

package.
This dataset is part of the data collected from one of the numerous
diabetes studies on the Pima Indians, a group of indigenous Americans
who have among the highest prevalence of Type II diabetes in the
world–probably due to a combination of genetic factors and their
relatively recent introduction to a heavily processed Western diet.
768 observations, 9 variables: skin fold thickness, BMI, and so on,
and a binary variable representing whether the patient had diabetes.
Purpose: to train a classifier to predict whether a patient has diabetes
or not.

Example II
many observations available; goodly amount of predictor variables

available; interesting problem; good mixture of both class outcomes
(35% diabetes positive observations)
Grievously imbalanced datasets can cause a problem with some
classifiers and impair our accuracy estimates.
Steps of the kNN method:
split the data 80/20 randomly
visualize the effectiveness of k-NN with a different k using
cross-validation (knnEval() function from the chemometrics package)
perform kNN for a suitable value of k, using knn() function from the
class package
compute the accuracy of the method for the chosen k and determine
the confusion matrix

Exercises I
1. Iris
Consider the ”iris” data set in R. Determine which is the response variable.
Load the data and split it into 2 parts (80%, 20%) that will be the training
data and test data. Train the kNN model for k = 1, 2, 3 and determine
which one is the best (i.e. most accurate) using the test data.
Use the functions: knn() in the package ”class”, CrossTable() in the package
”gmodels” or knn.cv() from the package ”class”, confussionMatrix() in the
package ”caret”.
2. Breast Cancer
To diagnose Breast Cancer, the doctor uses his experience by analyzing
details provided by a) Patient’s Past Medical History and b) Reports of all
the tests performed. At times, it becomes difficult to diagnose cancer even
for experienced doctors, since the information provided by the patient might
be unclear and insufficient. Breast cancer database was obtained from the
University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.

Exercises II
It contains 699 samples with 10 attributes. The Main objective is to predict

whether it’s benign or malignant. Use kNN algorithm to do that and explain
your choice of value for k.
The data can be downloaded at
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Original).

Lecture 11

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 11

Uploaded by

Copyright:

Available Formats

Predicting Categorical Variables.

Data Analysis Using R 1 / 17

Data Analysis Using R 2 / 17

Data Analysis Using R 3 / 17

We consider a basket of fruits containing: apples, bananas, grapes

We have the characteristics of the fruits (X variables): color

Data Analysis Using R 5 / 17

simplest version: predicts class by finding the nearest neighbour class

Data Analysis Using R 6 / 17

1 Calculate ”d(x, xi )”, i = 1, n; where d denotes the Euclidean distance

Data Analysis Using R 7 / 17

Data Analysis Using R 8 / 17

Data Analysis Using R 9 / 17

Data Analysis Using R 10 / 17

selecting the value of k - the most critical problem

Data Analysis Using R 11 / 17

Data Analysis Using R 12 / 17

To optimize the results, we can use Cross Validation (which is one of

Data Analysis Using R 13 / 17

The dataset: PimaIndiansDiabetes2 dataset from the mlbench

Data Analysis Using R 14 / 17

many observations available; goodly amount of predictor variables

Data Analysis Using R 15 / 17

Data Analysis Using R 16 / 17

It contains 699 samples with 10 attributes. The Main objective is to predict

Data Analysis Using R 17 / 17

You might also like