Professional Documents
Culture Documents
Classification
Chapter # 6
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
Nayak
Introduction
Classification and prediction are two important methods of data
analysis used to find patterns in data.
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.1 What Is Classification? What Is Prediction?
1. Classification is a two-step process. In the first step, a
model is constructed by analyzing the database and
the set of attributes that define the class variable.
2. A classification problem is a supervised machine-
learning problem. The training data is a sample from
the database, and the class attribute is already known.
3. In a classification problem, the class of Y, a
categorical variable, is determined by a set of input
variables {x1, x2, x3, …}. In classification, the variable
we would like to predict is typically called class
variable C, and it may have different values in the set
{c1, c2, c3, …}.
4. The observed or measured variables X1, X2, … Xn are
the attributes, or input variables, also called
explanatory variables.
5. In classification, we want to determine the relationship
between the Class variable and the inputs, or
explanatory variables.
6. Typically, models represent the classification rules or
mathematical formulas. Once these rules are created
by the learning model, this model can be used to
predict the class of future data for which the class is
unknown.
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2 Probabilistic Models for Classification
● Probabilistic classifiers and, in particular, the naïve Bayes classifier, is the
most popular classifier used by the machine-learning community.
● The naïve Bayes classifier is a simple probabilistic classifier based on Bayes’
theorem, the most popular theorem in natural language processing and visual
processing. It is one of the most basic classification techniques, with
applications such as e-mail spam detection, e-mail sorting, and sentiment
analysis.
● Even though naïve Bayes is a simple technique, it provides good performance
in many complex real-world problems.
● The study of probabilistic classification is based on the study of approximating
joint distribution with an assumption of independence and then decomposing
this probability into a product of conditional probability. A conditional
probability of event A, given event B—denoted by P(A|B)—represents the
chances of event A occurring, given that event B also occurs.
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2 Probabilistic Models for Classification
Let C1 correspond to the class Approved, and C2 correspond to class Denied. Using
the naïve Bayes classifier, we want to classify an unknown label sample X:
To classify a record, first compute the chance of a record belonging to each of the
classes by computing P(Ci|X1,X2, … Xp) from the training record.
Then classify based on the class with the highest probability. In this example, there
are two classes. We need to compute P(Xi|Ci)P(Ci).
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2.2 Naïve Bayes Classifier Using R
● Let’s try building the model by using R. We’ll use the same example. The data sample sets have the
attributes Age, Purchase Frequency, and Credit Rating.
● The class label attribute has two distinct classes: Approved or Denied.
● The objective is to predict the class label for the new sample, where Age > 40, Purchase Frequency =
Medium, Credit_Rating = Excellent.
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2.2 Naïve Bayes Classifier Using R
Step no. 2: Load “1071” package
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2.2 Naïve Bayes Classifier Using R
Step no. 3: Generate naiveBayes model
Data Set
Classification Variable
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2.2 Naïve Bayes Classifier Using R
Step no. 4: View NaiveBayes Model
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2.2 Naïve Bayes Classifier Using R
Step no. 5: Load Data to be predicted
creditratingpredicted <- read.csv("CreditRating_to_be_predicted.csv")
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2.2 Naïve Bayes Classifier Using R
Step no. 6: Predict output of test data
Name of data set
Model Name
Predictions
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.2.3 Advantages and Limitations of the Naïve Bayes Classifier
● The main problem with the naïve Bayes classifier is that the
classification model depends on posterior probability, and
when a predictor category is not present in the training data,
the model assumes zero probability.
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.3 Decision Trees
● A decision tree builds a classification model by using a tree structure.
● A decision tree structure consists of a root node, branches, and leaf nodes. Leaf
nodes hold the class label, each branch denotes the outcome of the decision-
tree test, and the internal nodes denote the decision points.
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.3.1 Recursive Partitioning Decision-Tree Algorithm
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.3.2 Information Gain
● In order to select the decision-tree node and attribute to split the tree, we measure the information
provided by that attribute.
● Such a measure is referred to as a measure of the goodness of split.
● The attribute with the highest information gain is chosen as the test attribute for the node to split.
● This attribute minimizes the information needed to classify the samples in the recursive partition
nodes. This approach of splitting minimizes the number of tests needed to classify an object and
guarantees that a simple tree is formed.
● Many algorithms use entropy to calculate the homogeneity of a sample.
13141 25 M Y
13142 24 F N
13143 23 F N
13144 25 F N
13145 26 M Y
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.3.3 Example of a Decision Tree
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.3.4 Induction of a Decision Tree
● In this example, CreditRating has the highest information gain and it is used as a root node and branches are grown for
each attribute value.
● The next tree branch node is based on the remaining two attributes, Age and PurchaseFrequency. Both Age and Purchase
Frequency have almost same information gain. Either of these can be used as split node for the branch. We have taken Age
as the split node for the branch.
● The rest of the branches are partitioned with the remaining samples. For Age < 35, the decision is clear. Whereas for the
other Age category, PurchaseFrequency parameter has to be looked at before making the loan approval decision. This
involves calculating the information gain for the rest of the samples and identifying the next split.
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.3.5 Decision Tree classifier using rpart (recursive partitioning)
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.3.5 Decision Tree classifier using rpart (recursive partitioning)
Input Variables
Apply rpart Model
Method is
classification
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.3.5 Decision Tree classifier using rpart (recursive partitioning)
Business Analytics Using R - A Practical Approach by — Dr. Umesh R. Hodeghatta & Umesha
6.4.1 K-Nearest Neighbor