You are on page 1of 16

Classification Techniques

Decision Trees
Contents
Introduction
Terminology
Example
Types of Decision Trees
Binary Variable Decision Tree
Continuous Variable Decision Tree
Advantages/Disadvantages
Introduction
Definition: a type of supervised learning algorithm that is mostly used
inclassification problems.
Works for both categorical/discrete and continuous input and output
variables.
In a nutshell: We split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant
splitter / differentiator ininput variables.
A Decision tree will keep splitting the data until every subset is perfect
i.e. a decision tree will classify the training example perfectly.
It is recommended to perform forehand a dimensional reduction using
PCA or similar techniques to avoid getting a very large tree.
Terminology
Classification Problem: Is the task of assigning objects to one of
several predefined categories. Some examples include:
Detecting spam email messages based upon the header and content.
Categorizing cells as malignant or benign based upon the MRI scans.
Classifying galaxies based upon their shapes.
Exploring discriminative (significant) features (variables).

Supervised learning is a type ofmachine learningalgorithm that


uses a known dataset (called the training dataset) to make predictions.
The training dataset includes input data and response values.
The supervised learning algorithm seeks to build a model that can
make predictions of the response values for a new dataset.
Terminology
ROOT Node:It represents entire population or sample and
this further gets divided into two or more homogeneous sets.

SPLITTING:It is a process of dividing a node into two or more


sub-nodes.

Decision Node:When a sub-node splits into further sub-


nodes, then it is called decision node.

Leaf/ Terminal Node:Nodes do not split is called Leaf or


Terminal node.
Terminology
Example
A sample of 30 students with three variables:
Gender (Boy/ Girl)
Class( IX/ X)
Height (5 to 6 ft) (In this example we will consider two values < (less than) 5.5 ft and
>= (greater than or equal) 5.5 ft)
15 out of these 30 play cricket inleisure time.
Goal: Create a model topredict who will play cricket during leisure
period.
We need to segregate students who play cricket in their leisure time
based on highly significant input variable among all three.
A decision tree will segregate the students based on all values of three
variables andidentify the variable,
Example
variable Gender is able to identify best homogeneous sets compared to the other two variables.
der is a dominant class in this example.
Types of Decision Trees
Binary Variable Decision Tree:
Has binary target variable.
Example:- In above scenario of student problem, where the
target variable was Student will play cricket or not i.e. YESor
NO.


Types of Decision Trees
Continuous Variable Decision Tree:
Has continuoustarget variable.
Example: to predict whether a customer will pay his renewal
premium with an insurance company(yes/ no). (originally a
binary target variable).
Income of customer is asignificant variable but insurance
company does not have income details for all customers.
we can build a decision tree to predict customer income based
on occupation, product and various other variables.
In this case, we are predicting values for continuous variable.
Advantages
Easy to Understand:
even for people from non-analytical background. It does not require any
statistical knowledge to read and interpret them.
Its graphical representation is very intuitiveand users can easily relate
their hypothesis.

Useful in Data exploration:


One of the fastest way to identify most significant variables and relation
between two or more variables.
With the help of decision trees, we can create new variables / features
that has better power to predict target variable.
Advantages
Less data cleaning required:
requires less data cleaning compared to someother modeling techniques.
not influenced by outliers and missing values to a fair degree.

Data type is not a constraint:


It can handle both numerical and categorical variables.

Non Parametric Method:


considered to be a non-parametric method.
have no assumptions about the space distribution andthe classifier
structure.
Disadvantages
Overfitting:
one of the most practical difficulty for decision tree models.
As the decision tree keeps splitting the data, the tree gets bigger and
bigger.
As it gets bigger, it becomes more and more accurate on the training
data.
But at some point it will become less accurate on the data that we
havent encountered before, the data that we are not using to train.
Eventually the size of the tree is the size of the data set.
The performance will rise at first then starts dropping.
The algorithm becomes too specific for the data you used to train it
and it will not generalize to the examples you havent given it before.
Solution: try to grow a tree that is not too large (random forests or
PCA)
Disadvantages
Not fit for continuous variables:
While working with continuous numerical variables, decision
tree looses information whenit categorizes variables in
different categories.
Missing Values
Low Variance:
Random Forest:

High Correlation
Use Pearson(continuous variables) orPolychoric(discrete
variables) correlation matrix to identify the variables with high
correlation and select one of them usingVIF(Variance Inflation
Factor).
Variables having higher value ( VIF > 5 ) can be dropped.

Backward Feature Elimination: Compute the sum of square


of error (SSR)
Factor Analysis:
EFA (Exploratory Factor Analysis)
CFA (Confirmatory Factor Analysis)

You might also like