Professional Documents
Culture Documents
Decision Trees
Contents
Introduction
Terminology
Example
Types of Decision Trees
Binary Variable Decision Tree
Continuous Variable Decision Tree
Advantages/Disadvantages
Introduction
Definition: a type of supervised learning algorithm that is mostly used
inclassification problems.
Works for both categorical/discrete and continuous input and output
variables.
In a nutshell: We split the population or sample into two or more
homogeneous sets (or sub-populations) based on most significant
splitter / differentiator ininput variables.
A Decision tree will keep splitting the data until every subset is perfect
i.e. a decision tree will classify the training example perfectly.
It is recommended to perform forehand a dimensional reduction using
PCA or similar techniques to avoid getting a very large tree.
Terminology
Classification Problem: Is the task of assigning objects to one of
several predefined categories. Some examples include:
Detecting spam email messages based upon the header and content.
Categorizing cells as malignant or benign based upon the MRI scans.
Classifying galaxies based upon their shapes.
Exploring discriminative (significant) features (variables).
Types of Decision Trees
Continuous Variable Decision Tree:
Has continuoustarget variable.
Example: to predict whether a customer will pay his renewal
premium with an insurance company(yes/ no). (originally a
binary target variable).
Income of customer is asignificant variable but insurance
company does not have income details for all customers.
we can build a decision tree to predict customer income based
on occupation, product and various other variables.
In this case, we are predicting values for continuous variable.
Advantages
Easy to Understand:
even for people from non-analytical background. It does not require any
statistical knowledge to read and interpret them.
Its graphical representation is very intuitiveand users can easily relate
their hypothesis.
High Correlation
Use Pearson(continuous variables) orPolychoric(discrete
variables) correlation matrix to identify the variables with high
correlation and select one of them usingVIF(Variance Inflation
Factor).
Variables having higher value ( VIF > 5 ) can be dropped.