Professional Documents
Culture Documents
Classification
Decision Trees: what they are and how they work
Hunts (TDIDT) algorithm
How to select the best split
How to handle
Inconsistent data
Continuous attributes
Missing values Sections 4.1-4.3, 4.4.1, 4.4.2, 4.4.5 of course book
Overfitting
Classification
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
10
15 No Large 67K ? Attrib1 = yes Class = No
Attrib1 = No Attrib3 < 95K Class = Yes
TNM033: Introduction to Data Mining #
Decision Trees
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that fits
10 No Single 90K Yes the same data!
10
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
TaxInc NO
< 80K > 80K
NO YES
Tree Induction
Issues
Determine when to stop splitting
Determine how to split the records
Which attribute to use in a split node split?
How to determine the best split?
How to specify the attribute test condition?
E.g. X < 1? or X+Y < 1?
Shall we use 2-way split or multi-way split?
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Entropy
Gini Index
Misclassification error
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
GainSplit = M0 M12 vs M0 M34
TNM033: Introduction to Data Mining #
How to Find the Best Split
Before Splitting: C0 N00
M0
C1 N01
A? Node N0 B?
Yes No Yes No
M1 M2 M3 M4
Mi Entropy ( Ni ) p ( j | Ni ) log p ( j | Ni )
j
Entropy (t ) p ( j | t ) log p ( j | t )
j 2
Information Gain:
n
Entropy ( p ) Entropy (i )
k
GAIN i
n
split i 1
Gain Ratio:
GAIN Split n n
SplitINFO
k
GainRATIO split
SplitINFO i 1
log
i i
n n
Parent node p is split into k partitions
ni is the number of records in partition i
Split Information
n n
SplitINFO
k
i 1
i
log i
n n
A=1 A= 2 A=3 A=4 SplitINFO
32 0 0 0 0
16 16 0 0 1
16 8 8 0 1.5
16 8 4 4 1.75
8 8 8 8 2
GINI (t ) 1 [ p ( j | t )]2
j
Error (t ) 1 max P (i | t )
i
Class=Yes 0 Cheat=Yes 2
Class=No 3 Cheat=No 4
Overfitting
A tree that fits the training data too well may not be a good
classifier for new examples.
Overfitting results in decision trees more complex than
necessary
Estimating error rates
Use statistical techniques
Re-substitution errors: error on training data set (training error)
Generalization errors: error on a testing data set (test error)
Typically, 2/3 of the data set is reserved to model building and 1/3 for error
estimation
Disadvantage: less data is available for training
Overfitting
Underfitting: when model is too simple, both training and test errors are large
TNM033: Introduction to Data Mining #
Breiman et al.
CART builds multivariate decision (binary) trees
Available in WEKA as SimpleCART
x+y<1
Class = + Class =
Q R
S 0 Q 1
0 1 S 0
0 1