Professional Documents
Culture Documents
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
Don’t Cheat
Cheat
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20
Tree Induction
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Design Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Size
{Small,
What about this split? Large} {Medium}
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Greedy approach:
– Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Gini Index
Entropy
Misclassification error
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32
Measure of Impurity: GINI
GINI (t ) 1 [ p( j | t )]2
j
GINI (t ) 1 [ p( j | t )]2
j
A? B?
Yes No Yes No
Node N1 Node N2 Node N1 Node N2
Gini(N1) = N1 N2 Gini(N1) = N1 N2
1– (5/6)2 – (2/6)2 = 0.194
1 – (4/6)2 – (3/6)2 = 0.489 C1 4 2 C1 5 1
Gini(N2) =
Gini(N2) =
C2 3 3 1 – (1/6)2 – (4/6)2 = 0.528
C2 2 4
1 – (2/6)2 – (3/6)2 = 0.480
Gini=0.486 Gini=0.333
Gini(Children) Gini(Children)
= 7/12 * 0.489 + 5/12 * 0.480 = 0.486 = 7/12 * 0.194 + 5/12 * 0.528 = 0.333
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Entropy(t ) p( j | t ) log p( j | t )
j 2
Information Gain:
n
Entropy( p)
k
GAIN Entropy(i ) i
n
split i 1
Gain Ratio:
GAIN n n
SplitINFO log
k
GainRATIO split
Split i i
SplitINFO n n i 1
A? Parent
C1 7
Yes No
C2 3
Node N1 Node N2 Gini = 0.42
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.361 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
2916 records
– 10% for training
– 90% for testing
Human user: Class 0
Web Robot: Class 1
SC/Turbo: 4 yes, 11 no
2 values for attribute SC/Turbo, so we need 2 entropy calculations
We can omit calculations for good and average since they always end up not fast.
Recap:
IGEngine 0.121
IGturbo 0.085
IGWeight 0.115
IGFuelEco 0.304
Our best pick is Fuel Eco, and we can immediately predict the car is not fast
when fuel economy is good or average.
Since we selected the Fuel Eco attribute for our Root Node, it is
removed from the table for future calculations.
Calculating for Entropy IE(Fuel Eco) we get 1, since we have 5 yes and 5 no.
SC/Turbo: 3 yes, 7 no
2 values for attribute SC/Turbo, so we need 2 entropy calculations
Recap:
IGEngine 0.115
IGturbo 0.035
IGWeight 0.439
Weight has the highest gain, and is thus the best choice.
All cars with large engines in this table are not fast.
Missing Values
Costs of Classification
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Overfitting
Underfitting: when model is too simple, both training and test errors are large
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 74
Underfitting and Overfitting
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 80
Overfitting due to Insufficient Examples
Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a
bottom-up fashion
– If generalization error improves after trimming,
replace sub-tree by a leaf node.
– Class label of leaf node is determined from
majority class of instances in the sub-tree
– Can use MDL for post-pruning
Entropy(Children)
Missing = 0.3 (0) + 0.6 (0.9183) = 0.551
value
Gain = 0.9 (0.8813 – 0.551) = 0.3303
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 95
Distribute Instances
Data Fragmentation
Search Strategy
Expressiveness
Tree Replication
Other strategies?
– Bottom-up
– Bi-directional
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
x+y<1
Class = + Class =
Q R
S 0 Q 1
0 1 S 0
0 1
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
ad TP TN
Accuracy
a b c d TP TN FP FN
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 108
Limitation of Accuracy
PREDICTED CLASS
wa wb wc w d 1 2 3 4
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 120
ROC Curve
(TP,FP):
(0,0): declare everything
to be negative class
(1,1): declare everything
to be positive class
(1,0): ideal
Diagonal line:
– Random guessing
– Below diagonal line:
prediction is opposite of
the true class
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 121
Using ROC for Model Comparison
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR
TP 5 4 4 3 3 3 3 2 2 1 0
FP 5 5 4 4 3 2 1 1 0 0 0
TN 0 0 1 1 2 3 4 4 5 5 5
FN 0 1 1 2 2 2 2 3 3 4 5
TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0
ROC Curve:
acc p
P( Z Z )
p(1 p) / N
/2 1 / 2
p /2 /2
2( N Z ) 2
/2
e1 ~ N 1 , 1
e2 ~ N 2 , 2
e (1 e )
– Approximate: ˆ
i i
i
n i
ˆ ˆ
2
t 1
2
2
2 2
1
2
ˆ 2 j 1 j
k (k 1)
t
d d t ˆ
t 1 , k 1 t