Professional Documents
Culture Documents
Classification
Hatem Haddad
Sheets are largely based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
2 What have we seen last time?
3
Knowledge Discovery (KDD) Process
This is a view from typical
database systems and data
warehousing communities Pattern Evaluation
Data mining plays an essential
role in the knowledge
discovery process Data Mining
Task-relevant Data
Data Cleaning
Data Integration
Databases
4 Data Mining Algorithms
Classification
predicts categorical class labels (discrete or
nominal)
classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new
data
7
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other
attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible.
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model Decision
11 No Small 55K ? Tree
12 Yes Medium 80K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to No
TaxInc NO
< 80K > 80K
NO YES
Decision Trees as a Computer Program
Test set
Ground truth, data labeling, mechanical Turk
Confusion Matrix and cost matrix
True positive, true negative, false positive, false negative
Accuracy
Error rates
21 Exercise Tree Induction: Training Dataset
Class: buys_computer
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
Solution: interpretation?
22
age?
<=30 overcast
31..40 >40
no yes yes
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
Many Algorithms:
Hunts Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ,SPRINT
General Structure of Hunts Algorithm
Let Dt be the set of training
Tid Refund Marital Taxable
Status Income Cheat
records that reach a node t 1 Yes Single 125K No
Dt
Hunts algorithm is then applied
recursively to each child of the
root node ?
Hunts Algorithm 1
2
Yes
No
Single
Married
125K
100K
No
No
Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No
Dont 5 No Divorced 95K Yes
Cheat ?
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
Refund 9 No Married 75K No
Refund
Yes No 10 No Single 90K Yes
Yes No 10
Dont
Dont Marital
Marital Cheat Status
Cheat Status Single,
Single, Married
Married Divorced
Divorced
Dont
Dont Taxable
? Income Cheat
Cheat
< 80K >= 80K
Dont
Cheat
Cheat
Handling Missing Attribute Values
Greedy strategy.
Split the records based on an attribute test that optimizes certain criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Determine how to cut back if tree is too deep
What is wrong with a tree that is too deep?
How to Specify Test Condition?
Size
Small Large
Medium
Size Size
{Small, OR {Medium,
Medium} {Large} Large} {Small}
Size
{Small,
Large} {Medium}
Splitting Based on Continuous
Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Greedy approach:
Nodes with homogeneous class distribution are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
How to Find the Best Split: let M be the
measure
C0 N00
Before Splitting: M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Entropy
Gain
Gini Index
Classification Error
Entropy Based Evaluation and Splitting
Entropy at a given node t:
Entropy (t ) p ( j | t ) log p ( j | t )
j
Minimum (0):
when all records belong to one class, implying most information
Examples for computing Entropy
Entropy (t ) p ( j | t ) log p ( j | t )
j 2
Information Gain:
GAIN n
Entropy ( p ) Entropy (i )
k
i
n
split i 1
Gain Ratio:
GAIN n n
GainRATIO SplitINFO log
Split k
i i
split
SplitINFO n n i 1
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j
Taxable
Simple method to choose best v Income
> 80K?
For each v, scan the database
to gather count matrix and
compute its Gini index Yes No
Computationally Inefficient!
Repetition of work.
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Classification Error
Greedy strategy.
Split the records based on an attribute test that optimizes certain
criterion.
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
Stopping Criteria for Tree Induction
Stop expanding a node when all the records belong to the same class
Stop expanding a node when all the records have similar attribute values
Exercise: from Introduction to Data Mining
54
GINI (t ) 1 [ p( j | t )]2
Customer ID Gender Car Type Shirt Size Class
1 M Family Small C0
j
2 M Sports Medium C0
3 M Sports Medium C0
10 F Luxury Large C0
11 M Family Large C1
13 M Family Medium C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
18 F Luxury Medium C1
19 F Luxury Medium C1
20 F Luxury Large C1
Exercise: from Introduction to Data Mining book
55
GINI (t ) 1 [ p( j | t )]2
Customer ID Gender Car Type Shirt Size Class
1 M Family Small C0
j
2 M Sports Medium C0
Compute the Gini index for the
3 M Sports Medium C0
Customer ID attribute. Answer:
4 M Sports Large C0
9 F Sports Medium C0
17 F Luxury Medium C1
overall gini for Gender is 0.5 0.5 + 0.5
18 F Luxury Medium C1
0.5 = 0.5.
19 F Luxury Medium C1
20 F Luxury Large C1
Exercise: from Introduction to Data Mining book
GINI (t ) 1 [ p( j | t )]2 GINI split ni GINI (i )
k
56
Customer ID Gender Car Type Shirt Size Class
j i 1 n
1 M Family Small C0
2 M Sports Medium C0 Compute the Gini index for the Car Type attribute
3 M Sports Medium C0 using multiway split.
4 M Sports Large C0
10 F Luxury Large C0
11 M Family Large C1
13 M Family Medium C1
15 F Luxury Small C1
16 F Luxury Small C1
17 F Luxury Medium C1
18 F Luxury Medium C1
19 F Luxury Medium C1
20 F Luxury Large C1
Exercise: from Introduction to Data Mining book
GINI (t ) 1 [ p( j | t )]2 GINI split ni GINI (i )
k
57
Customer ID Gender Car Type Shirt Size Class
j i 1 n
1 M Family Small C0
2 M Sports Medium C0 Compute the Gini index for the Car Type attribute
3 M Sports Medium C0 using multiway split.
4 M Sports Large C0
16 F Luxury Small C1
20 F Luxury Large C1
Decision Tree Based Classification
Advantages:
Inexpensive to construct
Extremely fast at classifying unknown records
Easy to interpret for small-sized trees
Accuracy is comparable to other classification techniques for many simple data
sets