You are on page 1of 15

DAT630


Classification

Basic Concepts, Decision Trees, and Model Evaluation Basic Concepts
Introduction to Data Mining, Chapter 4

08/09/2015

Krisztian Balog | University of Stavanger

Classification Why?
- Classification is the task of assigning objects - Descriptive modeling

to one of several predefined categories


- Explanatory tool to distinguish between objects of
different classes
- Examples

- Credit card transactions: legitimate or fraudulent? - Predictive modeling

- Emails: SPAM or not? - Predict the class label of previously unseen records
- Patients: high or low risk? - Automatically assign a class label when presented
- Astronomy: star, galaxy, nebula, etc. with the attributes of the record
- News stories: finance, weather, entertainment,
sports, etc.

The task Attribute set



(x)
Classification
Model
Class label

(y)

- Input is a collection of records (instances)

- Each record is characterized by a tuple (x,y)

- x is the attribute set


- y is the class label (category or target attribute)
- Classification is the task of learning a target
function f (classification model) that maps
each attribute set x to one of the predefined
class labels y
Attribute set

(x)
Classification
Model
Class label

(y) General approach
Tid Attrib1 Attrib2 Attrib3 Class
Learning
1 Yes Large 125K No
algorithm
Records whose class
2 No Medium 100K No

3 No Small 70K No labels are known


4 Yes Medium 120K No
Induction
5 No Large 95K Yes

6 No Medium 60K No

Nominal Nominal 7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No
Ordinal 10 No Small 90K Yes
Model
10

Training Set
Interval Apply
Model
Ratio
Tid
11
Attrib1

No
Attrib2

Small
Attrib3

55K
Class
?
Records with
12 Yes Medium 80K ? unknown class labels
13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

General approach Objectives for Learning Alg.


Tid Attrib1 Attrib2 Attrib3 Class
Learning
Learning Tid Attrib1 Attrib2 Attrib3 Class
Learning
Learning
1 Yes Large 125K No algorithm
algorithm 1 Yes Large 125K No algorithm
algorithm
2 No Medium 100K No Should fit2theNo input
Medium 100K No

Ind data well 3 No Ind


uc uct
3 No Small 70K No Small 70K No

4 Yes Medium 120K No tion


Induction 4 Yes Medium 120K No ion
Induction
5 No Large 95K Yes 5 No Large 95K Yes

6 No Medium 60K No 6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn 7 Yes Large 220K No
Learn
Learn
8 No Small 85K Yes model
Model 8 No Small 85K Yes model
Model
9 No Medium 75K No 9 No Medium 75K No

10 No Small 90K Yes 10 No Small 90K Yes


10

Model
Model
10

Model
Model
Training Set Training Set
Apply Apply
Apply
Model
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class Tid Attrib1 Attrib2 Attrib3 Class
n model model
11 No Small 55K ?
uctio
11 No Small 55K ?
uction
12 Yes Medium 80K ? Ded Should correctly
12 Yes Medium 80K ? Ded
13 Yes Large 110K ? Deduction predict class
13 labels
Yes Large 110K ? Deduction
14

15
No

No
Small

Large
95K

67K
?

?
for unseen14

15
data
No

No
Small

Large
95K

67K
?

?
10 10

Test Set Test Set

Learning Algorithms Machine Learning vs. 



Data Mining
- Decision trees
- Similar techniques, but different goal

- Rule-based
- Machine learning is focused on developing and
designing learning algorithms

- Naive Bayes

- More abstract, e.g., features are given


- Support Vector Machines

- Data Mining is applied Machine Learning

- Random forests
- Performed by a person who has a goal in mind and
- k-nearest neighbors
uses Machine Learning techniques on a specific
dataset
- … - Much of the work is concerned with data
(pre)processing and feature engineering
Today Objectives for Learning Alg.
Tid Attrib1 Attrib2 Attrib3 Class
Learning
Learning
- Decision trees
1
Should fit2theNo input
Yes
Medium
Large 125K
100K
No
No
algorithm
algorithm
data well 3 No Ind
uct
Small 70K No

- Binary class labels


4
5
Yes
No
Medium
Large
120K
95K
No
Yes
ion
Induction
- Positive or Negative 6 No Medium 60K No
Learn
Learn
7 Yes Large 220K No

8 No Small 85K Yes model


Model
How to measure
9
10
No
No
this?
Medium
Small
75K
90K
No
Yes
10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
n model
uctio
11 No Small 55K ?
Should correctly
12 Yes Medium 80K ? Ded
predict class
13 labels
Yes Large 110K ? Deduction
for unseen14

15
data
No

No
Small

Large
95K

67K
?

?
10

Test Set

Evaluation Confusion Matrix


- Measuring the performance of a classifier

Predicted class
- Based on the number of records correctly and
incorrectly predicted by the model

Positive Negative
- Counts are tabulated in a table called the
confusion matrix Positive
True Positives False Negatives
(TP) (FN)
- Compute various performance metrics based Actual
class
on this matrix Negative
False Positives True Negatives
(FP) (TN)

Confusion Matrix Example



"Is the man innocent?"
Predicted class Predicted class

Positive
 Negative

Positive Negative
Innocent Guilty
Type II Error
 True Positive
 False Negative
 letting a guilty
True Positives False Negatives Positive

Positive (TP) (FN) failing to 
 
 
 person go free
Innocent
Actual raise an alarm Actual
Convicted Freed (error of impunity)
class class False Positive
 True Negative

False Positives True Negatives Negative

Negative (FP) (TN)

 

Guilty Convicted Freed

Type I Error
 convicting an innocent person



raising a false alarm (miscarriage of justice)
Evaluation Metrics Exercise
- Summarizing performance in a single number
- Create confusion matrix

- Accuracy
- Compute Accuracy and Error rate
Number of correct predictions TP + TN
=
Total number of predictions TP + FP + TN + FN

- Error rate

Number of wrong predictions FP + FN


=
Total number of predictions TP + FP + TN + FN

- We seek high accuracy, or equivalently, low


error rate

Decision Trees Motivational Example

How does it work?


- Asking a series of questions about the
attributes of the test record

- Each time we receive an answer, a follow-up


question is asked until we reach a conclusion
about the class label of the record
Decision Tree Model Decision Tree
Tid Attrib1 Attrib2 Attrib3 Class
Learning
Learning
1 Yes Large 125K No algorithm
algorithm
2 No Medium 100K No

Ind
uc
3 No Small 70K No

4 Yes Medium 120K No tion


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No


Learn
Learn
8 No Small 85K Yes model
Model
9 No Medium 75K No

10 No Small 90K Yes


10

Model
Model
Training Set
Apply
Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
n model
uctio
11 No Small 55K ?

12 Yes Medium 80K ? Ded


13 Yes Large 110K ? Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Decision Tree Root node



no incoming edges

Decision Tree
zero or more outgoing edges

Internal node

exactly one incoming edges

two or more outgoing edges

Decision Tree Example Decision Tree


al al us
ic ric uo
or go in s
teg te nt as
ca c a c o cl
Tid Refund Marital Taxable
Splitting Attributes
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
No
Yes No
3 No Single 70K
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
Leaf (or terminal) nodes
 9 No Married 75K No
NO YES
have exactly one incoming edges
 10 No Single 90K Yes
and no outgoing edges
10

Training Data Model: Decision Tree


Another Example Apply Model to Test Data
al al
Learning
ic us Tid Attrib1 Attrib2 Attrib3 Class
Learning
or ric uo algorithm
go algorithm
1 Yes Large 125K No
eg in s
t te nt as Single,
cl MarSt 2 No Medium 100K No
ca c a c o
Married Ind
Divorced
uct
3 No Small 70K No
Tid Refund Marital Taxable
Status Income Cheat 4 Yes Medium 120K No ion
Induction
Yes
NO Refund 5 No Large 95K
1 Yes Single 125K No
Yes No 6 No Medium 60K No
Learn
2 No Married 100K No 7 Yes Large 220K No Learn
8 No Small 85K Yes model
Model
3 No Single 70K No NO TaxInc
9 No Medium 75K No
4 Yes Married 120K No < 80K > 80K 10 No Small 90K Yes

5 No Divorced 95K Yes


10

Model
Model
NO YES Training Set
6 No Married 60K No
Apply
7 Yes Divorced 220K No Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
n model
uctio
8 No Single 85K Yes 11 No Small 55K ?
There could be more than one
9 No Married 75K No 12 Yes Medium 80K ? Ded
10 No Single 90K Yes
tree that fits the same data! 13 Yes Large 110K ? Deduction
10

14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Test Data Test Data


Start from the root of tree. Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

No Married 80K ? No Married 80K ?


Refund 10

Refund 10

Yes No Yes No

NO MarSt NO MarSt

Single, Divorced Married Single, Divorced Married

TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K

NO YES NO YES

Test Data Test Data


Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

No Married 80K ? No Married 80K ?


Refund 10

Refund 10

Yes No Yes No

NO MarSt NO MarSt

Single, Divorced Married Single, Divorced Married

TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K

NO YES NO YES
Test Data Test Data
Refund Marital Taxable Refund Marital Taxable
Status Income Cheat Status Income Cheat

No Married 80K ? No Married 80K ?


Refund 10

Refund 10

Yes No Yes No

NO MarSt NO MarSt

Single, Divorced Married Single, Divorced Married Assign Cheat to “No”

TaxInc NO TaxInc NO
< 80K > 80K < 80K > 80K

NO YES NO YES

Decision Tree Induction Tree Induction


Tid Attrib1 Attrib2 Attrib3 Class
Learning
Learning
1
2
Yes
No
Large
Medium
125K
100K
No
No
algorithm
algorithm - There are exponentially many decision trees
3 No Small 70K No Ind
uct
ion
that can be constructed from a given set of
4 Yes Medium 120K No
Induction
5
6
No
No
Large
Medium
95K
60K
Yes
No
attributes

Learn
Learn
- Finding the optimal tree is computationally
7 Yes Large 220K No

8 No Small 85K Yes model


Model

infeasible (NP-hard)

9 No Medium 75K No

10 No Small 90K Yes


10

Model
Model
Training Set
Apply - Greedy strategies are used

Apply
Model
Tid Attrib1 Attrib2 Attrib3 Class
n model - Grow a decision tree by making a series of locally
ctio
edu
11 No Small 55K ?

12 Yes Medium 80K ? D optimum decisions about which attribute to use for
Deduction
splitting the data
13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Tid Refund Marital Taxable


Status Income Cheat
Tid Refund Marital Taxable
1 Yes Single 125K No

Hunt’s algorithm
Status Income Cheat

1 Yes Single 125K No 2 No Married 100K No


Refund
2 No Married 100K No Don’t 3 No Single 70K No
Yes No
3 No Single 70K No Cheat 4 Yes Married 120K No
4 Yes Married 120K No Don’t Don’t Yes
5 No Divorced 95K
Cheat Cheat
- Let Dt be the set of training records that 5 No Divorced 95K Yes
6 No Married 60K No
6 No Married 60K No
reach a node t and y={y1,…yc} the class 7 Yes Divorced 220K No
7 Yes Divorced 220K No
labels
8 No Single 85K Yes 8 No Single 85K Yes
- General Procedure
9 No Married 75K No Refund Refund 9 No Married 75K No
- If Dt contains records that belong the 10
10 No Single 90K Yes Yes No Yes No 10 No Single 90K Yes
same class yt, then t is a leaf node
10

Don’t Don’t Marital


Marital
labeled as yt Dt Cheat
Status
Cheat
Status
Single,
- If Dt is an empty set, then t is a leaf Single,
Married Divorced
Married
Divorced
node labeled by the default class, yd
- If Dt contains records that belong to
? Cheat Don’t Taxable Don’t
Cheat
Cheat Income
more than one class, use an attribute < 80K >= 80K
test to split the data into smaller
subsets. Recursively apply the Don’t Cheat
Cheat
procedure to each subset.
Tree Induction Issues Tree Induction Issues
- Determine how to split the records
- Determine how to split the records

- How to specify the attribute test condition? - How to specify the attribute test condition?
- How to determine the best split? - How to determine the best split?
- Determine when to stop splitting - Determine when to stop splitting

How to Specify Test Splitting Based on Nominal


Condition? Attributes
• Depends on attribute types
• Multi-way split: Use as many partitions as
- Nominal distinct values.

- Ordinal CarType
Family Luxury
- Continuous Sports

• Depends on number of ways to split

- 2-way split • Binary split: Divides values into two subsets.


- Multi-way split Need to find optimal partitioning.
CarType CarType
{Sports,
Luxury} {Family} OR {Family, 

Luxury} {Sports}

Splitting Based on Ordinal Splitting Based on


Attributes Continuous Attributes
• Multi-way split: Use as many partitions as - Different ways of handling

distinct values.
- Discretization to form an ordinal categorical attribute
Size - Static – discretize once at the beginning
Small Large - Dynamic – ranges can be found by equal interval bucketing,
Medium equal frequency bucketing (percentiles), or clustering

- Binary Decision: (A < v) or (A ≥ v)

• Binary split: Divides values into two subsets. 



- consider all possible splits and finds the best cut
Need to find optimal partitioning.
- can be more compute intensive
Size Size
{Small,
{Large}
OR {Medium, 

{Small}
Medium} Large}
Splitting Based on Tree Induction Issues
Continuous Attributes
- Determine how to split the records

- How to specify the attribute test condition?


Taxable Taxable - How to determine the best split?
Income Income?
> 80K? - Determine when to stop splitting
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split

Determining the Best Split Determining the Best Split


Before Splitting: 10 records of class 0,

10 records of class 1
• Greedy approach:

- Nodes with homogeneous class distribution are


Own preferred
Car Student
Car? Type? ID?
• Need a measure of node impurity:
Yes No Family Luxury c1 c20
c10 c11
Sports
C0: 5 C0: 9
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1 C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Which test condition is the best?

Impurity Measures Entropy


c 1
X
Entropy(t) = P (i|t)log2 P (i|t)
- Measuring the impurity of a node

i=0
- P(i|t) = fraction of records belonging to class i at a
given node t • Maximum (log nc) when records are equally
- c is the number of classes distributed among all classes implying least
c 1
X information
Entropy(t) = P (i|t)log2 P (i|t) • Minimum (0.0) when all records belong to one
i=0
class, implying most information
c 1
X
Gini(t) = 1 P (i|t)2
i=0

Classification error(t) = 1 max P (i|t)


c 1
X c 1
X
Exercise Entropy(t) =
i=0
P (i|t)log2 P (i|t) Exercise Entropy(t) =
i=0
P (i|t)log2 P (i|t)

C1 0 C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

c 1
X
GINI Exercise Gini(t) = 1
i=0
P (i|t)2
c 1
X
2
Gini(t) = 1 P (i|t)
i=0 C1 0
C2 6
- Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
C1 1
- Minimum (0.0) when all records belong to one
class, implying most interesting information C2 5

C1 2
C2 4

c 1
X
Exercise Gini(t) = 1
i=0
P (i|t)2 Classification Error
Classification error(t) = 1 max P (i|t)
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

• Maximum (1 - 1/nc) when records are equally distributed


among all classes, implying least interesting information
C1 1 P(C1) = 1/6 P(C2) = 5/6 • Minimum (0.0) when all records belong to one class,
C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 implying most interesting information

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Exercise Classification error(t) = 1 max P (i|t) Exercise Classification error(t) = 1 max P (i|t)

C1 0 C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Comparison of 
 Gain = goodness of a split


Impurity Measures C0 N00
Before Splitting: M0
C1 N01
For a 2-class problem: A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34

Information Gain Determining the Best Split


- When Entropy is used as the impurity measure, Before Splitting: 10 records of class 0,

10 records of class 1
it’s called information gain

- Measures how much we gain by splitting a Own


Car?
Car Student
Type? ID?
parent node number of records 

number of attribute values Family Luxury c1 c20
associated with the 
 Yes No
c10 c11
child node vj Sports
k
X C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
N (vj ) C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1
inf o = Entropy(p) Entropy(vj )
j=1
N

total number of records 
 Which test condition is the best?


at the parent node
Gain Ratio Tree Induction Issues
inf o
- Determine how to split the records

Gain ratio = - How to specify the attribute test condition?


Split info
k - How to determine the best split?
X
Split info = P (vi ) log2 P (vi ) - Determine when to stop splitting
i=1

- It the attribute produces a large number of


splits, its split info will also be large, which in
turn reduces its gain ratio

Stopping Criteria for Tree Summary Decision Trees


Induction
• Stop expanding a node when all the records - Inexpensive to construct

belong to the same class


- Extremely fast at classifying unknown records

• Stop expanding a node when all the records - Easy to interpret for small-sized trees

have similar attribute values

- Accuracy is comparable to other classification


• Early termination techniques for many simple data sets

Underfitting and Overfitting


500 circular and 500
triangular data points.

Practical Issues of Circular points:


0.5 ≤ sqrt(x12+x22) ≤ 1

Classification Triangular points:


sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting How to Address Overfitting
Overfitting
• Pre-Pruning (Early Stopping Rule)
- Stop the algorithm before it becomes a fully-grown tree
- Typical stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
- More restrictive conditions:
• Stop if number of instances is less than some user-
specified threshold
• Stop if class distribution of instances are independent
of the available features
Underfitting: when model is too simple, both training and test errors are large • Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain)

How to Address Methods for estimating


Overfitting… performance
• Post-pruning - Holdout

- Grow decision tree to its entirety - Reserve 2/3 for training and 1/3 for testing
- Trim the nodes of the decision tree in a bottom-up fashion (validation set)
- If generalization error improves after trimming, replace sub-
tree by a leaf node - Cross validation

- Class label of leaf node is determined from majority class of - Partition data into k disjoint subsets
instances in the sub-tree
- k-fold: train on k-1 partitions, test on the remaining
one
- Leave-one-out: k=n

Expressivity Expressivity
1

0.9

0.8
x < 0.43?
x+y<1
0.7
Yes No
0.6

y < 0.33?
y

0.5 y < 0.47?


0.4

0.3
Yes No Yes No Class = + Class =

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
Use-case: 

Web Robot Detection

Tasks
- Given a training data set and a test set

- Task 1: Build a decision tree classifier

Assignment 1 -
-
You have to build it from scratch
You are free to pick your programming language
- Submit code and predicted class labels for test set
- Accuracy has to reach a certain threshold
Tasks Online evaluation
- Task 2: Submit a short report describing
- Real-time leaderboard will be available for the
- What processing steps you applied submissions (updated after each git push)

- Which are the most important features of the dataset - Two tracks, results separately

(based on the decision tree built)


- Decision tree track (for everyone)
- Task 3 (optional) Use any classifier from a the - Open track (optional)
scikit-learn Python machine learning library

- Best teams for each track get +5 points at the


- Submit code and predicted class labels for test set
exam (all members)

- Online evaluation be available from next week

Practicalities
- Data set and specific instructions will be made
available today

- Work in groups of 2-3

- Deadlines

- Forming groups by 11/9 (this Friday!)


- Predictions due 28/9
- Report due 5/10
- Note: I’m away this Friday, but the practicum
will be held. Get started on Assignment 1!

You might also like