Professional Documents
Culture Documents
Chapter 6
August 1, 2011
prediction?
Issues regarding classification and
prediction
Classification by decision tree
induction
Bayesian classification Rule-based classification Classification by back propagation
August 1, 2011
August 1, 2011
August 1, 2011
Training Data
NAME RANK
YEARS TENURED
Assistant Prof Assistant Prof Professor Associate Prof Assistant Prof Associate Prof
3 7 2 7 6 3
Classifier (Model)
August 1, 2011
NAME RANK Tom M erlisa G eo rg e Jo sep h A ssistan t P ro f A sso ciate P ro f P ro fesso r A ssistan t P ro f
Tenured?
August 1, 2011
August 1, 2011
prediction?
Issues regarding classification and
prediction
Classification by decision tree
induction
Bayesian classification Rule-based classification Classification by back propagation
August 1, 2011
10
August 1, 2011
11
prediction?
Issues regarding classification and
prediction
Classification by decision tree
induction
Bayesian classification Rule-based classification Classification by back propagation
August 1, 2011
13
buys_computer no no yes yes yes no yes no yes yes yes yes yes no
14
August 1, 2011
yes
August 1, 2011
15
Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:
Info ( D ) ! pi log 2 ( pi ) Information needed (after using A to split Di !1 v partitions) to into classify D: v |D | j InfoA (D) ! v I ( Dj ) Information gained by branching on attributej !1 | D | A
m
Info ( D ) ! I (9,5) !
pi 2 4 3
5 I ( 2,3)means age <=30 has 5 out of 14 14 samples, with 2 yes es and 3 no s. Hence
ain ( age ) ! Info ( D ) Info age ( D ) ! 0.246
age income student credit_rating <=30 high no fair <=30 high no excellent 3140 high no fair >40 medium no fair >40 low yes fair >40 low yes excellent 3140 low yes excellent <=30 medium no fair <=30 low yes fair >40 medium yes fair <=30 medium yes excellent 3140 medium no excellent August 1, 3140 high2011 yes fair >40 medium no excellent
buys_computer no no yes yes yes no yes no yes yes yes yes Data Mining: Concepts and Techniques yes no
Similarly,
Gain (income ) ! 0.029 Gain ( student ) ! 0.151 Gain (credit _ rating ) ! 0.048
18
The point with the minimum expected information requirement for A is selected as the split-point for A Split: D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in DMining: Concepts andA > split-point satisfying Techniques August 1, 2011 Data 19
| Dj | |D|
v log 2 (
| Dj | |D|
gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the splitting attribute
August 1, 2011 Data Mining: Concepts and Techniques 20
where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as
gini ( D ) ! 1 p 2 j j !1
(gini(gini) !(D) (or the largest reduction)in A gini(D) giniA (D The attribute provides the smallest
split
impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
August 1, 2011 Data Mining: Concepts and Techniques 21
Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 4 10 in D2 gini ( D ) ! ini ( D ) ini ( D )
inco e{low , ediu }
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued May need other tools, e.g., clustering, to get the possible split values Can be modified for categorical attributes
August 1, 2011 Data Mining: Concepts and Techniques 22
14
14
test for
C-SEP: performs better than info. gain and gini index in certain cases
distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree Multivariate splits (partition based on multiple variable combinations) CART: finds multivariate splits based on a linear comb. of attrs. Which attribute selection measure is the best? Most give good results, none is significantly superior than others
August 1, 2011 Data Mining: Concepts and Techniques 24
AVC-set on Age
AVC-set on income
income Buy_Computer yes high medium low 2 4 3 no 2 2 1
income studentcredit_rating uys_compu Age Buy_Computer ig no fair no yes no ig no excellent no <=30 3 2 ig no fair yes 31..40 4 0 medium no fair yes >40 3 2 yes fair yes lo yes excellent no lo lo yes excellent yes AVC-set on Student medium no fair no lo yes fair yes student Buy_Computer medium yes fair yes yes no medium yes excellent yes medium no excellent yes yes 6 1 ig yes fair yes no 3 4 medium no excellent no
Data Mining: Concepts and Techniques
AVC-set on credit_rating
Credit rating fair excellent Buy_Computer yes 6 3 no 2 3
30
August 1, 2011
August 1, 2011
33
August 1, 2011
34
August 1, 2011
35