Professional Documents
Culture Documents
5 algorithm
Let the classes be denoted {C1, C2,, Ck}. There are three possibilities for the content of the set of training samples T in the given node of decision tree:
1. T contains one or more samples, all belonging to a single class Cj. The decision tree for T is a leaf identifying class Cj.
C4.5 algorithm
2. T contains no samples. The decision tree is again a leaf, but the class to be associated with the leaf must be determined from information other than T, such as the overall majority class in T. C4.5 algorithm uses as a criterion the most frequent class at the parent of the given node.
C4.5 algorithm
3. T contains samples that belong to a mixture of classes. In this situation, the idea is to refine T into subsets of samples that are heading towards single-class collections of samples. An appropriate test is chosen, based on single attribute, that has one or more mutually exclusive outcomes {O1,O2, ,On}:
T is partitioned into subsets T1, T2, , Tn where Ti contains all the samples in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test and one branch for each possible outcome.
C4.5 algorithm
Test entropy:
If S is any set of samples, let freq (Ci, S) stand for the number of samples in S that belong to class Ci (out of k possible classes), and S denotes the number of samples in the set S. Then the entropy of the set S:
k
C4.5 algorithm
After set T has been partitioned in accordance with n outcomes of one attribute test X: Infox(T) = ((Ti/ T) Info(Ti))
i=1 n
Gain(X) = Info(T) - Infox(T) Criterion: select an attribute with the highest Gain value.
Att.2 Att.3 Class ------------------------------80 True CLASS2 70 True CLASS2 80 False CLASS1 80 False CLASS1 96 False CLASS1
T3:
Att.1 Att.2 Class ------------------------------A 70 CLASS1 A 90 CLASS2 B 90 CLASS1 B 65 CLASS1 C 80 CLASS2 C 70 CLASS2
Att.1 Att.2 Class ------------------------------A 85 CLASS2 A 95 CLASS2 A 70 CLASS1 B 78 CLASS1 B 75 CLASS1 C 80 CLASS1 C 80 CLASS1 C 96 CLASS1
C4.5 algorithm
C4.5 contains mechanisms for proposing three types of tests:
The standard test on a discrete attribute, with one outcome and branch for each possible value of that attribute. If attribute Y has continuous numeric values, a binary test with outcomes YZ and Y>Z could be defined, based on comparing the value of attribute against a threshold value Z.
C4.5 algorithm
A more complex test based also on a discrete attribute, in which the possible values are allocated to a variable number of groups with one outcome and branch for each group.
Example(1/2)
Attribute2: After a sorting process, the set of values is: {65, 70, 75, 78, 80, 85, 90, 95, 96}, the set of potential threshold values Z is (C4.5): {65, 70, 75, 78, 80, 85, 90, 95}. The optimal Z value is Z=80 and the corresponding process of information gain computation for the test x3 (Attribute2 80 or Attribute2 > 80).
Example(2/2)
Infox3(T)=9/14(-7/9log2(7/9)2/9log2(2/9)) +5/14(-2/5log2(2/5)3/5log2 (3/5)) =0.837 bits Gain(x3)= 0.940- 0.837=0.103 bits Attribute1 gives the highest gain of 0.246 bits, and therefore this attribute will be selected for the first splitting.
Example
Attribute1 Attribute2 Attribute3 ------------------------------------------------------------------------------------A 70 True A 90 True A 85 False A 95 False A 70 False ? 90 True B 78 False B 65 True B 75 False C 80 True C 70 True C 80 False C 80 False C 96 False -------------------------------------------------------------------------------------Class CLASS1 CLASS2 CLASS2 CLASS2 CLASS1 CLASS1 CLASS1 CLASS1 CLASS1 CLASS2 CLASS2 CLASS1 CLASS1 CLASS1
Example
Info(T) = -8/13log2(8/13)-5/13log2(5/13)= 0.961 bits Infox1(T) = 5/13(-2/5log2(2/5)3/5log2(3/5)) + 3/13(-3/3log2(3/3)0/3log2(0/3)) + 5/13(-3/5log2(3/5)2/5log2(2/5)) = 0.747 bits Gain(x1) = 13/14 (0.961 0.747) = 0.199 bits
70 90 85 95 70 90
C1 C2 C2 C2 C1 C1
1 1 1 1 1 5/13
C1 C1 C1 C1
80 70 80 80 96 90
C2 C2 C1 C1 C1 C1
1 1 1 1 1 5/13
Postpruning
Removing retrospectively some of the tree structure using selected accuracy criteria.
Attribute1 = C and Attribute3 = True Classification = CLASS2 (2.4 / 0); Attribute1 = C and Attribute3 = False Classification = CLASS1 (3.0 / 0).