Professional Documents
Culture Documents
a r t i c l e
i n f o
Keywords:
Imprecise probabilities
Imprecise Dirichlet Model
Uncertainty measures
Credal decision trees
C4.5 algorithm
Noisy data
a b s t r a c t
In the area of classication, C4.5 is a known algorithm widely used to design decision trees. In this
algorithm, a pruning process is carried out to solve the problem of the over-tting. A modication of
C4.5, called Credal-C4.5, is presented in this paper. This new procedure uses a mathematical theory based
on imprecise probabilities, and uncertainty measures. In this way, Credal-C4.5 estimates the probabilities
of the features and the class variable by using imprecise probabilities. Besides it uses a new split criterion,
called Imprecise Information Gain Ratio, applying uncertainty measures on convex sets of probability distributions (credal sets). In this manner, Credal-C4.5 builds trees for solving classication problems
assuming that the training set is not fully reliable. We carried out several experimental studies comparing
this new procedure with other ones and we obtain the following principal conclusion: in domains of class
noise, Credal-C4.5 obtains smaller trees and better performance than classic C4.5.
2014 Elsevier Ltd. All rights reserved.
X 1 0 ! 3 of class A; 6 of class B
X 1 1 ! 6 of class A; 0 of class B
X 2 0 ! 1 of class A; 5 of class B
X 2 1 ! 8 of class A; 1 of class B
4626
If this data set appears in the node of a tree, then the C4.5 algorithm
chooses the variable X 1 for splitting the node (see Fig. 1).
We can suppose that the data set is noisy because it has an
outlier point when X 2 1 and class is B. In this way, the clean
distribution is composed by 10 instances of class A and 5 instances
of class B, that are organized as follows:
X2
X2=0
X2=1
X 1 0 ! 4 of class A; 5 of class B
X 1 1 ! 6 of class A; 0 of class B
X 2 0 ! 1 of class A; 5 of class B
X 2 1 ! 9 of class A; 0 of class B
If this data set is found in the node of a tree, then the C4.5 algorithm
chooses the variable X 2 for splitting the node (see Fig. 2).
We can observe that C4.5 algorithm generates an incorrect
subtree when noisy data are processed, because it considers that
the data set is reliable. Later, the pruning process considers that the
data set is not reliable in order to solve this problem. However, the
pruning process can only delete the generated incorrect subtree. It
can not make a detailed adjustment of the correct subtree
illustrated in Fig. 2. The ideal situation is to carry out the branching
shown in Fig. 2 and then to make the pruning process. This
situation is achieved by using decision trees based on imprecise
probabilities as it will be shown later.
In the last years, several formal theories for manipulation of
imprecise probabilities have been developed (Walley, 1996; Wang,
2010; Weichselberger, 2000). By using the theory of imprecise
probabilities presented in Walley (1996), known as the Imprecise
Dirichlet Model (IDM), Abelln and Moral (2003) have developed
an algorithm for designing decision trees, called credal decision
trees (CDTs). The variable selection process for this algorithm (split
criterion) is based on imprecise probabilities and uncertainty measures on credal sets, i.e. closed and convex sets of probability distributions. In particular, the CDT algorithm extends the measure
of information gain used by ID3. The split criterion is called the
Imprecise Info-Gain (IIG).
Recently, in Mantas and Abelln (2014), credal decision trees
are built by using an extension of the IIG criterion. In this work,
the probability values of the class variable and features are estimated via imprecise probabilities. The CDT algorithm obtains good
experimental results (Abelln & Moral, 2005; Abelln & Masegosa,
2009). Besides, its use with bagging ensemble (Abelln & Masegosa, 2009, 2012; Abelln & Mantas, 2014) and its above mentioned
extension (Mantas & Abelln, 2014) are especially suitable when
noisy data are classied. A complete and recent revision of machine learning methods to manipulate label noise can be found
in Frenay and Verleysen (in press). Here, the credal decision tree
procedure is included as a label noise-robust method.
X2
X1
X1=0
X1=1
X2=0
X2=1
2. Previous knowledge
2.1. Decision trees
Decision trees (DTs), also known as Classication Trees or hierarchical classiers, started to play an important role in machine
learning with the publication of Quinlans ID3 (Iterative Dichotomiser 3) (Quinlan, 1986). Subsequently, Quinlan also presented
the C4.5 algorithm (Classier 4.5) (Quinlan, 1993), which is an advanced version of ID3. Since then, C4.5 has been considered a standard model in supervised classication. It has also been widely
applied as a data analysis tool to very different elds, such as
astronomy, biology, medicine, etc.
Decision trees are models based on a recursive partition method, the aim of which is to divide the data set using a single variable
at each level. This variable is selected with a given criterion. Ideally, they dene a set of cases in which all the cases belong to
the same class.
Their knowledge representation has a simple tree structure. It
can be interpreted as a compact set of rules in which each tree
node is labeled with an attribute variable that produces branches
for each value. The leaf nodes are labeled with a class label.
The process for inferring a decision tree is mainly determined
by the followings aspects:
(i) The criteria used to select the attribute to insert in a node
and branching (split criteria).
(ii) The criteria to stop the tree from branching.
(iii) The method for assigning a class label or a probability distribution at the leaf nodes.
(iv) The post-pruning process used to simplify the tree structure.
Many different approaches for inferring decision trees, which
depend upon the aforementioned factors, have been published.
Quinlans ID3 (Quinlan, 1986) and C4.5 (Quinlan, 1993) stand out
among all of these.
Decision trees are built using a set of data referred to as the
training data set. A different set, called the test data set, is used
to check the model. When we obtain a new sample or instance
of the test data set, we can make a decision or prediction on the
state of the class variable by following the path in the tree from
the root to a leaf node, using the sample values and tree structure.
2.1.1. Split criteria
Let us suppose a classication problem. Let C be the class variable, fX 1 ; . . . ; X n g the set of features, and X a general feature. We
can nd the following split criteria to build a DT.
Info-Gain. This metric was introduced by Quinlan as the basis for
his ID3 model (Quinlan, 1986). The model has the following main
features: it was dened to obtain decision trees with discrete variables, it does not work with missing values, a pruning process is
not carried out and it is based on Shannons entropy H.
The split criterion of this model is Info-Gain (IG) which is dened as:
IGC; X HC
X
PX xi HCjX xi ;
where
HC
X
PC cj log PC cj :
j
4627
IGRC; X
IGC; X
:
HX
4628
j 1; . . . ; k;
with nzj as the frequency of the set of values Z zj in the data set,
N the sample size and s a given hyperparameter that does not depend on the sample space (Representation Invariance Principle, Walley (1996)). The value of parameter s determines the speed at which
the values of probability upper and lower converge when sample
size increases. Higher values of s give a more cautious inference.
Walley (1996) does not give a denitive recommendation for the
value of this parameter but he suggests two candidates: s 1 or
s 2.
One important characteristic of this model is that intervals are
wider if the sample size is smaller. Therefore, this method produces more precise intervals at the same time as N increases.
This representation gives rise to a specic kind of convex set of
probability distributions on the variable Z; KZ (Abelln, 2006).
The set is dened as
KZ
p j pzj 2
nzj nzj s
;
pzj 2
;
Ns Ns
nzj nzj s
;
;
Ns Ns
j 1; . . . ; k :
p zi
8
<
n zi
Ns
: nzi s=l
Ns
if zi R A
if zi 2 A
i 1; . . . ; k;
IIGRD C; X
IIGD C; X
;
HX
IIGD C; X H K D C
On this type of sets (really credal sets, Abelln (2006)), uncertainty measures can be applied. The procedure to build CDTs uses
the maximum of entropy function on the above dened credal set,
a well established total uncertainty measure on credal sets (see Klir
(2006)). This function, denoted as H , is dened as:
H KZ maxfHp j p 2 KZg
X D
P X xi H K D CjX xi ;
with K D C and K D CjX xi are the credal sets obtained via the IDM
for the C and CjX xi variables respectively, for a partition D of
the data set (see Abelln & Moral (2003)); PD X xi i 1; . . . ; n
is a probability distribution that belongs to the credal set K D X.
We choose the probability distribution PD from K D X that maximizes the following expression:
X
PX xi HCjX xi :
i
P xi
nxi
Ns
nxi s
Ns
if i j0
if i j0
ClassNo; D maxj Ij 2 D = classIj ci ; j 1; . . . ; jDj j
ci 2C
where classIj is the class of the instance Ij 2 D and jDj is the number of instances in D.
Stopping criteria: The branching of the decision tree is stopped
when the uncertainty measure is not reduced (a 6 0, step 6) or
when there are no more features to insert in a node (L ;, step
1) or when there are not a minimum number of instances per
leaf (step 3). The branching of a decision tree is also stopped
when there is not any valid split attribute using the aforementioned condition in Split Criteria, like classic C4.5.
Handling numeric attributes: The numeric attributes are handled in the same way that C4.5, presented in Section 2.2. The
only difference is the use of IIG instead of the IG measure.
Dealing with missing values: The missing values are manipulated in a similar way to C4.5, presented in Section 2.2. Again,
the only difference is to use IIG instead of IG.
Post-pruning process: Like C4.5, Pessimistic Error Pruning is
employed in order to prune a Credal-C4.5.
4629
(a) Small data sets. According Eq. (3), when imprecise probabilities are used to estimate values of a variable, the size of the obtained credal set is proportional to the parameter s. Hence, if
s 0 the credal set contains only one probability distribution
and, in this case, IIGR measure is equal to IGR.1 If s > 0 the size of
the credal set is inversely proportional to the data set size N. If N
is very high then the effect of the parameter s can be ignored and
so the measures IIGR and IGR can be considered equivalent. That
is, Credal-C4.5 and classic C4.5 have a similar behavior in the nodes
with high associated data set, usually in the upper levels of the tree.
On the other hand, if N is small, then the parameter s produces
credal sets with many probability distributions (big size of the convex set). Hence, the measures IIGR and IGR can be different (maximum entropy of a set H can be distinct from classic entropy H).
That is, Credal-C4.5 and C4.5 have a different behavior in the nodes
with small data set, usually in the lower levels of the tree.
(b) Features with many states. The IG measure used by ID3 is
biased in favor of feature variables with a large number of values.
The IGR measure used by C4.5 was created to compensate for this
bias. As the IIG measure (Eq. (8)) also penalizes feature variables
with many states (see Mantas & Abelln (2014)) and the difference
between IGR (Eq. (2)) and IIGR (Eq. (7)) is the use of IIG measure
instead of IG, we can conclude that Credal-C4.5 penalizes the features with many values in a higher degree than classic C4.5. As
we have said in the previous paragraph, this fact happens in the
nodes with a associated small data set.
(c) Negative split criterion values. It is important to note that
for a feature X and a partition D, IIGRD C; X can be negative. This
situation does not appear with classical split criteria, such as the
IGR criterion used in C4.5. This characteristic enables the IIGR criterion to reveal features that worsen the information on the class
variable. Hence, a new stopping criterion is dened for CredalC4.5 (Step 6 in Fig. 4) that is not available for classic C4.5. In this
way, IIGR provides a trade-off between stopping criteria tight
and loose, that is, it offers a trade-off between trees small undertted and large over-tted. Hence, we can hope that Credal-C4.5 procedure produces smaller trees than the classic C4.5 procedure. In
the experimental section we will show that it is so, and also that
the accuracy results of the new method are similar or better than
the ones of the classic C4.5 procedure before and after pruning.
5. Credal-C4.5 versus pessimistic pruning
10
IIGRD0 C; X j 6 0:
1
The Info-Gain ratio criterion is actually a specic case of the IIGR criterion using
the parameter s 0.
4630
eest N etr N s:
IGClass; X
> 0:
HX
IGClass; X HClass
4
X
PX xi HClassjX xi
i1
0:918
12
7
3
8
0:918
0:985
0:918
0:811
30
30
30
30
0:918 0:9051 > 0:
After the creation of this subtree, the parent node is pruned by
the pessimistic pruning because
X?
x1
4
X
i1
that is,
IIGRD Class; X
IIGD Class; X
< 0:
HX
x2
x3
4
X
P D X xi H K D ClassjX xi
i1
11
If we compare the generalization error of the pessimistic pruning (Eq. (10)) and the assumed error with the use of imprecise
probabilities (Eq. (11)), we can observe that they estimate the real
error of a data set by increasing the training error.
The difference is that pessimistic pruning labels a node as leaf if
the sum of the estimated error for the descendant nodes is greater
than for parent node, whereas Credal-C4.5 labels a node as leaf if
the obtained data sets with the error estimation do not provide
information gain when the node is split. Let us see an example of
this difference.
IGRClass; X
0:928
12
7:5
3
8
0:942
0:996
0:985
0:873
30:5
30:5
30:5
30:5
0:928 0:941 < 0;
PD
12 7:5
3
8
:
;
;
;
30:5 30:5 30:5 30:5
6. Experimental analysis
We present two experiments in this section.
(1) The aim of the rst experiment is show that Credal-C4.5
improves to the best previously published credal decision
tree, Complete Credal Decision Tree (CCDT) (Mantas & Abelln, 2014). CCDT is different to Credal-C4.5 because it uses
the IIG measure for the split criterion instead of IIGR measure, it has not pruning process and the data sets are preprocessed before using CCDT (the missing values are replaced
and the continuous variables are discretized).
(2) The second experiment studies the performance of CredalC4.5 as opposed to classic C4.5. An algorithm similar to
ID3, that we have called as MID3, is also implemented in
order to carry out a more complete comparison. This MID3
algorithm is equal to classic C4.5 by replacing IGR measure
by IG. Besides, all the variables available in a node are used
for the split criterion.
x4
4631
Table 1
data set description. Column N is the number of instances in the data sets, column Feat is the number of features or attribute variables, column Num is the number of
numerical variables, column Nom is the number of nominal variables, column k is the number of cases or states of the class variable (always a nominal variable) and column
Range is the range of states of the nominal variables of each data set.
Data set
Anneal
Arrhythmia
Audiology
Autos
Balance-scale
Breast-cancer
Wisconsin-breast-cancer
Car
CMC
Horse-colic
Credit-rating
German-credit
Dermatology
Pima-diabetes
Ecoli
Glass
Haberman
Cleveland-14-heart-disease
Hungarian-14-heart-disease
Heart-statlog
Hepatitis
Hypothyroid
Ionosphere
Iris
kr-vs-kp
Letter
Liver-disorders
Lymphography
mfeat-pixel
Nursery
Optdigits
Page-blocks
Pendigits
Primary-tumor
Segment
Sick
Solar-are2
Sonar
Soybean
Spambase
Spectrometer
Splice
Sponge
Tae
Vehicle
Vote
Vowel
Waveform
Wine
Zoo
Feat
Num
Nom
898
452
226
205
625
286
699
1728
1473
368
690
1000
366
768
366
214
306
303
294
270
155
3772
351
150
3196
20000
345
146
2000
12960
5620
5473
10992
339
2310
3772
1066
208
683
4601
531
3190
76
151
946
435
990
5000
178
101
38
279
69
25
4
9
9
6
9
22
15
20
34
8
7
9
3
13
13
13
19
30
35
4
36
16
6
18
240
8
64
10
16
17
19
29
12
60
35
57
101
60
44
5
18
16
11
40
13
16
6
206
0
15
4
0
9
0
2
7
6
7
1
8
7
9
2
6
6
13
4
7
35
4
0
16
6
3
0
0
64
10
16
0
16
7
0
60
0
57
100
0
0
3
18
0
10
40
13
1
32
73
69
10
0
9
0
6
7
15
9
13
33
0
0
0
1
7
7
0
15
23
0
0
36
0
0
15
240
8
0
0
0
17
0
22
6
0
35
0
1
60
44
2
0
16
1
0
0
16
6
16
24
7
3
2
2
4
3
2
2
2
6
2
7
7
2
5
5
2
2
4
2
3
2
26
2
4
10
4
10
5
10
21
7
2
3
2
19
2
48
3
3
3
4
2
11
3
3
7
In order to check the above procedures, we used a broad and diverse set of 50 known data sets, obtained from the UCI repository of
machine learning data sets which can be directly downloaded from
http://archive.ics.uci.edu/ml. We took data sets that are different
with respect to the number of cases of the variable to be classied,
data set size, feature types (discrete or continuous) and number of
cases of the features. A brief description of these can be found in
Table 1.
We used Weka software (Witten & Frank, 2005) on Java 1.5 for
our experimentation. The implementation of C4.5 algorithm provided by Weka software, called J48, was employed with its default
conguration. We added the necessary methods to build CredalC4.5 and MID3 trees with the same experimental conditions. The
parameter of the IDM for the Credal-C4.5 algorithm was set to
s 1:0 (see Section 2.3). The minimum number of instances per
leaf for branching was xed to 2 for C4.5 and Credal-C45, as it
is appears in the conguration by default for C4.5 (J48 in Weka).
Data sets with missing values or continuous variables were
Range
210
2
26
222
213
34
24
26
214
211
24
12
214
214
2
24
23
28
46
24
23
2
28
27
4
46
29
2
2
2
4632
Table 2
Accuracy results of C4.5, Credal-C4.5, MID3 and CCDT (without pruning) when are
applied on data sets with percentage of random noise equal to 0%.
Table 3
Accuracy results of C4.5, Credal-C4.5, MID3 and CCDT (without pruning) when are
applied on data sets with percentage of random noise equal to 10%.
Dataset
C4.5
Credal-C4.5
MID3
CCDT
Dataset
C4.5
Credal-C4.5
MID3
CCDT
Anneal
Arrhythmia
Audiology
Autos
Balance-scale
Breast-cancer
Wisconsin-breast-cancer
Car
CMC
Horse-colic
Credit-rating
German-credit
Dermatology
Pima-diabetes
Ecoli
Glass
Haberman
Cleveland-14-heart-disease
Hungarian-14-heart-disease
Heart-statlog
Hepatitis
Hypothyroid
Ionosphere
Iris
kr-vs-kp
Letter
Liver-disorders
Lymphography
mfeat-pixel
Nursery
Optdigits
Page-blocks
Pendigits
Primary-tumor
Segment
Sick
Solar-are2
Sonar
Soybean
Spambase
Spectrometer
Splice
Sponge
Tae
Vehicle
Vote
Vowel
Waveform
Wine
Zoo
98.57
64.05
76.48
82.40
79.44
68.15
94.37
93.74
49.19
82.09
82.17
68.11
94.04
73.87
82.53
67.76
70.52
76.44
78.55
76.78
78.59
99.51
89.83
94.80
99.44
88.02
65.37
75.42
78.42
98.69
90.48
96.78
96.54
42.60
96.80
98.77
99.49
73.42
90.69
92.42
47.31
92.16
91.68
58.60
72.18
95.76
81.63
75.12
93.20
93.41
98.19
67.55
78.58
74.52
78.00
71.44
95.08
91.42
52.01
84.64
85.46
70.17
93.82
73.19
81.90
63.66
73.89
76.60
82.54
80.04
79.84
99.53
88.35
94.73
99.40
87.57
64.18
78.51
79.58
96.30
90.77
96.72
96.39
42.19
96.04
98.77
99.53
71.47
92.50
92.61
45.52
93.81
94.11
53.20
72.84
96.04
77.87
76.05
92.13
92.42
98.99
64.89
76.02
77.46
79.45
68.34
94.56
93.98
49.58
82.91
81.26
69.68
92.20
73.79
83.07
67.90
70.62
78.62
76.12
77.93
78.82
99.56
88.15
94.80
99.42
88.00
65.75
73.42
75.66
98.64
91.10
96.92
96.39
38.59
96.77
98.82
99.38
73.53
86.76
92.83
43.49
91.37
92.70
58.21
72.67
95.56
84.09
75.70
93.83
92.01
99.34
67.08
80.94
78.27
69.59
72.03
94.74
90.28
48.70
83.17
84.06
69.53
94.58
74.22
80.03
68.83
73.59
76.23
78.62
82.11
80.32
99.37
89.75
93.73
99.49
77.55
56.85
74.50
80.31
96.31
79.33
96.26
88.87
38.73
94.18
97.80
99.46
73.92
92.21
91.85
45.22
93.17
94.63
46.78
69.53
96.18
75.60
74.44
92.08
95.83
Anneal
Arrhythmia
Audiology
Autos
Balance-scale
Breast-cancer
Wisconsin-breast-cancer
Car
CMC
Horse-colic
Credit-rating
German-credit
Dermatology
Pima-diabetes
Ecoli
Glass
Haberman
Cleveland-14-heart-disease
Hungarian-14-heart-disease
Heart-statlog
Hepatitis
Hypothyroid
Ionosphere
Iris
kr-vs-kp
Letter
Liver-disorders
Lymphography
mfeat-pixel
Nursery
Optdigits
Page-blocks
Pendigits
Primary-tumor
Segment
Sick
Solar-are2
Sonar
Soybean
Spambase
Spectrometer
Splice
Sponge
Tae
Vehicle
Vote
Vowel
Waveform
Wine
Zoo
96.05
59.81
75.06
74.73
76.56
66.31
92.00
88.80
47.25
79.74
76.80
65.23
88.00
72.04
77.77
64.07
68.86
72.93
75.73
72.33
73.84
95.55
86.30
89.73
93.51
85.01
62.15
71.31
71.79
91.59
82.49
93.95
89.66
38.70
90.09
95.68
98.22
67.56
86.85
89.87
42.63
81.23
83.00
52.23
66.17
92.20
78.19
69.10
86.17
91.99
97.84
64.77
76.98
71.57
78.46
68.68
94.29
90.90
50.07
83.52
85.04
69.01
92.66
73.58
81.52
65.29
73.40
76.31
80.73
77.67
78.17
99.38
87.30
93.60
98.04
86.27
61.11
74.10
76.88
96.23
87.49
96.67
95.19
39.53
95.02
98.07
99.50
70.53
91.70
91.56
43.01
90.87
89.07
49.01
69.63
94.39
75.15
74.96
89.22
92.10
96.31
57.65
71.43
68.88
76.35
64.85
92.20
88.47
47.30
78.93
75.80
65.94
85.73
72.20
77.89
63.40
69.02
73.83
73.73
72.48
75.27
95.74
85.31
89.20
93.05
84.45
62.27
69.59
67.51
91.39
83.01
94.06
89.62
37.85
90.31
95.53
98.32
69.34
79.55
89.38
38.78
80.30
84.80
52.55
65.72
92.52
78.42
69.07
86.28
91.49
97.87
63.90
76.40
74.50
71.90
68.98
92.93
90.41
47.66
79.24
82.87
68.69
92.10
73.35
79.67
67.43
72.74
75.87
77.70
79.70
77.37
99.05
85.95
93.67
97.39
76.45
56.85
73.57
76.43
96.13
75.14
96.18
87.80
36.99
93.58
97.49
99.44
71.51
90.04
88.20
42.43
90.18
90.73
45.13
67.77
94.57
74.12
70.81
90.64
92.07
Average
82.13
82.23
81.81
81.00
Average
77.74
80.72
77.06
79.23
3
All the tests were carried out using Keel software (Alcal-Fdez et al., 2009),
available at www.keel.es.
6.1.1. Results
Next, it is shown the results obtained by C4.5, Credal-C4.5,
MID3 and CCDT trees. Tables 24 present the accuracy results of
each method without post-pruning procedure, applied on data sets
with a percentage of random noise to the class variable equal to
0%; 10% and 30%, respectively.
Tables 5 and 6 present the average result of accuracy and tree
size (number of nodes) for each method without pruning when is
applied to data sets with percentages of random noise equal to
0%; 10% and 30%.
Table 7 shows Friedmans ranks obtained from the accuracy results of C4.5, Credal-C4.5, MID3 and CCDT (without pruning) when
they are applied on data sets with percentages of random noise
equal to 0%; 10% and 30%. We remark that the null hypothesis is
rejected in all the cases with noise.
4633
Table 6
Average result about tree size for C4.5, Credal-C4.5, MID3 and CCDT when are applied
on data sets with percentage of random noise equal to 0%; 10% and 30%.
Dataset
C4.5
Credal-C4.5
MID3
CCDT
Tree
Noise 0%
Noise 10%
Noise 30%
Anneal
Arrhythmia
Audiology
Autos
Balance-scale
Breast-cancer
Wisconsin-breast-cancer
Car
CMC
Horse-colic
Credit-rating
German-credit
Dermatology
Pima-diabetes
Ecoli
Glass
Haberman
Cleveland-14-heart-disease
Hungarian-14-heart-disease
Heart-statlog
Hepatitis
Hypothyroid
Ionosphere
Iris
kr-vs-kp
Letter
Liver-disorders
Lymphography
mfeat-pixel
Nursery
Optdigits
Page-blocks
Pendigits
Primary-tumor
Segment
Sick
Solar-are2
Sonar
Soybean
Spambase
Spectrometer
Splice
Sponge
Tae
Vehicle
Vote
Vowel
Waveform
Wine
Zoo
80.49
46.55
63.93
56.70
64.80
58.76
84.59
73.92
42.96
68.39
64.43
57.98
69.16
68.87
64.14
52.51
63.93
59.97
70.80
63.07
62.19
79.54
77.87
78.60
72.22
71.74
56.72
56.50
57.78
72.18
62.98
83.80
69.88
34.69
70.66
87.65
91.35
60.74
73.03
85.42
31.85
62.46
63.52
45.73
53.09
79.08
64.68
57.22
70.01
84.06
90.84
59.71
69.45
58.40
73.61
62.71
91.45
82.80
45.23
73.91
73.03
60.57
79.93
69.43
77.71
59.82
66.44
69.09
79.94
71.19
70.41
97.45
79.84
88.27
81.16
77.44
55.37
63.43
64.57
88.95
73.59
96.04
86.66
34.90
89.86
94.96
96.90
63.34
86.32
87.44
35.40
68.00
71.29
43.31
62.99
84.45
65.00
69.84
82.40
85.49
79.84
44.83
53.39
52.19
64.40
58.58
84.57
73.42
43.16
66.06
63.61
59.41
67.52
68.45
64.14
52.29
63.89
60.00
66.22
62.56
62.17
80.49
77.13
78.00
72.38
71.36
57.00
54.91
53.55
71.74
63.82
83.06
70.11
33.66
71.52
87.82
91.94
61.06
58.82
84.47
29.12
61.85
62.93
44.92
53.15
78.71
64.13
56.50
70.14
85.81
88.92
57.13
64.90
62.23
73.21
60.94
86.51
82.88
43.94
63.56
68.41
61.47
77.02
69.31
78.28
66.31
67.90
69.39
75.27
72.74
67.10
96.12
71.17
89.87
80.22
68.92
56.85
61.28
62.63
88.98
61.15
94.01
77.42
31.92
83.44
92.41
96.56
61.98
79.76
72.88
35.51
67.62
72.27
41.48
59.92
82.55
65.72
58.51
79.20
82.50
C4.5
Credal-C4.5
MID3
CCDT
216.98
138.78
216.15
387.18
376.37
167.09
373.97
523.09
672.13
317.92
662.42
1033.15
Average
65.86
73.21
64.82
70.60
Table 5
Average result of accuracy for C4.5, Credal-C4.5, MID3 and CCDT (without pruning)
when are applied on data sets with percentage of random noise equal to 0%; 10% and
30%.
Tree
Noise 0%
Noise 10%
Noise 30%
C4.5
Credal-C4.5
MID3
CCDT
82.13
82.23
81.81
81.00
77.74
80.72
77.06
79.23
65.86
73.21
64.82
70.60
Table 7
Friedmans ranks for a 0:1 obtained from the accuracy results of C4.5, Credal-C4.5,
MID3 and CCDT (without pruning) when they are applied on data sets with
percentage of random noise equal to 0%; 10% and 30%.
Tree
Noise 0%
Noise 10%
Noise 30%
C4.5
Credal-C4.5
MID3
CCDT
2.52
2.30
2.54
2.64
3.10
1.34
3.32
2.24
3.11
1.38
3.45
2.06
Table 8
p-Values of the Nemenyi test with a 0:1 obtained from the accuracy results of the
methods C4.5, Credal-C4.5, MID3 and CCDT (without pruning) when they are applied
on data sets with percentage of random noise equal to 10%. Nemenyi procedure
rejects those hypotheses that have a p-value6 0:016667.
i
algorithms
p-Values
6
5
4
3
2
1
0
0
0.000029
0.000491
0.000866
0.394183
Table 9
p-Values of the Nemenyi test with a 0:1 obtained from the accuracy results of the
methods C4.5, Credal-C4.5, MID3 and CCDT (without pruning) when they are applied
on data sets with percentage of random noise equal to 30%. Nemenyi procedure
rejects those hypotheses that have a p-value6 0:016667.
i
algorithms
p-Values
6
5
4
3
2
1
0
0
0
0.000048
0.008448
0.187901
4634
Table 10
Accuracy results of C4.5, Credal-C4.5 and MID3 (with pruning) when are applied on
data sets with percentage of random noise equal to 0%.
Table 11
Accuracy results of C4.5, Credal-C4.5 and MID3 (with pruning) when are applied on
data sets with percentage of random noise equal to 10%.
Dataset
C4.5
Credal-C4.5
MID3
Dataset
C4.5
Credal-C4.5
MID3
Anneal
Arrhythmia
Audiology
Autos
Balance-scale
Breast-cancer
Wisconsin-breast-cancer
Car
CMC
Horse-colic
Credit-rating
German-credit
Dermatology
Pima-diabetes
Ecoli
Glass
Haberman
Cleveland-14-heart-disease
Hungarian-14-heart-disease
Heart-statlog
Hepatitis
Hypothyroid
Ionosphere
Iris
kr-vs-kp
Letter
Liver-disorders
Lymphography
mfeat-pixel
Nursery
Optdigits
Page-blocks
Pendigits
Primary-tumor
Segment
Sick
Solar-are2
Sonar
Soybean
Spambase
Spectrometer
Splice
Sponge
Tae
Vehicle
Vote
Vowel
Waveform
Wine
Zoo
98.57
65.65
77.26
81.77
77.82
74.28
95.01
92.22
51.44
85.16
85.57
71.25
94.10
74.49
82.83
67.63
72.16
76.94
80.22
78.15
79.22
99.54
89.74
94.73
99.44
88.03
65.84
75.84
78.66
97.18
90.52
96.99
96.54
41.39
96.79
98.72
99.53
73.61
91.78
92.68
47.50
94.17
92.50
57.41
72.28
96.57
80.20
75.25
93.20
92.61
98.36
67.68
78.94
74.57
77.33
74.84
95.12
91.16
52.80
85.18
85.43
71.34
94.26
74.15
81.60
63.61
71.18
76.53
82.33
80.33
79.79
99.52
88.18
94.73
99.45
87.58
64.53
78.31
79.76
96.30
90.83
96.69
96.42
42.33
96.04
98.79
99.53
71.37
92.40
92.56
45.54
94.04
92.50
53.26
72.78
96.59
77.88
76.07
92.13
92.42
98.99
65.15
76.91
78.24
77.69
71.75
95.35
93.02
52.06
84.34
84.03
71.98
93.49
74.39
83.61
67.67
72.03
79.30
76.77
78.81
80.33
99.56
88.04
94.73
99.42
87.97
66.16
75.01
77.12
97.10
91.10
97.09
96.39
39.92
96.74
98.85
99.53
73.53
89.94
93.11
43.37
93.57
92.50
57.62
72.71
96.11
83.63
75.83
93.83
92.01
Anneal
Arrhythmia
Audiology
Autos
Balance-scale
Breast-cancer
Wisconsin-breast-cancer
Car
CMC
Horse-colic
Credit-rating
German-credit
Dermatology
Pima-diabetes
Ecoli
Glass
Haberman
Cleveland-14-heart-disease
Hungarian-14-heart-disease
Heart-statlog
Hepatitis
Hypothyroid
Ionosphere
Iris
kr-vs-kp
Letter
Liver-disorders
Lymphography
mfeat-pixel
Nursery
Optdigits
Page-blocks
Pendigits
Primary-tumor
Segment
Sick
Solar-are2
Sonar
Soybean
Spambase
Spectrometer
Splice
Sponge
Tae
Vehicle
Vote
Vowel
Waveform
Wine
Zoo
98.37
62.54
77.53
74.72
78.11
71.13
93.72
90.92
49.95
84.61
84.78
71.18
93.31
72.37
81.87
65.37
72.32
75.78
79.78
75.63
77.88
99.40
86.90
92.73
98.97
86.74
62.38
75.11
76.77
96.29
88.47
96.70
95.37
39.59
95.06
98.22
99.53
67.56
90.54
90.96
43.20
93.05
91.80
50.77
68.51
95.74
77.13
69.51
87.35
92.39
98.23
65.76
77.39
71.65
78.26
72.07
94.28
90.53
51.36
85.10
85.23
71.38
93.12
73.83
81.49
65.57
72.39
76.94
80.94
78.41
80.19
99.44
87.04
93.53
98.95
86.67
61.69
74.78
77.97
96.08
88.94
96.78
95.49
40.39
95.17
98.24
99.53
70.39
91.74
91.52
43.07
93.08
91.66
49.01
69.99
95.45
75.26
75.13
89.39
92.10
98.42
58.44
72.70
69.61
77.82
70.75
94.06
90.74
50.36
84.50
84.22
71.72
91.06
72.56
82.04
64.55
72.29
77.56
77.03
76.04
78.62
99.43
85.79
92.47
98.80
86.38
62.73
76.53
74.36
96.00
88.86
96.79
95.20
40.09
95.03
98.22
99.53
69.34
85.85
90.57
39.64
92.48
92.50
51.61
68.26
95.28
78.37
69.50
87.36
92.19
Average
82.62
82.30
82.37
Average
80.77
81.25
80.29
without noise. For data with noise (10% and 30%), this test indicates that Credal-C4.5 is the best model with signicant statistical difference. This test also says that CCDT improves to MID3
and C4.5 (without pruning) for data without noise.
After this analysis we can conclude the following comments
about comparison between algorithms:
Credal-C4.5 (without pruning) vs. CCDT: Credal-C4.5 without
pruning obtains better average accuracy results and Friedmans
rank than CCDT. According Nemenyi test, this difference is signicant for data with noise. Besides, Credal-C4.5 without pruning builds trees smaller than CCDT. Hence, we can conclude
with this experiment that Credal-C4.5 algorithm without pruning improves to the best previously published credal decision
tree.
4635
C4.5
Credal-C4.5
MID3
Anneal
Arrhythmia
Audiology
Autos
Balance-scale
Breast-cancer
Wisconsin-breast-cancer
Car
CMC
Horse-colic
Credit-rating
German-credit
Dermatology
Pima-diabetes
Ecoli
Glass
Haberman
Cleveland-14-heart-disease
Hungarian-14-heart-disease
Heart-statlog
Hepatitis
Hypothyroid
Ionosphere
Iris
kr-vs-kp
Letter
Liver-disorders
Lymphography
mfeat-pixel
Nursery
Optdigits
Page-blocks
Pendigits
Primary-tumor
Segment
Sick
Solar-are2
Sonar
Soybean
Spambase
Spectrometer
Splice
Sponge
Tae
Vehicle
Vote
Vowel
Waveform
Wine
Zoo
96.03
49.15
70.88
57.92
74.16
68.65
89.24
86.00
46.39
79.63
74.58
63.09
87.64
69.39
75.27
55.23
68.83
68.00
78.16
65.52
68.15
98.59
78.18
84.00
91.13
82.13
56.83
66.33
71.98
93.99
76.91
94.91
89.21
37.67
85.35
95.20
99.53
60.84
88.45
86.07
33.02
81.21
88.84
45.86
56.06
90.99
66.01
57.32
71.02
87.65
95.85
62.06
70.68
60.35
75.02
67.61
92.27
85.97
47.70
80.48
81.41
63.70
88.95
69.67
79.78
60.49
72.85
71.57
80.81
72.33
73.36
98.96
80.04
89.00
90.97
82.54
55.45
68.11
73.19
94.30
80.77
96.25
92.25
37.76
91.92
97.14
99.49
63.34
89.34
87.69
35.61
80.06
86.71
43.64
63.50
91.55
65.61
70.08
82.91
87.74
95.24
45.09
60.25
53.81
73.52
67.49
89.43
85.89
45.59
75.00
71.77
66.05
86.56
68.93
73.63
54.69
68.87
67.97
74.68
64.70
68.63
98.41
77.30
84.07
90.53
81.62
57.06
68.59
68.43
93.46
70.24
94.81
88.02
38.44
84.33
95.29
99.53
61.10
72.78
85.32
29.72
81.85
92.50
45.26
55.56
91.38
64.16
56.59
70.98
89.05
Average
74.14
76.58
72.88
Table 13
Average result of accuracy for C4.5, Credal-C4.5 and MID3 when are applied on data
sets with percentage of random noise equal to 0%; 10% and 30%.
Tree
Noise 0%
Noise 10%
Noise 30%
C4.5
Credal-C4.5
MID3
82.62
82.30
82.37
80.77
81.25
80.29
74.14
76.58
72.88
Table 14
Average result about tree size for C4.5, Credal-C4.5 and MID3 when are applied on
data sets with percentage of random noise equal to 0%; 10% and 30%.
Tree
Noise 0%
Noise 10%
Noise 30%
C4.5
Credal-C4.5
MID3
156.54
122.67
155.83
170.02
131.06
170.03
244.05
171.39
253.73
Table 15
Friedmans ranks for a 0:1 obtained from the accuracy results of C4.5, Credal-C4.5
and MID3 (with pruning) when they are applied on data sets with percentage of
random noise equal to 0%; 10% and 30%.
Tree
Noise 0%
Noise 10%
Noise 30%
C4.5
Credal-C4.5
MID3
1.90
2.08
2.02
2.07
1.60
2.33
2.07
1.40
2.53
Table 16
p-Values of the Nemenyi test with a 0:1 obtained from the accuracy results of the
methods C4.5, Credal-C4.5 and MID3 (with pruning) when they are applied on data
sets with percentage of random noise equal to 10%. Nemenyi procedure rejects those
hypotheses that have a p-value6 0:033333.
i
algorithms
p-Values
3
2
1
0.000262
0.018773
0.193601
Table 17
p-Values of the Nemenyi test with a 0:1 obtained from the accuracy results of the
methods C4.5, Credal-C4.5 and MID3 (with pruning) when they are applied on data
sets with percentage of random noise equal to 30%. Nemenyi procedure rejects those
hypotheses that have a p-value6 0:033333.
i
algorithms
p-Values
3
2
1
0
0.000808
0.021448
Credal trees (CCDT and Credal-C4.5 without pruning) vs. Classic methods without pruning (C4.5 and MID3): We can
observe with this experiment that credal trees improve the
results obtained by the classic methods (without pruning)
for data with noise. According Nemenyi test, this difference
is signicant. Hence, we can conclude that the pruning process is a fundamental part of the classic methods (C4.5 and
MID3) to improve their accuracy. For this reason, we have
focused on the methods with a pruning process in the next
experiment.
4636
4637