You are on page 1of 5

2010 IEEE 26-th Convention of Electrical and Electronics Engineers in Israel

Using the Confusion Matrix for Improving Ensemble Classifiers


Nadav David Marom
Department of Information System Engineering, Ben-Gurion University of the Negev

Lior Rokach
Department of Information System Engineering, Ben-Gurion University of the Negev

Armin Shmilovici
Department of Information System Engineering, Ben-Gurion University of the Negev

Abstract The code matrix enables to convert a multi class problem into an ensemble of binary classifiers. We suggest a new un-weighted framework for iteratively extending the code matrix which based on confusion matrix. The confusion matrix holds important information which is exploited by the suggested framework. Evaluating the confusion matrix at each iteration enables to make a decision regarding the next one against all classifier that should be added to the current code matrix. We demonstrate the benefits of the method by applying it to Error Correcting Code based ensemble and to AdbaBoost. We use Orthogonal arrays as the basic code matrix.

The ECOC, motivated by error correction methods in noisy communication cannels, uses a code matrix to decompose a multi-class problem into multiple binary problems. The decomposition enables us to transform the complex original problem into a set of simpler, binary classification problem. We called each classifier, which solves each binary problem base classifier. Then, all the problems are reconstructed by a combining framework [5]. II. ECOC FRAMEWORK ECOC for multi-class classification hinges on the design of the binary code matrix M of size k by l, where k is the number of classes and l the number of base classifiers. An important property is that as the number l increase, the accuracy tends to increase along with the costs (such as training and classification time, and implementation complexity). However, unlike l which can be extended, k is constant for a given problem.
Table 1. A coding scheme containing seven rows of code words and seven columns of classifiers

I. INTRODUCTION A classifier is a classification model which assigns an unclassified instance to a predefined set of classes. The classifier may be induced by using a learning algorithm (also known as an inducer - I), such as logistics regression or decision tree. The classifier is a function f(X) in which the domain consists of the training samples Xi={x1, x2, x3,xn} and the range is one of Y classes. The range is also called target attribute. Given training sample Xi, the above notation produces the tuple (Xi, f(Xi)). A binary inducer requires that the range of the function f(X) is binary. Many of the popular inducers are binary, e.g. logistic regression and SVM. However, most of the real world problems are multi-class problem in which the range consists of more then two classes. Some examples of which are character recognition, disease classification, and face recognition. Different ways of solving them have been suggested. One solution is using an ensemble. Ensemble is combination of more then one classifier. Such a combination is only useful if the classifiers disagree about some inputs [1]. Creating an ensemble in which each classifier is as different as possible while still being consistent with the training set is theoretically known to be an important feature for attaining improved ensemble performance [2]. Diversified classifiers lead to uncorrelated errors, which in turn improve the classification accuracy [3]. Dietterich [4] surveyed several approaches to constructing a good ensemble of classifiers, one of which is to manipulate the target attribute values that are given to the learning algorithm the Error-Correcting Output Coding (ECOC).

1 0 1 0 1 0 1

0 1 1 0 0 1 1

1 1 0 0 1 1 0

0 0 0 1 1 1 1

1 0 1 1 0 1 0

0 1 1 1 1 0 0

1 1 0 1 0 0 1

Table 1 depicts a code matrix M in which k=l=7. According to the decomposition, each class ki is encoded as an l-bit code word of the i-th row in the code matix. For example, the 3rd row of M is [1,1,0,0,1,1,0], which means that the class k=3 is represented by this code word. An instance belonging to this class should get 1 as the result of the f(xi)1 , f(xi)2, f(xi)5 and f(xi)6 base classifiers and 0 as the result of f(xi)3, f(xi)4 and f(xi)7 base classifiers. The columns of the code matrix M define the two super groups between which each base classifier should distinguish. For example, if the fifth column of M is [1,0,1,1,0,1,0]T, this means that the fifth base classifier can distinguish between the first super group (the first, third, fourth and sixth original classes) and the second super group (the

978-1-4244-8682-3/10/$26.00 2010 IEEE


000555

second, fifth, and the seventh original classes). Given an unseen sample Xi, we apply all base classifiers and combine their binary predictions into an l-bit predicate codeword. We then choose the class Yi whose codeword is the closest to the predicate code word based on the Hamming distance measure: the sample will be assign to class Yi, if the number of different bits between the predicate codeword and class Yi codeword is minimal. A. Different coding scheme Several methods to design such ECOC matrices have been suggested in the literature. The different methods can be classified into the following three categories [6]: B. Given M, find a subset of binary classifiers These matrices are also called discrete codes. Most of the published work is in this category. For instance: One against all [7]. Each problem considers the discrimination of one class from the other classes. The code matrix (M) is designed as diagonal matrix. Random codes [8]. The code matrix (M) contains randomly chosen binary elements. The Hadamard matrix (HM) of order n (donate Hn), is a square matrix of +1's and -1's whose rows are orthogonal and which satisfies
T H n H n = nI n [9]. Zhang et al. [10]

classifier learning phase. Popular implementation for binary problems is the AdaBoost [12]. AdaBosst.M1 [12] and AdaBoost.M2 [12] are extensions of AdaBoost to a multi-class algorithm. The AdaBoost.OC [13] combines ECOC and AdaBoost. At each iteration, the resulting classifier minimizes an error function which considers both the different weighting and the different partitioning of the dataset into binary problems (as applied by the code matrix). The error produced by the classifier at a given iteration affects the weighting of the next iteration. The final classification is a weighted vote of all the classifiers. The AdaBoost.Ecc [14] algorithm is a variant of AdaBoost.OC that attempts to improve the performance by considering the correlation between the columns of M. a different weighting scheme is introduced based on the positive and negative votes of the classifier. III. ORTHOGONAL ARRAYS A full factorial design is a Design of Experiments in which the experimenter chooses n attributes that are believed to affect the target attribute. Then, all possible combinations of the selected input attributes are acquired [15]. Applying a full factorial design is impractical when many input attributes are given. A fractional factorial design is a design, in which only a fraction of the combinations required for the complete factorial experiment is selected. One of the most practical forms of fractional factorial design is the Orthogonal Array (OA) [9]. An OA Lk(dn) is a matrix of k rows and n columns, with every element being one of d values. d is also called the OA level. The rows (runs) represent the experiments or tests to be preformed and the columns (factors) correspond to the different variables whose effects are being analyzed [9]. The array has strength t if, in every n by t sub-matrix, the dt possible distinct rows all appear the same number of times. An example of OA strength 2 is presented in Table 2. Any two attributes in this matrix have all possible combinations (11, 10, 01, 00). Each of these combinations appears an equal number of times.
Table 2. The L8(27) OAs design Attributes a1 0 1 0 1 0 1 0 1 a2 0 0 1 1 0 0 1 1 a3 0 1 1 0 0 1 1 0 a4 0 0 0 0 1 1 1 1 a5 0 1 0 1 1 0 1 0 a6 0 0 1 1 1 1 0 0 a7 0 1 1 0 1 0 0 1

suggested a way of using these matrices as part of ECOC and proved that for the k by k-1 sub matrix the classification accuracy is optimal. Data-driven Error Correcting Output Coding (DECOC) [11] explores the distribution of data classes and optimizes both the composition and the number of base learners to produce an effective and compact code matrix. A more general approach based on ECOC matrices enables the code words to be ternary based with values {-1, 0, 1} follows the next convention: 1 +1 Ck Mkl = 0 -1 Ck & Ck Ck
2 1 2

Where the denotes one of the super groups. Popular implementation is the one against one [7]. C. Given a set of binary classifiers, find a matrix M As this problem considered to be NP-hard for discrete codes, it is suggested to relax the problem by defining both the code matrix and the classifiers' output to be over the Reals [6]. With such a relaxation, M can be solved from a constrained optimization problem. D. Find both the matrix M and a subset of classifiers The boosting algorithms, in which we combine many "weak" classifiers (weak classifiers perform slightly better than random guessing). A key principle in boosting is the reweighting of the data during the learning phase so that a misclassified record gets a higher weight during the current

instances 1 2 3 4 5 6 7 8

IV. THE SUGGESTED METHOD A major drawback in the current adaptive methods related to category C, is the weighting introduced to the instances. Such a weighting is an integral part of the AdaBoost

000556

algorithms and used to determine the identity of the next added classifier. Another open issue in ECOC is the design of the code matrix. This design should be compact, yet, representing the given problem. With weighting, a different weight is used in each iteration and distorts the original distribution of the instance space [16]. That harms the intuitive relation between the classifier and the dataset. Extending a given code matrix without weighting has not yet been determined. We propose a new framework for extending a given code matrix without using weighting, based on the confusion matrix. The confusion matrix is the result of the classification phase. Each classified instance is mutual exclusive located in the matrix. For every cell in the matrix, the row represents the original class and the column represents the class as the classification model produced (Table 3). The diagonal in the matrix represent the ideal case in which the instance was correctly classified. All the off diagonal cells represent miss classified instances. An important advantage of using the confusion matrix is the ability to consider the performance of the whole classification model when considering the identity of the next classifier.
Table 3. The yeast dateset confusion matrix during the 10th iteration. A A 493 B 0 C 1 D 2 E 9 F 0 G 13 H 0 I 1 J 5 B 1 486 8 5 13 1 10 5 32 14 C 0 D 0 E 2 0 0 0 460 0 5 1 2 0 F 1 6 5 25 4 490 5 17 21 35 G 1 H 1 I 0 J 0 3 0 6 12 2 0 3 1 433

The Proposed Iterative Algorithm Input: X training set {<x1,y1>,,<xn,yn>} I an induction algorithm M0 An initial ECOC matrix of type OA. Output: E Ensemble of binary classifiers M The final ECOC matrix Algorithm: 1: E Build ensemble of binary based on M0 using I 2: M M0 3: While (another classifier can be added) Do 4: C Confusion Matrix for E 5: Find a new coloring function

: Y {1, 0, +1}
Add to M Train new classifier based on and adds it to E. 8: End While 9: Return E,M 6: 7:

Fig. 1. The proposed algorithm One of the key elements is the use of the coloring function. The result of the coloring function is a vector which determines the identity of the next added classifier. The length of the vector is equal to the number of classes, and each class receives a score which represent its importance. As the intrinsic relation between the different classes in the data set can not be infer from the vector, we choose to add one against all classifier. A history check ensures that the classifier that is being added does not exist already. The algorithm chooses the class with the highest score and suggests it for the next one against all classifier to be added to the ensemble. Based on experiments (detailed elsewhere) we have concluded that more accurate classifiers were produced when handling the matrix according to the columns and not the rows. We are proposing five different coloring functions and a random one. Out of the five, four are handling the matrix by columns and one by rows. 1. CRB - Sum both columns and rows. Both with respect to the history.
arg max
col

4 3 477 1 2 468 3 0 2 4 0 0 2 1 2 1 0 6

0 0 12 1 1 7 0 0 7 4 1 5 2 1 0 466 0 3 0 480 1 1 0 437 0 2 11

The proposed implementation assumes the next two assumptions: 1. It starts from a ready made code matrix (M0) which iteratively being extended. 2. The added classifiers are based on one against all classifiers. We propose the next three iterative steps. A detailed algorithm is described in Figure 1: 1. Train the inducers based on the current code matrix. 2. Test the inducer and record the results in the confusion matrix. 3. Use the coloring function, based on the confusion matrix, to decide which classifier to add next.

(confusion[row][col ] + , col history confusion[col ][row]) row col & row history

2. CB - Sum only columns with columns and rows respect to the history.
arg max
col rowcol & row history

(confusion[row][col ]), col history

3. CR - Sum both columns and rows. Only columns with respect to the history. (confusion[ row][col ] + arg max , col history col row col confusion[ col ][ row])

000557

4. C - Sum columns. Only columns with respect to the history.

arg max
col

row col

(confusion[row][col ]), col history

5. R - Sum rows. Only rows with respect to the history.

arg max
row

col row

(confusion[row][col ]), row history

6. RA - Random selection of the class with respect to the history. We will demonstrate each of those five methods on the confusion matrix given in table 3. Table 4 depicts the vector that each coloring function will yield. Given a vector, the produced classifier is the index of the maximum element (class) against the rest.
Table 4. The score vectors which each coloring function will return given the confusion matrix depicts in table 3 and under the assumption that classes F and I ere already included in the past. The numbers in bold represent the maximum score of each vector, which is the index of the class in the one against all clasifier. Class CRB CB CR C R A 35 30 36 31 6 B 66 56 99 89 28 C 23 11 27 15 47 D 26 11 31 16 47 E 50 8 52 10 51 F X X X X X G 34 6 37 9 36 H 15 5 18 6 30 I X X X X X J 51 24 54 27 73

The OA, as many other code matrices, requires a preprocessing phase. We choose OA of strength 2 with factors numbers as the classes' number. Then, we transpose the matrix so each of the "runs" will transpose their place from rows to columns. Detail regarding the sizes of M0 depict in Table 5. As for the induction algorithm that was used for training the base classifiers, we used decision stump. Decision stump is a weak learner consisting of a one-level tree which known to be benefit from boosting strategy [17]. B. Experimental Setup ECOC based Ensemble The first experimental setup used an ECOC based ensemble. We extended M0 by each one of the sixth mentioned coloring functions. To create the confusion matrix, at each iteration we randomly partitioned the training data into 3 disjoint instance subsets. Each subset was utilizes once in a test set and twice in a training set. The sum of the three resulting confusion matrices was sent to the coloring function. The natural stopping criterion is the upper limit of adding k different one against all classifiers. However, in some cases the coloring functions stops before reaching k, when one of the methods was not able to add another new classifier. In that case, we stopped the other methods too and recoded the accuracy. C. Experimental Setup Improved AdaBoost.ECC AdaBoost.ECC was chosen in order to compare the suggested method to existing weighting algorithms (in category C). In order to compare the results to the previous mentioned experiments, we updated the AdaBosst.ECC algorithm so that it will be able to control its intrinsic coloring function. For the experiment we created two extensions of the algorithm: 1. ABEM - AdaBosst.ECC which its first l classifiers is determine by M0. 2. ABEMCB as ABEM however, after we done adding the l classifiers according to M0 we add the next classifier according mentioned CB coloring function. In order to apply the CB coloring function at each iteration we record the classification accuracy in confusion matrix. Except for those changes no other changes have been made to the algorithm. When one of the algorithms was not able to add a classifier based on the described above, the following classifier was added base on the regular coloring function employ by the AdaBosst.ECC algorithm. The other parameters that were given to the algorithm are asymmetric weighting and coloring function of type even split. D. Datasets Ten benchmark datasets were taken from widely used UCI Machine Learning Repository for evaluating learning algorithms. The datasets vary across dimensions as the number of target classes, of instances, of input features and their type (such as nominal and numeric). Table 5 describes the datasets employed in the study.

V. EXPERIMENTAL DETAILS In order to evaluate the performance of the proposed approach, a comparative experiment was conducted on ten benchmark datasets. Since the accuracy and the classifier complexity are affected by the ensemble size (number of classifiers), we examined several ensemble sizes on each dataset. We apply the suggested method to two popular classification algorithms: a) ECOC based ensemble; b) AdaBoost.ECC. The following subsections describe the experimental set-up and the results obtained. A. Experimental Setup To estimate the generalized accuracy, a 10-fold crossvalidation procedure was conducted. All the experiments started from the same code matrix (M0). For the purpose of M0, we introduce for the first time the OA as code matrix for multi class classification task. At first sight, the use of OA as part of ECOC in order to solve multi class learning problem may see surprising. Deep observation reveals that ECOC and OA have lots in common. The rows of the OA matrix (the runs) represent tests which require resources such as time, money and hardware. The purpose of OA is to minimize the runs. This economical property is exactly what we are trying to achieve. Moreover, OA is related to ECOC with combinatory, finite fields, and geometry [15]. OA also supports binary values which is one of the requirements of the produce matrix.

000558

Table 5. The datasets used in the experiment. Number Number Number of of input of Instances Attributes Classes White wine quality 4,898 11 7 Segment 2,310 19 Seven segment 1,000 7 Yeast 1,484 8 10 Opt digits 5,620 64 Pen digit 10,992 16 Abalone 3,842 8 11 Vowel 990 12 Krkopt 28,056 6 18 Soybean 683 35 19 Dataset size of M0 Class (k) X Classifiers (l) 7x7

Table 7. The average rank over the 76 iterations given the six coloring function, ABEM and ABEMCB. Algorithm CRB CB CR C R RA ABEM ABEMCB Mean 4.88 4.08 4.43 3.89 6.25 5.97 3.14 3.34 Rank

10x11

11x11 18x19 19x19

VI. EXPERIMENTAL RESULTS All together, there were 76 different iterations with respect to the stopping criteria mentioned before. In each iteration we ranked the eight methods, six in the first experimental and two more in the second one. In order to conclude which method is superior, we used the null hypothesis that all pruning methods perform the same over multiple data sets and the observed differences are merely random. The adjusted Friedman test has been reject with FF (5,45) =22.74, p < 0.05. We proceeded with a post-hoc Bonferroni-Dunn test using C as the controlled method. We concluded that C, CB and CR method perform almost the same. Still C significantly outperforms RA with z= 6.16, p<0.05. Table 6 depict the average score over 76 iteration of the first six coloring function employed in the first experiments. The second experiment which considers also the ABEM and ABEMCB yields the results depict in Table 7. Figure 2 depict the results of two dataset graphically. Using adjusted Friedman test the null hypothesis that all pruning methods perform the same over multiple data sets and the observed differences are merely random has been reject with FF (7,63) =20.76, p < 0.05. We proceed with a post-hoc Bonferroni-Dunn test using ABEM as the controlled method. We concluded that ABEM and ABEMCB methods perform almost the same. VII. CONCLUSIONS We were able to suggest a new non-weighting algorithm to extend a given code matrix which outperform randomly extending of code matrix but not the weighting algorithm. Encouraged by these results, we believe that there is a place for further research as more complicated classifiers than one against all.
Table 6. The average rank over the 76 iterations given the six coloring function. Algorithm Mean Rank CRB 3.46 CB 2.74 CR 3.04 C 2.58 R 4.74 RA 4.45

Fig. 2. The performance of the six methods given the number of classifiers for the opt-digit dataset.

REFERENCES
[1] K. Tumer and J. Ghosh, "Error correlation and error reduction in ensemble classifiers," Connection Science, Vol. 8, pp. 385-404, 1996. L. I. Kuncheva, "Using diversity measures for generating errorcorrecting output codes in classifier ensembles," Pattern Recognition Letter. Vol. 26, pp. 83-90, 2005. X. Hu, "Using rough sets theory and database operations to construct a good ensemble of classifiers for data mining applications", Icdm, pp.233 2001. T. G. Dietterich, "Ensemble methods in machine learning," Multiple Classifier Systems, Vol. 1857, pp. 1-15, 2000. T. G. Dietterich and G. Bakiri, "Solving multiclass learning problems via error-correcting output codes, "Journal of Artificial Intelligence Research, Vol. 2, pp. 263-286, 1995. K. Crammer and Y. Singer, "On the learnability and design of output codes for multiclass problems," Machine Learning, Vol. 47, pp. 201233, 2002. L. Rokach, "Taxonomy for characterizing ensemble methods in classification tasks: A review and annotated bibliography," Computational Statistics & Data Analysis, Vol. 53, pp. 4046-4072, 2009. T. Windeatt and R. Ghaderi, "Coding and decoding strategies for multiclass learning problems, "Information Fusion, Vol. 4, pp. 11-21, 2003. A. Hedayat, N. J. A. Sloane and J. Stufken., "Orthogonal Arrays: Theory and Applications," Springer Verlag, NY 1999. A. Zhang, Z. L. Wu, C. H. Li and K. T. Fang, "On hadamard-type output coding in multiclass learning,". Intelligent Data Engineering and Automated Learning, Vol. 2690, pp. 397-404, 2003. J. Zhou, H. Peng and C. Y. Suen, "Data-driven decomposition for multiclass classification,". Pattern Recognition, Vol. 41, pp. 67-76, 2008. Y. Freund and R. Schapire, "A decision-theoretic generalization of online learning and an application to boosting," Computational Learning Theory, Vol. 904, pp.23-37, 1995. R. E. Schapire. "Using output codes to boost multiclass learning problems,"Machine Learning-International Workshop, pp. 313-321, 1997. V. Guruswami and A. Sahai, "Multiclass learning, boosting, and errorcorrecting codes," Proceedings of the Twelfth Annual Conference on Computational Learning Theory, pp. 145-155, 1999. K. Hinkelmann and O. Kempthorne, "Design and Analysis of Experiments: Introduction to Experimental Design," WileyInterscience, 2007. L. Rokach., "Decomposition methodology for classification tasks: A meta decomposer framework". Pattern Analysis & Applications, Vol. 9, pp. 257-271, 2006. S. Kotsiantis, D. Kanellopoulos and P. Pintelas. "Local boosting of decision stumps for regression and classification problems," Journal of Computers, Vol. 1, pp. 30-37, 2006.

[2]

[3]

[4] [5]

[6]

[7]

[8] [9] [10]

[11] [12]

[13] [14]

[15]

[16]

[17]

000559

You might also like