You are on page 1of 7

Attribute Filtered Data Mining

Jos C. Cortizo Ignacio Giraldez Artificial Intelligence & Netwok Solutions S.L. Departamento de Inteligencia Artificial C/Blgica, 7 2B, Fuenlabrada(28943), Spain Universidad Europea de Madrid jccp@ainetsolutions.com C/Tajo s/n, Villaviciosa de Odn(28670),Spain ignacio.giraldez@uem.es

Abstract
Data mining opens up the possibility to use data presented in different sources for the discovery of interesting and useful patterns. Our data mining tool, FBL (Filtered Bayesian Learning), performs a two stage process: first it analyzes data present in a data source, and then, using information about the data dependencies encountered, it filters out dependent attributes and performs the mining phase based on bayesian learning. The Nive Bayes classifier is based on the assumption that the attribute values are conditionally independent given the class. This makes it perform very well in some data domains, but its effectiveness worsens when attributes are dependent. In this paper, we try to identify those dependencies using linear regression on the attribute values, and then filter out the attributes which are a linear combination of one or two others. We have tested the FBL system on five domains, where we have added a synthetic attribute which is a linear combination of two of the original ones. The system detects perfectly those synthetic attributes and also some natural dependent attributes, obtaining a more accurate classifier.

1. Introduction
Machine Learning algorithms try to learn from experience with respect to some class of task, and the experience is usually coded as a set of training instances where each instance is defined by a set of attribute values. Then it is evident to see that the performance of the algorithms depends, at least, on the attributes and values given. It seems logical to think the more attributes you use to describe the problem instances, the better the performance you will obtain. But that is not always the case, since there is a set of algorithms (like Nive Bayes, Logistic Regression and others) and techniques where the existence of

dependencies between attributes may cause a worsening in system performance. The dependencies between attributes can affect in multiple ways. In algorithms like Nive Bayes, the independence assumption lets us the possibility to estimate a value which might not be calculated without this nave assumption. But, for example, in tree based models, if strong dependencies between the attributes are not recognized, and these attributes are not used near the root this causes node replications in lower levels [23]. Since this is a problem for some algorithms, the are some attempts to solve it: some of them try to solve the problem for a single algorithm ([18][19][22]), while others try to solve it in a more general way ([17]). [17] shows a general method based on attribute interactions. The presence of attribute interactions depends on whether the evidence function (the class membership function) can be written as the sum of the concrete pieces of evidence from individual attributes. A good example to illustrate this are the AND and OR logical functions. An interaction is present in the AND operation because the function can be written as the sum of the evidence given by each attribute (X AND Y is true only if X is true and Y is true), just when the two attributes return evidence about this, the function returns true. The XOR operation does not present interactions, because the evidence given by each attribute on its own makes no distinction about the final result. If the function can not be expressed in this way we can assume there are interactions between attributes which makes the global function to behave different than the accumulative approximation. Jakulin and Bratko define a heuristic for detecting these interactions which can be used to improve some algorithms, like Nive Bayes and Logistic Regression. [20] also explains a general method for performing classification tasks with attribute dependencies. The algorithm consists on transforming the input space attributes into a set of attributes which only explains

whether a relation is present on the original attributes or not. So that the attributes used for the classification task contain no dependencies between them and show much more explicit knowledge. Multiple relations are used, due to the huge variety of them in the real life; linear relationships, stochastical, logical and permits also to include some other relationship models. In [20] Montes only shows his model working on tree based models but, as he explains and it is easy to see, the ideas can easily be applied to other algorithms. In this paper, as in [2], we try to study how to solve the problem of dependencies between attributes in the Nive Bayes Classifier (where it is neccesary to assume independence between the attributes for making possible the calculation of the conditional probability of a class given the attributes that define the instance to be classified). Our main goal is to be able to define a Nive Bayes based algorithm which does not requiere the independece assumption which is violated in many situations. We want to improve the Nive Bayes algorithm since it is a simple algorithm that outputs very good results in situations where the attributes present no dependencies among themselves. The Bayesian classifier [5] studies the most probable target value v j V of a new instance according to its attribute values (a1, a2. . .an). The most probable target value will be the one which maximizes :

To study how (Eq. 2) works, we use the mean accuracy obtained from tenfold cross-validation (on unseen examples) as the quality metric to study the output of different classifiers. This quality metric is also used as a reference for comparisons between the Nive Bayes and our system FBL (Filtered Bayesian Learning). This metric was used because it is a reference evaluation metric for machine learning algorithms, and explains how good the abstraction works on test data (that has not been used for training). The Nive Bayes Classifier achieves its best accuracy when the attribute values are independent given the class value. It can be optimal in other particular domains which present dependencies as can be seen in [4], and has suffered multiple variations due to the singular problems in some domains and its own characteristics [25]. In this paper we study the problem of domains which present dependent attribute values not taking care of those which can be optimal even containing dependent attributes, as has been studied in [18], [19] and [22]. In those approximations the way to improve the accuracy was to transform the attributes using a greedy algorithm that, in its simplest form, starts with an empty set of attributes F. In each iteration, it adds an attribute that belongs to the initial set of attributes

Ai S when by means of adding

P v j a1, a2, ... ,an


This expression can be rewritten using the Bayes theorem as

Ai the classifier performs better than by adding any A j Ai Ai , Aj S . [22] considers also the
possibility to join two attributes into one for avoiding dependencies. Our approach is in some way innovative because, although as in [20], we find the dependencies between attributes and use them, we do not exclusively use the dependencies information in the mining process, as in [20] ([20] detected the dependencies and then created a new input data set which only contained information about attribute dependencies). We transform the original data acording to the information found when detecting the dependencies and then we use the transformed data for the mining process. Other approaches, as [18][19][20], also worked transforming the original dataset, and not creating a new one. But those approaches do not transform the original data using the information extracted by detecting dependencies. They only transform the original dataset trying some combinations of the original attributes in some different ways. We consider that searching for dependencies by means of testing each attribute combination is a rudimentary method, because it doesnt allow to treat each kind of dependency in a different way (as it is allowed in [20]), and also does not extend the benefits to other machine learning algorithms. If we can detect the dependencies and classify them according to some

a x m P(a1 , a 2 a n | v j )P(v j ) arg V v j P(a1 , a 2 a n ) (Eq. 1) = arg P(a1 , a 2 a n | v j )P(v j ) V v j


In this equation, it is easy to estimate P(vj) because it represents the prior probability of the category value vj (it can be obtained by dividing the amount of instances of this category by the size of the sample). Estimating P(a1,a2an|vj) is not so easy, but if we assume the attribute values are independent, given the target value vj, then the probability P(a1,a2an|vj) can be factored as the product P(a1|vj)P(an|vj). Substituting this into Equation (1), we have the Nive Bayes Classifier [15] :

v NveBayes argmax P v j i
vj V
i

P ai v j

(Eq. 2)

properties, it could be possible to improve the Bayesian classifier accuracy by transforming the attribute set according to the class of the dependency. For making it possible, we use linear regression, a classic statistical solution to the problem of determining the relationship between two random variables X and Y (or even more variables). Using regression we can determine only one kind of relationship between attributes, but we consider this a good starting point because there are some nonlinear regression models that can expand the system possibilities. A more complex Bayesian approach to this problem is represented by Bayesian Networks ([8], [9]). [14] defines a method, based on Augmented Nive Bayes, that combines the simplicity of Nive Bayes and the ability of the Bayesian Networks to represent dependences between attributes. It starts from the Nive Bayes method and then finds a Bayesian Network that represents the dependences between attributes. Other authors also work in this way, as can be seen in [16], [15]. The rest of the paper is structured as follows. In section 2, we explore a method of deleting dependent attributes by first calculating all the possible dependencies, and ordering them by their strength. With that information there is no need of testing all the possible combinations of attributes, there is only need to delete the attributes according to that list and stop deleting them when no accuracy performance is reached. In section 3, we explain what experiments we have done with the FBL system, and the results it has produced. In section 4, we discuss the statistical relevance of the experimental results. And in sections 4 and 5, we explain the limitations of FBL and present the conslusions of this paper.

and v v1 ,v2 vm is a point in the output space (V); then the Nive Bayes classifier receives as input a set of training examples T E1 ,E2 Ek and produces, as output, a function h (hypothesis) mapping the input space onto the output space, h: X C according to the probabilistic model defined before which can be used to predict the class of previously unseen entries ( E j ). Our method modifies S deleting all dependencies the and

Ai a bAv

Ai a bAv cAw where Ai Av Aw Ai , Av , Aw S and a ,b ,c,e . For this purpose, we create a list of
all possible linear dependencies ([21]) where Rq is the square of the correlation and measures how well the linear regression explains the relation (that means if a correlation of 0.8 is observed between two variables, then a linear regression model attempting to explain either variable in terms of the other variable will account for 64% of the variability in the data) and

D q , Rq

Dq

Ai , Av ,a ,b q t if Rq Rt

Ai , Av , Aw , a ,b ,c

We calculate all the possible dependencies between pairs or trios of attributes for each class value, that is why, finally, we have possible dependencies. L is a list of attribute dependencies ordered by their strength, that is why for obtaining the final attributes set it is only needed to delete each Ai from each dependency until no accuracy performance is obtained. Once finished this process, we obtain

n 1

n 2

2. The FBL Algorithm


We propose a method for improving the Nive Bayesian Classifier based on filtering out, from the original dataset, those attributes that strongly depend on other attributes present in the same dataset. Since Nive Bayes assumes the training data is free of dependencies, we would clean those dependencies, which is the finding phase, and then clean the data used to train removing the attributes that depend on other (only if removing them has no bad consequences for the accuracy of the classifier). The entire process of finding and cleaning is defined next. Let

Ai , Aj

Az Ak S which is a subset of S

A1 , A2
and

An
v1 ,v2

be the original set of

attributes classes.

vm

the

possible

where each Ai is linearly independent given the class. Finally we modify each element in T deleting each attribute value that belongs to an attribute deleted in this process. Let g be the linear regression method applied to a pair or a trio of attributes, and t a function that transforms each training example deleting all the values corresponding to the attributes that have been filtered out. Then, the FBL algorithm is summarized as follows:

Ej a1 , a2

x ,v is a training example where an is a point of the input space (X)

1F S 2 L 3 For each yi C 4 For each Ai S

Table 1. Comparison between Nive Bayes classifier and Filtered Bayesian Learning system in the five selected domains. Domain Balance-scale Contraceptive Glass Identification Wine Recognition TAE N.B. (all) 0.8336 0.4840 0.8178 0.8983 0.4569 N.B. (or) 0.9136 0.5017 0.8411 0.9719 0.4967 FBL 0.9136 0.5078 0.8925 0.9719 0.5099

5 6 7

For each Aj S Aj L L g Ai , A j For each Ak S Ai , Aj

8 L L g Ai , A j , Ak 9 For each L S L 10 If accuracy F ' F Ai L S accuracy F then F F ' 11 t T t : S L 12 N t T ive


3. Experiments
We ran some experiments comparing our system to the Nive Bayes (NB) algorithm. We selected five problem domains from [1]. Then added one, randomly created, synthetic attribute to each domain, so that each domain had, at least, one linear attribute dependency. We created one html file for each domain for the system to consult the data on-line. The system accessed the corresponding html for each domain and parsed the html tags, extracting the data and created an internal representation of that data for applying the entire process. On each problem, we ran the experiments using tenfold cross-validation to obtain more objective results. Then, for each domain we executed 3 algotithms: the NB algorithm (using [26] with the augmented attribute set), NB with only the original attributes, and our system FBL with the augmented attribute set. The results are shown in Table 1. As can be seen in Table 1, our system is capable to detect all the synthetic dependencies avoiding the lower accuracy of the Nive Bayes in those domains. But in three of the five domains, the system performs better than the Nive Bayes classifier working on the original data, this improvement is caused by some natural attribute relationships that were also detected. If we compare the performance of FBL to the performance of NB with the original attribute set, we see that FBL has an accuracy better than or equal to NB, but working with less attributes (see Table 2). So it is obtaining better results with simpler classifiers.

Table 2. Number of attributes deleted by FBL in the selected domains. When only one attribute is deleted that attribute is the synthetic one, the other attributes deleted show natural dependencies. Domain Balance-scale Contraceptive Glass Identification Wine Recognition TAE Deleted 1 3 3 1 2

But dependencies are not only a Bayesian matter. As can be seen in Table 3 (the bolded values are the best values for that domain), some other Machine Learning algorithms are affected in a similar way by these dependencies. In this table we present the difference between the accuracy in systems using the original attribute set and the accuracy adding the synthetic attributes.
Table 3. Comparison between some machine learning algorithms and FBL when using only original attributes and when using also synthetic ones in 6 domains (Contraceptive Method, Breast Cancer, Balance Scale, Glass Identification, Wine Recognition and TAE) and proofing with several added synthetic dependencies in each one. The results shows the accuracie's variation when adding the synthetic attributes. Algorithm FBL N. Bayes C4.5 C4.5 rules KNN(K=1) KNN(K=4) SMO Algorithm FBL N. Bayes C4.5 C4.5 rules KNN(K=1) KNN(K=4) SMO CM1 0.07 -1.15 -0.61 0.34 -1.29 -0.81 0.06 BC3 0.00 -0.14 2.72 0.29 0.28 0.72 -0.14 CM2 0.07 -2.45 1.77 0.81 -0.61 -0.88 0.06 BC4 0.00 -0.14 0.14 0.86 -0.15 -0.43 0.00 CM3 0.07 -1.02 0.68 1.08 -1.29 -0.88 0.00 BS1 2.30 -9.11 0.00 -2.74 -0.23 -1.14 0.22 GI1 BC1 0.00 -0.14 1.72 0.72 0.04 0.15 -0.14 BS2 2.30 -5.23 0.45 -1.60 0.91 -0.91 -0.23 WR1 BC2 0.00 -0.14 3.72 1.58 0.28 0.72 -0.14 BS3 2.30 -0.68 0.68 -0.91 1.14 -2.5 -0.46 TAE

Algorithm

FBL N. Bayes C4.5 C4.5 rules KNN(K=1) KNN(K=4) SMO

0.00 -0.08 0.13 0.72 -0.13 -1.16 0.00

0.00 -0.07 0.01 1.58 -0.28 -0.72 -0.04

0.01 -0.04 0.89 -0.29 0.23 0.01 0.54

Being n1 and n2 the number of experiments performed on Nive Bayes and FBL respectively, the standard error is calculated as:

error standar

p 1 p

1 n1 1 n2

The results are promising because the FBL system can detect all the synthetic linear dependencies, and also detects five natural linear dependencies on the test domains, which let us think about how good it would be avoiding the linear limitation on the regression method for detecting the relations. Also it is interesting to note (Table 3) that in all the algorithms tested happens that in some domains the accuracy decreases when using dependent attributes.

As p1 is 0, because in all the experiments Nive Bayes decreases on its accuracy, and p2 is 1 because in all the experiments performed on FBL system the accuracy was increased (or at least didn't decreased), the probability p is 0.5. Applying this value to the formula of the standard error, and knowing n1 and n2 are 13 (the number of experiments performed on both Nive Bayes and FBL approaches), the standard error is:

4. Discussion of Experimental Results


As shown in the results before, the FBL algorithm performs better than the classical Nive approach. To assess the relevance of the improvement, we study the statistical relevance of the results using the null hypothesis test [13]. Let us define two hypothesis, one (H0) which represents what is not desired and the opposite (Ha) which represents what we try to demonstrate, and then demonstrate Ha by demonstrating H0 is imposible. We consider H0 as the hypothesis that there is no statistical difference between the original Nive Bayes and our approach FBL (i.e. the experiments do not show that FBL is a significant improvement over NB). The alternative hypothesis, Ha, means there is a really significant improvement on the results shown before.
Table 4. Comparison between the times FBL does not decay with dependencies and the times Nive Bayes does not decay with dependent attributes. The ni represents the number of experiments done, and pi represents the percentage of times the algorithm does not decay on its accuracy. Algorithm Nive Bayes FBL Experiments n1 = 13 n2 = 13 % times doesn't decay p1 = 0/13 = 0 p2 = 13/13 = 1

error standar

0.5 1 0.5 0,1616

1 13 1 13

Assuming the 95% of certainty is enough to assure the statistical relevance, we should study whether the error multiplied by 1.96 (this value represents the x coordinate for the Z function where the 95% of the area of the function lies on its left) is lower than |p1-p2|.

error standar 1.96 0.38 p1 p2 1 0.38 1


As 0.38 is lower than 1, then we accept Ha, and we can assume the experimental results are the evidence that FBL performs better than the Nive Bayes approach.

5. Limitations and Future Work


The good results on the selected domains are a good starting point because it exposes regression is a reliable method for detecting attribute relations. But regression methods are not only the linear one. There is a huge variety of methods, each one related to one kind of function (polynomial, exponential, logistic, etc.) and using those methods allows us to detect much more dependencies because in real problems the relationships are not always linear. Another important topic is that not all the relations are given by a function that explains one variable given some other independent ones. [20] explains some kinds of relationships, one of them is the one given by regression models, but there are also logical relationships and stochastical ones where a dependence can be defined as membership (for example whether or

Let p1 and p2 be the percentage of times Nive Bayes and FBL doesn't decay on its accuracy when dependencies are present on the data. We now calculate p (the average between the percentages of number of experiments where each algorithm doesn't decay on its accuracy), which can be calculated as:

p1 p2 2

0 1 2

0.5

not an instance is classified as a member of a class by a neural network). Detecting attribute dependencies would also help some tree based models as it is explained in [23], and it could be interesting to extend the benefits of this system to other inductive algorithms. It would be really interesting to test FBL in domains where Nive Bayes works surprisingly fine, but contains many obvious dependencies between attributes, such as spam detection [11] [24] or concept indexing for automatic text categorization [3] [10]. If FBL performs in these domains as it does in the tested ones, it would be a really good alternative for the Nive Bayesian Classifier. Also we intend to try FBL in experiments working on traffic simulations [6] for learning the behavior of the traffic lights or other traffic signals depending on the traffic flows (where one flow could depend on one or more flows, i. e. a roads cross) to perform the middle wasted time for a vehicle to reach its destination from its origin.

module, FBL might be extended by replacing that module by another one based on other machine learning paradigm. That is why FBL is an improvement over the Nive Bayes Classifier, and also a general method for improving machine learning algorithms prone to accuracy decays when attribute dependencies are present in the training data.

7. References
[1] Blake, C.L. and Merz, C.J., 1998. UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science. [2] Cortizo, J. C. and Giraldez, J. I., 2004. Discovering Data Dependencies in Web Content Mining. IADIS International Conference WWW/Internet 2004. [3] Cortizo, J. C. and Ruiz, M. J., 2003. Integracin de Informacin Conceptual de Wordnet en Categorizacin Automtica de textos. Tercera Conferencia de Procesamiento del Lenguaje Natural de la Universidad Europea de Madrid: Plenum III 2003, CEES. [4] Domingos, P. and Pazzani M., 1996. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. International Conference on Machine Learning. Bari, Italy, pp. 105-112. [5] Duda, R. and Hart, P., 1973. Pattern classification and scene analysis. Jonh Wiley and Sons, New York, USA. [6] Expsito, D. and Giraldez, J. I., 2004. Control MultiAgente del trfico rodado mediante Redes WIFI. Conferencia Iberoamericana IADIS WWW/Internet 2004. [8] Friedman, N. and Goldszmidt, M., 1996. Building classifiers using Bayesian network. In Proceedings National Conference on Artificial Intelligence. pp. 1277-1284. Menlo Park, CA: AAAI Press. [9] Friedman, N., Geiger, D. and Goldszmidt, M., 1997. Bayesian Network Classifiers. Machine Learning Journal, volume 29, pp. 131-163. [10] Gomez, J. M., Cortizo, J. C., Puertas, E., Ruiz, M. J., 2004. Concept Indexing for Automated Text Categorization. Natural Language Processing and Information Systems: 9th International Conference on Applications of Natural Language to Information Systems, NLDB 2004, Salford, UK, June 23-25, 2004, Proceedings. Lecture Notes in Computer Science, vol 3136, Springer pp. 195-206, 2004. [11] Gomez, J. M., 2002. Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization. Applied Computing

6. Conclusions
We have introduced a simple alternative to the Nive Bayesian Classifier for dealing with the attribute dependencies found in some domains. Our non-greedy approach is based on estimating, for each attribute, a value that shows how it depends on other attributes, and then order the attributes according to this value. So we can later filter out the most depending attributes, in the order of decreasing dependecy strength, until no improvement in accuracy is obtained. We have shown that searching for dependencies among attributes when learning Nive Bayesian classifiers results in increases in accuracy. Three out of the five domains presented natural linear dependencies among attributes. These dependencies, common in real world data sets, contradict the Nive assumption (of independence between attributes). As a consequence, Nive Bayesian Classifier needs some help not to decrease its accuracy. Results show the convenience of studying the detection of dependencies using regression methods for avoiding the loss of accuracy caused by hidden relationships among attributes. This process is specially useful in web content mining applications, because data definitions are, either not accessible, or difficult to understand by web content mining tools. As a consequence, for these tools, it is difficult to take advantage of data dependencies, unless they can detect them themselves, as FBL does. Experiments also show that some other Machine Learning Algorithms are inclined to decay on its accuracy on domains where some attributes depend on others. As FBL uses the Bayesian Classifier as a

2002, Proceedings of the ACM Symposium on Applied Computing, the Universidad Carlos III de Madrid, Spain, March 11-14, 2002, pp. 615-620. [12] Heckerman, D., Geiger, D. and Chickering, D. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20 (pp. 197-243). [13] Hoel, P. G., Port, S. C., Stone, C. J. Testing Hypotheses. Ch. 3 in Introduction to Statistical Theory. New York: Houghton Mifflin, pp. 52-110, 1971. [14] Keogh, E. J. and Pazzani, M. J., 2002. Learning the Structure of Augmented Bayesian Classifiers. International Journal on Artificial Intelligence Tools, Vol. 11 No. 4 (pp. 587-601). World Scientific Publishing Company. [15] Kononenko, I., 1990. Comparison of inductive and naive Bayesian learning approaches to automatic knowledge adquisition. In B. Wielinga Editors, Current trends in knowledge adquisition. Amsterdam: IOS Press. [16] Kononenko, I. 1991. Semi-nive Bayesian Classifier. In Proceedings of the 6th European Working Session on Learning, (pp. 206-219). AAAI Press. [17] Jakulin, A. and Bratko, I. 2003. Analyzing Attribute dependencies. Proceedings of Kwowledge Discovery in Data (PKDD) Springer Verlag, LNAI, pp. 229-240. [18] Langley, P. 1993. Induction of recursive Bayesian classifiers. Proceedings of the Eight European Conference on Machine Learning.. Vienna, Austria, pp. 153-164. [19] Langley, P. and Sage, S. 1994. Induction of selective Bayesian classifiers. Proceedings of the Tenth National

Conference on Artificial Intelligence.. Seattle, USA, pp. 339406. [20] Montes, C., 1994. MITO: Mtodo de Induccin Total. PhD Thesis, Universidad Politcnica de Madrid, Boadilla del Monte, Spain. [21] Neter, J., Kutner, M. H., Wasserman, W., Nachtsheim C. J. 1996, Applied Linear Statistical Models. Irwin Editors. [22] Pazzani M., 1997. Searching for dependencies in Bayesian Classifiers. Artificial Intelligence and Statistics IV. Springer Verlag , New York, USA. [23] Robnik, M. and Kononenko, I., 1999. Attribute dependencies, understability and split selection in tree based models. International Conference on Machine Learning.Bled, Slovenia. Morgan Kauffman , pp. 344-353. [24] Salib, M., 2002. Metaslicer: Spam Classification with Nive Bayes and Smart Heuristics. Massachusetts Institute of Technology. [25] Webb, G.I., and Pazzani, M.J. (1998). Adjusted probability naive Bayesian induction. Proceedings of the Eleventh Australian Joint Conference on Artificial Intelligence. Berlin. Springer-Verlag. pp. 285-295. [26] Witten, I. H., and Frank, E., 2000, Data Mining: Practical machine learning tools with Java implementations, Morgan Kauffman, San Francisco.

You might also like