You are on page 1of 4

Interesting Measures for Mining Association Rules

Liaquat Majeed Sheikh, Basit Tanveer, Syed Mustafa Ali Hamdani FAST-NUCES, Lahore liaquat.majeed@nu.edu.pk, basit.tanveer@gmail.com, mustafa.hamdani@gmail.com

Abstract
Discovering association rules is one of the most important tasks in data mining and many efficient algorithms were proposed in literature. However, the number of discovered rules is often so large, so the user cannot analyze all discovered rules. To overcome that problem several methods for mining interesting rules only have been proposed. Many measures have been proposed in literature to determine the interestingness of the rule. In this paper we have selected a total of eight different measures, we have compared these measures by using a data set, and we have made some recommendation about the use of the measures for discovering the most interesting rules.

Correlation, and Odds ratio. The second section gives us the calculation of each measure on our sample data (customer transactions) and the last section contains our recommendation on using which measure for discovering the interesting rules.

2. Description of Different Measures


To make the measures comparable all measures are defined using probabilities. The probability of encountering itemset X is given by

P( X ) =

count ( X ) |D|

1. Introduction
In the previous few years a lot of work is done in the field of data mining especially in finding association between items in a data base of customer transaction. Association rules identify items that are most often bought along with certain other items by a significant fraction of the customers. For example, we may find that95 percent of the customers who bought bread also bought milk. A rule may contain more than one item in the antecedent and the consequent of the rule. Every rule must satisfy two user specified constraints: one is a measure of statistical significance called support and the other a measure of goodness of the rule called confidence. In this paper we have identified a set of measures as proposed by the literature and we have tried to conclude that a single measure alone can not determine the interestingness of the rule. This paper is divided in to three sections the first section gives the formal definition (as presented in the literature) and some explanation of each measure. The measures we have chosen are Support, Confidence, Conviction, Lift, Piatetsky-Shapiro, Coverage,

Where, count(X) is the number of transactions that contain the itemset X and |D| is the size (number of transactions) of the database.

2.1. Support [1]


Introduced by R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in large databases. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, Washington D.C., May 1993.

Support ( X ) = P ( X )
Support is defined on itemsets and gives the proportion of transactions that contain Z and therefore is used as a measure of significance (importance) of an itemset. Since it basically uses the count of transactions it is often called a frequency constraint. An itemset with a support greater than a set minimum support threshold is called a frequent or large itemset. Supports main feature is that it possesses the downward closure property (anti-monotonicity) which means that all subsets of a frequent set are also frequent. This

-1-

property (actually, the fact that no super set of infrequent set can be frequent) is used to prune the search space (usually thought of as a lattice or tree of item sets with increasing size) in level-wise algorithms (e.g., the Apriori algorithm). The disadvantage of support is the rare item problem. Items that occur very infrequently in the data set are pruned although they would still produce interesting and potentially valuable rules. The rare item problem is important for transaction data which usually have a very uneven distribution of support for the individual items (few items are used all the time and most item are rarely used). Its values are in range [0; 1]. If antecedent and consequent are not occurring in transactions it is equal to 0. And if they are occurring in all transactions its value is equal to 1.

2.3. Conviction [1]


Introduced by Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, pages 255-264, Tucson, Arizona, USA, May 1997.

conviction( X Y ) =

P( X ) P(Y ) P( X and Y )

2.2. Confidence [1]


Introduced by R. Agrawal, T. Imielinski, and A. Swami. Mining associations between sets of items in large databases. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, Washington D.C., May 1993.

confidence( X Y ) =

P( X

and Y ) P( X )

Conviction was developed as an alternative to confidence which was found to not capture direction of associations adequately. Conviction compares the probability that X appears without Y if they were dependent with the actual frequency of the appearance of X without Y. In that respect it is similar to lift (see section about lift on this page), however, it contrast to lift it is a directed measure since it also uses the information of the absence of the consequent. An interesting fact is that conviction is monotone in confidence and lift. Its values are in range [0; + ]. If antecedent and consequent are independent it is equal to 1. For implications occurring in all cases measures value is equal to + .

Confidence is defined as the probability of seeing the rule's consequent under the condition that the transactions also contain the antecedent. Confidence is directed and gives different values for the rules X Y and Y X. Confidence is not down-ward closed and was developed together with support by Agrawal et al. (the so-called support-confidence framework). Support is first used to find frequent (significant) itemsets exploiting its down-ward closure property to prune the search space. Then confidence is used in a second step to produce rules from the frequent itemsets that exceed a min. confidence threshold. A problem with confidence is that it is sensitive to the frequency of the consequent (Y) in the database. Caused by the way confidence is calculated, consequents with higher support will automatically produce higher confidence values even if there exists no association between the items. Its values are in range [0; 1]. If antecedent and consequent are independent it is equal to 0. For implications occurring in all cases measures value is equal to 1.

2.4. Lift [1]


Introduced by S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Int'l Conf. on Management of Data (ACM SIGMOD '97), pages 265-276, 1997.

lift ( X Y ) =

P( X and Y ) P( X ) P(Y )

Lift measures how many times more often X and Y occurs together than expected, if they where statistically independent. Lift is not down-ward closed and does not suffer from the rare item problem. Its values are in range [0; + ). Values lower than 1 mean, that satisfying condition of antecedent decreases probability of consequent in comparison to unconditional probability. Consequently, values higher than 1 mean, that satisfying condition of antecedent increases probability of consequent in comparison to unconditional probability. If antecedent and consequent are independent then lift is equal to 1.

-2-

2.5. Leverage [1]


Introduced by Piatetsky-Shapiro, G., Discovery, analysis, and presentation of strong rules. Knowledge Discovery in Databases, 1991: p. 229-248.

2.8. Odds Ratio


The odds-ratio is a statistical measure which is defined as the ratio of the odds of an event occurring in one group to the odds of it occurring in another group, or to a data-based estimate of that ratio. [3]

leverage X Y ) = P( X (

and Y ) P( X ) P(Y )

odds( X Y ) =

P( X P( X

and Y ) P( X and Y ) P( X

and Y ) and Y )

Leverage measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. The rational in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sells. Using minimum leverage thresholds at the same time incorporates an implicit frequency constraint. e.g., for setting a min. leverage thresholds to 0.01% (corresponds to 10 occurrence in a data set with 100,000 transactions) one first can use an algorithm to find all itemsets with minimum support of 0.01% and then filter the found item sets using the leverage constraint. Because of this property leverage also can suffer from the rare item problem.

Its values are in range [0; + ]. If antecedent and consequent are independent it is equal to 0. For strong associations its value is equal to + .

3. Comparison of Measures
This section compares all the measures, discussed in the first section. We have chosen a data set on which we have performed the A-priori algorithm to find out the frequent item set. All the measure are applied on each of the frequent item set, and then in the end the recommendation related to choosing a measure to decide which rule is interesting are given.

2.6. Coverage [1]

3.1. Sample Data

cov erage( X Y ) =

P( X

and Y ) P(Y )

It shows what part of itemsets from consequent is covered by a rule. Its values are in range [0; 1].

2.7. Correlation
Correlation is a statistical technique which can show whether and how strongly pairs of variables/itemsets are related.

The sample data for the analysis purpose is taken from a store database of customer transactions there are six different types of items and a total of ten transactions. In each transaction a 1 represents the presence of an item while a 0 represents the absence of an item from the market basket. Table 1: Sample Transactions Items A B C D E 1 1 1 0 1 0 2 1 0 1 1 0 3 1 0 1 1 0 4 0 1 1 1 0 5 0 1 0 1 1 6 1 0 0 0 1 7 1 0 1 0 1 8 0 0 1 0 0 9 0 1 1 1 0 10 1 1 0 1 1 Total 6 5 6 7 4 P(X) 0.6 0.5 0.6 0.7 0.4 TID

corr( X Y ) =

P( X

and Y ) P( X ) P(Y )

P( X ) P(Y )(1 P( X ))(1 P(Y ))

Correlation is a bi-variant measure of association (strength) of the relationship between two variables/itemsets. It varies from -1 (perfect negative linear relationship) to 1 (perfect linear relationship) and in between them 0 means no relationship. To the extent that there is a nonlinear relationship between the two variables being correlated, correlation will understate the relationship. [4]

F 1 1 1 0 0 1 1 0 0 0 5 0.5

-3-

Table 3: Calculation of Different Measures on Sample Datasets Rules AD DA AF FA BD DB CD DC Support 0.40 0.40 0.50 0.50 0.50 0.50 0.40 0.40 Confidence 0.67 0.57 0.83 1.00 1.00 0.71 0.67 0.57 Conviction 0.50 0.83 1.00 0.80 0.60 1.00 0.50 0.67 Lift 0.95 0.95 1.67 1.67 1.43 1.43 0.95 0.95 Leverage -0.02 -0.02 0.20 0.20 0.15 0.15 -0.02 -0.02 Coverage 0.57 0.67 1.00 0.83 0.71 1.00 0.57 0.67 Correlation -0.09 -0.09 0.82 0.82 0.65 0.65 -0.09 -0.09 1 1 Odds Ratio 0.67 0.67

3.2. Generating Frequent Itemsets


The frequent item set generated by the sample data using A-priori algorithm is shown in the following table: Table 2: Frequent Itemsets Itemset Support {A,D} 40% {A,F} 50% {B,D} 50% {C,D} 40% The minimum support used for the generation of the frequent item set is 40%.

Table 4: Subset of Sample Dataset Rules Measures Confidence Correlation Odds Ratio AF 0.83 0.82 FA 1.00 0.82 BD 1.00 0.65 DB 0.71 0.65 The table sown above contains the subset of measures and rules taken from the above table. The Odds Ratio in this table suggest that all the rules are interesting but if we look at the Correlation along with the Odds Ratio we will come to know that AF and FA are more strongly related to each other. On combining another measure i.e. Confidence with these two measures we will come to know that only the rule FA is more interesting.

3.3. Calculations
All the measures discussed in the first section are calculated for each rule, which is the output of the Apriori algorithm. The results are shown in table 3.

5. References
[1] http://wwwai.wu-wien.ac.at/~hahsler/research/ association_rules/measures.html [2] www.cise.ufl.edu/class/cis6930fa03dm/notes/ dm4part2.pdf [3] http://en.wikipedia.org/wiki/Odds-ratio [4] http://www2.chass.ncsu.edu/garson/pa765/ correl.htm [5] Discovering interesting rules from financial data Przemysaw Sodacki, Institute of Computer Science, Warsaw University of Technology Ul. Andersa 13, 00-159 Warszawa [6] Alternative Interest Measures for Mining Associations in Databases, Edward R. Omiecinski, Member, IEEE Computer Society.

3. Conclusion
Any measure alone cannot determine the Interestingness of the rules. We have to look at a combination of different measures in order to get the rule that is really interesting. There are two types of measures one is symmetric measures and the other is asymmetric If we look at a symmetric measure e.g. Odds Ratio we can conclude the AF, FA both the rules are interesting but the Confidence value of FA suggest that it is more interesting as compared to AF hence we can not conclude alone from a symmetric measure we also have to look for an asymmetric measure in order to know the interestingness of such types of rules AB, BA.

-4-

You might also like