Professional Documents
Culture Documents
Course Instructor: Prof. Anita Wasilewska State University of New York, Stony Brook
References
Data Mining: Concepts & Techniques by Jiawei Han and Micheline Kamber Presentation Slides of Prof. Anita Wasilewska Presentation Slides of the Course Book. An Effective Hah Based Algorithm for Mining Association Rules (Apriori Algorithm) by J.S. Park, M.S. Chen & P.S.Yu , SIGMOD Conference, 1995. Mining Frequent Patterns without candidate generation (FP-Tree Method) by J. Han, J. Pei , Y. Yin & R. Mao , SIGMOD Conference, 2000.
State University of New York, Stony Brook 2
Overview
Basic Concepts of Association Rule Mining The Apriori Algorithm (Mining single dimensional boolean association rules) Methods to Improve Aprioris Efficiency Frequent-Pattern Growth (FP-Growth) Method From Association Analysis to Correlation Analysis Summary
State University of New York, Stony Brook 3
Simple Formulas:
Confidence (AB) = #tuples containing both A & B / #tuples containing A = P(B|A) = P(A U B ) / P (A) Support (AB) = #tuples containing both A & B/ total number of tuples = P(A U B)
What do they actually mean ? Find all the rules X & Y Z with minimum confidence and
support support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability that a transaction having {X, Y} also contains Z
State University of New York, Stony Brook 6
Let minimum support 50%, and minimum confidence 50%, then we have, A C (50%, 66.6%) C A (50%, 100%)
State University of New York, Stony Brook 7
Overview
Basic Concepts of Association Rule Mining The Apriori Algorithm (Mining single dimensional boolean association rules) Methods to Improve Aprioris Efficiency Frequent-Pattern Growth (FP-Growth) Method From Association Analysis to Correlation Analysis Summary
State University of New York, Stony Brook 9
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
Pseudo-code:
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;
State University of New York, Stony Brook 12
T100
T100 T100
I1, I2, I5
I2, I4 I2, I3
T100
T100 T100 T100 T100 T100
I1, I2, I4
I1, I3 I2, I3 I1, I3 I1, I2 ,I3, I5 I1, I2, I3
Consider a database, D , consisting of 9 transactions. Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) Let minimum confidence required is 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min. confidence.
13
6 7 6 2 2
Itemset
Sup.Count
6 7 6 2 2
C1 of candidate.
L1
In the first iteration of the algorithm, each item is a member of the set
The set of frequent 1-itemsets, L1 , consists of the candidate 1itemsets satisfying minimum support.
State University of New York, Stony Brook 14
Itemset
Itemset
Scan D for count of each candidate
{I1, I2}
{I1, I3} {I1, I4} {I1, I5} {I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5}
Sup. Count 4 4 1
Itemset {I1, I2} {I1, I3} {I1, I5} {I2, I3} {I2, I4} {I2, I5}
Sup Count 4 4 2 4 2 2
{I1, I5}
{I2, I3} {I2, I4} {I2, I5} {I3, I4} {I3, I5} {I4, I5}
2
4 2 2 0 1 0
L2
C2
C2
State University of New York, Stony Brook 15
16
Sup. Count 2 2
Sup Count 2 2
C3
C3
L3
The generation of the set of candidate 3-itemsets, C3 , involves use of the Apriori Property. In order to find C3, we compute L2 Join L2. C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck.
State University of New York, Stony Brook 17
Back To Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
Lets take l = {I1,I2,I5}. Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
20
Let minimum confidence threshold is , say 70%. The resulting association rules are shown below, each listed with its confidence.
R1: I1 ^ I2 I5
Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is Rejected. Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% R2 is Selected. Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is Selected.
State University of New York, Stony Brook 21
R2: I1 ^ I5 I2
R3: I2 ^ I5 I1
Overview
Basic Concepts of Association Rule Mining The Apriori Algorithm (Mining single dimensional boolean association rules) Methods to Improve Aprioris Efficiency Frequent-Pattern Growth (FP-Growth) Method From Association Analysis to Correlation Analysis Summary
State University of New York, Stony Brook 23
Overview
Basic Concepts of Association Rule Mining The Apriori Algorithm (Mining single dimensional boolean association rules) Methods to Improve Aprioris Efficiency Frequent-Pattern Growth (FP-Growth) Method From Association Analysis to Correlation Analysis Summary
State University of New York, Stony Brook 25
Consider the same previous example of a database, D , consisting of 9 transactions. Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) The first scan of database is same as Apriori, which derives the set of 1-itemsets & their support counts. The set of frequent items is sorted in the order of descending support count. The resulting set is denoted as L = {I2:7, I1:6, I3:6, I4:2, I5:2}
27
I1:2
I4
I5
2
2 I5:1 I3:2 I4:1 I5:1
I3:2
29
Start from each frequent length-1 pattern (as an initial suffix pattern). Construct its conditional pattern base which consists of the set of prefix paths in the FP-Tree co-occurring with suffix pattern. Then, Construct its conditional FP-Tree & perform mining on such a tree. The pattern growth is achieved by concatenation of the suffix pattern with the frequent patterns generated from a conditional FP-Tree. The union of all frequent patterns (generated by step 4) gives the required frequent itemset.
State University of New York, Stony Brook 30
I4
I3 I2
<I2: 2>
I2 I4: 2
<I2: 4, I1: 2>,<I1:2> I2 I3:4, I1, I3: 2 , I2 I1 I3: 2 <I2: 4> I2 I1: 4
Mining the FP-Tree by creating conditional (sub) pattern bases Now, Following the above mentioned steps: Lets start from I5. The I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3 I5: 1}. Therefore considering I5 as suffix, its 2 corresponding prefix paths would be {I2 I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base.
State University of New York, Stony Brook 31
33
Overview
Basic Concepts of Association Rule Mining The Apriori Algorithm (Mining single dimensional boolean association rules) Methods to Improve Aprioris Efficiency Frequent-Pattern Growth (FP-Growth) Method From Association Analysis to Correlation Analysis Summary
State University of New York, Stony Brook 34
35
Correlation Concepts
Two item sets A and B are independent (the occurrence of A is independent of the occurrence of item set B) iff
P(A B) = P(A) P(B)
Otherwise A and B are dependent and correlated The measure of correlation, or correlation between A and B is given by the formula:
Corr(A,B)= P(A U B ) / P(A) . P(B)
36
Support(A B)= P(AUB) Confidence(A B)= P(B|A) That means that, Confidence(A B)= corr(A,B) P(B)
So correlation, support and confidence are all different, but the correlation provides an extra information about the association rule (A B).
We say that the correlation corr(A,B) provides the LIFT of the association rule (A=>B), i.e. A is said to increase (or LIFT) the likelihood of B by the factor of the value returned by the formula for corr(A,B).
State University of New York, Stony Brook 38
Correlation Rules
A correlation rule is a set of items {i1, i2 , .in}, where the items occurrences are correlated. The correlation value is given by the correlation formula and we use square test to determine if correlation is statistically significant. The square test can also determine the negative correlation. We can also form minimal correlated item sets, etc
Limitations: square test is less accurate on the data tables that are sparse and can be misleading for the contingency tables larger then 2x2
State University of New York, Stony Brook 39
Summary
Association Rule Mining
Finding interesting association or correlation relationships.
Association rules are generated from frequent itemsets. Frequent itemsets are mined using Apriori algorithm or FrequentPattern Growth method. Apriori property states that all the subsets of frequent itemsets must also be frequent. Apriori algorithm uses frequent itemsets, join & prune methods and Apriori property to derive strong association rules. Frequent-Pattern Growth method avoids repeated database scanning of Apriori algorithm. FP-Growth method is faster than Apriori algorithm. Correlation concepts & rules can be used to further support our derived association rules.
State University of New York, Stony Brook 40
Questions ?
41