Professional Documents
Culture Documents
Data Mining I
Summer semester 2017
Introduction
Basic concepts
Frequent Itemsets Mining (FIM) Apriori
Association Rules Mining
Apriori improvements
Closed frequent itemsets (CFI) & Maximal frequent itemsets (MFI)
Homework/tutorial
Things you should know from this lecture
4 Eggs
Applications:
Improving store layout, Sales campaigns, Cross-marketing, Advertising
Introduction
Basic concepts
Frequent Itemsets Mining (FIM) Apriori
Association Rules Mining
Apriori improvements
Closed frequent itemsets (CFI) & Maximal frequent itemsets (MFI)
Homework/tutorial
Things you should know from this lecture
Let X be an itemset.
Tid Transaction items
Itemset cover: the set of transactions containing X: 1 Butter, Bread, Milk, Sugar
2 Butter, Flour, Milk, Sugar
cover(X) = {tid | (tid, XT) DB, X XT}
3 Butter, Eggs, Milk, Salt
(absolute) Support/ support count of X: # transactions containing X 4 Eggs
5 Butter, Flour, Milk, Salt, Sugar
supportCount(X) = |cover(X)|
(relative) Support of X: fraction of transactions containing X (or the probability that a transaction contains X)
Frequent itemset: An itemset X is frequent in DB if its support is no less than a minSupport threshold s:
support(X) s
Lk: the set of frequent k-itemsets
L comes from Large (large itemsets), another term for frequent itemsets
Support s of a rule: the percentage of transactions containing X Y in the DB or the probability P(X Y)
support(XY)=P(X Y) = support(X Y)
Confidence c of a rule: the percentage of transactions containing X Y in the set of transactions containing X.
Or, in other words the conditional probability that a transaction containing X also contains Y
( )
= = =
()
Support and confidence are measures of rules interestingness. Explain the rules:
{Diapers} {Beer} (0.5%, 60%)
{Toast bread} {Toast cheese}
Rules are usually written as follows: XY (support, confidence) (50%, 90%)
transactionID items
Association rules:
2000 A,B,C
A C (Support = 50%, Confidence= 66.6%)
1000 A,C
C A (Support = 50%, Confidence= 100%)
4000 A,D
5000 B,E,F
Problem 2 (ARM): Find all association rules X Y in DB, w.r.t. min support s and min confidence c, i.e.,: {X Y |
support(X Y) s, confidence(XY) c, X,Y I and XY=}
The number of itemsets can be really huge. Let us consider a small set of items: I = {A,B,C,D}
4 4! 4!
# 1-itemsets: 4 ABCD
1 (4 1)!*1! 3!
4 4! 4!
# 2-itemsets: 6 ABC ABD ACD BCD
2 (4 2)!*2! 2!*2!
4 4! 4!
# 3-itemsets: 4 AB AC BC AD BD CD
3 (4 3)!*3! 3!
# 4-itemsets:
4 4!
1
4 (4 4)!*4! A B C D
Introduction
Basic concepts
Frequent Itemsets Mining (FIM) Apriori
Association Rules Mining
Apriori improvements
Closed frequent itemsets (CFI) & Maximal frequent itemsets (MFI)
Homework/tutorial
Things you should know from this lecture
A B C D
Method overview:
{}
Initially, scan DB once to get frequent 1-itemset
On the contrary: When X is not frequent, all its supersets are not frequent and thus they should not be
generated/ tested!!! reduce the candidate itemsets set {beer, diaper, nuts} <=1
e.g., if {beer, diaper} is not frequent, {beer, diaper, nuts} would not be frequent also
{beer, diaper} 1
minSupport =2
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
{}:?
Transaction Database
{Chips, Pizza}
{Beer, Chips} {Beer,Chips,Pizza,Wine}:?
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
Transaction Database
{Chips, Pizza}
{Beer, Chips} {Beer,Chips,Pizza,Wine}:?
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
Transaction Database
{Chips, Pizza}
{Beer, Chips} {Beer,Chips,Pizza,Wine}:?
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
Transaction Database
{Chips, Pizza}
{Beer, Chips} {Beer,Chips,Pizza,Wine}:?
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
Transaction Database
{Chips, Pizza}
{Beer, Chips} {Beer,Chips,Pizza,Wine}:?
{Chips, Pizza, Wine}
{Wine}
Pruned search space
minSupp = 2
{}:4
{Chips,Pizza,Wine}:?
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
{Chips,Pizza,Wine}:?
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
{Chips,Pizza,Wine}:?
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
{Chips,Pizza,Wine}:?
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
{Chips,Pizza,Wine}:?
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
{Chips,Pizza,Wine}:?
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
Pruned Search Space
minSupp = 2
{}:4
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
{}:4
Transaction Database
{Chips, Pizza}
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
minSupp = 2
Border itemsets X: all subsets Y X are frequent, all supersets Z X are not frequent
{}:4
non-frequent
{Bier,Chips}:1 {Bier,Pizza}:0 {Bier,Wine}:1 {Chips,Pizza}:2 {Chips,Wine}:1 {Pizza,Wine}:1
minSupport s = 2
Border itemsets X: all subsets Y X are frequent, all supersets Z X are not frequent
{}:4
non-frequent
{Bier,Chips}:1 {Bier,Pizza}:0 {Bier,Wine}:1 {Chips,Pizza}:2 {Chips,Wine}:1 {Pizza,Wine}:1
minSupport s = 1
Border itemsets X: all subsets Y X are frequent, all supersets Z X are not frequent
Positive border: X is also frequent {}:4
Negative border: X is not frequent
Transaction Database
{Chips, Pizza} {Bier,Chips,Pizza}:0 {Bier,Chips,Wine}:0 {Bier,Pizza,Wine}:0 {Chips,Pizza,Wine}:1
{Beer, Chips}
{Chips, Pizza, Wine}
{Wine}
minSupp = 2 {Bier,Chips,Pizza,Wine}:0
Two (k-1)-itemsets p, q are joined, if they agree in the first (k-2) items - Prune step (apriori-based):
acde is pruned since cde is not frequent
Prune step: prune Ck and return Lk
- Prune step (DB-based):
Ck is superset of Lk check abcd support in the DB
Nave idea: count the support for all candidate itemsets in Ck .|Ck| might be large!
Use Apriori property: a candidate k-itemset that has some non-frequent (k-1)-itemset cannot be frequent
Prune all those k-itemsets, that have some (k-1)-subset that is not frequent (i.e. does not belong to Lk-1)
Due to the level-wise approach of Apriori, we only need to check (k-1)-subsets
L1 = {frequent items};
Candidate generation
for (k = 1; Lk !=; k++) do begin (self-join, apriori property)
Ck+1 = candidates generated from Lk;
for each transaction t in database do DB scan
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support subset function
end Prune by support count (ask DB)
return k Lk;
Subset function:
- The subset function must for each transaction T in DB check all candidates
in the candidate set Ck whether they are part of the transaction T
- Organize candidates Ck in a hash tree
Advantages:
Apriori property
Easy implementation (in parallel also)
Disadvantages:
It requires up to |I| database scans
It assumes that the DB is in memory
Complexity depends on
minSupport threshold
Number of items (dimensionality)
Number of transactions
Average transaction length
Introduction
Basic concepts
Frequent Itemsets Mining (FIM) Apriori
Association Rules Mining
Apriori improvements
Closed frequent itemsets (CFI) & Maximal frequent itemsets (MFI)
Homework/tutorial
Things you should know from this lecture
no database access!
We can decide if there are strong using
the support counts (already computed
during the FIM step)
tid XT
1 {Bier, Chips, Wine} Transaction database
2 {Bier, Chips}
3 {Pizza, Wine}
I = {Bier, Chips, Pizza, Wine}
4 {Chips, Pizza} Rule Sup. Freq. Conf.
{Bier} {Chips} 2 50 % 100 %
For a rule A B
Support P( A B)
e.g. support(milk, bread, butter)=20%, i.e. 20% of the transactions contain these
P( A B)
Confidence P( A)
e.g. confidence(milk, bread butter)=50%, i.e. 50% of the times a customer buys milk and bread, butter is bought as well.
P( A B)
Lift
P( A) P( B)
e.g. lift(milk, bread butter)=20%/(40%*40%)=1.25. the observed support is 20%, the expected (if they were
independent) is 16%.
Introduction
Basic concepts
Frequent Itemsets Mining (FIM) Apriori
Association Rules Mining
Apriori improvements
Closed frequent itemsets (CFI) & Maximal frequent itemsets (MFI)
Homework/tutorial
Things you should know from this lecture
Readings:
Tan P.-N., Steinbach M., Kumar V book, Chapter 6.
Han J., KamberM., Pei J.Data Mining: Concepts and Techniques3rd ed., Morgan Kaufmann, 2011 (Chapter 6)
Apriori algorithm: Rakesh Agrawal and R. Srikant, Fast Algorithms for Mining Association Rules, VLDB94.
Introduction
Basic concepts
Frequent Itemsets Mining (FIM) Apriori
Association Rules Mining
Apriori improvements
Closed frequent itemsets (CFI) & Maximal frequent itemsets (MFI)
Homework/tutorial
Things you should know from this lecture
Frequent Itemsets Mining: computation cost, negative border, downward closure property