You are on page 1of 56

1

Fast Algorithms for Mining


Association Rules
Rakesh Agrawal
Ramakrishnan Srikant
Rully Soelaiman
Program Studi Sistem Informasi
FT Informasi-ITS
2
Outline
Introduction
Formal statement
Apriori Algorithm
AprioriTid Algorithm
Comparison
AprioriHybrid Algorithm
Conclusions

3
Introduction
Bar-Code technology
Mining Association Rules over basket
data (93)
Tires ^ accessories automotive
service
Cross market, Attached mail.
Very large databases.

4
Notation
Items I = {i
1
,i
2
,,i
m
}
Transaction set of items
Items are sorted lexicographically
TID unique identifier for each
transaction
I T _
5
Notation
Association Rule X Y
| = _ _ Y X I Y I X and ,
6
Confidence and Support
Association rule XY has
confidence c,
c% of transactions in D that contain
X also contain Y.
Association rule XY has support s,
s% of transactions in D contain X
and Y.

7
Notice
X A doesnt mean X+YA
May not have minimum support
X A and A Z
doesnt mean X Z
May not have minimum confidence


8
Define the Problem
Given a set of transactions D, generate
all association rules that have support
and confidence greater than the
user-specified minimum support and
minimum confidence.

9
Discovering all Association
Rules
Find all Large itemsets
itemsets with support above minimum
support.
Use Large itemsets to generate the
rules.
10
General idea
Say ABCD and AB are large itemsets
Compute
conf = support(ABCD) / support(AB)
If conf >= minconf
AB CD holds.

11
Discovering Large Itemsets
Multiple passes over the data
First pass count the support of individual
items.
Subsequent pass
Generate Candidates using previous passs large
itemset.
Go over the data and check the actual support
of the candidates.
Stop when no new large itemsets are found.
12
The Trick
Any subset of large itemset is large.
Therefore
To find large k-itemset
Create candidates by combining large k-1
itemsets.
Delete those that contain any subset
that is not large.

13
Algorithm Apriori

k
k
k k
t
k t
k- k
k-1
; L Answer
minsup} |c.count C { c L
c.count
C c
,t) (C C
D t
) (L C
) ; k 2; L ( k
itemsets} {large 1- L
=
> e =
+ +
e
=
e
=
+ + = =
=
end
end
end
;
do candidates forall
subset
begin do ons transacti forall
; gen apriori-
begin do For
1
1
|
Count item occurrences
Generate new k-itemsets
candidates
Find the support of all the
candidates
Take only those with
support over minsup
14
The Apriori Algorithm Example
TID Items
100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
itemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C
1
L
1
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3
{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L
2
C
2
C
2
Scan D
C
3
L
3
itemset
{2 3 5}
Scan D
itemset sup
{2 3 5} 2
Database D
Min support =50% = 2 trans
15
Candidate generation
Join step



Prune step

1 k 1 k 2 k 2 k 1 1
1 k 1 k
1 k 1 k 1
k
q.item p.item , q.item p.item ,..., q.item p.item
q p,L L
item q item p item p p.item
C



< = = where
from
. , . , . , select
into insert
2
k
k-1
k
c f rom C
) L (s
ets s of c (k-1)-subs
C itemsets c
delete
then if
do forall
do forall
e
e
P and q are 2 k-1 large
itemsets identical in all
k-2 first items.
Join by adding the last item of
q to p
Check all the subsets, remove a
candidate with small subset
16
Example
L
3
= { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} }

After joining

{ {1 2 3 4}, {1 3 4 5} }

After pruning

{1 2 3 4}


{1 4 5} and {3 4 5}
Are not in L
3
17
Correctness
1 k 1 k 2 k 2 k 1 1
1 k 1 k
1 k 1 k 1
k
q.item p.item , q.item p.item ,..., q.item p.item
q p,L L
item q item p item p p.item
C



< = = where
from
. , . , . , select
into insert
2
k
k-1
k
c f rom C
) L (s
ets s of c (k-1)-subs
C itemsets c
delete
then if
do forall
do forall
e
e
Show that
k k
L C _
Join is equivalent to
extending L
k-1
with all
items and removing
those whose (k-1)
subsets are not in L
k-1

Prevents duplications
Any subset of large itemset
must also be large
18
Subset Function
Candidate itemsets - C
k
are
stored in a hash-tree
Finds in O(k) time whether a
candidate itemset of size k
is contained in transaction
t.
Total time O(max(k,size(t))

k
k
k k
t
k t
k- k
k-1
; L Answer
minsup} |c.count C { c L
c.count
C c
,t) (C C
D t
) (L C
) ; k 2; L ( k
itemsets} {large 1- L
=
> e =
+ +
e
=
e
=
+ + = =
=
end
end
end
;
do candidates forall
subset
begin do ons transacti forall
; gen apriori-
begin do For
1
1
|

19
Problem?
Every pass goes over
the whole data.

k
k
k k
t
k t
k- k
k-1
; L Answer
minsup} |c.count C { c L
c.count
C c
,t) (C C
D t
) (L C
) ; k 2; L ( k
itemsets} {large 1- L
=
> e =
+ +
e
=
e
=
+ + = =
=
end
end
end
;
do candidates forall
subset
begin do ons transacti forall
; gen apriori-
begin do For
1
1
|
20
Algorithm AprioriTid
Uses the database only once.
Builds a storage set C^
k
Members has the form < TID, {X
k
} >
X
k
are potentially large k-items in
transaction TID.
For k=1, C^
1
is the database.
Uses C^
k
in pass k+1.



Each item is replaced by
an itemset of size 1
21
Advantage
C^
k
could be smaller than the
database.
If a transaction does not contain k-
itemset candidates, than it will be
excluded from C^
k
.
For large k, each entry may be
smaller than the transaction
The transaction might contain only few
candidates.
22
Disadvantage
For small k, each entry may be larger
than the corresponding transaction.
An entry includes all k-itemsets
contained in the transaction.
23
Algorithm AprioriTid

k
k
k k
t
^
k t
t
k t
^
k-1
^
k
k- k
k-1
^
1
; L Answer
minsup} |c.count C { c L
; t.TID,C C then ) (C
c.count
C c
items}; of t.set 1]) c[k (c
items of t.set c[k] |(c C {c C
C entries t
; C
) (L C
) ; k 2; L ( k
; database D C
itemsets} {large 1- L
=
> e =
> =< + =
+ +
e
e .
e e =
e
=
=
+ + = =
=
=
end
end
end
if
;
do candidates forall

begin do forall
; gen apriori-
begin do For
1
1
|
|
Count item occurrences
Generate new k-itemsets
candidates
Find the support of all the
candidates
Take only those with
support over minsup
The storage set is initialized
with the database
Build a new storage set
Determine candidate itemsets
which are containted in
transaction TID
Remove empty entries
24
Items TID
1 3 4 100
2 3 5 200
1 2 3 5 300
2 5 400
Set-of-
itemsets
TID
{ {1},{3},{4} } 100
{ {2},{3},{5} } 200
{ {1},{2},{3},{5} } 300
{ {2},{5} } 400
Support Itemset
2 {1}
3 {2}
3 {3}
3 {5}
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
Set-of-itemsets TID
{ {1 3} } 100
{ {2 3},{2 5} {3 5} } 200
{ {1 2},{1 3},{1 5},
{2 3}, {2 5}, {3 5} }
300
{ {2 5} } 400
Support Itemset
2 {1 3}
3 {2 3}
3 {2 5}
2 {3 5}
itemset
{2 3 5}
Set-of-itemsets TID
{ {2 3 5} } 200
{ {2 3 5} } 300
Support Itemset
2 {2 3 5}
Database
C^
1
L
2
C
2
C^
2
C^
3
L
1
L
3
C
3
25
Correctness
Show that C
t
generated in the kth
pass is the same as set
of candidate k-
itemsets in C
k
contained in
transaction with t.TID

k
k
k k
t
^
k t
t
k t
^
k-1
^
k
k- k
k-1
^
1
; L Answer
minsup} |c.count C { c L
; t.TID,C C then ) (C
c.count
C c
items}; of t.set 1]) c[k (c
items of t.set c[k] |(c C {c C
C entries t
; C
) (L C
) ; k 2; L ( k
; database D C
itemsets} {large 1- L
=
> e =
> =< + =
+ +
e
e .
e e =
e
=
=
+ + = =
=
=
end
end
end
if
;
do candidates forall

begin do forall
; gen apriori-
begin do For
1
1
|
|
26
Correctness
Lemma 1
k >1, if C^
k-1
is correct and complete,
and L
k-1
is correct,
Then the set C
t
generated at the kth
pass is the same as the set of
candidate k-itemsets in C
k
contained
in transaction with t.TID
t of C^
k
t.set-of-itemsets
includes all large k-itemsets
contained in transaction
with t.TID
t of C^
k
t.set-of-itemsets
doesnt include any k-itemsets
not contained in transaction
with t.TID
Same as the set of all large k-
itemsets
27
Proof
Suppose a candidate itemset
c = c[1]c[2]c[k] is in transaction t.TID

c
1
= (c-c[k]) and c
2
=(c-c[k-1]) were in
transaction t.TID

c
1
and c
2
must be large

c
1
and c
2
were members of t.set-of-items

c will be a member of C
t

C
k
was built using apriori-gen(L
k-1
)
all subsets of c of C
k
must be large
C^
k-1
is complete
28
Proof
Suppose c
1
(c
2
) is not in transaction
t.TID
c
1
(c
2
) is not in t.set-of-itemsets
c of C
k
is not in transaction t.TID
c will not be a member of C
t

C^
k-1
is correct
29
Correctness
Lemma 2

k >1, if L
k-1
is correct and the set C
t

generated in the kth step is the same
as the set of candidate k-itemsets in
C
k
in transaction t.TID, then the set
C^
k
is correct and complete.

30
Proof
Apriori-gen guarantees
C
t
includes all
large k-itemsets in
t.TID, which are added
to C^
k

C^
k
is complete.

C
t
includes only itemsets
in t.TID, only items in C
t

are added to C^
k

C^
k
is correct.

k
k
k k
t
^
k t
t
k t
^
k-1
^
k
k- k
k-1
^
1
; L Answer
minsup} |c.count C { c L
; t.TID,C C then ) (C
c.count
C c
items}; of t.set 1]) c[k (c
items of t.set c[k] |(c C {c C
C entries t
; C
) (L C
) ; k 2; L ( k
; database D C
itemsets} {large 1- L
=
> e =
> =< + =
+ +
e
e .
e e =
e
=
=
+ + = =
=
=
end
end
end
if
;
do candidates forall

begin do forall
; gen apriori-
begin do For
1
1
|
|
k k
L C _
31
Correctness
Theorem 1

k >1, the set C
t
generated in the kth pass is
the same as the set of candidate k-
itemsets in C
k
contained in transaction
t.TID
Show:
C^
k
is correct and complete and L
k
is
correct for all k>=1.

32
Proof (by induction on k)
K=1 C^1 is the database.
Assume it holds for k=n.
C
t
generated in pass n+1 consists of exactly
those itemsets in C
n+1
contained in transaction
t.TID.
Apriori-gen guarantees
and C
t
is correct L
n+1
is correct
C^
n+1
will be correct and complete
C^
k
is correct and complete for all k>=1
The theorem holds

1 n 1 n
L C
+ +
_
Lemma 2
Lemma 1
33
General idea (reminder)
Say ABCD and AB are large itemsets
Compute
conf = support(ABCD) / support(AB)
If conf >= minconf
AB CD holds.

34
Discovering Rules
For every large itemset l
Find all non-empty subsets of l.
For every subset a
Produce rule a (l-a)
Accept if support(l) / support(a) >= minconf


35
Checking the subsets
For efficiency, generate subsets using
recursive DFS. If a subset a doesnt
produce a rule, we dont need to check
for subsets of a.
Example
Given itemset : ABCD
If ABC D doesnt have enough confidence
then surely AB CD wont hold

36
Why?
For any subset a^ of a:
Support(a^) >= support(a)
Confidence ( a^ (l-a^) ) =
support(l) / support(a^) <=
support(l) / support(a) =
confidence ( a (l-a) )

37
Simple Algorithm


end
end
call
then if
output
begin then if
begin do forall
genrules procedure
forall
); ,a genrules(l
1) 1 (m
); a (l the rule a
minconf ) (conf
) a )/support( support(l conf
A a
}; a | a emset a {(m-1)-it A
emset) large m-it : emset, a large k-it : (l
) ,l genrules(l
2 do k , sets l large item
m-1 k
m-1 k m-1
m-1 k
m-1
m m-1 m-1
m k
k k
k
>

>
=
e
c =
>
Check all the subsets
Check all the large itemsets
Output the rule
Continue the DFS
over the subsets.
If there is no confidence the
DFS branch cuts here
Check confidence of
new rule
38
Faster Algorithm
Idea:

If (l-c) c holds than all the rules
(l-c^) c^ must hold

Example:

If AB CD holds,
then so do ABC D and ABD C

C^ is a non empty
subset of c
39
Faster Algorithm
From a large itemset l,
Generate all rules with one item in its
consequent.
Use those consequents and Apriori-gen to
generate all possible 2 item consequents.
Etc.
The candidate set of the faster algorithm
is a subset of the candidate set of the
simple algorithm.

40
Faster algorithm
end
call
end
delete
else
output
then if
begin do forall
begin then if
genrules - ap procedure
end
call
begin do , forall
); ,h s(l ap-genrule
f rom H h
) support(l support conf and dence with conf i
h ) h l the rule (
minconf ) (conf
); h l )/support( support(l conf
H h
); en(H apriori-g H
1) m (k
equents) -item cons : set of m temset , H :large k-i (l
); ,H s(l ap-genrule
nt }; e conseque item in th with one f rom l s derived ts of rule {consequen H
2 k emsets l large k-it
1 m k
1 m 1 m
k
1 m 1 m k
1 m k k
1 m 1 m
m 1 m
m k
1 k
k 1
k
+
+ +
+ +
+
+ +
+
= =

>
=
e
=
+ >
=
>
Find all 1 item
consequents (using 1
pass of the simple
algorithm)
Generate new (m+1)-
consequents
Check the support of
the new rule
Continue for bigger
consequents
If a consq. Doesnt
hold, dont look for
bigger.
41
Advantage
Example

Large itemset : ABCDE
One item conseq. : ACDEB ABCED
Simple algorithm will check:
ABCDE, ABECD, BCEAD and ACEBD.
Faster algorithm will check:
ACEBD which is also the only rule that
holds.
42
ABCDE
ACDEB
ABCED
ACDBE
ADEBC
CDEAB
ACEBD
BCEAD
ACEBD
ABECD
ABCED
Large itemset
Rules with minsup
Simple algorithm:
Fast algorithm:
ACEBD
ABCDE
ACDEB
ABCED
Example
43
Results
Compare Apriori, and AprioriTid
performances to each other, and to
previous known algorithms:
AIS
SETM
The algorithms differ in the method
of generating all large itemsets.
Both methods generate
candidates on-the-fly
Designed for use
over SQL
44
Method
Check the algorithms on the same
databases
Synthetic data
Real data
45
Synthetic Data
Choose the parameters to be compared.
Transaction sizes, and large itemsets sizes are
each clustered around a mean.
Parameters for data generation
D Number of transactions
T Average size of the transaction
I Average size of the maximal potentially large
itemsets
L Number of maximal potentially large itemsets
N Number of Items.
46
Synthetic Data
Expriment values:
N = 1000
L = 2000
T5.I2.D100k
T10.I2.D100k
T10.I4.D100k
T20.I2.D100k
T20.I4.D100k
T20.I6.D100k
D Number of transactions
T Average size of the transaction
I Average size of the maximal
potentially large itemsets
L Number of maximal potentially large
itemsets
N Number of Items.

T=5, I=2, D=100,000
47
SETM values
are too big to
fit the graphs.
Apriori always
beats AIS
D Number of transactions
T Average size of the transaction
I Average size of the maximal
potentially large itemsets

Apriori is better
than AprioriTid in
large problems
48
Explaining the Results
AprioriTid uses C^
k
instead of the
database. If C^
k
fits in memory
AprioriTid is faster than Apriori.
When C^
k
is too big it cannot sit in
memory, and the computation time is
much longer. Thus Apriori is faster
than AprioriTid.
49
Reality Check
Retail sales
63 departments
46873
transactions
(avg. size 2.47)
Small database,
C^
k
fits in
memory.


50
Reality Check
Mail Order
15836 items
2.9 million transactions
(avg size 2.62)
Mail Customer
15836 items
213972 transactions
(avg size 31)
51
So who is better?
Look At the Passes.
At final stages, C^
k

is small enough to fit
in memory
52
Algorithm AprioriHybrid
Use Apriori in initial passes
Estimate the size of C^
k


Switch to AprioriTid when C^
k
is
expected to fit in memory
The switch takes time, but it is still
better in most cases.

e
+
k
C c candidates
ons transacti of number support(c)
53
54
Scale up experiment
55
Conclusions
The Apriori algorithms are better
than the previous algorithms.
For small problems by factors
For large problems by orders of
magnitudes.
The algorithms are best combined.
The algorithm shows good results in
scale-up experiments.

56
Summary
Association rules are an important
tool in analyzing databases.
Weve seen an algorithm which finds
all association rules in a database.
The algorithm has better time
results then previous algorithms.
The algorithm maintains its
performances for large databases.

You might also like