You are on page 1of 28

CLOSET: An Efficient

Algorithm for Mining


Frequent Closed Itemsets
Jian Pei, Jiawei Han and Runying Mao
Intelligent Database Systems Research Lab.
School of Computing Science
Simon Fraser University
Email: {peijian, han, rmao}@cs.sfu.ca
http://www.cs.sfu.ca/~{peijian, han, rmao}
Outline
why mining frequent closed itemsets?

CLOSET: an efficient method

Performance study and experimental

results

Conclusions
Mining Frequent Itemsets
Given a transaction database and a
support threshold, mining frequent
itemsets is to find the complete set of
frequent itemsets
Mining frequent itemsets is essential for
many data mining tasks, e.g. association,
etc.
Mining frequent itemsets and association
rules over them often generates a large
number of frequent itemsets and rules
 Harm efficiency
 Hard to understand
From Frequent Itemsets to
Frequent Closed Itemsets
Mining frequent closed itemsets has the
same power as mining the complete set of
frequent itemsets, but it substantially
reduces redundant rules to be generated
 Increase both efficiency and effectiveness
TDB min_sup=1
(a1a2…a100)
min_conf=50%
(a1a2…a50)

2100-1 frequent itemsets 2 frequent closed itemsets


a1, …, a100, a1a2, …, a99a100, a1a2…a100, a1a2…a50
…, a1a2…a100 1 rule
A tremendous number of a1a2…a50a51a52…a100
association rules!
What Is Frequent Closed
Itemset?
An itemset X is a closed itemset if there
exists no itemset Y such that every
transaction having X contains Y
A closed itemset X is frequent if its
support passes the given support
threshold
The concept is firstly proposed by Pasquier
et al. in ICDT’99 and Information Systems
Vol.24, No.1, 1999
How to Generate Rules on
Frequent Closed Itemsets?
Rule XY is an association rule on
frequent closed itemsets if
 Both X and XY are frequent closed itemsets
 There exists no frequent closed itemset Z
such that XZ(XY)
 The confidence of the rule passes the given
threshold
Given rules XY and XYZ, the rule
XYZ is redundant!
How to Mine Frequent Closed
Itemsets?
A-Close [PBTL99]
 Using the A-priori framework
 Pruning redundancies in candidates
 Post-processing to generate complete but
non-duplicate result
ChARM [ZaHs00]
 Exploring a vertical data format
 Finding frequent closet itemsets by computing
intersections of sets of transaction ids for
itemsets
CLOSET: our method presented here
How CLOSET Works? An
Example
Transaction Items
ID
10 a, c, d, e, f
20 a, b, e
30 c, e, f
40 a, c, d, f Step 1. Find frequent items
50 c, e, f

min_sup =2 List of frequent items in support


descending order
f_list=<c:4, e:4, f:4, a:3, d:2>
Divide Search Space
All frequent closed itemsets can be divided
into 5 non-overlap subsets based on f_lsit
 The ones containing d
 The ones containing a but no d
 The ones containing f but no a nor d
 The ones containing e but no f, a nor d
 The ones containing only c
Transaction ID Items
10 a, c, d, e, f
20 a, b, e
30 c, e, f f_list=<c:4, e:4, f:4, a:3, d:2>
40 a, c, d, f
50 c, e, f
Find Subsets of Frequent Closed Itemsets by
Constructing Conditional Databases
Let a be a frequent item in TDB. The a-
conditional database, denoted as TDB|a, is the
subset of transactions in TDB containing a, and
all occurrences of infrequent items, item a, and
items following a in f_list are omitted
Let b be a frequent item in X-conditional
database TDB|X, the bX-conditional database,
denoted as TDB|bX, is the subset of transactions
in TDB|X containing b and all the occurrences of
local infrequent items, item b, and items
following j in local f_listX are omitted
Find Frequent Closed Itemsets
Containing d
TDB
cefad
ea
cef f_list:<c:4, e:4, f:4, a:3, d:2>
cfad
Local frequent cef
items: c, f, a

TDB|d (d:2) TDB|a (a:3) TDB|f (f:4) TDB|e (e:4)


cefa cef ce:3 c:3
cfa e c
F.C.I.: e:4
cf
F.C.I.: cfad:2 F.C.I.: cf:4, cef:3
F.C.I.: a:3
Every transaction
TDB|ea (ea:2)
having d also c
contains c, f and a F.C.I.: ea:2
Find Frequent Closed Itemsets
Containing a but No d
Frequent closed itemsets
TDB
containing a but no d can be cefad
further partitioned into subsets ea
Ones having af but no d cef f_list:<c:4, e:4, f:4, a:3, d:2>
cfad
Ones having ae but no d nor f cef
Ones having ac but no d, e nor f

TDB|d (d:2) TDB|a (a:3) TDB|f (f:4) TDB|e (e:4)


cefa cef ce:3 c:3
cfa e c
F.C.I.: e:4
cf
F.C.I.: cfad:2 F.C.I.: cf:4, cef:3
F.C.I.: a:3
sup(fa)=sup(ca)=sup(cfad)
TDB|ea (ea:2)
No FCI having fa or ca but no d c
F.C.I.: ea:2
Find Frequent Closed Itemsets
Containing f but No a Nor d
TDB
cefad
ea
cef f_list:<c:4, e:4, f:4, a:3, d:2>
cfad
cef

TDB|d (d:2) TDB|a (a:3) TDB|f (f:4) TDB|e (e:4)


cefa cef ce:3 c:3
cfa e c
F.C.I.: e:4
cf
F.C.I.: cfad:2 F.C.I.: cf:4, cef:3
F.C.I.: a:3
TDB|ea (ea:2)
c
F.C.I.: ea:2
Find Frequent Closed Itemsets
Containing e but No f, a Nor d
TDB
cefad
ea
cef f_list:<c:4, e:4, f:4, a:3, d:2>
cfad
cef

TDB|d (d:2) TDB|a (a:3) TDB|f (f:4) TDB|e (e:4)


cefa cef ce:3 c:3
cfa e c
F.C.I.: e:4
cf
F.C.I.: cfad:2 F.C.I.: cf:4, cef:3
F.C.I.: a:3
TDB|ea (ea:2)
c
F.C.I.: ea:2
Find Frequent Closed Itemsets
Containing Only c
sup(c)=sup(cf), c is not a closed itemset

In summary, the set of frequent closed

itemsets is {acdf:2, a:3, ae:2, cf:4, cef:3,

e:4}
Optimization 1: Compress Transactional
& Conditional Databases Using FP-trees
FP-tree compresses databases for
frequent itemsets

Conditional databases can be derived from


FP-tree efficiently

Please refer our SIGMOD’00 paper for


details
Optimization 2: Extract Items Appearing in
Every Transaction of Conditional Database
Let Y be the set of items appearing in
every transaction of the X-conditional
database, XY is a potential frequent
closed itemset
This optimization takes effect before
constructing the FP-tree for the
conditional database
Benefits
 Reduce the size of FP-tree
 Reduce the levels of recursions
Optimization 3: Directly Extract
Frequent Closed Itemsets From FP-tree
Benefits root
 Identify
frequent closed a:7
itemsets quickly abc:7 b:7
abcd:5
 Reduce the size
c:7
of the
remaining FP- d:5
tree to be e:4
examined abcdef:4
f:4
 Reduce the
levels of
recursions
Optimization 4: Prune Search
Branches
If XY, sup(X)=sup(Y) and Y is a frequent
closed itemset, there is no need to search
for X-conditional database for frequent
closed itemset
 Any frequent closed itemset having X must
contain Y-X as well
Benefits
 Avoid search for subsumed frequent itemsets
Scaling up CLOSET in Large
Database
Using projected TDB
databases in cefad
ea
place of FP-trees cef f_list:<c:4, e:4, f:4, a:3, d:2>

Partition-based cfad
cef
projection

TDB|d (d:2) TDB|a (a:3) TDB|f (f:4) TDB|e (e:4)


cefa cef ce:3 c:3
cfa e c
F.C.I.: e:4
cf
F.C.I.: cfad:2 F.C.I.: cf:4, cef:3
F.C.I.: a:3
TDB|ea (ea:2)
c
F.C.I.: ea:2
Performance Study
Test takers
 A-Close
 ChARM
 CLOSET
Datasets
 Synthetic dataset T25I20D100k with 10k
items
 Connect-4
 Pumsb
Compactness of Frequent
Closed Itemsets
Example: Dataset Connect-4

Support #FCI #FI #FI/#FCI


64179 (95%) 812 2205 2.72
60801 (90%) 3486 27127 7.78
54046 (80%) 15107 533975 35.35
47290 (70%) 35875 4129839 115.12
Scalability with Support Threshold
on Dataset T25I20D100k
100
A-CLOSE
CLOSET
80
ChARM
Runtime (second)

60

40

20

0
0.7% 0.9% 1.1% 1.3% 1.5%
Support threshold
Scalability With Support Threshold
on Dataset Connect-4
10000
A-CLOSE
CLOSET
1000 ChARM
Runtime (second)

100

10

1
40% 50% 60% 70% 80% 90% 100%
Support threshold
Scalability With Support Threshold
on Dataset Pumsb
300
A-CLOSE
250 CLOSET
ChARM
Runtime (second)

200

150

100

50

0
75% 80% 85% 90% 95%
Support threshold
Size Scaleup on Datasets
300
T25I20D100-1000K (1%)
Connect4 (70%)
250
Pumsb (85%)
Runtime (second)

200

150

100

50

0
0 2 4 6 8 10
Replication Factor
Conclusions
CLOSET is an FP-tree-based database
projection method for efficient mining of
frequent closed itemsets in large
databases
 Applying FP-tree structure
 Developing techniques to identify frequent
closed itemsets quickly
 Exploring a partition-based projection
mechanism for scalable mining
CLOSET can be straightforwardly extended
to mine max-patterns
References
R. Agarwal, C. Aggarwal and V.V.V. Prasad. A tree projection algorithm
for generation of frequent itemsets. In Journal of Parallel and
Distributed Computing, (to appear), 2000
R. Agrawal and R. Srikant. Fast algorithms for mining association
rules. In Proc. VLDB’94, Chile, September 1994
R.J. Bayardo. Efficiently mining long patterns from databases. In
Proc. SIGMOD’98, WA, June 1998
J. Han, J. Pei and Y. Yin. Mining frequent patterns without candidate
generation. In Proc. SIGMOD’00, TX, May 2000
H. Mannila, H. Toivonen and A.I. Verkamo. Efficient algorithms for
discovering association rules. In Proc. KDD’94, WA, July 1994
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. In Proc. ICDT’99, Israel, January
1999.
Nicolas Pasquier, Yves Bastide, Rafik Taouil, Lotfi Lakhal: Efficient
Mining of Association Rules Using Closed Itemset Lattices. In
Information Systems, Vol.24, No.1, 1999
M.J. Zaki and C. Hsiao. ChARM: An efficient algorithm for closed
association rule mining. In Tech. Rep. 99-10, Computer Science,
Rensselaer Polytechnic Institute, 1999.

You might also like