17 views

Uploaded by rodda1

- Gawl Itza 09 Games
- COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG SALES DATA
- Arm 1
- OR_report
- A Survey of Association Rules
- K1_10.23919@PICMET.2018.8481992
- thunghe
- Parallel Implementation of Apriori Algorithm on Multicore System for Retail Market.pdf
- Solving a Real VRP
- An Efficient Technique for Mining Association Rules Using Enhanced Apriori Algorithm
- A New Approach in Tube Count for Tubular Heat Ex Changers
- Pattern Discovery 11.3
- 2012-NewBinaryPSObasedMethodforfindingbestthresholdsinassociationrulemining
- lec5.pdf
- KL Partitioning
- Literature regarding Automatic Timetable Generation
- Array Assignment (1).docx
- A Comparative Runtime Analysis of Heuristic Algorithms
- MSIT-104 Data Structure and Algorithms
- Vlsi Physical Design

You are on page 1of 11

Frequent/Closed/Maximal Itemsets

1

National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan

e-mail: uno@nii.jp, masashi@grad.nii.ac.jp

2

Information Science and Technology, Hokkaido University

Kita 14-jo Nishi 9-chome, 060-0814 Sapporo, JAPAN, e-mail: arim@ist.hokudai.ac.jp

Abstract: For a transaction database, a frequent a ppc extension needs no previously obtained closed

itemset is an itemset included in at least a specified itemset. Hence, the memory use of LCM does not

number of transactions. A frequent itemset P depend on the number of frequent closed itemsets,

is maximal if P is included in no other frequent even if there are many frequent closed itemsets.

itemset, and closed if P is included in no other The time complexity of LCM is theoretically

itemset included in the exactly same transactions bounded by a linear function in the number of fre-

as P . The problems of finding these frequent quent closed itemsets, while the existing algorithms

itemsets are fundamental in data mining, and are not. We further developed algorithms for the

from the applications, fast implementations for frequency counting, occurrence deliver and hybrid of

solving the problems are needed. In this paper, diffsets. They reduce the practical computation time

we propose efficient algorithms LCM (Linear time efficiently. Moreover, the framework of LCM is sim-

Closed itemset Miner), LCMfreq and LCMmax ple. Generating ppc extensions needs no sophisti-

for these problems. We show the efficiency of our cated data structure such as binary trees. LCM is

algorithms by computational experiments compared implemented with only arrays. Therefore, LCM is

with existing algorithms. fast, and outperforms than other algorithms for some

sparse datasets.

However, LCM does not have any routine for re-

ducing the database, while many existing algorithms

1 Introduction have. Thus, the performance of LCM is not good for

dense datasets with large minimum supports, which

Frequent item set mining is one of the fundamental involve many unnecessary items and transactions.

problems in data mining and has many applications At FIMI03, we also proposed modifications of LCM,

such as association rule mining, inductive databases, LCMfreq and LCMmax, for enumerating all frequent

and query expansion. From these applications, fast itemsets and maximal frequent itemsets. Although

implementations of frequent itemset mining problems they are fast for some instances, if LCM is not fast

are needed. In this paper, we propose the second for an instance, they are also not fast for the instance.

versions of LCM, LCMfreq and LCMmax, for enu- Existing maximal frequent itemset mining algorithms

merating closed, all and maximal frequent itemsets. have efficient pruning methods to reduce the number

LCM is an abbreviation of Linear time Closed item of iterations, while LCMmax does not have. It is also

set Miner . a reason of the slowness of LCMmax.

In FIMI03[7], we proposed the first version of This paper proposes the second version of LCM

LCM, which is for enumerating frequent closed item- algorithms. We added database reduction to LCM,

sets. LCM uses prefix preserving closure extension so that problems of dense datasets can be solved in

(ppc extension in short), which is an extension from short time. The second version of LCMmax includes

a closed itemset to another closed itemset. The ex- a pruning method, thus the computation time is re-

tension induces a search tree on the set of frequent duced when the number of maximal frequent itemsets

closed itemsets, thereby we can completely enumer- is small. We further developed new algorithms for

ate closed itemsets without duplications. Generating checking the maximality of a frequent itemset and

1

for taking the closure of an itemset. We compare f rq(Q) < f rq(P ). We call the item with the maxi-

the performance of LCM algorithms and other algo- mum index in P the tail of P , and denote by tail(P ).

rithms submitted to FIMI03 by computational exper-

iments. In many instances, LCM algorithms perform

above other algorithms. 3 Algorithms for Efficient Enu-

The organization of the paper is as follows. Sec- meration

tion 2 introduces preliminaries. The main algorithms

and practical techniques of LCM algorithms are de- In this section, we explain the techniques used in the

scribed in Section 3. Section 4 shows the results of second versions of LCM algorithms. We explain them

computational experiments, and Section 5 concludes one-by-one with comparing to the techniques used

the paper. by the other algorithms, in the following subsections.

The new techniques used in the second version are:

counting cost)

Let I = {1, ..., n} be the set of items. A transaction 3.6. database reduction for fast checking closedness

database on I is a set T = {t1 , . . . , tm } such that each 3.8. database reduction for fast checking maximality

ti is included in I. Each ti is called a transaction. We 3.7. new pruning algorithm for backtracking-based

denote by ||T || the sum of sizes of all transactions in maximal frequent itemset mining.

T , that is, the size of database T . A set P ⊆ I is

called an itemset. The techniques also used in the first versions are:

For itemset P , a transaction including P is called

an occurrence of P . The denotation of P , denoted 3.4. occurrence deliver (compute frequency in

by T (P ) is the set of the occurrences of P . |T (P )| is linear time)

called the frequency of P, and denoted by f rq(P ). For 3.5. ppc extension (generates closed itemsets with

given constant θ, called a minimum support, itemset neither memory nor duplication)

P is frequent if f rq(P ) ≥ θ. If a frequent itemset P 3.3. hypercube decomposition (fast enumeration by

is included in no other frequent itemset, P is called grouping frequent itemsets by equivalence class).

maximal. For any itemsets P and Q, T (P ∪ Q) =

T (P )∩T (Q) holds, and if P ⊆ Q then T (Q) ⊆ T (P ). The techniques used in the existing algorithms and

An itemset P is called closed if no other itemset Q the first version have citations to the previous papers.

satisfies T (P ) = T (Q), P ⊆ Q.

Given set S ⊆ T of transactions, let I(S) be the set 3.1 Enumerating Frequent Itemsets

T items common to all transactions in S, i.e., I(S) =

of

Any itemset included in a frequent itemset is it-

T ∈S T . Then, we define the closure of T itemset P

in T , denoted by clo(P ), by I(T (P ))(= t∈T (P ) t). self frequent. Thereby, the property “frequent” is

For every pair of itemsets P and Q, the following monotone. From this, we can construct any frequent

properties hold[13, 14]. itemset from the empty set by adding items one-by-

one without passing through any infrequent itemset.

(1) If P ⊆ Q, then clo(P ) ⊆ clo(Q). Roughly speaking, the existing algorithms are classi-

(2) If T (P ) = T (Q), then clo(P ) = clo(Q). fied into two groups, and algorithms in both groups

(3) clo(clo(P )) = clo(P ). use this property.

(4) clo(P ) is the unique smallest closed itemset The first group is so called apriori or level-by-level

including P . algorithms [1, 2]. Let Dk be the set of frequent item-

(5) A itemset P is a closed itemset if and only sets of size k. Apriori algorithms start with D0 , that

if clo(P ) = P . is {∅}, and compute Dk from Dk−1 in the increasing

order of k from k = 1. Any itemset in Dk is obtained

For itemset P and item i ∈ P , let P (i) = P ∩ from an itemset of Dk−1 by adding an item. Apriori

{1, . . . , i} be the subset of P consisting only of el- algorithms add every item to each itemset of Dk−1 ,

ements no greater than i, called the i-prefix of P . and choose frequent itemsets among them. If Dk = ∅

An itemset Q is a closure extension of an itemset holds for some k, then Dk0 = ∅ holds for any k 0 > k.

P if Q = clo(P ∪ {i}) holds for some i 6∈ P . If Thus, apriori algorithms stop at such k. This is the

Q is a closure extension of P , then Q ⊃ P, and scheme of apriori algorithms.

2

The other group is so called backtracking Database reduction performs well when the mini-

algorithms[3, 18, 19]. Backtracking algorithm is mum support is large, and many existing algorithms

based on recursive calls. An iteration of a backtrack- use it. LCM algorithms also use database reduction.

ing algorithm inputs a frequent itemset P , and gener- In the existing studies, the input databases are

ates itemsets by adding every items to P . Then, for often stored and maintained by using FP-tree (fre-

each itemset being frequent among them, the itera- quent pattern tree), which is a version of prefix tree

tion generates recursive calls with respect to it. To (trie) [9]. By using FP-tree, we can search specified

avoid duplications, an iteration of backtracking algo- transactions from the datasets efficiently. FP-tree

rithms adds items with indices larger than the tail compresses the common prefix, so we can decrease

of P . We describe the framework of backtracking the memory use. In addition, FP-tree can detect the

algorithms as follows. identical transactions, thus we can merge them into

one. This merge accelerates the frequency counting.

ALGORITHM BackTracking (P :current solution) From these reasons, FP-trees are used in many algo-

1. Output P rithms and implementations.

2. For each e ∈ I, e > tail(P ) do Although FP-tree has many good advantages, we

3. If P ∪ {e} is frequent then do not use it in the implementation of LCM, but use

call BackTracking (P ∪ {e}) simple arrays. The main reason is that LCM does

not have to search transactions in the database. The

An execution of backtracking algorithms gives a main operation of LCM is tracing the transactions in

tree structure such that the vertices of the tree are the the denotation of the current solution. Thus, we

iterations, and edges connect two iterations if one of do not need to use sophisticated data structures for

the iteration calls the other. If an iteration I recur- searching.

sively calls another iteration I 0 , then we say that I The other reason is the computation time for the

is the parent of I 0 , and I 0 is a child of I. For an iter- initialization. If we use a standard binary tree for

ation, the itemset received from the parent is called implementing FP-tree, the initialization of the in-

the current solution. put database takes O(||T || + |T | log |T |) time. for

Apriori algorithms use much memory for storing constructing FP-tree in memory. Compared to this,

Dk in memory, while backtracking algorithms use LCM detects the identical transactions and stores

less memory since they keep only the current solu- the database in memory within linear time of the

tion. Backtracking algorithms need no computation database size. This is because that LCM uses radix

for maintaining previously obtained itemsets, so the sort for this task, which sorts the transactions in a

computation time of backtracking algorithms is gen- lexicographic order in linear time. In general, the

erally short. However, apriori algorithms have ad- datasets of data mining problems have many trans-

vantages for the frequency counting. actions, and each transaction has few items. Thus,

LCM algorithms are based on backtracking al- ||T || is usually smaller than |T | log |T |, and LCM has

gorithms, and use an efficient techniques for the an advantage. The constant factors of the compu-

frequency counting, which are occurrence deliver tation time of binary tree operations are relatively

and anytime database reduction described below. larger than that of array operations. LCM also has

Hence, LCM algorithms compute the frequency effi- an advantage at this point. Again, we recall that

ciently without keeping previously obtained itemsets LCM never search the transactions, so each opera-

in memory. tion required by LCM can be done in constant time.

FP-tree has an advantage in reducing the memory

use. This memory reduction can also reduce the

3.2 Maintaining Databases computation time of the frequency counting. To

check the efficiency of the reduction, we checked

In the existing studies, database reduction is said to the reduction ratio by FP-tree for some datasets

be important to reduce the computation time. It is examined in FIMI03. The result is shown in Table

to reduce the input database as the following rules: 1. Each cell shows the ratio of the number of items

needed to be stored by arrays and FP-tree. Usually,

1. remove each item included in less than θ transactions the input database is reduced in each iteration,

2. remove each item included in all transactions hence we sum up the numbers over all iterations to

3. merge the identical transactions into one. compute the ratio. In the results of our experiments,

3

the ratio was not greater 3 in many instances. If generated itemset. To reduce the computation time,

|I| is small, |T | is large, and the dataset has a existing algorithms uses down project. For an itemset

randomness, such as accidents, the ratio was up to 6. P , down project computes its denotation T (P ) by us-

Generally, a binary tree uses memory three times as ing two subsets P1 and P2 of P . If P = P1 ∪ P2 , then

much as an array. Thus, the performance of FP-tree T (P ) = T (P1 )∩T (P2 ). Under the condition that the

seems to be not quite good rather than an array in items of P1 and P2 are sorted by their indices, the in-

both memory use and computation time, for many tersection can be computed in O(|T (P1 )| + |T (P2 )|)

datasets. time. Down project uses this property, and computes

the denotations quickly. Moreover, if |T (P1 )| < θ or

|T (P2 )| < θ holds, we can see that P never be fre-

3.3 Hypercube Decomposition quent. It also helps to reduce the computation time.

Apriori-type algorithms accelerates the frequency

LCM finds a number of frequent itemsets at once for counting by finding a good pair P1 and P2 of subsets

reducing the computation time[18]. Since the item- of P , such that |T (P1 )| + |T (P2 )| is small, or either

sets obtained at once compose a hypercube in the P1 or P2 is infrequent. Backtracking algorithm adds

itemset lattice, we call the technique hypercube de- an item e to the current solution P in each itera-

composition. For a frequent itemset P , let H(P ) tion, and compute its denotation. By using T ({e}),

be the set of items e satisfying e > tail(P ) and the computationP time for the frequency counting is

T (P ) = T (P ∪ {e}). Then, for any Q ⊆ H(P ), reduced to O( e>tail(P ) (|T (P )| + |T ({e})|)).

T (P ∪ Q) = T (P ) holds, and P ∪ Q is frequent. The bitmap method[5] is a technique for speed-

LCMfreq uses this property. For two itemsets P and ing up the computation of taking the intersection in

P ∪ Q, we say that P 0 is between P and P ∪ Q if down project. It uses a bitmap image (the charac-

P ⊆ P 0 ⊆ P ∪ Q. In the iteration with respect to teristic vector) of the denotations. To take the in-

P , we output all P 0 between P and P ∪ H(P ). This tersection, we have to take O(|T |) time with bitmap.

saves about 2|H(P )| times of the frequency counting. However, a 32bit CPU can take the intersection of

To avoid duplications, we do not generate recur- 32bits at once, thus roughly speaking the computa-

sive calls with respect to items included in H(P ). tion time is reduced to 1/32. This method has a

Instead of generating these recursive calls, we output disadvantage for sparse datasets, and is not orthog-

frequent itemsets including items of H(P ) in recur- onal to anytime database reduction described in the

sive calls with respect to items not included in H(P ). below. From the results of the experiments in FIMI

When the algorithm generates a recursive call with 03, bitmap method seems to be not good for sparse

respect to e 6∈ H(P ), we pass H(P ) to it. In the re- large datasets.

cursive call, we output all itemsets between P ∪ {e} LCM algorithms use another method for the fre-

and P ∪ {e} ∪ H(P ) ∪ H(P ∪ {e}). Since any itemset quency counting, called occurrence deliver[18, 19].

Q satisfies T (P ∪ Q ∪ H(P )) = T (P ∪ Q), the item- Occurrence deliver computes the denotations of P ∪

sets output in the recursive calls are frequent. We {e} for e = tail(P ) + 1, ..., |I| at once by tracing

describe hypercube decomposition as follows. transactions in T (P ). It use a bucket for each e to

ALGORITHM HypercubeDecomposition be added, and set them to empty set at the begin-

(P :current solution, S:itemset) ning. Then, for each transaction t ∈ T (P ), occur-

S 0 := S ∪ H(P ) rence deliver inserts t to the bucket of e for each

Output all itemsets including P e ∈ t, e > tail(P ). After these insertions, the bucket

and included in P ∪ S 0 of e is equal to T (P ∪ {e}). For each transaction t,

For each item e ∈ I \ (P ∪ S 0 ), e > tail(P ) do occurrence deliver takes O(|t ∩ {tail(P )P + 1, ..., |I|}|)

If P ∪ {e} is frequent then time. Thus, the computation time is O( T ∈T (P ) |T ∩

P

call HypercubeDecomposition (P ∪ {e}, S 0 ) {tail(P )+1, ..., |I|}|) =O(|T (P )|+ e>tail(P ) |T (P ∪

End for {e})|). This time complexity is smaller than down

project. We describe the pseudo code of occurrence

3.4 Frequency Counting deliver in the following.

Generally, the most heavy part of the frequent item- ALGORITHM OccurrenceDeliver

set mining is the frequency counting, which is to (T:database, P :itemset)

count the number of transactions including a newly 1. Set Bucket[e] := ∅ for each item e > tail(P )

4

dataset and chess accidents BMS-WebView2 T40I10D100K

minimum support 40% 30% 0.05% 0.1%

reduction factor

by FP-tree 2.27 6.01 1.9 1.57

reduction factor by

Hypercube decomposition 6.25 1 1.21 1

reduction factor

by apriori (best) 1.11 1.34 1.35 2.85

Table 1: Efficiency test of FP-tree, hypercube decomposition, and apriori: the reduction factor of FP-tree is

(sum of # of elements in reduced database by LCM) / ( sum of # of elements in reduced database by FP-tree),

over all iterations, the reduction ratio of hypercube decomposition is the

P average number of output frequent

itemsets in an iteration, and the reduction ratio of apriori is (sum of

P e>tail(P ) |T (P ∪ {e})|) / (sum of

e>F (P ) |T (P ∪ {e})|), over all iterations.

3. For each item e ∈ t, e > tail(P ) do According to this, LCM algorithms recursively re-

4. Insert t to Bucket[e] duce the database while the execution of recursive

5. End for calls. Before the recursive call, LCM algorithms gen-

6. End for erate a reduced database according to the above dis-

7. Output Bucket[e] for all e > tail(P ) cussion, and pass it to the recursive call. We call this

technique anytime database reduction.

Let F (P ) be the set of items e such that e > Anytime database reduction reduces the compu-

tail(P ) and P ∪ {e} is frequent. Apriori algorithms tation time of the iterations located at the lower

have possibility to find out in short time that P ∪ {e} levels of the recursion tree. In the recursion tree,

is infrequent, thus, in the bestPcase, their computa- many iterations are on the lower levels and few

tion time can be reduced to O( e∈F (P ) |T (P ∪{e})|). iterations are on the upper levels. Thus, anytime

P

If e>tail(P ),e6∈F (P ) |T (P ∪ {e})| is large, occurrence database reduction is expected to be efficient. In

deliver will be slow. our experiments, anytime database reduction works

P

To decrease e>tail(P ),e6∈F (P ) |T (P ∪ {e})|, LCM quite well. The following table shows the efficiency

algorithms sort indices of items e in the increasing of anytime database reduction. We sum up over all

order of |T ({e})|. As we can see in Table 1, this iterations the sizes of the database received from

P

sort reduces e>tail(P ),e6∈F (P ) |T (P ∪ {e})| to 1/4 of the parent, in both cases with anytime database

P

reduction and without anytime database reduction.

e>tail(P ) |T (P ∪ {e})| in many cases. Since apriori

algorithms take much time to maintain previously Each cell shows the sum. The reduction ratio is large

obtained itemsets, the possibility of speeding up by especially if the dataset is dense and the minimum

apriori algorithms is not so large. support is large.

LCM algorithms further speeds up the frequency

counting by iteratively reducing the database. Sup-

pose that an iteration I of a backtracking algorithm 3.5 Prefix Preserving Closure Exten-

receives a frequent itemset P from its parent. Then, sion

in any descendant iteration of I, no item of indices

smaller than tail(P ) is added. Hence, any such item Many existing algorithms for mining closed itemsets

can be removed from the database while the exe- are based on frequent itemset mining. That is, the

cution of the descendant iterations. Similarly, the algorithms enumerate frequent itemsets, and output

transactions not including P never include the cur- those being closed. This approach is efficient when

rent solution of any descendant iteration, thus such the number of frequent itemsets and the number of

transactions can be removed while the execution of frequent closed itemsets differ not so much. How-

the descendant iterations. Indeed, infrequent items ever, if the difference between them is large, the algo-

can be removed, and the identical transactions can rithms generate many non-closed frequent itemsets,

5

dataset and connect pumsb BMS-WebView2 T40I10D100K

minimum support 50% 60% 0.1% 0.03%

Database reduction 188319235 2125460007 2280260 1704927639

Anytime database reduction 538931 7777187 521576 77371534

Reduction factor 349.4 273.2 4.3 22.0

Table 2: Accumulated number of transactions in database in all iterations

thus they will be not efficient. Many pruning meth- ate the closure of P . By adding to P all items e such

ods have been developed for speeding up, however that f rq(P ) = f rq(P ∪ {e}), we can construct the

they are not complete. Thus, the computation time closure of P . We call the second closure operation.

is not bounded by a linear function in the number of LCM uses closure operations for generating ppc ex-

frequent closed itemsets. There is a possibility of over tensions. Similar to the frequency counting, we use

linear increase of computation time in the number of database reduction for closure operation. Suppose

output. that the current solution is P , the reduced database

LCM uses prefix preserving closure extension (ppc- is composed of transactions S1 , ..., Sh , and each Sl is

extension in short) for generating closed itemsets[18, obtained from transactions T1l , ..., Tkl of the original

19]. For a closed itemset P , we define the closure database. For eachTSl , we define the interior inter-

tail clo tail(P ) by the item i of the minimum index section In(Sl ) by T ∈{T l ,...,T l } T . Here the closure

T 1 k

satisfying clo(P (i)) = P . clo tail(P ) is always in- of P is equal to S∈{S1 ,...,Sh } In(S). Thus, by using

cluded in P . We say that P 0 is a ppc extension of P interior intersections, we can efficiently construct the

if P 0 = clo(P ∪ {e}) and P 0 (e − 1) = P (e − 1) hold for closure of P .

an item e > clo tail(P ). Let P0 be the itemset satis- When we merge transactions to reduce the

fying T (P 0 ) = T . Any closed itemset P 0 6= P0 is a database, interior intersections can be updated effi-

ppc extension of another closed itemset P , and such ciently, by taking the intersection of their interior in-

P is unique for P 0 . Moreover, the frequency of P is tersections. In the same way as the frequency count-

strictly larger than P 0 , hence ppc extension induces a ing, we can remove infrequent items from the interior

rooted tree on frequent closed itemsets. LCM starts intersections for more reduction. The computation

from P0 , and finds all frequent closed itemsets in a time for the closure operation in LCM depends on

depth first manner by recursively generating ppc ex- the size of database, but not on the number of previ-

tensions. The proof of ppc extension algorithms are ously obtained itemsets. Thus, storage method has

described in [18, 19]. advantages if the number of frequent closed itemsets

By ppc extension, the time complexity is bounded is small. However, for the instances with a lot of

by a linear function in the number of frequent closed frequent closed itemsets, which take long time to be

itemsets. Hence, the computation time of LCM never solved, LCM has an advantage.

be super linear in the number of frequent closed item-

sets.

3.7 Enumerating Maximal Frequent

Itemsets

3.6 Closure Operation

Many existing algorithms for maximal frequent item-

To enumerate closed itemsets, we have to check set enumeration are based on the enumeration of fre-

whether the current solution P is a closed itemset quent itemsets. In breadth-first manner or depth-

or not. In the existing studies, there are two meth- first manner, they enumerate frequent itemsets and

ods for this task. The first method is to store in output maximal itemsets among them. To reduce the

memory previously obtained itemsets which are cur- computation time, the algorithms prune the unnec-

rently maximal among itemsets having the identical essary itemsets and recursive calls.

denotation. In this method, we find frequent itemsets Similar to these algorithms, LCMmax enumerates

one-by-one, and store them in memory with remov- closed itemsets by backtracking, and outputs maxi-

ing itemsets included in another itemset having the mal itemsets among them. It uses a pruning to cut

identical denotation. After finding all frequent item- off unnecessary branches of the recursion. The prun-

sets, only closed itemsets remain in memory. We call ing is based on a re-ordering of the indices of items,

this storage method. The second method is to gener- in each iteration. We explain the re-ordering in the

6

following. is a heavy task, thus many existing algorithms avoid

Let us consider a backtracking algorithm for enu- it. They store in memory maximal itemsets among

merating frequent itemsets. Let P be the current so- previously obtained frequent itemsets, and update

lution of an iteration of the algorithm. Suppose that them when they find a new itemset. When the al-

P 0 is a maximal frequent itemset including P . LCM- gorithms terminate and obtain all frequent itemsets,

max puts new indices to items with indices larger only maximal frequent itemsets remain in memory.

than tail(P ) so that any item in P 0 has an index We call this storage method. If the number of max-

larger than any item not in P 0 . Note that this re- imal frequent itemsets is small, storage method is

ordering of indices has no effect to the correctness of efficient. However, if the number is large, storage

the algorithm. method needs much memory. When a frequent item-

Let e > tail(P ) be an item in P 0 , and consider the set is newly found, storage method checks whether

recursive call with respect to P ∪ {e}. Any frequent the itemset is included in some itemsets in the mem-

itemset P̂ found in the recursive call is included in ory or not. If the number of frequent itemsets is large,

P 0 , since every item having an index larger than e the operation takes long time.

is included in P 0 , and the recursive call adds to P

items only of indices larger than e. From this, we To avoid the disadvantage of storage method,

can see that by the re-ordering of indices, recursive LCMmax operates maximality check. LCMmax

calls with respect to items in P 0 ∩ H generates no checks the maximality by finding an item e such

maximal frequent itemset other than P 0 . that P ∪ {e} is frequent. If and only if such e ex-

According to this, an iteration of LCMmax chooses ists, P is not maximal. To operate this efficiently,

an item e∗ ∈ H, and generates a recursive call with we reduce the database. Let us consider an itera-

respect to P ∪ {e∗ } to obtain a maximal frequent tion of LCMmax with respect to a frequent itemset

itemset P 0 . Then, re-orders the indices of items other P . LCM algorithms reduce the database by anytime

than e∗ as the above, and generates recursive calls database reduction for the frequency counting. Sup-

with respect to each e > tail(P ) not included in P 0 ∪ pose that the reduced database is composed of trans-

{e∗ }. In this way, we save the computation time for actions S1 , ..., Sh , and each Sl is obtained by merg-

finding P 0 , and by finding a large itemset, increase ing transactions T1l , ..., Tkl of the original database.

the efficiency of this approach. In the following, we Let H be the set of items to be added in the iter-

describe LCMmax. ation. Suppose that we remove all items e from H

such that P ∪ {e} is infrequent. Then, for any l,

ALGORITHM LCMmax (P :itemset, H:items to T1l ∩ H = T2l ∩ H =, ..., = Tkl ∩ H holds. For an item e

be added) and a transaction Sl , we define the weight w(e, Sl ) by

1. H 0 := the set of items e in H s.t. P ∪ {e} is frequent the number of transactions in P T1l , ..., Tkl including e.

2. If H 0 = ∅ then Here the frequency of P ∪{e} is S∈{S1 ,...,Sh } w(e, S).

3. If P ∪ {e} is infrequent for any e then Thus, by using the weights, we can efficiently check

output P ; return the maximality, in linear time of the size of the re-

4. End if duced database.

5. End if When we merge transactions to reduce the

6. Choose an item e∗ ∈ H 0 ; H 0 := H 0 \ {e∗ } database, the weights can be updated easily. For each

7. LCMmax (P ∪ {e}, H 0 ) item e, we take the sum of w(e, S) over all transac-

8. P 0 := frequent itemset of the maximum size tions S to be merged. In the same way as frequency

found in the recursive call in 7 counting, we can remove infrequent items from the

9. For each item e ∈ H \ P 0 do database for maximality checking, for more reduc-

10. H 0 := H 0 \ {e} tion.

11. LCMmax (P ∪ {e}, H 0 )

12. End for The computation time for maximality check in

LCMmax depends on the size of database, but not

3.8 Checking Maximality on the number of previously obtained itemsets. Thus,

storage method has advantages if the number of max-

When LCMmax finds a frequent itemset P , it checks imal frequent itemsets is small, but for the instances

the current solution is maximal or not. We call with a lot of maximal frequent itemsets, which take

this operation maximality check. Maximality check long time to be solved, LCMmax has an advantage.

7

BMS-WebView-2-all

BMS-WebView-2-all

BMS-WebView-2-all BMS-WebView-2-closed

BMS-WebView-2-closed

BMS-WebView-2-closed

10

1000

10 10

1000

10

afopt_all

afopt_all

afopt_all afopt_all afopt_closed

afopt_closed

afopt_closed

fpg_all

fpg_all

fpg_all fpg_all fpg_closed

fpg_closed

fpg_closed

lcm_all

lcm_all lcm_all lcm_closed

lcm_closed

100 mafia_all mafia_all 100 mafia_closed

dci_all dci_all

cputime (sec)

cputime (sec)

patriciamine_all

1

10 1

10

1 1

0.1

10.1 0.1

10.1

0.11

0.11

0.110.1

0.1

0.10.09

0.09

0.090.08

0.08

0.080.07

0.07

0.070.06

0.06

0.060.05

0.05

0.050.04

0.04

0.04 0.03

0.03

0.03 0.02

0.02 0.01 0.11

0.11

0.110.1

0.1

0.10.09

0.09

0.090.08

0.08

0.080.07

0.07

0.070.06

0.06

0.060.05

0.05

0.050.04

0.04

0.04 0.03

0.03

0.03 0.02

0.02 0.01

minsup

minsup

minsup(%)

(%)

(%) minsup

minsup

minsup(%)

(%)

(%)

BMS-WebView-2-maximal

BMS-WebView-2-maximal

BMS-WebView-2-maximal

10

1000

10

afopt_maximal

afopt_maximal

afopt_maximal

fpg_maximal

fpg_maximal

fpg_maximal

lcm_maximal

lcm_maximal

100 mafia_maximal

cputime (sec)

1

10

0.1

10.1

0.11

0.11

0.110.1

0.1

0.10.09

0.09

0.090.08

0.08

0.080.07

0.07

0.070.06

0.06

0.060.05

0.05

0.050.04

0.04

0.04 0.03

0.03

0.03 0.02

0.02 0.01

minsup

minsup

minsup(%)

(%)

(%)

Figure 1: Results 1

displayed in Figure 1 and 2. In each graph, the hori-

In this section, we show the results of our compu- zontal axis is the size of minimum supports, and the

tational experiments. We implemented our three al- virtical axis is the CPU time written in a log scale.

gorithms LCM, LCMfreq, and LCMmax. They are

coded by ANSI C, and complied by gcc. The exper- From the performances of implementations, the in-

iments were executed on a notebook PC, with AMD stances were classified into three groups, in which the

athron XP 1600+ of 224MB memory. The perfor- results are similar. Due to the space limitation, we

mance of LCM algorithms are compared with the show one instance as a representative for each group.

algorithms which marked good score on FIMI 03: The first group is composed of BMS-WebView1,

fpgrowth[8], afopt[11], MAFIA[5, 6], kDCI[12], and BMS-WebView2, BMS-POS, T10I4D100K, kosarak,

PATRICIAMINE[16]. We note that kDCI and PA- and retail. These datasets have many items and

TRICIAMINE are only for all frequent itemset min- transactions but are sparse. We call these datasets

ing. To reduce the time for experiments, we stop sparse datasets. We chosen BMS-WebView2 as the

the execution when an algorithm takes more than representative.

10 minute. The following figures show the results.

We do not plot if the computation time is over 10 The second group is composed of datasets taken

8

chess-all

chess-all chess-closed

100

100 1000

afopt_all

afopt_all afopt_all afopt_closed

fpg_all

fpg_all fpg_all fpg_closed

lcm_all lcm_all lcm_closed

10 mafia_all mafia_all 100 mafia_closed

10 dci_all dci_all

cputime (sec)

cputime (sec)

patriciamine_all

1 10

1

0.1 1

0.1

0.01 0.1

90

90 80

80 70

70 60

60 50

50 40

40 30 20 90 80

80 7070 60 60 50 5040 40

30 30

20

minsup

minsup(%)

(%) minsup (%)

chess-maximal accidents-all

accidents-all

100 100

1000

afopt_maximal afopt_all

afopt_all afopt_all

fpg_maximal fpg_all

fpg_all fpg_all

lcm_maximal lcm_all lcm_all

mafia_maximal mafia_all mafia_all

10 100 dci_all dci_all

cputime (sec)

cputime (sec)

patriciamine_all

10

1 10

0.1 11

90 80 70 60 50 40 30 20 9090 80

80 70

70 60

60 50 50 40

40 30

30 20 10

minsup (%) minsup

minsup(%)

(%)

accidents-closed

accidents-closed accidents-maximal

100

1000 1000

afopt_closed

afopt_closed afopt_maximal

fpg_closed

fpg_closed fpg_maximal

lcm_closed lcm_maximal

mafia_closed mafia_maximal

100 100

cputime (sec)

cputime (sec)

10

10 10

11 1

9090 80

80 70

70 6060 50 50 40 4030 30

20 20

10 90 80 70 60 50 40 30 20 10

minsup

minsup(%)

(%) minsup (%)

Figure 2: Results 2

9

from UCI-Machine Learning Repository1 , connect, 5 Conclusion

chess, mushrooms, pumsb, and pumsb-star. These

datasets have many transactions but few items. We In this paper, we proposed a fast implementation of

call these datasets middle density datasets. As a rep- LCM for enumerating frequent closed itemsets, which

resentative, we show the result of chess. is based on prefix preserving closure extension. We

The third group is accidents. It is different from further gave implementations LCMfreq and LCM-

any other dataset. It has huge number of transac- max for enumerating all frequent itemsets and max-

tions, but few items. Transactions includes many imal frequent itemsets by modifying LCM. We show

items, so the dataset is very dense. We call this by computational experiments that our implements

dataset very dense dataset. of LCM, LCMfreq and LCMmax perform above the

other algorithms for many datasets, especially for

In almost instances and minimum supports, LCM

sparse datasets. There is a possibility of speeding up

algorithms perform well. When the minimum sup-

LCM algorithms by developing more efficient maxi-

port is large, LCM algorithms are the fastest for all

mality checking algorithms, or developing a hybrid of

instances, because of the fast initialization. For all

array and FP-tree like data structures.

instances with any minimum support, LCM outper-

forms other closed itemset mining algorithms. This

shows the efficiency of ppc extension. Acknowledgment

For sparse datasets, LCM algorithms are the

fastest, for any minimum support. The efficiency of This research is supported by joint-research funds of

FP-tree is not large, and occurrence deliver works ef- National Institute of Informatics.

ficiently. The performances of afopt and fp-growth

are quite similar for these problems. They are the References

second bests, and 2 to 10 times slower than LCM

algorithms. For enumerating frequent closed item- [1] R. Agrawal and R. Srikant, “Fast Algorithms for

sets, they take much time when the number of closed Mining Association Rules in Large Databases,”

itemsets is large. Although PATRICIAMINE is fast In Proceedings of VLDB ’94, pp. 487–499, 1994.

as much as fp-groth and afopt, it abnormally ter-

minated for some instances. kDCI is slow when the [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen

number of frequent itemsets is large. MAFIA was the and A. I. Verkamo, “Fast Discovery of Associa-

slowest for these instances, for any minimum support. tion Rules,” In Advances in Knowledge Discov-

For middle density datasets, LCM is the fastest ery and Data Mining, MIT Press, pp. 307–328,

for all instances on closed itemset mining. On all 1996.

and maximal frequent itemset mining, LCMfreq and [3] R. J. Bayardo Jr., “Efficiently Mining Long Pat-

LCMmax are the fastest for large minimum supports, terns from Databases”, In Proc. SIGMOD’98,

for any dataset. For small minimum supports, for pp. 85–93, 1998.

half instances LCMfreq and LCMmax are the fastest. [4] E. Boros, V. Gurvich, L. Khachiyan, and

For the other instances, the results are case by case: K. Makino, “On the Complexity of Generat-

each algorithm won in some cases. ing Maximal Frequent and Minimal Infrequent

For accidents, LCM algorithms are the fastest Sets,” STACS 2002, pp. 133-141, 2002.

when the minimum support is large. For small sup- [5] D. Burdick, M. Calimlim, J. Gehrke, “MAFIA:

ports, LCM(closed) is the fastest, however LCMfreq A Maximal Frequent Itemset Algorithm for

and LCMmax are slower than fp-growth For this Transactional Databases,” In Proc. ICDE 2001,

dataset, the efficiency of FP-tree is large, and the pp. 443-452, 2001.

compression ratio is up to 6. Bitmap is also efficient

[6] D. Burdick, M. Calimlim, J. Flannick, J.

from the density. Hence, the computation time for

Gehrke, and T. Yiu, “MAFIA: A Performance

the frequency counting is short in the execution of

Study of Mining Maximal Frequent Itemsets,”

existing implementations. However, by ppc exten-

In Proc. IEEE ICDM’03 Workshop FIMI’03,

sion, LCM has an advantage for closed itemset min-

2003. (Available as CEUR Workshop Proc. se-

ing. hence LCM(closed) is the fastest.

ries, Vol. 90, http://ceur-ws.org/vol-90)

[7] B. Goethals, the FIMI’03 Homepage,

1 http://www.ics.uci.edu/ mlearn/MLRepository.html

http://fimi.cs.helsinki.fi/, 2003.

10

[8] G. Grahne and J. Zhu, “Efficiently Using as CEUR Workshop Proc. series, Vol. 90,

Prefix-trees in Mining Frequent Itemsets,” In http://ceur-ws.org/vol-90)

Proc. IEEE ICDM’03 Workshop FIMI’03, 2003. [19] T. Uno, T. Asai, Y. Uchida, H. Arimura, “An

(Available as CEUR Workshop Proc. series, Efficient Algorithm for Enumerating Closed Pat-

Vol. 90, http://ceur-ws.org/vol-90) terns in Transaction Databases,” to appear in

[9] J. Han, J. Pei, Y. Yin, “Mining Frequent Pat- Proc. of Discovery Science 2004, 2004.

terns without Candidate Generation,” SIGMOD [20] M. J. Zaki, C. Hsiao, “CHARM: An Efficient

Conference 2000, pp. 1-12, 2000 Algorithm for Closed Itemset Mining,” 2nd

[10] R. Kohavi, C. E. Brodley, B. Frasca, L. Ma- SIAM International Conference on Data Mining

son and Z. Zheng, “KDD-Cup 2000 Organizers’ (SDM’02), pp. 457-473, 2002.

Report: Peeling the Onion,” SIGKDD Explo- [21] Z. Zheng, R. Kohavi and L. Mason, “Real World

rations, 2(2), pp. 86-98, 2000. Performance of Association Rule Algorithms,”

[11] Guimei Liu, Hongjun Lu, Jeffrey Xu Yu, Wang KDD 2001, pp. 401-406, 2000.

Wei, and Xiangye Xiao, “AFOPT: An Efficient

Implementation of Pattern Growth Approach,”

In Proc. IEEE ICDM’03 Workshop FIMI’03,

2003. (Available as CEUR Workshop Proc. se-

ries, Vol. 90, http://ceur-ws.org/vol-90)

[12] S. Orlando, C. Lucchese, P. Palmerini, R. Perego

and F. Silvestri, “kDCI: a Multi-Strategy Algo-

rithm for Mining Frequent Sets,” In Proc. IEEE

ICDM’03 Workshop FIMI’03, 2003. (Available

as CEUR Workshop Proc. series, Vol. 90,

http://ceur-ws.org/vol-90)

[13] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,

Efficient Mining of Association Rules Using

Closed Itemset Lattices, Inform. Syst., 24(1),

25–46, 1999.

[14] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal,

Discovering Frequent Closed Itemsets for Asso-

ciation Rules, In Proc. ICDT’99, 398-416, 1999.

Algorithm for Mining Frequent Closed Item-

sets,” ACM SIGMOD Workshop on Research Is-

sues in Data Mining and Knowledge Discovery

2000, pp. 21-30, 2000.

[16] A. Pietracaprina and D. Zandolin, “Mining

Frequent Itemsets using Patricia Tries,” In

Proc. IEEE ICDM’03 Workshop FIMI’03, 2003.

(Available as CEUR Workshop Proc. series,

Vol. 90, http://ceur-ws.org/vol-90)

[17] S. Tsukiyama, M. Ide, H. Ariyoshi and I. Shi-

rakawa, “A New Algorithm for Generating All

the Maximum Independent Sets,” SIAM Jour-

nal on Computing, Vol. 6, pp. 505–517, 1977.

“LCM: An Efficient Algorithm for Enumerat-

ing Frequent Closed Item Sets,” In Proc. IEEE

ICDM’03 Workshop FIMI’03, 2003. (Available

11

- Gawl Itza 09 GamesUploaded byKernel5
- COMPARATIVE STUDY OF DISTRIBUTED FREQUENT PATTERN MINING ALGORITHMS FOR BIG SALES DATAUploaded byIAEME Publication
- Arm 1Uploaded bygaya3prakash12999
- OR_reportUploaded byavnishgarg96
- A Survey of Association RulesUploaded bywahyusun
- K1_10.23919@PICMET.2018.8481992Uploaded byKezia Winda Yuliana
- thungheUploaded byDiên Tuấn Lê
- Parallel Implementation of Apriori Algorithm on Multicore System for Retail Market.pdfUploaded byEditor IJRITCC
- Solving a Real VRPUploaded byhcu805
- An Efficient Technique for Mining Association Rules Using Enhanced Apriori AlgorithmUploaded byEditor IJRITCC
- A New Approach in Tube Count for Tubular Heat Ex ChangersUploaded bySalim Yilmaz
- Pattern Discovery 11.3Uploaded byRossana Cunha
- 2012-NewBinaryPSObasedMethodforfindingbestthresholdsinassociationruleminingUploaded byjkl316
- lec5.pdfUploaded byElvis Capia Quispe
- KL PartitioningUploaded byjakk234
- Literature regarding Automatic Timetable GenerationUploaded byVikrantMane
- Array Assignment (1).docxUploaded byAnushka
- A Comparative Runtime Analysis of Heuristic AlgorithmsUploaded byravigobi
- MSIT-104 Data Structure and AlgorithmsUploaded byvidhya_bineesh
- Vlsi Physical DesignUploaded byayush16up
- Intelligent Transport SchedulingUploaded byJournal of Computing
- 04-Basic-SyntaxUploaded byNeeraj Kumar
- About Programming ContestUploaded byyeshue
- Introduction to ScilabUploaded byImran Basha
- Arrays _ GeeksQuizUploaded bySahanur Mondal
- Data MiningUploaded byahmed
- 1. Arrays in JavaUploaded byNeema
- Hierarchical Traffic Grooming in Large-Scale Mesh WDM NetworksUploaded byRobert James
- Financial ring & ComputationUploaded byjorgesabat
- basicRforFinance.pdfUploaded byduc anh

- DME From Syngas JapanUploaded byeni
- Data Acquisition - WikipediaUploaded bybhulakshmidevi
- Advance Payment DisbursementUploaded byAb Wahab
- Detection of Chlamydia Trachomatis and Neisseria Gonorrhoeae by Enzyme Immunoassay, Culture, And Three NAAT(1)Uploaded byRajanthi Ramachandran
- 75 Greatest Motivational Stories Ever Told!.pdfUploaded byEsther Parvu
- Carbon Adsorption VaporUploaded byMehdi Hajd Kacem
- Case Study BIMB Al-Awfar_Saving or Betting (2012)Uploaded byApis Zizan
- 3m peltor technical datasheet.pdfUploaded bycris
- R9.pdfUploaded byJonásFradest
- NHDD Case StatementUploaded byNational Healthcare Decisions Day
- healtheducation hema.pptUploaded byhemihema
- Full Documentation MTX+ MAJ 260210Uploaded bym_1_3
- Consolidated Mines v CIR DigestUploaded bycookbooks&lawbooks
- QUESTÕES DE GRAMATICA _INGLESUploaded byLeandro Cavalcanti De Araújo
- al988e00cambodia.pdfUploaded byDan-dan Memeg
- 1-A Tigh Nanfiltration(1)Uploaded byAfrillia Fahrina
- TuningMgr_Procedures - Google DocsUploaded bysrikrishna_rajesh2553
- SBI Clerk Mains Result 2016 Declared!!!Uploaded byTushita
- Booklet de DGUploaded bysanquintin_1557
- ImageryUploaded byLawal Idris Adeshola
- Chapt 05Uploaded bySilax
- Yu Tek & Co. vs. GonzalezUploaded byReg Versoza
- sajeesh kappachery_biofoulingUploaded bySajeesh Kappachery
- Manufacturing TechnologyUploaded byTonya Whitney
- Marketing Strategies of Vodafone.docUploaded byrachitagaglani9420
- Chemical Compatibility GuideUploaded bytpcho
- Surfboard AnatomyUploaded byCocoliso Arbulú
- Local Certified Coach, John Maxwell and His Team Deliver Strategic Leadership Development Tools for Costa Rican Transformation at Request of Costa Rican PresidentUploaded byPR.com
- gs0200a HrMSUploaded byTesfaye Yenealem
- Paradis JuridiquesUploaded byStephano Goerget