Frequent Pattern (Or Item set) mining is the extraction of interested collection of items from dataset. The frequent item set is used for achieving the collection of items according user’s requirement. The researchers have proposed various algorithms like Apriori, Eclat, RElim, SaM etc. There is a problem of ordering of items while selecting one item as prefix for mining frequent item sets. This type of problem affects the performance. The researchers introduce a RElim algorithm for frequent item set mining. In this paper, two approaches are considered for solving such type of the problem. The results of these two approaches are compared with RElim execution.
Original Title
Efficient Mining of Frequent Item Set using Recursive Algorithm
Frequent Pattern (Or Item set) mining is the extraction of interested collection of items from dataset. The frequent item set is used for achieving the collection of items according user’s requirement. The researchers have proposed various algorithms like Apriori, Eclat, RElim, SaM etc. There is a problem of ordering of items while selecting one item as prefix for mining frequent item sets. This type of problem affects the performance. The researchers introduce a RElim algorithm for frequent item set mining. In this paper, two approaches are considered for solving such type of the problem. The results of these two approaches are compared with RElim execution.
Frequent Pattern (Or Item set) mining is the extraction of interested collection of items from dataset. The frequent item set is used for achieving the collection of items according user’s requirement. The researchers have proposed various algorithms like Apriori, Eclat, RElim, SaM etc. There is a problem of ordering of items while selecting one item as prefix for mining frequent item sets. This type of problem affects the performance. The researchers introduce a RElim algorithm for frequent item set mining. In this paper, two approaches are considered for solving such type of the problem. The results of these two approaches are compared with RElim execution.
International Journal of Advanced Engineering Research and Technology (IJAERT)
Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190
61
www.ijaert.org
Efficient Mining of Frequent Item Set using Recursive Algorithm Bhumika H. Patel Department of Computer Science and Engineering, PIET, Limda Gujarat Technological University Vadodara, India
Abstract - Frequent Pattern (Or Item set) mining is the extraction of interested collection of items from dataset. The frequent item set is used for achieving the collection of items according users requirement. The researchers have proposed various algorithms like Apriori, Eclat, RElim, SaM etc. There is a problem of ordering of items while selecting one item as prefix for mining frequent item sets. This type of problem affects the performance. The researchers introduce a RElim algorithm for frequent item set mining. In this paper, two approaches are considered for solving such type of the problem. The results of these two approaches are compared with RElim execution.
I ndex Terms Data mining, Frequent item set mining, RElim, Support.
I. INTRODUCTION
In the era of business world, data mining is gaining popularity in terms of organizational profits. The core idea of data mining is to gain useful and unknown information or the patterns from the data in the large dataset. Data mining is currently used in the wide range of profiling practices, such as scientific discovery, marketing, fraud detection and surveillance[11]. Frequent item set mining works on the principle of finding the item sets that are found frequently as well as together in the transaction set. Various algorithms like Apriori[1,2], FP-Growth[4], Eclat[5], RElim[6], SaM[7], etc. have been proposed after Agrawal first introducing the problem of deriving categorical association rule from transactional databases[2]. Studies of Frequent Itemset Mining is held in the data mining because of its broad applications in mining association rules, correlations and graph pattern constraint based on frequent patterns and many other data mining tasks. Let I = {i1,, in} be a set of N distinct items and DB be a database consist of M transactions {t1, , tm} such that each transaction ti is a subset of I . An itemset or pattern x is a subset of I which if |x|=k, it is called a k-itemset. One of the properties of x is its support count or Sup(x) which is the number of transactions in DB that contain the itemset x. If Sup(x) is no less than a user specified threshold, called Minsup, it is called a frequent pattern. The aim of frequent pattern mining is to find all frequent patterns satisfying Minsup from a given database DB. As the minimum threshold decreases, resulting frequent items would be more. Therefore, eliminating infrequent patterns can be done effectively in mining process and that is the one of the main issues in frequent pattern mining. Our main work is based on this issue that how to select less frequent item in case two or more items have same frequency, for mining frequent item sets. The rest of this paper is organized as follows. Section 2 describes the work already done related to frequent item set mining. In section 3, the limitation regarding to few algorithms is presented. In section 4, proposed work is shown and section 5 illustrates experiment results and finally conclusion is derived in section 5.
II. RELATED WORK
In this section, we describe few existing frequent item set mining algorithms, namely: (i)Can-tree, (ii)CP-tree, (iii)RElim.
A. Can-tree: In [10] a tree structure called Can-Tree is proposed. This Can-tree algorithm requires only single scan of database. In this algorithm, items are ordered on the basis of a canonical standard (e.g. alphabetical) depending upon user choice. Therefore, if there is a change in frequency, it will not affect the order of items in the Can-tree. Therefore, new transactions are inserted into the tree without swapping any tree nodes.
B. CP-tree: In [9] a new tree structure called CP-tree is proposed which is a dynamic tree. This structure allows all the transactions to be inserted in accordance with a predefined item order. This item order is maintained by a list, called I-list. After inserting some of the transactions, if the item order of the I-list differs from the current frequency-descending item order to a predefined degree, the CP-tree is restructured through a method called the branch sorting. Then, the item order is updated with the current list.
C. RElim: In [6] RElim algorithm is proposed which uses array list structure to find frequent item sets. Figure 1 shows all the necessary steps that are required to process RElim. In first step, orginal database is shown. By scanning the database, frequency of each item is determined in step 2. After that items in each transactions are sorted in frequency ascending order in 3. In step 4, each transactions are sorted depending on items lexicographic order. International Journal of Advanced Engineering Research and Technology (IJAERT) Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190 62
www.ijaert.org
Fig. 1: (1) Database in original form, (2)item frequencies, (3)transactions with sorted items, (4)lexicographically sorted transactions
In step 5, the data structure used by RElim is created. This data structure contains a list which is sorted in frequency descending order of the items. This list contains a counter that shows the number of transactions that starts with the first leading item and a pointer to the head of the list. The list- elements themselves contains a successor pointer and pointer to the transaction.
Fig. 2: (5) Data structure used by RElim
The basic operations of the RElim algorithm are illustrated in Figure 3. Basic operations of RElim starts with eliminating least frequent item from the list and respective array elements are transferred to the conditional database containing that data item. The item to be processed is the one associated with the last (rightmost) list (in the example this is item e).
Fig. 3: Basic operations of RElim
If the counter associated with the list, which states the support of the item, exceeds the minimum support, the item set consisting of this item and the prefix of the conditional database is reported as frequent. In addition, the list is traversed and its elements are copied to construct a new list array, which represents the conditional database of transactions containing the item. In this operation the leading item of each transaction (suffix) is used as an index into the list array to find the list it has to be added to. In addition, the leading item is removed (see Figure 3 on the right). The resulting conditional database is then processed recursively to find all frequent item sets containing the list item.
III. LIMITATION OF EXISTING ALGORITHM
The limitation of RElim algorithm is that when dataset has more number of attributes, the performance of algorithm is decreased. When more attributes is there, number of items available in each transaction is also more. So it is difficult to select the prefix with same item frequency.
IV. PROPOSED WORK
Frequent item set mining problem can be solved using many approaches. One of them is RElim algorithm. As discussed in above section, this algorithm has some limitations. In order to overcome from this limitation, we have proposed two different approaches. As RElim uses array list as data structure, the running time of array based FI mining algorithms take less time as compared to that of tree-based algorithms. RElim operations are simply based on three processing steps: deleting items, recursive processing, and reassigning transactions. Here we are considering deleting items operation step where there is a scope of improvement in terms of time. Therefore, all the preprocessing steps are the same as that of RElim, the difference will be between the order of choosing an item for pruning when items have same frequency. In the proposed method, two approaches for choosing an item for elimination in case they have same frequency are: alphabetical order and other is order of occurrence of an item in the database. Suppose the database and all the preprocessing steps from 1 to 4 shown in figure 1 are the same and data structure shown in figure 2 is also same. Now to begin performing step 6, we need to consider the list for selecting an item as prefix for pruning. Here in figure 2, item e and item a have the same frequency. Now there is a confusion whether to select item a or item e as prefix as all the recursive processing and reassigning of the transaction greatly depend on this prefix only. Therefore if we consider first approach i.e. alphabetical order, the item-order list will be {a,b,c,d,e} as shown in figure 3 and item a is selected as prefix and their array elements will be transferred in conditional database and leading item of each transaction is used as an index into the list array to find the list it has to be added to(in our case it is c only) and their support count will be incremented depending upon the number of the transactions added to it(in our case c is incremented by 2). International Journal of Advanced Engineering Research and Technology (IJAERT) Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190 63
www.ijaert.org
added to and support count of both transactions will be incremented by 1 as one transaction bd will be added in the list of b and one transaction cbd will be added in the list of c.
V. COMPARING ALGORITHMS
All the tree based algorithms requires more time to find frequent itemsets, while RElim requires less time for exection as it is simply using three steps: deleting items, recursive processing, and reassigning transactions. In modified approach, the algorithm will take less time for execution and each item will get its importance while generating frequent itemsets.
Fig. 3: operations of Modified RElim using Method-I
Therefore if we consider second approach i.e. order of occurrence the item-order list will be {e,d,b,a,c} as shown in figure 4.
Fig. 4: operations of Modified RElim using Method-II
and item e is selected as prefix and their array elements will be transferred in conditional database. The leading items b and c is used as an index into the list array to find the list it has to be
VI. CONCLUSION
This paper provides brief introduction about the algorithms which is used in the area of frequent item set mining. RElim algorithm is based on array-list structure and easy to implement. Modified RElim extends existing RElim by maintaining item-list for same frequency items. Due to comparision of such items, as a part of future work, I am going to analyse the behavior of various interesting measures on mining frequent itemsets.
ACKNOWLEDGMENT
My most sincere thanks go to my advisor Asst.Prof. Neha Pandya. I thank her for providing me opportunity to work in the area of FI mining. I thank her guidance, encouragement and support during initial development of this project. I would not like to miss a chance to say thank for the time that she spared for me, from her extremely busy schedule.
REFERENCES
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDBY94, pp. 487-499. [2] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proc.1993 ACM-SIGMOD Int. Conf. Management of Data, Washington, D.C., May 1993, pp 207-216 [3] J. Han, J. Pei, Y. Yin, And R. Mao. Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Mining And Knowledge Discovery, 2003. [4] J. Han, H. Pei, And Y. Yin. Mining frequent patterns without candidate generation. In: Proc. Conf. On The Management Of Data (Sigmod00, Dallas, Tx). Acm Press, New York, Ny, Usa 2000. [5] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, New algorithms for fast discovery of association rules, Proc. 3rd Int. Conf. on Knowledge Discovery and Data Mining (KDD97, Newport Beach, CA), 283296. AAAI Press, Menlo Park, CA, USA 1997 [6] C. Borgelt, Keeping things simple: finding frequent item sets by recursive elimination. Proc. Workshop Open Software for Data Mining (OSDM05 at KDD05, Chicago, IL), 6670. ACM Press, New York, NY, USA 2005 [7] C. Borgelt, Simple algorithms for frequent item set mining, Springer-Verlag, Berlin, Germany 2010 [8] J. Han, and M. Kamber, 2000. Data Mining Concepts and Techniques. Morgan Kanufmann. International Journal of Advanced Engineering Research and Technology (IJAERT) Volume 2 Issue 2, May 2014, ISSN No.: 2348 8190 64
www.ijaert.org
[9] S.K. Tanbeer, C.F. Ahmed, B.-S. Jeong, Y.-K. Lee, Efficient single-pass frequent pattern mining using a prefix-tree. Information Sciences 179 (2009) 559583 [10] C.K.-S. Leung, Q.I. Khan, Z. Li, and T. Hoque, CanTree: A canonical-order tree for incremental frequent-pattern mining, KAIS, 11 (3), pp. 287311, Apr. 2007. [11] R. Somkumar. A study on various data mining approaches of association rules. Int.J.Comput. Sci. Eng. Vol.2, pp.141-144. [12] C.L. Blake and C.J. Merz. UCI Repository of Machine Learning Databases. Dept. of Information and Computer Science, University of California at Irvine, CA, USA 1998. http://www.ics.uci.edu/mlearn/MLRepository.html [13] R. Kohavi, C.E. Bradley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 Organizers Report: Peeling the Onion. SIGKDD Exploration 2(2):8693. ACM Press, New York, NY, USA 2000. [14] Synthetic Data Generation Code for Associations and Sequential Patterns. Intelligent Information Systems, IBM Almaden Research Center. http://www.almaden.ibm.com/software/quest/Resources/index.sh tml