Bab 6 Mining Sequential Patterns Arif Djunaidy e-mail: arif@its-sby.edu URL: www.its-sby.edu/~arif Bab 6 - 2/25 Data Mining Arif Djunaidy FTIF ITS Outline What is sequential rules mining? Finding sequential patterns AprioriAll Algorithm Generalized Sequential Patterns (GSP) Algorithm Bab 6 - 3/25 Data Mining Arif Djunaidy FTIF ITS Definition: Given is a set of objects, with each object associated with its own timeline of events, find rules that predict strong sequential dependencies among different events. What Is Sequential Rules Mining? - 1 Sequence mining: discover sequences of events that commonly occur together. Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints. Much higher computational complexity than association rule discovery: O(m k 2 k-l ) number of possible sequential patterns having k events, where m is the total number of possible events. Bab 6 - 4/25 Data Mining Arif Djunaidy FTIF ITS The input data is a set of sequences, called data-sequences Each data-sequence is a list of transactions, where each transaction is a sets of literals, called items Typically there is a transaction-time associated with each transaction. A sequential pattern also consists of a list of sets of items What Is Sequential Rules Mining? - 2 The problem is to find all sequential patterns with a user-specified minimum support, where the support of a sequential pattern is the percentage of data-sequences that contain the pattern Bab 6 - 5/25 Data Mining Arif Djunaidy FTIF ITS Example: In the database of a book-club, each data-sequence may correspond to all book selections of a customer and each transaction to the books selected by the customer in one order.
A sequential pattern might be 5% of customers bought Foundation, then Foundation and Empire, and then Second Foundation . Elements of a sequential pattern can be sets of items, for example, Foundation and Ringworld, followed by Foundation and Empire and Ringworld Engineers, followed by Second Foundation What Is Sequential Rules Mining? - 3 Bab 6 - 6/25 Data Mining Arif Djunaidy FTIF ITS We are given a database D of customer transactions: Each transaction consists of the following fields: customer-id, transaction-time, and the items purchased in the transaction No customer has more than one transaction with the same transaction-time Quantities of items bought in a transaction is not considered Each item is a binary variable representing whether an item was bought or not
Problem Statement - 1 Bab 6 - 7/25 Data Mining Arif Djunaidy FTIF ITS An itemset is a non-empty set of items. A sequence is an ordered list of itemsets. The support for a sequence is defined as the fraction of total customers who support this sequence It is assumed that the set of items is mapped to a set of contiguous integers. An itemset i is denoted as where i j is an item. A sequence s is denoted as where s j is an itemset. A sequence is contained in another sequence if there exist integers such that For example: The sequence { (3) (4 5) (8) } is contained in { (7) (3 8) (9) (4 5 6) (8) }, since (3) (3 8), (4 5) (4 5 6) and (8) (8). However, the sequence { (3) (5) } is not contained in { (3 5) } (and vice versa). Problem Statement - 2 Bab 6 - 8/25 Data Mining Arif Djunaidy FTIF ITS Given a database D of customer transactions: The problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user-specified minimum support. Each such maximal sequence represents a sequential pattern. A sequence satisfying the minimum support constraint is called a large sequence
Problem Statement - 3 Database Sorted by Customer Id and Transaction Time Customer-Sequence Version of the Database Bab 6 - 9/25 Data Mining Arif Djunaidy FTIF ITS Problem Statement : Example With a minimum support set to 25%, i.e., a minimum support, of 2 customers, two sequences: {(30) (90)} and {(30) (40 70)} are maximal among those satisfying the support constraint, and are the desired sequential patterns An example of a sequence that does not have minimum support is the sequence {(10 20) (30)}, which is only supported by customer 2. The sequences {(30)}, {(40)}, {(70)}, {(90)}, {(30) (40)}, {(30) (70)} and {(40 70)}, though having minimum support, are not in the answer because they are not maximal Bab 6 - 10/25 Data Mining Arif Djunaidy FTIF ITS Terminology: The length of a sequence is the number of itemsets in the sequence A sequence of length k is called a k-sequence The support for an itemset i is defined as the fraction of customers who bought the items in i in a single transaction An itemset with minimum support is called a large itemset or litemset. Note that each itemset in a large sequence must have minimum support. Hence, any large sequence must be a list of litemsets Finding Sequential Patterns Bab 6 - 11/25 Data Mining Arif Djunaidy FTIF ITS 1. Sort Phase. The database (D) is sorted, with customer-id as the major key and transaction-time as the minor key. This step implicitly converts the original transaction database into a database of customer sequences. 2. Litemset Phase. In this phase, we find the set of all litemsets L. We are also simultaneously finding the set of all large l-sequences, since this set is just { (l) | l L } The litemsets is mapped to a set of contiguous integers. In the example database, the large itemsets are (30), (40), (70), (40 70) and (90) which is respectively mapped to {1, 2, 3, 4, 5} (see next slide) The reason for this mapping is that by treating litemsets as single entities, we can compare two litemsets for equality in constant time, and reduce the time required to check if a sequence is contained in a customer sequence. Finding Sequential Patterns: The Algorithm - 1 Bab 6 - 12/25 Data Mining Arif Djunaidy FTIF ITS 2. Litemset Phase (example) Finding Sequential Patterns: The Algorithm - 2 Customer-Sequence Version of the Database Large itemsets minsup = 25% Bab 6 - 13/25 Data Mining Arif Djunaidy FTIF ITS 3. Transformation Phase. As we will see later (phase 4), we need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. To make this test fast, we transform each customer sequence into an alternative representation. In a transformed customer sequence, each transaction is replaced by the set of all litemsets contained that transaction. If a transaction does not contain any litemset, it is not retained in the transformed sequence. If a customer sequence does not contain any litemset,, this sequence is dropped from the transformed database. However, it still contributes to the count of tota1 number of customers. Finding Sequential Patterns: The Algorithm - 3 Bab 6 - 14/25 Data Mining Arif Djunaidy FTIF ITS 3. Transformation Phase ..... (example): Finding Sequential Patterns: The Algorithm - 4 minsup = 25% Bab 6 - 15/25 Data Mining Arif Djunaidy FTIF ITS 4. Sequence Phase. Use the set of litemsets obtained in phase-3 to find the desired sequences We will illustrate the use of an AprioriAll algorithm (see later) 5. Maximal Phase. Find the maximal sequences among the set of large sequences. In some algorithms (such as AprioriAll algorithm), this phase is combined with the sequence phase to reduce the time wasted in counting non-maximal sequences.
Finding Sequential Patterns: The Algorithm - 5 Bab 6 - 16/25 Data Mining Arif Djunaidy FTIF ITS In the first pass, the output of the litemset phase is used to initialize the set of large l-sequences. The candidates are stored in Hash-Tree to quickly find all candidates contained in a customer sequence In each pass, we use the large sequences obtained from the previous pass to generate the candidate sequences and then measure their support, by making a pass over the database At the end of the pass, the support of the candidates is used to determine the large sequences The Algorithm: AprioriAll L k denotes the set of all large k-sequences, and C k the set of candidate k-sequences Bab 6 - 17/25 Data Mining Arif Djunaidy FTIF ITS The apriori-generate function takes as argument L k-1 , the set of all large (k-1)-sequences. The function works as follow:. First, join L k-1 with L k-1 :
AprioriAll: Candidate Generation Next, delete all sequences c C k such that some (k - 1)-subsequence of c is not in L k-1
Bab 6 - 18/25 Data Mining Arif Djunaidy FTIF ITS Having found the set of all large sequences in S in the sequence phase, the following algorithm can be used for finding maximal sequences. Let the length of the longest sequence be n. Then, AprioriAll: Finding Maximal Sequences Bab 6 - 19/25 Data Mining Arif Djunaidy FTIF ITS Assume we have a database with the customer-sequences as shown below (in the transformed form). The minimum support is assumed = 40% (i.e., 2 customer sequences). AprioriAll: Example No candidate is generated for the 5 th pass The resulting maximal large sequence: { 1 2 3 4 }, { 1 3 5 } and { 4 5 } Candidate 3-sequences: Bab 6 - 20/25 Data Mining Arif Djunaidy FTIF ITS Algorithmn: AprioriSome Bab 6 - 21/25 Data Mining Arif Djunaidy FTIF ITS AprioriSome: Forward Phase In the forward pass, only sequences of certain lengths are counted For example, sequences of length 1, 2, 4 and 6 might be counted in the forward phase and count sequences of length 3 and 5 in the backward phase The function next takes as parameter the length of sequences counted in the last pass and returns the length of sequences to be counted in the next pass The apriori-generate function is used to generate new candidate sequences However, in the kth pass, we may not have the large sequence set L k-1
available as we did not count the (k-1)-candidate sequences. In that case, we use the candidate set C k-1 to generate C k
Correctness is maintained because C k-1 > L k-1
Bab 6 - 22/25 Data Mining Arif Djunaidy FTIF ITS AprioriSome: Forward Phase - Example Using the database used in the example for the AprioriAll algorithm, we find the large l-sequences (L l ) in the litemset phase (during the first pass over the database). Take for illustration simplicity, f(k) = 2k. In the second pass, we count C 2 to get L 2 . After the third pass, apriori- generate is called with L 2 as argument to get C 3 . We do not count C 3 , and hence do not generate L 3 . Next, apriori-generate is called with C 3 to get C 4 , which after pruning, turns out to be the same C 4 (1 2 3 4) After counting C 4 to get L 4 , we try generating C 5 , which turns out to be empty. Candidate 3-sequences: Bab 6 - 23/25 Data Mining Arif Djunaidy FTIF ITS AprioriSome: Backward Phase In the backward phase, we count sequences for the lengths we skipped over during the forward phase, after first deleting all sequences contained in some large sequence. These smaller sequences cannot be in the answer because we are only interested in maximal sequences. We also delete the large sequences found in the forward phase that are non- maximal. Bab 6 - 24/25 Data Mining Arif Djunaidy FTIF ITS AprioriSome: Backward Phase - Example When the backward phase is started, nothing gets deleted from L 4 since there are no longer sequences. We had skipped counting the support for sequences in C 3 in the forward phase. After deleting those sequences in C 3
that are subsequences of sequences in L 4 , i.e., subsequences of (1 2 3 4), we are left with the sequences ( 1 3 5 ) and ( 3 4 5 ). Those would be counted to get ( 1 3 5 ) as a maximal large 3-sequence. Next,, all the sequences in L 2 except (4 5) are deleted since they are contained in some longer sequence. For the same reason, all sequences in L 1 are also deleted. Candidate 3-sequences: Answer: (1 2 3 4) (1 3 5) (4 5) Bab 6 - 25/25 Data Mining Arif Djunaidy FTIF ITS Akhir Bab 6