August 19, 2014 Data Mining: Concepts and Techniques 1
Chap 5.1: Mining Sequential Patterns
A kind of association rules Its algorithms are closely related with association rule mining algorithms August 19, 2014 Data Mining: Concepts and Techniques 2 Sequence Databases and Sequential Pattern Analysis Transaction databases, time-series databases vs. sequence databases Time-series db stores sequences of values that change with time, such as data collected regarding the stock exchange. Frequent patterns vs. (frequent) sequential patterns Applications of sequential pattern mining Customer shopping sequences: First buy computer, then CD-ROM, and then digital camera, within 3 months. Medical treatment, natural disasters (e.g., earthquakes), science & engineering processes, stocks and markets, etc. Telephone calling patterns, Weblog click streams DNA sequences and gene structures August 19, 2014 Data Mining: Concepts and Techniques 3 What Is Sequential Pattern Mining? Given a set of sequences, find the complete set of frequent subsequences A sequence database A sequence : < (ef) (ab) (df) c b > An element may contain a set of items. Items within an element are unordered and we list them alphabetically. <a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)> Given support threshold min_sup =2, <(ab)c> is a sequential pattern SID sequence 10 <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)> 30 <(ef)(ab)(df)cb> 40 <eg(af)cbc> August 19, 2014 Data Mining: Concepts and Techniques 4 Mining Sequential Patterns A sequence <a 1 , a 2 ,a n > is contained in another sequence <b 1 ,b 2 ,b m > if there exist integers 1 i1<i2<<i3 m, such that a 1 b i1 , a 2 b i2 ,, a n b in .
E.g. sequence <(3) (4 5) (8)> is contained in <(7)(3 8)(9)(4 5 6) (8)> However, <(3) (5)> is not contained in <(3 5)> <(3) (5)> means item 3, 5 is bought after one another <(3 5)> means item 3, 5 is bought together In a set of sequences, a sequence s is maximal if s is not contained in another sequence August 19, 2014 Data Mining: Concepts and Techniques 5 Mining Sequential Patterns : An example Customer Id Transaction Time Items Bought 1 1 June 25 93 June 30 93 30 90 2 2 2 June 10 93 June 15 93 June 20 93 10, 20 30 40, 60, 70 3 June 25 93 30, 50, 70 4 4 4 5 June 25 93 June 30 93 July 25 93 June 12 93 30 40,70 90 90 Database Sorted by Customer Id and Transaction Time Customer Customer ID Sequence 1 <(30)(90)> 2 <(10 20)(30)(40 60 70)> 3 <(30 50 70)> 4 <(30)(40 70) (90)> 5 <(90)> Customer sequence version of the DB Sequential patterns with support > 25%
<(30) (90)> <(30) (40 70)> August 19, 2014 Data Mining: Concepts and Techniques 6 Mining Sequential Patterns Given a database D of customer transaction, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have a certain user specified minimum support August 19, 2014 Data Mining: Concepts and Techniques 7 Mining Sequential Patterns Involve 5 phase Sort phase Litemset phase Transformation phase Sequence phase Maximal phase
Terminology Length of a sequence number of itemset in the sequence ( sequence of length k called k-sequence) Support of an itemset i is a fraction of customers who bought items in i in a single transaction An itemset with minimum support is called a large itemset or litemset August 19, 2014 Data Mining: Concepts and Techniques 8 Mining Sequential Patterns Involve 5 phase Sort phase The database (D) is sorted, with customer-id as the major key and transaction time as the minor key Converts the original transaction database into a database of customer sequences
Litemset phase
Find the set of all litemsets L Simultaneously find the set of all large 1-sequences {<l >|l L} The set of litemset is mapped to a set of contiguous integers Reason for mapping: by treating litemset as single entities,compare two itemset for equality in constant time, and reduce the time required to check if a sequence is contained in a customer sequence
August 19, 2014 Data Mining: Concepts and Techniques 9
Litemset phase Customer Id Transaction Time Items Bought 1 1 June 25 93 June 30 93 30 90 2 2 2 June 10 93 June 15 93 June 20 93 10, 20 30 40, 60, 70 3 June 25 93 30, 50, 70 4 4 4 5 June 25 93 June 30 93 July 25 93 June 12 93 30 40,70 90 90 Fig.1 Database Sorted by Customer Id and Transaction Time Large itemsets are (30), (40),(70), (40,70) and (90) Large Itemset Mapped to (30) 1 (40) 2 (70) 3 (40 70) 4 (90) 5 August 19, 2014 Data Mining: Concepts and Techniques 10 Transformation Phase Repeatedly determine which of a given set of large sequences are contained in a customer sequence By transforming each customer sequence into an alternative representation Each transaction is replaced by the set of all litemsets contained in the transaction If a transaction does not contain any litemset, not retained in the transformed sequence If a customer sequence does not contain any litemsets, this sequence is dropped from the transformed database Customer Customer ID Sequence 1 <(30)(90)> 2 <(10 20)(30)(40 60 70)> 3 <(30 50 70)> 4 <(30)(40 70) (90)> 5 <(90)> Customer sequence version of the DB Large Itemset Mapped to (30) 1 (40) 2 (70) 3 (40 70) 4 (90) 5 Large itemset August 19, 2014 Data Mining: Concepts and Techniques 11 Transformation Phase Customer Customer Transformed After ID Sequence Customer Sequence Mapping
Sequence Phase Make multiple passes over the data In each pass, we start with a seed set of large sequences Use the seed set for generating new potentially large sequences called candidate sequences Count the support while pass the data At the end of the pass, determine the large candidate sequences these large candidate becomes the seed for the next pass. Involve 2 algorithms Count-all and count some Count-all based on apriori algorithm called AprioriAll August 19, 2014 Data Mining: Concepts and Techniques 13
Large Candidate Candidate 3-Sequences 4-sequences 4-sequences (after join) (after pruning) <1 2 3> <1 2 3 4> <1 2 3 4> <1 2 4> <1 2 4 3> <1 3 4> <1 3 4 5> <1 3 5> <1 3 5 4> <2 3 4> Maximal phase Find the maximal sequences among the set of large sequences August 19, 2014 Data Mining: Concepts and Techniques 14
Sequence Phase Large sequences
< {1 5} {2} {3} {4} > < {1} {3} {4} {3 5}> < {1} {2} {3} {4}> < {1} {3} {5} > < {4} {5} > Customer sequences 1-sequences support <1> 4 <2> 2 <3> 4 <4> 4 <5> 4 2-sequences support <1 2> 2 <1 3> 4 <1 4> 3 <1 5> 3 <2 3> 2 <2 4> 2 <3 4> 3 <3 5> 2 <4 5> 2 3-sequences support <1 2 3> 2 <1 2 4> 2 <1 3 4> 3 <1 3 5> 2 <2 3 4> 2 4-sequences support <1 2 3 4> 2 L 1 L 2 L 3 L 4 Min_support = 40% (2 customer sequences) Maximal large sequence <1 2 3 4>, <1 3 5>, <4 5> August 19, 2014 Data Mining: Concepts and Techniques 15 Challenges on Sequential Pattern Mining A huge number of possible sequential patterns are hidden in databases A mining algorithm should find the complete set of patterns, when possible, satisfying the minimum support (frequency) threshold be highly efficient, scalable, involving only a small number of database scans be able to incorporate various kinds of user-specific constraints August 19, 2014 Data Mining: Concepts and Techniques 16 Studies on Sequential Pattern Mining Concept introduction and an initial Apriori-like algorithm R. Agrawal & R. Srikant. Mining sequential patterns, ICDE95 GSPAn Apriori-based, influential mining method (developed at IBM Almaden) R. Srikant & R. Agrawal. Mining sequential patterns: Generalizations and performance improvements, EDBT96 From sequential patterns to episodes (Apriori-like + constraints) H. Mannila, H. Toivonen & A.I. Verkamo. Discovery of frequent episodes in event sequences, Data Mining and Knowledge Discovery, 1997 Mining sequential patterns with constraints M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB 1999 August 19, 2014 Data Mining: Concepts and Techniques 17 A Basic Property of Sequential Patterns: Apriori A basic property: Apriori (Agrawal & Srikant94) If a sequence S is not frequent Then none of the super-sequences of S is frequent E.g, <hb> is infrequent so do <hab> and <(ah)b> <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID Given support threshold min_sup =2 August 19, 2014 Data Mining: Concepts and Techniques 18 GSPA Generalized Sequential Pattern Mining Algorithm GSP (Generalized Sequential Pattern) mining algorithm proposed by Agrawal and Srikant, EDBT96 Outline of the method Initially, every item in DB is a candidate of length-1 for each level (i.e., sequences of length-k) do scan database to collect support count for each candidate sequence generate candidate length-(k+1) sequences from length-k frequent sequences using Apriori repeat until no frequent sequence or no candidate can be found Major strength: Candidate pruning by Apriori August 19, 2014 Data Mining: Concepts and Techniques 19 Finding Length-1 Sequential Patterns Examine GSP using an example Initial candidates: all singleton sequences <a>, <b>, <c>, <d>, <e>, <f>, <g>, <h> Scan database once, count support for candidates <a(bd)bcb(ade)> 50 <(be)(ce)d> 40 <(ah)(bf)abf> 30 <(bf)(ce)b(fg)> 20 <(bd)cb(ac)> 10 Sequence Seq. ID min_sup =2 Cand Sup <a> 3 <b> 5 <c> 4 <d> 3 <e> 3 <f> 2 <g> 1 <h> 1 August 19, 2014 Data Mining: Concepts and Techniques 20 Generating Length-2 Candidates <a> <b> <c> <d> <e> <f> <a> <aa> <ab> <ac> <ad> <ae> <af> <b> <ba> <bb> <bc> <bd> <be> <bf> <c> <ca> <cb> <cc> <cd> <ce> <cf> <d> <da> <db> <dc> <dd> <de> <df> <e> <ea> <eb> <ec> <ed> <ee> <ef> <f> <fa> <fb> <fc> <fd> <fe> <ff> <a> <b> <c> <d> <e> <f> <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)> <b> <(bc)> <(bd)> <(be)> <(bf)> <c> <(cd)> <(ce)> <(cf)> <d> <(de)> <(df)> <e> <(ef)> <f> 15 length-2 Candidates Without Apriori property, 8*8+8*7/2=92 candidates Apriori prunes 44.57% candidates