Course Content: Data Mining & Knowledge Discovery

Course Content
Data Mining & Knowledge Discovery Introduction to Data Mining

Fall 2007
Association Analysis
Sequential Pattern Analysis
Chapter 3: Sequential Pattern Analysis
Classification and prediction
Dr. Osmar R. Zaane Contrast Sets
Data Clustering
Outlier Detection
University of Alberta Web Mining
Other topics if time permits (spatial data, biomedical data, etc.)
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 1 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 2
Sequence of Transactions
Chapter 3 Objectives Association rule mining searches for relationships
between items in a dataset where time is irrelevant.
{a,b,c,d}
Understand Sequential Pattern Analysis in Store

{x,y,z}
the context of transactional data and get a { , , ,}
brief introduction to the different Sequential Pattern Analysis considers time (or order of
algorithms for sequential pattern discovery transactions).
Customer x
Store
<{a,b,c,d}, {x,y,z},>
day1
<>

day2

Data: sequences of evidences in time order
Dr. Osmar R. Zaane, 1999, 2007 Dr. Osmar R. Zaane, 1999, 2007
Target: sub-sequences that happenedUniversity
frequently
Data Mining and Knowledge Discovery University of Alberta 3 Data Mining and Knowledge Discovery of Alberta 4
Lecture Outline
Part I: Concepts (30 minutes)
Sequence Pattern Examples Basic concepts
Examples 1
Part II: Apriori-based Approaches (45 minutes)
60% of customers typically rent Star Wars, then Empire Strikes Apriori-all
Back, and then Return of Jedi. GSP
Note: these rentals need not to be consecutive.
Part III: Pattern-Growth-based Approaches (45 minutes)
<SW>,,<ESB>,,<RJ>
Free-Span
Example 2 Prefix-Span
60% of customers buy Fitted Sheet and flat sheet and pillow,
followed by comforter, followed by drapes and ruffles
Note: elements of a sequential pattern need not to be simple items.
<FittedSheet, FaltSheet, Pillow>, ,<Comforter>,,<Drapes, Ruffles>
Sequence Database
Converts the original transaction database into a database of customer sequences.
Why sequential pattern mining? Transactional database Sequence database
Cust1 {30} Cust1 {30} Cust1 {30} 1,<30, 90>

Time or order in which actions appear or happen can be Cust2 {10,20} Cust1 {90} {90} 2,<(10,20), 30, (40, 60, 70)>
relevant in decision making. Cust3 {30, 50, 70} Cust2 {10,20} Cust2 {10,20} 3,<(30, 50, 70)>
Cust4 {30} Cust2 {30} {30} 4,<30, (40, 70), 90>
Many applications of sequential pattern mining Cust5 {90} Cust2 {40, 60, 70} {40, 60, 70} 5,<90>
Customer shopping sequences (idem for book/video rental): Cust2 {30} Cust3 {30, 50, 70} Cust3 {30, 50, 70}
Cust4 {40, 70} Cust4 {30} Cust4 {30}
First buy computer, then CD-ROM, and then digital camera, Cust1 {90} Cust4 {40, 70} {40, 70}
within 3 months. Cust2 {40, 60, 70} Cust4 {90} {90}
Cust4 {90} Cust5 {90} Cust5 {90}
Medical treatment (e.g., symptoms and diseases) 30 90
t
Serial crime solving CID=1
40
Natural disasters (e.g., earthquakes), Sort transactions: 10 60
20 30 70
Science & engineering processes, Customer ID = Major key CID=2
t
30
Transaction Time = Minor key 50
Stocks and markets, 70
t
Telephone calling patterns, <(10,20)(30)(40,60,70)> CID=3
30
40
70 90
Web access log click streams, < a1, a2, a3 > t
CID=4
<(20)(30)(40)> contained in <(10,20)(30)(40,60,70)> 90
DNA sequences and gene structures, etc. <(20)(30,40)> not contained in <(10,20)(30)(40,60,70)> t
CID=5
What Is Sequential Pattern Mining? Sequential Patterns
Given
Given a set of sequences, find the complete
set of frequent subsequences a set of sequences, where each sequence consists
of a list of elements and each element consists of
A sequence : < (ef) (ab) (df) c b > set of items
A sequence database
user-specified min_support threshold
SID sequence An element may contain a set of items.
10 <a(abc)(ac)d(cf)> Items within an element are unordered
and we list them alphabetically.
20 <(ad)c(bc)(ae)> id Sequence <a(abc)(ac)d(cf)> - 5 elements, 9 items
30 <(ef)(ab)(df)cb> <a(bc)dc> is a subsequence 10 <a(abc)(ac)d(cf)>

<a(abc)(ac)d(cf)> - 9-sequence
40 <eg(af)cbc> of <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(abc)(ac)d(cf)> = <a(cba)(ac)d(cf)>
<a(abc)(ac)d(cf)> <a(ac)(abc)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a 40 <eg(af)cbc>
sequential pattern (cf. SID 10 & 30) Order doesnt

Matter (list) Order matters (sequence)
Sequential Pattern Mining Subsequence vs. super sequence

Given two sequences =<a1a2an> and =<b1b2bm>
Find all the frequent subsequences, i.e. the
is called a subsequence of , denoted as ,
subsequences whose occurrence frequency in
if there exist integers 1j1<j2<<jn m such that
the set of sequences is no less than a1bj1, a2 bj2,, anbjn
min_support
is a super sequence of
Solution 53 frequent subsequences A sequence s is maximal if it is not contained in any
<a><aa> <ab> <a(bc)> <a(bc)a> <aba> <abc>
id Sequence <(ab)> <(ab)c> <(ab)d> <(ab)f> <(ab)dc> <ac> other sequence.
<aca> <acb> <acc> <ad> <adc> <af>
10 <a(abc)(ac)d(cf)>
 <ba> <bc> <(bc)> <(bc)a> <bd> <bdc> <bf> =<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
<c> <ca> <cb> <cc>
30 <(ef)(ab)(df)cb> 1=<aa(ac)d(c)> 4=<df(cf)>
<d> <db> <dc> <dcb>
40 <eg(af)cbc> 2=<(ac)(ac)d(cf)> 5=<(cf)d>
<e> <ea> <eab> <eac> <eacb> <eb> <ebc> <ec>
<ecb> <ef> <efb> <efc> <efcb>
min_support = 2 <f> <fb> <fbc> <fc> <fcb> 3=<ac> 6=<(abc)dcf>
Sequence Support Count Counting Sequences (An example)
A sequence database is a set of tuples <sid, s>
A Generated Candidate
A tuple <sid, s> is said to contain a sequence , if <(7)(3,8)(9)(4,5,6)(8)>
Contained Pattern
is a subsequence of s, i.e., s <(8)(3,8)(9)(4,5)(6)(7)>

<(3)(4,5)(8)>
Support count
The support of a sequence is the number of tuples <(3)(5,4)>
1
5
4
3
2
0
Contained
containing
<(3)(5,4)(6)(9,8)>
<(10)(3)(5,4)(9,8)> Contained Co No t
nta
ined
<(5,4)(6,6)(9,8)> Co No t
un
<(3)(5,4)(6)(9,8)> Contained t ed
1=<a> support(1) = 4 IF M
id Sequence in
5 _Sup
<(3)(5)(4)(6)(9,8)>
10 <a(abc)(ac)d(cf)> 2=<ac> support(2) = 4 <(3 0% TH port
<(3)(5,2,1)(6)(9,8)> )(4, EN
20 <(ad)c(bc)(ae)> Fre 5)(8)>
3=<(ab)c> support(3) = 2 que i
nt s
<(3)(5,4)(6)(9,8)> Contained
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Challenges on Sequential Pattern Mining Studies on Sequential Pattern Mining

A huge number of possible sequential patterns are hidden in databases
A mining algorithm should Concept introduction and an initial Apriori-like algorithm
R. Agrawal & R. Srikant. Mining sequential patterns, ICDE95
find the complete set of patterns, when possible, satisfying the minimum support
GSPAn Apriori-based, influential mining method (developed at
(frequency) threshold
IBM Almaden)
be highly efficient, scalable, involving only a small number of database scans
R. Srikant & R. Agrawal. Mining sequential patterns: Generalizations
be able to incorporate various kinds of user-specific constraints and performance improvements, EDBT96
Comparison of association rules and sequence mining From sequential patterns to episodes (Apriori-like + constraints)
Mining for association rules H. Mannila, H. Toivonen & A.I. Verkamo. Discovery of frequent
episodes in event sequences, Data Mining and Knowledge Discovery,
Purpose: Discovery of frequent unordered itemsets.
1997
n n!
Complexity: With n items there are = k-itemsets (sets with k items).
k k!(n k )! Data Projection-based approaches
Mining for Sequential Patterns FreeSpan (Han et al. frequent pattern-projected sequential pattern
Purpose: Discovery of frequent sequences of (unordered) itemsets. mining SIGKDD 2000)
PrefixSpan (Pei et al. Prefix-projected pattern growth ICDE 2001)
Complexity: With n items there are nk k-sequences (sequences with k items).
Mining sequential patterns with constraints
Association mining algorithms discover isolated item sets (intra-event patterns). M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Sequence mining algorithms discover series of item sets (inter-event patterns). Mining with Regular Expression Constraints. VLDB 1999
Lecture Outline
Part I: Concepts (30 minutes)
Basic concepts AprioriAll: The idea
Part II: Apriori-based Approaches (45 minutes)
Basic method to mine sequential patterns
Apriori-All
Based on the Apriori algorithm
GSP
Part III: Pattern-Growth-based Approaches (45 minutes) Count all the large sequences, including non-
Free-Span maximal sequences
Prefix-Span Use Apriori-generate function to generate
candidate sequences: get candidates for a pass
using only the large sequences found in the
previous pass and make a scan over the data to
find their support
AprioriAll Algorithm(1)
AprioriAll Algorithm
AprioriAll: The big picture Ck: Candidate sequence of size k
Lk : frequent or large sequence of size k
Five-phase algorithm L1 = {large 1-sequence}; //result of litemset phase

for (k = 2; Lk-1 !=; k++) do begin
1. Sort phase: Ck = candidates generated from Lk-1;
Create the sequence database from transactions. for each customer-sequence c in database do
Increment the count of all candidates in Ck
2. Large itemset phase that are contained in c
Find all frequent itemsets using Apriori Lk =Candidates in Ck with minimum support
end
3. Transformation phase: Answer=Maximal sequences in k Lk;
Do integer mapping for large itemsets
Candidate Generation --Join Step:
4. Sequence phase:
Ck is generated by joining Lk-1with itself For example:
Find all frequent sequential patterns using Apriori. {1,2,3} X {1,2,4}
Insert into Ck,
5. Maximal phase: Select p.litemset1, , p.litemsetk-1, q.litemsetk-1
=
{1,2,3,4}
and
Eliminate non maximal sequences. From Lk-1 p, Lk-1 q {1,2,4,3}
Where p.litemset1= q.litemset1,...,
Dr. Osmar R. Zaane, 1999, 2007 Dr. Osmar R. Zaane, 1999, 2007
= q.litemset
p.litemsetk-2Data k-2
Data Mining and Knowledge Discovery University of Alberta 19 Mining and Knowledge Discovery University of Alberta 20
Sequence Database Example
30 90
Sort Phases
t
CID=1
40 30 90
10 30 60 t
20 70 CID=1
t Sort Phases
CID=2 40
Customer ID TransactionTime Items 10 60
30 30
1 1 30
CID: major key, TID: secondary key 20 70
50 t
1 2 90 70 CID=2
t Customer ID TransactionTime Items
2 1 10,20 CID=3 30
2 2 30 50
1 1 30
2 3 40,60,70 40 70
30 1 2 90 t
3 1 70 90
30,50,70 CID=3
t 2 1 10,20
CID=4
4 1 30 2 2 30
30 40
4 2 40,70 2 3 40,60,70 90
70
4 3 90 90 3 1 t
30,50,70 CID=4
5 1 90 t 4 1 30
CID=5
4 2 40,70
MinSupport =40%, i.e. 2 customers 4 3 90
90
Answer: (<30><90>) (CID1,4) (<30><40,70>) (CID2,4) t
5 1 90 CID=5
Not Answer: <30> <40><70><90>(<30><40>)(<30><70>)(<40 70>) Why?
Litemset Phase Transformation Phase itemset Map

Litemset Phase: Transformation Phase: {30} 1
Find all large itemset Each large itemset is then mapped to a set of contiguous
integers {40} 2
(Why? Because each itemset in a large sequence has to be a large {70} 3
itemset.) (Why? to be able to compare two frequent itemsets in
constant time) {40 70} 4
To get large (frequent) itemsets Use Apriori algorithm
Represent transactions as sets of large itemsets. {90} 5
Need to modify support counting. (For sequential patterns, support is
measured by fraction of customers.) 30 90 30 90 1 5
Litemsets
t t t
CID=1 CID=1 CID=1
Customer ID TransactionTime Items 40 40 2
10
30
60 10
30
70 ! 1 3
1 1 30 20 70
t
20 40,70
t
4
t
1 2 90 MinSupport =40%, i.e. 2 customers CID=2
30
CID=2
30
CID=2
1
50 70 ! 3
2 1 10,20 Litemset Result: 70
t t t
2 2 30 CID=3 CID=3 CID=3 2
2 3 40,60,70 {30} {40} {70} {40 70}{90} 30
40
70 90 30
40
70 ! 90 1
4
3 5
t 40, 70 t t
CID=4 CID=4 CID=4
3 1 30,50,70 90 90 5
4 1 30 Difference from Apriori: CID=5

t
CID=5
t
CID=5
t
4 2 40,70 the support count should be 1,<30, 90> 1,<{1}, {5}>

4 3 90 2,<(10,20), 30, (40, 60, 70)>
incremented only once per 3,<(30, 50, 70)> Sequence Transformed
2,<{1}, {2, 3, 4}>
3,<{1, 3}>
5 1 90 customer 4,<30, (40, 70), 90> 4,<{1}, {2, 3, 4}, {5}>
5,<90> database database 5,<{5}>
Sequence Phase itemset Map Maximal phase
{30} 1
Sequence Phase: {40} 2
Maximum Phase:
{70} 3
Use set of large itemsets to find the desired
{40 70} 4 Find the maximal sequences among the set of
sequences. {90} 5 frequent sequences
Similar structure to Apriori algorithms used to find Litemsets
large itemsets. delete all sequences that are sub-sequences of
Use seed set to generate candidate sequences. other frequent sequences.
Count support for each candidate.
Eliminate candidate sequences which are not large.
<30> <30>
MinSupport =40%, i.e. 2 customers
F.Seq. Sup. <40> for (k=n; k>1; k--) do <40>
{1} 4
<70> <70>
{2} 2 <40, 70> for each k-sequence Sk do <40, 70>
1,<{1}, {5}> {3} 3 <90> <90>
2,<{1}, {2, 3, 4}> {4} 2
3,<{1, 3}> Apriori {5} 3 Re-mapping <30><40> Delete from all subsequences of Sk <30><40>
4,<{1}, {2, 3, 4}, {5}> ({1}, {2}) 2 <30><70> <30><70>
5,<{5}> ({1}, {3}) 3
<30, (40, 70)> <30, (40, 70)>
({1}, {4}) 2
({1}, {5}) 2 <30><90> <30><90>
GSPA Generalized Sequential Pattern

Summary for AprioriAll Mining Algorithm
Algorithm wastes much time in counting non- GSP (Generalized Sequential Pattern) mining algorithm
maximal sequences, which can not be proposed by Agrawal and Srikant, EDBT96
sequential patterns Outline of the method
There are other variations of AprioriAll that reduce the Initially, every item in DB is a candidate of length-1
candidates that are not maximals: AprioriSome and for each level (i.e., sequences of length-k) do
DynamicSome scan database to collect support count for each
candidate sequence
Absence of time window constraints generate candidate length-(k+1) sequences from
AprioriALL is the basis of many efficient algorithm length-k frequent sequences using Apriori
developed later. GSP is among them. repeat until no frequent sequence or no candidate can
be found
Major strength: Candidate pruning by Apriori
A Basic Property of Sequential Patterns:
Apriori
The GSP Algorithm
Take sequences in form of <x> as length-1 candidates
A basic property: Apriori (Agrawal & Sirkant94) Scan database once, find F1, the set of length-1
If a sequence S is not frequent
sequential patterns
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent so do <hab> and <(ah)b> Let k=1; while Fk is not empty do
Form Ck+1, the set of length-(k+1) candidates from Fk;
Seq. ID Sequence Given support threshold
10 <(bd)cb(ac)> min_sup =2 If Ck+1 is not empty, scan database once, find Fk+1, the set
20 <(bf)(ce)b(fg)> of length-(k+1) sequential patterns
30 <(ah)(bf)abf> Let k=k+1;
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
Finding Length-1 Sequential Patterns Generating Length-2 Candidates

<a> <c> <d> <e> <f>
Examine GSP using an example 36 <a> <aa> <ab> <ac> <ad> <ae> <af>
Initial candidates: all singleton Cand Sup <ba> <bb> <bc> <bd> <be> <bf>
sequences 51 length-2 <c> <ca> <cb> <cc> <cd> <ce> <cf>
<a> 3
<a>, , <c>, <d>, <e>, <f>, Candidates <d> <da> <db> <dc> <dd> <de> <df>
 5
<g>, <h> <e> <ea> <eb> <ec> <ed> <ee> <ef>
<c> 4 15 <f> <fa> <fb> <fc> <fd> <fe> <ff>
Scan database once, count support
for candidates <d> 3
<a> <c> <d> <e> <f> Without Apriori
Seq. ID Sequence <e> 3 <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
property,
10 <(bd)cb(ac)> <f> 2 <(bc)> <(bd)> <(be)> <(bf)>
8*8+8*7/2=92
20 <(bf)(ce)b(fg)> <c> <(cd)> <(ce)> <(cf)>
<g> 1 candidates
min_sup =2 30 <(ah)(bf)abf> <d> <(de)> <(df)>
<h> 1 <e> <(ef)> Apriori prunes
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
<f> 44.57% candidates
Generating Length-2 Candidates Generating Length-2 Candidates
min_sup =2 min_sup =2
<a> <c> <d> <e> <f>
Seq. ID Sequence <a> <c> <d> <e> <f>
<a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1 Seq. ID Sequence <a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1
10 <(bd)cb(ac)> <ba> <bb> <bc> <bd> <be> <bf> 10 <(bd)cb(ac)> <ba>:3 <bb>:4 <bc>:4 <bd>:2 <be>:3 <bf>:2
20 <(bf)(ce)b(fg)> <c> <ca> <cb> <cc> <cd> <ce> <cf>
20 <(bf)(ce)b(fg)> <c> <ca>:2 <cb>:3 <cc>:1 <cd>:2 <ce>:1 <cf>:1
30 <(ah)(bf)abf> <d> <da> <db> <dc> <dd> <de> <df>
30 <(ah)(bf)abf> <da>:2 <db>:2 <dc>:2 <dd>:0 <de>:1 <df>:0
<d>
40 <(be)(ce)d> <e> <ea> <eb> <ec> <ed> <ee> <ef>
40 <(be)(ce)d> <ea>:0 <eb>:1 <ec>:0 <ed>:1 <ee>:1 <ef>:1
<e>
50 <a(bd)bcb(ade)> <f> <fa> <fb> <fc> <fd> <fe> <ff>
50 <a(bd)bcb(ade)> <f> <fa>:1 <fb>:2 <fc>:1 <fd>:0 <fe>:1 <ff>:2
<a> <c> <d> <e> <f> <a> <c> <d> <e> <f>
<(ab)> <(ac)> <(ad)> <(ae)> <(af)> SID: 30, 50
<a> <a> <(ab)>:0 <(ac)>:1 <(ad)>:1 <(ae)>:1 <(af)>:0
 <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2 <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2
<c> <(cd)> <(ce)> <(cf)> <c> <(cd)>:0 <(ce)>:2 <(cf)>:0
<d> <(de)> <(df)> <d> <(de)>:1 <(df)>:0
SID: 20, 30
<e> <(ef)> <e> <(ef)>:0
<f> <f>
Generating Length-3 Candidates and Finding

Length-3 Patterns
Length-2 Sequential Patterns
Generate Length-3 Candidates
Self-join length-2 sequential patterns
After scanning the database to collect Based on the Apriori property
<ab>, <aa> and <ba> are all length-2 sequential patterns
support count for each length-2 candidate <aba> is a length-3 candidate
54 candidates are generated
There are 19 length-2 candidates which pass <(bd)>, <bb> and <db> are all length-2 sequential patterns
<(bd)b> is a length-3 candidate
the minimum support threshold 27 candidates are generated
a(bd), (bd)a, b(bd), (bd)b, (bd)c, (bd)d, (bd)e, (bd)f, c(bd), d(bd), f(bd), a(bf), (bf)a, (bf)b, b(bf),
They are length-2 sequential patterns (bf)c, (bf)d, (bf)e, (bf)f, c(bf), d(bf), f(bf), b(ce), (ce)a, (ce)b, (ce)d, d(ce)
16 of them in the pattern of <xy> Find Length-3 Sequential Patterns

3 of them in the pattern of <(xy)> Scan database once more, collect support counts for candidates
19 out of 81 candidates pass support threshold
Generating Length-3 Candidates Generating Length-3 Candidates
<a> <c> <d> <e> <f>
<aaa>:0, <aab>:0 <a> <c> <d> <e> <f>
<a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1
<aba>:2, <abb>:2, <abc>:1, <a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1
 <ba>:3 <bb>:4 <bc>:4 <bd>:2 <be>:3 <bf>:2 <abd>:1, <abe>:1, <abf>:1
 <ba>:3 <bb>:4 <bc>:4 <bd>:2 <a>
<be>:3 
<bf>:2 <c> <d> <e> <f>
<c> <ca>:2 <cb>:3 <cc>:1 <cd>:2 <ce>:1 <cf>:1 <baa>, <bab>
<c> <ca>:2 <cb>:3 <cc>:1 <a>
<cd>:2 <ce>:1 <(ab)>:0
<cf>:1 <(ac)>:1 <(ad)>:1 <(ae)>:1 <(af)>:0
<bba>, <bbb>, <bbc>, <bbd>,
<d> <da>:2 <db>:2 <dc>:2 <dd>:0 <de>:1 <df>:0
<bbe>, <bbf> <d> <da>:2 <db>:2 <dc>:2 
<dd>:0 <de>:1 <df>:0 <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2
<e> <ea>:0 <eb>:1 <ec>:0 <ed>:1 <ee>:1 <ef>:1 <bca>, <bcb>, <bcd>
<e> <ea>:0 <eb>:1 <ec>:0 <c>
<ed>:1 <ee>:1 <ef>:1 <(cd)>:0 <(ce)>:2 <(cf)>:0
<f> <fa>:1 <fb>:2 <fc>:1 <fd>:0 <fe>:1 <ff>:2 <bda>, <bdb>, <bdc>
<bfb>, <bff> <f> <fa>:1 <fb>:2 <fc>:1 <d>
<fd>:0 <fe>:1 <ff>:2 <(de)>:1 <(df)>:0
<caa>, <cab> <e> <(ef)>:0

Example of generating <xyz> pattern for <aa>: <cba>, <cbb>, <cbc>, <cbd>,
<f>
Need to concatenate another Length-2 frequent itemset <cbe>, <cbf>
Concatenating another frequent itemsets that start with a <cda>, <cdb>, <cdc>
<daa>, <dab> Example of generating <(xy)z> pattern for <(bd)>:
to form <aaa> and <aab>
<dba>, <dbb>, <dbc>, <dbd>, Need to concatenate another Length-2 frequent itemset
Seq. ID Sequence <dbe>, <dbf> Concatenating those patterns that end with b or d to form something like
10 <(bd)cb(ac)> <dca>, <dcb>, <dcd>
20 <(bf)(ce)b(fg)> <fba>, <fbb>, <fbc>, <fbd>, <a(bd)>, <b(bd)>, <c(bd)>, <d(bd)>, <f(bd)>
<fbe>, <fbf> Concatenating those patterns that starts with b or d to form something like
min_sup =2
30 <(ah)(bf)abf>
40 <(be)(ce)d>
<ffb>, <fff> <(bd)a>, <(bd)b>, <(bd)c>, <(bd)d>, <(bd)e>, <(bd)f>
50 <a(bd)bcb(ade)>
The GSP Mining Process Bottlenecks of GSP

<(bd)cba> Candidate cannot A huge set of candidates could be generated
pass sup. threshold
<abba> <(bd)bc> Candidate not in DB at all 1,000 frequent length-1 sequences generate
1000 999
1000 1000 + = 1,499,500 length-2 candidates!
<abb> <aab> <aba> <baa> <bab> 2
<aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)>

Multiple scans of database
<a> <c> <d> <e> <f> <g> <h> Real challenge: mining long sequential patterns
Seq. ID Sequence
An exponential number of short candidates
min_sup =2 10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)> A length-100 sequential pattern needs 1030
30 <(ah)(bf)abf> candidate sequences!
100 100
40 <(be)(ce)d>
i =1
= 2100 1 1030
50 <a(bd)bcb(ade)>
i
Lecture Outline
Part I: Concepts (30 minutes) FreeSpan (Generalities)
Basic concepts
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and
Part II: Apriori-based Approaches (45 minutes) M.-C. Hsu. FreeSpan: Frequent pattern-projected
Apriori-All sequential pattern mining. KDD'00, pages 355-359.
GSP
Part III: Pattern-Growth-based Approaches (45 minutes) A divide-and-conquer approach
Recursively project a sequence database into a set of
Free-Span (Frequent Pattern-Projected Sequential Pattern Mining)
smaller databases
Prefix-Span (Prefix-Projected Sequential Pattern)
Mine each projected database to find the subset of
patterns
SID
10
Sequence
<a(abc)(ac)d(cf)>
FreeSpan (cont)
FreeSpan (example) 20 <(ad)c(bc)(ae)> Finding Seq. Patterns containing item b but no
30 <(ef)(ab)(df)cb>
items after b in f_list
40 <(eg(af)cbc>
Given a sequence database S and min_support = 2 -projected database:
Step 1: find length-1 sequential patterns and list them in support 10:<a(ab)a>, 20:<aba>, 30:<(ab)b>, 40:<ab>
descending order
f_list = a:4,b:4,c:4,d:3,e:3,f:3; g:1 Find all the length-2 seq. patterns containing item
b but no items after b in f_list :
Step 2: divide search space. The complete set of seq. patterns can be
partitioned into 6 disjoint subsets (move down the f_list): <ab>:4, <ba>:2, <(ab)>:2
ones only contain item a Further partition and mining
ones contain item b but no items after b in f_list
ones contain item c but no items after c in f_list SID Sequence
ones contain item d but no items after d in f_list 10 <a(abc)(ac)d(cf)> f_list = a:4,b:4, c:4,d:3,e:3,f:3
ones contain item e but no items after e in f_list
ones contain item f 20 <(ad)c(bc)(ae)>
find subsets of sequential patterns. They can be mined by constructing 30 <(ef)(ab)(df)cb>
projected databases and mining each recursively 40 <(eg(af)cbc>
From FreeSpan to PrefixSpan Prefix of a Sequence
Freespan:
Given two sequences =<a1a2an> and =<b1b2bm>,
Projection-based: No candidate sequence needs to be
generated mn
But, projection can be performed at any point in the Sequence is called a prefix of if and only if:
sequence, and the projected sequences may not shrink bi = ai for i m-1;
much. For example, the size of f-projected database is
the same as the original sequence database bm am;
PrefixSpan All the items in (am bm) are alphabetically after those in bm
Given an alphabetical order of items in each itemset (element)
Projection-based
But only prefix-based projection: less projections and
quickly shrinking sequences =<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "PrefixSpan:
=<a(ab)a>
=<a(abc)a> =<a(abc)c>
Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth", Proc.
=<a>
2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.
=<a(ab)>
Projection Postfix
Given sequences and , such that is a
subsequence of . Let =<a1a2an> be the projection of w.r.t.
prefix =<a1a2am-1am> (m n)
A subsequence of sequence is called a projection
of w.r.t. prefix if and only if Sequence =<amam+1an> is called the postfix
has prefix ;
of w.r.t. prefix , denoted as = / , where
There exist no proper super-sequence of such that
am=(am-am)
is a subsequence of and also has prefix We also denote =
=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
=<(bc)a>
=<a(abc)a>
=<(bc)(ac)d(cf)>
=<(_c)d(cf)>
PrefixSpan Algorithm PrefixSpan Algorithm (2)
Input: A sequence database S, and the minimum support
threshold min_sup Method
1. Scan S| once, find the set of frequent items b such
Output: The complete set of sequential patterns that:
a) b can be assembled to the last element of to form a
Method: Call PrefixSpan(<>,0,S) sequential pattern; or
b) can be appended to to form a sequential pattern.
Subroutine PrefixSpan(, l, S|)
2. For each frequent item b, append it to to form a
Parameters:
sequential pattern , and output ;
: sequential pattern, 3. For each , construct -projected database S|,
l: the length of ; and call PrefixSpan(, l+1, S|).
S|: the -projected database, if <>; otherwise; the sequence
database S.
id Sequence
PrefixSpan - Example 10
20
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
PrefixSpan Example (2)
Projected database for <d>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
3. Find subsets of sequential patterns
1. Find length-1 sequential patterns
min_support = 2 <d>
<a> <c> <d> <e> <f> <(_f)>
<a> <c> <d> <e> <f> <g> <(cf)>
Partition search space into <c(bc)(ae)> 1 2 3 0 1 1 1
4 4 4 3 3 3 1
6 subsets: <(_f)cb>
ones having prefix <a>;
2. Divide search space ones having prefix ;
<db> <dc>

Prefix ones having prefix <f>;
<db> <dc>
<a> <c> <d> <e> <f> <c>
<(_c)> <(bc)>
<(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <(ac)d(cf)> <(cf)> <(_f)(ab)(df)cb> <(ab)(df)cb> 2 1

<(_d)c(bc)(ae)> <(_c)(ae)> <(bc)(ae)> <c(bc)(ae)> <(af)cbc> <cbc>
<(_b)(df)cb> <(df)cb> <(_f)cb>
<(_f)cbc> <c> <bc> <dcb> <dcb>
Lets see the case of <d> <>
id Sequence
PrefixSpan - characteristics 10
20
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
No candidate sequence needs to be generated by Bi-level Projection 30

40
<(ef)(ab)(df)cb>
<eg(af)cbc>
PrefixSpan Scan to get 1-length sequences min_support = 2
Projected databases keep shrinking Construct a triangular matrix instead of projected
The major cost of PrefixSpan is the construction databasesa for each
2
length-1 patterns
of projected databases b (4,2,2) 1
How to reduce this cost? c (4,2,1) (3,3,2) 3 ALL length-2 sequential
d (2,1,1) (2,2,0) (1,3,0) 0
pattern
Different projection methods e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0
Bi-level projection f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1
reduces the number and the size of projected databases a b c d e f
Pseudo-Projection Support(<ac>) = 4
Support(<ca>) = 2 Support(<cc>) = 3
reduces the cost of projection when projected database can be held in Support(<(ac)>) = 1
main memory
Bi-level projection (2) Bi-level projection (3) - optimization

For each length-2 sequential pattern , construct the
-projected database and find the frequent items Do we need to include every item in a postfix in the
projected databases?
Construct corresponding S-matrix
NO! Item pruning in projected database by 3-way
<ab> a b c (_c) d (_d) e (_e) f (_f) Apriori checking
<(_c)(ac)(cf)>
2 0 2 2 0 1 0 0 1 0
<(_c)a> Any super-sequence of c can be excluded from construction of
<ac> is not frequent
it can never be a sequential <ab> - projected database
<aba> <abc> <a(bc)>
<c>
pattern
a 0
c (1,0,1) 1 <a(bd)> is not frequent To construct <a(bc)>-projected database,
sequence <a(bcde)df> should be projected to <(_e)df>
(_c) (,2, ) (,1, ) instead of <(_de)df>
a c (_c)
<a(bc)a>
Pseudo-Projection Efficiency of Prefix-Span and Effect
Observation: postfixes of a sequence often appear of Pseudo-Projection
repeatedly in recursive projected databases
Method: instead of constructing physical projection
by collecting all the postfixes, we can use pointers
referring to the sequences in the database as a pseudo-
projection
Every projection consists of two pieces of
information: pointer to the sequence in database and
offset to the postfix in the sequence
s1=<a(abc)(ac)d(cf)> Pointer Offset Postfix
s1 2 <(abc)(ac)d(cf)>
s1 5 <(ac)d(cf)> PrefixSpan-1 (level-by-level projection
s1 6 <(_c)d(cf)> PrefixSpan-2 (bi-level projection)
Summary
Sequential Pattern Mining is useful in many
application, e.g. weblog analysis, financial market
prediction, BioInformatics, etc.
It is similar to the frequent itemsets mining, but with

consideration of ordering.
We have looked at different approaches that are

descendants from two popular algorithms in mining
frequent itemsets
Candidates Generation: AprioriAll and GSP
Pattern Growth: FreeSpan and PrefixSpan
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 59

Course Content: Data Mining & Knowledge Discovery

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Course Content: Data Mining & Knowledge Discovery

Uploaded by

Copyright:

Available Formats

Course Content

Data Mining & Knowledge Discovery Introduction to Data Mining

Understand Sequential Pattern Analysis in Store

the context of transactional data and get a { , , ,}

Cust1 {30} Cust1 {30} Cust1 {30} 1,<30, 90>

30 <(ef)(ab)(df)cb> <a(bc)dc> is a subsequence 10 <a(abc)(ac)d(cf)>

sequential pattern (cf. SID 10 & 30) Order doesnt

Sequential Pattern Mining Subsequence vs. super sequence

is a subsequence of s, i.e., s <(8)(3,8)(9)(4,5)(6)(7)>

Challenges on Sequential Pattern Mining Studies on Sequential Pattern Mining

Five-phase algorithm L1 = {large 1-sequence}; //result of litemset phase

Litemset Phase Transformation Phase itemset Map

4 1 30 Difference from Apriori: CID=5

4 2 40,70 the support count should be 1,<30, 90> 1,<{1}, {5}>

GSPA Generalized Sequential Pattern

Finding Length-1 Sequential Patterns Generating Length-2 Candidates

Generating Length-3 Candidates and Finding

16 of them in the pattern of <xy> Find Length-3 Sequential Patterns

<caa>, <cab> <e> <(ef)>:0

The GSP Mining Process Bottlenecks of GSP

<aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)>

No candidate sequence needs to be generated by Bi-level Projection 30

Bi-level projection (2) Bi-level projection (3) - optimization

It is similar to the frequent itemsets mining, but with

We have looked at different approaches that are

You might also like