You are on page 1of 15

Course Content

Data Mining & Knowledge Discovery Introduction to Data Mining


Fall 2007
Association Analysis
Sequential Pattern Analysis
Chapter 3: Sequential Pattern Analysis
Classification and prediction
Dr. Osmar R. Zaane Contrast Sets
Data Clustering
Outlier Detection
University of Alberta Web Mining
Other topics if time permits (spatial data, biomedical data, etc.)

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 1 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 2

Sequence of Transactions
Chapter 3 Objectives Association rule mining searches for relationships
between items in a dataset where time is irrelevant.
{a,b,c,d}

Understand Sequential Pattern Analysis in Store


{x,y,z}

the context of transactional data and get a { , , ,}

brief introduction to the different Sequential Pattern Analysis considers time (or order of
algorithms for sequential pattern discovery transactions).
Customer x
Store
<{a,b,c,d}, {x,y,z},>
day1
<>

day2

Data: sequences of evidences in time order
Dr. Osmar R. Zaane, 1999, 2007 Dr. Osmar R. Zaane, 1999, 2007
Target: sub-sequences that happenedUniversity
frequently
Data Mining and Knowledge Discovery University of Alberta 3 Data Mining and Knowledge Discovery of Alberta 4
Lecture Outline
Part I: Concepts (30 minutes)
Sequence Pattern Examples Basic concepts

Examples 1
Part II: Apriori-based Approaches (45 minutes)
60% of customers typically rent Star Wars, then Empire Strikes Apriori-all
Back, and then Return of Jedi. GSP
Note: these rentals need not to be consecutive.
Part III: Pattern-Growth-based Approaches (45 minutes)
<SW>,,<ESB>,,<RJ>
Free-Span
Example 2 Prefix-Span
60% of customers buy Fitted Sheet and flat sheet and pillow,
followed by comforter, followed by drapes and ruffles
Note: elements of a sequential pattern need not to be simple items.
<FittedSheet, FaltSheet, Pillow>, ,<Comforter>,,<Drapes, Ruffles>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 5 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 6

Sequence Database
Converts the original transaction database into a database of customer sequences.
Why sequential pattern mining? Transactional database Sequence database

Cust1 {30} Cust1 {30} Cust1 {30} 1,<30, 90>


Time or order in which actions appear or happen can be Cust2 {10,20} Cust1 {90} {90} 2,<(10,20), 30, (40, 60, 70)>
relevant in decision making. Cust3 {30, 50, 70} Cust2 {10,20} Cust2 {10,20} 3,<(30, 50, 70)>
Cust4 {30} Cust2 {30} {30} 4,<30, (40, 70), 90>
Many applications of sequential pattern mining Cust5 {90} Cust2 {40, 60, 70} {40, 60, 70} 5,<90>
Customer shopping sequences (idem for book/video rental): Cust2 {30} Cust3 {30, 50, 70} Cust3 {30, 50, 70}
Cust4 {40, 70} Cust4 {30} Cust4 {30}
First buy computer, then CD-ROM, and then digital camera, Cust1 {90} Cust4 {40, 70} {40, 70}
within 3 months. Cust2 {40, 60, 70} Cust4 {90} {90}
Cust4 {90} Cust5 {90} Cust5 {90}
Medical treatment (e.g., symptoms and diseases) 30 90
t
Serial crime solving CID=1
40
Natural disasters (e.g., earthquakes), Sort transactions: 10 60
20 30 70
Science & engineering processes, Customer ID = Major key CID=2
t
30
Transaction Time = Minor key 50
Stocks and markets, 70
t
Telephone calling patterns, <(10,20)(30)(40,60,70)> CID=3
30
40
70 90
Web access log click streams, < a1, a2, a3 > t
CID=4
<(20)(30)(40)> contained in <(10,20)(30)(40,60,70)> 90
DNA sequences and gene structures, etc. <(20)(30,40)> not contained in <(10,20)(30)(40,60,70)> t
CID=5

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 7 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 8
What Is Sequential Pattern Mining? Sequential Patterns
Given
Given a set of sequences, find the complete
set of frequent subsequences a set of sequences, where each sequence consists
of a list of elements and each element consists of
A sequence : < (ef) (ab) (df) c b > set of items
A sequence database
user-specified min_support threshold
SID sequence An element may contain a set of items.
10 <a(abc)(ac)d(cf)> Items within an element are unordered
and we list them alphabetically.
20 <(ad)c(bc)(ae)> id Sequence <a(abc)(ac)d(cf)> - 5 elements, 9 items

30 <(ef)(ab)(df)cb> <a(bc)dc> is a subsequence 10 <a(abc)(ac)d(cf)>


<a(abc)(ac)d(cf)> - 9-sequence
40 <eg(af)cbc> of <a(abc)(ac)d(cf)> 20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(abc)(ac)d(cf)> = <a(cba)(ac)d(cf)>
<a(abc)(ac)d(cf)> <a(ac)(abc)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a 40 <eg(af)cbc>

sequential pattern (cf. SID 10 & 30) Order doesnt


Matter (list) Order matters (sequence)

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 9 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 10

Sequential Pattern Mining Subsequence vs. super sequence


Given two sequences =<a1a2an> and =<b1b2bm>
Find all the frequent subsequences, i.e. the
is called a subsequence of , denoted as ,
subsequences whose occurrence frequency in
if there exist integers 1j1<j2<<jn m such that
the set of sequences is no less than a1bj1, a2 bj2,, anbjn
min_support
is a super sequence of
Solution 53 frequent subsequences A sequence s is maximal if it is not contained in any
<a><aa> <ab> <a(bc)> <a(bc)a> <aba> <abc>
id Sequence <(ab)> <(ab)c> <(ab)d> <(ab)f> <(ab)dc> <ac> other sequence.
<aca> <acb> <acc> <ad> <adc> <af>
10 <a(abc)(ac)d(cf)>
<b> <ba> <bc> <(bc)> <(bc)a> <bd> <bdc> <bf> =<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
<c> <ca> <cb> <cc>
30 <(ef)(ab)(df)cb> 1=<aa(ac)d(c)> 4=<df(cf)>
<d> <db> <dc> <dcb>
40 <eg(af)cbc> 2=<(ac)(ac)d(cf)> 5=<(cf)d>
<e> <ea> <eab> <eac> <eacb> <eb> <ebc> <ec>
<ecb> <ef> <efb> <efc> <efcb>
min_support = 2 <f> <fb> <fbc> <fc> <fcb> 3=<ac> 6=<(abc)dcf>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 11 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 12
Sequence Support Count Counting Sequences (An example)
A sequence database is a set of tuples <sid, s>
A Generated Candidate
A tuple <sid, s> is said to contain a sequence , if <(7)(3,8)(9)(4,5,6)(8)>
Contained Pattern

is a subsequence of s, i.e., s <(8)(3,8)(9)(4,5)(6)(7)>


<(3)(4,5)(8)>
Support count
The support of a sequence is the number of tuples <(3)(5,4)>
1
5
4
3
2
0
Contained
containing
<(3)(5,4)(6)(9,8)>
<(10)(3)(5,4)(9,8)> Contained Co No t
nta
ined
<(5,4)(6,6)(9,8)> Co No t
un
<(3)(5,4)(6)(9,8)> Contained t ed
1=<a> support(1) = 4 IF M
id Sequence in
5 _Sup
<(3)(5)(4)(6)(9,8)>
10 <a(abc)(ac)d(cf)> 2=<ac> support(2) = 4 <(3 0% TH port
<(3)(5,2,1)(6)(9,8)> )(4, EN
20 <(ad)c(bc)(ae)> Fre 5)(8)>
3=<(ab)c> support(3) = 2 que i
nt s
<(3)(5,4)(6)(9,8)> Contained
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 13 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 14

Challenges on Sequential Pattern Mining Studies on Sequential Pattern Mining


A huge number of possible sequential patterns are hidden in databases
A mining algorithm should Concept introduction and an initial Apriori-like algorithm
R. Agrawal & R. Srikant. Mining sequential patterns, ICDE95
find the complete set of patterns, when possible, satisfying the minimum support
GSPAn Apriori-based, influential mining method (developed at
(frequency) threshold
IBM Almaden)
be highly efficient, scalable, involving only a small number of database scans
R. Srikant & R. Agrawal. Mining sequential patterns: Generalizations
be able to incorporate various kinds of user-specific constraints and performance improvements, EDBT96
Comparison of association rules and sequence mining From sequential patterns to episodes (Apriori-like + constraints)
Mining for association rules H. Mannila, H. Toivonen & A.I. Verkamo. Discovery of frequent
episodes in event sequences, Data Mining and Knowledge Discovery,
Purpose: Discovery of frequent unordered itemsets.
1997
n n!
Complexity: With n items there are = k-itemsets (sets with k items).
k k!(n k )! Data Projection-based approaches
Mining for Sequential Patterns FreeSpan (Han et al. frequent pattern-projected sequential pattern
Purpose: Discovery of frequent sequences of (unordered) itemsets. mining SIGKDD 2000)
PrefixSpan (Pei et al. Prefix-projected pattern growth ICDE 2001)
Complexity: With n items there are nk k-sequences (sequences with k items).
Mining sequential patterns with constraints
Association mining algorithms discover isolated item sets (intra-event patterns). M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern
Sequence mining algorithms discover series of item sets (inter-event patterns). Mining with Regular Expression Constraints. VLDB 1999
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 15 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 16
Lecture Outline
Part I: Concepts (30 minutes)
Basic concepts AprioriAll: The idea
Part II: Apriori-based Approaches (45 minutes)
Basic method to mine sequential patterns
Apriori-All
Based on the Apriori algorithm
GSP
Part III: Pattern-Growth-based Approaches (45 minutes) Count all the large sequences, including non-
Free-Span maximal sequences
Prefix-Span Use Apriori-generate function to generate
candidate sequences: get candidates for a pass
using only the large sequences found in the
previous pass and make a scan over the data to
find their support
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 17 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 18

AprioriAll Algorithm(1)
AprioriAll Algorithm
AprioriAll: The big picture Ck: Candidate sequence of size k
Lk : frequent or large sequence of size k

Five-phase algorithm L1 = {large 1-sequence}; //result of litemset phase


for (k = 2; Lk-1 !=; k++) do begin
1. Sort phase: Ck = candidates generated from Lk-1;
Create the sequence database from transactions. for each customer-sequence c in database do
Increment the count of all candidates in Ck
2. Large itemset phase that are contained in c
Find all frequent itemsets using Apriori Lk =Candidates in Ck with minimum support
end
3. Transformation phase: Answer=Maximal sequences in k Lk;
Do integer mapping for large itemsets
Candidate Generation --Join Step:
4. Sequence phase:
Ck is generated by joining Lk-1with itself For example:
Find all frequent sequential patterns using Apriori. {1,2,3} X {1,2,4}
Insert into Ck,
5. Maximal phase: Select p.litemset1, , p.litemsetk-1, q.litemsetk-1
=
{1,2,3,4}
and
Eliminate non maximal sequences. From Lk-1 p, Lk-1 q {1,2,4,3}
Where p.litemset1= q.litemset1,...,
Dr. Osmar R. Zaane, 1999, 2007 Dr. Osmar R. Zaane, 1999, 2007
= q.litemset
p.litemsetk-2Data k-2
Data Mining and Knowledge Discovery University of Alberta 19 Mining and Knowledge Discovery University of Alberta 20
Sequence Database Example
30 90
Sort Phases
t
CID=1
40 30 90
10 30 60 t
20 70 CID=1
t Sort Phases
CID=2 40
Customer ID TransactionTime Items 10 60
30 30
1 1 30
CID: major key, TID: secondary key 20 70
50 t
1 2 90 70 CID=2
t Customer ID TransactionTime Items
2 1 10,20 CID=3 30
2 2 30 50
1 1 30
2 3 40,60,70 40 70
30 1 2 90 t
3 1 70 90
30,50,70 CID=3
t 2 1 10,20
CID=4
4 1 30 2 2 30
30 40
4 2 40,70 2 3 40,60,70 90
70
4 3 90 90 3 1 t
30,50,70 CID=4
5 1 90 t 4 1 30
CID=5
4 2 40,70
MinSupport =40%, i.e. 2 customers 4 3 90
90
Answer: (<30><90>) (CID1,4) (<30><40,70>) (CID2,4) t
5 1 90 CID=5
Not Answer: <30> <40><70><90>(<30><40>)(<30><70>)(<40 70>) Why?
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 21 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 22

Litemset Phase Transformation Phase itemset Map


Litemset Phase: Transformation Phase: {30} 1
Find all large itemset Each large itemset is then mapped to a set of contiguous
integers {40} 2
(Why? Because each itemset in a large sequence has to be a large {70} 3
itemset.) (Why? to be able to compare two frequent itemsets in
constant time) {40 70} 4
To get large (frequent) itemsets Use Apriori algorithm
Represent transactions as sets of large itemsets. {90} 5
Need to modify support counting. (For sequential patterns, support is
measured by fraction of customers.) 30 90 30 90 1 5
Litemsets
t t t
CID=1 CID=1 CID=1
Customer ID TransactionTime Items 40 40 2
10
30
60 10
30
70 ! 1 3
1 1 30 20 70
t
20 40,70
t
4
t
1 2 90 MinSupport =40%, i.e. 2 customers CID=2
30
CID=2
30
CID=2
1
50 70 ! 3
2 1 10,20 Litemset Result: 70
t t t
2 2 30 CID=3 CID=3 CID=3 2
2 3 40,60,70 {30} {40} {70} {40 70}{90} 30
40
70 90 30
40
70 ! 90 1
4
3 5
t 40, 70 t t
CID=4 CID=4 CID=4
3 1 30,50,70 90 90 5

4 1 30 Difference from Apriori: CID=5


t
CID=5
t
CID=5
t

4 2 40,70 the support count should be 1,<30, 90> 1,<{1}, {5}>


4 3 90 2,<(10,20), 30, (40, 60, 70)>
incremented only once per 3,<(30, 50, 70)> Sequence Transformed
2,<{1}, {2, 3, 4}>
3,<{1, 3}>
5 1 90 customer 4,<30, (40, 70), 90> 4,<{1}, {2, 3, 4}, {5}>
5,<90> database database 5,<{5}>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 23 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 24
Sequence Phase itemset Map Maximal phase
{30} 1
Sequence Phase: {40} 2
Maximum Phase:
{70} 3
Use set of large itemsets to find the desired
{40 70} 4 Find the maximal sequences among the set of
sequences. {90} 5 frequent sequences
Similar structure to Apriori algorithms used to find Litemsets
large itemsets. delete all sequences that are sub-sequences of
Use seed set to generate candidate sequences. other frequent sequences.
Count support for each candidate.
Eliminate candidate sequences which are not large.
<30> <30>
MinSupport =40%, i.e. 2 customers
F.Seq. Sup. <40> for (k=n; k>1; k--) do <40>
{1} 4
<70> <70>
{2} 2 <40, 70> for each k-sequence Sk do <40, 70>
1,<{1}, {5}> {3} 3 <90> <90>
2,<{1}, {2, 3, 4}> {4} 2
3,<{1, 3}> Apriori {5} 3 Re-mapping <30><40> Delete from all subsequences of Sk <30><40>
4,<{1}, {2, 3, 4}, {5}> ({1}, {2}) 2 <30><70> <30><70>
5,<{5}> ({1}, {3}) 3
<30, (40, 70)> <30, (40, 70)>
({1}, {4}) 2
({1}, {5}) 2 <30><90> <30><90>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 25 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 26

GSPA Generalized Sequential Pattern


Summary for AprioriAll Mining Algorithm
Algorithm wastes much time in counting non- GSP (Generalized Sequential Pattern) mining algorithm
maximal sequences, which can not be proposed by Agrawal and Srikant, EDBT96
sequential patterns Outline of the method
There are other variations of AprioriAll that reduce the Initially, every item in DB is a candidate of length-1
candidates that are not maximals: AprioriSome and for each level (i.e., sequences of length-k) do
DynamicSome scan database to collect support count for each
candidate sequence
Absence of time window constraints generate candidate length-(k+1) sequences from
AprioriALL is the basis of many efficient algorithm length-k frequent sequences using Apriori
developed later. GSP is among them. repeat until no frequent sequence or no candidate can
be found
Major strength: Candidate pruning by Apriori
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 27 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 28
A Basic Property of Sequential Patterns:
Apriori
The GSP Algorithm
Take sequences in form of <x> as length-1 candidates
A basic property: Apriori (Agrawal & Sirkant94) Scan database once, find F1, the set of length-1
If a sequence S is not frequent
sequential patterns
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent so do <hab> and <(ah)b> Let k=1; while Fk is not empty do
Form Ck+1, the set of length-(k+1) candidates from Fk;
Seq. ID Sequence Given support threshold
10 <(bd)cb(ac)> min_sup =2 If Ck+1 is not empty, scan database once, find Fk+1, the set
20 <(bf)(ce)b(fg)> of length-(k+1) sequential patterns
30 <(ah)(bf)abf> Let k=k+1;
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 29 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 30

Finding Length-1 Sequential Patterns Generating Length-2 Candidates


<a> <b> <c> <d> <e> <f>
Examine GSP using an example 36 <a> <aa> <ab> <ac> <ad> <ae> <af>
Initial candidates: all singleton Cand Sup <b> <ba> <bb> <bc> <bd> <be> <bf>
sequences 51 length-2 <c> <ca> <cb> <cc> <cd> <ce> <cf>
<a> 3
<a>, <b>, <c>, <d>, <e>, <f>, Candidates <d> <da> <db> <dc> <dd> <de> <df>
<b> 5
<g>, <h> <e> <ea> <eb> <ec> <ed> <ee> <ef>
<c> 4 15 <f> <fa> <fb> <fc> <fd> <fe> <ff>
Scan database once, count support
for candidates <d> 3
<a> <b> <c> <d> <e> <f> Without Apriori
Seq. ID Sequence <e> 3 <a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
property,
10 <(bd)cb(ac)> <f> 2 <b> <(bc)> <(bd)> <(be)> <(bf)>
8*8+8*7/2=92
20 <(bf)(ce)b(fg)> <c> <(cd)> <(ce)> <(cf)>
<g> 1 candidates
min_sup =2 30 <(ah)(bf)abf> <d> <(de)> <(df)>
<h> 1 <e> <(ef)> Apriori prunes
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
<f> 44.57% candidates
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 31 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 32
Generating Length-2 Candidates Generating Length-2 Candidates
min_sup =2 min_sup =2
<a> <b> <c> <d> <e> <f>
Seq. ID Sequence <a> <b> <c> <d> <e> <f>
<a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1 Seq. ID Sequence <a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1
10 <(bd)cb(ac)> <b> <ba> <bb> <bc> <bd> <be> <bf> 10 <(bd)cb(ac)> <b> <ba>:3 <bb>:4 <bc>:4 <bd>:2 <be>:3 <bf>:2
20 <(bf)(ce)b(fg)> <c> <ca> <cb> <cc> <cd> <ce> <cf>
20 <(bf)(ce)b(fg)> <c> <ca>:2 <cb>:3 <cc>:1 <cd>:2 <ce>:1 <cf>:1
30 <(ah)(bf)abf> <d> <da> <db> <dc> <dd> <de> <df>
30 <(ah)(bf)abf> <da>:2 <db>:2 <dc>:2 <dd>:0 <de>:1 <df>:0
<d>
40 <(be)(ce)d> <e> <ea> <eb> <ec> <ed> <ee> <ef>
40 <(be)(ce)d> <ea>:0 <eb>:1 <ec>:0 <ed>:1 <ee>:1 <ef>:1
<e>
50 <a(bd)bcb(ade)> <f> <fa> <fb> <fc> <fd> <fe> <ff>
50 <a(bd)bcb(ade)> <f> <fa>:1 <fb>:2 <fc>:1 <fd>:0 <fe>:1 <ff>:2

<a> <b> <c> <d> <e> <f> <a> <b> <c> <d> <e> <f>
<(ab)> <(ac)> <(ad)> <(ae)> <(af)> SID: 30, 50
<a> <a> <(ab)>:0 <(ac)>:1 <(ad)>:1 <(ae)>:1 <(af)>:0
<b> <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2 <b> <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2
<c> <(cd)> <(ce)> <(cf)> <c> <(cd)>:0 <(ce)>:2 <(cf)>:0
<d> <(de)> <(df)> <d> <(de)>:1 <(df)>:0
SID: 20, 30
<e> <(ef)> <e> <(ef)>:0
<f> <f>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 33 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 34

Generating Length-3 Candidates and Finding


Length-3 Patterns
Length-2 Sequential Patterns
Generate Length-3 Candidates
Self-join length-2 sequential patterns
After scanning the database to collect Based on the Apriori property
<ab>, <aa> and <ba> are all length-2 sequential patterns
support count for each length-2 candidate <aba> is a length-3 candidate
54 candidates are generated
There are 19 length-2 candidates which pass <(bd)>, <bb> and <db> are all length-2 sequential patterns
<(bd)b> is a length-3 candidate
the minimum support threshold 27 candidates are generated
a(bd), (bd)a, b(bd), (bd)b, (bd)c, (bd)d, (bd)e, (bd)f, c(bd), d(bd), f(bd), a(bf), (bf)a, (bf)b, b(bf),
They are length-2 sequential patterns (bf)c, (bf)d, (bf)e, (bf)f, c(bf), d(bf), f(bf), b(ce), (ce)a, (ce)b, (ce)d, d(ce)

16 of them in the pattern of <xy> Find Length-3 Sequential Patterns


3 of them in the pattern of <(xy)> Scan database once more, collect support counts for candidates
19 out of 81 candidates pass support threshold

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 35 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 36
Generating Length-3 Candidates Generating Length-3 Candidates
<a> <b> <c> <d> <e> <f>
<aaa>:0, <aab>:0 <a> <b> <c> <d> <e> <f>
<a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1
<aba>:2, <abb>:2, <abc>:1, <a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1
<b> <ba>:3 <bb>:4 <bc>:4 <bd>:2 <be>:3 <bf>:2 <abd>:1, <abe>:1, <abf>:1
<b> <ba>:3 <bb>:4 <bc>:4 <bd>:2 <a>
<be>:3 <b>
<bf>:2 <c> <d> <e> <f>
<c> <ca>:2 <cb>:3 <cc>:1 <cd>:2 <ce>:1 <cf>:1 <baa>, <bab>
<c> <ca>:2 <cb>:3 <cc>:1 <a>
<cd>:2 <ce>:1 <(ab)>:0
<cf>:1 <(ac)>:1 <(ad)>:1 <(ae)>:1 <(af)>:0
<bba>, <bbb>, <bbc>, <bbd>,
<d> <da>:2 <db>:2 <dc>:2 <dd>:0 <de>:1 <df>:0
<bbe>, <bbf> <d> <da>:2 <db>:2 <dc>:2 <b>
<dd>:0 <de>:1 <df>:0 <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2
<e> <ea>:0 <eb>:1 <ec>:0 <ed>:1 <ee>:1 <ef>:1 <bca>, <bcb>, <bcd>
<e> <ea>:0 <eb>:1 <ec>:0 <c>
<ed>:1 <ee>:1 <ef>:1 <(cd)>:0 <(ce)>:2 <(cf)>:0
<f> <fa>:1 <fb>:2 <fc>:1 <fd>:0 <fe>:1 <ff>:2 <bda>, <bdb>, <bdc>
<bfb>, <bff> <f> <fa>:1 <fb>:2 <fc>:1 <d>
<fd>:0 <fe>:1 <ff>:2 <(de)>:1 <(df)>:0

<caa>, <cab> <e> <(ef)>:0


Example of generating <xyz> pattern for <aa>: <cba>, <cbb>, <cbc>, <cbd>,
<f>
Need to concatenate another Length-2 frequent itemset <cbe>, <cbf>
Concatenating another frequent itemsets that start with a <cda>, <cdb>, <cdc>
<daa>, <dab> Example of generating <(xy)z> pattern for <(bd)>:
to form <aaa> and <aab>
<dba>, <dbb>, <dbc>, <dbd>, Need to concatenate another Length-2 frequent itemset
Seq. ID Sequence <dbe>, <dbf> Concatenating those patterns that end with b or d to form something like
10 <(bd)cb(ac)> <dca>, <dcb>, <dcd>
20 <(bf)(ce)b(fg)> <fba>, <fbb>, <fbc>, <fbd>, <a(bd)>, <b(bd)>, <c(bd)>, <d(bd)>, <f(bd)>
<fbe>, <fbf> Concatenating those patterns that starts with b or d to form something like
min_sup =2
30 <(ah)(bf)abf>
40 <(be)(ce)d>
<ffb>, <fff> <(bd)a>, <(bd)b>, <(bd)c>, <(bd)d>, <(bd)e>, <(bd)f>
50 <a(bd)bcb(ade)>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 37 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 38

The GSP Mining Process Bottlenecks of GSP


<(bd)cba> Candidate cannot A huge set of candidates could be generated
pass sup. threshold
<abba> <(bd)bc> Candidate not in DB at all 1,000 frequent length-1 sequences generate
1000 999
1000 1000 + = 1,499,500 length-2 candidates!
<abb> <aab> <aba> <baa> <bab> 2

<aa> <ab> <af> <ba> <bb> <ff> <(ab)> <(ef)>


Multiple scans of database
<a> <b> <c> <d> <e> <f> <g> <h> Real challenge: mining long sequential patterns
Seq. ID Sequence
An exponential number of short candidates
min_sup =2 10 <(bd)cb(ac)>
20 <(bf)(ce)b(fg)> A length-100 sequential pattern needs 1030
30 <(ah)(bf)abf> candidate sequences!
100 100
40 <(be)(ce)d>
i =1
= 2100 1 1030
50 <a(bd)bcb(ade)>
i

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 39 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 40
Lecture Outline
Part I: Concepts (30 minutes) FreeSpan (Generalities)
Basic concepts
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and
Part II: Apriori-based Approaches (45 minutes) M.-C. Hsu. FreeSpan: Frequent pattern-projected
Apriori-All sequential pattern mining. KDD'00, pages 355-359.
GSP
Part III: Pattern-Growth-based Approaches (45 minutes) A divide-and-conquer approach
Recursively project a sequence database into a set of
Free-Span (Frequent Pattern-Projected Sequential Pattern Mining)
smaller databases
Prefix-Span (Prefix-Projected Sequential Pattern)
Mine each projected database to find the subset of
patterns

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 41 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 42

SID
10
Sequence
<a(abc)(ac)d(cf)>
FreeSpan (cont)
FreeSpan (example) 20 <(ad)c(bc)(ae)> Finding Seq. Patterns containing item b but no
30 <(ef)(ab)(df)cb>
items after b in f_list
40 <(eg(af)cbc>
Given a sequence database S and min_support = 2 <b>-projected database:
Step 1: find length-1 sequential patterns and list them in support 10:<a(ab)a>, 20:<aba>, 30:<(ab)b>, 40:<ab>
descending order
f_list = a:4,b:4,c:4,d:3,e:3,f:3; g:1 Find all the length-2 seq. patterns containing item
b but no items after b in f_list :
Step 2: divide search space. The complete set of seq. patterns can be
partitioned into 6 disjoint subsets (move down the f_list): <ab>:4, <ba>:2, <(ab)>:2
ones only contain item a Further partition and mining
ones contain item b but no items after b in f_list
ones contain item c but no items after c in f_list SID Sequence
ones contain item d but no items after d in f_list 10 <a(abc)(ac)d(cf)> f_list = a:4,b:4, c:4,d:3,e:3,f:3
ones contain item e but no items after e in f_list
ones contain item f 20 <(ad)c(bc)(ae)>
find subsets of sequential patterns. They can be mined by constructing 30 <(ef)(ab)(df)cb>
projected databases and mining each recursively 40 <(eg(af)cbc>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 43 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 44
From FreeSpan to PrefixSpan Prefix of a Sequence
Freespan:
Given two sequences =<a1a2an> and =<b1b2bm>,
Projection-based: No candidate sequence needs to be
generated mn
But, projection can be performed at any point in the Sequence is called a prefix of if and only if:
sequence, and the projected sequences may not shrink bi = ai for i m-1;
much. For example, the size of f-projected database is
the same as the original sequence database bm am;
PrefixSpan All the items in (am bm) are alphabetically after those in bm
Given an alphabetical order of items in each itemset (element)
Projection-based
But only prefix-based projection: less projections and
quickly shrinking sequences =<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "PrefixSpan:
=<a(ab)a>
=<a(abc)a> =<a(abc)c>
Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth", Proc.
=<a>
2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.
=<a(ab)>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 45 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 46

Projection Postfix
Given sequences and , such that is a
subsequence of . Let =<a1a2an> be the projection of w.r.t.
prefix =<a1a2am-1am> (m n)
A subsequence of sequence is called a projection
of w.r.t. prefix if and only if Sequence =<amam+1an> is called the postfix
has prefix ;
of w.r.t. prefix , denoted as = / , where
There exist no proper super-sequence of such that
am=(am-am)
is a subsequence of and also has prefix We also denote =

=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
=<(bc)a>
=<a(abc)a>
=<(bc)(ac)d(cf)>
=<(_c)d(cf)>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 47 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 48
PrefixSpan Algorithm PrefixSpan Algorithm (2)
Input: A sequence database S, and the minimum support
threshold min_sup Method
1. Scan S| once, find the set of frequent items b such
Output: The complete set of sequential patterns that:
a) b can be assembled to the last element of to form a
Method: Call PrefixSpan(<>,0,S) sequential pattern; or
b) <b> can be appended to to form a sequential pattern.
Subroutine PrefixSpan(, l, S|)
2. For each frequent item b, append it to to form a
Parameters:
sequential pattern , and output ;
: sequential pattern, 3. For each , construct -projected database S|,
l: the length of ; and call PrefixSpan(, l+1, S|).
S|: the -projected database, if <>; otherwise; the sequence
database S.
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 49 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 50

id Sequence

PrefixSpan - Example 10
20
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
PrefixSpan Example (2)
Projected database for <d>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
3. Find subsets of sequential patterns
1. Find length-1 sequential patterns
min_support = 2 <d>
<a> <b> <c> <d> <e> <f> <(_f)>
<a> <b> <c> <d> <e> <f> <g> <(cf)>
Partition search space into <c(bc)(ae)> 1 2 3 0 1 1 1
4 4 4 3 3 3 1
6 subsets: <(_f)cb>
ones having prefix <a>;
2. Divide search space ones having prefix <b>;
<db> <dc>

Prefix ones having prefix <f>;

<db> <dc>
<a> <b> <c> <d> <e> <f> <b> <c>
<(_c)> <(bc)>
<(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <(ac)d(cf)> <(cf)> <(_f)(ab)(df)cb> <(ab)(df)cb> 2 1
<b>
<(_d)c(bc)(ae)> <(_c)(ae)> <(bc)(ae)> <c(bc)(ae)> <(af)cbc> <cbc>
<(_b)(df)cb> <(df)cb> <b> <(_f)cb>
<(_f)cbc> <c> <bc> <dcb> <dcb>
Lets see the case of <d> <>

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 51 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 52
id Sequence

PrefixSpan - characteristics 10
20
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>

No candidate sequence needs to be generated by Bi-level Projection 30


40
<(ef)(ab)(df)cb>
<eg(af)cbc>
PrefixSpan Scan to get 1-length sequences min_support = 2
Projected databases keep shrinking Construct a triangular matrix instead of projected
The major cost of PrefixSpan is the construction databasesa for each
2
length-1 patterns
of projected databases b (4,2,2) 1
How to reduce this cost? c (4,2,1) (3,3,2) 3 ALL length-2 sequential
d (2,1,1) (2,2,0) (1,3,0) 0
pattern
Different projection methods e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0
Bi-level projection f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1
reduces the number and the size of projected databases a b c d e f

Pseudo-Projection Support(<ac>) = 4
Support(<ca>) = 2 Support(<cc>) = 3
reduces the cost of projection when projected database can be held in Support(<(ac)>) = 1
main memory
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 53 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 54

Bi-level projection (2) Bi-level projection (3) - optimization


For each length-2 sequential pattern , construct the
-projected database and find the frequent items Do we need to include every item in a postfix in the
projected databases?
Construct corresponding S-matrix
NO! Item pruning in projected database by 3-way
<ab> a b c (_c) d (_d) e (_e) f (_f) Apriori checking
<(_c)(ac)(cf)>
2 0 2 2 0 1 0 0 1 0
<(_c)a> Any super-sequence of c can be excluded from construction of
<ac> is not frequent
it can never be a sequential <ab> - projected database
<aba> <abc> <a(bc)>
<c>
pattern

a 0
c (1,0,1) 1 <a(bd)> is not frequent To construct <a(bc)>-projected database,
sequence <a(bcde)df> should be projected to <(_e)df>
(_c) (,2, ) (,1, ) instead of <(_de)df>
a c (_c)
<a(bc)a>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 55 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 56
Pseudo-Projection Efficiency of Prefix-Span and Effect
Observation: postfixes of a sequence often appear of Pseudo-Projection
repeatedly in recursive projected databases
Method: instead of constructing physical projection
by collecting all the postfixes, we can use pointers
referring to the sequences in the database as a pseudo-
projection
Every projection consists of two pieces of
information: pointer to the sequence in database and
offset to the postfix in the sequence
s1=<a(abc)(ac)d(cf)> Pointer Offset Postfix
s1 2 <(abc)(ac)d(cf)>
s1 5 <(ac)d(cf)> PrefixSpan-1 (level-by-level projection
s1 6 <(_c)d(cf)> PrefixSpan-2 (bi-level projection)

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 57 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 58

Summary
Sequential Pattern Mining is useful in many
application, e.g. weblog analysis, financial market
prediction, BioInformatics, etc.

It is similar to the frequent itemsets mining, but with


consideration of ordering.

We have looked at different approaches that are


descendants from two popular algorithms in mining
frequent itemsets
Candidates Generation: AprioriAll and GSP
Pattern Growth: FreeSpan and PrefixSpan

Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 59

You might also like