Professional Documents
Culture Documents
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 1 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 2
Sequence of Transactions
Chapter 3 Objectives Association rule mining searches for relationships
between items in a dataset where time is irrelevant.
{a,b,c,d}
brief introduction to the different Sequential Pattern Analysis considers time (or order of
algorithms for sequential pattern discovery transactions).
Customer x
Store
<{a,b,c,d}, {x,y,z},>
day1
<>
day2
Data: sequences of evidences in time order
Dr. Osmar R. Zaane, 1999, 2007 Dr. Osmar R. Zaane, 1999, 2007
Target: sub-sequences that happenedUniversity
frequently
Data Mining and Knowledge Discovery University of Alberta 3 Data Mining and Knowledge Discovery of Alberta 4
Lecture Outline
Part I: Concepts (30 minutes)
Sequence Pattern Examples Basic concepts
Examples 1
Part II: Apriori-based Approaches (45 minutes)
60% of customers typically rent Star Wars, then Empire Strikes Apriori-all
Back, and then Return of Jedi. GSP
Note: these rentals need not to be consecutive.
Part III: Pattern-Growth-based Approaches (45 minutes)
<SW>,,<ESB>,,<RJ>
Free-Span
Example 2 Prefix-Span
60% of customers buy Fitted Sheet and flat sheet and pillow,
followed by comforter, followed by drapes and ruffles
Note: elements of a sequential pattern need not to be simple items.
<FittedSheet, FaltSheet, Pillow>, ,<Comforter>,,<Drapes, Ruffles>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 5 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 6
Sequence Database
Converts the original transaction database into a database of customer sequences.
Why sequential pattern mining? Transactional database Sequence database
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 7 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 8
What Is Sequential Pattern Mining? Sequential Patterns
Given
Given a set of sequences, find the complete
set of frequent subsequences a set of sequences, where each sequence consists
of a list of elements and each element consists of
A sequence : < (ef) (ab) (df) c b > set of items
A sequence database
user-specified min_support threshold
SID sequence An element may contain a set of items.
10 <a(abc)(ac)d(cf)> Items within an element are unordered
and we list them alphabetically.
20 <(ad)c(bc)(ae)> id Sequence <a(abc)(ac)d(cf)> - 5 elements, 9 items
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 9 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 10
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 11 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 12
Sequence Support Count Counting Sequences (An example)
A sequence database is a set of tuples <sid, s>
A Generated Candidate
A tuple <sid, s> is said to contain a sequence , if <(7)(3,8)(9)(4,5,6)(8)>
Contained Pattern
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 13 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 14
AprioriAll Algorithm(1)
AprioriAll Algorithm
AprioriAll: The big picture Ck: Candidate sequence of size k
Lk : frequent or large sequence of size k
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 23 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 24
Sequence Phase itemset Map Maximal phase
{30} 1
Sequence Phase: {40} 2
Maximum Phase:
{70} 3
Use set of large itemsets to find the desired
{40 70} 4 Find the maximal sequences among the set of
sequences. {90} 5 frequent sequences
Similar structure to Apriori algorithms used to find Litemsets
large itemsets. delete all sequences that are sub-sequences of
Use seed set to generate candidate sequences. other frequent sequences.
Count support for each candidate.
Eliminate candidate sequences which are not large.
<30> <30>
MinSupport =40%, i.e. 2 customers
F.Seq. Sup. <40> for (k=n; k>1; k--) do <40>
{1} 4
<70> <70>
{2} 2 <40, 70> for each k-sequence Sk do <40, 70>
1,<{1}, {5}> {3} 3 <90> <90>
2,<{1}, {2, 3, 4}> {4} 2
3,<{1, 3}> Apriori {5} 3 Re-mapping <30><40> Delete from all subsequences of Sk <30><40>
4,<{1}, {2, 3, 4}, {5}> ({1}, {2}) 2 <30><70> <30><70>
5,<{5}> ({1}, {3}) 3
<30, (40, 70)> <30, (40, 70)>
({1}, {4}) 2
({1}, {5}) 2 <30><90> <30><90>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 25 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 26
<a> <b> <c> <d> <e> <f> <a> <b> <c> <d> <e> <f>
<(ab)> <(ac)> <(ad)> <(ae)> <(af)> SID: 30, 50
<a> <a> <(ab)>:0 <(ac)>:1 <(ad)>:1 <(ae)>:1 <(af)>:0
<b> <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2 <b> <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2
<c> <(cd)> <(ce)> <(cf)> <c> <(cd)>:0 <(ce)>:2 <(cf)>:0
<d> <(de)> <(df)> <d> <(de)>:1 <(df)>:0
SID: 20, 30
<e> <(ef)> <e> <(ef)>:0
<f> <f>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 33 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 34
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 35 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 36
Generating Length-3 Candidates Generating Length-3 Candidates
<a> <b> <c> <d> <e> <f>
<aaa>:0, <aab>:0 <a> <b> <c> <d> <e> <f>
<a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1
<aba>:2, <abb>:2, <abc>:1, <a> <aa>:2 <ab>:2 <ac>:1 <ad>:1 <ae>:1 <af>:1
<b> <ba>:3 <bb>:4 <bc>:4 <bd>:2 <be>:3 <bf>:2 <abd>:1, <abe>:1, <abf>:1
<b> <ba>:3 <bb>:4 <bc>:4 <bd>:2 <a>
<be>:3 <b>
<bf>:2 <c> <d> <e> <f>
<c> <ca>:2 <cb>:3 <cc>:1 <cd>:2 <ce>:1 <cf>:1 <baa>, <bab>
<c> <ca>:2 <cb>:3 <cc>:1 <a>
<cd>:2 <ce>:1 <(ab)>:0
<cf>:1 <(ac)>:1 <(ad)>:1 <(ae)>:1 <(af)>:0
<bba>, <bbb>, <bbc>, <bbd>,
<d> <da>:2 <db>:2 <dc>:2 <dd>:0 <de>:1 <df>:0
<bbe>, <bbf> <d> <da>:2 <db>:2 <dc>:2 <b>
<dd>:0 <de>:1 <df>:0 <(bc)>:0 <(bd)>:2 <(be)>:1 <(bf)>:2
<e> <ea>:0 <eb>:1 <ec>:0 <ed>:1 <ee>:1 <ef>:1 <bca>, <bcb>, <bcd>
<e> <ea>:0 <eb>:1 <ec>:0 <c>
<ed>:1 <ee>:1 <ef>:1 <(cd)>:0 <(ce)>:2 <(cf)>:0
<f> <fa>:1 <fb>:2 <fc>:1 <fd>:0 <fe>:1 <ff>:2 <bda>, <bdb>, <bdc>
<bfb>, <bff> <f> <fa>:1 <fb>:2 <fc>:1 <d>
<fd>:0 <fe>:1 <ff>:2 <(de)>:1 <(df)>:0
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 39 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 40
Lecture Outline
Part I: Concepts (30 minutes) FreeSpan (Generalities)
Basic concepts
J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and
Part II: Apriori-based Approaches (45 minutes) M.-C. Hsu. FreeSpan: Frequent pattern-projected
Apriori-All sequential pattern mining. KDD'00, pages 355-359.
GSP
Part III: Pattern-Growth-based Approaches (45 minutes) A divide-and-conquer approach
Recursively project a sequence database into a set of
Free-Span (Frequent Pattern-Projected Sequential Pattern Mining)
smaller databases
Prefix-Span (Prefix-Projected Sequential Pattern)
Mine each projected database to find the subset of
patterns
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 41 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 42
SID
10
Sequence
<a(abc)(ac)d(cf)>
FreeSpan (cont)
FreeSpan (example) 20 <(ad)c(bc)(ae)> Finding Seq. Patterns containing item b but no
30 <(ef)(ab)(df)cb>
items after b in f_list
40 <(eg(af)cbc>
Given a sequence database S and min_support = 2 <b>-projected database:
Step 1: find length-1 sequential patterns and list them in support 10:<a(ab)a>, 20:<aba>, 30:<(ab)b>, 40:<ab>
descending order
f_list = a:4,b:4,c:4,d:3,e:3,f:3; g:1 Find all the length-2 seq. patterns containing item
b but no items after b in f_list :
Step 2: divide search space. The complete set of seq. patterns can be
partitioned into 6 disjoint subsets (move down the f_list): <ab>:4, <ba>:2, <(ab)>:2
ones only contain item a Further partition and mining
ones contain item b but no items after b in f_list
ones contain item c but no items after c in f_list SID Sequence
ones contain item d but no items after d in f_list 10 <a(abc)(ac)d(cf)> f_list = a:4,b:4, c:4,d:3,e:3,f:3
ones contain item e but no items after e in f_list
ones contain item f 20 <(ad)c(bc)(ae)>
find subsets of sequential patterns. They can be mined by constructing 30 <(ef)(ab)(df)cb>
projected databases and mining each recursively 40 <(eg(af)cbc>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 43 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 44
From FreeSpan to PrefixSpan Prefix of a Sequence
Freespan:
Given two sequences =<a1a2an> and =<b1b2bm>,
Projection-based: No candidate sequence needs to be
generated mn
But, projection can be performed at any point in the Sequence is called a prefix of if and only if:
sequence, and the projected sequences may not shrink bi = ai for i m-1;
much. For example, the size of f-projected database is
the same as the original sequence database bm am;
PrefixSpan All the items in (am bm) are alphabetically after those in bm
Given an alphabetical order of items in each itemset (element)
Projection-based
But only prefix-based projection: less projections and
quickly shrinking sequences =<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu, "PrefixSpan:
=<a(ab)a>
=<a(abc)a> =<a(abc)c>
Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth", Proc.
=<a>
2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, April 2001.
=<a(ab)>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 45 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 46
Projection Postfix
Given sequences and , such that is a
subsequence of . Let =<a1a2an> be the projection of w.r.t.
prefix =<a1a2am-1am> (m n)
A subsequence of sequence is called a projection
of w.r.t. prefix if and only if Sequence =<amam+1an> is called the postfix
has prefix ;
of w.r.t. prefix , denoted as = / , where
There exist no proper super-sequence of such that
am=(am-am)
is a subsequence of and also has prefix We also denote =
=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
=<(bc)a>
=<a(abc)a>
=<(bc)(ac)d(cf)>
=<(_c)d(cf)>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 47 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 48
PrefixSpan Algorithm PrefixSpan Algorithm (2)
Input: A sequence database S, and the minimum support
threshold min_sup Method
1. Scan S| once, find the set of frequent items b such
Output: The complete set of sequential patterns that:
a) b can be assembled to the last element of to form a
Method: Call PrefixSpan(<>,0,S) sequential pattern; or
b) <b> can be appended to to form a sequential pattern.
Subroutine PrefixSpan(, l, S|)
2. For each frequent item b, append it to to form a
Parameters:
sequential pattern , and output ;
: sequential pattern, 3. For each , construct -projected database S|,
l: the length of ; and call PrefixSpan(, l+1, S|).
S|: the -projected database, if <>; otherwise; the sequence
database S.
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 49 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 50
id Sequence
PrefixSpan - Example 10
20
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
PrefixSpan Example (2)
Projected database for <d>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
3. Find subsets of sequential patterns
1. Find length-1 sequential patterns
min_support = 2 <d>
<a> <b> <c> <d> <e> <f> <(_f)>
<a> <b> <c> <d> <e> <f> <g> <(cf)>
Partition search space into <c(bc)(ae)> 1 2 3 0 1 1 1
4 4 4 3 3 3 1
6 subsets: <(_f)cb>
ones having prefix <a>;
2. Divide search space ones having prefix <b>;
<db> <dc>
Prefix ones having prefix <f>;
<db> <dc>
<a> <b> <c> <d> <e> <f> <b> <c>
<(_c)> <(bc)>
<(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <(ac)d(cf)> <(cf)> <(_f)(ab)(df)cb> <(ab)(df)cb> 2 1
<b>
<(_d)c(bc)(ae)> <(_c)(ae)> <(bc)(ae)> <c(bc)(ae)> <(af)cbc> <cbc>
<(_b)(df)cb> <(df)cb> <b> <(_f)cb>
<(_f)cbc> <c> <bc> <dcb> <dcb>
Lets see the case of <d> <>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 51 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 52
id Sequence
PrefixSpan - characteristics 10
20
<a(abc)(ac)d(cf)>
<(ad)c(bc)(ae)>
Pseudo-Projection Support(<ac>) = 4
Support(<ca>) = 2 Support(<cc>) = 3
reduces the cost of projection when projected database can be held in Support(<(ac)>) = 1
main memory
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 53 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 54
a 0
c (1,0,1) 1 <a(bd)> is not frequent To construct <a(bc)>-projected database,
sequence <a(bcde)df> should be projected to <(_e)df>
(_c) (,2, ) (,1, ) instead of <(_de)df>
a c (_c)
<a(bc)a>
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 55 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 56
Pseudo-Projection Efficiency of Prefix-Span and Effect
Observation: postfixes of a sequence often appear of Pseudo-Projection
repeatedly in recursive projected databases
Method: instead of constructing physical projection
by collecting all the postfixes, we can use pointers
referring to the sequences in the database as a pseudo-
projection
Every projection consists of two pieces of
information: pointer to the sequence in database and
offset to the postfix in the sequence
s1=<a(abc)(ac)d(cf)> Pointer Offset Postfix
s1 2 <(abc)(ac)d(cf)>
s1 5 <(ac)d(cf)> PrefixSpan-1 (level-by-level projection
s1 6 <(_c)d(cf)> PrefixSpan-2 (bi-level projection)
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 57 Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 58
Summary
Sequential Pattern Mining is useful in many
application, e.g. weblog analysis, financial market
prediction, BioInformatics, etc.
Dr. Osmar R. Zaane, 1999, 2007 Data Mining and Knowledge Discovery University of Alberta 59