Ch5 Sequential

August 19, 2014 Data Mining: Concepts and Techniques 1
Chap 5.1: Mining Sequential Patterns

A kind of association rules
Its algorithms are closely related with association
rule mining algorithms
Sequence Databases and Sequential
Pattern Analysis
Transaction databases, time-series databases vs. sequence
databases
Time-series db stores sequences of values that change with time, such as
data collected regarding the stock exchange.
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital camera,
within 3 months.
Medical treatment, natural disasters (e.g., earthquakes),
science & engineering processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures
What Is Sequential Pattern Mining?
Given a set of sequences, find the complete set
of frequent subsequences
A sequence database
A sequence : < (ef) (ab) (df) c b >
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
SID sequence
10 <a(abc)(ac)d(cf)>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Mining Sequential Patterns
A sequence <a
1
, a
2
,a
n
> is contained in another
sequence if there exist integers
1 i1<i2<<i3 m, such that a
1
b
i1
, a
2
b
i2
,, a
n
b
in
.

E.g. sequence
<(3) (4 5) (8)> is contained in <(7)(3 8)(9)(4 5 6) (8)>
However, <(3) (5)> is not contained in <(3 5)>
<(3) (5)> means item 3, 5 is bought after one another
<(3 5)> means item 3, 5 is bought together
In a set of sequences, a sequence s is maximal if s is not
contained in another sequence
Mining Sequential Patterns : An example
Customer
Id
Transaction
Time
Items
Bought
1
1
June 25 93
June 30 93
30
90
2
2
2
June 10 93
June 15 93
June 20 93
10, 20
30
40, 60, 70
3 June 25 93 30, 50, 70
4
4
4
5
June 25 93
June 30 93
July 25 93
June 12 93
30
40,70
90
90
Database Sorted by Customer
Id and Transaction Time
Customer Customer
ID Sequence
1 <(30)(90)>
2 <(10 20)(30)(40 60 70)>
3 <(30 50 70)>
4 <(30)(40 70) (90)>
5 <(90)>
Customer sequence version of the DB
Sequential patterns with support > 25%

<(30) (90)>
<(30) (40 70)>
Given a database D of customer transaction, the problem
of mining sequential patterns is to find the maximal
sequences among all sequences that have a certain user
specified minimum support
Involve 5 phase
Sort phase
Litemset phase
Transformation phase
Sequence phase
Maximal phase

Terminology
Length of a sequence number of itemset in the sequence (
sequence of length k called k-sequence)
Support of an itemset i is a fraction of customers who bought
items in i in a single transaction
An itemset with minimum support is called a large itemset or
litemset
Involve 5 phase
Sort phase
The database (D) is sorted, with customer-id as the major key and
transaction time as the minor key
Converts the original transaction database into a database of
customer sequences

Litemset phase

Find the set of all litemsets L
Simultaneously find the set of all large 1-sequences {<l >|l L}
The set of litemset is mapped to a set of contiguous integers
Reason for mapping: by treating litemset as single entities,compare
two itemset for equality in constant time, and reduce the time
required to check if a sequence is contained in a customer sequence


Litemset phase
Customer
Id
Transaction
Time
Items
Bought
1
1
June 25 93
June 30 93
30
90
2
2
2
June 10 93
June 15 93
June 20 93
10, 20
30
40, 60, 70
3 June 25 93 30, 50, 70
4
4
4
5
June 25 93
June 30 93
July 25 93
June 12 93
30
40,70
90
90
Fig.1 Database Sorted by Customer
Id and Transaction Time
Large itemsets are (30), (40),(70),
(40,70) and (90)
Large Itemset Mapped to
(30) 1
(40) 2
(70) 3
(40 70) 4
(90) 5
Transformation Phase
Repeatedly determine which of a
given set of large sequences are
contained in a customer sequence
By transforming each customer
sequence into an alternative
representation
Each transaction is replaced by
the set of all litemsets
contained in the transaction
If a transaction does not
contain any litemset, not
retained in the transformed
sequence
If a customer sequence does
not contain any litemsets, this
sequence is dropped from the
transformed database
Customer Customer
ID Sequence
1 <(30)(90)>
2 <(10 20)(30)(40 60 70)>
3 <(30 50 70)>
4 <(30)(40 70) (90)>
5 <(90)>
Customer sequence version of the DB
(30) 1
(40) 2
(70) 3
(40 70) 4
(90) 5
Large itemset
Transformation Phase
Customer Customer Transformed After
ID Sequence Customer Sequence Mapping

1 <(30)(90)> <{(30)}{(90)}> <{1} {5}>
2 <(10 20)(30)(40 60 70)> <{(30)} {(40) (70) (40 70)}> <{1} {2,3,4}>
3 <(30 50 70)> <{(30), (70)}> <{1, 3}>
4 <(30)(40 70) (90)> <{(30)} {(40),(70),(40, 70)} {(90)}> <{1}{2,3,4} {5}>
5 <(90)> <{(90)}> <{5}>
Transformed Database
(30) 1
(40) 2
(70) 3
(40 70) 4
(90) 5
Large itemset

Sequence Phase
Make multiple passes over the data
In each pass, we start with a seed set of large
sequences
Use the seed set for generating new potentially
large sequences called candidate sequences
Count the support while pass the data
At the end of the pass, determine the large
candidate sequences
these large candidate becomes the seed for the
next pass.
Involve 2 algorithms
Count-all and count some
Count-all based on apriori algorithm called AprioriAll

Sequence Phase
AprioriAll
Apriori candidate generation

Large Candidate Candidate
3-Sequences 4-sequences 4-sequences
(after join) (after pruning)
<1 2 3> <1 2 3 4> <1 2 3 4>
<1 2 4> <1 2 4 3>
<1 3 4> <1 3 4 5>
<1 3 5> <1 3 5 4>
<2 3 4>
Maximal phase
Find the maximal sequences among the set of large sequences

Sequence Phase
Large sequences

< {1 5} {2} {3} {4} >
< {1} {3} {4} {3 5}>
< {1} {2} {3} {4}>
< {1} {3} {5} >
< {4} {5} >
Customer sequences
1-sequences support
<1> 4
<2> 2
<3> 4
<4> 4
<5> 4
2-sequences support
<1 2> 2
<1 3> 4
<1 4> 3
<1 5> 3
<2 3> 2
<2 4> 2
<3 4> 3
<3 5> 2
<4 5> 2
3-sequences support
<1 2 3> 2
<1 2 4> 2
<1 3 4> 3
<1 3 5> 2
<2 3 4> 2
4-sequences support
<1 2 3 4> 2
L
1
L
2
L
3
L
4
Min_support = 40%
(2 customer sequences)
Maximal large sequence
<1 2 3 4>, <1 3 5>, <4 5>
Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small
number of database scans
be able to incorporate various kinds of user-specific
constraints
Studies on Sequential Pattern Mining
Concept introduction and an initial Apriori-like algorithm
R. Agrawal & R. Srikant. Mining sequential patterns, ICDE95
GSPAn Apriori-based, influential mining method (developed at IBM
Almaden)
R. Srikant & R. Agrawal. Mining sequential patterns:
Generalizations and performance improvements, EDBT96
From sequential patterns to episodes (Apriori-like + constraints)
H. Mannila, H. Toivonen & A.I. Verkamo. Discovery of frequent
episodes in event sequences, Data Mining and Knowledge
Discovery, 1997
Mining sequential patterns with constraints
M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential
Pattern Mining with Regular Expression Constraints. VLDB 1999
A Basic Property of Sequential Patterns: Apriori
A basic property: Apriori (Agrawal & Srikant94)
If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent so do <hab> and <(ah)b>
<a(bd)bcb(ade)> 50
<(be)(ce)d> 40
<(ah)(bf)abf> 30
<(bf)(ce)b(fg)> 20
<(bd)cb(ac)> 10
Sequence Seq. ID
Given support threshold
min_sup =2
GSPA Generalized Sequential Pattern Mining Algorithm
GSP (Generalized Sequential Pattern) mining algorithm
proposed by Agrawal and Srikant, EDBT96
Outline of the method
Initially, every item in DB is a candidate of length-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for each
candidate sequence
generate candidate length-(k+1) sequences from
length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidate
can be found
Major strength: Candidate pruning by Apriori
Finding Length-1 Sequential Patterns
Examine GSP using an example
Initial candidates: all singleton sequences
<a>, , <c>, <d>, <e>, <f>,
<g>, <h>
Scan database once, count support for
candidates
<a(bd)bcb(ade)> 50
<(be)(ce)d> 40
<(ah)(bf)abf> 30
<(bf)(ce)b(fg)> 20
<(bd)cb(ac)> 10
Sequence Seq. ID
min_sup =2
Cand Sup
<a> 3
 5
<c> 4
<d> 3
<e> 3
<f> 2
<g> 1
<h> 1
Generating Length-2 Candidates
<a> <c> <d> <e> <f>
<a> <aa> <ab> <ac> <ad> <ae> <af>
 <ba> <bb> <bc> <bd> <be> <bf>
<c> <ca> <cb> <cc> <cd> <ce> <cf>
<d> <da> <db> <dc> <dd> <de> <df>
<e> <ea> <eb> <ec> <ed> <ee> <ef>
<f> <fa> <fb> <fc> <fd> <fe> <ff>
<a> <c> <d> <e> <f>
<a> <(ab)> <(ac)> <(ad)> <(ae)> <(af)>
 <(bc)> <(bd)> <(be)> <(bf)>
<c> <(cd)> <(ce)> <(cf)>
<d> <(de)> <(df)>
<e> <(ef)>
<f>
15 length-2
Candidates
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates

Ch5 Sequential

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch5 Sequential

Uploaded by

Copyright:

Available Formats

August 19, 2014 Data Mining: Concepts and Techniques 1

Chap 5.1: Mining Sequential Patterns

You might also like