Professional Documents
Culture Documents
Tweet
In this blog post, I will give an introduction to sequential pattern mining, an important data mining task with
a wide range of applications from text analysis to market basket analysis. This blog post is aimed to be a short
introductino. If you want to read a more detailed introduction to sequential pattern mining, you can read a
survey paper that I recently wrote on this topic.
Data mining consists of extracting information from data stored in databases to understand the data and/or take
decisions. Some of the most fundamental data mining tasks are clustering, classification, outlier analysis, and
pattern mining. Pattern mining consists of discovering interesting, useful, and unexpected patterns in
databases Various types of patterns can be discovered in databases such as frequent itemsets, associations,
subgraphs, sequential rules, and periodic patterns.
The task of sequential pattern mining is a data mining task specialized for analyzing sequential data, to
discover sequential patterns. More precisely, it consists of discovering interesting subsequences in a set of
sequences, where the interestingness of a subsequence can be measured in terms of various criteria such as its
occurrence frequency, length, and profit. Sequential pattern mining has numerous real-life applications due to the
fact that data is naturally encoded as sequences of symbols in many fields such as bioinformatics, e-learning,
market basket analysis, texts, and webpage click-stream analysis.
I will now explain the task of sequential pattern mining with an example. Consider the following sequence
database, representing the purchases made by customers in a retail store.
This database contains four sequences. Each sequence represents the items purchased by a customer at
different times. A sequence is an ordered list of itemsets (sets of items bought together). For example, in this
database, the first sequence (SID 1) indicates that a customer bought some items a and b together, then purchased
an item c, then purchased items f and g together, then purchased an item g, and then finally purchased an item e.
Traditionally, sequential pattern mining is being used to find subsequences that appear often in a sequence
database, i.e. that are common to several sequences. Those subsequences are called the frequent sequential
patterns. For example, in the context of our example, sequential pattern mining can be used to find the
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 1/9
05/10/2017 An Introduction to Sequential Pattern Mining - The Data Mining Blog
sequences of items frequently bought by customers. This can be useful to understand the behavior of customers to
take marketing decisions.
To do sequential pattern mining, a user must provide a sequence database and specify a parameter called the
minimum support threshold. This parameter indicates a minimum number of sequences in which a pattern
must appear to be considered frequent, and be shown to the user. For example, if a user sets the minimum
support threshold to 2 sequences, the task of sequential pattern mining consists of finding all subsequences
appearing in at least 2 sequences of the input database. In the example database, 29 subsequences met this
requirement. These sequential patterns are shown in the table below, where the number of sequences containing
each pattern (called the support) is indicated in the right column of the table.
For example, the patterns <{a}> and <{a}, {g}> are frequent and have a support of 3 and 2 sequences,
respectively. In other words, these patterns appears in 3 and 2 sequences of the input database, respectively. The
pattern <{a}> appears in the sequences 1, 2 and 3, while the pattern <{a}, {g}> appears in sequences 1 and 3.
These patterns are interesting as they represent some behavior common to several customers. Of course, this is a
toy example. Sequential pattern mining can actually be applied on database containing hundreds of thousands of
sequences.
Another example of application of sequential pattern mining is text analysis. In this context, a set of sentences
from a text can be viewed as sequence database, and the goal of sequential pattern mining is then to find
subsequences of words frequently used in the text. If such sequences are contiguous, they are called “ngrams” in
this context. If you want to know more about this application, you can read this blog post, where sequential
patterns are discovered in a Sherlock Holmes novel.
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 2/9
05/10/2017 An Introduction to Sequential Pattern Mining - The Data Mining Blog
Besides sequences, sequential pattern mining can also be applied to time series (e.g. stock data), when
discretization is performed as a pre-processing step. For example, the figure below shows a time series (an
ordered list of numbers) on the left. On the right, a sequence (a sequence of symbols) is shown representing the
same data, after applying a transformation. Various transformations can be done to transform a time series to a
sequence such as the popular SAX transformation. After performing the transformation, any sequential pattern
mining algorithm can be applied.
To try sequential pattern mining with your datasets, you may try the open-source SPMF data mining
software, which provides implementations of numerous sequential pattern mining
algorithms: http://www.philippe-fournier-viger.com/spmf/
It provides implementations of several algorithms for sequential pattern mining, as well as several variations of
the problem such as discovering maximal sequential patterns, closed sequential patterns and sequential
rules. Sequential rules are especially useful for the purpose of performing predictions, as they also include the
concept of confidence.
What are the current best algorithms for sequential pattern mining?
There exists several sequential pattern mining algorithms. Some of the classic algorithms for this problem are
PrefixSpan, Spade, SPAM, and GSP. However, in the recent decade, several novel and more efficient
algorithms have been proposed such as CM-SPADE and CM-SPAM (2014), FCloSM and FGenSM (2017), to
name a few. Besides, numerous algorithms have been proposed for extensions of the problem of sequential
pattern mining such as finding the sequential patterns that generate the most profit (high utility sequential
pattern mining).
Conclusion
In this blog post, I have given a brief overview of sequential pattern mining, a very useful set of techniques for
analyzing sequential data. If you want to know more about this topic, you may read the following recent survey
paper that I wrote, which gives an easy-to-read overview of this topic, including the algorithms forf sequential
pattern mining, extensions, research challenges and opportunities.
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 3/9
05/10/2017 An Introduction to Sequential Pattern Mining - The Data Mining Blog
Fournier-Viger, P., Lin, J. C.-W., Kiran, R. U., Koh, Y. S., Thomas, R. (2017). A Survey of Sequential
Pattern Mining. Data Science and Pattern Recognition, vol. 1(1), pp. 54-77.
—
Philippe Fournier-Viger is a professor of Computer Science and also the founder of the open-source data
mining software SPMF, offering more than 120 data mining algorithms.
Related posts:
Choosing data structures A Map of Data Mining Call for chapters: High
according to what you want Algorithms (offered in Utility Pattern Mining, the
to do SPMF v092c) book
This entry was posted in Big data, Data Mining, Data science and tagged big data, data mining, data science, frequent pattern mining, frequent patterns, pattern, sequence,
sequential pattern. Bookmark the permalink.
Periodic patterns are discovered in a single sequence. We want to find a pattern that periodically appears. For
example, it must appear approximately every week.
Sequential patterns are discovered in a sequence database (multiple sequences). We want to find a pattern that is
common to several sequences. If a pattern appears multiple times in the SAME sequence, it will still be counted as
just once for sequential pattern mining algorithms.
Reply
I started reading your survey paper ( A Survey of Sequential Pattern Mining). I had a query. You stated that an item set X is a
subset of items I.
Thus an item, say A, can appear at most ‘k’ number of times in an itemset if it appears ‘k’ number of items in set I.
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 4/9
05/10/2017 An Introduction to Sequential Pattern Mining - The Data Mining Blog
However, while explaining BFS approach you took I = {a,b,c} and while enumerating 2-sequences you generated as well.
I guess there is an error in this enumeration.
Also while explaining definition of support of subsequence there is a mistake in its formulated definition.
Given is:
sup(sa) = | { s | s contained in sa ^ s belongs to SDB} |
I guess it should be:
sup(sa) = | { s | sa contained in s ^ s belongs to SDB} |
Yes, there was an error in the definition of sup(sa), as you noticed. I have fixed it and re-uploaded the paper to my
website:
http://www.philippe-fournier-viger.com/dspr-paper5.pdf
For the other question, I think that something is missing in your question. Do you want to ask why the same item can
appear multiple times in the same pattern? If this is your question, then yes, an item is allowed to appear multiple in
a same sequence. Let me explain this more clearly. In sequential pattern mining, an item can only appear at most one
time in each itemset. But a sequence can contain multiple itemsets. Thus, it is possible that the same item appear
multiple times in the same sequence by appearing in several itemsets (because a sequence is a list of itemsets). For
example, the sequence (a)(a) is correct because it contains two times “a” but not in the same itemset. But the
sequence (a a) would be incorrect according to the definition because it contains many times the same item in the
same itemset. So, to explain again, an itemset X in a sequence must be a subset of I. Thus, an item cannot appear
more than once in an itemset. However, a sequence contains many itemsets. Thus, the same item is allowed to appear
multiple time in the same sequence, but it must be in different itemsets.
Best regards,
Reply
I got confused because while explaining BFS based algorithms on Page 58 there is an example that takes I =
{a,b,c}. And while enumerating 2-sequences contains itemsets like {a,a}, {b,b}, {c,c}.
I wanted to point out the above fact. As I = {a,b,c}. The sequence is valid. However the sequence is not.
Reply
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 5/9
05/10/2017 An Introduction to Sequential Pattern Mining - The Data Mining Blog
Hello,
I understand now. Yes, you are right, there is some error in that example. I have fixed and re-
uploaded the paper to my website. Thanks!
Philippe
AT says:
2017-09-19 at 8:09 AM
Hello,
I have a question about the example in this article. Is correct that the pattern has a support of 2? Because for what I
understood this pattern appears in sequences 1, 3 and 4. And the same for the pattern alone. Are there some errors in the
example or I am wrong in the comprehension of Sequential Pattern?
Thanks.
Reply
Hello, I would like to answer your question but I am not sure about which patterns you are talking? I think that when
you submitted your question, WordPress might have removed the patterns from your comment because of the >
symbols.
If you can tell me which pattern, I will answer your question.
Reply
AT says:
2017-09-19 at 5:37 PM
Yes, it probably thinks that they are a tags. My bad. The patters are ({b},{f,g}) and also ({f,g}) alone.
Thanks.
Reply
Hello,
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 6/9
05/10/2017 An Introduction to Sequential Pattern Mining - The Data Mining Blog
I’m curios about the difference between ‘high utility sequential pattern’ and ‘high utility itemset’. Because I’m studying on
this research area in order to adopt this concept to find sequential high yield pattern from manufacturing processes. Thus,
I’m trying to search several articles including ones written by yours.
While searching, I’m confused about the difference of the above two terms.
In my opinion, the main difference is ‘sequential’. Since the result of some algorithms related to ‘high utility itemset’ do not
consider the order of transactions, right?
If I were right, I will able to narrow down my searching scope not ‘high utility itemset’ but ‘high utility sequential pattern’.
Hi,
Yes, that is correct. In high utility itemset mining, the time or sequential order is not considered.
In high utility itemset mining, the input is a transaction database ( a set of transactions performed by many
customers).
The goal is then to find the group of items that yield a high profit when purchased together. But there is no order.
For example, you could find that {apple,bread} is a high utility itemset. It means that when bought together, apple
and bread makes a lot of money. But it does not tell anything about the order.
In high utility sequential pattern mining, the order is considered.
Here the input is a sequence database. It means a set of sequences of transactions.
For example, you may have a database for 100 customers, and for each customer, you have a sequence of transactions
that is sequentiall ordered.
Then you can find some patterns such as < (apple),(bread)> which means that buying apple followed by buying bread
yield a high profit.
But a problem of high utility sequential pattern mining is that there is no concept of confidence.
If you have the pattern < (apple),(bread)> , you still don’t know likely it is that if someone buy Apple, he will then buy
bread.
To solve this problem, high utility sequential rule mining was proposed.
It is also applied on a sequence database, and will find rules such as:
Apple –> Bread confidence : 70 % which means that this rules yield a lot of money and when someone buy Apple, 70
% of the time he will buy bread after.
Besides that, if you have a single sequence instead of many sequences, you can look at episode mining. Episode
mining is similar to sequential pattern mining except that patterns are mine from a single sequence instead of many
sequences.
Reply
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 7/9
05/10/2017 An Introduction to Sequential Pattern Mining - The Data Mining Blog
Reply
If you want some faster high utility sequentila pattern mining algorithms (faster than USpan), you
can check CRoM and HuspExt.
Best,
Philippe
Reply
Cristina says:
2017-09-24 at 10:25 PM
Hi,
I was wondering if you could explain briefly when a sequential pattern mining method (algorithm) is robust. How to validate
the robustness of an algorithm?
Best regards,
Cristina
Reply
Hello,
In general, in computer science, an algorithm is said to be robust if it can tolerate some invalid or erroneous input.
How would it apply to sequential pattern mining? Just like any computer program, you want a sequential pattern
mining algorithm to work well with different kind of data, and to show some errors if the input data or parameter
value(s) are not what the algorithm should expect. So this is software engineering 101. When you design a program,
you should test it with different kind of data and parameters to make sure that it works in all situations, and make
sure that it shows error message to the user when an error message should be shown.
Regards,
Reply
Cristina says:
2017-09-25 at 2:33 AM
Thank you
Reply
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 8/9
05/10/2017 An Introduction to Sequential Pattern Mining - The Data Mining Blog
http://data-mining.philippe-fournier-viger.com/introduction-sequential-pattern-mining/ 9/9