Efficient Mining of Correlated Sequential Patterns Based On Null Hypothesis

Efficient Mining of Correlated Sequential Patterns Based
on Null Hypothesis
Cindy Xide Lin†§ , Ming Ji† , Marina Danilevsky† , Jiawei Han†

†
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
§
Twitter Inc, San Francisco, CA, USA
†
seraphimdl@twitter.com, {mingji1, danilev1, hanj}@illinois.edu
ABSTRACT association rule mining [8], sequential pattern mining [22],

Frequent pattern mining has been a widely studied topic in graph pattern mining [30] etc. However, all pattern mining
the research area of data mining for more than a decade. approaches have a hard time with real datasets. When the
However, pattern mining with real data sets is complicat- minimum support threshold is high, in general only obvious,
ed - a huge number of co-occurrence patterns are usually common sense ‘knowledge’ will be found, whereas when the
generated, a majority of which are either redundant or un- minimum support is low, a huge number of patterns will
informative. The true correlation relationships among data usually be generated, most of which are redundant, uninfor-
objects are buried deep among a large pile of useless informative or just random combinations of popular data object-
mation. To overcome this difficulty, mining correlations has s [3]. The question of how to discover truly useful patterns
been recognized as an important data mining task for its [25, 10, 4, 33] that are buried deep among a large pile of use-
many advantages over mining frequent patterns. less information has recently attracted substantial attention
In this paper, we formally propose and define the task from researchers.
of mining frequent correlated sequential patterns from a se- Example 1.1. What makes a frequent pattern ‘interest-
quential database. With this aim in mind, we re-examine ing’ ? We crawled 13, 409, 424 Flickr photos containing geo-
various interestingness measures to select the appropriate spatial information, and generated frequent patterns by treat-
one(s), which can disclose succinct relationships of sequen- ing each sequence of photos uploaded by each user as one
tial patterns. We then propose PSBSpan, an efficient min- sequence. We discovered a huge number of patterns such as
ing algorithm based on the framework of the pattern-growth popular tourism trails, some of which revealed a clear picture
methodology which mines frequent correlated sequential pat- of tourists’ interests (e.g., traveling from the center of San
terns. Our experimental study on real datasets shows that Francisco to the Pacific seashore, as shown in Figure 1(b)),
our algorithm has outstanding performance in terms of both while others are just combinations of popular locations (e.g.,
efficiency and effectiveness. Figure 1(a)).
Categories and Subject Descriptors Bay Bridge

H.2.8 [Database Management]: Database Applications—
Data Mining
Downtown
ow ow
General Terms
Algorithms
Keywords
(a) Frequent Pattern
Frequent Pattern Mining, Correlated Pattern Mining
Coit Tower
1. INTRODUCTION Transamerica
Frequent pattern mining has been a widely studied top- Pyramid
ic in data mining research. Common approaches include
Union
Square
Downtown
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies (b) Interesting Pattern
bear this notice and the full citation on the first page. To copy otherwise, to Figure 1: Popular Tours at San Francisco
republish, to post on servers or to redistribute to lists, requires prior specific In this paper, we study the problem of finding sequences
permission and/or a fee.
Web-KR’12, October 29, 2012, Maui, Hawaii, USA. that are both popular and correlated in an input sequen-
Copyright 2012 ACM 978-1-4503-1711-5/12/10 ...$15.00. tial database. This task is widely applicable in a variety
17
S1 support vector machine S2 graph support classification
S3 support vector machine S4 graph theory
S5 support evidence S6 graph pattern mining
S7 machine learning S8 graph pattern mining
S9 spectral clustering algorithm S10 sequence pattern mining
S11 spectral clustering algorithm S12 novel association pattern mining
S13 spectral clustering method S14 construction algorithm
S15 spectral clustering model S16 EM algorithm
Table 1: The Example Sequence Database SDB
of settings, including popular event tracking [17, 18], re- 2. PRELIMINARIES
search topic analysis [20], market basket problems [1], etc. In this section, we formally define the problem of mining
Although the traditional association pattern mining prob- correlated sequential patterns (Section 2.1), and theoretical-
lem is well defined and has been thoroughly studied over the ly and empirically analyze certain properties under this def-
last decade, there is currently no canonical way to measure inition, in order to select appropriate association measure(s)
the degree of the so-called correlation between sequential for our mining task (Section 2.2).
patterns. We believe that there should intuitively be more
than one ‘correct’ solution to define this new type of pat- 2.1 Problem Formulation
tern, especially among different scenarios. Let us start with Let E be a set of distinct items. A sequence S is an ordered
a toy example to illustrate (i) the differences between an as- list of items, denoted as S = e1 e2 · · · e|S| , where ei ∈ E is
sociation and a correlated pattern, and (ii) how a reasonable an item. For convenience, we refer to the ith item ei as S[i].
correlation measure can effectively mine correlated sequen- An input sequence database is a set of sequences, denoted
tial patterns. as SDB = {S1 , S2 , · · · , SN }.
Definition 2.1. (Subsequence, and Super-sequence) For
Example 1.2. Suppose we have a mini database made
two sequences S = e1 e2 · · · e|S| and S ′ = e′1 , e′2 , · · · , e′|S ′ |
up of 16 word phrases extracted from bibliographical record-
s (see Table 1). Some phrases therein are research topics, (|S| ≤ |S ′ |), S is said to be a subsequence of S ′ , denoted
e.g., ‘support vector machine’ and ‘machine learning’, while by S ⊆ S ′ , if there exists a series of one-to-one mapping
others are combinations of popular terms, e.g., ‘support ma- positions 1 ≤ p1 < p2 < · · · < p|S| ≤ |S ′ |, s.t. S[i] = S ′ [pi ]
chine’, ‘graph mining’ and ‘clustering algorithm’. All of the (i.e., ei = e′pi ) for any i ∈ 1, 2, · · · , |S|. In particular, for
five patterns are association patterns because they appear |S| < |S ′ |, we call S a proper (strict) subsequence of S ′ ,
together frequently, but only two are correlated, as they ex- denoted by S ⊂ S ′ . We may also say S ′ is a super-sequence
press specific meaningful research topics, whereas the other of S, or S ′ contains S.
three express either useless or overly broad meanings. A pattern is also a sequence in the context of a sequential
Although the answer to whether a sequential pattern is database. For two patterns P and P’, if P is a subsequence
correlated or not is not an absolute ‘Yes’ or ‘No’, we at least of P’, then P is said to be a sub-pattern of P’, and Paäis
֒ a
expect to match common knowledge, so that the phrase ’sup- super-pattern of P.
port vector machine’ would be more correlated than ’support
Definition 2.2. (Projected Database, Support, and Prob-
vector’. Thus, under an appropriate measure of correlation,
ability) For an input sequence database SDB and a sequence
a long pattern should be allowed to be more correlated than
P , DB(P )) is the set of sequences in SDB, of which P
its sub-patterns. Based on such observation, we will re-
appears as a subsequence. We define the support of P as
examine a lot of interestingness measures and make careful
Sup(P ) = |DB(P )| and therefore the probability of P as
selections for our mining task in latter sections. )
P r(P ) = Sup(P
|SDB|
. Any sequence in DB(P ) is referred as
In this paper, we propose a novel algorithm for mining P ’s supporting sequence and DB(P ) is called the projected
frequent correlated sequential patterns, based on ex- database of SDB based on the sequence P .
tensions of the pattern growth methodology [23]. Specifical- Definition 2.3. (Cutting and Cutting Set) For a sequence
ly, we make the following contributions: S (|S| > 2) and an ordered list of subsequences C = {c1 ,
c2 , · · · , c|C| } (|C| ≥ 2), we call C a cutting of S, if any
1. Although there have been extensive studies on mining ci ∈ C is non-empty (i.e. |ci | > 0) and the concatenation of
item pairs, correlated association rules, and recurrent se- c1 , c2 , · · · , c|C| equals to S. Specifically, C is k-piece cutting
quence patterns [3, 16], to the best of our knowledge, this if |C| = k. We abbreviate the set of all k-piece cuttings of S
paper is the first work to formally propose and define the as Ck (S), and furthermore C2..k (S) = C2 (S) ∪ · · · ∪ Ck (S).
task of mining correlated sequential patterns.
A number of interesting correlation measures defined on
association patterns [1, 3] have been proposed and analyzed
2. We analyze the ‘good’ and ‘bad’ properties that a reason- [29, 25], including χ2 , lift, all-confidence, max-confidence,
able correlation measure should satisfy, and select mea- Kulcsynski and cosine. Formally, a pattern P is said to fre-
sure(s) appropriate to our mining task. quent if Sup(P ) ≥ min sup, and P is said to be correlated if
its correlation score is no less than min cor, where min sup
3. We propose an efficient mining algorithm based on the and min cor are specified empirically by users. The task of
pattern growth framework [23], and demonstrate the out- mining frequent correlated sequential patterns is therefore
standing performance of our algorithm, in terms of both to find the complete set of sequences in an input database
efficiency and effectiveness, on two real datasets. which are both frequent and correlated.
18
2.2 Measure Selection Theorem 2. The correlation measure defined by Equa-
In this section, we analyze ‘good’ and ‘bad’ properties tion 1 does not satisfy the Apriori Property on correlation
a reasonable correlation measure should satisfy, and make score, i.e., if P is correlated, it is unnecessary for its sub-
careful selections for our mining task. Let us start with pattern P ′ to be correlated too.
introducing the well-known Apriori property on support. Proof. We prove Theorem 2 by giving a counter example
stated in Section 3.
Lemma 1. (Monotonicity on Support) Given two pattern-
s P and P ′ in a sequence database SDB, if P ′ is a super- 3. AN EFFICIENT MINING ALGORITHM
pattern of P (i.e., P ⊆ P ′ ), then Sup(P ) ≥ Sup(P ′ ).
In this section, we first introduces the classical PrefixSpan
Theorem 1. (Apriori Property on Support) If P is not [23] algorithm (Section 3.1), based on which a three-stage
frequent, none of its super-patterns P ′ are frequent either. mining method is developed to extract correlated sequential
Or equivalently, if P is frequent, any of its sub-patterns are patterns (Section 3.2).
frequent too.
3.1 The PrefixSpan Algorithm
This Apriori property on support can be well utilized to PrefixSpan [23] belongs to a series of pattern-growth method-
improve the mining efficiency: we could eliminate further ologies [7, 9, 30]. The major idea is that instead of projecting
exploration of the super patterns of P at the stage of the sequence databases by considering all possible occurrences
sub-pattern P if P is not frequent. However, an appropri- of frequent subsequences, the projection is based only on
ate correlation measure may not satisfy the Apriori property frequent prefixes, as any frequent subsequence can always
on correlation score. Recall Example 1.2: the word phrase be found by growing a frequent prefix. Generally speak-
‘support vector machine’ is more interesting than ‘support ing, sequential patterns can be mined by a prefix-projection
machine’, but the former is a super pattern of the latter. method in three steps: (i) find all frequent length-1 patterns;
Actually, it is often the case in real situations that a super- (ii) divide the search space into projected databases based
pattern is more correlated than its sub-pattern(s). We ad- on prefixes; and (iii) Mine each projected database recur-
mit that the failure of Apriori property on correlation bring sively and output sequential patterns with prefixes added to
challenges in computational issues, but on the other hand the front.
it makes the mining results more effective and meaningful. However, there are challenges in computing the correla-
Therefore, we re-examine various interestingness measures tion score (Equation (1)) of a frequent pattern during the
[2, 29, 14], in search of a measure we could use to mine mining step of PrefixSpan, because we do not know the prob-
correlated sequential patterns, and finally obtain the moti- ability of any of sub-patterns. The naı̈ve solution would be
vation of designing our correlation measure from the null to generate all frequent patterns first, create an in-memory
hypothesis in Ngram testing. index on these patterns, and re-examine all frequent pat-
Tests of correlation between n random variables typically terns by calculating their correlation scores. However, such
set up a null hypothesis that holds if the n random variables a solution is undesirable for several reasons:
are independent. The n items in a correlated pattern that 1. First, to avoid missing useful patterns, the minimum sup-
fail the test might then be considered to be related or depen- port is usually set low, and the in-memory index of fre-
dent in some way, since they have failed to exhibit statistical quent patterns may exceed the availability of the primary
independence. Formally speaking, for n items a1 , a2 , · · · , an memory as a result. In fact this very situation occurs in
that make up a correlated pattern a1 a2 · an , we would ex- our experiments (Section 4).
pect the probability of these items appearing together to
be significantly larger than the product of the probabilities 2. Second, even if we are able to create an in-memory index
of each item appearing separately. Moreover, we extend the for frequent patterns, the efficiency of the pattern mining
concept of an individual item to any arbitratry sub-sequence algorithm will rely heavily on the efficiency of the hash
of a correlated pattern, and propose a correlation measure function of the index (i.e., the cost of accessing one value),
on sequential patterns as: especially since the patterns themselves may have various
formats and structures.
P r(P ) 3. Third, when an algorithm is disk-based, it is easier to
CorSeq(P ) = Q (1)
M ax { i P r(ci )} improve the efficiency by utilizing parallel computing [27].
C={ci }∈C2..|P | (P )
3.2 The PSBSpan Mining Algorithm

In the above formula, we express the combiniation of in-
formation units by ‘cutting’ (see Definition 2.3). The gen- Definition 3.1. (k-Piece Correlated Pattern and k-Piece
eral idea is: a correlated sequential pattern P is expected Maximum Cutting Probability) A frequent pattern P is called
to fail the null hypothesis consisting of any arbitrary cut- a k-piece correlated pattern (k ≥ 2) if k-CorSeq(P ) ≥ min cor,
ting that make up P , i.e., the probability of a correlated where k-CorSeq(P ) is derived from Equation (1) by setting
pattern should be significantly larger than the product of a constraint on the cutting size. For convenience, we call
the probabilities of sub-patterns appearing in any of its cut- the denominator of the first line of Equation (2) the k-piece
tings. Hence, we use the ratio between the probability of P maximum cutting probability.
and the maximum of the joint probability of its sub-patterns
appearing separately in any possible cutting of P to denote P r(P )
k-CorSeq(P ) = Q (2)
the correlation score of P . M ax { i P r(ci )}
C={ci }∈C2..k (P )
19
association 0.06 vector 0.13 machine 0.19 learning 0.06
spectral 0.25 clustering 0.25 algorithm 0.25 method 0.06
model 0.06 mining 0.25 classification 0.06 graph 0.25
EM 0.06 theory 0.06 pattern 0.25 novel 0.06
sequence 0.06 support 0.25 construction 0.06 evidence 0.06
Table 2: The Vocabulary E for Example 3.1
Non-Single-Item Patterns PrefixSpan(Corpre) SuffixSpan(Corpost) Binding
graph mining sup = 2 2.0 2.0
graph pattern sup = 2 2.0 2.0
graph pattern mining sup = 2 4.0 X 2.0
pattern mining sup = 4 4.0 X 4.0 X X
spectral clustering sup = 4 4.0 X 4.0 X X
spectral clustering algorithm sup = 2 2.0 4.0 X
support machine sup = 2 2.7 2.7
support vector sup = 2 4.0 X 4.0 X X
support vector machine sup = 2 5.3 X 4.0 X X
vector machine sup = 2 5.3 X 5.3 X X
Table 3: The Sequential Patterns Extracted at Different Stages
Definition 3.2. (Prefix and Suffix Upper-bound) For a The frequent patterns are listed in the first column of
sequence P = e1 e2 · · · en (n ≥ 2), {e1 e2 · · · en−1 , en } and Table 3, along with their supports. Prefix upper-bounds of
{e1 , e2 e3 · · · en } are two cuttings of P , based on which it is those frequent patterns are calculated in the second column.
P r(P ) We can see that 10 non-single-item patterns are generated
easy to prove that CorSeqpre (P ) = P r(e1 e2 ···e n−1 )P r(en )
and
r(P ) during the mining procedure, only 6 of which (marked with
CorSeqpost (P ) = P r(e2 ,e3P,··· ,en )P r(e1 )
are two upper-bounds X) have larger prefix upper-bounds than the minimum cor-
of CorSeq(P ), called the prefix upper-bound and the suffix relation threshold of 3.0.
upper-bound, respectively.
Theorem 3. The output sequences of the PrefixSpan Step
We now introduce a mining algorithm (called PSBSpan)
are automatically sorted lexicologically.
for generating the complete set of 2-piece correlated patterns
in three steps: the PrefixSpan step, the SuffixSpan step, and Proof. For two sequential patterns S = e1 , e2 , · · · , en
the Binding step, and then extend the results to correlated and S ′ = e′1 , e′2 , · · · , em in the output, w.l.o.g., we suppose
patterns of pieces of arbitrary size. S < S ′ that ei = e′i for any i = 1, 2, · · · , j and ej+1 < e′j+1 ,
The PrefixSpan Step. This step is almost identical to i.e., S ∗ = e1 e2 · · · ej is the longest common prefix of S
the traditional PrefixSpan algorithm, with one additional and S ′ . The two sequences are projected into two differ-
calculation: ent sub-databases DB(S ∗ + ej+1 ) and DB(S ∗ + e′j+1 ), re-
spectively, at the function call of PrefixSpan(DB(S ∗ ), S ∗ ,
1. We first calculate the prefix upper-bound of each frequent min sup, min cor). Since PrefixSpan generates all patterns
pattern. A pattern is a said to be a potentially corre- in DB(S ∗ + ej+1 ) before the patterns in DB(S ∗ + e′j+1 ), as
lated pattern if its prefix upper-bound is no less than a result, S is output earlier than S ′ .
the minimum correlation threshold min cor. We output
a frequent pattern only if (i) it is a potentially correlated The SuffixSpan Step. This step is a mirrored version of
pattern, or (ii) it is a prefix of a potentially correlated the PrefixSpan step, but with an additional sorting step at
pattern. The intuition behind this is: (a) If a pattern is the end: (i) Project databases based on suffixes and generate
not a potentially correlated pattern, it is definitely not a patterns by concatenating suffixes to the end of patterns
correlated pattern, so that it is unnecessary to consider mined from projected databases. (ii) Calculate the suffix
it further.(b) The cost of pruning is low: when we are at upper-bound of each pattern, and output a frequent pattern
the stage of having found a given frequent pattern, we if it is a potentially correlated pattern or if it is a suffix of a
must have previously accessed the probability of its im- potentially correlated pattern, together with the probability
mediate prefix and the amount of space needed to store of its suffixes. (iii) Sort the output sequences in lexicological
the probability of each prefix of the current pattern and order by (disk-based) sorting methods [15].
of each single item, is miniscule.
Example 3.2. Following Example 3.1, the suffix upper-
2. For each pattern we select to output using the above step,
bounds are listed in the third column of Table 3 for each
we not only output its probability, but also the probabili-
frequent pattern. 6 of the 10 frequent patterns (marked with
ty of any of its proper prefixes. We will show the utility of
X) have larger suffix upper-bounds than the minimum corre-
outputting this additional information in the discussion
lation threshold 3.0 and therefore remain as potentially cor-
of the Binding Step.
related patterns.
Example 3.1. We illustrate the mining process (shown
in Table 3) of PrefixSpan with a toy sequence database con- The Binding Step. With the two sorted output lists gen-
sisting of the abbreviated title of 16 publications (listed in erated from the two previous steps, we (1) find their over-
Table 1). The vocabulary with word probabilities are giv- lapping set; (ii) do a final verification for each pattern in the
en in Table 2. The parameters are empirically set to be overlapping set, to check whether it is a ‘true’ 2-piece cor-
min sup = 2 and min cor = 3.0. related pattern according to Equation (2), and (iii) output
20
4
x 10
5000 3.5 5000
PrefixSpan+
PrefixSpan+ 3 PSBSpan
Running time (in sec)

4000 4000
PSBSpan
2.5
3000
2
3000 PrefixSpan+
PSBSpan
1.5
2000 2000
1
1000 1000
0.5
0 0 0
0.5 1 1.5 2 2.5 3 0.01% 0.015% 0.02% 0.025% 5 10 15 20 25 30
Size of input database x 10
5 Minimum support threshold Minimum correlation threshold
(a) the running time w.r.t. |SDB| (b) the running time w.r.t. min sup (c) the running time w.r.t. min cor
Figure 2: PSBSpan vs. PrefixSpan+
a frequent pattern if it is a 2-piece correlated pattern or if Thus, given the k-piece maximum cutting probability of
it is a prefix of a 2-piece correlated pattern, along with the P , it is easy to obtain its (k+1)-piece correlation score (E-
2-piece maximum cutting probability of each of its prefixes. quation (2)), and therefore judge whether P is a (k+1)-piece
The procedure is basically a merge sort (in linear time) correlated pattern.
[15], and the verification step is easy, since we already have
the probability of any prefix or suffix of a potentially corre-
lated pattern. For each 2-piece correlated pattern, we out- 4. EXPERIMENTS
put its probability as well as the 2-piece maximum cutting In this section, we conduct our experiments on two real
probability of any of its proper prefixes. datasets to show the performance of the PSBSpan algorithm.
All the algorithms were implemented in Java (Eclipse Helio
Example 3.3. After ‘binding’ the results generated from 2000) and the experiments were performed on a Windows
Examples 3.1 and 3.2, the ‘truly’ correlated patterns are 7 server with Intel Core2 Duo processors and 2GB of main
marked with a X in the last column of Table 3. memory.
DBLP Dataset. The Digital Bibliography and Library
Extending to k-Piece Correlated Patterns To find cor- Project 1 is a web accessible database of the bibliographic
related patterns of pieces of arbitrary size, the only problem information of computer science publications. In this paper,
we need to solve relies on how to extend k-piece correlat- we use a collection of DBLP articles [26] released by the
ed patterns to (k + 1)-piece ones. Let us start with two ArnetMiner group of Tsinghua University, which contains
theorems: 1, 632, 442 publications and 1, 741, 170 researchers. We con-
sider therein 32, 224 papers published in prestigious confer-
Theorem 4. The 2-piece correlated patterns generated by ences (e.g., SIGKDD, SIGIR, SIGMOD, VLDB, SDM, etc)
PSBSpan are automatically sorted lexicologically. in the areas of database, data mining, and machine learning.
Flickr Dataset. Flickr 2 is an image and video hosting
Proof. Correlated patterns are generated at the Binding
website. In this paper, we use a collection of 13, 409, 424
step only if they are contained in the result from the PrefixS-
Flickr photos supplied by Kodak Inc. Each photo is asso-
pan step. Since results generated by the PrefixSpan step are
ciated with a publishing user, a title, geographical informa-
sorted lexicologically as per Thereom 3, and Binding does
tion, a set of tags, etc. We consider therein 13.7% photos
not change the order among patterns from the PrefixSpan
taken in 12 metropolitan cities famous for tourists.
step, the final results from PSBSpan are also automatically
sorted lexicologically.
4.1 Efficiency Evaluation
Theorem 5. For a k-piece correlated pattern P (|P | ≥ In this section, we will evaluate the mining efficiency of
k), if we have the k-piece maximum cutting probability of our PSBSpan algorithm.
any of its proper prefixes and the probability of any of its PrefixSpan+. The number of generated frequent patterns
proper suffixes, we can calculate the (k + 1)-piece correlated may be several millions, and creating indexes on these fre-
score (Equation (2)) and the (k + 1)-piece maximum cutting quent patterns exceeds the availability of our primary mem-
probability of P . ory. To avoid using indexes, we implemented an alterna-
Proof. We do the following formula transformation: tive method, called PrefixSpan+, as a baseline to compare
with our PSBSpan algorithm. Generally, PrefixSpan+ is
(k + 1)-M cp(P ) the same approach as traditional PrefixSpan algorithm [23],
Y but with additional correlation testing during the pattern
= M ax { P r(ci )}
C={ci }∈C2..k+1 (P )
i
growth-based mining process. However, since computing the
Y correlation score (Equation 1) of a sequence depends on the
= M ax { M ax { P r(c′i )}P r(c′ )} probability of any arbitrary sub-sequence, PrefixSpan+ has
C ′ ,c′ :C+c′ =P C ′ ={c′i }∈C2..k (P )
i no other choice but to simply go back to the original input
′ ′
= M ax {k-M cp(C )P r(c )}
C ′ ,c′ :C+c′ =P 1
www.informatik.uni-trier.de/∼ley/db/
2
www.flickr.com
21
Measure Top Ranked Patterns
support object oriented database distributed database system
database management system relational database system
object oriented system data management system
object database system association rule mining
support vector machine oriented database system
data base system time series data
object oriented database system real tme database
all-confidence object oriented database association rule mining
peer peer network object oriented system
nearest neighbor search nearest neighbor query
self organizing map distributed database system
concurrency control database database management system
relational database system wireless sensor network
real time database mining association rule
lift support vector machine nonnegative matrix factorization
reverse nearest neighbor conditional random field
named entity recognition nearest neighbor moving
latent dirichlet allocation object oriented database
nearest neighbor uncertain singular value decomposition
privacy preserving publishing association rule mining
continuous nearest neighbor nearest neighbor search
cor nonnegative matrix factorization singular value decomposition
conditional random field named entity recognition
aqualogic data service platform latent dirichlet allocation
association rule mining join algorithm multiprocessor
optimized rule numeric attribute inductive logic programming
reverse nearest neighbor wireless data broadcast
privacy preserving data publishing message table content index
Table 4: Case Study on DBLP
database SDB to count the support of these sub-sequences but its useless sub-pattern ‘object oriented system’ is ranked
by scanning the whole database. higher by all-confidence (see the second row of Table 4).
In Figure 2, we show the running time (in seconds) of The traditional measure lift defined on itemsets and the
PSBSpan and PrefixSpan+, while varying the size of the cor measure we proposed for sequential patterns (Equation (1))
input database (Figure 2(a)), the minimum support thresh- have the same intuition: the probability of a correlated pat-
old (Figure 2(b)) and the minimum correlation threshold tern should be significantly larger than the joint probability
(Figure 2(c)). We can see that in all cases, PSBSpan signif- of ‘information units’ that make up the correlated pattern.
icantly outperforms PrefixSpan+. In fact, when the size of The main difference is that the only possible information
database is larger than 25K or the minimum support thresh- units considered by lift are single items appearing in the
old is lower than 0.01%, the running time of PrefixSpan+ pattern, while the concept of an information unit in cor is
becomes intolerable. extended to be any sub-pattern of the correlated pattern,
so that the cor measure considers a more complete set of
4.2 Case Study patterns than the lift measure. We refer to this extended
In this experiment, we perform case studies on two real concept as ‘cutting’ (Definition 2.3) and define the cor mea-
datasets, i.e., DBLP and Flirck, to show the effectiveness of sure by using cutting in Equation (1).
our method. For the reasons that we have brought up previously, ‘near-
est neighbor’ is an interesting information unit, but ‘nearest’
4.2.1 Study on DBLP and ‘neighbor’ separately are not. Using the lift measure,
In Table 4, we list top-ranked sequential patterns (whose ‘reverse nearest neighbor’, ‘nearest neighbor moving’, ‘near-
size is larger than two) according to four measures: sup- est neighbor uncertain’, and ‘nearest neighbor search’ all
port (see Definition 2.2), all-confidence [29], lift [14], and appear as highly ranked patterns (see the third row of Ta-
cor (Equation 1). We can see that patterns with the highest ble 4), but only ‘reverse nearest neighbor’ is considered to
support values are mostly random combinations of popular be a truly correlated pattern by the cor measure (see the
words: even though some phrases make sense as high lev- last row of Table 4).
el concepts, e.g., ‘database system’, their useless duplicates
may appear multiple times, such as, ‘oriented database sys- 4.2.2 Study on Flickr
tem’, ‘data base system’, and ‘object database system’. We pick up Flickr photos taken at 12 popular metropoli-
As discussed in Section 2.2, the measure all-confidence tans (as listed in [31]), and cluster them into 1, 200 places of
satisfies the Apriori property, i.e., a super-pattern must have interests, by doing which, the photo uploaded by each user
a higher all-confidence score than its sub-patterns, which becomes a sequence of visiting locations in the database.
is obviously unreasonable in real situations. For instance, For three selected cities in three different countries - Lon-
‘object oriented database system’ is a meaningful phrase, don, San Francisco, and Paris - we draw their top correlated
22
British Coit Tower Palais Garnier
Museum
Transamerica
Pyramid Museum du
Louvre
Somerset
H
House Union
St Brides Square Luxembourg
Downtown
Trafalgar Church Palace
Square
(a) Tour at London by PSBSpan (b) Tour at San Francisco by PSBSpan (c) Tour at Paris by PSBSpan
Millennium
Bay Bridge Museum du
Bridge
Louvre
Bigben
Downtown
ow ow
Cathedral
Notre
N t D Dame
(d) Tour at London by PrefixSpan (e) Tour at San Francisco by PrefixSpan (f) Tour at Paris by PrefixSpan
Figure 3: Case Study on Flickr
pattern using Google Map in Figure 3(a), 3(b) and 3(c), and techniques, such as Null Hypothesis [2] testing, where the
compare with each city’s top frequent pattern, as shown in authors design a Null Hypothesis that holds if the two ran-
Figure 3(d), 3(e) and 3(f), respectively. dom variables are independent of each other. Other tests
For all three cities without exception, the top trail mined including χ2 Test and Student’s T-Test [19] have also been
by PrefixSpan (i.e., the most frequent pattern) is a ‘random’ employed for hypothesis testing.
connection of two of the top three popular places for each A number of different Correlation Measures have been
city, e.g., Louvre Museum with the Notre Dame Cathedral, discussed extensively in the literature of pattern mining. (i)
and Big Ben with the Millenium Bridge. For the three trails, χ2 [6, 16] is a measure adopted in correlation relationship
the traveling distance varies from 1.7 to 5.1 miles, and the mining. The definition of χ2 follows the standard definition
walking time ranges from 13 mins to 1 hour 4 mins, which in statistics. (ii) Lift [16] is also a correlation measure com-
does not seem to be a pleasant tour. puted from the support of the itemsets. (iii) All-confidence
In contrast, the top trails mined in each city by PSBSpan is a measure that can disclose correlation relationships a-
(i.e., the most correlated frequent pattern in each city), ap- mong data objects. It has the nice null-invariance property
pear as highly consistent and localized in their geographical and the downward closure property. (iv) Coherence [16],
locations and reveal some reasonable tourist interests. For (v) kulczynski, (vi) max-confidence and (vii) Cosine [29] are
example, Downtown, Union Square, Transamerica Pyramid four other good measures and are useful in discovering corre-
and Coit Tower is a sequence of locations leading from the lation information. (viii) Bond [21] is an interesting correla-
bustling center of San Francisco to the beautiful shore of the tion measure that offers information about the conjunctive
Pacific Ocean. support of a pattern as well as its disjunctive and nega-
tive supports. (ix) Pearson’s Correlation Coefficient [24] is
5. RELATED WORK a nominal measure which can analyze the correlation re-
lationships among nominal variables. Given the variety of
PSBSpan is a novel algorithm for mining correlated se-
measures proposed, Tan et al. [25] discuss how to select the
quential patterns in a sequential database. It inherits pat-
right measure for a given application. They have shown that
tern growth and database projection from PrefixSpan [23] to
each measure is useful for some application, but not for oth-
accelerate mining efficiency, and utilizes the binding tech-
ers. Wu et al. [29] re-examine the null-invariant measures
nique to improve the accessibility of information. To the
and show a generalization of the measures in one mathe-
best of our knowledge, this paper is the first work aiming at
matical framework, which helps us understand and select
correlated sequential pattern mining. There are, however,
the proper measure for different applications.
several lines of related work.
Substantial research efforts have been devoted to mining
Ngram Testing is proposed to extract meaningful phras-
Correlated Itemsets. Based on extensions of a pattern-
es by statistical tests on the co-occurrence of the words in
growth methodology, efficient algorithms [16] are proposed
one phrase in natural language processing or topic modeling.
to mine the correlation relationships among patterns. Ke et
An Ngram is a sequence of N units, or tokens, of text, where
al. [11] mine correlations from quantitative databases effi-
those units are typically single characters or strings that are
ciently by utilizing normalized mutual information and all-
delimited by spaces [2, 32, 28]. Many methods have been
confidence to perform a two-level pruning. They show that
proposed to determine whether an Ngram is a meaningful
mining correlations is more effective than mining associa-
phrase [5, 2, 19] by testing the association or computing
tions. Besides, in the graph pattern mining literature, there
statistical measures such as mutual information [5] among
the units. Some other approaches rely on hypothesis testing
23
is also some work on mining correlated and representative [16] Y.-K. Lee, W.-Y. Kim, Y. D. Cai, and J. Han.
graph patterns [4, 13, 12]. Comine: Efficient mining of correlated patterns. In
ICDM, 2003.
6. CONCLUSION [17] C. X. Lin, Q. Mei, J. Han, Y. Jiang, and
M. Danilevsky. The joint inference of topic diffusion
To the best of our knowledge, this paper is the first s-
and evolution in social communities. In ICDM, 2011.
tudy on mining frequent correlated sequential patterns from
a sequential database. To formally define the problem, we [18] C. X. Lin, B. Zhao, Q. Mei, and J. Han. Pet: a
analyze ‘good’ and ‘bad’ properties for the selection of cor- statistical model for popular events tracking in social
relation measures. We point out that forcing the correla- communities. In KDD, 2010.
tion score to satisfy the Apriori property is harmful to the [19] C. D. Manning and H. Schtze. Foundations of
effectiveness of the mining result, and use this analysis to statistical natural language processing. MIT Press,
carefully select the appropriate measures for calculating a 1999.
correlation score. [20] Q. Mei, X. Shen, and C. Zhai. Automatic labeling of
Moreover, we develop an efficent three-stage mining method, multinomial topic models. In KDD, 2007.
Prefix-Suffix-Binding Span (PSBSpan), based on an exten- [21] E. Omiecinski. Alternative interest measures for
sion of pattern growth methodology. Experimental studies mining associations in databases. Trans. Knowl. Data
on real datasets reveal that our mining method is able to Eng., 2003.
discover ‘truly’ succinct and interesting patterns, while still [22] J. Pei and J. Han. Constrained frequent pattern
remaining efficient for large-scale datasets. mining: a pattern-growth view. In KDD, 2002.
[23] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto,
7. REFERENCES Q. Chen, U. Dayal, and M.-C. Hsu. Mining sequential
patterns by pattern-growth: The PrefixSpan approach.
[1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining
IEEE Trans. Knowledge and Data Engineering, 2004.
association rules between sets of items in large
databases. In SIGMOD, 1993. [24] J. L. Rodgers and W. A. Nicewander. Thirteen ways
to look at the correlation coefficient. In The American
[2] S. Banerjee and T. Pedersen. The design,
Statistician, 1988.
implementation, and use of the ngram statistics
package. In CICLing, 2003. [25] P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the
right objective measure for association analysis. In
[3] S. Brin, R. Motwani, and C. Silverstein. Beyond
KDD, 2002.
market baskets: Generalizing association rules to
correlations. In SIGMOD, 1997. [26] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su.
Arnetminer: extraction and mining of academic social
[4] C. Chen, C. X. Lin, X. Yan, and J. Han. On effective
networks. In KDD, 2008.
presentation of graph patterns: a structural
representative approach. In CIKM, 2008. [27] C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi.
Scalable mining of large disk-based graph databases.
[5] K. W. Church and P. Hanks. Word association norms,
In KDD, 2004.
mutual information, and lexicography. In Comput.
Linguist., 1990. [28] G. I. Webb. Self-sufficient itemsets: An approach to
screening potentially interesting associations between
[6] G. Corder and D. Foreman. Nonparametric Statistics
items. TKDD, 4(1), 2010.
for Non-Statisticians: A Step-by-Step Approach.
Wiley, 2009. [29] T. Wu, Y. Chen, and J. Han. Re-examination of
interestingness measures in pattern mining: A unified
[7] J. Han and M. Kamber. Data Mining: Concepts and
framework. In Data Mining and Knowledge Discovery,
Techniques. Morgan Kaufmann, 3 edition, 2006.
2010.
[8] J. Han and J. Pei. Mining frequent patterns by
[30] X. Yan and J. Han. gspan: Graph-based substructure
pattern-growth: Methodology and implications. 2000.
pattern mining. In ICDM, 2002.
[9] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal,
[31] Z. Yin, L. Cao, J. Han, J. Luo, and T. S. Huang.
and M. Hsu. Freespan: frequent pattern-projected
Diversified trajectory pattern ranking in geo-tagged
sequential pattern mining. In KDD, 2000.
social media. In SDM, 2011.
[10] M. A. Hasan, V. Chaoji, S. Salem, J. Besson, and
[32] J. Zhang, B. Jiang, M. Li, J. Tromp, X. Zhang, and
M. J. Zaki. Origami: Mining representative orthogonal
M. Q. Zhang. Computing exact p-values for dna
graph patterns. In ICDM, 2007.
motifs. Bioinformatics, 23(5):531–537, 2007.
[11] Y. Ke, J. Cheng, and W. Ng. Mining quantitative
[33] S. Zhang, J. Yang, and S. Li. Ring: An integrated
correlated patterns using an information-theoretic
method for frequent representative subgraph mining.
approach. In KDD, 2006.
In ICDM, 2009.
[12] Y. Ke, J. Cheng, and J. X. Yu. Efficient discovery of
frequent correlated subgraph pairs. In ICDM, 2009.
[13] Y. Ke, J. Cheng, and J. X. Yu. Top-k correlative
graph mining. In SDM, 2009.
[14] S. Kim, M. Barsky, and J. Han. Efficient mining of top
correlated patterns based on null-invariant measures.
In PKDD, 2011.
[15] D. E. Knuth. The Art of Computer Programming:
Sorting and Searching. Addison-Wesley, 1968.
24

Efficient Mining of Correlated Sequential Patterns Based On Null Hypothesis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Efficient Mining of Correlated Sequential Patterns Based On Null Hypothesis

Uploaded by

Copyright:

Available Formats

Efficient Mining of Correlated Sequential Patterns Based

Cindy Xide Lin†§ , Ming Ji† , Marina Danilevsky† , Jiawei Han†

ABSTRACT association rule mining [8], sequential pattern mining [22],

Categories and Subject Descriptors Bay Bridge

3.2 The PSBSpan Mining Algorithm

Running time (in sec)

Running time (in sec)

You might also like