You are on page 1of 12

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO.

11, NOVEMBER 2009 1617

The Cyclic Model Analysis on


Sequential Patterns
Ding-An Chiang, Cheng-Tzu Wang, Shao-Ping Chen, and Chun-Chi Chen

Abstract—Sequential pattern mining has been used to predict various aspects of customer buying behavior for a long time.
Discovered sequence reveals the chronological relation between items and provides valuable information to aid in developing
marketing strategies. Nevertheless, we can hardly know whether the buying is cyclic and how long the interval between the two
consecutive items in the sequential pattern is. To solve this problem, in this paper, data mining skills and the fundamentals of statistics
are combined to develop a set of algorithms to unearth the cyclic properties of discovered sequential patterns. The algorithms, coupled
with the sequential pattern mining process, constitute a thorough scheme to analyze and predict likely consumer behavior. The
proposed algorithms are implemented and applied to test against real data collected from a consumer goods company. The
experimental results illustrate how the model can be used to predict likely purchases within a certain time frame. Consequently,
marketing professionals can execute campaigns to favorably impact customers’ behaviors.

Index Terms—Association rules, data mining, frequency, sequential pattern, polynomial regression.

1 INTRODUCTION

D ISCOVERING sequential relationships presented in trans-


action data is important to many application domains,
particularly useful in the analysis of customers, where
Based on this heuristic, a series of Apriori-like algorithms
such as AprioriAll, AprioriSome, DynamicSome [3], and
GSP [17] was developed. Later on, different kinds of
certain buying patterns can be identified, such as likely algorithms are proposed by different researchers. For
follow-up purchases. When the data can be interpreted example, FreeSpan [8] and PrefixSpan [9] were developed
from a temporal or sequential perspective, the task of by the data projection approach. SPADE [19] is a lattice-
sequential pattern mining is to identify the sequences based algorithm, MEMISP [11] is a memory-indexing-based
whose statistical significance in the database is above the approach, while SPIRIT [6] integrates constraints by using
user-specified threshold. regular expression. The above algorithms focus on the
Sequential pattern mining can be applicable in a wide chronological order only.
range of applications. For example, the sequential patterns Some have tried to extend the mining of sequential
discovered from supermarket transactions can provide patterns to periodical patterns. Periodicity analysis at-
insight for developing marketing and product strategies, tempts to analyze the data to identify pattern, which repeat
the patterns mined from Web usage logs can propose a or recur in a time series. In general, full periodicity
better way to arrange the Web site, and the alarm patterns indicates the situation where all data points contribute to
occurred in telecom networks are very useful for alarm the behavior of the series. Whereas partial periodicity
prediction and alarm control. means that only certain points contribute to the behavior of
The problem of discovering sequence was first intro- the series. Cyclical periodicity relates to the set of events,
duced by Agrawal and Sirkant [2]. Many algorithms were which occurs periodically.
developed afterward and successfully improved the effi- Han et al. [7] proposed two algorithms for mining partial
ciency of the task of mining sequential patterns. A great periodic pattern—single period and multiple periods. Yang
diversity of algorithms for sequential pattern mining exists. et al. [18] proposed distance-based pruning to find the
The most basic and earlier algorithms are based upon periodic patterns, which may contain a disturbance of
Apriori algorithm [1]. The core of the Apriori property is length up to a certain threshold. The mining of frequent
that any subpattern of a frequent pattern must be frequent. partial periodic sequential patterns in a time series is to find
possibly with some restriction or disturbance.
Ozden et al. [14] proposed the sequential algorithm and
. D.A. Chiang, S.P. Chen, and C.C. Chen are with the Department of
Computer Science and Information Engineering, Tamkang University, interleaved algorithm to determine cyclic association rules.
Tamsui, Taipei County, Taiwan 25137, ROC. Associate rules capture interrelationships between various
E-mail: chiang@cs.tku.edu.tw, {694190033, 894190049}@s94.tku.edu.tw. items. Cycle Pruning, Cycle Skipping, and Cycle Elimina-
. C.-T. Wang is with the Department of Computer Science, National Taipei
University of Education, Taipei 106, Taiwan, ROC. tion aim to identify the association rules that have the
E-mail: ctwang@tea.ntue.edu.tw. minimum confidence and support occurring at regular
Manuscript received 15 Dec. 2006; revised 21 Dec. 2007; accepted 6 Jan. 2009; intervals. Chiang et al. [5] and Chen et al. [4] proposed
published online 16 Jan. 2009. algorithms to determine the intervals of recurring patterns.
Recommended for acceptance by W. Wang. The approaches presented above primarily target the
For information on obtaining reprints of this article, please send e-mail to:
tkde@computer.org, and reference IEEECS Log Number TKDE-0566-1206. orders of items purchased, or cyclic patterns occurring within
Digital Object Identifier no. 10.1109/TKDE.2009.36. a time window, defined by users. However, by these works,
1041-4347/09/$25.00 ß 2009 IEEE Published by the IEEE Computer Society
1618 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009

the cycle and interval of items purchased are hardly known. TABLE 1
Thus, the best time to recommend the right products to the The Transactions by Four Customers over One Month
right person is hardly known either. Actually, periodical
patterns are common in daily life. A time-interval sequential
pattern provides more information than a conventional
sequential pattern does, discovering time interval of the
successive item set is the first step toward more accurate
analysis of customer analysis. Therefore, in this paper, we
develop a set of algorithms to analyze the periodical proper-
ties of time intervals over sequential patterns.
Data mining skills and the fundamentals of statistic are
combined to introduce an algorithm Cyclic Model Analysis
(CMA) to find out the model of recurring purchasing. The ðNumber of customer supports sequencesÞ
modeling process commences with the discovery of sequen- Support ðSÞ ¼ :
ðT otal number of customersÞ
tial patterns from the transactional database. Then the
existence of periodicity is identified and the interval of A sequence is maximal if it is not contained in any other
successive events by the Generalized Periodicity Detection sequences. Given a database D of customer transactions,
(GPD)/Trend Modeling (TM) computed, which will be sequential pattern mining is the process of finding maximal
explained in more detail later. Next, the CMA algorithm is sequences among all sequences that have a certain user-
used to obtain the period and trends of quantities of specified minimum support. Each such maximal sequence
purchasing. Consequently, marketing people can recom- represents a sequential pattern. The user-specified mini-
mend the right products to the right customers at the right mum support threshold (denoted by minsup) means
time. statistical significance of a sequence in the database.
This section provides a comprehensive review of prior Table 1 gives a simple example, which contains four
works related to sequence pattern mining. In addition, the customers and their activities over one month. Given the
motivation and research objectives of this paper are also threshold minsup ¼ 0:5, three frequent sequences <A, F>,
explained in this section. The remainder of this paper is <F, H>, and <D, E> are found. The support of <A, F> is
organized as follows: The mathematical models that portray 3=4 ¼ 0:75. The support of <F, H> and <D, E> is
the sequential buying behavior are constructed in Section 2. 2=4 ¼ 0:5. Hence, there are three sequential patterns in
The proposed algorithms are presented in Section 3. The the example database.
experimental results are shown in Section 4. The briefings A sequential pattern indicates the correlation between
on short conclusion and discussion on the future direction transactions. The sequence mined from the transaction
are shown in Section 5. databases represents the order of purchases by the same
customer, those items come from different transactions. A
2 PROBLEM STATEMENT typical example of such a sequential pattern is a customer
who buys a personal computer, then a laser printer. As
An item set i, denoted by (x1 ; x2 ; . . . ; xt ), is a nonempty set of discussed in the previous section, there are many algorithms
items. A sequence S, denoted by <i1 ; i2 ; . . . ; iq >, is an ordered developed by researchers to address the problem of effi-
set of item sets. The size of a sequence S, written as jSj, is the ciently discovering sequences. However, prior works seldom
number of elements in S. A sequence is a k-sequence if address the issue of our major concerns: Tendency and
jSj ¼ k. For example, sequence <a; b; c; d> is a 4-sequence. Periodicity. Whether the next purchase will happen or how
A sequence <a1 ; a2 ; . . . ; an > is a subsequence of another long the purchase behavior will last is hard to tell. A tool to
sequence <b1 ; b2 ; . . . ; bm > if there exist 1  i1 < i2 < i3    < capture the characteristics of discovered sequences is needed.
in  m such that a1  bi1 ; a2  bi2 ; . . . , and an  bin . We also
To simplify the discussion, the case for 2-sequence <i1 ; i2 >,
call that the sequence <a1 ; a2 ; . . . ; an > is contained in the
where i1 ; i2 are item sets, is considered. The item set is a
sequence <b1 ; b2 ; . . . ; bm >. For example, the sequence <a; b>
collection of the items. Thus, the case can be extended to more
is a subsequence of <ða; cÞ; ðb; dÞ> since a  ða; cÞ and
complicated situations. Given a 2-sequnece <i1 ; i2 > mined
b  ðb; dÞ. On the other hand, the sequence <ða; cÞ; b> is not
from transactions, the definition of the Trend Distribution
contained in <ða; c; bÞ>, and vice versa.
Function (TDF) of the 2-sequence is stated in Definition 1.
Given a database D of customer transactions, each
transaction is characterized by the fields: <customer-id>, Definition 1. The sequential pattern s ¼ <i1 ; i2 > is a
<time stamp>, and <items purchased>. More precisely, 2-sequence mined from transaction database over designated
each transaction is a set of item sets and each sequence is a time frame T ¼ ½t1 ; tn . The Trend Distribution Function of
list of transactions ordered by transaction time. Usually, the the given sequence s, denoted by fðxj Þ, is a nonnegative
list of all the transactions of a customer is called the function defined on ½0; tn  t1 . A sequence s is said to be an
customer sequence. xj -interval-sequence if the interval difference between i1 and i2
A customer supports a sequence s if s is contained in the is xj . The value of fðxj Þ is the total occurrences of xj -interval-
corresponding customer sequence. The support for a sequence in the transaction database D.
sequence s is defined as the number of data sequences
containing s. The definition of support for a sequence s can The pseudocode for computing the value of the trend
be written as follows: distribution function is presented in Fig. 1.
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1619

Fig. 1. Pseudocode for computing the value of trend distribution function


for sequence s.

For a 2-sequence <a, b>, there are many complicated


distributions exist. The examples below are used to Fig. 2. The illustrations of the three types of distributions.
illustrate how the TDF is computed. For the sake of brevity,
the postfix increment operator þþ means adding 1 to the Since the function is the portrait of the real world, the
value of the expression. Therefore, fðdn Þþþ means that the graph of the function will never be a simple straight line.
value of the function at dn is incremented by one. What is more, it is unlikely to be a periodic function with
Considering the example 2-sequence <a, b>, the different the standard sine (or cosine) wave graph either. The curve
types of appearances of distribution are as follows: goes along the x-axis with the inclination moving upward
or downward with slight vibration. To better facilitate the
1. Simple: <ða; t1 Þ; ðb; t2 Þ> research of the tendency of the function, the slope of the
regression line of function within a designated domain is
) fðt2  t1 Þ þþ:
used to characterize the inclination of the function.
For any given subset of the domain of the function, a
2. Repeated appearance: <ða; t1 Þ; ðb; t2 Þ; ðb; t3 Þ; simple linear regression is used to construct the regression
ðb; t4 Þ . . . :ðb; tn Þ> line of the function within the subdomain. If the slope of the
obtained straight line is a positive number, the purchase of
) fðt2  t1 Þþþ; fðt3  t1 Þþþ;
item-i2 increases in a certain rate. If the slope is a negative
fðt4  t1 Þþþ; . . . ; fðtn  t1 Þþþ: number, the sales volume of item-i2 is on the decrease at a
certain rate.
3. Multiple appearance: < ða; t1 Þ; ðb; t2 Þ; ða; t3 Þ; The distribution of interest can be categorized into three
ðb;t4 Þ; ða; t5 Þ; ðb; t6 Þ > types. The first type is the simply ascending type. The second
one is the plainly descending distribution. And the third
) fðt2  t1 Þþþ; fðt4  t3 Þþþ; fðt6  t5 Þþþ: type is the most common one that occurs frequently in the
real-world scenarios. First, consumers are more likely to buy
4. More complicated: < ða; t1 Þ; ðb; t2 Þ; ða; t3 Þ; ða; t4 Þ; item-i2 after the initial purchase of item-i1 . Then, the sales
ðb; t5 Þ; ða; t6 Þ; ðb; t7 Þ; ðb; t8 Þ > volume of item-i2 decreases after a certain point is reached.
Fig. 2 is the plot of the distribution functions obtained
) fðt2  t1 Þþþ; fðt5  t4 Þþþ; fðt7  t6 Þþþ; from a real-world database. The upper half of Fig. 2
fðt8  t6 Þþþ: illustrates the third distribution type; however, the function
is of type one on the domain [0, 90]. The follow-up sales of
The distribution function defined in Definition 1 is a time- item-i2 go up in a span of time after the item-i1 is purchased;
series representation of the transactions associated with the then, the sales volume decreases after the ascending
sequence discovered. The traditional sequential formulation duration. Besides, part 2 of Fig. 2 is another common type.
reveals the chronological order of purchase only. Therefore, The purchase of item-i2 decreases at a moderate rate.
the model of distribution function to portray the sequential After the mathematical model is constructed, the char-
purchasing phenomenon is constructed. Take the sequence acteristics of the model and the relations between the model
<i1 ; i2 > as an example. The distribution function is character- and the real world will be explained. The periodicity and
ized by the interval between the purchases of the consecutive degeneration will be the major concerns. The study of the
items. The graph of the function is the set of all points (x, f(x)) characteristics of the model is useful to uncovers more
in the xy-plane such that x is in the domain of f and y ¼ fðxÞ. previously hidden facts underlying patterns such as “how
If plot of the function is sketched, the movement along the long the repeated behavior will last?” and “when the next
curve shows the tendency of the purchase. purchase will happen?”
1620 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009

Fig. 4. The plot of a linearly decreasing function.


Fig. 3. The plot of the function (1).
the domain, the value of the function will be greater than or
If the function repeats its value after some certain period, equal to zero. Hence, the point, which causes the function to
the function is called a periodic function. The periodicity reaches zero, is important for the study of the model as well.
implies the cyclic sales volume. In other words, when the If f(x) is a linearly decreasing distribution function and
customers are likely to buy, the item-i2 can be predicted. reaches zero within the domain, the point x is called the
Therefore, the definition of periodicity of the function is degeneration point of the function. We learned from real-
vital to the study. The following definition considers the world experiences that the periodical purchase will not last
formulation of a simple periodic function. indefinitely for many reasons. The degeneration point
Definition 2. Let f(x) be a trend distribution function of sequence signifies the fact that the customers tend to stop the
s defined on X ¼ ½x1 ; xn . For each xi in X, then purchasing. The business owner must be alerted before

fðxÞ ¼ fðx þ Þ and   ðxn  x1 Þ=2, then f(x) is said to the degeneration point is reached. Thus, the marketers can
be a periodical trend distribution function of the sequence s take actions to favorably impact the customers.
with period . In the next section, the data mining skills and some
mathematical tools will be combined to formulate the
In daily life, purchasing amount varies over time, it is algorithms to construct an automated and attainable analysis
hardly a constant. Consider the curves with downward procedure for both engineers and marketing professionals.
tendency such as

ð24  xÞ 3 ALGORITHMS
fðxÞ ¼ ðsinðxÞ þ 1:5Þ: ð1Þ
18 In this section, a set of algorithms designed to deal with
The function shown in Fig. 3 is not a strict (monotone) the distribution functions obtained from the transaction
decreasing function. But the movement along the curve goes databases is presented. The procedures proposed here are
downward steadily. As mentioned before in this section, the the synthesis of data mining techniques and mathematical
inclination of the function within the designated domain can tools. More specifically, the aim of this research is to devise
be characterized as the slope of the regression line within a scheme to analyze the trend underlying the patterns. The
the domain. Given any subset of the whole domain, the scheme is to be integrated with traditional sequential
function is called a linear increasing distribution function if pattern mining to offer a comprehensive analysis proce-
the slope of the regression line is positive. The function is a dure, which can more easily be adopted by marketers. The
linearly decreasing function if the slope of the regression scheme presented in this paper takes a two-phase
line is negative. Accordingly, the following is the definition approach to cope with all periodicity-related problems,
for a periodical distribution function. which occur in the analysis process of sequential pattern
Definition 3. Let f(x) be a linearly periodical trend distribution mined from transactions.
function of sequence s defined on the domain X ¼ ½x; xn . The core theme of the research is Simple is Beauty. It is
The straight line y ¼ ax þ b is the trend line constructed by well known that a host of algorithms have been developed
linear regression. For each xi in X; fðxÞ ffi fðx þ Þ þ ax; for efficient mining of sequential patterns. To solve the
  1=2ðxn  x1 Þ, then f(x) is said to be a linearly periodicity problem, a mathematical model constructed to
increasing periodic trend distribution function of the portray the sequential pattern mined from the database. The
sequence with period l on the domain X. The function f(x) structure proposed to describe the nature of the pattern can
is said to be a linearly decreasing periodical trend distribu- reveal not only the periodicity but also the tendency of the
tion function if fðxÞ ffi fðx þ Þ  ax;   ðxn  x1: Þ=2. occurrence of purchasing actions. Then the mathematical
tool is used to determine that the periodicity exists. If the
Fig. 4 is a sample of a typical linearly decreasing periodicity exists, a procedure is proposed to analyze the
function. The graph of the function goes downward along likely consumer behavior.
the x-axis at a certain rate. And the curve repeats its shape The scheme comprises the sequential pattern mining
after a period of 63. That is, the function decreases steadily technique and the algorithms presented in this section.
with period ¼ 63. Then the function reaches zero at a certain Given the result of sequential pattern mining, the primary
point, that is, x ¼ 300. concern is to know where there are regularities that can be
As mentioned, periodicity is not the only interest. The found. Thus, the value of trend distribution function is
degeneration phenomenon is another major concern. Since computed and then the GPD is introduced to detect the
the distribution function is a nonnegative function defined on periodicity of the function. If the periodicity can be
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1621

Fig. 6. The first step of the GPD procedure is to find the regression line
y ¼ ax þ b. Then the iterative computation of error threshold suggests
that 6.39 has maximum likelihood that 6.39 is the period of the function.

GPD determines whether the function has periodicity and


computes the value of the periods if periodicity exists.
Three parameters are needed to be determined before-
hand. Given a trend distribution function f defined on
X ¼ fx1 ; x2 ; . . . ; xn g, the parameter n stands for the number
of elements in X. The parameter min_period is the smallest
possible value of the period. The possibility of existence of
periodicity is calculated from the designated minimum
possible value to the maximum possible value. The value of
period usually cannot exceed half of the investigated time
Fig. 5. The GPD procedure determines the possible values of period. frame (n/2). The next parameter is max_error, which is the
error threshold used to judge if the investigated value is
identified, the analysts can have a better understanding of subject to the generalized periodicity. Fig. 5 is the
the patterns and decide on the next appropriate action. pseudocode of the procedure GPD, the output of GPD is
Next, the TM procedure is developed to find an equation empty if the periodicity does not exist.
to approximate the trend distribution function obtained Since the given trend distribution function will not be a
from the sequential pattern mined from databases. Thus, straight line or a monotone P increasing (or decreasing)
the nature of the phenomenon represented by the sequence function, the measure jfðxiþ Þ  fðxi Þj is not sufficient to
can be identified and more or less described by the judge whether the function has the period . Hence, a more
mathematical tool introduced here. general measure to complete the task is proposed. The first
The third procedure presented in this paper is CMA. The step is to find the simple linear model y ¼ ax þ b of the
CMA algorithm is used to depict the tendency and given function f. Then the function g(x) is defined as
periodicity of the buying activities within the investigated (fðxÞÞ=ðax þ b). The function g(x) is to be used to construct
time frame. Coupled with industry domain knowledge and the measure
marketers’ expertise, the constructed model helps to predict Pn
jgðxiþ Þ  gðxi Þj
likely buying behavior. errorðÞ ¼ i¼1Pn : ð2Þ
i¼1 jgðxi Þj
3.1 Generalized Periodicity Detection
Equation (2) is used to calculate the error threshold to
In a real-world scenario, purchasing behavior is extremely determine if the examined value is a prospect of period. For
dynamic. The occurrences of purchases will not happen in a each prospect value of the period, the value of the measure
strictly regular way. However, the marketers and analysts is computed. If errorðxi Þ is less than the threshold max_error
are eager to identify the nature of the phenomenon. predefined by users, the value xi is a period prospect. Fig. 6
Periodicity detection is one of the primary areas of interest. shows the result of applying GPD to the given input (1)
Therefore, a systematic way has to be found to determine mentioned before in Section 2.
whether the function repeats itself at a regular period and Let f(x) be defined on X ¼ ½0; 18, divide the domain into
what value of the period is. As mentioned in Section 1, the 100 partitions. Thus, we have xi ¼ i ð18=100Þ for
simple periodicity is rarely found in the real world. Thus, 1  i  100. The value of function will be computed at
the conceptual definition of periodicity has been general- each xi .
ized. Then an empirical measure is proposed to determine Set the input parameters min period ¼ 0:5 and
the generalized periodicity of the function. The Procedure max error ¼ 0:2. The first step is to find the value of
1622 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009

a; b to determine the linear model y ¼ ax þ b. Next, the


value of the error indicators errorðxi Þ is computed for
each prospect of period xi . Then, we have a ¼ 0:112;
b ¼ 2:2326, and  ¼ 6:39.
Fig. 6 illustrates the result of the examination. Fig. 6a
shows the graph of the function and the regression line
y ¼ 0:112x þ 2:2326 found by step 1 of the procedure.
Fig. 6b shows the vertical bar at the discovered period for
the given function.

3.2 Trend Modeling


Frequency is a good indicator of the importance of a
pattern. In real life, the environment may change constantly
and patterns discovered from databases may also change
over time. Once the existence of periodicity has been
identified, the characteristic of the sequence can be captured
more precisely.
Hence, a mathematical structure, which symbolizes and
describes the transactions occurring in the real world, is
needed. The results of the modeling process must map the
relationships between relevant attributes in the transaction
databases to those relationships between relevant attributes
of the function obtained by computing rules presented in Fig. 7. The pseudocode of the CMA procedure.
Definition 1.
The purchasing behavior observed in the real world is X
m
ðax þ bÞ ci  ðx mod Þi : ð4Þ
highly dynamic. The graph of the function is usually a
i¼0
visual aid to illustrate the dynamics of the observed facts.
The salient features of the (x, y) plot are the evidence. However, the overfitting problem needs to be addressed.
Therefore, a sophisticated model is needed. For the overfitting problem, the degree of the polynomial is
In many cases, broad movements can be discerned that usually left as user input parameter [13].
evolve more gradually than the other movements which are The TM algorithm is straightforward. It begins with the
evident. These gradually changes are called trends. The establishment of the simple linear regression model
changes, which are of a transitory nature, are described as y ¼ ax þ b. The regression line obtained characterizes the
fluctuations. When dealing with time-series data, there is an inclination of the variable Y . Then the polynomial model
inclination to break down the time series into the trend described in (4) is computed. Combined with the result of
component to capture the trend characteristics of the
GPD, the complete model characterizing the given input
function. Since regression analysis is frequently used for
distribution function is obtained. The pseudocode of the
fitting equation to data, regression techniques are applied to
construct the Trend Modeling procedure. algorithm is presented in Fig. 7.
Obviously, the simple linear regression model, which is To illustrate how Trend Modeling is used to find the
used to find a straight line representing the inclination of approximating model of a given input, two typical linearly
the scatter of (x, y) plot, is not sufficient to describe the trend increasing/decreasing periodic functions are used as sample
component underlying the series data. This type of model is input.
called a polynomial regression. It has been proved that (1) is a typical linearly decreasing
The simple linear regression model Y ¼ 0 þ 1 x þ " can function, which has period 6.39. TM is invoked to find the
be generalized as an mth-order polynomial regression in polynomial approximates (1).
one variable, which is written as Let f(x) be defined on X ¼ ½0; 18, divide the domain
into 100 partitions. Given predetermined parameters
y i ¼  0 þ  1 x þ  2 x 2 þ    þ  m x m þ ei : ð3Þ
min period ¼ 0:5; max erroe ¼ 0:2, and degree ¼ 6, Trend
To approximate the substantial curvature as is com- Modeling is applied to find the approximating model of
monly understood in the real world, the simple linear (1). Then, this gives the following:
regression model or the general model represented by (3) is
not appropriate to fit the data. In addition, there are pitfalls 1. Applying GPD to f(x) to find that a ¼ 0:012; b ¼
that must be aware of [10], [13]. 2:2326, and  ¼ 6:39.
The polynomial model can be highly ill-conditioned, 2. Compute the polynomial f 0 ¼
even for low-order polynomials. This may lead to sub- 8
>
> ð0:012x þ 2:2326Þ
stantial round-off errors. Generally, as low an order as < f0:9390 þ 0:5577ðx mod 6:39 þ 0:5251ðx mod 6:39Þ2
possible should be used to obtain a satisfactory fit. To deal
>
>  0:5153ðx mod 6:39Þ3 þ 0:1321ðx mod 6:39Þ4
with the ill-conditioning issue, the polynomial model (4) is :
 0:0144ðx mod 6:39Þ5 þ 0:00066ðx mod 6:39Þ6 g:
brought in to solve the problem:
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1623

Fig. 8. The plot of the function fðxÞ and the polynomial f 0 ðxÞ.

Fig. 8 shows the result of applying TM to the function Fig. 9. The plot of the function fðxÞ and the polynomial f 0 ðxÞ.
f(x). The darker line is the polynomial f 0 ðxÞ determined by
regression and the other line is the input function f(x). obtained from the sequence. In short, the mathematical
Next, TM is used to find the polynomial of the function model established by GPD/TM is used to describe the
(5), which is similar to the previous inspected example (1) characteristics of the sequential patterns mined from
but it is a linearly increasing function: designated time frame.
Next, CMA is proposed to analyze and describe the
ð24 þ xÞ
gðxÞ ¼ ðsinðxÞ þ 1:5Þ: ð5Þ characteristics of the sequential pattern mined from the
18
transaction databases. Users must determine the value of
The function g(x) is defined on the same domain X be parameters min_period, max_error, degree, and trcd. The
defined on X = [0, 18], divide the domain into 100 partitions. meaning of the parameters min_period, max_error, and degree
Use the same input parameters as done to (1). Let is the same as defined in GPD and TM. The value of trcd is
min period ¼ 0:5; max error ¼ 0:2, and degree ¼ 6; then, the terminating condition of the process. If the length of the
apply Trend Modeling to find the approximating model domain is too short, the process should be stopped since it
of g(x). This will give the following: is meaningless to investigate the characteristics of repeated
patterns. The procedural steps are shown in Fig. 10.
1. Invoking GPD to g(x) to find that a ¼ 0:023;
The trend distribution function of a given sequence is
b ¼ 2:526, and  ¼ 6:42.
defined in Definition 1. Then the type of the function is
2. Compute the polynomial g0 ¼
determined by finding the local maximum of the function. If
8
>
> ð0:023x þ 2:526Þ the local maximum exists at the end of the domain of the
< f0:9012 þ 0:5577ðx mod 6:42Þ þ 0:5251ðx mod 6:42Þ2 function, the function belongs to the ascending type. The
>
>  0:5153ðx mod 6:42Þ3 þ 0:1321ðx mod 6:42Þ4 descending type can be determined if the local maximum
:
 0:0144ðx mod 6:42Þ5 þ 0:00066ðx mod 6:42Þ6 g: exists at the beginning of the domain.
If the distribution function is ascending or descending
The darker line in Fig. 9 is the polynomial g0 ðxÞ, which is type, apply GPD/TM directly to get the polynomial
determined by regression to the input g(x); the lighter one is
approximating the patterns and find the period of the
the plot of the input function g(x).
distribution function. If the distribution is neither the
It has been demonstrated how TM can perfectly
ascending nor the descending type, whole time frame has
approximate the descending and ascending types of the
to be partitioned into two subframes and invoke CMA
linearly periodic functions.
recursively until the distribution function of subframe is
The polynomial gained by the TM process is an aid to
simplified. If the length of inspected subframe is smaller
identify the nature of the sequence mined. The graphical
than the predefined terminating condition trcd, the process
representation of the polynomial is an extremely good aid
will be stopped.
to help observers have a better understanding of the
In other words, CMA takes the divide-and-conquer
tendency of the pattern. And the analysis of the character-
istics of the polynomial itself is helpful in describing the approach to collect the knowledge of the designated
phenomenon of the mined pattern. Hence, an elaborate and distribution function. Analyzers use the synthesized
systematic plan of action is needed to complete the task. mathematical model associated with product knowledge
That is why we developed CMA. to interpret the meaning of the model discovered by the
proposed algorithms. Consequently, the interpretations
3.3 Cyclic Model Analysis will be translated into marketing insight and marketing
The purpose of the establishment of the mathematical practice accordingly.
model is to help analysts obtain a better understanding of Below is an illustration of CMA applied to the real-world
the whole picture of what happened and predict what is likely databases. The data collected were transactions of a
to happen. With GPD and TM, it can be determined if domestic cosmetic supplier. The marketing department
customers tend to repeat buying at a regular period and an discovered several sequential patterns, which are of inter-
equation can be formulated to approximate the distribution est. They found that the pattern <37, 27> was unusual;
1624 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009

Fig. 11. (a) The plot of the distribution function of pattern <37, 27>.
(b) After applying GPD/TM, the vertical bar used to indicate the
discovered period was drawn on the picture.

The plot of distribution function of pattern <37, 27> is


shown in Fig. 11. It was learned from the picture and the
polynomial that:

1. Consumers usually purchased <27> after 42 days


after they purchased <37> ( ¼ 42).
2. In the first 6-month time frame after the first purchase
of <37>, more and more purchases were made.
3. After six months, all buying stopped radically. This
is unusual for consumer goods.
After consultation with business analysts, the strange
pattern was explained. The item <27> was faced with the
internal and external competition during the first half-year
of the investigated time frame. The marketers conducted a
price-cut campaign, which stimulated the sales volume.
After six months, the product was replaced with an
alternative manufactured by the brand owner. Thus, the
sales of <27> decreased dramatically.
Fig. 10. The pseudocode of the CMA algorithm. As mentioned before, the interpretation of the model
established by TM/GPD/TM must be based on the
therefore, they chose to take on this pattern to demonstrate abstraction of purchasing behavior, knowledge of products,
the CMA process. and industry know-how. More comprehensive examples
First, the analyzer computed the trend distribution and discussions will be detailed in the next section.
function according to definition. The input parameters
were set as min period ¼ 5; max error ¼ 0:5; degree ¼ 6,
4 EXPERIMENTS
and trcd ¼ 30, then GPD/TM was invoked to find the
period of the distribution and the approximating poly- In this section, the experiments conducted to demonstrate
nomial. The results obtained by GPD were a ¼ 0:039, the usage of the processed Cyclic Model Analysis procedure
b ¼ 3:760, and  ¼ 42. The approximating polynomial are described and how the proposed method can affect
produced by TM is as follows: marketers’ decisions and actions is discussed.

fðxÞ ¼ 4.1 Analysis Process and Data Sets


8
> ð0:039x þ 3:760Þ If x < 170 In general, consumer markets have several characteristics in
>
> common such as repeat-buying over the relevant time
>
> f1:6473  0:4335ðx mod 42Þ
>
> frame, a large number of customers, and a wealth of
>
< þ 0:0691ðx mod 42Þ2
information detailing past customer purchases. Hence, a
>  0:052ðx mod 42Þ3 þ 0:0002ðx mod 42Þ4
> well-known cosmetic company was approached to conduct
>
>
>
>  0:000ðx mod 42Þ5 þ 0:000ðx mod 42Þ6 g; the sequential pattern mining project.
>
>
: The first experiment illustrates how to interpret the
0; Otherwise:
analysis and how to capture the cyclic characteristics of
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1625

TABLE 2
The Result of Sequential Pattern Mining

behavior can be predicted with the aide of a simple and


elegant mathematical model.
Next, the patterns that were identified as vital in 2000
were selected, and the results of the analysis conducted
against 2001 data sets were examined.
The purchasing records of a total of 41 products between
years 2000 and 2001 were collected. Then IBM Intelligent
Miner was used as our mining tool. In summary, there were
160,334 and 215,854 transactions in the years 2000 and 2001,
Fig. 12. The process flow of how the CMA is applied to the real-world respectively. In the year 2000, the average products
data. purchased in a single transaction were 2.93 (items). Assum-
ing the likelihood that each product sold was equal, ð1=41Þ
patterns discovered by sequential pattern mining. Next, it :
2:93 ¼ 0:0715. Thus, we set the min sup ¼ 7:0 percent as the
was demonstrated that consistent behavior pattern shown threshold of frequent item set. After applying the sequential
in the first experiment proved to be significant in the mining process to the transaction database, the sequences
following year.
found are listed in Table 2.
The process of applying CMA to real-world data is
displayed in Fig. 12. After the mining process was 4.2 Finding Periodicity and Tendency
completed, the sequential patterns to be investigated were Workings with the owner of the transaction database, two
selected. The distribution function of the designated pattern patterns <38, 20> and <38, 36> were found to be two
was computed. The plot of the function was outlined to
typical cyclic purchasing patterns of interest to marketing
determine if the function was of interest to marketers. Then
CMA was applied to capture the knowledge of the professionals.
In terms of product sales, the existence of sequence
mathematical model. CMA divides the domain into smaller
segments such that the distribution function defined on the <38, 20> means that the customer purchases product 38
segment is of simply increasing or decreasing type. Then first, then buys product 20. But the information did not
GPD and TM were invoked to obtain the polynomial reveal when the user will buy product 20. Therefore,
approximating to the pattern defined on the segment. CMA was applied to analyze <38, 20> to discover the
The result of divide-and-conquer process will be periodicity and tendency of the pattern.
collected and synthesized. Then the synthesized knowledge The first step of the process is to compute the distribu-
will be used to help analyze past and likely future tion function of the pattern and find the regression line of
behaviors. From the collected knowledge, marketers can
distribution function. The GPD procedure is invoked to find
elicit more fact from the transaction records. Consequently,
the linear model and optimal candidate for period of the
the marketers can take appropriate action to favorably
impact customers’ behavior. function. Invoking GPD with parameters min period ¼ 5;
In the first experiment, GPD/TM/CMA were applied to max error ¼ 0:5, and degree ¼ 6, the results obtained were
conduct the analysis against the transactions that occurred a ¼ 0:08; b ¼ 25:87, and  ¼ 35.
in the year 2000. The experiment showed how the Then TM was used to get the approximating polynomial.
procedure is implemented and how the customers’ likely The polynomial generated by TM is as follows:
1626 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009

Fig. 13. The plot of the pattern <38, 20> and its regression line.

fðxÞ ¼
8 Fig. 14. The plot of the pattern <38, 36> and its regression line.
> ð0:08x þ 25:87Þ If x < 305
>
>
>
> f1:756  0:361ðx mod 35Þ
>
> The plot of the distribution function and its regression
>
< þ 0:070ðx mod 35Þ2
line are shown in Fig. 14. Together, the picture of the model
>
>
>  0:007ðx mod 35Þ3 þ 0:000ðx mod 35Þ4 and the characteristics of the polynomial were obtained. It
>
>
>
>  0:000ðx mod 35Þ5 þ 0:000ðx mod 35Þ6 g; was learned that:
>
:
0; Otherwise . The majority of customers tended to buy product
However, the approximating polynomial is an abstrac- <36> every 63 days.
tion of the pattern, which is hardly interpreted by . The purchasing decreases moderately.
nontechnical people. With the help of visual representation . Customer will not buy product <36> after 299 days
of the distribution function, engineers, marketers, and after the initial purchase of product <38>.
business owners communicate among themselves easily. The results indicated that the CMA performs well in
Thus, the results of modeling process can easily be exploring the trends of repeat-buying behaviors and
incorporated into a marketing practice. provides a practical model for predicting when the
The plot of the distribution function and its regression customers tend to purchase, and when they are likely to
line are shown in Fig. 13. It can easily be seen that <38, 20> stop buying. Consequently, the marketers can allocate
is a simple descending sequence. The pattern has the period resources to build and execute marketing campaigns, which
¼ 35 days and degenerates at day x ¼ 305. Thus, the favorably impact the behavior of these customers.
analysis suggested the following:
4.3 Consistent Buying Behaviors
. The majority of customers tended to buy product Next, transactions which occurred in the year 2001 were
<20> every 35 days. examined to see if the patterns proved to be vital by CMA in
. The purchasing decreases moderately. the year 2000 have the same characteristic. Hence, we
. Customers will not buy product <20> after 305 days applied GPD/TM to find the regression polynomial of each
after the initial purchase of product <38>. pattern in the years 2000 and 2001.
The characteristics of the patterns were learned from Fig. 14 is the plot of the trend line and regression
the mathematical model and visualization of the distribu- polynomial determined by GPD/TM of the pattern <38,
tion function. Marketing professionals incorporate infor- 20> in the years 2000 and 2001, respectively. Fig. 15 is the
mation gained from CMA with the knowledge of a plot of the trend line and regression polynomial discovered
product, then adapt the marketing practice to impact by GPD/TM of the pattern <38, 36> in the years 2000 and
consumers’ likely behavior. 2001, respectively.
Next, CMA was applied to take on <38, 36>. Similarly,
sequence <38, 36> means that the customer purchases
product 38 first and then buys product 36. It was understood
that marketers require more information than what was
revealed. Thus, predetermined parameters min period ¼ 5;
max error ¼ 0:5, and degree ¼ 6 were invoked. The results
of GPD were: a ¼ 0:07; b ¼ 22:13, and  ¼ 63. The approx-
imating polynomial of the distribution function of <38, 36> is

fðxÞ ¼
8
> ð0:07x þ 22:13Þ If x < 299
>
>
>
> f1:995  0:231ðx mod 63Þ
>
>
>
< þ 0:016ðx mod 63Þ2
>  0:001ðx mod 63Þ3 þ 0:000ðx mod 63Þ4
>
>
>
>
>  0:000ðx mod 63Þ5 þ 0:000ðx mod 63Þ6 g;
>
> Fig. 15. The regression line and approximation polynomial of <38, 20> in
: the years 2000 and 2001. We found out that the shapes of the two plots
0; Otherwise
of the two polynomials are similar.
CHIANG ET AL.: THE CYCLIC MODEL ANALYSIS ON SEQUENTIAL PATTERNS 1627

improvement. First, the periodicity detecting procedure


can be improved by applying fuzzy techniques and other
statistics tools. In addition, the CMA algorithm should be
implemented as an add-on to existing mining tools.
Nowadays, mass production is an outmoded business
model, and companies must provide goods and services
that fit customers’ individual needs. Mass customization
has become the new paradigm. The GPD/TD/CMA
procedures can be incorporated into personalized recom-
mendation systems. The hybrid recommender can be
formed to build an automated process for discovering
time-relevant knowledge of customers, predicting custo-
mers’ likely actions, and providing useful suggestions for
Fig. 16. The regression line and approximation polynomial of <38, 36> in marketing practice.
the years 2000 and 2001. The results of the experiment are similar to the
experiment conducted on <38, 20>.
REFERENCES
The patterns were examined in turn and some facts [1] R. Agrawal, T. Imieli nski, and A. Swami, “Mining Association
found to be in common. Sketches of the regression Rules between Sets of Items in Large Databases,” Proc. ACM
polynomial for the years 2000 and 2001 are similar in SIGMOD ’93, pp. 207-216, 1993.
[2] R. Agrawal and R. Srikant, “Mining Sequential Patterns,” Proc.
shape, but the sales volume varied. Figs. 15 and 16 show the 1995 IEEE 11th Int’l Conf. Data Eng. (ICDE ’95), pp. 3-14, 1995.
similarity of the shape of the function plot. [3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining
In addition, it was found that both polynomials for each Association Rules in Large Databases,” Proc. 20th Int’l Conf. Very
year have periodicity but with slight difference. The period Large Data Bases (VLDB ’94), pp. 487-499, 1994.
of <38, 20> in the year 2001 is 35, the period of the repeated [4] Y. Chen, M. Chiang, and M. Ko, “Discovering Time-Interval
Sequential Patterns in Sequence Databases,” Expert Systems with
buying in the year 2001 is 42. The period of <38, 36> found Applications, vol. 25, pp. 343-354, 2003.
by GPD for the years 2000 and 2001 is the same, i.e., 20. [5] D. Chiang, S. Lee, C. Chen, and M. Wang, “Mining Interval
The similarity of the plots and the computation of period Sequential Patterns,” Int’l J. Intelligent Systems, vol. 20, pp. 359-373,
showed amazing consistency. The variation of sales volume 2005.
[6] M. Garofalakis, R. Rastogi, and K. Shim, “Mining Sequential
and periodicity can be explained. Many factors can affect Patterns with Regular Expression Constraints,” IEEE Trans.
sales: economic projection (boom or recession), emergence Knowledge and Data Eng., vol. 14, no. 3, pp. 530-552, May 2002.
of product replacement, and marketing practice conducted [7] J. Han, G. Dong, and Y. Yin, “Efficient Mining of Partial Periodic
in the first year, etc. Patterns in Time Series Database,” Proc. Int’l Conf. Data Eng. (ICDE
’99), p. 106-115, 1999.
Analysis of the customer profile database revealed an
[8] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu,
interesting fact. Some customers bought the cosmetics “FreeSpan: Frequent Pattern-Projected Sequential Pattern
during both 2000 and 2001. Some only purchased the Mining,” Proc. ACM SIGKDD Int’l Conf. Knowledge Discovery and
products in 2000 and did not buy anything in 2001. Some Data Mining (SIGKDD ’00), pp. 355-359, 2000.
only purchased the products in the year 2001. [9] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu, “PrefixSpan: Mining Sequential Patterns Efficiently by
The discovered facts revealed that the repeating-buying Prefix-Projected Pattern Growth,” Proc. 17th Int’l Conf. Data Eng.
pattern holds for different sets of customers. This indicated (ICDE ’01), pp. 215-224, 2001.
that the picked patterns are user-independent since the [10] J. Neter, M.H. Kutner, W. Wasserman, and C.J. Nachtsheim,
CMA only cares the items bought and the chronological Applied Linear Statistics Model, fourth ed. McGraw-Hill, 1996.
[11] M. Lin and S. Lee, “Fast Discovery of Sequential Patterns by
order of the purchases. Memory Indexing,” Proc. Fourth Int’l Conf. Data Warehousing
The consistency of the repeat-buying behavior over time Knowledge Discovery (DaWaK ’02), pp. 150-160, 2002.
and the item-based feature of the CMA algorithm suggested [12] F. Masseglia, F. Cathala, and P. Poncelet, “The PSP Approach for
that a hybrid recommendation system can be formed to Mining Sequential Patterns,” Proc. Second European Symp. Princi-
provide better prediction. Thus, marketing professionals will ples Data Mining Knowledge Discovery (PKDD ’98), pp. 176-184,
1998.
have a better tool with which they retain their customers. [13] M.A. Golberg and H.A. Cho, Introduction to Regression Analysis,
vol. 1. WIT Press, 2003.
[14] B. Ozden, S. Ramaswamy, and A. Silberschatz, “Cyclic Associa-
5 CONCLUSION tion Rules,” Proc. 14th Int’l Conf. Data Eng. (ICDE ’98), pp. 412-421,
1998.
The main purpose of this work is to design a time-interval [15] J. Pei, J. Han, and W. Wang, “Mining Sequential Patterns with
analysis algorithm, which can be applied to a wide array of Constraints in Large Databases,” Proc. 11th Int’l Conf. Information
applications, especially to analyze time-variant purchase and Knowledge Management (CIKM ’02), pp. 18-25, 2002.
behavior and establish a model for predicting consumers’ [16] P. Rolland, “FlExPat: Flexible Extraction of Sequential Patterns,”
Proc. IEEE Int’l Conf. Data Mining (ICDM ’01), pp. 481-488, 2001.
likely behavior. The algorithms proposed in this paper were
[17] R. Srikant and R. Agrawal, “Mining Sequential Patterns: General-
designed to work with traditional mining algorithms to izations and Performance Improvements,” Proc. Fifth Int’l Conf.
provide better understanding of customers and predict Extending Database Technology, (EDBT ’96), p. 3-17, 1996.
likely actions taken by observed targets. [18] J. Yang, W. Wang, P.S. Yu, and J. Han, “Mining Long Sequential
It has been shown that the CMA performs well in Patterns in a Noisy Environment,” Proc. ACM SIGMOD ’02,
pp. 406-417, 2002.
describing the buying pattern of consumers and predicting [19] M.J.E. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent
likely behaviors. However, it leaves some room for Sequences,” Machine Learning, vol. 42, nos. 1/2, pp. 31-60, 2001.
1628 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 21, NO. 11, NOVEMBER 2009

Ding-An Chiang received the BS degree in hydraulic engineering from Shao-Ping Chen is currently working toward the PhD degree in
Chung Yuan Christian University, Taiwan, in 1981, and the MS and PhD computer science and information engineering at Tamkang Uni-
degrees in computer science from the University of Southwestern versity in Taipei, Taiwan. His research interests include data mining,
Louisiana in 1986 and 1990, respectively. He is currently a professor in e-commence, and cyber culture.
the Department of Computer Science and Information Engineering and
the dean of the student affairs at Tamkang University. His research Chun-Chi Chen received the MS degrees in computer science and
interests include fuzzy, relational databases and data mining. information engineering from Tamkang University in Taipei, Taiwan, in
2003. His research interests include relational databases and data
Cheng-Tzu Wang received the MS and PhD degrees from the Center mining.
for Advanced Computer Studies at the University of Louisiana in 1991
and 1994, respectively. He is currently an associate professor in the
Department of Computer Science at the National Taipei University of . For more information on this or any other computing topic,
Education, Taiwan. His research interests include software engineering,
please visit our Digital Library at www.computer.org/publications/dlib.
hybrid soft computing models, and data mining.

You might also like