Professional Documents
Culture Documents
Abstract: data mining and knowledge discovery methods sequential patterns is about finding all those patterns which
host many decision support and engineering application needs satisfy . Under classical framework constraints can be
of various organisations. Most real world data has time classified as monotonic, anti-monotonic and succint [14]. A
component inherent in them. Sequential patterns are inter-event constraint is anti-monotonic if its agreement for any
patterns ordered in time associated with various objects under sequence α implies its satisfaction by all its subsequences. A
study. Analysis and discovery of frequent sequential patterns in constraint is monotonic if a sequence α satisfies
user defined constraints are interesting datamining results.
implies that every super-sequence of α also satisfies .
These patterns can serve a variety of enterprise applications
concerning analytic and decision support needs. Impostion of Succinct type of constraints is pre-counting pushable
various constraints further enhances the quality of mining constraints such that for any sequence α the satifaction of
results and retrict the results to only relevent patterns. In this the constraint implies its satisfaction by all the elements of
paper, we have proposed and rough set perspective to the sequence α. A succinct constraint is specified using a
problem ofconstraint driven mining of sequential pattern. We precise “formula”. According to the “formula”, one can
have used indiscernibility relation from theory of rough sets to generate all the patterns satisfying a succinct constraint.
partition the search space of sequential patterns and have There is no need to iteratively check the constraint in the
proposed a novel algorithm that allows pre-visualization of mining process.
patterns and imposition of various types of constraints in the Early work in the domain of constriant imposition into
mining task. The algorithm C-Rough Set Partitioning is atleast
sequential pattern mining task is the algorithm GSP [3].
ten times faster than the naïve algorithm SPRINT that is based
on various types of regular expression constriants.
They proposed the concept of time interval constraint,
maximum gap and minimum gap constraint and build them
Keywords: Rough sets, Sequential patterns, constriants, into apriori algorithm framework. Another work in the
indiscernibility, partitioning framework in time interval constraints is given by Mannilla
et.al [2]. They defined “an episode as a collection of events
that occur relatively close to each other in a given partial
1. Introduction
order.” They did consider the importance of time frame of
Sequential pattern mining is studied extensively in data patterns and gave the concept of event window and sliding
mining literature due to its applicability into a variety of event window. They defined patterns as directed acyclic
applications. It is applied to a lot of real world decision graphs with vertex as a single event and edge as “Event A
support applications like root causes of banking customer occurs before event B”. Their method of finding frequent
churn [8], analysis of web logs [9], fault diagnosis and episodes is “bottom-up candidate-generate and test
prediction in telecom networks [10], study of adverse drug apporach” which is similar to Apriori ALL proposed by
reactions as temporal association rules[11]. The enormous Agrawal and Srikant [1].
search space and huge number of patterns are inherent F Masseglia et al.[15] have also proposed the time
challenges in the sequence mining task. Conventional constraint imposition into mining of sequential patterns.
studies into sequential pattern mining give various They have presented a graph theoretic mining algorithm to
computational methodologies to enumerate the frequent deduce the search space of time constraint sequential
sequence space [1]-[6]. These methods mine all sequential pattern.
patterns in the support confidence framework. Garofalakis et al. [16] have given the framework for
Computational methodologies in [1]-[5] are botton up imposing regular expression cosntraint into sequential
candidate generate and test approaches. The method pattern mining. A regular expression R is a set of
PrefixSpan [6] works on the concept of iteratively projecting expressions such as disjunction and Kleene closure [17]. R
the database on the basis of the prefix. This method does not specifies a language of strings over a regular family of
generate any candidate and is strictly based on the events sequential patterns that are of interest to the user. They
present in the database. confirmed that Regular expression constraints have the
New generation mining methods require the retrieval of same expressive power as diterministic finite automata [17].
patterns in user defined constraints. Impostion of constraints The algorithms SPRINT is a multi database scan candidate
not only condense the mining results to the most useful ones generate and test strategy based on GSP [3]. The candidate
but also reduce the search space and improve performance. generate strategy works on imposing a relaxed constraint
A constraint can be regarded as a Boolean function on
all sequences. The problem of constraint based mining of
(IJCNS) International Journal of Computer and Network Security, 17
Vol. 1, No. 2, November 2009
The method first genrates candidates and checks for validity address most decision centric constraint imposition tasks. In
patterns that statisfy the given the regular expression this paper, we explain all the seven types of constraint their
constraint and then finds occurance frequency for such treatment in the rough set based framework. Here we retrict
length-1 sequences that cross the minimum support our discussion to length-1 sequences. This correspond to
threshold. This becomes the seed set for further iteration. many real world sequential patterns for example sequential
The Candidate Length-2 sequences are formed by joining pattern of web access patterns, faults in telecom landline
the elements of the seed set. Now, the database is scanned networks etc. (i) We have proposed a user friendly interface
again for searching these candidates and their counts are that generates previsualization of a sample of emerging
accumulated after checking the relaxed constriant . In sequential patterns and allows flexible imposition of time,
subsequent iterations, candidate k-length sequences are length, gap constraint prior to mining task and (ii) we have
formed by joining frequent k-1 sequences that have the same presented a novel algorithm based on indiscerniblity relation
contiguous subsequences. Suppose a sequence from theory of rough sets to address the computational
Sα= e1 , e 2 ,......e n , another sequence sβ is a contigeous aspect of the expensive mining problem of frequent
sequential patterns satisfying item, super pattern, regular
subsequence of Sα if (i) sβ is derived from Sα (ii) sβ is expression contraints. It is found from experimental
derived from Sα by dropping an item from an element ej that evaluations that our algorithm is atleast 10 times faster than
has at least 2 items. (iii) sβ is a contiguous subsequence of algorithm SPRINT [16].
sδ and sδ is a contiguous subsequence of Sα
The process is continued untill all frequent sequences 2. Problem Formulation
present in the database are found satifying the relaxed
constriant From theory of rough sets, an information system is given
Given an anti-monotonic constriant, the constraint is first as: S = {U, A t , V, f} where U is a finite set of objects,
imposed and candidates which do not satisfy the constraint U = {x1 , x 2 ,.............x n } At is a finite set of attributes, At is
are pruned. It is clear that like the support constraint, the further classified into two disjoint subsets, conditional
constraint is also anti-monotonic, that is if the constriant attributes C and decision attributes D, A t = C ∪ D
is not supported by a sub-pattern it will not be supported by
its super pattern also. V= UV p
and Vp is a domain of attribute p
p∈ At
In case the constraint in monotone an appropriate
choice of relaxed constriant is used for generate of valid f : U × A t → V is a total function such that f (x i , q) ∈ Vq for
results. The family of SPRINT methods suffer from the every q ∈ A t and x i ∈ U . Consider an example transaction
drawback of huge query overhead due to multiple scans, database as in TABLE I.
weak constriant imposition based candidate generation
followed by frequent pattern discovery from amongst the Table 1: Example transaction database
candidate set.
Han et al. [17] have confirmed the imposition of various
user defined constraints for efficient mining of patterns.
They have proposed architecture for mining
multidimensional association rules in the framework of
online analytical mining. They proposed constraint
imposition at the level of transaction database with the use
of PLSQL query language which is further subject to
multidimenisional association pattern discovery.
Pei et al. [14] have studied the process of constriant
imposition in the framework of prefixspan [6]. They have
presented the constraint imposition framework in both
classical and application centric framework. Their work
presents a detailed study on how conventional monotone,
anti-monotone and succinct constraints can be studied as a
prefix constraint while recursively projecting the database
with the same. Their study confirmed that while the method
prefixspan is efficient for sequential pattern mining it is not
suitable for constraint driven mining. They have presented a
systematic study of regular expression and aggregate
constraints imposition and presented various application
oriented examples for tough but interesting constraints.
They defined seven categories of constraints from the A t = (T, I) where T is the set of transaction times and I is
application perspective; item, super pattern, time intrerval, the set of associated itemsets with x i . Examples of
gap between subsequent transactions, regular expression transaction database can be database of customer purchase
constraint, length of sequence and various aggregate patterns in a retail store, web access details etc. There are
constraints. Though these are not the complete set of multiple instances of the same customers ( x i ) in the
possible constraints but are more or less comprehensive to
information system U. Alternate representation of the
18 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009
transaction database is termed as a sequence database For example consider the example of web browsing patterns
formed by grouping transactions corresponding to same of customers, a pattern of type 3 can be web access pattern
( x i ).The alternate information system is S' = (U, E) where which encapsulates the subsequence (online advertisement,
U = {x1 , x 2 ,.......x n } E = {e1 , e 2 , e3 ,........em } a sequence or product site). Super pattern constraint is monotone and
succint.
serial episode is defined as a set of events that occur within
Constriant type 3: (time interval constraint) a transaction
a predetermined event interval and are associated with the
database has time stamp information against event labels.
object under study. Given I be the set of itemsets
The time interval or duration constraint are a set of
I = {i1 ,i 2 , i3 ..............i n } then the set of sequence E ⊂ A t is
sequences with the property that the time interval between
formed by combining itemsets associated with the same first and last transaction is less than or greater than a
object ordered by time. ∀ei ∈ E ei = {i1 , i 2 ,.......i l } The specific value.
length of a sequence is the number of items it contains. A k- (5)
sequence contains k items k = ∑ e j . The absolute support Where and is a given integer. The length of
j the sequential pattern depends on the choice of the time
of a sequence ei is defined as the number of transaction that interval under study. Let in T ⊂ A t , t s be the start Time
contain it and relative support is defined as sup (ei) = and t e be the end time for study of transaction patterns.
absolute support/no. of objects in the dataset. A pattern is Then, the event/time interval for study of patterns is given
frequent if it crosses a user specified frequency threshold by: t s − t e for given information system S. If we group the
called minimum support threshold [15]
Given sequences, , represents a disjuction transaction information I ⊂ At corresponding to the
operator which indicates the selection of either of the event same x i , we derive and alternate representation of the
patterns. Here, is the ith element of the sequence. is a information system S. If we impose time interval retriction
regular expression constraint. represent the Kleene we derive sequence database in constraint time interval. The
closure operatorwhich signifies zero or more occurances of maximum length can be controlled by the appropriate
element . consideration of time interval constraint. Consider the
The problem of constraint driven mining of sequential transaction database in TABLE I If the time interval under
patterns is concerned with discovery of frequent patterns consideration is 20 days then the sequence database is as
that also satisfy user specified contriants. Commonly given in TABLE II and if the time interval under
imposed constriant can be classified in the following consideration is 25 days then the derived sequence database
categories. is given by TABLE III. Both length and time interval
Contraint type 1: (Item constraint) An item constraint constriants are anti-monotone under operation and they
specifies subset of items that should or should not be present are monotone and succint under the operation.
in the patterns. Considering the case of n size length-1 Constraint type 4: (Length Constraint) In case of length-1
sequential patterns V also corresponds to subsequence sequences this type of constraint restricts the size of the
relation. sequence under consideration. It can be the restriction of the
(1) maximal pattern length.
(6)
Where V is the subset of items, Consider the example in TABLE I,II,III the maximum
length of sequential pattern in TABLE II is 5 while in case
If then the item constraint is both anti-monotone and of TABLE III it is 3.
succint under operation. Constraint type 5: An aggregate constraint is the
If then the item constraint is both amonotone and constraint on an aggregate of items in a pattern, where the
succint under operation. aggregate function can be sum, avg, max, min, standard
Example of type 1 constraint is discovery of specific web deviation, etc.
usage pattern of customer characterized by one type of sites For example in case of data for market basket analysis the
for example online gift stores. Another example in case of retails store customer might be intrested in knowing those
fault diagnosis in telecom landline networks; a constraint of items which the sum of bill was more than 2000 Rs. Some
type 1 can be characterized by all sequential patterns in aggregate function like sum, average on both positive and
which the fault signal “dead phone” is present or absent. If negetive values are neither monotone, anti-monotone or
T is the set of gift stores on the web then, succinct.
(2) Constraint type 6: (Regular Expression Constraint) the
Given the domain all uniques sequential patterns; all regular expression constraints are specified as a regular
transactions that follow the type 1 constraint are the expression over the set of items using regular expression
members of the indiscernibility relation formed by the operators like disjunction or Kleene closure. A sequential
equivalence class of patterns indiscernible with respect to pattern satisfies a regular expression constraint if and only if
the concept of pattern existance. the pattern is accepted by equivalent finite automata. Like
(3) aggregate constraints regular expression constraints are also
Constraint type 2: (super pattern constraint) a super neither monotone or anti-monotone nor succinct.
pattern constraint finds those patterns which encapsulate a Constraint type 7: (gap constraint in adjescent
user specified sequence. transactions) in many transaction events have to be
(4) equispaced in time that is the time gap between subsequent
(IJCNS) International Journal of Computer and Network Security, 19
Vol. 1, No. 2, November 2009
The proposed algorithm C-RSP is a break and search Π Top k LocationId from Table1 where transaction_date>=Tstart &
strategy. C-RSP proposes a complete mining system that transaction_date<=Tend
//--Π is project operator of relational algebra which implies Select Distinct k is
allows imposition of all types of constriant. The input to the the number of records the user //wishes to visualize
problem of mining sequential pattern in user defined FOR each customer id in the rec_inner_test
constriant is the transaction database of objects under study. LOOP
A sample database is given in TABLE I. It is evident that return_str:='';
FOR I IN 1..rec_inner_test.COUNT
resultant sequence database is governed by user’s choice of LOOP
time interval and maximum length constriant. return_sequence:=return_str||rec_inner_test(i).signal||':'
The algorithm first presents a user interface that allows END LOOP;
Update the Sequence_table with Sequence against each LocationID
flexible and adjustible impostion of CAT1 types of
ENDLOOP
constraints.
Figure 3. Algorithm Pseudocode to derive sequences from
Once the user derives the relevent sequence database under
transaction database in user specified time interval
study by impostion of CAT1 categories of constriants; the
sequence database is now the input to the mining of patterns
in CAT2 categories of constriants. Π Top k LocationId from Table1 where Lengthofsequence<=n
This is a done by presenting a user interface that gives a //--Π is project operator of relational algebra which implies Select Distinct k
is the number of records the user //wishes to visualize
view FOR each customer id in the rec_inner_test
of the sequence database on choosing an appropriate time LOOP
interval. Figure 1 gives user interface that allows pre- return_str:='';
visualization of sequences formed by transactions FOR I IN 1..rec_inner_test.COUNT
LOOP
indiscernible with maximal time interval of patterns. Figure return_sequence:=return_str||rec_inner_test(i).signal||':'
2 gives the user interface for previsualization of the END LOOP;
maximum length of patterns as a result of user’s choice of Update the Sequence_table with Sequence against each LocationID
ENDLOOP
time interval.
(7)
are satisfied yi ⊆ U :
4. Results and discussion
Condition 1: yi ≠ φ
We have compared the effiiciency of C-RSP with SRINT(N)
yi ∩ y j = φ
Condition 2: naïve. It was found to be more than 10 times faster than
Uy i = U ' for i ≠ j i,j=1,2,.....n SPRINT. Figure [6][7][8] give runtime comparison of C-
Condition 3: RSP with SPRINT by imposition of time interval and length
V = U Vs constriant represtively. Figure [6] give comparitive
Given the domain S
of all sequence present in the efficiency on impotion of time constriant on real data of
database under study, V can be partitioned on the basis of network fault patterns in telecom landline networks of
equivalence classes yi such that each yi contains patterns Madhya Pradesh in India. The time period of data was
with the same prefix. Clearly condition 1 is satisfied since considered by the knowledge worker as three months. The
each element of V will be a member of some yi . Condition algorithm C-RSP is implemented in JDK1.3. The
2 is satisfied since no two elements in V with the same preprocessing step is a java program which connects to
prefix will be different equivalence classes. Since all database as in TABLE 1 and invokes a PLSQL cursor which
members of V with different prefix are in some equivalence creates TABLE II. The entire process is undertaken using
class union of all equivalence class should result in V. java database connectivity interface. It connects to the
U yi = V for i ≠ j i,j=1,2,.....n database in MSSQL Server 2005 as in TABLE II and
Now the database is in good form for impostion of various fetches the data into data structures using jdbc. The machine
constriants of CAT2, item constraint, super pattern used is HP Proliant DL580G5 with Intel Xeon CPU 1.6
constriant, regular expression constraint and other complex GHZ processor with 8 GB RAM. The operating system is
constriants. Ms Windows Server 2003 R2. The data comprised of 75833
Case 1: Suppose the user want to find all frequent sequences records with voice related gross faults collected over a time
that have pattern in them, the algorithm finds patterns window of three months. There are 215 distinct elements in
which are indiscernible on the basis of pattern existance. the sequence and maximum length of the sequence is 14.
The algoithm SPRINT is also programmed onthr same
machine using jdk1.3. The time contraint imposition is done
(IJCNS) International Journal of Computer and Network Security, 21
Vol. 1, No. 2, November 2009
at the level of generating candidates. Only those candidates the same recursively. There is no candidate generation since
are considered in the support counting process in subsequent we are only fetching data into data structures and applying
scan of data which satisfy the specified time constraints. computation logic on the same. The method C-RSP requires
only one to two scans of the database while SPRINT
recursively scans the databases and works on candidate
generate test strategy. The constriant impostion strategies
allow impostions of individual and composite constraints.
5. Conclusion
The following are the benefits of proposed model:
References
[1] R. Agrawal and R. Srikant, “Mining Sequential
Patterns", In Proceeding of International Conference in
Data Engineering pp:3-14, 1995.
[2] Manilla, H. Toivonen H. and Verkamo A. I.
“Discovering frequent episodes in sequences.” In
proceeding of International Conference on Knowledge
Discovery and Data Mining, IEEE Computer Society
Press 1995 pp:210-125, 1995 .
[3] R. Srikant and R. Agrawal, “Mining sequential patterns:
Generalizations and performance improvements.” In
Proc. 5th Int. Conf. Extending Database Technology
(EDBT’96), pp: 3-17, Avignon, France, March 1996.
[4] Jay Ayres, Johannes Gehrke, Tomi Yiu,& Jason
Figure 7. Runtime evaluations of synthetic data on Flannick, “Sequential Pattern Mining using A Bitmap
impostion of length constraint Representation”, In Proc. 2002 of the eighth ACM
SIGKDD international conference on Knowledge
It is clear from the above graphs that C-RSP outperforms the discovery and data mining Edmonton, Alberta Canada
SPRINT family of methods by an order of magnitude. This pp: 429 – 435, 2002.
is due to partitions of search space and impostion of
constriant at the preprocessing level and avoiding validity of
22 (IJCNS) International Journal of Computer and Network Security,
Vol. 1, No. 2, November 2009