You are on page 1of 12

Identification of Precursory Alarm Sequence Patterns for Predicting

Equipment Failures Using Ant Colony-Based Algorithm


M. Luo1, D.H. Zhang1, L. L. Aw1, F. L. Lewis2
1
Singapore Institute of Manufacturing Technology, 71 Nanyang Drive, SINGAPORE
2
Automation & Robotics Research Institute, The University of Texas at Arlington, 7300 Jack Newell
Blvd. S, Ft. Worth, Texas 76118-7115

Ming Luo is a Research Scientist in Singapore Institute of Manufacturing Technology. Her research
interests include computational intelligence for optimization in industrial applications, manufacturing event
management, event data based diagnosis and prognosis for manufacturing systems, and discrete system
modeling and control. She received her B.Eng. in Electrical Engineering in 1987 from South China
University of Technology, M.A. in Mathematics and Computer Science from Eastern Michigan University,
USA in 1991 and Ph.D. degree in System Modeling in 1998 from Nanyang Technological University,
Singapore. She involved in many in-house and industrial projects in the areas of dynamic resource
allocation and management, and equipment diagnosis and prognosis technology development. She has
published more than 30 Journal and conference papers in the areas of system modeling, computational
intelligence approaches for equipment diagnosis and prognosis, and event management.

Danhong Zhang is a Senior Research Engineer in Singapore Institute of Manufacturing Technology. Her
research interest areas include distributed control system, real-time control system, device communication
technology, web enabled technology, wireless Sensor Network, fault diagnostics and prognostics for
equipment & system. Ms. Danhong Zhang received her B.Eng and M.Eng in Automation from Tsinghua
University, China in 1986 and 1988. She involved in many research and industry projects such as of shop
floor information systems, Maintenance and diagnosis system for large scale material handling equipment,
real-time monitoring and control system, machine and system diagnosis support system, wireless sensor
network for machine condition monitoring, etc. She has published 18 conference and journal papers.

Leck Leng Aw is a research officer in Singapore Institute Of Manufacturing Technology. Her current
research area focus on the track and trace of manufacturing processes, data analysis on manufacturing data.
She obtained her Master of Science in CIM (Computer Integrated Manufacturing) from Nanyang
Technological University in 2000 and B.S from Singapore Institute of Management. For past years, she has
involved in many industry and in-house project in the areas of data pre-processing for data mining,
manufacturing process tracking and tracing system.

Frank L. Lewis, Fellow IEEE, Fellow IFAC, Fellow U.K. Institute of Measurement & Control, PE Texas,
U.K. Chartered Engineer, is Distinguished Scholar Professor and Moncrief-O’Donnell Chair at University
of Texas at Arlington’s Automation & Robotics Research Institute. He obtained the Bachelor's Degree in
Physics/EE and the MSEE at Rice University, the MS in Aeronautical Engineering from Univ. W. Florida, and
the Ph.D. at Ga. Tech. He works in feedback control, intelligent systems, distributed control systems, and
sensor networks. He is author of 6 U.S. patents, 216 journal papers, 330 conference papers, 14 books, 44
chapters, and 11 journal special issues. He received the Fulbright Research Award, NSF Research Initiation
Grant, ASEE Terman Award, Int. Neural Network Soc. Gabor Award 2009, U.K. Inst Measurement & Control
Honeywell Field Engineering Medal 2009. Received Outstanding Service Award from Dallas IEEE Section,
selected as Engineer of the year by Ft. Worth IEEE Section. Listed in Ft. Worth Business Press Top 200
Leaders in Manufacturing. He served on the NAE Committee on Space Station in 1995. He is an elected Guest
Consulting Professor at South China University of Technology and Shanghai Jiao Tong University. Founding
Member of the Board of Governors of the Mediterranean Control Association. Helped win the IEEE Control
Systems Society Best Chapter Award (as Founding Chairman of DFW Chapter), the National Sigma Xi Award
for Outstanding Chapter (as President of UTA Chapter), and the US SBA Tibbets Award in 1996 (as Director of
ARRI’s SBIR Program).

1
ABSTRACT
The identification of alarm patterns that occur as precursors to failures provides the ability of predicting
potential equipment malfunctions for condition based maintenance. In industry, enormous files of historical
data are collected from equipment monitoring prior to failures. The search for reliable precursory alarm
patterns, that is, specific sequences of alarm events, in such data sets is a challenging task. This paper describes
an algorithm for identifying precursory alarm patterns from historical measured event data containing
numerous fault alarms and equipment states. The algorithm is based on modifications to ant colony
optimization (ACO), which is an effective probabilistic learning method for finding shortest paths in large
complex graphs. An actual industry application is used to verify the algorithm. The algorithm is applied to a
dataset from an industry partner collected over two years from an automatic material handling system. By
mining the historical data with the proposed algorithm, a plausible set of precursory alarm patterns is found.
These industry application results demonstrate the effectiveness of the proposed algorithm for finding
precursory alarm patterns for prediction of equipment failures.

are minor, some alarms are critical signals of faults that


I. INTRODUCTION could lead to equipment breakdown. The frequencies of
Effective approaches for monitoring and predicting the alarms may have a wide range in frequency of occurrence
health status of equipment have long been a major concern to from once a month or year to a few times each day.
industry. In many applications, especially those relating to Unfortunately, the identification of the meaningful
high value manufacturing that demands high degrees of precursory alarm patterns from the enormous sets of recorded
equipment uptime and reliability, failures are costly and historical data is extremely difficult, even by human experts,
dangerous. Potential unrecoverable losses make prediction of due to the complexity of the data and human limitations
equipment failures extremely important, as opposed to inherent in analyzing large quantities of data.
relying on traditional “fail and fix” approaches. On the other Our objective is to analyze the relationship between these
hand, a “play safe” approach such as time-based regular alarms and the equipment breakdowns to identify those
preventive maintenance can also be wasteful, since parts are meaningful alarm patterns that serve as precursors for
replaced after fixed time intervals regardless of their failures. The extraction of such precursory alarm patterns
condition or remaining useful lifetime. Therefore, an would allow the prediction of critical fault or equipment
intelligent approach for equipment health assessment is failures. This paper describes an algorithm for mining
desirable for scheduling maintenance work effectively while historical industrial data to find reliable precursory alarm
ensuring reliable performance. patterns for machine failures. The algorithm is based on ant
The state of health of a machine/equipment can include colony optimization (ACO), which can solve complex
information such as the existence and type of imminent learning problems that can be converted to shortest path
problems and an estimation of the remaining life expectancy. problems in large graphs. However, despite its potential, as
In general, complex industrial equipment is furnished with normally used ACO is not suitable for this application.
numerous sensors that detect abnormalities in the system and Therefore modifications are made to ACO to develop an
trigger corresponding alarms when a situation deteriorates effective method for extracting precursory alarm patterns
beyond a threshold. The alarm values are logged to a from complex industrial historical data. The resulting data
database as historical data. Specific alarm events are often mining algorithm is able to identify, from complex historical
precursors to specific equipment failures and breakdowns. data, sequences of alarm events, i.e. alarm patterns, that are
The occurrence of such precursory alarm patterns can associated with and precursory to equipment breakdowns.
therefore be used to predict possible problems that can cause This allows predictive rules to be discovered for prediction
a machine to fail. of equipment failures.
Recorded historical data from industry typically consists of An application is made to an actual industrial dataset
huge amounts of data running over long time periods that can provided by an industry partner. The data was collected over
extend into months or years. Industrial equipment a two year period from an elevated transfer vehicle with
controllers with the capability of fusing sensor data may more than 10 subsystems and numerous installed sensors.
report hundreds of possible causes of system faults, operation Despite the complexity of the data, the proposed algorithm is
faults, and component failures in the form of alarms for able to effectively and efficiently find plausible precursory
maintenance. Equipment monitoring systems continuously alarm patterns for prescribed faults. A performance analysis
log all records of such alarms as an event stream that demonstrates the effectiveness of the proposed algorithm.
includes the alarm identification, severity level, problem The paper is organized as follows. Section II provides
location and occurrence time stamp, etc. Though most alarms literature review. In Section III we define a temporal event-

2
based framework for using alarm patterns as predictors for prediction, rule-based classification techniques to predict
failures. In Section IV we review ant colony optimization critical events, and a Bayesian network to build dependency
(ACO). In Section V we develop a novel algorithm for graphs to isolate the root cause of problems.
extracting precursory alarm patterns from historical data. Reference [7] presents a framework of fault prediction and
First, we show how to formulate event data mining as a detection for Grid computing systems. They classify the fault
graph problem. Then a new algorithm is developed by detection as resource-based detection which depends on the
making certain modifications to ACO to allow its application data collection on-site, or workflow-based detection which
to identification of precursory alarm patterns. The depends on monitoring the execution state of activities (e.g.
importance of selecting suitable start features is emphasized events). Historical data was used to define the template and
and a method is given for so doing. Convergence criteria are use data mining algorithm to solve the problems in
added, and a stopping condition is given. In section VI we prediction. Workflow prediction techniques are used on fault
apply the algorithm to a complex historical dataset provided prediction. An association algorithm is used to derive multi-
by an industry partner, and the effectiveness of the algorithm feature correlations from the resource- and workflow-based
in finding precursory alarm patterns is verified through models. Workflow-based clustering algorithm detects the
performance analysis. outliers by clustering the historical data. The basic idea is to
compare the execution data of activities with same activity
II. RELATED WORK type, and labels the smallest clusters as anomalies.
Most research work on equipment failure prediction using Less work can be found in the literature on the applications
alarm or error log analysis is for computer systems and of equipment failure prediction based on event or alarm data.
telecommunication equipment where huge number of alarm Early work related to failure prediction using alarm data used
logs are generated by software that controls computer SPC based methods. Paper [8] presents a template
systems or the switches [1]-[6]. The authors in [1] had monitoring method incorporating SPC (statistic process
developed a failure prediction mechanisms based on control) techniques for fault detection and prediction in
dispersion frame technique (DFT). This approach exploit the discrete event manufacturing systems. This method is first
times of error occurrence applying heuristic rules for failure applied to obtain the normal sequence relationships and
prediction. The authors in [2] proposed “Timeweaver”, a timing relationships of alarm events. Then SPC monitoring
genetic algorithm to predict rare events in sequences. It uses technique is used to detect the timing deviations of typical
a set of event sequences to train the algorithm and a test set process sequences that result in a fault prediction warning.
to tune its performance. The authors in [3] proposed different Similar to the formulation of the failure prediction problem
predictive algorithms for short term prediction of in [2], the authors in [9] use frequent event sequence as
performance variables and long term prediction of abnormal failure signatures for prediction. WINEPI approach [10] is
behavior and system failure events for computer systems. proposed to discover the frequent episodes from alarm
Criteria in the selection of predictive algorithms base on both database and classify episodes into serial episode and parallel
numeric and categorical data were discussed. Association episode. The events in the failure signatures are then used to
rule mining techniques was adopted for finding the rule for build a Cox proportional hazard model to provide a
predicting a target event from the frequency occurrence of statistically rigorous prediction method for system failures.
correlation events. In [4], three alarm analyzing principles A scalable Watchdog Agent-based toolbox approach for
are discussed that use the variation of total alarm counts and machine health prognostics is presented in [11]. The toolbox
Pareto Ranking within a certain period for detection of consists of modularized embedded algorithms for signal
possible failure events. The method is implemented in the processing and failure signature extraction and performance
web-based system called ARISTO for alarm monitoring. It assessment. Methods are based on continued time series data
has been shown that ARISTO is able to generate early which may be costly to obtain when additional sensors are
warnings for Telecom voice mail systems. A Similar Events needed to be installed. However, with advanced IT
Prediction (SEP) approach is proposed in [5] to predict technology the event data may become huge. Mining
imminent failures in a telecommunication system online. frequent event sequence from the huge dataset effectively is
SEP is based on recognition of suspicious error patterns a challenging research area for which many exciting
utilizing a semi-Markov chain in combination with problems remain open. In particular, mining algorithms for
clustering. The comparison study showed that SEP is static data are not directly applicable on temporal data [12].
superior to DFT though the computational complexity is The equipment failure prediction is an application of
increased. The authors in [6] reported their work on proactive temporal data mining with the objective of predicting rare
prediction and control system for large-scale clusters of event with a window of time [12][13]. Most previous work
computer system. Based on cluster node performance data on the algorithms for temporal association rule mining
(reliability, availability and serviceability, and activity focuses on extending association rule mining algorithm, such
report), they applied a time series algorithm for performance as A-priori algorithm, to time dimension. It involves

3
complicated data pre-processing. In this paper, we proposed where all n events are in the time interval t1 ≤ t ≤ tn and
the Ant Colony Optimization (ACO) [14] based algorithm satisfy the ordering of time tj<tj+1<tn.. There are two sorts of
which can directly search in temporal datasets to reduce data events. Alarm events are denoted in the form ei. Target
preprocessing effort with a higher accuracy rate [15]. events are those failures that are to be predicted based on the
alarm events. These are denoted in the form of etx where
III. BACKGROUND ON EVENT SEQUENCE x=1,2,…H. Figure 1 shows how the events are represented
MONITORING FOR FAILURE PREDICTION in 1-dimesional feature space plotted with respect to time. In
In this section we present a formulation for predicting the figure, et1 and et4 represent target failure events occurring
equipment failures in terms of measured alarm event at different timestamps but with the same feature A5. These
sequences. Industrial equipment controllers typically log two are treated as distinctive events, but are referred to as the
alarm event data records as event streams that include the same type of failure events. Furthermore, the subsequences
alarm identification, severity level, problem location and S1= ((t1, A2), (t4, A1), (t5, A3), (t7, A4)) and S2= ((t22, A2), (t23,
occurrence time stamp, etc. The frequencies of alarms may A1), (t25, A3), (t27, A4)) are called event patterns. They have
have a wide range in frequency of occurrence from once a the same feature sequence P= (A2, A1, A3, A4). We refer to
month or year to a few times each day. Our objective is to these two event sequences, S1 and S2, as having the same
analyze the relationship between these alarms and the event pattern. Since this alarm event pattern occurs prior to
equipment breakdowns to identify those meaningful alarm both target events, the alarm pattern is said to be precursory
patterns that serve as precursors for failures. To accomplish to the target events, that is, it is a signal that the failure A5 is
this based on typical available data, one requires a imminent.
formulation for working with event sequences, including the Fz

causal relationships between various alarm events and the Az

failure events. In this section we define a framework for A Z -1


e 24 e 31
failure prediction with some terms share much in common A .. e6

with the concepts of predicting target event in [2][5]. A9 e2

A8 e9 e 26 e 30
A. Events and Feature Space A7 e3

Suppose there is available historical recorded data from A6 e 29

equipment monitoring computers in the form of event A5 e t1 e t4

streams. The events are of two sorts, alarm conditions A4 e7 e8 e 27

detected by the controllers, and failure events. An event ei A3 e5


e 10
e 25
e1 e 22 e 28
can be characterized by its timestamp and its description by a A2
e4 e 23
set of features so that A1

ei ∈ E = timestamp t i , feature f j t1 t4 t5 t7 t22 t23 t25 t27 Time

where i=1,2,… feature f j ∈ FZ , j=1,…,Z, and FZ is a finite Figure 1. Events in 1-dimesional Feature Space w.r.t. time
Z-dimensional feature space for describing the observation in
the problem domain. In the industrial application examined B. Monitoring, Prediction and Warning Times
in this paper, every alarm event can be described by a point We adopt the terms and definitions used [2][5] to formulate
in a 3-dimensional feature space. These 3-dimensions are: (1) the problem of predicting equipment failures, namely the
the alarm ID that is associated with fault type; (2) the source target events just defined, in terms of the measured alarm
ID that indicates the location or the alarm source; and (3) the events. The following discussion is illustrated in Figure 2.
severity level. To demonstrate the technique develop herein, Assuming a target event occurs at time t, the monitoring time
we use the fact that all alarm IDs are uniquely identified with M is the time prior to the target event during which one
fault types. Therefore, we shall assume that features are 1- monitors for possible precursory alarm patterns. This
dimensional, with only the alarm ID used to describe the determines a maximum window [t-M, t) within which the
observations of events in the following discussion. Hereafter, precursory event patterns can be to allow them to be
a feature refers to a point in the 1-D feature space. The associated with target event.
development in this paper extends directly to higher- Warning time interval W is the time necessary to raise a
dimensional feature spaces. warning before the occurrence of the target event. As the
The recorded history of events can be represented as an warning time decreases, the accuracy of the prediction may
event sequence S defined as a time-ordered sequence of increase. However the time for reaction will be reduced
events correspondingly and hence there is a trade-off between
S = ( e1, e2 , K , et1, K , en − k , K , etx , K , en ) warning time and accuracy.

4
Prediction period D is the time allowed to make a M ( Pj , Cx ) ∈ {True, False} , j = 1, 2, ... N and x = 1, 2,....H (3.2)
prediction of the target event. Any prediction must happen Define M ( Pj , Cx ) = True if there is a subsequence
before the warning time t-W to be useful. One has that
S j = ( e j1,e j 2 ,K , e jk ,K , e jK ) in Cx that is matched to Pj, where
prediction period D = M – W. The respective times are
illustrated in Figure 2. timestamps for events in Sj satisfies t-M<tjk<t-W and
Monitoring period M
(t jK − t j1 ) ≤ Dp .
Target event eti is said to be predicted to occur if there
exists at least one pattern Pj for which M ( Pj , Ci )=True holds.
t-M Prediction t-W Warning t With the definition (3.2), if Pj can be identified and tested
period D period W in a given dataset, then prediction rules can be formulated as
(3.3) and used for online prediction of equipment failure.
Figure 2 Monitoring, Prediction, Warning times
IF Pj occurs within a prediction period D
The event sequence S is considered to be partitioned into a THEN a target event ex will occur in the period ∆t p ≤ D
series of non-overlapping time intervals or chunks Cx, After warning time period W
x=1,2…H. Each chunk consists of the sequence of events With a Confidence Level CL (3.3)
happening before a target event etx, and within the monitor
period. To simplify the chunking scenarios in the study, we The time window ∆t p ≤ D is an additional parameter that can
assume that these chunks are non-overlapping. The event
be tuned to obtain best performance of the algorithm
sequence S is thus written as:
developed in this paper.
S = ( C1 , C2 ,K, CI ) (3.1)
Confidence Level (CL) defined in (3.4) is introduced to
subject to the constraints Ci ∩ C j = ∅ , ∀i , j , where Cx is the evaluate the accuracy of a candidate predictive pattern
set of time-ordered sequence events exactly prior to the xth generated. It measures the fraction of predictions that are
target event etx. made correctly by a given pattern. The definition of
Confidence Level is equivalent to Precision in [2][5]. We use
C. Prediction of Target Events different names to differentiate its usage in the different
Define a pattern as a specific fixed sequence of features stages of our algorithm: Precision is used to generate
A pattern associated with an event sequence, is the predictive patterns while, by contrast, CL is used to evaluate
ordered sequence of features occurring in the event sequence, the accuracy of the pattern for prediction in testing data.
and is called an event pattern. Considering the facts that a
fault may manifest itself in slightly different ways at TP
Confidence = (3.4)
different times, and also that the equipment behavior changes TP + FP
in response to a fault, the actual time intervals between any where TP (True Positives) is the number of target events
occurrences of two events are not of concern, and only predicated correctly by the pattern, and FP (False Positives)
matching of patterns to feature space is performed. is the number of false prediction by the pattern. A prediction
The prediction approach identifies a set of event patterns Pj of a failure event is a true positive if a true failure in the test
where each pattern correctly predicts a subset of target events data occurs within the prediction period ∆t p ≤ D after the
in S. Each pattern Pj is composed of a sequence of features warning time, e.g. M ( Pj , Cx ) = True as defined in (3.2). If no
and is constructed by mapping the feature sequence to an failure actually occurs, the prediction is a false positive.
event sequence in an event chunk Cx. In order for the
prediction to be valid, the last mapped event must occur D. Prediction Matrix and Support
within the prediction period D for the target event etx.
Assume a set of prediction patterns P = {P1 , P2 ,K , Pn } has
We also introduce another time parameter Dp, which
denotes the maximum time length of the pattern durations. been identified to predict the set of target events
That is, only patterns of time lengths less than or equal to Dp Et = ( et 1, et 2,....etk ....et ) in S. Each pattern Pj may predict a
H

are allowed. subset of target events eti. Define a binary Prediction Matrix
Given pattern Pj = ( f j1, f j 2 ,K , f jk , K , f jK ) , where fjk is the kth A as
feature of Pj and K is the total number of features, Pj 1 if M ( Pj , Ci )=True
A := [ ai , j ]H × N , where ai , j =  (3.5)
correctly predicts target event etx if it matches a subsequence 0 otherwise
of events in chunk Cx within the prediction period D, The nonzero entries of column j of A represent the target
otherwise Pj does not predict etx. This prediction relation is events that are predicted by pattern Pj. Define the prediction
captured by a matching function defined as vector Vj of pattern Pj as column j of A.

5
The total number of target events predicted by pattern Pj is R = arg max u ∈ J k {[τ ru (t )] . [η ru ] β } (4.2)
r
known as the Support of Pj defined as the j-th column sum
( ) H
S P j = ∑ ai , j where j = 1, 2....N
i =1
(3.6) The State Transition Rule is defined as a combination of
the Exploration and Exploitation rules.
Support is used as a measure of the effectiveness of each if q ≤ q0 (exploitation)
 R (4.3)
prediction pattern and it is similar to recall used in [2][5]. s= k
 prs (t ) otherwise (biased exploration)
IV. REVIEW OF PRINCIPLES OF ANT COLONY where
q = a random number uniformly distributed in [0...1]
OPTIMIZATION
q0 = q0 is a predefined threshold controlling the rate of
Ant colony optimization (ACO) is an effective method of exploration and exploitation ( 0 ≤ q0 ≤ 1 ) .
solving problems that can be reduced to searches for shortest The Local Updating Rule for ACO [14] with τ 0 = 0 , is the
paths in graphs [14]. ACO is a probabilistic learning search
following:
method, where learning is based on the activities of swarms
τ rs (t + 1) = (1 − ρ ) .τ rs (t ) (4.4)
of ants. The concept of pheromones, which enhance learning
along certain paths, together with the lack of centralized After an ant has chosen a node, the pheromone level is
control and a priori information are the main attractive points updated using (4.4). This decreases the pheromone density
for designing a pattern identification algorithm that is on visited nodes, making old edges less attractive, and
inspired by this behavior. therefore favouring the exploration of seldom visited edges.
To find the shortest path for the Traveling Salesman
The Global Update Rule defined in (4.5) is to deposit
Problem (TSP) using ACO, the problem can be represented additional amount of pheromones to edges that are in the
as a graph of nodes and edges. Nodes represent cities; edges path with relatively shorter distance. The purpose is to make
represent connective relations between any pair of nodes. the ants search in the neighbourhood of the best paths
There are three principal rules in ACO [14], namely the State identified up to the current iteration of the algorithm.
Transition Rule, the Local Update Rule and the Global τ rs (t + 1) = (1 − α ) . τ rs (t ) + α . ∆τ rs (t ) (4.5)
Update Rule. These rules are fundamental to all ACO where
applications. α = global pheromone decay (0 < α < 1)
The State Transition Rule determines the probability of ∆τrs (t ) is the amount of pheromones increased on the edges that
moving ant k from node r to the next node s in the graph. belongs to global best tour and is defined as
This rule can be divided into two parts, Exploration (4.1) and  1
 if ( r , s ) ∈ global - best - tour (4.6)
Exploitation (4.2). ∆τ rs (t) =  L gb
An ant chooses to explore new nodes instead of following 0 otherwise

currently attractive edges by applying Exploration rule (4.1)
which is based on two variables, pheromone density τrs(t) where Lgb = length of the global best tour found so far.
and visibility ηrs. τrs(t) multiplied by ηrs forces ants to choose
short edges with high amounts of pheromone. A node is then
chosen randomly according to their normalized pheromone V. ALGORITHM FOR EXTRACTING
levels. PRECURSORY ALARM PATTERNS (EPAP) FROM
 [τ rs (t )].[ηrs ]β
HISTORICAL DATA LOGS
 if s ∈ J rk
k
prs

(t ) =  ∑ [τ ru (t )].[ηru ]β (4.1) ACO is able to learn the shortest paths in extremely
complex graphs with large numbers of nodes. Therefore, one
 u∈J r
k

0 otherwise is motivated to apply ACO for the extraction of precursory



where alarm patterns from complex historical data logs of very
τrs(t) = the pheromone between feature r and s at time large sizes such as those stored by industrial equipment
instant t monitoring systems. To accomplish this, however, one must
ηrs = the visibility 1/δrs map the alarm pattern extraction pattern into the sort of
δrs = a value determined by greedy rule (heuristics)
based on problem domain
problem confronted by ACO, e.g. a graph search problem,
Jkr = the set of nodes that remain to be visited by ant and then make several modifications to ACO. The result is
k located at nodes r an effective algorithm which we call EPAP- Extraction of
β = parameters that determine the relative Precursory Alarm Patterns.
importance of pheromone versus the visibility
Ants follow the most attractive path by applying A. Formulating Event Data Mining as a Graph
Exploitation rule (4.2) that select the maximum product of Problem
τrs(t) and ηrs. It will lead to convergence on the most favored To map the extraction of precursory patterns into a graph
path. search problem, the feature space is modeled as a graph of

6
nodes and arcs where nodes represent the features and arcs
are correlations among features. Thus, 1-dimentional feature The identification of predictive patterns is different from
space in Figure 1 can be modeled as a graph as shown in the classical Traveling Salesman Problem (TSP), which is
Figure 3 where fi = Ai, i=1, 2, …..Z. We define feature fs to the benchmark for ACO and other NP-hard optimization
be correlated with feature fr if an event with feature fr techniques. Therefore to apply ACO to pattern identification,
happening followed by an event with feature fs within a fixed some modifications to ACO are needed. A comparison of
time window ∆t p ≤ D in the event sequence S. TSP and the Pattern Identification Problem is given in Figure
4. Motivated by these differences, the rest of this section
In order to quantify the correlation relation between two
elaborates the modifications to the original ACO algorithm
nodes, we define a correlation factor δrs as the summation of
required to develop EPAP. Several of the modifications add
the occurrences of the events with feature fr happening,
additional variable parameters that may be selected to tune
followed by events with fs within a given time window ∆t p .
the EPAP algorithm to improve its performance.
That is
I J

∑∑ B ( f xi , f yj ) TSP Pattern Identification


δ rs = i =1 j =1 (5.1) 1. Start and end cities do not need to choice of starting features affects
Freq ( f x ) be explicitly defined the convergence of the algorithm.
where 2. Unambiguous stopping condition Stopping condition is ambiguous.
3.
1 if f sj follows f ri within window ∆t p The exact number of intermediate The number of features in each
B ( f ri , f sj ) =  cities which remains to be visited prediction pattern is variable
0 otherwise
is known
Variable fzk is the kth time instance when an event with 4. Each city is allowed to be visited Features are allowed to be visited
feature z occurs in the event sequence S. I and J are the total once only more than once
number of occurrences of events with features r and s in the 5. Ants can travel to any remaining Each feature is potentially
event sequence S, respectively. The time window ∆t p ≤ D is cities from the current location as connected to all other features,
all cities are interconnected (i.e. a depending on the correlation
defined in connection with the rule (3.3). densely connected graph) between the features. Normally, it
is not a dense graph
A3 δA3F1 6. Costs of edges are symmetrical. Costs (i.e. correlation factor) are
δ A1A3
A1 asymmetrical
F1
δA Figure 4. Differences between TSP and Pattern
1A
2 Identification
δA2A4 δA4Fm
A2 A4
Fm
A5 C. Choosing Start Features
δA5A In ACO [14], ants are randomly placed on starting cities
6
without penalties. However, in EPAP, the choice of a poor
A6 start feature results in a poor prediction pattern. Therefore,
δ AzA6
care must be taken to ensure exploration of start features and
Az
convergence of patterns. As such, ProbStart defined in (5.2)
Az : Feature node : Correlation arc δrs : Correlation factor is created to balance exploration and convergence.
 P ( Px (1) ) if q ≤ q0 (convergence)
Pr obStart =  (5.2)
Figure 3. Graph model for event data mining  P ( f x ) otherwise (exp loration)

where
B. Required Modifications to ACO for Pattern 1
P ( Px (1) ) = Px(1) is starting feature in predictiv pattern Px
Identification N
Now that the alarm pattern extraction problem has been P( fx ) =
1
c is size of the feature space
mapped into a graph structure in feature space, the EPAP c

algorithm can be developed. EPAP operates by identifying a


set of candidate patterns using ACO to construct certain The variable q0 is a predefined threshold that balances
paths in the feature space graph. The key point to note is that exploration and convergence.
paths generated by the ants correspond to feature patterns in Exploration of the start features simply selects a random
the feature space. These paths are then matched to the event feature from the feature space according to a uniform
sequence to predict the target events. With this mapping, the distribution, i.e. P( f ) = 1 . Convergence, on the other hand,
x
c
δrs in (4.2) is defined as following:
selects a random start feature from the global best N patterns
δrs= the correlation factor defined in (5.1)

7
(refer to section G below), P(P (1)) = 1 where N << c. Next of the pattern length.
x
N
the ants move according to the ACO state transition rule The weighted support is used in EPAP and described in the
when they have not converged to a pattern. next subsection.

D. Stopping Condition F. Global Best N Patterns


The predictive pattern lengths are variable. Hence, EPAP is an elitist ant colony algorithm that makes use of a
searching for predictive patterns do not have a clear stopping pool of relative best paths up to the current iteration. This
condition, as it is possible for ants to construct paths of pool of relative best paths is known as Global Best N
infinite length. Thus, the original ACO needs to be modified Patterns. A predefined value parameter N is used to control
by adding a stopping condition. When an ant moves to the the size of the pool that stores the predictive patterns with
next node, the generated node will be checked to determine if high Weighted Support as defined in (5.3). The larger the
it is to be appended to the ant’s path. A simple support value of N, the more patterns the algorithm will store.
threshold is used for the stopping condition. If the Support of However this may lead to non-convergence of the colony if
the kth ant’s path (pathk), appended with the generated node spurious patterns, i.e. patterns that do not predict any target
ni, drops below a user-specified threshold, the generated event, are stored.
node is not constructive to pathk. Ant k thus reaches its The generated paths are computed to determine their
stopping condition and waits for the other ants to stop. If the corresponding SW ( Pj ) by (5.3). Paths with higher Weighted
threshold is low, ants will generate long patterns, and vice
Support Value are added into the Global Best N Patterns by
versa if the threshold is high.
applying an elitist strategy. This allows the best patterns
identified, to be kept until some other better patterns are
E. Weighted Support
found. The definition of ∆τrs(t ) in (4.6) is thus modified as
Ideally, a correct prediction is made when a path matches a
following:
set of events patterns in an event stream. However, as paths
S
are of variable lengths, evaluation of predictive patterns  if (r,s) ∈ global best N patterns (5.4)
∆τ rs(t) =  L
based purely on the Support in (3.6) may be biased, since  0 otherwise
patterns with fewer features occur more often and generally
where S = SW ( Pj ) defined by (5.3)
have higher Support values.
Evaluations of the predictive patterns need to balance the L = number of unique features in the pattern
Support and Precision [2] that indicates the level of accuracy
The Global Update Rule defined in (4.5) is then intended to
of a predictive pattern. Two possibilities can occur during the
provide a greater amount of pheromones to patterns with
process of identifying predictive patterns. First, a short
high Weighted Support. The pheromones of edges in the
pattern of, for example, 2 features, may happen more often
Global Best N Patterns are then updated by the Global
and thus can predict a wider range of target events and have a
update Rule in (4.5) with modified ∆τrs(t ) . This process is
higher Support value. However, it may be undesirable to
search for 2-feature patterns as the most common patterns in repeated until the paths in the Global Best N Patterns reach
the event sequence may have 2 features, including those that convergence.
have no predictive value. Thus, 2-feature patterns may lead
to false prediction of target events. G. Convergence Criterion
On the other hand, longer patterns, for example 5 features, 1. Definition of Convergence and Search
occur less often and thus have a lower Support value. Complexity
However, since longer patterns are more precise in their The definition of convergence is a soft matching of patterns
prediction than their shorter counterparts, it is necessary to with the Global Best N Patterns. For example, if a generated
augment the evaluation with the lengths of the patterns so pattern contains a subset of features that is in the Global Best
that the precision of the predictive patters will be considered. N Patterns, it is treated as a single convergence. This is
To capture this tradeoff, the Weighted Support SW ( Pj ) is because the search space of the feature set is polynomial,
with complexity of O(n2).
defined for evaluation of predictive patterns identified by
The complexity of the feature generation is computed as
EPAP as
follows. Assume n features in the feature space are densely
SW ( Pj ) = S ( Pj ) ∗ Lp W , W ∋ [ 0,∞ ) (5.3)
j
connected. The number of features that can be selected in
where each feature generation is thus n. For a path of length m,
L p = the length of the pattern Pj, which equals the
j
there would be m feature generations. Hence, the number of
number of unique features in Pj. possible paths is nm, and thus the complexity of O(nm).
W = weight factor for control of the dependency

8
2. Termination of EPAP tons at a speed of 60 meters per minute, and contains more
In ACO, when the colony of ants converges onto the same than 10 subsystems such as the main/backup travelling
path, the algorithm can be terminated, as further searches motors, travelling control system, main/backup hosting
will not improve the solution. In EPAP, as the convergence motors, power supply system, etc. It is equipped with
rate is low even with soft convergence, termination approximately 200 different alarms out of which 148 alarms
conditions must be carefully designed. Two possible criteria actively send signals to an equipment monitoring system.
exist for termination of the algorithm: 1) set a fixed number The equipment operates on four modes/states: auto, manual,
of iterations; 2) keep track of the total number of times a maintenance, and off mode. When the equipment is in for
prediction pattern has been generated. preventive maintenance, it will be in off mode. When
The problem with the first criterion is that all runs are unexpected failure events occur, the equipment will switch
unique. Therefore, having a fixed number of iterations is from auto mode or manual mode to maintenance mode.
inflexible in managing the convergence of the prediction
patterns. Moreover, to choose a fixed value for the number Start
of iterations that will work for all runs is not possible. Some
runs may be able to terminate early, and some runs may need
a much longer time to converge. M is total number of iterations
Iterations i = 1
K is total number of ants
Thus, the second criterion of keeping track of the times that
a pattern has been generated is intuitively more sound. This
False
allows each pattern to have a minimum amount of While ants have not
converged and i < M
convergence and yield satisfied results before termination.
However, in practice, allowing the ants to find the paths may True
not result in convergence if some spurious patterns were Set start node for each ant
generated. according to ProbStart

The solution to this problem is to use a combination of both


methods. EPAP terminates if predictive patterns reach True
Is k < K ?
convergence. Meanwhile, an upper bound is also set on the
total number of iterations to terminate the algorithm if the False
patterns are unable to converge. This ensures that most of the
patterns will have a minimum convergence prior to Generate next node ni
Exploration Exploitation
termination. according to State Transition
The flowchart of the procedure of identifying predictive Rule (4.3)
patterns in the EPAP Algorithm is shown in Figure 5.
Move to next node according to Move to next node according to
(4.1) (4.2)
H. Confidence Level for Performance Evaluation
The value of Confidence Level (CL) (see 3.4) is calculated
Calculate support for
for each candidate predictive pattern Pj generated by EPAP (path k + ni )
for evaluation of prediction accuracy. By giving a predefined
minimum threshold of CL, a greedy approach can be used to
select patterns that have a higher Confidence Level and True
If support ≥ threshold
greater than the minimum threshold at each iteration.
A set of 20 to 50 predictive patterns will be usually False Append ni to
generated by EPAP depending on the parameter settings. It path k

may be necessary to prune those which are too general, e.g. False
Has k finished Apply Local
with higher Support and low CL value. Doing so reduces the a=a+1
tour? Update to ni (4.4)
number of patterns and thus, increases the accuracy of the True
prediction rules.
Apply Global Update to
i =i+1 global best N patterns
(4.5)

VI. TESTING AND PRELIMINARY FINDINGS


Output Global Best N
The proposed EPAP Algorithm was tested using the Patterns
historical alarm data from an industry partner collected from
an automatic material handling system over a two year
period. The data involves specifically an elevated transfer End

vehicle. The vehicle is 44-metres high and can transport 6.8 Figure 5. Pattern Identification Procedure in EPAP Algorithm

9
It should be noted that, from the patterns listed in Table 1,
The equipment failure events were defined as occurring the CL value is not proportional to Weighted Support.
when the equipment is in maintenance mode for more than Therefore, selecting predictive patterns to construct
15 minutes after it switched from auto mode. For training, prediction rule set as defined in (3.3) is not a simple task and
we used data collected in 2003 within which there were 29 needs to balance between the Support and the Confidence
identified unexpected equipment failures that were used as Level.
target events. For verification, we used data from 2004 At the initial development and testing stage, target events
within which there were 31 identified unexpected equipment were identified as the events of equipment mode changing
failures. from auto mode to maintenance mode and staying in
The prototype of EPAP has been running on the dataset maintenance mode for more than 15 minutes. It did not
with parameters of 7-day Monitoring Time (M) and 8 hours differentiate the target events with the different failure types;
Warning Time (W). Initial test results are presented in terms hence the alarm patterns may not be accurate and meaningful.
With more rigid data pre-process and clear definitions of the
of Support (S), Weighted Support (SW), Confident Level (CL)
failure types, it is expected that we can identify alarm
and Predictive Pattern as shown in column 1, 2, 3 and 4 in
patterns that can be useful for equipment degradation
Table 1 respectively. In these tables and in this section in modelling.
general, the alarm fault patterns are identified by uniquely Further testing was conducted with 7 identified chain
assigned numbers. failure events from 3 elevated transfer vehicles of the same
Each weight factor (W) value in (5.3) will generate one type. The reason to use dataset from three vehicles of the
unique set of patterns. When W=0, EPAP generated same type is that failure events are rare events, the same type
generally short patterns with one or two features frequently of failure events happened only 2-3 times during the period
occurring in the event sequence. An example is the alarm of 2 years for a specific equipment.
pattern with two events <0004, 0018>. Domain experts Figure 6 shows examples of the candidates of predictive
verified that this alarm pattern is indeed associated with a alarm patterns identified by EPAP with different parameter
certain fault event that occurs for most equipment failures. settings. Those patterns with confidence level greater than
Thus, this short pattern may not hold any predictive value as 60% can be used to construct a prediction model. In Table a)
of Fig 6, pattern A1 clearly presents the typical event
it is too general and will give false alarm most of time when
sequence that happened in handling critical vehicle problem,
a short down-time occur. include breaking the vehicle travel motor (0154), opening the
When W =1.75, EPAP generates patterns with longer cabin door (0088), lifting the maintenance supports (0084-
length. Thus, W =1.75 was used to conduct the testing 0087) and locking the horizontal movement(0099). It shows
hereafter. Table 1 shows pattern examples with pattern that EPAP can find the valid event pattern. Pattern A3 shows
duration parameter selected as ∆t p = 0.5 hur . the main travel motor break error (0152) follow by backup
travel motor break error (0154), then travel drive safety loop
fault (0082). This pattern is also quite meaningful.
S SW CL Candidate Predictive Pattern
In table b) of Fig. 6, we noticed that pattern B1 is the same
23 157 0.32 0018 0082 0102 as A1, which indicates pattern A1 duration is less or equal to
5 115 0.71 0006 0031 0098 0099 0161 0081
0031 0.5 hr. Though this pattern’s confident level is 100%, it can
5 115 0.71 0022 0098 0099 0161 0081 0125
not be used as predictive pattern as it indicates a sequence
5 115 0.71 0006 0031 0098 0161 0081 0125
16 109 0.75 0071 0018 0082
event happened when maintenance staff were conducting a
14 95 0.3 0101 0018 0082
check-up activity for chain failure.
14 95 0.28 0018 0101 0082 Pattern B3 indicates timeout at a friction deck/ a powered
4 92 0.8 0000 0006 0098 0018 0081 0125 deck when the vehicle was loading and unloading. These are
4 92 0.8 0006 0031 0098 0099 0018 0082 the symptoms that are associated with chain failure. Pattern
13 88 0.48 0071 0018 0101 B3 can be selected as a predictive pattern if minimum
13 88 0.42 0037 0004 0018 confident level requirement is greater than 70%.
5 83 0.71 0006 0098 0099 0161 0125 Pattern B4 shows backup travel motor break error (0154),
5 83 0.63 0006 0098 0099 0081 0125 followed by travel drive safety loop fault (0082). This is a
5 83 0.71 0006 0098 0099 0081 0082 subset of pattern A3 with higher support. It could be due to
5 83 0.86 0087 0107 0161 0081 0082 the window size of 0.5 hours for the second group. The
5 83 0.71 0087 0107 0161 0081 0125 complete pattern happened in larger time window. This
5 83 0.83 0006 0098 0161 0081 0082 implies that a suitable window size plays an important role in
5 83 0.71 0086 0098 0161 0081 0125
event pattern searching.
12 82 0.31 0071 0018 0100
With different parameters settings, more than 10 candidates
of predictive alarm patterns with confidence level >60% can
Table 1. Preliminary Results: ∆t p = 0.5 hur , W=1.75, D=8hur, N=30
be identified. How to determine the predictive alarm pattern
With different pattern durations, there were a total of 6 sets that can be used to model equipment degradation is a
of predictive patterns generated, each containing 30 patterns. challenging issue. We believe that domain expert’s

10
involvement will play an important role at the stage of IX. REFERENCES
prediction rule set generation as background knowledge of 1. Ting-Ting Y. Lin, Daniel P. Siewiorek, “Error Log Analysis:
field experts is vital. Statistical Modeling and Heuristic trend Analsysis”, IEEE
Transactions on Reliability,Vol 39, No.4, 1990 Oct.
Pattern Simple Weighted
ID support support Confidence Candidate Predictive Pattern
A1 3 9037.386 1 0154 0083 0084 0085 0086 0087 0099
2. Gary M. Weiss and Haym Hirsh, 1998. Learning to Predict
A2 5 1681.793 0.85 0118 0117 0117 0117
Rare Events in Event Sequences. Proceedings of the 4th
A3 3 3394.113 0.75 0148 0152 0154 0082 International Conference on Knowledge Discovery and Data
A4 5 1681.793 0.62 0118 0118 0117 0117 Mining, AAAI Press, 359-363.
A5 3 2051.556 0.6 0004 0086 0087
A6 6 2018.151 0.58 0020 0018
3. R. Vilalta, C. V. Apte,J. L. Hellerstein, S. Ma, S. M. Weiss,
A7 7 2354.51 0.57 0136 0135
A8 4 2735.408 0.57 0122 0152 0154
“Predictive algorithms in the management of computer
a) Pattern during 1 hr, warning period 7 hr
systems”, IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002.
Pattern Simple Weighted
ID support support Confidence Candidate Predictive Pattern
4. D. Levy and R. Chillarege, Early Warning of Failures through
B1 3 9037.386 1 0154 0083 0084 0085 0086 0087 0099
Alarm Analysis – A Case Study in Telecom Voice Mail
B2 4 1345.434 0.8 0194 0063
B3 4 1345.434 0.75 0196 0195
Systems, IEEE International Symposium on Software
B4 4 1345.434 0.75 0154 0082 Reliability Engineering (ISSRE 2003), Denver CO, Nov 2003
B5 6 2018.151 0.58 0020 0018
B6 7 2354.51 0.57 0136 0135 5. Felix Salfner, Michael Schieschke and Miroslaw Malek.
B7 4 2735.408 0.57 0122 0152 0154
Predicting Failures of Computer Systems: A Case Study for a
B8 5 1681.793 0.57 0118 0117
Telecommunication System. Proceedings of 11th IEEE
B9 5 1681.793 0.54 0165 0018
b) Pattern during 0.5 hr, warning period 7 hr
Workshop on Dependable Parallel, Distributed and Network-
Centric Systems (DPDNS'06); Rhodes Island, Greece; April
2006.
Figure 6. Examples of candidate predictive alarm patterns for
chain failure type 6. R. K. Sahoo, A. J. Oliner_ , I. Rish, M. Gupta, J.E. Moreira
and S. Ma, “Critical Event Prediction for Proactive
Management in Largescale Computer Clusters”, The Ninth
VII. CONCLUSION AND FUTURE WORK ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (SIGKDD’03), pp426-435, August
This paper has presented an ACO based pattern 2003, Washington DC, USA
identification algorithm, EPAP (Extraction of Precursory
Alarm patterns) for equipment failure prediction. In EPAP, a 7. R. Duan, R. Prodan, T. Fahringer, "Short Paper: Data Mining-
feature correlation factor and heuristics for start node based Fault Prediction and Detection on the Grid," hpdc,
selection, stopping condition, and termination are developed pp.305-308, 2006 15th IEEE International Conference on High
Performance Distributed Computing, 2006
with the balance of convergence and exploration capabilities
of the algorithm. We applied EPAP to the real alarm history 8. Fadel, H.K.; Holloway, L.E., “Using SPC and template
dataset collected from equipment. The initial results have monitoring method for fault detection and prediction in
shown that EPAP is feasible. discrete event manufacturing systems" Intelligent
Refinement of the algorithm will look at two areas: Control/Intelligent Systems and Semiotics, 1999. Proceedings
of the 1999 IEEE International Symposium on 15-17 Sept.
ordering connective and adjustment of parameters in EPAP. 1999 Page(s):150 – 155
The ordering connective is a simplified set of regular
expressions [16]. It allows the use of non-linear pattern 9. Zhiguo Li, Shiyu Zhou, Suresh Choubey, Crispian
matching, e.g. A|B is equivalent to AB and BA, where “|” is Sievenpiper, “Failure event prediction using the Cox
known as the “unordered” connective. As alarms may not proportional hazard model driven by frequent failure
signatures”, IIE Transactions, Online Publication Date 01
necessarily occur in a strict sequence, using connectives March 2007.
enables a richer set of pattern matching. This will potentially
increase the accuracy of the algorithm. 10. Mannila, H., Toivonen, H. and Verkamo, A. I. (1997)
As the EPAP’s performance, in terms of running time and Discovery of frequent episodes in event sequences, Data
accuracy, is highly dependent on parameter tuning, Minning and Knowledge Discovery, 1, 259-289.
parameter adjustment should be carried out to improve the 11. L. Liao, H. Wang and J. Lee, "A Reconfigurable Watchdog
efficiency of the proposed algorithm. Agent for Machine Health Prognostics", International Journal
of COMADEM, Volume 11, Number 3. July 2008, pp2-15.
VIII. ACKNOWLEDGEMENT
12. John F. Roddick,Myra Spiliopoulou, "A Survey of Temporal
The authors would like to thank K. C. Ng for his Knowledge Discovery Paradigms and Methods", IEEE
implementation of EPAP and the members in MEC group of Transations on Knowledge and data Engineering, Vol. 14, No.
SIMTech for their support and many useful discussions that 4, Aug 2002.
are related to the domain problem and data pre-processing.

11
13. Ricardo Vilalta, Sheng Ma, "Predicting Rare Events In 15. Chih-Hung Wu, Wei-Ting Lin, Chi-Hua Li, I-Ching Fang,
Temporal Domains," Second IEEE International Conference Chia-Hsiang Wu,“Ant Colony Optimization On Building An
on Data Mining (ICDM'02), 2002, pp.474. Online Delayed Diagnosis Detection Support System For
Emergency Department”, The 7th International Conference on
14. Dorigo M., V. Maniezzo & A. Colorni (1996). Ant System: Computational Intelligence in Economics and Finance, Dec. 5-
Optimization by a colony of cooperating agents. IEEE 7, 2008, Kainan University, Taoyuan, Taiwan.
Transactions on Systems, Man, and Cybernetics-Part B,
26(1):29-41. 16. http://www.regular-expressions.info/ Regular expressions and
their uses.

12

You might also like