Professional Documents
Culture Documents
Ming Luo is a Research Scientist in Singapore Institute of Manufacturing Technology. Her research
interests include computational intelligence for optimization in industrial applications, manufacturing event
management, event data based diagnosis and prognosis for manufacturing systems, and discrete system
modeling and control. She received her B.Eng. in Electrical Engineering in 1987 from South China
University of Technology, M.A. in Mathematics and Computer Science from Eastern Michigan University,
USA in 1991 and Ph.D. degree in System Modeling in 1998 from Nanyang Technological University,
Singapore. She involved in many in-house and industrial projects in the areas of dynamic resource
allocation and management, and equipment diagnosis and prognosis technology development. She has
published more than 30 Journal and conference papers in the areas of system modeling, computational
intelligence approaches for equipment diagnosis and prognosis, and event management.
Danhong Zhang is a Senior Research Engineer in Singapore Institute of Manufacturing Technology. Her
research interest areas include distributed control system, real-time control system, device communication
technology, web enabled technology, wireless Sensor Network, fault diagnostics and prognostics for
equipment & system. Ms. Danhong Zhang received her B.Eng and M.Eng in Automation from Tsinghua
University, China in 1986 and 1988. She involved in many research and industry projects such as of shop
floor information systems, Maintenance and diagnosis system for large scale material handling equipment,
real-time monitoring and control system, machine and system diagnosis support system, wireless sensor
network for machine condition monitoring, etc. She has published 18 conference and journal papers.
Leck Leng Aw is a research officer in Singapore Institute Of Manufacturing Technology. Her current
research area focus on the track and trace of manufacturing processes, data analysis on manufacturing data.
She obtained her Master of Science in CIM (Computer Integrated Manufacturing) from Nanyang
Technological University in 2000 and B.S from Singapore Institute of Management. For past years, she has
involved in many industry and in-house project in the areas of data pre-processing for data mining,
manufacturing process tracking and tracing system.
Frank L. Lewis, Fellow IEEE, Fellow IFAC, Fellow U.K. Institute of Measurement & Control, PE Texas,
U.K. Chartered Engineer, is Distinguished Scholar Professor and Moncrief-O’Donnell Chair at University
of Texas at Arlington’s Automation & Robotics Research Institute. He obtained the Bachelor's Degree in
Physics/EE and the MSEE at Rice University, the MS in Aeronautical Engineering from Univ. W. Florida, and
the Ph.D. at Ga. Tech. He works in feedback control, intelligent systems, distributed control systems, and
sensor networks. He is author of 6 U.S. patents, 216 journal papers, 330 conference papers, 14 books, 44
chapters, and 11 journal special issues. He received the Fulbright Research Award, NSF Research Initiation
Grant, ASEE Terman Award, Int. Neural Network Soc. Gabor Award 2009, U.K. Inst Measurement & Control
Honeywell Field Engineering Medal 2009. Received Outstanding Service Award from Dallas IEEE Section,
selected as Engineer of the year by Ft. Worth IEEE Section. Listed in Ft. Worth Business Press Top 200
Leaders in Manufacturing. He served on the NAE Committee on Space Station in 1995. He is an elected Guest
Consulting Professor at South China University of Technology and Shanghai Jiao Tong University. Founding
Member of the Board of Governors of the Mediterranean Control Association. Helped win the IEEE Control
Systems Society Best Chapter Award (as Founding Chairman of DFW Chapter), the National Sigma Xi Award
for Outstanding Chapter (as President of UTA Chapter), and the US SBA Tibbets Award in 1996 (as Director of
ARRI’s SBIR Program).
1
ABSTRACT
The identification of alarm patterns that occur as precursors to failures provides the ability of predicting
potential equipment malfunctions for condition based maintenance. In industry, enormous files of historical
data are collected from equipment monitoring prior to failures. The search for reliable precursory alarm
patterns, that is, specific sequences of alarm events, in such data sets is a challenging task. This paper describes
an algorithm for identifying precursory alarm patterns from historical measured event data containing
numerous fault alarms and equipment states. The algorithm is based on modifications to ant colony
optimization (ACO), which is an effective probabilistic learning method for finding shortest paths in large
complex graphs. An actual industry application is used to verify the algorithm. The algorithm is applied to a
dataset from an industry partner collected over two years from an automatic material handling system. By
mining the historical data with the proposed algorithm, a plausible set of precursory alarm patterns is found.
These industry application results demonstrate the effectiveness of the proposed algorithm for finding
precursory alarm patterns for prediction of equipment failures.
2
based framework for using alarm patterns as predictors for prediction, rule-based classification techniques to predict
failures. In Section IV we review ant colony optimization critical events, and a Bayesian network to build dependency
(ACO). In Section V we develop a novel algorithm for graphs to isolate the root cause of problems.
extracting precursory alarm patterns from historical data. Reference [7] presents a framework of fault prediction and
First, we show how to formulate event data mining as a detection for Grid computing systems. They classify the fault
graph problem. Then a new algorithm is developed by detection as resource-based detection which depends on the
making certain modifications to ACO to allow its application data collection on-site, or workflow-based detection which
to identification of precursory alarm patterns. The depends on monitoring the execution state of activities (e.g.
importance of selecting suitable start features is emphasized events). Historical data was used to define the template and
and a method is given for so doing. Convergence criteria are use data mining algorithm to solve the problems in
added, and a stopping condition is given. In section VI we prediction. Workflow prediction techniques are used on fault
apply the algorithm to a complex historical dataset provided prediction. An association algorithm is used to derive multi-
by an industry partner, and the effectiveness of the algorithm feature correlations from the resource- and workflow-based
in finding precursory alarm patterns is verified through models. Workflow-based clustering algorithm detects the
performance analysis. outliers by clustering the historical data. The basic idea is to
compare the execution data of activities with same activity
II. RELATED WORK type, and labels the smallest clusters as anomalies.
Most research work on equipment failure prediction using Less work can be found in the literature on the applications
alarm or error log analysis is for computer systems and of equipment failure prediction based on event or alarm data.
telecommunication equipment where huge number of alarm Early work related to failure prediction using alarm data used
logs are generated by software that controls computer SPC based methods. Paper [8] presents a template
systems or the switches [1]-[6]. The authors in [1] had monitoring method incorporating SPC (statistic process
developed a failure prediction mechanisms based on control) techniques for fault detection and prediction in
dispersion frame technique (DFT). This approach exploit the discrete event manufacturing systems. This method is first
times of error occurrence applying heuristic rules for failure applied to obtain the normal sequence relationships and
prediction. The authors in [2] proposed “Timeweaver”, a timing relationships of alarm events. Then SPC monitoring
genetic algorithm to predict rare events in sequences. It uses technique is used to detect the timing deviations of typical
a set of event sequences to train the algorithm and a test set process sequences that result in a fault prediction warning.
to tune its performance. The authors in [3] proposed different Similar to the formulation of the failure prediction problem
predictive algorithms for short term prediction of in [2], the authors in [9] use frequent event sequence as
performance variables and long term prediction of abnormal failure signatures for prediction. WINEPI approach [10] is
behavior and system failure events for computer systems. proposed to discover the frequent episodes from alarm
Criteria in the selection of predictive algorithms base on both database and classify episodes into serial episode and parallel
numeric and categorical data were discussed. Association episode. The events in the failure signatures are then used to
rule mining techniques was adopted for finding the rule for build a Cox proportional hazard model to provide a
predicting a target event from the frequency occurrence of statistically rigorous prediction method for system failures.
correlation events. In [4], three alarm analyzing principles A scalable Watchdog Agent-based toolbox approach for
are discussed that use the variation of total alarm counts and machine health prognostics is presented in [11]. The toolbox
Pareto Ranking within a certain period for detection of consists of modularized embedded algorithms for signal
possible failure events. The method is implemented in the processing and failure signature extraction and performance
web-based system called ARISTO for alarm monitoring. It assessment. Methods are based on continued time series data
has been shown that ARISTO is able to generate early which may be costly to obtain when additional sensors are
warnings for Telecom voice mail systems. A Similar Events needed to be installed. However, with advanced IT
Prediction (SEP) approach is proposed in [5] to predict technology the event data may become huge. Mining
imminent failures in a telecommunication system online. frequent event sequence from the huge dataset effectively is
SEP is based on recognition of suspicious error patterns a challenging research area for which many exciting
utilizing a semi-Markov chain in combination with problems remain open. In particular, mining algorithms for
clustering. The comparison study showed that SEP is static data are not directly applicable on temporal data [12].
superior to DFT though the computational complexity is The equipment failure prediction is an application of
increased. The authors in [6] reported their work on proactive temporal data mining with the objective of predicting rare
prediction and control system for large-scale clusters of event with a window of time [12][13]. Most previous work
computer system. Based on cluster node performance data on the algorithms for temporal association rule mining
(reliability, availability and serviceability, and activity focuses on extending association rule mining algorithm, such
report), they applied a time series algorithm for performance as A-priori algorithm, to time dimension. It involves
3
complicated data pre-processing. In this paper, we proposed where all n events are in the time interval t1 ≤ t ≤ tn and
the Ant Colony Optimization (ACO) [14] based algorithm satisfy the ordering of time tj<tj+1<tn.. There are two sorts of
which can directly search in temporal datasets to reduce data events. Alarm events are denoted in the form ei. Target
preprocessing effort with a higher accuracy rate [15]. events are those failures that are to be predicted based on the
alarm events. These are denoted in the form of etx where
III. BACKGROUND ON EVENT SEQUENCE x=1,2,…H. Figure 1 shows how the events are represented
MONITORING FOR FAILURE PREDICTION in 1-dimesional feature space plotted with respect to time. In
In this section we present a formulation for predicting the figure, et1 and et4 represent target failure events occurring
equipment failures in terms of measured alarm event at different timestamps but with the same feature A5. These
sequences. Industrial equipment controllers typically log two are treated as distinctive events, but are referred to as the
alarm event data records as event streams that include the same type of failure events. Furthermore, the subsequences
alarm identification, severity level, problem location and S1= ((t1, A2), (t4, A1), (t5, A3), (t7, A4)) and S2= ((t22, A2), (t23,
occurrence time stamp, etc. The frequencies of alarms may A1), (t25, A3), (t27, A4)) are called event patterns. They have
have a wide range in frequency of occurrence from once a the same feature sequence P= (A2, A1, A3, A4). We refer to
month or year to a few times each day. Our objective is to these two event sequences, S1 and S2, as having the same
analyze the relationship between these alarms and the event pattern. Since this alarm event pattern occurs prior to
equipment breakdowns to identify those meaningful alarm both target events, the alarm pattern is said to be precursory
patterns that serve as precursors for failures. To accomplish to the target events, that is, it is a signal that the failure A5 is
this based on typical available data, one requires a imminent.
formulation for working with event sequences, including the Fz
A8 e9 e 26 e 30
A. Events and Feature Space A7 e3
where i=1,2,… feature f j ∈ FZ , j=1,…,Z, and FZ is a finite Figure 1. Events in 1-dimesional Feature Space w.r.t. time
Z-dimensional feature space for describing the observation in
the problem domain. In the industrial application examined B. Monitoring, Prediction and Warning Times
in this paper, every alarm event can be described by a point We adopt the terms and definitions used [2][5] to formulate
in a 3-dimensional feature space. These 3-dimensions are: (1) the problem of predicting equipment failures, namely the
the alarm ID that is associated with fault type; (2) the source target events just defined, in terms of the measured alarm
ID that indicates the location or the alarm source; and (3) the events. The following discussion is illustrated in Figure 2.
severity level. To demonstrate the technique develop herein, Assuming a target event occurs at time t, the monitoring time
we use the fact that all alarm IDs are uniquely identified with M is the time prior to the target event during which one
fault types. Therefore, we shall assume that features are 1- monitors for possible precursory alarm patterns. This
dimensional, with only the alarm ID used to describe the determines a maximum window [t-M, t) within which the
observations of events in the following discussion. Hereafter, precursory event patterns can be to allow them to be
a feature refers to a point in the 1-D feature space. The associated with target event.
development in this paper extends directly to higher- Warning time interval W is the time necessary to raise a
dimensional feature spaces. warning before the occurrence of the target event. As the
The recorded history of events can be represented as an warning time decreases, the accuracy of the prediction may
event sequence S defined as a time-ordered sequence of increase. However the time for reaction will be reduced
events correspondingly and hence there is a trade-off between
S = ( e1, e2 , K , et1, K , en − k , K , etx , K , en ) warning time and accuracy.
4
Prediction period D is the time allowed to make a M ( Pj , Cx ) ∈ {True, False} , j = 1, 2, ... N and x = 1, 2,....H (3.2)
prediction of the target event. Any prediction must happen Define M ( Pj , Cx ) = True if there is a subsequence
before the warning time t-W to be useful. One has that
S j = ( e j1,e j 2 ,K , e jk ,K , e jK ) in Cx that is matched to Pj, where
prediction period D = M – W. The respective times are
illustrated in Figure 2. timestamps for events in Sj satisfies t-M<tjk<t-W and
Monitoring period M
(t jK − t j1 ) ≤ Dp .
Target event eti is said to be predicted to occur if there
exists at least one pattern Pj for which M ( Pj , Ci )=True holds.
t-M Prediction t-W Warning t With the definition (3.2), if Pj can be identified and tested
period D period W in a given dataset, then prediction rules can be formulated as
(3.3) and used for online prediction of equipment failure.
Figure 2 Monitoring, Prediction, Warning times
IF Pj occurs within a prediction period D
The event sequence S is considered to be partitioned into a THEN a target event ex will occur in the period ∆t p ≤ D
series of non-overlapping time intervals or chunks Cx, After warning time period W
x=1,2…H. Each chunk consists of the sequence of events With a Confidence Level CL (3.3)
happening before a target event etx, and within the monitor
period. To simplify the chunking scenarios in the study, we The time window ∆t p ≤ D is an additional parameter that can
assume that these chunks are non-overlapping. The event
be tuned to obtain best performance of the algorithm
sequence S is thus written as:
developed in this paper.
S = ( C1 , C2 ,K, CI ) (3.1)
Confidence Level (CL) defined in (3.4) is introduced to
subject to the constraints Ci ∩ C j = ∅ , ∀i , j , where Cx is the evaluate the accuracy of a candidate predictive pattern
set of time-ordered sequence events exactly prior to the xth generated. It measures the fraction of predictions that are
target event etx. made correctly by a given pattern. The definition of
Confidence Level is equivalent to Precision in [2][5]. We use
C. Prediction of Target Events different names to differentiate its usage in the different
Define a pattern as a specific fixed sequence of features stages of our algorithm: Precision is used to generate
A pattern associated with an event sequence, is the predictive patterns while, by contrast, CL is used to evaluate
ordered sequence of features occurring in the event sequence, the accuracy of the pattern for prediction in testing data.
and is called an event pattern. Considering the facts that a
fault may manifest itself in slightly different ways at TP
Confidence = (3.4)
different times, and also that the equipment behavior changes TP + FP
in response to a fault, the actual time intervals between any where TP (True Positives) is the number of target events
occurrences of two events are not of concern, and only predicated correctly by the pattern, and FP (False Positives)
matching of patterns to feature space is performed. is the number of false prediction by the pattern. A prediction
The prediction approach identifies a set of event patterns Pj of a failure event is a true positive if a true failure in the test
where each pattern correctly predicts a subset of target events data occurs within the prediction period ∆t p ≤ D after the
in S. Each pattern Pj is composed of a sequence of features warning time, e.g. M ( Pj , Cx ) = True as defined in (3.2). If no
and is constructed by mapping the feature sequence to an failure actually occurs, the prediction is a false positive.
event sequence in an event chunk Cx. In order for the
prediction to be valid, the last mapped event must occur D. Prediction Matrix and Support
within the prediction period D for the target event etx.
Assume a set of prediction patterns P = {P1 , P2 ,K , Pn } has
We also introduce another time parameter Dp, which
denotes the maximum time length of the pattern durations. been identified to predict the set of target events
That is, only patterns of time lengths less than or equal to Dp Et = ( et 1, et 2,....etk ....et ) in S. Each pattern Pj may predict a
H
are allowed. subset of target events eti. Define a binary Prediction Matrix
Given pattern Pj = ( f j1, f j 2 ,K , f jk , K , f jK ) , where fjk is the kth A as
feature of Pj and K is the total number of features, Pj 1 if M ( Pj , Ci )=True
A := [ ai , j ]H × N , where ai , j = (3.5)
correctly predicts target event etx if it matches a subsequence 0 otherwise
of events in chunk Cx within the prediction period D, The nonzero entries of column j of A represent the target
otherwise Pj does not predict etx. This prediction relation is events that are predicted by pattern Pj. Define the prediction
captured by a matching function defined as vector Vj of pattern Pj as column j of A.
5
The total number of target events predicted by pattern Pj is R = arg max u ∈ J k {[τ ru (t )] . [η ru ] β } (4.2)
r
known as the Support of Pj defined as the j-th column sum
( ) H
S P j = ∑ ai , j where j = 1, 2....N
i =1
(3.6) The State Transition Rule is defined as a combination of
the Exploration and Exploitation rules.
Support is used as a measure of the effectiveness of each if q ≤ q0 (exploitation)
R (4.3)
prediction pattern and it is similar to recall used in [2][5]. s= k
prs (t ) otherwise (biased exploration)
IV. REVIEW OF PRINCIPLES OF ANT COLONY where
q = a random number uniformly distributed in [0...1]
OPTIMIZATION
q0 = q0 is a predefined threshold controlling the rate of
Ant colony optimization (ACO) is an effective method of exploration and exploitation ( 0 ≤ q0 ≤ 1 ) .
solving problems that can be reduced to searches for shortest The Local Updating Rule for ACO [14] with τ 0 = 0 , is the
paths in graphs [14]. ACO is a probabilistic learning search
following:
method, where learning is based on the activities of swarms
τ rs (t + 1) = (1 − ρ ) .τ rs (t ) (4.4)
of ants. The concept of pheromones, which enhance learning
along certain paths, together with the lack of centralized After an ant has chosen a node, the pheromone level is
control and a priori information are the main attractive points updated using (4.4). This decreases the pheromone density
for designing a pattern identification algorithm that is on visited nodes, making old edges less attractive, and
inspired by this behavior. therefore favouring the exploration of seldom visited edges.
To find the shortest path for the Traveling Salesman
The Global Update Rule defined in (4.5) is to deposit
Problem (TSP) using ACO, the problem can be represented additional amount of pheromones to edges that are in the
as a graph of nodes and edges. Nodes represent cities; edges path with relatively shorter distance. The purpose is to make
represent connective relations between any pair of nodes. the ants search in the neighbourhood of the best paths
There are three principal rules in ACO [14], namely the State identified up to the current iteration of the algorithm.
Transition Rule, the Local Update Rule and the Global τ rs (t + 1) = (1 − α ) . τ rs (t ) + α . ∆τ rs (t ) (4.5)
Update Rule. These rules are fundamental to all ACO where
applications. α = global pheromone decay (0 < α < 1)
The State Transition Rule determines the probability of ∆τrs (t ) is the amount of pheromones increased on the edges that
moving ant k from node r to the next node s in the graph. belongs to global best tour and is defined as
This rule can be divided into two parts, Exploration (4.1) and 1
if ( r , s ) ∈ global - best - tour (4.6)
Exploitation (4.2). ∆τ rs (t) = L gb
An ant chooses to explore new nodes instead of following 0 otherwise
currently attractive edges by applying Exploration rule (4.1)
which is based on two variables, pheromone density τrs(t) where Lgb = length of the global best tour found so far.
and visibility ηrs. τrs(t) multiplied by ηrs forces ants to choose
short edges with high amounts of pheromone. A node is then
chosen randomly according to their normalized pheromone V. ALGORITHM FOR EXTRACTING
levels. PRECURSORY ALARM PATTERNS (EPAP) FROM
[τ rs (t )].[ηrs ]β
HISTORICAL DATA LOGS
if s ∈ J rk
k
prs
(t ) = ∑ [τ ru (t )].[ηru ]β (4.1) ACO is able to learn the shortest paths in extremely
complex graphs with large numbers of nodes. Therefore, one
u∈J r
k
6
nodes and arcs where nodes represent the features and arcs
are correlations among features. Thus, 1-dimentional feature The identification of predictive patterns is different from
space in Figure 1 can be modeled as a graph as shown in the classical Traveling Salesman Problem (TSP), which is
Figure 3 where fi = Ai, i=1, 2, …..Z. We define feature fs to the benchmark for ACO and other NP-hard optimization
be correlated with feature fr if an event with feature fr techniques. Therefore to apply ACO to pattern identification,
happening followed by an event with feature fs within a fixed some modifications to ACO are needed. A comparison of
time window ∆t p ≤ D in the event sequence S. TSP and the Pattern Identification Problem is given in Figure
4. Motivated by these differences, the rest of this section
In order to quantify the correlation relation between two
elaborates the modifications to the original ACO algorithm
nodes, we define a correlation factor δrs as the summation of
required to develop EPAP. Several of the modifications add
the occurrences of the events with feature fr happening,
additional variable parameters that may be selected to tune
followed by events with fs within a given time window ∆t p .
the EPAP algorithm to improve its performance.
That is
I J
where
B. Required Modifications to ACO for Pattern 1
P ( Px (1) ) = Px(1) is starting feature in predictiv pattern Px
Identification N
Now that the alarm pattern extraction problem has been P( fx ) =
1
c is size of the feature space
mapped into a graph structure in feature space, the EPAP c
7
(refer to section G below), P(P (1)) = 1 where N << c. Next of the pattern length.
x
N
the ants move according to the ACO state transition rule The weighted support is used in EPAP and described in the
when they have not converged to a pattern. next subsection.
8
2. Termination of EPAP tons at a speed of 60 meters per minute, and contains more
In ACO, when the colony of ants converges onto the same than 10 subsystems such as the main/backup travelling
path, the algorithm can be terminated, as further searches motors, travelling control system, main/backup hosting
will not improve the solution. In EPAP, as the convergence motors, power supply system, etc. It is equipped with
rate is low even with soft convergence, termination approximately 200 different alarms out of which 148 alarms
conditions must be carefully designed. Two possible criteria actively send signals to an equipment monitoring system.
exist for termination of the algorithm: 1) set a fixed number The equipment operates on four modes/states: auto, manual,
of iterations; 2) keep track of the total number of times a maintenance, and off mode. When the equipment is in for
prediction pattern has been generated. preventive maintenance, it will be in off mode. When
The problem with the first criterion is that all runs are unexpected failure events occur, the equipment will switch
unique. Therefore, having a fixed number of iterations is from auto mode or manual mode to maintenance mode.
inflexible in managing the convergence of the prediction
patterns. Moreover, to choose a fixed value for the number Start
of iterations that will work for all runs is not possible. Some
runs may be able to terminate early, and some runs may need
a much longer time to converge. M is total number of iterations
Iterations i = 1
K is total number of ants
Thus, the second criterion of keeping track of the times that
a pattern has been generated is intuitively more sound. This
False
allows each pattern to have a minimum amount of While ants have not
converged and i < M
convergence and yield satisfied results before termination.
However, in practice, allowing the ants to find the paths may True
not result in convergence if some spurious patterns were Set start node for each ant
generated. according to ProbStart
may be necessary to prune those which are too general, e.g. False
Has k finished Apply Local
with higher Support and low CL value. Doing so reduces the a=a+1
tour? Update to ni (4.4)
number of patterns and thus, increases the accuracy of the True
prediction rules.
Apply Global Update to
i =i+1 global best N patterns
(4.5)
vehicle. The vehicle is 44-metres high and can transport 6.8 Figure 5. Pattern Identification Procedure in EPAP Algorithm
9
It should be noted that, from the patterns listed in Table 1,
The equipment failure events were defined as occurring the CL value is not proportional to Weighted Support.
when the equipment is in maintenance mode for more than Therefore, selecting predictive patterns to construct
15 minutes after it switched from auto mode. For training, prediction rule set as defined in (3.3) is not a simple task and
we used data collected in 2003 within which there were 29 needs to balance between the Support and the Confidence
identified unexpected equipment failures that were used as Level.
target events. For verification, we used data from 2004 At the initial development and testing stage, target events
within which there were 31 identified unexpected equipment were identified as the events of equipment mode changing
failures. from auto mode to maintenance mode and staying in
The prototype of EPAP has been running on the dataset maintenance mode for more than 15 minutes. It did not
with parameters of 7-day Monitoring Time (M) and 8 hours differentiate the target events with the different failure types;
Warning Time (W). Initial test results are presented in terms hence the alarm patterns may not be accurate and meaningful.
With more rigid data pre-process and clear definitions of the
of Support (S), Weighted Support (SW), Confident Level (CL)
failure types, it is expected that we can identify alarm
and Predictive Pattern as shown in column 1, 2, 3 and 4 in
patterns that can be useful for equipment degradation
Table 1 respectively. In these tables and in this section in modelling.
general, the alarm fault patterns are identified by uniquely Further testing was conducted with 7 identified chain
assigned numbers. failure events from 3 elevated transfer vehicles of the same
Each weight factor (W) value in (5.3) will generate one type. The reason to use dataset from three vehicles of the
unique set of patterns. When W=0, EPAP generated same type is that failure events are rare events, the same type
generally short patterns with one or two features frequently of failure events happened only 2-3 times during the period
occurring in the event sequence. An example is the alarm of 2 years for a specific equipment.
pattern with two events <0004, 0018>. Domain experts Figure 6 shows examples of the candidates of predictive
verified that this alarm pattern is indeed associated with a alarm patterns identified by EPAP with different parameter
certain fault event that occurs for most equipment failures. settings. Those patterns with confidence level greater than
Thus, this short pattern may not hold any predictive value as 60% can be used to construct a prediction model. In Table a)
of Fig 6, pattern A1 clearly presents the typical event
it is too general and will give false alarm most of time when
sequence that happened in handling critical vehicle problem,
a short down-time occur. include breaking the vehicle travel motor (0154), opening the
When W =1.75, EPAP generates patterns with longer cabin door (0088), lifting the maintenance supports (0084-
length. Thus, W =1.75 was used to conduct the testing 0087) and locking the horizontal movement(0099). It shows
hereafter. Table 1 shows pattern examples with pattern that EPAP can find the valid event pattern. Pattern A3 shows
duration parameter selected as ∆t p = 0.5 hur . the main travel motor break error (0152) follow by backup
travel motor break error (0154), then travel drive safety loop
fault (0082). This pattern is also quite meaningful.
S SW CL Candidate Predictive Pattern
In table b) of Fig. 6, we noticed that pattern B1 is the same
23 157 0.32 0018 0082 0102 as A1, which indicates pattern A1 duration is less or equal to
5 115 0.71 0006 0031 0098 0099 0161 0081
0031 0.5 hr. Though this pattern’s confident level is 100%, it can
5 115 0.71 0022 0098 0099 0161 0081 0125
not be used as predictive pattern as it indicates a sequence
5 115 0.71 0006 0031 0098 0161 0081 0125
16 109 0.75 0071 0018 0082
event happened when maintenance staff were conducting a
14 95 0.3 0101 0018 0082
check-up activity for chain failure.
14 95 0.28 0018 0101 0082 Pattern B3 indicates timeout at a friction deck/ a powered
4 92 0.8 0000 0006 0098 0018 0081 0125 deck when the vehicle was loading and unloading. These are
4 92 0.8 0006 0031 0098 0099 0018 0082 the symptoms that are associated with chain failure. Pattern
13 88 0.48 0071 0018 0101 B3 can be selected as a predictive pattern if minimum
13 88 0.42 0037 0004 0018 confident level requirement is greater than 70%.
5 83 0.71 0006 0098 0099 0161 0125 Pattern B4 shows backup travel motor break error (0154),
5 83 0.63 0006 0098 0099 0081 0125 followed by travel drive safety loop fault (0082). This is a
5 83 0.71 0006 0098 0099 0081 0082 subset of pattern A3 with higher support. It could be due to
5 83 0.86 0087 0107 0161 0081 0082 the window size of 0.5 hours for the second group. The
5 83 0.71 0087 0107 0161 0081 0125 complete pattern happened in larger time window. This
5 83 0.83 0006 0098 0161 0081 0082 implies that a suitable window size plays an important role in
5 83 0.71 0086 0098 0161 0081 0125
event pattern searching.
12 82 0.31 0071 0018 0100
With different parameters settings, more than 10 candidates
of predictive alarm patterns with confidence level >60% can
Table 1. Preliminary Results: ∆t p = 0.5 hur , W=1.75, D=8hur, N=30
be identified. How to determine the predictive alarm pattern
With different pattern durations, there were a total of 6 sets that can be used to model equipment degradation is a
of predictive patterns generated, each containing 30 patterns. challenging issue. We believe that domain expert’s
10
involvement will play an important role at the stage of IX. REFERENCES
prediction rule set generation as background knowledge of 1. Ting-Ting Y. Lin, Daniel P. Siewiorek, “Error Log Analysis:
field experts is vital. Statistical Modeling and Heuristic trend Analsysis”, IEEE
Transactions on Reliability,Vol 39, No.4, 1990 Oct.
Pattern Simple Weighted
ID support support Confidence Candidate Predictive Pattern
A1 3 9037.386 1 0154 0083 0084 0085 0086 0087 0099
2. Gary M. Weiss and Haym Hirsh, 1998. Learning to Predict
A2 5 1681.793 0.85 0118 0117 0117 0117
Rare Events in Event Sequences. Proceedings of the 4th
A3 3 3394.113 0.75 0148 0152 0154 0082 International Conference on Knowledge Discovery and Data
A4 5 1681.793 0.62 0118 0118 0117 0117 Mining, AAAI Press, 359-363.
A5 3 2051.556 0.6 0004 0086 0087
A6 6 2018.151 0.58 0020 0018
3. R. Vilalta, C. V. Apte,J. L. Hellerstein, S. Ma, S. M. Weiss,
A7 7 2354.51 0.57 0136 0135
A8 4 2735.408 0.57 0122 0152 0154
“Predictive algorithms in the management of computer
a) Pattern during 1 hr, warning period 7 hr
systems”, IBM SYSTEMS JOURNAL, VOL 41, NO 3, 2002.
Pattern Simple Weighted
ID support support Confidence Candidate Predictive Pattern
4. D. Levy and R. Chillarege, Early Warning of Failures through
B1 3 9037.386 1 0154 0083 0084 0085 0086 0087 0099
Alarm Analysis – A Case Study in Telecom Voice Mail
B2 4 1345.434 0.8 0194 0063
B3 4 1345.434 0.75 0196 0195
Systems, IEEE International Symposium on Software
B4 4 1345.434 0.75 0154 0082 Reliability Engineering (ISSRE 2003), Denver CO, Nov 2003
B5 6 2018.151 0.58 0020 0018
B6 7 2354.51 0.57 0136 0135 5. Felix Salfner, Michael Schieschke and Miroslaw Malek.
B7 4 2735.408 0.57 0122 0152 0154
Predicting Failures of Computer Systems: A Case Study for a
B8 5 1681.793 0.57 0118 0117
Telecommunication System. Proceedings of 11th IEEE
B9 5 1681.793 0.54 0165 0018
b) Pattern during 0.5 hr, warning period 7 hr
Workshop on Dependable Parallel, Distributed and Network-
Centric Systems (DPDNS'06); Rhodes Island, Greece; April
2006.
Figure 6. Examples of candidate predictive alarm patterns for
chain failure type 6. R. K. Sahoo, A. J. Oliner_ , I. Rish, M. Gupta, J.E. Moreira
and S. Ma, “Critical Event Prediction for Proactive
Management in Largescale Computer Clusters”, The Ninth
VII. CONCLUSION AND FUTURE WORK ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (SIGKDD’03), pp426-435, August
This paper has presented an ACO based pattern 2003, Washington DC, USA
identification algorithm, EPAP (Extraction of Precursory
Alarm patterns) for equipment failure prediction. In EPAP, a 7. R. Duan, R. Prodan, T. Fahringer, "Short Paper: Data Mining-
feature correlation factor and heuristics for start node based Fault Prediction and Detection on the Grid," hpdc,
selection, stopping condition, and termination are developed pp.305-308, 2006 15th IEEE International Conference on High
Performance Distributed Computing, 2006
with the balance of convergence and exploration capabilities
of the algorithm. We applied EPAP to the real alarm history 8. Fadel, H.K.; Holloway, L.E., “Using SPC and template
dataset collected from equipment. The initial results have monitoring method for fault detection and prediction in
shown that EPAP is feasible. discrete event manufacturing systems" Intelligent
Refinement of the algorithm will look at two areas: Control/Intelligent Systems and Semiotics, 1999. Proceedings
of the 1999 IEEE International Symposium on 15-17 Sept.
ordering connective and adjustment of parameters in EPAP. 1999 Page(s):150 – 155
The ordering connective is a simplified set of regular
expressions [16]. It allows the use of non-linear pattern 9. Zhiguo Li, Shiyu Zhou, Suresh Choubey, Crispian
matching, e.g. A|B is equivalent to AB and BA, where “|” is Sievenpiper, “Failure event prediction using the Cox
known as the “unordered” connective. As alarms may not proportional hazard model driven by frequent failure
signatures”, IIE Transactions, Online Publication Date 01
necessarily occur in a strict sequence, using connectives March 2007.
enables a richer set of pattern matching. This will potentially
increase the accuracy of the algorithm. 10. Mannila, H., Toivonen, H. and Verkamo, A. I. (1997)
As the EPAP’s performance, in terms of running time and Discovery of frequent episodes in event sequences, Data
accuracy, is highly dependent on parameter tuning, Minning and Knowledge Discovery, 1, 259-289.
parameter adjustment should be carried out to improve the 11. L. Liao, H. Wang and J. Lee, "A Reconfigurable Watchdog
efficiency of the proposed algorithm. Agent for Machine Health Prognostics", International Journal
of COMADEM, Volume 11, Number 3. July 2008, pp2-15.
VIII. ACKNOWLEDGEMENT
12. John F. Roddick,Myra Spiliopoulou, "A Survey of Temporal
The authors would like to thank K. C. Ng for his Knowledge Discovery Paradigms and Methods", IEEE
implementation of EPAP and the members in MEC group of Transations on Knowledge and data Engineering, Vol. 14, No.
SIMTech for their support and many useful discussions that 4, Aug 2002.
are related to the domain problem and data pre-processing.
11
13. Ricardo Vilalta, Sheng Ma, "Predicting Rare Events In 15. Chih-Hung Wu, Wei-Ting Lin, Chi-Hua Li, I-Ching Fang,
Temporal Domains," Second IEEE International Conference Chia-Hsiang Wu,“Ant Colony Optimization On Building An
on Data Mining (ICDM'02), 2002, pp.474. Online Delayed Diagnosis Detection Support System For
Emergency Department”, The 7th International Conference on
14. Dorigo M., V. Maniezzo & A. Colorni (1996). Ant System: Computational Intelligence in Economics and Finance, Dec. 5-
Optimization by a colony of cooperating agents. IEEE 7, 2008, Kainan University, Taoyuan, Taiwan.
Transactions on Systems, Man, and Cybernetics-Part B,
26(1):29-41. 16. http://www.regular-expressions.info/ Regular expressions and
their uses.
12