Professional Documents
Culture Documents
www.elsevier.com/locate/sigpro
Abstract
This paper presents a novel concept of anomaly detection in complex dynamical systems using tools of Symbolic Dynamics,
Finite State Automata, and Pattern Recognition, where time-series data of the observed variables on the fast time-scale
are analyzed at slow time-scale epochs for early detection of (possible) anomalies. The concept of anomaly detection in
dynamical systems is elucidated based on experimental data that have been generated from an active electronic circuit with
a slowly varying dissipation parameter.
? 2004 Elsevier B.V. All rights reserved.
0165-1684/$ - see front matter ? 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2004.03.011
1116 A. Ray / Signal Processing 84 (2004) 1115 – 1130
• An observable non-stationary behavior of the dy- behavior patterns, compared to those of the nominal
namical system can be associated with anomaly(ies) behavior; and (ii) inverse problem of Pattern Recog-
evolving at a slow time scale. nition to infer parametric or non-parametric changes
based on the learnt patterns and observed stationary
The theme of anomaly detection, formulated in this response. The inverse problem could be ill-posed or
paper, is built upon the concepts of Symbolic Dynam- have no unique solution. That is, it may not always
ics [14,15] Finite State Automata [12], and Pattern be possible to identify a unique anomaly pattern
Recognition [9] as a means to qualitatively describe based on the observed behavior of the dynamical
the (fast-time-scale) dynamical behavior in terms system. Nevertheless, the feasible range of parameter
of symbol sequences [2,4]. Appropriate phase-space variation estimates can be narrowed down from the
partitioning of the dynamical system yields an alpha- intersection of the information generated from inverse
bet to obtain symbol sequences from time-series data images of the responses under several stimuli.
[1,8,13]. Then, tools of computational mechanics [7] It is envisioned that complex dynamical systems
are used to identify statistical patterns in these sym- will acquire the ability of self-diagnostics through us-
bolic sequences through construction of a (probabilis- age of the proposed anomaly detection technique that
tic) 3nite-state machine from each symbol sequence. is analogous to the diagnostic procedure employed in
Transition probability matrices of the 3nite-state ma- medical practice in the following sense. Similar to
chines, obtained from the symbol sequences, capture the notion of injecting medication or inoculation on a
the pattern of the system behavior by information nominally healthy patient, a dynamical system would
compression. For anomaly detection, it su7ces that a be excited with known stimuli (chosen in the for-
detectable change in the pattern represents a deviation ward problem) in the idle cycles for self diagnosis and
of the nominal behavior from an anomalous one. The health monitoring. The inferred information on health
state probability vectors, which are derived from the status can then be used for the purpose of self-healing
respective state transition matrices under the nominal control or life-extending control [25]. This paper fo-
and an anomalous condition, yield a vector measure of cuses on the forward problem and demonstrates the
the anomaly, which provides more information than a e7cacy of anomaly detection based on experimental
scalar measure such as the complexity measure [20]. data generated from an active electronic circuit with
In contrast to the -machine [7,20] that has an a a slowly varying dissipation parameter.
priori unknown structure and yields optimal pattern The paper is organized in seven sections and two
discovery in the sense of mutual information [5,11], appendices. Section 2 brieEy introduces the notion
the state machine adopted in this paper has an a priori of nonlinear time-series analysis. Section 3 pro-
known structure that can be freely chosen. Although vides a brief overview of symbolic dynamics and
the proposed approach is suboptimal, it provides a encoding of time series data. Section 4 presents two
common state machine structure where physical sig- ensemble approaches for statistical pattern represen-
ni3cance of each state is invariant under changes in tation. It also presents information extraction based
the statistical patterns of symbol sequences. This fea- on the -machine [7] and the D-Markov machine,
ture allows unambiguous detection of possible anoma- as well as their comparison from diDerent perspec-
lies from symbol sequences at diDerent (slow-time) tives. Section 5 presents the notion of anomaly
epochs. The proposed approach is apparently compu- measure to quantify the changing patterns of anoma-
tationally faster than the -machine [20], because of lous behavior of the dynamical system form the
signi3cantly fewer number of Eoating point arithmetic information-theoretic perspectives, followed by an
operations. These are the motivating factors for in- outline of the anomaly detection procedure. Sec-
troducing this new anomaly detection concept that is tion 6 presents experimental results on a nonlinear
based on a 3xed-structure 3xed-order Markov chain, active electronic circuit to demonstrate e7cacy of
called the D-Markov machine in the sequel. the proposed anomaly detection technique. Section
The anomaly detection problem is separated into 7 summarizes and concludes the paper with recom-
two parts [21]: (i) forward problem of Pattern mendations for future research. Appendix A explains
Discovery to identify variations in the anomalous the physical signi3cance of information-theoretic
A. Ray / Signal Processing 84 (2004) 1115 – 1130 1117
quantities used in the Section 4.1 and Section 5. The 3rst three steps show how chaotic systems may
Appendix B introduces the concept of shift spaces, be separated from stochastic ones and, at the same
which is used to delineate the diDerences time, provide estimates of the degrees of freedom
between the -machine [7] and the D-Markov machine and the complexity of the underlying dynamical sys-
in Section 4.4. tem. Based on this information, Step 4 formulates a
state-space model that can be used for prediction of
anomalies and incipient failures. The functional form
2. Nonlinear time-series analysis often used in this step, includes orthogonal polynomi-
als and radial basis functions. This paper has adopted
This section presents nonlinear time-series analysis an alternative class of discrete models inspired from
(NTSA) that is needed to extract relevant physical in- Automata Theory, which is built upon the principles
formation on the dynamical system from the observed of Symbolic Dynamics as described in the following
data. NTSA techniques are usually executed in the section.
following steps [1]:
1. Signal separation: The (deterministic) time- 3. Symbolic dynamics and encoding
dependent signal {y(n) : n ∈ N}, where N is the set
of positive integers, is separated from noise, using This section introduces the concept of Symbolic
time-frequency and other types of analysis. Dynamics and its usage for encoding nonlinear sys-
2. Phase space reconstruction: Based on the Tak- tem dynamics from observed time-series data. Let a
ens Embedding theorem [22], time lagged or delayed continuously varying physical process be modelled as
variables are used to construct the state vector x(n) a 3nite-dimensional dynamical system in the setting
in a phase space of dimension dE (which is diDeo- of an initial value problem:
morphically equivalent to the attractor of the original dx(t)
dynamical system) as follows: = f(x(t); (ts ); x(0) = x0 ; (2)
dt
x(n) = [y(n); y(n + T ); : : : ; y(n + (dE − 1)T )]; (1) where t ∈ [0; ∞) denotes the (fast-scale) time; x ∈ Rn
is the state vector in the phase space; and ∈ R‘ is
where the time lag T is determined using mutual in- the (possibly anomalous) parameter vector varying
formation; and one of the ways to determine dE is the in (slow-scale) time ts . Sole usage of the model in
false nearest neighbors test [1]. Eq. (2) may not always be feasible due to unknown
3. Signal classi4cation: Signal classi3cation and parametric and non-parametric uncertainties and
system identi3cation in nonlinear chaotic systems re- noise. A convenient way of learning the dynami-
quire a set of invariants for each subsystem of interest cal behavior is to rely on the additional information
followed by comparison of observations with those provided by (sensor-based) time-series data [1,4].
in the library of invariants. The invariants are proper- A tool for behavior description of nonlinear dynam-
ties of the attractor and could be independent of any ical systems is based on the concept of formal lan-
particular trajectory. These invariants can be divided guages for transitions from smooth dynamics to a dis-
into two classes: fractal dimensions and Lyapunov crete symbolic description [2]. The phase space of the
exponents. Fractal dimensions characterize geometri- dynamical system in Eq. (2) is partitioned into a 3nite
cal complexity of dynamics (e.g., spatial distribution number of cells, so as to obtain a coordinate grid of
of points along a system orbit); and Lyapunov expo- the space. A compact (i.e., closed and bounded) re-
nents describe the dynamical complexity (e.g., stretch- gion ∈ Rn , within which the (stationary) motion un-
ing and folding of an orbit in the phase space) [18]. der the speci3c exogenous stimulus is circumscribed,
4. Modeling and prediction: This step involves is identi3ed. Encoding of is accomplished by intro-
determination of the parameters of the assumed model ducing a partition B ≡ {B0 ; : : : ; Bm−1 } consisting of
of the dynamics, which is consistent with the invari- m mutually exclusive (i.e., Bj ∩ Bk = ∅ ∀j = k), and
m−1
ant classi3ers (e.g., Lyapunov exponents, and fractal exhaustive (i.e., j=0 Bj = ) cells. The dynamical
dimensions). system describes an orbit by the time-series data as:
1118 A. Ray / Signal Processing 84 (2004) 1115 – 1130
← →←
the time-series data [13]. Having de3ned a partition s to predictions P( S | s ). In essence, a prediction im-
of the phase space, the time-series data is converted ←
poses a partition on the set S of all histories. The cells
to a symbol sequence that, in turn, is used for con- of this partition contain histories for which the same
struction of a 3nite-state machine using the tools of prediction is made and are called the e7ective states
Computational Mechanics [7] as illustrated in Fig. 1. of the process under the given predictor. The set of
This paper considers two alternative techniques of eDective states is denoted by R; a random variable for
3nite-state machine construction from a given sym- an eDective state is denoted by R and its realization
bol sequence S: (i) the -machine formulation [20]; by .
and (ii) a new concept based on Dth order Markov The objective of -machine construction is to 3nd
chains, called the D-Markov machine, for identifying ←
patterns based on time series analysis of the observed a predictor that is an optimal partition of the set S of
data. Both techniques rely on information-theoretic histories, which requires invoking two criteria in the
principles (see Appendix A) and are based on theory of Computational Mechanics [6]:
computational mechanics [7]. 1. Optimal Prediction: For any partition of his-
tories or eDective states R, the conditional entropy
→ → ← ← ←
4.1. The -machine H [ S L |R] ¿ H [ S L | S ]; ∀L ∈ N; ∀ S ∈ S , is equiva-
lent to remembering the whole past. EDective states R
Like statistical mechanics [10,4], computational are called prescient if the equality is attained ∀L ∈ N.
mechanics is concerned with dynamical systems Therefore, optimal prediction needs the eDective states
consisting of many partially correlated components. to be prescient.
Whereas Statistical Mechanics deals with the local 2. Principle of Occam Razor: The prescient states
space–time behavior and interactions of the system with the least complexity are selected, where complex-
elements, computational mechanics relies on the joint ity is de3ned as the measured Shannon information of
probability distribution of the phase-space trajectories the eDective states:
of a dynamical system. The -machine construction H [R] = − P(R = )log P(R = ): (5)
[7,20] makes use of the joint probability distribution ∈R
to infer the information processing being performed
Eq. (5) measures the amount of past information
by the dynamical system. This is developed using
needed for future prediction and is known as Statisti-
the statistical mechanics of orbit ensembles, rather
cal Complexity denoted by C (R) (see Appendix A).
than focusing on the computational complexity of
For each symbolic process S, there is a unique set of
individual orbits.
prescient states known as causal states that minimize
Let the symbolic representation of a discrete-time,
the statistical complexity C (R).
discrete-valued stochastic process be denoted by: S ≡
· · · S−2 S−1 S0 S1 S2 · · · as de3ned earlier in Section 4.
Denition 4.1 (Shalizi et al. [20]): Let S be a (con-
At any instant t, this sequence of random variables can ←
← ditionally) stationary symbolic process and S be the
be split into a sequence S t of the past and a sequence ← →
→ set of histories. Let a mapping : S → ( S ) from
S t of the future. Assuming conditional stationarity of ← →
← → → the set S of histories into a collection ( S ) of mea-
the symbolic process S (i.e., P[ S t | S t = s ] being in- ←
dependent of t), the subscript t can be dropped to de- surable subsets of S be de3ned as:
← → → ← ← ←
note the past and future sequences as S and S , respec- ∀ ∈ ( S ); ( s ) ≡ { s ∈ S such that
tively. A symbol string, made of the 3rst L symbols → ← → ←
→ → ← ←
of S , is denoted by S L . Similarly, a symbol string, P( S ∈ | S = s ) = P( S ∈ | S = s )}: (6)
← ←
made of the last L symbols of S , is denoted by S L . Then, the members of the range of the function are
→
Prediction of the future S necessitates determina- called the causal states of the symbolic process S.
←
tion of its probability conditioned on the past S , which The ith causal state is denoted by qi and the set of
→
requires existence of a function mapping histories all causal states by Q ⊆ ( S ). The random variable
1120 A. Ray / Signal Processing 84 (2004) 1115 – 1130
corresponding to a causal state is denoted by Q and a single causal state and hence zero statistical com-
its realization by q. plexity and high entropy rate. At this stage, CSSR uses
statistical tests to determine when it must add states
Given an initial causal state and the next symbol to the model, which increases the estimated complex-
from the symbolic process, only successor causal ity, while lowering the entropy rate h (see Appendix
states are possible. This is represented by the legal A). A key and distinguishing feature of the CSSR
transitions among the causal states, and the probabil- code is that it maintains homogeneity of the causal
ities of these transitions. Speci3cally, the probability states and deterministic state-to-state transitions as the
of transition from state qi to state qj on a single model grows. Complexity of the CSSR algorithm is:
symbol s is expressed as: O(mLmax ) + O(m2Lmax +1 ) + O(N ), where m is the size
→ of the alphabet A; N is the data size and Lmax is the
Tij(s) = P( S 1 = s; Q = qj | Q = qi ) ∀qi ; qj ∈ Q; (7)
length of the longest history to be considered. Details
(s) are given in [20].
Tij = 1: (8)
s∈A qj ∈Q
4.2. The suboptimal D-markov machine
The combination of causal states and transitions is
called the -machine (also known as the causal state This section presents a new alternative approach for
model [20]) of a given symbolic process. Thus, the representing the pattern in a symbolic process, which
-machine represents the way in which the symbolic is motivated from the perspective of anomaly detec-
process stores and transforms information. It also pro- tion. The core assumption here is that the symbolic
vides a description of the pattern or regularities in the process can be represented to a desired level of accu-
process, in the sense that the pattern is an algebraic racy as a Dth order Markov chain, by appropriately
structure determined by the causal states and their tran- choosing D ∈ N.
sitions. The set of labelled transition probabilities can
be used Denition 4.2. A stochastic symbolic stationary pro-
to obtain a stochastic matrix [3] given by:
T= s∈A Ts where the square matrix Ts is de3ned cess S ≡ · · · S−2 S−1 S0 S1 S2 · · · is called Dth order
as: Ts = [Tijs ] ∀s ∈ A. Denoting p as the left eigen- Markov process if the probability of the next symbol
vector of T, corresponding to the eigenvalue " = 1, depends only on the previous (at most) D symbols,
the probability of being in a particular causal state can i.e. the following condition holds:
be obtained by normalizing p‘1 = 1. A procedure P(Si |Si−1 Si−2 · · · Si−D · · ·) = P(Si |Si−1 · · · Si−D ) (9)
for construction of the -machine is outlined below.
← ← ←
The original -machine construction algorithm is the Alternatively, symbol strings S , S ∈ S become indis-
subtree-merging algorithm as introduced in [7,6]. The ←
tinguishable whenever the respective substrings S D
default assumption of this technique was employed ←
by Surana et al. [21] for anomaly detection. This ap- and S D , made of the most recent D symbols, are iden-
proach has several shortcomings, such as lack of a sys- tical.
tematic procedure for choosing the algorithm param-
eters, may return non-deterministic causal states, and De3nition (4.2) can be interpreted as follows:
← ← ← ← ←
also suDers from slow convergence rates. Recently, ∀ S , S ∈ S such that | S | ¿ D and |S | ¿ D,
Shalizi et al. [20] have developed a code known as ← ← ← ← ← ←
(S ∈ ( S ) and S ∈ (S )) iD S D = S D . Thus, a set
Causal State Splitting Reconstruction (CSSR) that is ←
based on state splitting instead of state merging as was { S L : L ¿ D} of symbol stings can be partitioned into
done in the earlier algorithm of subtree-merging [7]. a maximum of |A|D equivalence classes where A is
The CSSR algorithm starts with a simple model for the symbol alphabet, under the equivalence relation
←
the symbolic process and elaborates the model com- de3ned in Eq. (6). Each symbol string in { S L : L ¿ D}
ponents only when statistically justi3ed. Initially, the either belongs to one of the |A|D equivalence classes
algorithm assumes the process to be independent and or has a distinct equivalence class. All such symbol
identically distributed (iid) that can be represented by strings belonging to the distinct equivalence class
A. Ray / Signal Processing 84 (2004) 1115 – 1130 1121
For |A| = 2 and D = 2, the 3nite-state machine D-Markov chain (for an appropriate D) and so3c shifts
construction is (to some extent) analogous to the as an “extreme version” of a Hidden Markov process
one-dimensional Ising model of spin-1/2 systems with [24], respectively. The shifts have been referred to as
nearest neighbor interactions, where the z-component “extreme” in the sense that they specify only a set of
of each spin takes on one of the two possible values allowed sequences of symbols (i.e., symbol sequences
s = +1 or s = −1 [10,19]. For |A| ¿ 3, the machine that are actually possible, but not the probabilities of
would be analogous to one-dimensional Potts model, these sequences). Note that a Hidden Markov model
where each spin is directed in the z-direction with consists of an internal D-order Markov process that
|A| diDerent discrete values sk : k ∈ 1; 2; : : : ; |A|; for is observed only by a function of its internal-state
a j=2-spin model, the alphabet size |A| = j + 1 [4]. sequence. This is analogous to so3c shifts that are ob-
For D ¿ 2, the spin interactions extend up to the tained by a labelling function on the edge of a graph,
(D − 1)th neighbor. which otherwise denotes a 3nite-type shift. Thus, in
these terms, an -machine infers the Hidden Markov
4.4. Comparison of D-Markov machine and Model (so3c shift) for the observed process. In
-machine contrast, the D-Markov Model proposed in this paper
infers a (3nite-type shift) approximation of the (so3c
An -machine seeks to 3nd the patterns in the shift) -machine.
time series data in the form of a 3nite-state machine,
whose states are chosen for optimal prediction of the
symbolic process; and a 3nite-state automation can
5. Anomaly measure and detection
be used as a pattern for prediction [20]. An alterna-
tive notion of the pattern is one which can be used
The machines described in Sections 4.1 and 4.2 rec-
to compress the given observation. The 3rst notion
ognize patterns in the behavior of a dynamical system
of the pattern subsumes the second, because the ca-
that undergoes anomalous behavior. In order to quan-
pability of optimal prediction necessarily leads to the
tify changes in the patterns that are representations of
compression as seen in the construction of states by
evolving anomalies, we induce an anomaly measure
lumping histories together. However, the converse is
on these machines, denoted by M. The anomaly mea-
not true in general. For the purpose of anomaly detec-
sure M can be constructed based on the following
tion, the second notion of pattern is su7cient because
information-theoretic quantities: entropy rate, excess
the goal is to represent and detect the deviation of
entropy, and complexity measure of a symbol string
an anomalous behavior from the nominal behavior.
S (see Appendix A).
This has been the motivating factor for proposing
an alternative technique, based on the 3xed structure • The entropy rate h (S) quanti3es the intrinsic ran-
D-Markov machine. It is possible to detect the evolv- domness in the observed dynamical process.
ing anomaly, if any, as a change in the probability • The excess entropy E(S) quanti3es the memory in
distribution over the states. the observed process.
Another distinction between the D-Markov ma- • The statistical complexity C (S) of the state ma-
chine and -machine can be seen in terms of 4nite-type chine captures the average memory requirements
shifts and so4c shifts [15] (see Appendix B). Basic for modelling the complex behavior of a process.
distinction between 3nite-type shifts and so3c shifts
can be characterized in terms of the memory: while a Given two symbol strings S and S0 , it is possible to
3nite-type shift has 3nite-length memory, a so3c shift obtain a measure of anomaly by adopting any one of
uses 3nite amount of memory in representing the the following three alternatives:
patterns. Hence, 3nite-type shifts are strictly proper
subsets of so3c shifts. While, any 3nite-type shift has
|h (S) − h (S0 )|; or
a representation as a graph, so3c shifts can be repre- M(S; S0 )) = |E(S) − E(S0 )|; or
sented as a labelled graph. As a result, the 3nite-type
shift can be considered as an “extreme version” of a |C (S) − C (S0 )|:
A. Ray / Signal Processing 84 (2004) 1115 – 1130 1123
Note that each of the anomaly measures, de3ned are the steps for the forward problem:
above, is a pseudo metric [17]. For example, let us
consider two periodic processes with unequal periods, (F1) Selection of an appropriate set of input stimuli.
represented by S and S0 . For both processes, h = 0, (F2) Signal–noise separation, time interval selection,
so that M(S; S0 ) = 0 for the 3rst of the above three and phase-space construction.
options, even if S = S0 . (F3) Choice of a phase space partitioning to generate
The above measures are obtained through scalar- symbol alphabet and symbol sequences.
valued functions de3ned on a state machine and do (F4) State Machine construction using generated sym-
not exploit the rich algebraic structure represented in bol sequence(s) and determining the connection
the state machine. For example, the connection matrix matrix.
T associated with the -machine (see Section 4.1), (F5) Selection of an appropriate metric for the
can be treated as a vector representation of any possi- anomaly measure M.
ble anomalies in the dynamical system. The induced (F6) Formulation and calibration of a (possibly
2-norm of the diDerence between the T-matrices for non-parametric) relation between the computed
the two state machines can then be used as a measure anomaly measure and known physical anomaly
of anomaly, i.e., M(S; S0 ) = T − T0 2 . Such a under which the time-series data were collected
measure, used in [21], was found to be eDective. How- at diDerent (slow-time) epochs.
ever, there is some subtlety in using this measure on
-machines, because -machines do not guarantee that Following are the steps for the inverse problem:
the machines formulated from the symbol sequences
(I1) Excitation with known input stimuli selected in
S and S0 have the same number of states; and these
the forward problem.
states do not necessarily have similar physical sig-
(I2) Generation of the stationary behavior as
ni3cance. In general, T and T0 may have diDerent
time-series data for each input stimulus at dif-
dimensions and diDerent physical signi3cance. How-
ferent (slow-time) epochs.
ever, by encoding the causal states, T could be em-
(I3) Embedding the time-series data in the phase
bedded in a larger matrix, and an induced norm of the
space determined for the corresponding input
diDerence between T matrices for these two machines
stimuli in Step F2 of the forward problem.
can be de3ned. Alternatively, a (vector) measure of
(I4) Generation of the symbol sequence using the
anomaly can be derived directly from the stochastic
same phase-space partition as in Step F3 of the
matrix T as the left eigenvector p corresponding to
forward problem.
the unit eigenvalue of T, which is the state probabil-
(I5) State Machine construction using the symbol se-
ity vector under a stationary condition.
quence and determining the anomaly measure.
This paper has adopted the D-Markov machine ap-
(I6) Detection and identi3cation of an anomaly, if
proach, described in the Section 4.2 to build the state
any, based on the computed anomaly measure
machines. Since D-Markov machines have a 3xed
and the relation derived in Step F6 of the forward
state structure, the state probability vector p associated
problem.
with the state machine have been used for a vector
representation of anomalies, leading to the anomaly
measure M(S; S0 )) as a distance function between
the respective probability vectors p and p0 (that are 6. Application to an active electronic circuit
of identical dimensions), or any other appropriate
functional. This section illustrates an application of the
D-Markov machine concept for anomaly detection
5.1. Anomaly detection procedure on an experimental apparatus that consists of an ac-
tive electronic circuit. The apparatus implements a
Having discussed various tools and techniques, this second-order non-autonomous, forced Du7ng equa-
section outlines the steps of the forward problem and tion in real time [23]. The governing equation with
the inverse problem described in Section 1. Following a cubic nonlinearity in one of the state variables is
1124 A. Ray / Signal Processing 84 (2004) 1115 – 1130
10 10
β= 0.10 β= 0.28
2.0
1.5
d 2 x(t) d x(t) 1.0
+ )(ts ) + x(t) + x3 (t) = A cos !t: (14) 0.5
dt 2 dt 0.0
-0.5
-1.0
The dissipation parameter )(ts ), realized in the form -1.5
-2.0 β = 0.10 β = 0.28
time scale t at which the dynamical system is excited. Fig. 4. Time plots for electronic circuit experiment.
The goal is to detect, at an early stage, changes in )(ts )
that are associated with the anomaly. and its time derivative (i.e., the instantaneous cur-
In the forward problem, the 3rst task is the selection rent). While a small diDerence between the plots for
of appropriate input stimuli. For illustration purposes, ) = 0:10 and 0.27 is observed, there is no clearly visi-
we have used the stimulus with amplitude A = 22:0 ble diDerence between the plots for ) = 0:27 and 0.28
and frequency ! = 5:0 in this paper. Changes in the in Fig. 3. However, the phase plots for ) = 0:28 and
stationary behavior of the electronic circuit take place 0.29 display a very large diDerence indicating period
starting from ) ≈ 0:10 with signi3cant changes oc- doubling possibly due to onset of bifurcation. Fig. 4
curring in the narrow range of 0:28 ¡ ) ¡ 0:29. The displays time responses of the stationary behavior of
stationary behavior of the system response for this in- the phase variable for diDerent values of the parameter
put stimulus is obtained for several values of ) in the ) corresponding to the phase plots in Fig. 3. The plots
range of 0.10–0.35. in Fig. 4 are phase-aligned for better visibility. (Note
The four plates in Fig. 3 exhibit four phase plots that the proposed anomaly detection method does not
for the values of the parameter ) at 0.10, 0.27, 0.28, require phase alignment; equivalently, the 3nite-state
and 0.29, respectively, relating the phase variable of machine Fig. 2 can be started from any arbitrary
electrical charge that is proportional to the voltage state corresponding to no speci3c initial condition.)
across one of the capacitors in the electronic circuit, While the time responses for ) = 0:27 and 0.28 are
A. Ray / Signal Processing 84 (2004) 1115 – 1130 1125
indistinguishable, there is a small diDerence between research before its acceptance for application to a
those for ) = 0:27 and 0.10. Similar to the phase plots general class of dynamical systems for anomaly de-
in Fig. 3, the time responses for ) = 0:28 and 0.29 tection; and its e7cacy needs to be compared with
display existence of period doubling due to a possible that of existing phase-space partitioning methods such
bifurcation. as false nearest neighbor partitioning [13].)
Additional exogenous stimuli have been identi3ed, The procedure, described in the subsection IV-B
which also lead to signi3cant changes in the station- constructs a D-Markov machine and obtains the con-
ary behavior of the electronic system dynamics for nection matrix T and the state vector p from the sym-
other ranges of ). For example, with the same ampli- bol sequence corresponding to each ). For this anal-
tude A = 22:0, stimuli at the excitation frequencies of ysis, the wave space generated from each data set
! = 2:0 and ! = 1:67 (not shown in Figs. 3 and 4) has been partitioned into eight (8) segments, which
detect small changes in the ranges of 0:18 ¡ ) ¡ 0:20 makes the alphabet size |A|=8 to generate symbol se-
and 0:11 ¡ ) ¡ 0:12, respectively [21]. These ob- quences from the scale series data. At each value of ),
servations reveal that exogenous stimuli at diDerent the generated symbol sequence has been used to con-
excitation frequencies can be eDectively used for struct several D-Markov Machines starting with D = 1
detection of small changes in ) over an operating and higher integers. It is observed that, the dominant
range (e.g., 0:10 ¡ ) ¡ 0:35). probabilities of the state vector (albeit having diDer-
Having obtained the phase plots from the time-series ent dimensions) for diDerent values of D are virtually
data, the next step is to 3nd a partition of the phase similar. Therefore, a 3xed-structure D-Markov Ma-
space for symbol sequence generation. This is a chine with alphabet size |A| = 8 and depth D = 1,
di7cult task especially if the time-series data is which yields the number of states |A|D = 8, is chosen
noise-contaminated. Several methods of phase-space to generate state probability (p) vectors for the sym-
partitioning have been suggested in literature (for bol sequences.
example, [1,8,13]). Apparently, there exist no The electronic circuit system is assumed to be at
well-established procedure for phase-space partition- the nominal condition for the dissipation parameter
ing of complex dynamical systems; this is a subject ) = 0:10, which is selected as the reference point
of active research. In this paper, we have introduced for calculating the anomaly measure. The anomaly
a new concept of symbol sequence generation, which measure M is computed based on two diDerent com-
uses wavelet transform to convert the time-series putation methods as discussed in Section 5. Fig. 5
data to time-frequency data for generating the symbol exhibits three plots of the normalized anomaly mea-
sequence. The graphs of wavelet coe7cients ver- sure M versus the dissipation parameter ), where
sus scale at selected time shifts are stacked starting M is computed based on diDerent metrics. In each
with the smallest value of scale and ending with its
largest value and then back from the largest value
to the smallest value of the scale at the next in-
stant of time shift. The resulting scale series data
in the wavelet space is analogous to the time-series
data in the phase space. Then, the wavelet space
is partitioned into segments of coe7cients on the
ordinate separated by horizontal lines. The number
of segments in a partition is equal to the size of
the alphabet and each partition is associated with a
symbol in the alphabet. For a given stimulus, parti-
tioning of the wavelet space must remain invariant
at all epochs of the slow time scale. Nevertheless,
for diDerent stimuli, the partitioning could be chosen
diDerently. (The concept of proposed wavelet-space
partitioning would require signi3cant theoretical Fig. 5. Anomaly measure versus parameter ).
1126 A. Ray / Signal Processing 84 (2004) 1115 – 1130
case, the reference point of nominal condition is rep- Symbolic Dynamics, Finite State Automata, and Pat-
resented by the parameter ) = 0:10. The 3rst plot, tern Recognition. It is assumed that dynamical sys-
shown in solid line, shows M expressed as the angle tems under consideration exhibit nonlinear dynami-
(in radians) between the p vectors of the state ma- cal behavior on two time scales. Anomalies occur
chines under nominal and anomalous conditions, i.e., on a slow time scale that is (possibly) several or-
M = “(pref ; p) ≡ cos−1 ( p|p ref ;p|
ref ‘2 p ‘2
). The remain- ders of magnitude larger than the fast time scale of
ing two plots, one in dashed line and the other in dot- the system dynamics. It is also assumed that the un-
ted line, show the anomaly measure expressed as the forced dynamical system (i.e., in the absence of ex-
‘1 -norm and ‘2 -norm of the diDerence between the ternal stimuli) is stationary at the fast time scale and
p vectors of the state machines under nominal and that any non-stationary behavior is observable only on
an anomalous conditions, i.e., M = pref − p‘1 and the slow time scale. This concept of small change de-
M = pref − p‘2 , respectively. In each of the three tection in dynamical systems is elucidated on an ac-
plots, M is normalized to unity for better compari- tive electronic circuit representing the forced Du7ng
son. All three plots in Fig. 5 show gradual increase equation with a slowly varying dissipation parameter.
in the anomaly measure M for ) in the approximate The time-series data of stationary phase trajectories
range of 0.10–0.25. At ) ≈ 0:25 and onwards, M are collected to create the respective symbolic dynam-
starts increasing at a faster rate and 3nally saturates ics (i.e., symbol sequences) using wavelet transform.
at ) ¿ 0:29. The large values of anomaly measure at The resulting state probability vector of the transition
) = 0:29 and beyond indicate the occurrence of pe- matrix is considered as the vector representation of
riod reduction as seen in Figs. 3 and 4. This abrupt a phase trajectory’s stationary behavior. The distance
disruption, preceded by gradual changes, is analogous between any two such vectors under the same stimu-
to a phase transition in the thermodynamic sense [4], lus is the measure of anomaly that the system has been
which can also be interpreted as a catastrophic disrup- subjected to. This vector representation of anomalies
tion in a physical process. Hence, observation of mod- is more powerful than a scalar measure. The major
est changes in the anomaly measure may provide very conclusion of this research is that Symbolic Dynam-
early warnings for a forthcoming catastrophic failure ics along with the stimulus-response methodology and
as indicated by the gradual change in the )−M curve. having a vector representation of anomaly is eDective
Following the steps (I1)–(I5) of the inverse prob- for early detection of small anomalies.
lem in Section 5.1, the state probability vector p can be The D-Markov machine, proposed for anomaly
obtained for the stationary behavior under the known detection, is a suboptimal approximation of the
stimulus. The a priori information on the anomaly -machine. It is important that this approximation is
measure, generated in the step F6 of the forward prob- a su7ciently accurate representation of the nominal
lem in the Section 5.1, can then be used to determine behavior. Research in this direction is in progress and
the possible range in which ) lies. Solutions of the for- the results will be presented in a forthcoming publi-
ward problem can generate more information on dif- cation. Further theoretical research is recommended
ferent ranges of ) under diDerent input stimuli. Thus, in the following areas:
the range of the unknown parameter ) can be further
narrowed down by repeating this step for other known • Separation of information-bearing part of the signal
stimuli as reported earlier [21]. This ensemble of in- from noise.
formation provides inputs for the inverse problem for • Identi3cation of a relevant submanifold of the phase
detecting anomalies based on the sensor data collected space and its partitioning to generate a symbol al-
in real time, during the operation of machineries. phabet.
• Identi3cation of appropriate wavelet basis functions
for symbol generation and construction of a map-
7. Summary and conclusions ping from the wavelet space to the symbol space.
• Selection of the minimal D for the D-Markov ma-
This paper presents a novel concept of anomaly chine and identi3cation of the irreducible submatrix
detection in complex systems based on the tools of of the state transition matrix that contains relevant
A. Ray / Signal Processing 84 (2004) 1115 – 1130 1127
information on anomalous behavior of the in order to reveal the actual per-symbol uncertainty h ,
dynamical system. and thus measures di7culty in the prediction of the
process. Excess entropy has alternate interpretations
Acknowledgements such as: it is the intrinsic redundancy in the process;
geometrically it is a sub-extensive part of H (L); and
The author wishes to thank Dr. Cosma Shalizi for it represents how much historical information stored
making the CSSR code available and also for clarify- in the present is communicated to the future.
ing some of the underlying concepts of the -machine. Statistical complexity (C ) [11]: The information
The author is also thankful to Dr. Matthew Kennel for of the probability distribution of causal states, as
providing him with the software code on phase-space measured by Shannon entropy, yields the minimum
partitioning using symbolic false nearest neighbors. average amount of memory needed to predict future
The author acknowledges technical contributions of con3gurations. This quantity is the statistical com-
Mr. Amit Surana and Mr. David Friedlander in de- plexity of a symbol string S, de3ned by Crutch3eld
veloping the anomaly detection concept and of Mr. and Young [7] as
Venkatesh Rajagopalan for design and fabrication n−1
of the experimental apparatus on active electronic
C ≡ H (S) = − [Pr(Sk ) log2 Pr(Sk )]; (A.4)
circuits. k=0
Appendix A. Information theoretic quantities where n is the number of states of the 3nite-state ma-
chine constructed from the symbol string S. As shown
This appendix introduces the concepts of standard in [11], E 6 C in general, and C = E + Dh .
information-theoretic quantities: entropy rate, excess
entropy and statistical complexity [11], which are
Appendix B. Finite-type shift and soc shift
used to establish the anomaly measure in Section 5.
Entropy rate (h ): The entropy rate of a symbol
This appendix very brieEy introduces the concept
string S is given by the Shannon entropy as follows:
of shift spaces with emphasis on 3nite shifts and so3c
H [L] shifts that respectively characterize the D-Markov ma-
h = lim ; (A.1)
L→∞ L chine and the -machine described in the Section 4.4.
where H [L] ≡ − sL ∈A L P(sL ) log2 (P(sL )) is the The shift space formalism is a systematic way to study
Shannon entropy of all L-blocks (i.e., symbol se- the properties of the underlying grammar, which rep-
quences of length L) in S. The limit is guaranteed resent the behavior of dynamical systems encoded
to exist for a stationary process [5]. The entropy rate through symbolic dynamics. The diDerent shift spaces
quanti3es the irreducible randomness in sequences provide increasingly powerful classes of models that
produced by a source: the randomness that remains can be used to represent the patterns in the dynamical
after the correlation and the structures in longer and behavior.
longer sequence blocks are taken into account. For
a symbol string S represented as an -machine, Denition 2.1. Let A be a 3nite alphabet. The full
→
h = H [ S 1 |S]. A-shift is the collection of all bi-in3nite sequences of
Excess entropy (E): The excess entropy of a sym- symbols from A and is denoted by:
bol string S is de3ned as AZ = {x = (xi )i∈Z : xi ∈ A ∀i ∈ Z}: (A.5)
∞
E= [h (L) − h ] (A.2) Denition 2.2. The shift map / on the full shift AZ
L=1 maps a point x to a point y=/(x) whose ith coordinate
where h (L) ≡ H [L]−H [L−1] is the estimate of how is yi = xi+1 .
random the source appears if only L-blocks in S are
considered. Excess entropy measures how much addi- A block is a 3nite sequence of symbols over A. Let
tional information must be gained about the sequence x ∈ AZ and w be a block over A. Then w occurs in x
1128 A. Ray / Signal Processing 84 (2004) 1115 – 1130
if ∃ indices i and j such that w = x[i; j] = xi xi+1 · · · xj . Sliding block codes: Let X be a shift space over A,
Note that the empty block occurs in every x. then x ∈ X can be transformed into a new sequence y=
Let F be a collection of blocks, i.e., 3nite sequences · · · y−1 y0 y1 · · · over another alphabet U as follows.
of symbols over A. Let x ∈ AZ and w be a block over Fix integers m and n such that −m 6 n. To compute
A. Then w occurs in x if ∃ indices i and j such that yi of the transformed sequence, we use a function 2
w = x[i; j] = xi xi+1 · · · xj . For any such F, let us de3ne that depends on the “window” of coordinates of x from
XF to be the subset of sequences in AZ , which do not i − m to i + n. Here 2 : Bm+n+1 (X ) → U is a 3xed
contain any block in F. block map, called a (m + n + 1)-block map from the
allowed (m + n + 1)-blocks in X to symbols in U.
Denition 2.3. A shift space is a subset X of a full Therefore,
shift AZ such that X = XF for some collection F of
forbidden blocks over A. yi = 2(xi−m xi−m+1 · · · xi+n ) = 2(x[i−m; i+n] ): (A.7)
For a given shift space, the collection F is at most Denition 2.6. Let 2 be a block map as de3ned in
countable (i.e., 3nite or countably in3nite) and is Eq. (A.7). Then the map 3 : X → (U)Z de3ned by
non-unique (i.e., there may be many such F’s de- y = 3(x) with yi given by Eq. (A.7) is called the
scribing the shift space). As subshifts of full shifts, sliding block code with memory m and anticipation n
these spaces share a common feature called shift in- induced by 2.
variance. Since the constraints on points are given in
terms of forbidden blocks alone and do not involve Denition 2.7. Let X and Y be shift spaces, and 3 :
the coordinate at which a block might be forbidden, it X → Y be a sliding block code.
follows that if x ∈ XF , then so are its shifts /(x) and
/−1 (x). Therefore /(XF ) = XF , which is a necessary • If 3 : X → Y is onto, then 3 is called a factor code
condition for a subset of AZ to be a shift space. This from X onto Y .
property introduces the concept of shift dynamical • If 3 : X → Y is one-to-one, then 3 is called an
systems. embedding of X into Y .
• If 3 : X → Y has a inverse (i.e., ∃ a sliding block
Denition 2.4. Let X be a shift space and /X : X → code : Y → X such that (3(x)) = x ∀x ∈ X and
X be the shift map. Then (X; /X ) is known as a shift 3( (y)) = y ∀ y ∈ Y ), then 3 is called a conjugacy
dynamical system. from X to Y .
The shift dynamical system mirrors the dynamics If ∃ a conjugacy from X to Y , then Y can be viewed
of the original dynamical system from which it is gen- as a copy of X , sharing all properties of X . Therefore,
erated (by symbolic dynamics). Several examples of a conjugacy is often called topological conjugacy in
shift spaces are given in [15]. literature.
Rather than describing a shift space by specify- Finite-type shifts: We now introduce the concept of
ing the forbidden blocks, it can also be speci3ed by 3nite-type shift that is the structure of the shift space
allowed blocks. This leads to the notion of a language in the D-Markov machine proposed in the Section 4.2.
of a shift.
Denition 2.8. A 4nite-type shift is a shift space that
Denition 2.5. Let X be a subset of a full shift, and can be described by a 3nite collection of forbidden
let Bn (X ) denote the set of all n-blocks (i.e., blocks blocks (i.e., X having the form XF for some 3nite set
of length n) that occur in X . The language of the shift F of blocks).
space X is de3ned as:
An example of a 3nite shift is the golden mean shift,
∞
where the alphabet is 6 = {0; 1} and the forbidden set
B(X ) = Bn (X ): (A.6) F = {11}. That is, X = XF is the set of all binary
n=0 sequences with no two consecutive 1’s.
A. Ray / Signal Processing 84 (2004) 1115 – 1130 1129
Denition 2.9. A 3nite-type shift is M -step or has presentation of a so3c shift X is a labelled graph G
memory M if it can be described by a collection of for which XG = X .
forbidden blocks all of which have length M + 1.
An example of a so3c shift is the even shift, which is
The properties of a 3nite-type shift are listed below: the set of all binary sequences with only even number
of 0’s between any two 1’s. That is, the forbidden set
• If X is a 3nite-type shift, then ∃M ¿ 0 such that X F is the collection {102n+1 : n ¿ 0}.
is M -step. Some of the salient characterization of so3c shifts
• The language of the 3nite-type shift is characterized are presented below [15]:
by the property that if two words overlap, then they
can be glued together along their overlap to from • Every 3nite-type shift quali3es as a so3c shift.
another word in the language. Thus, a shift space • A shift space is so3c iD it is a factor of a 3nite-type
X is an M -step 3nite-type shift iD whenever uv, shift.
vw ∈ B(X ) and |v| ¿ M , then uvw ∈ B(X ). • The class of so3c shifts is the smallest collection
• A shift space that is conjugate to a 3nite-type shift of shifts spaces that contains all 3nite-type shifts
is itself a 3nite-type shift. and also contains all factors of each space in the
• A 3nite-type shift can be represented by a 3nite, collection.
directed graph and produces the collection of all • A so3c shift that does not have 3nite-type subshifts
bi-in3nite walks (i.e. sequence of edges) on the is called a strictly so4c. For example, the even shift
graph. is strictly so3c [15].
• A factor of a so3c shift is a so3c shift.
So4c shifts: The so3c shift is the structure of the • A shift space conjugate to a so3c shift is itself so3c.
shift space in the -machines [7,20] in Section 4.1. • A distinction between 3nite-type shifts and so3c
Let us label the edges of a graph with symbols from shifts can be characterized in terms of the memory.
an alphabet A, where two or more edges are allowed While 3nite-type shifts use 3nite-length memory,
to have the same label. Every bi-in3nite walk on the so3c shifts require 3nite amount of memory. In
graph yields a point in AZ by reading the labels of its contrast, context-free shifts require in3nite amount
edges, and the set of all such points is called a so4c of memory [12].
shift.
[10] D.P. Feldman, Computational mechanics of classical spin [19] R.K. Pathria, Statistical Mechanics, 2nd Edition,
systems, Ph.D. Dissertation, University of California, Davis, ButterworthHeinemann, Oxford, UK, 1998.
1998. [20] C.R. Shalizi, K.L. Shalizi, J.P. Crutch3eld, An algorithm
[11] D.P. Feldman, J.P. Crutch3eld, Discovering non-critical for pattern discovery in time series, SFI Working Paper
organization: statistical mechanical, information theoretic, and 02-10-060, 2002.
computational views of patterns in one-dimensional spin [21] A. Surana, A. Ray, S.C. Chin, Anomaly detection in complex
systems, Santa Fe Institute Working Paper 98-04-026, 1998. systems, in: Fifth IFAC Symposium on Fault Detection,
[12] H.E. Hopcroft, R. Motwani, J.D. Ullman, Introduction to Supervision and Safety of Technical Process, Washington,
Automata Theory, Languages, and Computation, 2nd Edition, DC, 2003.
Addison-Wesley, Boston, 2001. [22] F. Takens, Detecting strange attractors in turbulence, in:
[13] M.B. Kennel, M. Buhl, Estimating good discrete partitions D. Rand, L.S. Young (Eds.), Proceedings of the Symposion
form observed data: symbolic false nearest neighbors, Dynamical Systems and Turbulence, Warwick, 1980, Lecture
http://arxiv.org/PS cache/nlin/pdf/0304/0304054.pdf, 2003. Notes in Mathematical, Vol. no 898, Springer, Berlin, 1981,
[14] B.P. Kitchens, Symbolic Dynamics: One Sided, Two sided p. 366.
and Countable State Markov Shifts, Springer, New York, [23] J.M.T. Thompson, H.B. Stewart, Nonlinear Dynamics and
1998. Chaos, Wiley, Chichester, UK, 1986.
[15] D. Lind, M. Marcus, An Introduction to Symbolic Dynamics [24] D.R. Upper, Theory and algorithms for hidden Markov
and Coding, Cambridge University Press, UK, 1995. models and generalized hidden Markov models, Ph.D.
[16] M. Markou, S. Singh, Novelty detection: a review—parts 1 Dissertation in Mathematics, University of California,
and 2, Signal Processing 83 (2003) 2481–2521. Berkeley, 1997.
[17] A.W. Naylor, G.R. Sell, Linear Operator Theory in [25] H. Zang, A. Ray, S. Phoha, Hybrid life extending control of
Engineering and Science, Springer, New York, 1982. mechanical systems: experimental validation of the concept,
[18] E. Ott, Chaos in Dynamical Systems, Cambridge University Automatica 36 (2000) 23–36.
Press, UK, 1993.