Professional Documents
Culture Documents
ABSTRACT The authors have been working in the area of using AI for
fault management and alarm correlation using Bayesian
At the heart of fault management is alarm correlation. Belief Networks (BBN) as the technique of choice. The
The approach that is used in commercial telecom systems, BBN offers the advantages of ANN (Artificial Neural
as specified by international standards, is correlation Network) while also being justifiable in that the network
through monitoring, filtering and masking event alarms. is visible and not a black box. Yet like other AI
This is vital since it reduces the amount of event instances approaches dealing with the sheer volume of data creates
presented, the symptoms of the fault. The next objective a complex learning process and subsequent problems with
is to automatically present the actual fault. Artificial usage in a real-time fault management system.
Intelligence (AI) offers the potential of determining the
fault or multiple faults, yet AI has its own problems To assist in reducing the number of alarm events that need
including dealing with the vast amounts of potential to be considered in both (learning and real-time use)
variations in the data. Not least of these problems is the stages the authors propose developing a first-stage alarm
mistrust, by engineers, in the ’non-deterministic’ AI results correlator, which is presented in this paper. The first
produced. This paper presents a simple first-stage alarm approach would be to develop rules from the ITU-T
correlator to reduce the data for both induction and recommendations. After testing, other sources could then
deduction. be considered such as defining rules with expert assistance
using visualisation tools such as NxGantt[1]. Then once
Keywords: Intelligent Systems, Expert Systems, Event the approach was well proven and had an effect on
Correlation, Telecommunications. reducing the size of the BBN, to test the concept of
extracting correlations from the actual BBN for those
variables that have an exceedingly high probability of
1 INTRODUCTION cause and effect.
Raw
Alarm Monitoring monitored Alarm Filtering Alarm Masking
alarm state filtered state masked state
? ? ?
enable enable enable
ACTIVE / PRESENT / PRESENT /
disable CLEAR disable NOT PRESENT / disable NOT PRESENT /
INTERMITTENT INTERMITTENT
alarm Monitored Filtered
CLEAR alarm state alarm state
Alarm masking is designed to prevent the unnecessary Displayed is the multiplexer ’Enfield’ that has an alarm
reporting of alarms. The masked alarm is inhibited from active on all 16 ports on the tributary card in slot 2, on
generating reports if an instance of its superior alarm is Nov. 9th 1998; the alarm on slot 2 port 8 was active from
active and fits the ’Masking’ periods. A ’Masking 2:28:36pm to 2:29:30pm.
Hierarchy’ determines the priority of each alarm type.
Alarm masking is also enabled/disabled on a per alarm
instance basis.
The types of alarm correlation can be generalised into five In a test case of the equipment it is common to daisy chain
rules; Compression, Suppression, Count, Generalisation, the tributaries since there is no real traffic where the
Specialisation and Boolean Patterns. The rule definitions generated signal is passed or chained through all the ports.
that follow use an example of alarms visualised in Figure In this case (Figure 2) 15 PPI-AIS alarms can be
2. Figure 2 is a part of a screenshot of NxGantt[1] that compressed to one PPI-AIS since it arises from the initial
displays an Event log’s alarm events (horizontal bars) port on the card. (Nortel Networks equipment has 16
against user action events (vertical lines) over time.
ports on a card therefore 4 cards are needed in a mux for general interest once the failure is determined[3]. There
63 tribs). are two real world concerns:
1. the sheer volume of alarm event traffic when a fault
Suppression: [A, B, p(A)<p(B)] => ∅ occurs;
2. the cause not the symptoms.
A low-priority alarm may be inhibited in the presence of a
higher alarm. Generally referred to as “masking”. A The types of correlation that has been described previous
consequence of the PPI-AIS is that a TU-AIS is injected meet criterion (1), which is vital. They focus on reducing
towards the payload manager. A TU-AIS is of higher the volume of alarms but do not necessarily meet the
priority than PPI-AIS and as such PPI-AIS could be criterion (2) to determine the actual cause - this is left to
suppressed. the operator to determine from the reduced set of higher
priority alarms.
Count [n x A] => B
Ideally a technique that can tackle both these concerns
The substitution of a specified number of occurances of an would be best. Artificial Intelligence (A.I.) offers that
alarm, with a new alarm. In the example above if it was potential and has been and still is an active area of
known that the system was not daisy chained and 16 x research to assist in fault management[4][5][6][7].
PPI-AIS appeared this may indicate a card fault (NE-
Card_Fail) which could be substituted for the 16 alarms.
2.6 ALARM CORRELATION - THE
Generalisation [A, A ⊂ B] => B BAYESIAN NETWORK WAY
Reference to an alarm by its superclass. PPI-AIS is the The authors’ research [8] does deal with both criteria
lowest of the AIS alarms. It may be superseded in terms (volume of alarms and cause not the symptoms) using
of priority by INT-TU-AIS, TU-AIS, INT-AU-AIS, AU- probabilistic reasoning techniques [9]. The cause and
AIS. effect graph can be considered a complex form of alarm
correlation. The alarms are connected by edges that
Specialisation [A, A ⊃ B] => B indicate the probabilistic strength of correlation. Yet the
cause and effect network can contain more than just
Alarm specialisation is the direct opposite of alarm alarms as variables - actual faults can be included as
generation and provides for substitution of an alarm by a variables.
more specific sub-class of the alarm. In the SDH world
alarms are usually thought of in terms of generalisations – Induction is used to produce the probabilistic network by
i.e. higher order alarms mask lower order alarms. Yet correlating offline alarm event data, and deducing the
when looking for the root cause, which is the underlying cause using this probabilistic network from live alarm
goal of correlation, looking at the specific can assist. For events.
example, AU-AIS could be specialised to PPI-AIS if the
configuration indicated that the traffic signal is This approach compares well with others reported in the
unstructured. The PPI-AIS is the valid signal – the cause. literature, although it does have several disadvantages but
these are not exclusive to it:
Boolean Pattern [A, B, …. T, ∧,∨,¬] => C Dealing with the sheer volume of data and all
possible alarm events (and as such combinations of
Substitution of a set of alarms satisfying a Boolean pattern correlations) results in a complex induction process,
with a new alarm. MS-AIS indicates an AIS has been and therefore to a complex probabilistic network.
detected in the K2 byte in the section overhead (indicating The network then becomes hard to validate and may
a failure at the far multiplexer). If AU-AIS injection is lead to mistrust with the telecoms experts.
enabled for the MS-AIS alarm then MS-FERF will always
Figure 3 Fault Management Process - Learning and Fault Prediction - including the First-Stage Alarm Correlator