You are on page 1of 7

HIGH SPEED NETWORK FIRST-STAGE ALARM CORRELATOR

ROY STERRITT, MARY SHAPCOTT, KENNY ADAMSON, EDWIN CURRAN

School of Information and Software Engineering, Faculty of Informatics, University of Ulster,


Shore Road, Newtownabbey, Co. Antrim, BT37 OQB, Northern Ireland. UK
E-mail: {R.Sterritt, CM.Shapcott, K.Adamson, EP.Curran} @ulst.ac.uk

ABSTRACT The authors have been working in the area of using AI for
fault management and alarm correlation using Bayesian
At the heart of fault management is alarm correlation. Belief Networks (BBN) as the technique of choice. The
The approach that is used in commercial telecom systems, BBN offers the advantages of ANN (Artificial Neural
as specified by international standards, is correlation Network) while also being justifiable in that the network
through monitoring, filtering and masking event alarms. is visible and not a black box. Yet like other AI
This is vital since it reduces the amount of event instances approaches dealing with the sheer volume of data creates
presented, the symptoms of the fault. The next objective a complex learning process and subsequent problems with
is to automatically present the actual fault. Artificial usage in a real-time fault management system.
Intelligence (AI) offers the potential of determining the
fault or multiple faults, yet AI has its own problems To assist in reducing the number of alarm events that need
including dealing with the vast amounts of potential to be considered in both (learning and real-time use)
variations in the data. Not least of these problems is the stages the authors propose developing a first-stage alarm
mistrust, by engineers, in the ’non-deterministic’ AI results correlator, which is presented in this paper. The first
produced. This paper presents a simple first-stage alarm approach would be to develop rules from the ITU-T
correlator to reduce the data for both induction and recommendations. After testing, other sources could then
deduction. be considered such as defining rules with expert assistance
using visualisation tools such as NxGantt[1]. Then once
Keywords: Intelligent Systems, Expert Systems, Event the approach was well proven and had an effect on
Correlation, Telecommunications. reducing the size of the BBN, to test the concept of
extracting correlations from the actual BBN for those
variables that have an exceedingly high probability of
1 INTRODUCTION cause and effect.

High-speed broadband telecommunication systems are


built with extensive redundancy and complex management 2 ALARM EVENT CORRELATION
systems to ensure robustness. The presence of a fault may
not only be detected by the offending component and its 2.1 FAULTS
parent but the consequence of that fault discovered by
other components. This often results in a net effect of a A Fault is a malfunction that has occurred either in the
large number of alarm events being raised and cascaded to hardware or software on the network. This can be due to
the element controller. some external force for example a digger cutting through
the fibre cable or an internal fault such as a card fail.
The behaviour of the alarms is so complex it appears non-
deterministic. It is very difficult to isolate the true cause 2.2 EVENTS
of the fault. Failures in the network are unavoidable but
quick detection and identification of the fault is essential An event is an occurrence on the network. Those that
to ensure robustness. To this end the ability to correlate relate to the management of the network are recorded by
alarm events becomes very important. the Element Controller (EC; historically referred to as the
Element Manager - EM). In older releases a recorded
The major telecommunication equipment manufacturers event equated to an alarm. This is no longer the case,
deal with alarm correlation through alarm monitoring, other examples of events are user logins and user actions
filtering and masking (as specified by ITU-T) with rule- such as switch protection.
based type systems for assistance to the operator. Yet
often it is left in the operator’s hands to determine the
actual fault or multiple-faults from the filtered set of
alarms reported.
Table 1 An Example Set of Alarm Events
Type Path Event_Type EC_Time EC_Date Alarm Sev NE_ID Alarm_ID
TN-1X /bireh706/TN-1X/Mux_03 Comms fail alarm 11:27:12 26 March 1999 present Critical 5002 25001
TN-1X /bireh706/TN-1X/Mux_03 Comms fail alarm 11:33:41 26 March 1999 clear Critical 5002 25001

Table 2 Example of other alarms generated from Comms fail alarm


Type Path Event_Type EC_Time EC_Date Alarm Sev NE_ID Alarm_ID
TN-1X /bireh706/TN-1X/Mux_02/S6 Qecc-Comms_Fail 11:31:48 26 March 1999 present Critical 5001 668
TN-1X /bireh706/TN-1X/Mux_04/S7 Qecc-Comms_Fail 11:32:38 26 March 1999 present Critical 5003 349
TN-1X /bireh706/TN-1X/Mux_04/S7 Qecc-Comms_Fail 11:33:05 26 March 1999 clear Critical 5003 349
TN-1X /bireh706/TN-1X/Mux_02/S6 Qecc-Comms_Fail 11:33:31 26 March 1999 clear Critical 5001 668

2.3 ALARM EVENTS example (Table 1) Qecc-Comms_fail alarms were raised


(Table 2). This alarm indicates that the NE can not
There are numerous types of alarm events that may be communicate via the Embedded Control Channel (ECC)
generated within a Network Element (NE). For instance of the indicated STM-N card (slots 6 and 7 in the
Nortel Networks FibreWorld TN-1X multiplexer (STM-1) example) with the neighbouring NE.
release 8 has 88 possible alarm events. Other releases and
other products have different possible alarm events. An 2.3.3 Alarm Monitoring, Filtering and Masking
example of a critical alarm is a ’Comms fail alarm’. An
alarm exists for a time period whereby under normal Alarms can be generated exponentially in different NE’s
circumstances an ’alarm present event’ will be throughout the network due to certain fault conditions, the
accompanied by an ’alarm clear event’. In Table 1 the larger the network the greater the number of alarms that
’Comms fail alarm’ was in existence for 6 minutes 29 will be generated. It is therefore essential for the NE’s to
seconds. provide some correlation of the different alarms that are
generated so that the EC is not flooded with alarms and
only the ones with high priorities are transmitted.
2.3.1 Alarm Severity Levels
This is handled in three sequential transformations; alarm
Each alarm type is assigned a Severity Level of Critical, monitoring, alarm filtering and alarm masking as shown in
Major or Minor by the network management system Figure 1. These mean that if the raw state of an alarm
depending on the severity of the fault indicated by the instance changes an alarm event is not necessarily
alarm type. In the example, the alarm type ’Comms fail’ generated.
has a Critical Severity Level while other alarms such as
’Tributary Unit Alarm Indication Signal’ (TU-AIS) has a Alarm monitoring takes the raw state of an alarm and
Minor Severity Level. produces a ’monitored’ state. Alarm monitoring is
enabled/disabled on a per alarm instance premise. If
2.3.2 Alarm Generation monitoring is enabled, then the monitored state is the
same as the raw state, if disabled then the monitored state
The instance of a fault can cause numerous alarm events is clear.
to be raised from an individual NE, this means that the
alarms are often interrelated (and thus the desire to Alarm filtering is also enabled/disabled on a per alarm
correlate). Also a fault may trigger numerous similar and instance basis. An alarm may exist in any one of three
different alarms (and indeed alarm types) to be generated states, Present, Intermittent or Clear, depending on how
in different NE’s up or down stream on the network. For long the alarm is raised for. Assigning these states, by
example the Comms fail alarm, an alarm raised by the checking for the presence of an alarm within certain
management system if it cannot maintain a ’filtering’ periods, determines the Alarm Filtering.
communications channel to the indicated NE, may cause
other alarms such as RS-LOS, RS-LOF, Qecc-
Comms_fail, MS-EXC or even laser alarms depending on
the fault and configuration. In the Comms fail alarm
States of
superior alarms

present not present check extension


filter t filter t mask t mask t

Raw
Alarm Monitoring monitored Alarm Filtering Alarm Masking
alarm state filtered state masked state
? ? ?
enable enable enable
ACTIVE / PRESENT / PRESENT /
disable CLEAR disable NOT PRESENT / disable NOT PRESENT /
INTERMITTENT INTERMITTENT
alarm Monitored Filtered
CLEAR alarm state alarm state

Figure 1 Alarm Monitoring, Filtering and Masking Sequential Transformations

Alarm masking is designed to prevent the unnecessary Displayed is the multiplexer ’Enfield’ that has an alarm
reporting of alarms. The masked alarm is inhibited from active on all 16 ports on the tributary card in slot 2, on
generating reports if an instance of its superior alarm is Nov. 9th 1998; the alarm on slot 2 port 8 was active from
active and fits the ’Masking’ periods. A ’Masking 2:28:36pm to 2:29:30pm.
Hierarchy’ determines the priority of each alarm type.
Alarm masking is also enabled/disabled on a per alarm
instance basis.

If an alarm changes state at any time the network


management system must be informed. The combination
of Alarm Monitoring, Filtering and Masking makes alarm
handling within the NE’s quite complex.

2.3.4 The Complexity

The simple example of inter-connecting alarms above and


the transformations should have illustrated that fault
determination is not a straightforward process. The
combinations of possible alarm events and the time they
are received at the EC are numerous. Added to this
complexity is the fact that individual alarms can be
configured in different states such as ’Masking Disabled’
Figure 2 Section of NxGantt[1] - the Stimuli Event
or ’Masking Enabled’; or the Network is in different states
Correlation Analyser
such as ’1+1 protection’ or ’unprotected’.
Compression [A,A,….A] => A

2.4 ALARM CORRELATION The reduction of multiple occurances of an alarm into a


single alarm. PPI-AIS - An Alarm Indication Signal
At the heart of fault management is alarm correlation. (AIS) has been detected in the incoming 2Mbit/s traffic. If
The alarm events may be the first indication that a fault the 2Mbit/s signal is unstructured (e.g. does not conform
has or is occurring. Since one of the main primary aims is to ITU-T recommendation G732[2]), this AIS may be a
to avoid network traffic interruption a quick diagnosis is valid signal (indicating presence of traffic). The
essential. monitoring of the alarm can be disabled by the operator.

The types of alarm correlation can be generalised into five In a test case of the equipment it is common to daisy chain
rules; Compression, Suppression, Count, Generalisation, the tributaries since there is no real traffic where the
Specialisation and Boolean Patterns. The rule definitions generated signal is passed or chained through all the ports.
that follow use an example of alarms visualised in Figure In this case (Figure 2) 15 PPI-AIS alarms can be
2. Figure 2 is a part of a screenshot of NxGantt[1] that compressed to one PPI-AIS since it arises from the initial
displays an Event log’s alarm events (horizontal bars) port on the card. (Nortel Networks equipment has 16
against user action events (vertical lines) over time.
ports on a card therefore 4 cards are needed in a mux for general interest once the failure is determined[3]. There
63 tribs). are two real world concerns:
1. the sheer volume of alarm event traffic when a fault
Suppression: [A, B, p(A)<p(B)] => ∅ occurs;
2. the cause not the symptoms.
A low-priority alarm may be inhibited in the presence of a
higher alarm. Generally referred to as “masking”. A The types of correlation that has been described previous
consequence of the PPI-AIS is that a TU-AIS is injected meet criterion (1), which is vital. They focus on reducing
towards the payload manager. A TU-AIS is of higher the volume of alarms but do not necessarily meet the
priority than PPI-AIS and as such PPI-AIS could be criterion (2) to determine the actual cause - this is left to
suppressed. the operator to determine from the reduced set of higher
priority alarms.
Count [n x A] => B
Ideally a technique that can tackle both these concerns
The substitution of a specified number of occurances of an would be best. Artificial Intelligence (A.I.) offers that
alarm, with a new alarm. In the example above if it was potential and has been and still is an active area of
known that the system was not daisy chained and 16 x research to assist in fault management[4][5][6][7].
PPI-AIS appeared this may indicate a card fault (NE-
Card_Fail) which could be substituted for the 16 alarms.
2.6 ALARM CORRELATION - THE
Generalisation [A, A ⊂ B] => B BAYESIAN NETWORK WAY
Reference to an alarm by its superclass. PPI-AIS is the The authors’ research [8] does deal with both criteria
lowest of the AIS alarms. It may be superseded in terms (volume of alarms and cause not the symptoms) using
of priority by INT-TU-AIS, TU-AIS, INT-AU-AIS, AU- probabilistic reasoning techniques [9]. The cause and
AIS. effect graph can be considered a complex form of alarm
correlation. The alarms are connected by edges that
Specialisation [A, A ⊃ B] => B indicate the probabilistic strength of correlation. Yet the
cause and effect network can contain more than just
Alarm specialisation is the direct opposite of alarm alarms as variables - actual faults can be included as
generation and provides for substitution of an alarm by a variables.
more specific sub-class of the alarm. In the SDH world
alarms are usually thought of in terms of generalisations – Induction is used to produce the probabilistic network by
i.e. higher order alarms mask lower order alarms. Yet correlating offline alarm event data, and deducing the
when looking for the root cause, which is the underlying cause using this probabilistic network from live alarm
goal of correlation, looking at the specific can assist. For events.
example, AU-AIS could be specialised to PPI-AIS if the
configuration indicated that the traffic signal is This approach compares well with others reported in the
unstructured. The PPI-AIS is the valid signal – the cause. literature, although it does have several disadvantages but
these are not exclusive to it:
Boolean Pattern [A, B, …. T, ∧,∨,¬] => C Dealing with the sheer volume of data and all
possible alarm events (and as such combinations of
Substitution of a set of alarms satisfying a Boolean pattern correlations) results in a complex induction process,
with a new alarm. MS-AIS indicates an AIS has been and therefore to a complex probabilistic network.
detected in the K2 byte in the section overhead (indicating The network then becomes hard to validate and may
a failure at the far multiplexer). If AU-AIS injection is lead to mistrust with the telecoms experts.
enabled for the MS-AIS alarm then MS-FERF will always 

A complex probabilistic network ensures further


be raised. Thus; MS-AIS ∧ MS-FERF => AU-AIS complications with deduction, not least of which is
the speed of propagation of probabilities.

2.5 ALARM CORRELATION AND 2.6.1 Learning the Graph - Induction


ARTIFICIAL INTELLIGENCE
In this case, as in many cases, the structure of the
At the heart of alarm event correlation is the graphical model (the Bayesian network) is not known in
determination of the cause. The alarms represent the advance, but there is a database of information concerning
symptoms and as such, in the global scheme, are not of the frequencies of occurrence of combinations of different
variable values (the alarms). In such a case the problem is
that of induction – to induce the structure from the data. 3 FIRST-STAGE ALARM CORRELATOR
Heckerman has a good description of the problem[10]. (FAC)
There has been a lot of work in the literature in the area,
including that of Cooper and Herskovits[11]. In practice when it comes to learning the cause and effect
Unfortunately the general problem is NP-hard [12]. For a graph, the volume of event traffic and correlation of
given number of variables there is a very large number of alarms can be reduced by simple first-stage correlation
potential graphical structures which can be induced. To (generally pattern matchers). The expert system approach
determine the best structure then in theory one should fit (in this case the deduction from the probabilistic network)
the data to each possible graphical structure, score the could then handle the remaining more complex problems,
structure, and then select the structure with the best score. taking advantage of the much reduced and enriched
Consequently algorithms for learning networks from data stream of events.
are usually heuristic, once the number of variables gets to
be of reasonable size. There are 2k(k-1)/2 distinct possible As such the authors have now designed and developed this
independence graphs for a k-dimensional random vector: simple first-stage event correlator. The first
this translates to 64 probabilistic models for k= 4, and implementation has rules that were extracted from the
32,768 models for k = 6. industry standards specification [2] (correlations derived
from consequence alarm tables and masking hierarchies).
Although not the primary aim, the authors have Using the masking hierarchy to define rules is equivalent
undertaken approaches to assist in validating the network to what the monitoring, filtering and masking functions
by means of providing visualisation applications can do yet they do not consider consequence alarms. The
[1][13][14][15]. These clarify the induction process and masking hierarchy is included since there is no guarantee
assure the telecoms engineers (technology transfer). They that masking is enabled in the data being considered. The
also provide the mechanism for feedback on the telecoms tool is also generic so rules can be added, for instance
domain (knowledge capture) that can be used to validate simple additional rules can be defined from particular
the probabilistic network. environments rules specific to the testing environment (as
in the data displayed in Figure 2 - a rule concerning 15x
2.6.2 Using the Graph - Deduction PPI-AIS equating to a test daisy chain configuration could
be defined).
Once the graph has been developed it can be used as the
knowledge 'guts' of an expert system. Through deducing It has been established in the previous sections why you
or inferring1 from given alarm data the correlations would want to reduce the number of alarm events you
between the alarms - the cause and effect relationships, to need to consider both at the AI learning stage and in the
hopefully determine the root cause - the actual fault or the actual fault management process. To recap;
fault possibilities. Learning;
• BBN complex to induce when there is vast amounts
Once an alarm has been reported the effects of that
of variables
observation are propagated throughout the network and
Real-time Fault Management System;
the other marginal probabilities are updated. In simple
• Rapid response for simpler correlations - more
networks the marginal probabilities (likelihoods) of each
challenging correlations to be handled by AI.
state can be calculated from the knowledge of the joint
• Reduce size of BBN to reduce time required to
distribution using the product rule and Bayes’ theorem.
propagate probabilities.
Nevertheless, more than often the graph is not simple,
cycles occur and the calculation is much more complex.
In addition since temporal behaviour is not apart of the
Some algorithms calculate the marginals exactly yet the
BBN approach, this tool offers the ability to start
calculation of exact probabilities on graphical structures is
investigative research into the temporal aspects of the
NP-hard [12]. Therefore many researchers have
alarm correlation, in the form of knowledge capture with
developed algorithms which approximate the answer.
Telecoms experts. This will assist in capturing knowledge
They may sacrifice accuracy for a lower computational
concerning the temporal influence of alarms for possible
overhead.
future migration to Dynamic Belief Networks/ Temporal
Belief Networks (DBN/TBN).
Either way, exact or approximation, the system would
benefit from as simple a graphical structure as possible,
particularly if the final system is to achieve real-time 3.1 THE GENERIC TOOL
speeds.
As stated the purpose was to develop a generic tool to
assist in the learning process as well as the first stages of
1 the fault management process. Therefore it must be
Thus the alternative names for this approach - deduction component
capable of coping with several styles of rules. Screenshot
and an inference engine
1 depicts the FAC screenshot demonstrating correlation of and reason given. A certainty field (probability of
an event log and Screenshot 2 shows the FAC screenshot correlation being correct) is included in the rule. At this
demonstrating configuration of simple correlation rules. stage this is a speculative figure defined by the user but is
included for future development. Some examples of this
type of rule are;
AU-AIS, TU-AIS, "Failure in upstream path (AU-AIS
caused injection of TU-AIS)", 0, 4, 15

AU-AIS, HO-FERF, "Indirect consequence of AU-AIS


(AU-AIS Failure in upstream path)",0,4,5

PPI-AIS, TU-AIS, "Unstructured signal (thus AIS a valid


signal) thus injection of TU-AIS",0,4,10

Another rule type is;


N x X , r(N x X), tmin, tmax, p(N x X)

Where if alarm x occurs n times within the time frame tmin


- tmax then the correlation is made and reason given. An
examples of this type of rule is;
PPI-AIS, 16, "Test equipment - Daisy chained
tribs",0,6,10
Screenshot 1 FAC screenshot demonstrating
correlation of an event log
3.1.2 The Discovery and Learning of Rules

Since the basis of FAC was to catch simple correlations to


reduce the amount of event types that needs to be handled
by the expert system, the initial source for rules has been
knowledge acquisition - consultation with experts and the
appropriate documentation. Yet the tool has the potential
to use rules discovered from human discovery
(visualisation techniques) and computer discovery (data
mining), this is depicted in Figure 3.

The top half of the figure depicts the learning process;


rules can be discovered via the visualisation tools and
coded for FAC. Rules can also be derived from the
standard bodies specifications. Then when undertaking
the knowledge discovery process, FAC is used to extract
the data that can be correlated from its rules - this reduces
the amount of data and combinations that need to be
considered when actually inducing (data mining) the cause
and effect (C&E) network. After the first iteration strong
Screenshot 2 FAC screenshot demonstrating correlations may be extracted from the C&E network to
configuration of simple correlation rules be coded as FAC rules.

The bottom half of the figure depicts the fault


3.1.1 The Rules
management/diagnostic tool application. FAC is then
used to catch correlations from live data, anything it does
The current prototype can cope with several types of
not handle is passed on to the deduction algorithm/expert
rules, which gives it its generic potential. The main rule
system.
type is;

X ∧ Y, r(X ∧ Y), tmin, tmax, p(X ∧ Y)

Where if alarm x and alarm y occur within the time frame


tmin - tmax (either arriving first) then the correlation is made
4 CONCLUSION [3] Harrison K. "A Novel Approach to Event Correlation", Hewlett
Packard, Intelligent Networked Computing Laboratory, HP
Laboratories, Bristol. HP-94-68, July, 1994, pp. 1-10.
The FAC prototype tool has successfully demonstrated the [4] Bouloutas A. T., Calo S., Finkel A. Alarm Correlation and Fault
benefits of reducing the amount of alarms to be Identification in Communication Networks IEEE Transactions on
considered by first-stage simple alarm correlation. The Communications, Vol. 42, No. 2/3/4, February/March/April 1994, pp.
523-533.
future development requirements planned for the tool are [5] Gardner R. D., Harie D. A. Expert Data Mining for Alarm,
to improve the tool’s correlation routines to achieve real- Correlation in High-Speed Networks Industrial Application,
time response. It is also planned to expand the types of EXPERSYS-97, pp. 145-150.
rules that can be handled from the current types; [6] Haljela S., HP OEMF: Alarm Management in Telecommunications
Networks Hewlett Packard Journal, October 1996, Article 3, pp1-11.
generalisation and count to compression, suppression, [7] Ricciulli L., Shacham N. Modeling Correlated Alarms In Network
specialisation and boolean pattern. It is planned for the Management Systems Computer Science Laboratory, SRI International,
pilot of FAC to be conducted on a private STM-4 radio Menlo Park, CA 94025-9493. May, 1996, pp. 1-25.
network, which is the first of its type in Europe. [8] EPSRC AIKMS programme, The NetExtract Project (An
Architecture for the Extraction of Cause and Effect Networks from
Complex Systems). 1995-97
ACKNOWLEDGEMENTS [9] R. Sterritt, M. Daly, K. Adamson, M. Shapcott, D.A. Bell, F.
McErlean, "NETEXTRACT: An Architecture For The Extraction Of
Cause And Effect Networks From Complex Systems", Proceedings of
We would like to thank IRTU (START ITS 7 programme) the 15th IASTED International Conference on Applied Informatics,
for funding the GARNET research and EPSRC (AIKMS pp55-57, 1997
programme) for the initial NetExtract research. We are [10] Heckerman D, 1996. “Bayesian Networks for Knowledge
also indebted to our industrial collaborators NITEC Discovery” In Fayyad UM, Piatetsky-Shapiro G, Smyth P and
Uthurusamy R (Eds.), Advances in Knowledge Discovery and Data
(Northern Ireland Telecommunications Engineering Mining, AAAI Press / The MIT Press, 273-305.
Centre) Nortel Networks, and in particular Nortel’s [11] Cooper, G.F. and Herskovits, E., 1992. “A Bayesian Method for
GARNET team leader Dr. Roger Johnson. Finally, the Induction of Probabilistic Networks from Data”. Machine Learning,
Deirdre Clarke, who undertook prototyping of the first- 9, pp 309-347
[12] Chickering D.M. and D. Heckerman, 1994. “Learning Bayesian
stage correlation tool as part of her BSc final year project
networks is NP-hard”. Technical Report MSR-TR-94-17, Microsoft
(1999). Research, Microsoft Corporation, 1994.
[13] IRTU ITS Start 7 programme, the GARNET (Graphical And Real-
REFERENCES time Network Emulation Tool) project. 1996-99
[14] R. Sterritt, E.P. Curran, K. Adamson, C.M. Shapcott, "Application
Of AI For Automated Testing In Complex Telecommunication
[1] R. Sterritt, K. Adamson, E.P. Curran, C.M Shapcott, "Visualisation Systems", Proceedings of the EXPERSYS 98, 10th International
And Context Of Telecommunications Data", Proceedings of the 17th Conference on Artificial Intelligent Applications, 1998
International IASTED Conference on Applied Informatics, pp588-591, [15] S. Mcbride, R. Sterritt, E.P. Curran, K. Adamson, C.M. Shapcott,
1999 MAYPOLE: Visualisating Contingency Tables, Accepted for The
[2] ITU-T Recommendations G732. International. Conference On Artificial Intelligence (IC-AI’2000), 26-
29 June 2000

Figure 3 Fault Management Process - Learning and Fault Prediction - including the First-Stage Alarm Correlator

You might also like