You are on page 1of 5

EventHelix.

com
eventstudio model object and
message flows
visualether Wireshark pcap
to call flow
system design LTE IMS GSM TCP/IP
Embedded OOAD
company contact us
support
facebook like us and
stay connected
Share
Handling Processor Reboot
Realtime systems typically consist of multiple processors implementing different parts of the systems functionality. Each of these processors
can encounter a hardware or software failure and reboot. Realtime systems should be designed to smoothly handle processor failure and
recovery.
Processor failure and recovery handling can be divided into the following steps:
1. A processor in the system fails. Other processors in the system detect the failure.
2. All other processors in the system cleanup all features that are involved in interactions with the failed processor.
3. The failed processor reboots and comes up.
4. Once the processor comes back up, it establishes protocol with all the processors in the system.
5. After establishing protocol, the rebooted processor reconciles all its data structures with the system.
6. Data structure audits are initiated with other processors to weed out inconsistencies that might have taken place due to processor
reboot.
Share Share Share Share More
In the following discussion we will cover each of the steps mentioned above. We will be taking the example of XEN card reboot in Xenon.
Processor Failure Detection
When a processor reboots in the system, other processors will detect its failure in one of the following ways:
Loss of periodic health messages: In an idle system with very little traffic, loss of periodic health messages may be the only
mechanism to detect processor failure. This mechanism places an upper bound on the time it will take to detect processor failure. For
example, if a XEN card sends a health message to CAS every 5 seconds and it takes 3 timeouts to declare the card failure, worst case
XEN failure detection time would be 20 seconds (15 seconds for timeouts and 5 seconds additional delay for the case when XEN card
failed right after sending the health message).
Protocol faults: Protocol faults are the quickest way to detect the failure of a processor in a busy system. As soon as a node sends a
message to the failed processor, the protocol software will timeout for the peer protocol entity on the failed processor. This failure is
reported to the fault handling software. Note that this technique works only when a message is sent to the failed node. Thus no upper
bound can be specified on the failure detection time. But in most situations, protocol fault detection will be fast as there will be some
message traffic towards the failed node. For example, a XEN card failure will be detected by other XEN and CAS processors as soon
as they try to send a message to the failed XEN.
Cleaning Up on Processor Failure
Whenever a node fails, all the other nodes in the system that were involved in feature interactions with this node, need to be notified so that
they can clean up any feature that might be affected by the failure of this node.
For example, when a XEN card fails, all the other XEN cards are informed so that they can clear all calls that had one leg of the call in the
failed XEN. This may appear to be fairly straightforward, but consider that all of a sudden the system has to clear so many calls. This may
lead to a sudden increase in memory buffer and CPU utilization. The designers should take this into account when dimensioning resources.
Processor Recovery
Once a failed processor reboots and comes up, it will communicate with the central processor informing it that it has recovered and is ready
to resume service. At this point the central processor would inform all other processors so that they can reestablish protocol with the just
recovered processor.
In the XEN example, when XEN card recovers, it will inform the CAS card about its recovery. Then CAS will inform other XEN cards so
that they can resume protocol with the recovered card. This will also involve changing the status of all terminals and trunk groups handled
by the XEN card to inservice.
Data Reconciliation
When the failed card comes up, it has to recover the context that was lost due to failure. The context is recovered by the following
mechanisms:
Getting the configuration data from the operations and maintenance module.
Periodically backing up the state data with the operations and maintenance module so that this information can be recovered on
reboot.
Reconciling data structures with other processors in the system to rebuild data structures.
When a XEN card recovers, it obtains V5.2 interface definition, trunk group data etc. from the operations and maintenance module.
Permanent status change information like circuit failure status would be obtained from the backed up data. Transient state information like
circuit blocking status would be recovered by exchanging blocking messages with other exchanges.
Audits
A processor reboot might have created lot of inconsistencies in the system. Software audits are run just after processor recovery to catch
these inconsistencies. Once the inconsistencies are fixed, the system designers may opt to have audits running periodically to counter
inconsistencies that might happen during normal course of operation.
When the XEN card recovers, it triggers the following audits:
Space slot resource audit with CAS
Time slot resource audit with other XEN cards
Call audit with XEN and CAS
The above audits will clean up any hanging slot allocations or hanging calls in the system.
EventStudio
call flow gallery
sequence diagrams
use cases & more
testimonials
download free trial
VisualEther
Wireshark gallery
protocol analyzer
visualize Wireshark
reverse engineer
download free trial
Telecom+networking
LTE tutorials and call flows
IMS call flows
telecom call flows
TCP/IP protocol flows
Software Design
object oriented design
design patterns
embedded design
fault handling
congestion control
Follow
facebook
twitter
linkedin
tumblr
google+
Share
Company
contact us
Share Share Share Share More
blog
2014 EventHelix.com Inc.