EAFTC IEEEpaper 2005

Environmentally Adaptive Fault Tolerant Computing
(EAFTC)
Jeremy Ramos, Dean W. Brenner, Gary E. Galica, and Chris J. Walter
Honeywell Inc., Defense & Space Electronic Systems
13350 U.S. Highway 19 North
Clearwater, Florida 33764-7290
727-539-4311/ 727-539-2584
jeremy.ramos@honeywell.com
dean.brenner@honeywell.com
galica@psicorp.com
cwalter@wwtechnology.com
Abstract—1,2The application of commercial-off-the-shelf The vast amounts of data being generated can not be
(COTS) processing components in operational space transmitted via available downlink channels in reasonable
missions with optimal performance and efficiency requires a time. The industry proposed solution is to reduce the
system-level approach. Of primary concern is the need to demand on the downlink by moving processing onto the
handle the inherent susceptibility of COTS components to spacecraft. This approach is hampered by the limited
Single Event Upsets (SEUs). Honeywell in conjunction capabilities of today’s on-board processors and the
with Physical Sciences Incorporated, and WW Technology prohibitive cost of developing radiation hardened high-
Group has developed a new paradigm for fault tolerant performance electronics [4,5]. This has partly encouraged
COTS based onboard computing. The paradigm is called the industry to consider the use of COTS components [6].
“Environmentally Adaptive Fault Tolerant Computing” Furthermore, the recent adoption of silicon-on-insulator
(EAFTC.) EAFTC combines a set of innovative (SOI) technology by COTS integrated circuit foundries is
technologies to enable efficient use of high performance resulting in devices with moderate space radiation tolerance
COTS processors, in the harsh space environment, while [7,8]. Despite all of the progress, COTS components
maintaining the required system availability. continue to be highly susceptible to SEUs. Therefore,
technology must be developed that capitalizes on the
TABLE OF CONTENTS strengths of COTS devices while overcoming their
susceptibility to SEUs without negating their benefits.
INTRODUCTION .........................................................1
TECHNOLOGY ADVANCE ..........................................2
The popular approach for mitigating SEUs is to employ
CONCEPT OF OPERATION .........................................2
fixed component level redundancy [9]. One major
IMPLEMENTATION METHODOLOGY .........................2
disadvantage of fixed redundancy is low efficiency and
HARDWARE IMPLEMENTATION ...............................3
unrealized system capacity. The EAFTC paradigm
SOFTWARE FRAMEWORK .........................................5
facilitates use of COTS components in SEU abundant
EAFTC AND RP MIDDLEWARE ...............................6
environments, while maintaining adequate levels of system
TECHNOLOGY DEVELOPMENT .................................8
efficiency and capacity. It accomplished this by adaptively
CONCLUSION .............................................................8 configuring the level of fault tolerance in the system as
REFERENCES .............................................................9 mandated by the mission environment and mission
ACKNOWLEDGEMENTS .............................................9 application. The three fundamental elements of EAFTC are:
BIOGRAPHIES ............................................................9
• Real time environmental sensing
• COTS based computer architecture that supports
INTRODUCTION adaptable configuration levels of fault tolerance
• System controller to optimize performance and
Science and defense missions alike have ever-increasing efficiency while maintaining reliable operation
demands for data returns from their space assets. In recent
time we have seem a significant increase in the capability of In this paper we discuss the EAFTC paradigm in detail. We
the instruments deployed in space [1,2,3]. The traditional also discuss ongoing efforts to advance EAFTC as a
implementation approach of data gathering, data technology to adequate levels for use in operational
compression, and data transmission is no longer sustainable. missions.
1
0-7803-8870-4/05/$20.00© 2005 IEEE
2
IEEEAC paper #10787, Version 4, Updated October 27, 2004
1
TECHNOLOGY ADVANCE sensor suite, and target computer3. The fundamental
process implemented in the system consists of three steps:
State of the art onboard processing computers consist
mostly of radiation hardened components based on COTS 1. Sense the environment
equivalents. Though COTS compatibility offers many
benefits, including adoption of commercial software, large 2. Assess the environmental threat to the system’s
amounts of NRE are required for the initial silicon availability
implementation. Additionally, radiation hardened
components lag their commercial counterparts in overall 3. Adapt the processing system’s configuration to
performance and capability by at least 1 to 2 orders of effectively mitigate the threat presented by the
magnitude. There are a number of factors that contribute to environment
this deficiency. Most predominantly, radiation-hardening
techniques for microelectronics require the use of fixed SEU Alarm Energy Level
<<sensor>>
transistor or gate level redundancy. The additional logic EAFTC
increases the power required to perform the same unit of controller Process
Environmental Deployment
computation. Measurement
(i.e. temp,
Sensor
Response Health Target Computer
power, etc.) <<onboard
History Deployment
<<sensor>> computer>>
An approach towards improvement is the use of true COTS Plan
microprocessors and Field Programmable Gate Arrays Spacecraft

ephemeris
(FPGAs.) This approach avoids the high cost and long

development time associated with radiation hardened Direction of Data Flow
equivalents. However, COTS devices are highly susceptible
to SEUs. The popular SEU mitigation approach is to use Figure 1-1. EAFTC based system
component level N-module redundancy, which results in
low efficiency and low capacity due to the overhead of 1/3, In general, EAFTC can be implemented to accept a
or more. Furthermore, the level of redundancy is fixed and multitude of environmental inputs that can induce faults in
often unnecessary. To overcome the deficiencies of fixed the target computer4. However, for this particular
redundancy there are two characteristics of space missions discussion, environmental monitoring is focused on
that we focused on, the variability of space environment and measurements of high-energy particle flux. By monitoring
task level criticality. Most missions will have a mix of flux of high-energy particles, it is possible to assess the
processes with varying criticality. This characteristic of systems overall susceptibility to SEUs.
mission processing can be exploited to increase the systems
efficiency by applying redundancy at the task level. Sensor measurements5 and target computer health are
Furthermore, the variability of the space environment continuously monitored by the EAFTC controller and
provides a temporal and orbital position dependency on the combined with a mission defined application task
necessary redundancy. EAFTC exploits both of these Deployment Plan. The Deployment Plan contains task level
characteristics to mitigate SEUs via redundancy, but unlike criticality requirements as well as other pertinent
N-module redundancy, it employs replication selectively information used by the controller. Based on that input, the
and adaptively as mandated by the environment and the controller determines the reliability and availability threat
particular task. posed by the environment on the target computer. The
controller then generates the requisite signals to adapt the
Hence, EAFTC as a technology mitigates faults, and in target computer, thereby, countering the hostile
particular SEUs in COTS devices, while increasing the environmental threat. The target computer in response
system’s overall efficiency and capacity. EAFTC does this optimally employs the requested fault tolerant mechanism.
by optimally applying fault tolerance over the life of the This process is performed in real-time and on-line as an
mission as demanded by the task criticality and in-situ integral part of the overall system operation.
environmental measurements.
IMPLEMENTATION METHODOLOGY
CONCEPT OF OPERATION
Figure 1-2 shows an implementation methodology that
The goal of an EAFTC based system is to optimally employ incorporates fault tolerant techniques at each level of the
system level fault tolerance based on historical and in-situ system hierarchy.
environmental conditions. Figure 1-1 is a block diagram of
a representative EAFTC based system. The major functional 3
e.g. payload computer system
elements of the system are a controller, environmental 4
For example temperature, available power, etc.
5
i.e. SEU Alarm, temperature, and spacecraft ephemeris
2
<<processor>> #4
Mission APC:
Data Processor
<<processor>> #3
Mission Faults <<processor>> #2 Benchmark
APC:
RHPPC SBC: Application
Data Processor
System Methods <<spacecraft interface System Controller Linux
<<processor>>
Horizontal#2
Services
EAFTC A/B>> Benchmark
APC:
System
EAFTC Application
Data Processor
Components
Linux
Application Faults Experiment Comm.
Experiment Manager
Configuration
Horizontal Services
BenchmarkControl
Spacecraft System
Software Methods Experiment Monitors
<<processor>>
Experiment #1
Configuration
Configuration
Application
Components
Linux Memory
<<processor>> #1
SIFT Middlewares, Roll-back, Check-Pointing, Algorithm Design, RHPPC SBC:
VxWorks
System Controller
Configuration
System
Horizontal
APC:
Control
Profiling
Services
others System
Data Components
Processor
Configuration
Configuration Control
Memory
Configuration Memory
System Profiling
System Profiling
Module Faults
Hardware Methods <<28V Power>>
Replication, Scrubbing, Heart Beating, Watchdogs, others <<Switch Fabric A>>
<<Switch Fabric B>>
Component Faults
Component Selection
<<device>>
Radiation Hardened, COTS, Military Grade Power Supply A/B
<<device>>
SEU
Alarm:Environmental
Sensors A/B
Figure 1-2 Fault Tolerance Design Methodology

Figure 1-3 Target Computer Block Diagram
The foundation of this methodology starts with component
selection through testing, screening, and analysis. At the
System Controller
next layer, fault tolerant hardware design is used to address
component faults. In the proposed system, the use of The System Controller is implemented using redundant,
hardware redundancy is adapted to mitigate faults in the Radiation Hardened Single Board Computers. A highly
critical system subassemblies. The third layer of fault reliable, radiation hardened System Controller provides a
tolerance uses software-based methods for faults not platform for the deployment of critical control software
covered by the lower layers in the system. The final fault such as the EAFTC controller. A good candidate for a
tolerance layer requires system level methods, such as system controller is the Honeywell radiation hardened
EAFTC, to address those faults not covered by the RHPPC Single Board Computer (SBC.) This radiation
application. This overall fault tolerance strategy is hardened SBC is based on Motorola 603e microprocessor
employed in the design of our Target Computer and technology [5]. Key features of the RHPCC SBC are
provides an optimal system solution contributing summarized in Table 1-1.
significantly to overall system capacity and efficiency.
Table 1-1. RHPPC SBC Key Features
Key Features
HARDWARE IMPLEMENTATION 3.3V and 5.0 V Power
RHPPC delivering 100 MIPS
A Target Computer hardware architecture is proposed that Peripheral Enhancement Component support chip
is based on Honeywell’s Integrated Payload System. Figure 4MB EEPROM with Single Error Correction and Double
1-3 depicts the hardware realization of a COTS-based Error Detection
Target Computer for EAFTC. The basic hardware elements 512KB EEPROM
of the system are a System Controller, a cluster of Data 128 MB DRAM with SuperEDAC
Processors, Packet Switched Fabric, and an Environmental 6U x 220mm Euro Card Form Factor
Sensor Suite.
Max Power Draw 15W
Mass >3lbs
Redundant 1553 (interface to spacecraft computer)
32-bit 33MHz PCI (interface to cluster and MIB
electronics)
Data Processors
The Data Processor cluster uses COTS based processors
with Honeywell’s unique architecture called the Adaptive
Processing Computer (APC.) The APC is a multi-mode
device that combines the use of COTS microprocessors and
FPGAs on a single platform. The APC employs a COTS
3
IBM PowerPC 750FX microprocessor and a Xilinx VirtexII the most flexibility by retaining a programmable
6000 FPGA6. Figure 1-7 is a block diagram of the APC. microprocessor and access to custom hardware. The APC is
capable of dynamic switching between modes, useful in
Front Panel
many applications.
Ethernet
PPC
RAM
The APC’s flexibility allows application designers to adapt
Interrupts
750fx
MPC Bus Control
MPC Bus Address Data
the target processor to a variety of mission level

UART RX/TX (PE) UART RX/TX (CM)
EEPROM
64 MB
requirements. Greater efficiency can be achieved by using

DRAM
MPC Bus Data
Mem Data
64 MB User IO PROM
more custom hardware modules in the FPGA. Similarly,

Mem Address/Cntr
DRAM
Select Map Bus
CM Interrupt
Configuration
greater processing performance can also be realized in
4 MB
EEPROM PE/PC Sleep RAM
VirtexII 6K Health
Manager
Actel
FPGA modules. However, for applications that require ease
128 KB
Mem Address/Cntr CM Reset Mem Data
PROM
EEPROM
Watchdog Interrupt Mem Address/Cntr
of programmability, the microprocessor mode would be a

Local PCI better fit. The APC facilitates these and other
PCI-to-PCI Bridge Ethernet
implementation alternatives not typically available in on-
PWR (3.3,1.5,+/-12V)
+ Arbiter
Actel
Interface APC Interrupt board processor modules. Some key features of the APC are
Timer Synch
APC Health
listed in Table 1-2.

RIO
Thermistor
To Front
PCI
JTAG
Panel
cPCI Connector
Table 1-2. APC Key Features
Figure 1-4 Adaptive Processing Computer Block Diagram Key Features

750 fx @ 650 MHz Delivering 1300 MIPS
The left side of the APC block diagram shows the COTS VirtexII 6000 Processing Element/Processor Controller
computer’s resources and the right side radiation hardened PCI 32-bit 33 MHz
Configuration Manager and supporting functions. The Rapid I/O
Configuration Manager handles APC mode changes, basic 128 MB DRAM with Super EDAC
FPGA configuration, FPGA configuration memory 4MB EEPROM with SECDED EDAC
scrubbing, low-level health monitoring, and power mode
Configuration Manager with support FPGA SEU
control. The APC implements three primary modes of
mitigation
operation: microprocessor, custom processor, and hybrid
PCI-to-PCI bridge facilitating a local PCI bus
processor. Operating mode is determined by the active
Ethernet development interface
configuration of the FPGA labeled Processing
6U x 220 mm Euro Card Form Factor
Element/Processor Controller in the block diagram.
Mass <3 lbs
While in microprocessor mode the APC’s FPGA is Max Power Draw 20W
configured as a Processor Controller and the microprocessor
is enabled. In this mode the APC behaves much like a SBC.
The Processor Controller FPGA hosts all of the support Packet Switched Fabric
functions for the PPC including IO, memory controller, All system modules are interconnected via a Packet
interrupts, timers, etc. When enabled as a custom process, Switched Fabric based on the RapidIO (RIO) commercial
the microprocessor is disabled and can not execute industry standard [11]. RIO is recognized by many as the
software. In this mode the FPGA is configured as a leading edge COTS interconnect. State of the art payload
Processing Element and hosts a full-custom application data processor interconnects are mostly based upon
including all IO and processing logic. The logic in the multidrop configurations such as MODULE BUS, PCI and
Processing Element is defined by the image loaded into the VME. Multidrop systems distribute available bandwidth
FPGA’s configuration memory by the Configuration over each module in the system but also produce points of
Manager who in turn takes its commands from software on contention among participant nodes often resulting in
the System Controller. The third APC capability is hybrid system level bottlenecks. In contrast, RIO implements a
mode operation, in which the FPGA hosts the Processor packet-switched, point-to-point interconnect allowing,
Controller for the microprocessor as well as application multiple full-bandwidth point-to-point links to be
specific modules. This mode can be likened to a co- simultaneously established between end-nodes in a network.
processor system. Application specific modules include RIO reduces contention and delivers more bandwidth to the
Digital Signal Processing (DSP) functions, data application.
compression, vector processors, etc. As with the custom
mode, the use of application specific modules results in high Figure 1-5 shows a RIO system based on two building
efficiency and performance yields [10]. This mode offers blocks: a RIO end-node, and a RIO switch. Each end-node
in the system is outfitted with a RIO network interface
6
The IBM 750fx and Xilinx VirtexII devices have been identified having a point-to-point link to a shared RIO Switch. The
as suitable COTS devices for the flight experiment.
4
switch receives and routes packets to the appropriate module, Figure 1-6 shows a module with three sensors,
destination. The non-blocking nature of RIO allows controller electronics, and network interface.
concurrent routing of multiple packets. For example: sensor
data can be stored in Bulk Memory at the same time as the
processor accesses the General Purpose IO. By using
multiple switches in a system, topologies consisting of
hundreds or thousands of nodes can be achieved. Proton Ion COTS Proton
scintillator and scintillator and scintillator and
Photo Detector Photo Detector Photo Detector
Processor Processor Threshold Output Threshold Output Threshold Output
Alarm Analog/ Alarm Analog/ Alarm Analog/

Digital Electronics Digital Electronics Digital Electronics
Control and Control and
Control and
Data Data
Data
RapidIO Bulk
Sensor Data
Switch Memory
SSM
PWR (3.3,1.5,+/-12V)
Backplane
Controller
FPGA
Thermistor
SSIO
General Non-volatile
Purpose I/O Memory cPCI Connector
Figure 1-6. SEU Alarm Module Block Diagram

Figure 1-5. RapidIO Example System
RIO interfaces are based on LVDS signaling technology SOFTWARE FRAMEWORK

and can achieve bandwidths of up to 60 Gbits/s for each
active link. A 16 bit RIO system with two active point-to- The Target Computer’s software framework as shown in
point links is capable of 120 Gbits/s providing >120x Figure 1-7. The basic components of the framework are the
performance increase over a 33 MHz 32 bit Compact PCI Operating System/System Software, Fault Tolerant System
based system. Controller/Node, EAFTC controller, Messaging
Middleware, and Reliable Platform Middleware. The
A notable benefit of the RIO protocol is its extensive error objective of the Target Computer software framework is to
detection and recovery mechanism. By combining retry provide the system developer with a dependable yet familiar
protocols, cyclic redundancy codes (CRC) and software platform. The user’s software consists of mission
single/multiple error detection, RIO handles all in network specific Payload Control and Communications hosted on
errors without application intervention. This inherent error the System Controller, and application processes distributed
handling and recovery capability proves ideal for space across the Data Processor cluster. All software components
applications that require a highly reliable interconnect. may be developed using COTS environments and
associated Application Program Interfaces (APIs).
Environmental Sensor Suite
<<processor>> #4 APC: Data Processor
EAFTC relies, to a great extent, on the system’s capability Direction of Data Flow
to sense its environment. As part of PSI’s proprietary COTS
Reconfigurable Environmentally-Adaptive Computing <<processor>> #1 RHPPC <<processor>> #1 APC: Data Processor

Technology (REACT), a miniature embedded radiation SBC:System Controller
monitor, the SEU Alarm has been developed. The SEU EAFTC
Controller
FT System
Controller
FT System
Node
Application
Alarm is based on flight-proven technology originally

developed for PSI’s radiation diagnostic instrumentation Payload
Messg. <<PCI>> Messg. RP
Control and
[12]. An advantage of the SEU Alarm over state of the art Comm.
Middleware Middleware Middleware
sensors, other than its small foot print, is that it is System System
specifically designed to support SEU rate predictions. VxWorks:OS Software

Components
Linux:OS Software
Components
The SEU Alarm provides continuous monitoring of the <<Synch. Ser. IO>>
<<device>> <<RIO>> <<device>>
:RIO Switch
:SEU Sensor
proton and heavy-ion fluxes that cause single event upsets.
The basic components of the SEU Alarm are a small block Figure 1-7 Software Framework
of scintillators coupled to a photodetector. A number of
these simple devices can be consolidated onto a single
5
The proposed Operating Systems are VxWorks [13] for the Environmental Server - Given the variety of possible
System Controller and Linux for the Data Processor cluster. sensory input, a function has been defined to collect and
VxWorks provides the capabilities necessary for the organize sensor signals into abstract representations that
deployment of real-time control processes such as those may be shared with other EAFTC components. The
implemented by the EAFTC controller, Fault Tolerant Environmental Server encapsulates the low-level interfaces
Controller, and Payload Control and Communications. to each of the sensors in the system, including the sampling
VxWorks also provides a familiar platform for developers of each signal.
of these types of applications. The Data Processor cluster,
unlike the System Controller, is the domain of the science Health Monitor - The Health Monitor is responsible for
application developer. In this case Linux is the preferred OS monitoring the state of each target system compute
due to its popularity in the scientific community. To resource. Signals such as heartbeats, redundant output
mitigate the problems associated with the interaction of consistency mismatches, watchdog time-out, etc are
heterogeneous operating systems we have introduced the collected via the Fault Tolerant Controller/Node
use of a COTS messaging middleware. The messaging components and provided to the Health Monitor. Given
component of GoAhead’s SelfReliant Middleware [14] predefined policies, the Health Monitor makes a
provides a common interface for communication between determination of the health for each Data Processor in the
Linux and VxWorks software components along with a APC cluster. This information is then shared with the
variety of practical messaging services such as publish– Deployment Generator where it is used in determining the
subscribe, and replicated databases. Messaging within the system’s task deployment.
Data Processor cluster is via Reliable Platform (RP)
Middleware, which is also responsible for the Software History Database - Although reacting to immediate sensory
Implemented Fault Tolerance (SIFT) [15] in the cluster. input may be adequate for many applications, the ability to
Together, the OS and Middlewares provide the base predict near future threats to the system can provide
platform on which other software is implemented. significant advantages. In particular, adapting fault
tolerance to address anticipated threats reduces the exposure
of the system to faults. A History Database is a key
EAFTC AND RP MIDDLEWARE component of the predictive filter implemented in the Alert
Level Generator. Sensor measurements from prior
EAFTC relies heavily on the use of two key software spacecraft orbits are maintained in this database and
components Reliable Platform Middleware (RP), and the subsequently retrieved by the Alert Level Generator.
EAFTC controller as described in this section.
Alert Level Generator - The process of evaluating the
EAFTC Controller environmental threat to the system is implemented in the
Alert Level Generator. Given the current sensory input,
The EAFTC controller is the core control of an EAFTC historical database, and a set of system specific thresholds
based system. Since the integrity and dependability of the the Alert Level Generator outputs a discrete threat level for
entire system relies on the EAFTC controller its realization the system. The core algorithm of the Alert Level
must be highly reliable. Hence, we have selected the Generator is an Adaptive Linear Predictive Filter that
implementation of the EAFTC controller as a software generates a particle flux prediction. Based on the prediction,
component hosted on a highly reliable System Controller. a series of user defined thresholds are evaluated to
This implementation will give the most flexibility for future determine the current system alert level to be used by the
use and adaptations. Deployment Generator in determining the system’s process
deployment.
Figure 1-8 shows internal functions of the EAFTC
controller in the context of a characteristic system Deployment Plan - The on-line behavior of an EAFTC
implementation. The details of each internal component in controller varies based on the target environment, system
the EAFTC controller are described below. level requirements, target application, target system
architecture, and other implementation specific factors.
EAFTC Controller
FPGA
This application specific behavior is captured as a user
Spacecraft
ephemeris Environmental Deployment
FPGA
Configuration
Controller
Configuration/
Refresh defined parameter set. In particular, the Deployment Plan
Server Plan
describes the desired system dependability for a given
SEU Alarm
<<sensor>> Flux Measurments
Deployment Health
Target
Computer spacecraft position, threat level, and time. The Deployment
<<onboard
Alert Level
Generator Monitor Health computer>>
Plan is fine grained and it is defined by the requirements of
Generator
History
CPU
Process
Deployment
each individual application process.
Configuration
Controller
Direction of Data Flow
Deployment Generator - Once the system threat level has
been assessed, the Deployment Generator acts to counter
Figure 1-8 EAFTC Controller Block Diagram
6
the threat. Given the Deployment Plan, target system and Removal (FDIR) services, enabling hosted applications
health, and alert level, the Deployment Generator produces to provide uninterrupted delivery of service in the presence
a new system deployment. The process of generating a new of faults.
deployment is primarily based on determining the lowest
cost distribution of application processes (including number Figure 1-9 shows a block diagram of the RP and its
of replicas) across the available target resources. The relations to other software elements of the system. The main
generated deployment is then sent to each node in the RP framework components are described as follows.
cluster where local actions implemented by the Fault
Tolerant Node software fulfil the deployment requests. Local Services - These services are local to each processor
Specifically, the Fault Tolerant Node collaborates with the in the distributed system. These services provide local
RP Middleware, discussed below, to deploy fault tolerance functionality required for a processor to perform useful
as requested. work in the cluster. Examples of these types of services
include networking, local scheduling, timing, and inter
Configuration Controllers - The Configuration Controllers process communications.
are each designed to interface with a particular target
system. Given a new deployment, each Configuration Cluster Synchronization - A service that establishes a
Controller generates the low-level signals to effect required dependable distributed time base that is consistent across
changes in the target system. In the proposed system, two the entire system. This service is based on a message
Configuration Controller types are implemented. The first is passing technique and uses local physical clocks at each
responsible for interaction with APC nodes operating in component to form a logical system clock. The Cluster
microprocessor mode. The second interacts with APC Synchronization service is scalable and efficiently
nodes operating in custom processor mode. establishes the time base across processors. This time base
is used as a backbone for scheduling distributed operations
Reliable Platform Middleware across the cluster.
The role of WW Technology’s RP in the overall EAFTC
System Configuration Services - These services are used to
solution is to implement SIFT. The RP manages the fault
establish and control the configuration of the cluster. The
tolerance of applications and services distributed across
cluster configuration comprises the system physical
clusters of processors by establishing a consistent
resources and logical capabilities. The System
framework and common context in which the system
Configuration Service interacts directly with the EAFTC
operates.
Fault Tolerant Node component, which in turn
Hosted communicates with the Fault Tolerant Controller. The
Applications
EAFTC controller sends its generated deployment via the
Fault Tolerant Controller/Node to each processor’s System
Monitoring Services Reliable Platform API System Configuration Service where deployment changes are
API Organizati on AP I
finally effected.
System Capability Frame Data Process
Monitoring
Schedule Integrity Group ••• System Capability
Service Service Service Management
• Application Monitoring • Element Discovery

System Monitoring Services - These services supply the
• RP Co mponent Monitoring
system with the ability to dynamically assess the health of
Application Services
the cluster and localize failed processors and application
Cluster Services Local
Resource
processes. Assessments are made with a cluster wide
Local Resource
Monitoring
(Synchronization, Application
Service Management)
Management
perspective using distributed decision-making and
integrated monitoring information from across the cluster.
Local Services (Scheduler, Networking, OpSys Services)
Failure notifications from this service are forwarded to the
Native Hardware, Operating System, and Vendor Device Drivers
EAFTC Health Monitor via the Fault Tolerant
©WW Technology Group 2004 Controller/Node components.
Figure 1-9. WWTG Reliable Platform Middleware Block Process Group Management - The proposed approach for
Diagram enhancing the availability and dependability of payload
applications relies on replication. The set of replicated
RP consists of a set of services that facilitate the instances are managed as a “process group,” which is a
implementation of reliable systems through the dependable peer-to-peer entity in which the support services of each
management of redundant/replicated resources. The RP is replica are constantly checking the performance/behavior of
ideally suited for addressing the needs of composing its local replica against that of its remote peers.
systems utilizing COTS hardware and software
components, as it offers a software based dependability Scheduling - A scheduling mechanism is available to the
solution that provides transparent Fault Detection, Isolation hosted applications. This mechanism will initially provide
7
simple indications to application processes as to when to The prototype consisted of a combination of Honeywell and
perform its execution cycle and when interaction with other COTS elements. All components were functionality
support services may be performed. This scheduling integrated to include software to represent a functional
mechanism is based on the common time base established EAFTC system. The software included all of the basic
through cluster synchronization. Operations controlled by functions of an EAFTC system implemented at an adequate
this scheduling service can be coordinated in time across all level of fidelity for TRL4 validation. The prototype was
elements of the cluster. demonstrated in a lab environment. The space environment
and SEU Alarm response where simulated with data
Data Integrity - A data integrity capability ensures captured in an Environmental Scenario file derived from
consistent data sets across replicas. A deviation from this SPENVIS radiation models [17].
consistent data by a replica is to be interpreted as an error
by that replica. This capability allows for hosted SEU Alarm Prototype
applications to expose internal state data facilitating warm
An SEU Alarm prototype was implemented and
starts of additional resources as they come on-line.
demonstrated in the context of a REACT system. PSI
Additional replicas may join an established group by
demonstrated, through measurements and modeling, that the
adopting the internal state of the existing replicas.
SEU alarm has sufficient sensitivity and count rate
capability to detect SEU-producing particles on orbit7.
The RP offers its services in a highly flexible manner,
Furthermore, PSI demonstrated that the SEU Alarm may be
supporting a distribution of applications that is not
used to adapt fault tolerance on-line.
necessarily tied to the physical realization of the cluster.
The RP utilizes a clustering approach to manage a cluster
processor. Application replicates are hosted on each RP-
Enabled resource via the RP Interface (RPI), rendering the CONCLUSION
application “unaware” of the fact that it has been replicated,
or to what extent it has been replicated. The RP works in EAFTC is an advanced fault tolerant system paradigm with
the background to monitor application behavior and potential use in many NASA and DoD missions. As a
recognizing when a fault has resulted in application follow on to the study effort the NMP is sponsoring a flight
divergence. Of particular importance, the RP not only experiment of select technologies. EAFTC is a candidate
provides dependability to hosted applications, but the RP is for this flight demonstration. The overall goal of the
in-and-of itself dependable, capitalizing internally on the proposed experiment is to show that EAFTC is a
same techniques and properties conveyed to hosted competitive and a low-risk solution for missions needing
applications. COTS high-performance on-board payload processing.
TECHNOLOGY DEVELOPMENT
EAFTC has been developed to a great extent under the

sponsorship of the NASA New Millennium Program (NMP)
[16]. Under the NMP Space Technology 8 program, a study
phase contract conducted from November 2003 through
July 2004 advanced the EAFTC technology to a
Technology Readiness Level 4 (TRL4.) To this end the
development team implemented prototypes of an EAFTC
system and an SEU Sensor. Additionally, low-fidelity
predictive performance models where developed for each
prototype.
EAFTC System Prototype

An EAFTC system prototype and an associated predictive
model have been implemented with all of the corresponding
functional elements of the EAFTC technology. The
technology components demonstrated in the prototype
system, are a Target Computer, and EAFTC controller. In
addition to the technology components, instrumentation was
integrated with the prototype for measurement of key 7
For the Xilinx FPGAs, protons with energies greater than a few
system performance factors.
MeV have the potential to produce SEUs.
8
REFERENCES [15] P. Ellis and C. J. Walter, “Fault Tolerant Discovery and
Formation Protocols for Autonomous Composition of
[1] “An Overview of Earth Science Enterprise”, NASA Spacecraft Constellations,” IEEE Aerospace Conference,
Goddard Space Flight Center, FS-2002-3-040-GSFC, 2003, pp 837-852.
March 2002
[16] New Millennium Program Web site
[2] Wallace M. Porter And Harry T. Enmark, “A System http://nmp.jpl.nasa.gov/
Overview of The Airborne Visble/Infrared Imaging
Spectrometer (AVIRIS)”, JPL Pasadena, California [17] Space Environment Information System Web site
http://www.spenvis.oma.be/spenvis/
[3] H.L. Huang, “Data Compression of High-spectral
Resolution Measurements”, Satellite Direct Readout
Conference for the Americas, December 2002 ACKNOWLEDGEMENTS
[4] J. Marshall and R. Berger, “A Processor Solution for the We would like to thank the following people and
Second Century of Powered Space Flight,” Digital organizations for their contributions to this work: Gary
Avionics Systems Conferences, 2000. Proceedings. Galica and Robin Cox from Physical Sciences Inc.; Chris
DASC. The 19th ,Volume: 2 , 7-13 Oct. 2000, Walter, Peter Ellis and Brian LaValley from WW
Pages:8.A.2_1 - 8.A.2_8 Technology Group; GoAhead Software; NASA New
Millennium Program; and the entire Honeywell team
[5] Gary R. Brown, “Radiation Hardened PowerPC 603e ™ including John Samson, Lee Hoffman, Jeff Wolfe, and
Based Single Board Computer,” 20th Digital Avionics Jason Copenhaver.
Systems, 2001. Oct 2001
[6] E. R. Prado et al.,“A Standard Approach to Spaceborne BIOGRAPHIES

Payload Data Processing,” IEEE Aerospace Conference,
March 2001. Jeremy Ramos has been a Systems
Engineer with Honeywell Defense and
[7] F. Irom et al., “Single-Event Upset in Evolving Space Electronic System since 1999. Mr.
Commercial Silicon-on-Insulator Microprocessor Ramos received a B.S. degree in
Technologies, Nuclear and Space Radiation Effects Computer Science and Engineering from
Conference 2003 the University of South Florida. Prior to
his engineering career Mr. Ramos served
[8] Xilinx Corporation, “QPro Virtex 2.5V Radiation for 7+ years with the United States Army as a Technician in
Hardened FPGA,” Xilinx Web site the Army Ordnance Corps. Most recently Mr. Ramos has
http://www.xilinx.com/, Nov. 2001 served as the Technical Director of the Honeywell
Reconfigurable Space Computer (HRSC) project, MNP
[9] Daniel P. Siewiorek and Robert S. Swarz, Reliable Space Technology 8 Study Phase project, and a number of
Computer Systems Design and Evaluation 3rd edition, other technology development efforts. Mr Ramos’s expertise
,MA: AK Peters Ltd., 1998. includes computer architecture, system simulation, and
reconfigurable computing.
[10] J.S. Donaldson, “Push the DSP Performance Envelope”,
Xilinx Xcell Journal, Spring 2003
Dean Brenner is a Program Manager for
the Satellite Enterprise Team of
[11] RapidIO Trade Association Web site Honeywell Defense and Space Electronic
http://www.rapidio.org/ Systems. He has 14 years of experience in
the field of digital signal processing,
[12] Physical Sciences Inc. Web site communications systems, and payload
http://www.psicorp.com/index.shtml processing development for military space
operational systems. Prior to becoming a Program
[13] Wind River Systems Weeb site Manager for Honeywell, Mr. Brenner was Principle Staff
http://www.windriver.com/ Systems Engineer for the Advanced Systems engineering
group and supported several advanced flight development
[14] GoAhead Web site http://www.goahead.com/ programs including Space-Based Infrared Systems (SBIRS).
Mr. Brenner served as the Principal Investigator for the
NMP Space Technology 8 Study Phase EAFCT Project.
9
Chris J. Walter is the founder of WW
Technology Group and has been actively
involved in the field of fault tolerant and
distributed computing for real-time
control systems for over 20 years. Dr.
Walter coauthored an IEEE Computer
tutorial text on Advances in Ultra-
Dependable Distributed Systems. His
work on a fault tolerant “X-by-Wire” architecture was
transitioned to the Navy’s New Virginia Class Submarine
(VSSN) for the fault tolerant Ship Control System. DARPA,
ONR, Army, Navy, Air Force, JPL and NASA have
supported his research. He is a member of the IFIP WG
10.4 on Dependable Computing and of the IEEE Fault-
Tolerant Technical Committee. His interests include
distributed middleware, reconfigurable computing,
systematic design methods and analysis tools for
dependable systems, on-line diagnosis, and real-time
mission critical computing problems.
Gary E. Galica is currently Group

Leader of Radiation Technologies at
Physical Science Inc. (PSI). He has 15
years of experience in developing high-
reliability instrumentation for space,
aircraft and terrestrial application. Since
joining PSI in 1995, he has been involved
in the development of experiments and
instrumentation for a variety of space science and
environmental applications, including several generations
of Light Particle Detectors that measure the high energy
protons, electrons, and alpha particles in space. At PSI,
Dr. Galica and the Radiation Technologies Group have
developed and delivered 5 flight sensors, 3 engineering
model sensors, 5 breadboard sensor and 1 ground test
facility since 1995. Dr. Galica received his Ph.D. in
Physical Chemistry from the Massachusetts Institute of
Technology in 1988, and his B.S. in Chemistry from Tufts
University in 1983. His doctoral thesis work included the
development of novel experimental and theoretical
techniques for the study of photodissociation dynamics in
small polyatomic molecules.
10

EAFTC IEEEpaper 2005

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EAFTC IEEEpaper 2005

Uploaded by

Copyright:

Available Formats

Environmentally Adaptive Fault Tolerant Computing

microprocessors and Field Programmable Gate Arrays Spacecraft

(FPGAs.) This approach avoids the high cost and long

<<Switch Fabric B>>

Figure 1-2 Fault Tolerance Design Methodology

the target processor to a variety of mission level

requirements. Greater efficiency can be achieved by using

more custom hardware modules in the FPGA. Similarly,

of programmability, the microprocessor mode would be a

listed in Table 1-2.

Table 1-2. APC Key Features

Figure 1-4 Adaptive Processing Computer Block Diagram Key Features

Processor Processor Threshold Output Threshold Output Threshold Output

Alarm Analog/ Alarm Analog/ Alarm Analog/

Figure 1-6. SEU Alarm Module Block Diagram

RIO interfaces are based on LVDS signaling technology SOFTWARE FRAMEWORK

Reconfigurable Environmentally-Adaptive Computing <<processor>> #1 RHPPC <<processor>> #1 APC: Data Processor

Alarm is based on flight-proven technology originally

specifically designed to support SEU rate predictions. VxWorks:OS Software

• Application Monitoring • Element Discovery

EAFTC has been developed to a great extent under the

EAFTC System Prototype

[6] E. R. Prado et al.,“A Standard Approach to Spaceborne BIOGRAPHIES

Gary E. Galica is currently Group

You might also like