Professional Documents
Culture Documents
(EAFTC)
Jeremy Ramos, Dean W. Brenner, Gary E. Galica, and Chris J. Walter
Honeywell Inc., Defense & Space Electronic Systems
13350 U.S. Highway 19 North
Clearwater, Florida 33764-7290
727-539-4311/ 727-539-2584
jeremy.ramos@honeywell.com
dean.brenner@honeywell.com
galica@psicorp.com
cwalter@wwtechnology.com
Abstract—1,2The application of commercial-off-the-shelf The vast amounts of data being generated can not be
(COTS) processing components in operational space transmitted via available downlink channels in reasonable
missions with optimal performance and efficiency requires a time. The industry proposed solution is to reduce the
system-level approach. Of primary concern is the need to demand on the downlink by moving processing onto the
handle the inherent susceptibility of COTS components to spacecraft. This approach is hampered by the limited
Single Event Upsets (SEUs). Honeywell in conjunction capabilities of today’s on-board processors and the
with Physical Sciences Incorporated, and WW Technology prohibitive cost of developing radiation hardened high-
Group has developed a new paradigm for fault tolerant performance electronics [4,5]. This has partly encouraged
COTS based onboard computing. The paradigm is called the industry to consider the use of COTS components [6].
“Environmentally Adaptive Fault Tolerant Computing” Furthermore, the recent adoption of silicon-on-insulator
(EAFTC.) EAFTC combines a set of innovative (SOI) technology by COTS integrated circuit foundries is
technologies to enable efficient use of high performance resulting in devices with moderate space radiation tolerance
COTS processors, in the harsh space environment, while [7,8]. Despite all of the progress, COTS components
maintaining the required system availability. continue to be highly susceptible to SEUs. Therefore,
technology must be developed that capitalizes on the
TABLE OF CONTENTS strengths of COTS devices while overcoming their
susceptibility to SEUs without negating their benefits.
INTRODUCTION .........................................................1
TECHNOLOGY ADVANCE ..........................................2
The popular approach for mitigating SEUs is to employ
CONCEPT OF OPERATION .........................................2
fixed component level redundancy [9]. One major
IMPLEMENTATION METHODOLOGY .........................2
disadvantage of fixed redundancy is low efficiency and
HARDWARE IMPLEMENTATION ...............................3
unrealized system capacity. The EAFTC paradigm
SOFTWARE FRAMEWORK .........................................5
facilitates use of COTS components in SEU abundant
EAFTC AND RP MIDDLEWARE ...............................6
environments, while maintaining adequate levels of system
TECHNOLOGY DEVELOPMENT .................................8
efficiency and capacity. It accomplished this by adaptively
CONCLUSION .............................................................8 configuring the level of fault tolerance in the system as
REFERENCES .............................................................9 mandated by the mission environment and mission
ACKNOWLEDGEMENTS .............................................9 application. The three fundamental elements of EAFTC are:
BIOGRAPHIES ............................................................9
• Real time environmental sensing
• COTS based computer architecture that supports
INTRODUCTION adaptable configuration levels of fault tolerance
• System controller to optimize performance and
Science and defense missions alike have ever-increasing efficiency while maintaining reliable operation
demands for data returns from their space assets. In recent
time we have seem a significant increase in the capability of In this paper we discuss the EAFTC paradigm in detail. We
the instruments deployed in space [1,2,3]. The traditional also discuss ongoing efforts to advance EAFTC as a
implementation approach of data gathering, data technology to adequate levels for use in operational
compression, and data transmission is no longer sustainable. missions.
1
0-7803-8870-4/05/$20.00© 2005 IEEE
2
IEEEAC paper #10787, Version 4, Updated October 27, 2004
1
TECHNOLOGY ADVANCE sensor suite, and target computer3. The fundamental
process implemented in the system consists of three steps:
State of the art onboard processing computers consist
mostly of radiation hardened components based on COTS 1. Sense the environment
equivalents. Though COTS compatibility offers many
benefits, including adoption of commercial software, large 2. Assess the environmental threat to the system’s
amounts of NRE are required for the initial silicon availability
implementation. Additionally, radiation hardened
components lag their commercial counterparts in overall 3. Adapt the processing system’s configuration to
performance and capability by at least 1 to 2 orders of effectively mitigate the threat presented by the
magnitude. There are a number of factors that contribute to environment
this deficiency. Most predominantly, radiation-hardening
techniques for microelectronics require the use of fixed SEU Alarm Energy Level
<<sensor>>
transistor or gate level redundancy. The additional logic EAFTC
increases the power required to perform the same unit of controller Process
Environmental Deployment
computation. Measurement
(i.e. temp,
Sensor
Response Health Target Computer
power, etc.) <<onboard
History Deployment
<<sensor>> computer>>
An approach towards improvement is the use of true COTS Plan
IMPLEMENTATION METHODOLOGY
CONCEPT OF OPERATION
Figure 1-2 shows an implementation methodology that
The goal of an EAFTC based system is to optimally employ incorporates fault tolerant techniques at each level of the
system level fault tolerance based on historical and in-situ system hierarchy.
environmental conditions. Figure 1-1 is a block diagram of
a representative EAFTC based system. The major functional 3
e.g. payload computer system
elements of the system are a controller, environmental 4
For example temperature, available power, etc.
5
i.e. SEU Alarm, temperature, and spacecraft ephemeris
2
<<processor>> #4
Mission APC:
Data Processor
<<processor>> #3
Mission Faults <<processor>> #2 Benchmark
APC:
RHPPC SBC: Application
Data Processor
System Methods <<spacecraft interface System Controller Linux
<<processor>>
Horizontal#2
Services
EAFTC A/B>> Benchmark
APC:
System
EAFTC Application
Data Processor
Components
Linux
Application Faults Experiment Comm.
Experiment Manager
Configuration
Horizontal Services
BenchmarkControl
Spacecraft System
Software Methods Experiment Monitors
<<processor>>
Experiment #1
Configuration
Configuration
Application
Components
Linux Memory
<<processor>> #1
SIFT Middlewares, Roll-back, Check-Pointing, Algorithm Design, RHPPC SBC:
VxWorks
System Controller
Configuration
System
Horizontal
APC:
Control
Profiling
Services
others System
Data Components
Processor
Configuration
Configuration Control
Memory
Configuration Memory
System Profiling
System Profiling
Module Faults
Hardware Methods <<28V Power>>
Replication, Scrubbing, Heart Beating, Watchdogs, others <<Switch Fabric A>>
Component Faults
Component Selection
<<device>>
Radiation Hardened, COTS, Military Grade Power Supply A/B
<<device>>
SEU
Alarm:Environmental
Sensors A/B
Data Processors
The Data Processor cluster uses COTS based processors
with Honeywell’s unique architecture called the Adaptive
Processing Computer (APC.) The APC is a multi-mode
device that combines the use of COTS microprocessors and
FPGAs on a single platform. The APC employs a COTS
3
IBM PowerPC 750FX microprocessor and a Xilinx VirtexII the most flexibility by retaining a programmable
6000 FPGA6. Figure 1-7 is a block diagram of the APC. microprocessor and access to custom hardware. The APC is
capable of dynamic switching between modes, useful in
Front Panel
many applications.
Ethernet
PPC
RAM
The APC’s flexibility allows application designers to adapt
Interrupts
750fx
MPC Bus Control
MPC Bus Address Data
Mem Data
64 MB User IO PROM
PWR (3.3,1.5,+/-12V)
+ Arbiter
Actel
Interface APC Interrupt board processor modules. Some key features of the APC are
Timer Synch
APC Health
Thermistor
To Front
PCI
JTAG
Panel
cPCI Connector
PWR (3.3,1.5,+/-12V)
Backplane
Controller
FPGA
Thermistor
SSIO
General Non-volatile
Purpose I/O Memory cPCI Connector
monitor, the SEU Alarm has been developed. The SEU EAFTC
Controller
FT System
Controller
FT System
Node
Application
sensors, other than its small foot print, is that it is System System
The SEU Alarm provides continuous monitoring of the <<Synch. Ser. IO>>
<<device>> <<RIO>> <<device>>
:RIO Switch
:SEU Sensor
proton and heavy-ion fluxes that cause single event upsets.
The basic components of the SEU Alarm are a small block Figure 1-7 Software Framework
of scintillators coupled to a photodetector. A number of
these simple devices can be consolidated onto a single
5
The proposed Operating Systems are VxWorks [13] for the Environmental Server - Given the variety of possible
System Controller and Linux for the Data Processor cluster. sensory input, a function has been defined to collect and
VxWorks provides the capabilities necessary for the organize sensor signals into abstract representations that
deployment of real-time control processes such as those may be shared with other EAFTC components. The
implemented by the EAFTC controller, Fault Tolerant Environmental Server encapsulates the low-level interfaces
Controller, and Payload Control and Communications. to each of the sensors in the system, including the sampling
VxWorks also provides a familiar platform for developers of each signal.
of these types of applications. The Data Processor cluster,
unlike the System Controller, is the domain of the science Health Monitor - The Health Monitor is responsible for
application developer. In this case Linux is the preferred OS monitoring the state of each target system compute
due to its popularity in the scientific community. To resource. Signals such as heartbeats, redundant output
mitigate the problems associated with the interaction of consistency mismatches, watchdog time-out, etc are
heterogeneous operating systems we have introduced the collected via the Fault Tolerant Controller/Node
use of a COTS messaging middleware. The messaging components and provided to the Health Monitor. Given
component of GoAhead’s SelfReliant Middleware [14] predefined policies, the Health Monitor makes a
provides a common interface for communication between determination of the health for each Data Processor in the
Linux and VxWorks software components along with a APC cluster. This information is then shared with the
variety of practical messaging services such as publish– Deployment Generator where it is used in determining the
subscribe, and replicated databases. Messaging within the system’s task deployment.
Data Processor cluster is via Reliable Platform (RP)
Middleware, which is also responsible for the Software History Database - Although reacting to immediate sensory
Implemented Fault Tolerance (SIFT) [15] in the cluster. input may be adequate for many applications, the ability to
Together, the OS and Middlewares provide the base predict near future threats to the system can provide
platform on which other software is implemented. significant advantages. In particular, adapting fault
tolerance to address anticipated threats reduces the exposure
of the system to faults. A History Database is a key
EAFTC AND RP MIDDLEWARE component of the predictive filter implemented in the Alert
Level Generator. Sensor measurements from prior
EAFTC relies heavily on the use of two key software spacecraft orbits are maintained in this database and
components Reliable Platform Middleware (RP), and the subsequently retrieved by the Alert Level Generator.
EAFTC controller as described in this section.
Alert Level Generator - The process of evaluating the
EAFTC Controller environmental threat to the system is implemented in the
Alert Level Generator. Given the current sensory input,
The EAFTC controller is the core control of an EAFTC historical database, and a set of system specific thresholds
based system. Since the integrity and dependability of the the Alert Level Generator outputs a discrete threat level for
entire system relies on the EAFTC controller its realization the system. The core algorithm of the Alert Level
must be highly reliable. Hence, we have selected the Generator is an Adaptive Linear Predictive Filter that
implementation of the EAFTC controller as a software generates a particle flux prediction. Based on the prediction,
component hosted on a highly reliable System Controller. a series of user defined thresholds are evaluated to
This implementation will give the most flexibility for future determine the current system alert level to be used by the
use and adaptations. Deployment Generator in determining the system’s process
deployment.
Figure 1-8 shows internal functions of the EAFTC
controller in the context of a characteristic system Deployment Plan - The on-line behavior of an EAFTC
implementation. The details of each internal component in controller varies based on the target environment, system
the EAFTC controller are described below. level requirements, target application, target system
architecture, and other implementation specific factors.
EAFTC Controller
FPGA
This application specific behavior is captured as a user
Spacecraft
ephemeris Environmental Deployment
FPGA
Configuration
Controller
Configuration/
Refresh defined parameter set. In particular, the Deployment Plan
Server Plan
describes the desired system dependability for a given
SEU Alarm
<<sensor>> Flux Measurments
Deployment Health
Target
Computer spacecraft position, threat level, and time. The Deployment
<<onboard
Alert Level
Generator Monitor Health computer>>
Plan is fine grained and it is defined by the requirements of
Generator
History
CPU
Process
Deployment
each individual application process.
Configuration
Controller
Direction of Data Flow
Deployment Generator - Once the system threat level has
been assessed, the Deployment Generator acts to counter
Figure 1-8 EAFTC Controller Block Diagram
6
the threat. Given the Deployment Plan, target system and Removal (FDIR) services, enabling hosted applications
health, and alert level, the Deployment Generator produces to provide uninterrupted delivery of service in the presence
a new system deployment. The process of generating a new of faults.
deployment is primarily based on determining the lowest
cost distribution of application processes (including number Figure 1-9 shows a block diagram of the RP and its
of replicas) across the available target resources. The relations to other software elements of the system. The main
generated deployment is then sent to each node in the RP framework components are described as follows.
cluster where local actions implemented by the Fault
Tolerant Node software fulfil the deployment requests. Local Services - These services are local to each processor
Specifically, the Fault Tolerant Node collaborates with the in the distributed system. These services provide local
RP Middleware, discussed below, to deploy fault tolerance functionality required for a processor to perform useful
as requested. work in the cluster. Examples of these types of services
include networking, local scheduling, timing, and inter
Configuration Controllers - The Configuration Controllers process communications.
are each designed to interface with a particular target
system. Given a new deployment, each Configuration Cluster Synchronization - A service that establishes a
Controller generates the low-level signals to effect required dependable distributed time base that is consistent across
changes in the target system. In the proposed system, two the entire system. This service is based on a message
Configuration Controller types are implemented. The first is passing technique and uses local physical clocks at each
responsible for interaction with APC nodes operating in component to form a logical system clock. The Cluster
microprocessor mode. The second interacts with APC Synchronization service is scalable and efficiently
nodes operating in custom processor mode. establishes the time base across processors. This time base
is used as a backbone for scheduling distributed operations
Reliable Platform Middleware across the cluster.
The role of WW Technology’s RP in the overall EAFTC
System Configuration Services - These services are used to
solution is to implement SIFT. The RP manages the fault
establish and control the configuration of the cluster. The
tolerance of applications and services distributed across
cluster configuration comprises the system physical
clusters of processors by establishing a consistent
resources and logical capabilities. The System
framework and common context in which the system
Configuration Service interacts directly with the EAFTC
operates.
Fault Tolerant Node component, which in turn
Hosted communicates with the Fault Tolerant Controller. The
Applications
EAFTC controller sends its generated deployment via the
Fault Tolerant Controller/Node to each processor’s System
Monitoring Services Reliable Platform API System Configuration Service where deployment changes are
API Organizati on AP I
finally effected.
System Capability Frame Data Process
Monitoring
Schedule Integrity Group ••• System Capability
Service Service Service Management
Figure 1-9. WWTG Reliable Platform Middleware Block Process Group Management - The proposed approach for
Diagram enhancing the availability and dependability of payload
applications relies on replication. The set of replicated
RP consists of a set of services that facilitate the instances are managed as a “process group,” which is a
implementation of reliable systems through the dependable peer-to-peer entity in which the support services of each
management of redundant/replicated resources. The RP is replica are constantly checking the performance/behavior of
ideally suited for addressing the needs of composing its local replica against that of its remote peers.
systems utilizing COTS hardware and software
components, as it offers a software based dependability Scheduling - A scheduling mechanism is available to the
solution that provides transparent Fault Detection, Isolation hosted applications. This mechanism will initially provide
7
simple indications to application processes as to when to The prototype consisted of a combination of Honeywell and
perform its execution cycle and when interaction with other COTS elements. All components were functionality
support services may be performed. This scheduling integrated to include software to represent a functional
mechanism is based on the common time base established EAFTC system. The software included all of the basic
through cluster synchronization. Operations controlled by functions of an EAFTC system implemented at an adequate
this scheduling service can be coordinated in time across all level of fidelity for TRL4 validation. The prototype was
elements of the cluster. demonstrated in a lab environment. The space environment
and SEU Alarm response where simulated with data
Data Integrity - A data integrity capability ensures captured in an Environmental Scenario file derived from
consistent data sets across replicas. A deviation from this SPENVIS radiation models [17].
consistent data by a replica is to be interpreted as an error
by that replica. This capability allows for hosted SEU Alarm Prototype
applications to expose internal state data facilitating warm
An SEU Alarm prototype was implemented and
starts of additional resources as they come on-line.
demonstrated in the context of a REACT system. PSI
Additional replicas may join an established group by
demonstrated, through measurements and modeling, that the
adopting the internal state of the existing replicas.
SEU alarm has sufficient sensitivity and count rate
capability to detect SEU-producing particles on orbit7.
The RP offers its services in a highly flexible manner,
Furthermore, PSI demonstrated that the SEU Alarm may be
supporting a distribution of applications that is not
used to adapt fault tolerance on-line.
necessarily tied to the physical realization of the cluster.
The RP utilizes a clustering approach to manage a cluster
processor. Application replicates are hosted on each RP-
Enabled resource via the RP Interface (RPI), rendering the CONCLUSION
application “unaware” of the fact that it has been replicated,
or to what extent it has been replicated. The RP works in EAFTC is an advanced fault tolerant system paradigm with
the background to monitor application behavior and potential use in many NASA and DoD missions. As a
recognizing when a fault has resulted in application follow on to the study effort the NMP is sponsoring a flight
divergence. Of particular importance, the RP not only experiment of select technologies. EAFTC is a candidate
provides dependability to hosted applications, but the RP is for this flight demonstration. The overall goal of the
in-and-of itself dependable, capitalizing internally on the proposed experiment is to show that EAFTC is a
same techniques and properties conveyed to hosted competitive and a low-risk solution for missions needing
applications. COTS high-performance on-board payload processing.
TECHNOLOGY DEVELOPMENT
8
REFERENCES [15] P. Ellis and C. J. Walter, “Fault Tolerant Discovery and
Formation Protocols for Autonomous Composition of
[1] “An Overview of Earth Science Enterprise”, NASA Spacecraft Constellations,” IEEE Aerospace Conference,
Goddard Space Flight Center, FS-2002-3-040-GSFC, 2003, pp 837-852.
March 2002
[16] New Millennium Program Web site
[2] Wallace M. Porter And Harry T. Enmark, “A System http://nmp.jpl.nasa.gov/
Overview of The Airborne Visble/Infrared Imaging
Spectrometer (AVIRIS)”, JPL Pasadena, California [17] Space Environment Information System Web site
http://www.spenvis.oma.be/spenvis/
[3] H.L. Huang, “Data Compression of High-spectral
Resolution Measurements”, Satellite Direct Readout
Conference for the Americas, December 2002 ACKNOWLEDGEMENTS
[4] J. Marshall and R. Berger, “A Processor Solution for the We would like to thank the following people and
Second Century of Powered Space Flight,” Digital organizations for their contributions to this work: Gary
Avionics Systems Conferences, 2000. Proceedings. Galica and Robin Cox from Physical Sciences Inc.; Chris
DASC. The 19th ,Volume: 2 , 7-13 Oct. 2000, Walter, Peter Ellis and Brian LaValley from WW
Pages:8.A.2_1 - 8.A.2_8 Technology Group; GoAhead Software; NASA New
Millennium Program; and the entire Honeywell team
[5] Gary R. Brown, “Radiation Hardened PowerPC 603e ™ including John Samson, Lee Hoffman, Jeff Wolfe, and
Based Single Board Computer,” 20th Digital Avionics Jason Copenhaver.
Systems, 2001. Oct 2001
9
Chris J. Walter is the founder of WW
Technology Group and has been actively
involved in the field of fault tolerant and
distributed computing for real-time
control systems for over 20 years. Dr.
Walter coauthored an IEEE Computer
tutorial text on Advances in Ultra-
Dependable Distributed Systems. His
work on a fault tolerant “X-by-Wire” architecture was
transitioned to the Navy’s New Virginia Class Submarine
(VSSN) for the fault tolerant Ship Control System. DARPA,
ONR, Army, Navy, Air Force, JPL and NASA have
supported his research. He is a member of the IFIP WG
10.4 on Dependable Computing and of the IEEE Fault-
Tolerant Technical Committee. His interests include
distributed middleware, reconfigurable computing,
systematic design methods and analysis tools for
dependable systems, on-line diagnosis, and real-time
mission critical computing problems.
10