You are on page 1of 10

422

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 4, APRIL 2008

Fault Emulation for Dependability Evaluation of VLSI Systems


David de Andrs, Juan Carlos Ruiz, Daniel Gil, and Pedro Gil

AbstractAdvances in semiconductor technologies are greatly increasing the likelihood of fault occurrence in deep-submicrometer manufactured VLSI systems. The dependability assessment of VLSI critical systems is a hot topic that requires further research. Field-programmable gate arrays (FPGAs) have been recently proposed as a means for speeding-up the fault injection process in VLSI systems models (fault emulation) and for reducing the cost of xing any error due to their applicability in the rst steps of the development cycle. However, only a reduced set of fault models, mainly stuck-at and bit-ip, have been considered in fault emulation approaches. This paper describes the procedures to inject a wide set of faults representative of deep-submicrometer technology, like stuck-at, bit-ip, pulse, indetermination, stuck-open, delay, short, open-line, and bridging, using the best suitable FPGAbased technique. This paper also sets some basic guidelines for comparing VLSI systems in terms of their availability and safety, which is mandatory in mission and safety critical application contexts. This represents a step forward in the dependability benchmarking of VLSI systems and towards the denition of a framework for their evaluation and comparison in terms of performance, power consumption, and dependability. Index TermsFault injection, eld-programmable gate arrays (FPGAs), run-time reconguration, validation of VLSI circuits.

I. INTRODUCTION

OWADAYS, computer-based systems are widely used in almost any application domain. The steady reduction of transistors size has greatly decreased the overall size of current chips, also allowing for less power consumption and higher frequency rates, leading altogether to greater performance. These same features that improve VLSI systems performance, negatively affect their dependability. Low-power, high-speed, and reduced-size transistors are highly increasing the likelihood of faults occurrence in these systems [1], [2]. There exist a large number of computer-based systems people rely on, often for their livelihood, sometimes for their lives, called critical systems, in which the occurrence of a failure can be totally unacceptable. These systems cannot only be designed and evaluated under the same principles as performance-driven systems. A dependability perspective must also be adopted. Hence, a great concern has arisen towards the development of new and efcient techniques for the validation of critical systems in the presence of representative faults.

Manuscript received August 10, 2006. This work was supported by the Spanish MCYT Project TEC 2005-05119/MIC. The authors are with the Fault Tolerant Systems Group (GSTF), Technical University of Valencia (UPV), DISCAETS Informtica Aplicada, Valencia E-46021, Spain (e-mail: ddandres@disca.upv.es; jcruizg@disca.upv.es; dgil@disca.upv.es; pgil@disca.upv.es). Digital Object Identier 10.1109/TVLSI.2008.917428

That part of a systems total state that may lead to a failure, a deviation of the delivered service from correct one, is called an error, and the cause of an error, due to development, physical, or interaction reasons is called a fault [3]. The impossibility of observing the system on the eld to get statistical data makes fault injection [4], which consists in the deliberate introduction of faults into a system, a very valuable methodology in the validation process. Hardware implemented fault injection (HWIFI) and software implemented fault injection (SWIFI) techniques [5], [6] use a prototype to introduce the faults into the system. The required extra hardware usually increases the cost of HWIFI techniques and some of them may even damage the system under test. SWIFI techniques are usually a low-cost alternative to HWIFI that can target operating systems and applications. Although SWIFI is a exible approach, it can only target software accessible locations. Even though prototypes rapidly execute the experiments, these techniques can only be applied in the last stages of the system development cycle, thus increasing the cost of xing any error in the design. Model-based fault injection techniques [7], [8] solve that particular problem by injecting faults into system models. Since a nal prototype is not required, they allow for the early validation of the system, thus reducing the cost of xing any error in the design. However, the simulation of modern complex models may take up an enormous amount of time. In this context, eld-programmable gate arrays (FPGAs) have been mainly used for the logic emulation (prototyping and validation) of VLSI systems [9]. Recently, they have been proposed as a means for speeding-up the execution of model-based fault injection experiments. This technique, known as fault emulation, consists in implementing the model of the system onto the FPGA to accelerate its execution. It relies on the recongurable capabilities of FPGAs to inject the fault into the system. It is also easier and cheaper than prototype-based fault injection techniques (HWIFI and SWIFI). The large variety of possible faults is sorted into fault models, which group faults that cause similar effects into the system. Recent studies [10] show that classical stuck-at and bit-ip fault models are not enough to cope with the fault mechanisms of new deep-submicrometer technologies. New fault models must also cover aspects like pulses, indeterminations, delays, stuck-opens, shorts, open-lines, and bridgings. Most of FPGA-based fault injection techniques and tools only cover the well-known stuck-at and bit-ip fault models. Hence, although fault emulation techniques can be effectively used to study the robustness of VLSI systems, it is also necessary to study whether they can target the whole set of new fault models for deep-submicrometer manufactured systems.

1063-8210/$25.00 2008 IEEE

DE ANDRS et al.: FAULT EMULATION FOR DEPENDABILITY EVALUATION OF VLSI SYSTEMS

423

This paper focuses on the specication of a methodology for the injection of a wide set of faults considered representative of deep-submicrometer technology that were not addressed yet for fault emulation. Guidelines for the study and comparison of VLSI systems under a dependability perspective, which can be considered as a rst step towards a general framework for dependability benchmarking, are also provided. Section II studies the existing fault emulation techniques and analyzes which one can achieve a better speedup for the dependability assessment of modern VLSI complex systems. Section III deals with the emulation of the whole set of considered fault models following the technique selected in Section II. The basic operations that can be performed onto the conguration memory of a generic FPGA are used to specify the procedures to emulate such faults. FADES (FPGA-based framework for the Analysis of the Dependability of Embedded Systems), a tool for the dependability assessment of VLSI systems which implements the proposed methodology is described in Section IV. Three different microprocessors are compared in terms of their availability and safety according to the experiments performed in Section V. Section VI identies some open challenges claiming for further research and Section VII outlines the main conclusions of this paper.

II. RELATED WORK Nowadays, several different commercial systems enable the rapid prototyping and logic validation of VLSI circuits. These systems, named logic emulators, usually consist of one or several motherboards which hold a certain number of FPGAs depending on the complexity of the model to be implemented. The use of logic emulation principles to speedup model-based fault injection experiments is known as fault emulation. One of the rst attempts to use logic emulators with fault-injection purposes was proposed in [11] under the name of serial fault emulation. This rst approach was mainly aimed at obtaining the test coverage (fault grading) of large circuits. The FPGA conguration le of the fault-free circuit was used to debug the circuit and obtain the expected values. New conguration les were generated for each fault to be injected. These les recongured the logic emulator to emulate the behavior of the system in the presence of the considered faults. Although the use of FPGAs accelerated fault injection experiments, it was soon clear that the following two were the main aspects that limited the attainable speedup. 1) Implementation of medium- to high-complexity models takes from a few minutes to some hours. As a new model is implemented for each of the considered faults, it can lead to so long of implementation times that this methodology might result impracticable. 2) Fault injection process involves the reconguration of the logic emulator to inject the fault and, in the case of a transient fault, to delete it later. The full reconguration of a typical FPGA, when downloading the le from a PC platform, is in the order of tens of milliseconds. Thus, the global reconguration time may greatly exceed the execution time of the experiments. New techniques intended to minimize the number of required implementations and recongurations, and the size of

Fig. 1. (a) Compile- (b) and run-time reconguration ow.

the new reconguration les, have evolved from the classical approaches for implementing recongurable applications [12]. Under compile-time reconguration (CTR), all the different required functionalities of the system are described in the same model. Once implemented on an FPGA, the proper activation of a number of control signals will cause the circuit to assume the requested functionally. The use of CTR for fault emulation [cf. Fig. 1(a)] consists in modifying the original model of the system to include the necessary logic to inject the fault and monitor the systems state. This approach was rst proposed in [13] under the name of dynamic fault injection. The injection logic was introduced at some different locations and the activation of a particular fault was achieved by using a circular shift-register, a demultiplexer-based structure [14], and a hybrid approach [15]. The most renowned tool that follows this methodology is fault injection by means of FPGA (FIFA), which was successfully used to inject faults into VLSI [16] and microprocessor-based [17] systems. This methodology reduces the number of model synthesis and implementations, since the instrumentation includes the logic required to inject a wide number of faults, and minimizes the fault injection and deletion time. Nevertheless, the addition of some custom logic at every selected injection point largely increases the size of the instrumented model. Depending on the implementation of the fault models, the number of injection points, and the FPGA size, this new instrumented model may not t in the selected device. Therefore, an injection campaign

424

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 4, APRIL 2008

usually needs several partial instrumentations of the model to cover all the desired injection points and fault models. Hence, although CTR reduces the number of implementations with respect to the original serial fault emulation approach, it is still not enough when dealing with large and complex systems. Run-time reconguration (RTR) follows a dynamic allocation approach that reallocates the required hardware for each of the desired functionalities. Each application consists of multiple congurations per FPGA implementing some fraction of the application. The FPGA implements the original model of the system and, when needed, the internal resources of the FPGA are reallocated to emulate the new functionality. The use of RTR for fault emulation [cf. Fig. 1(b)] focuses on rewriting the FPGAs conguration memory to reect the behavior of the system in the presence of faults [18]. One of the better known automated tools that implements this methodology ) [19]. is an FPGA-based fault simulator ( This methodology only requires the synthesis and implementation of the original systems model, and somewhat reduces the time required to inject and delete the fault when the partial reconguration of the device is viable. However, it introduces some temporal overhead in the execution of the model due to the transfer of information to and from the FPGA. This overhead, which depends on the implementation and location of the fault, and the FPGAs architecture, is largely counterbalanced by the time saved by implementing the model just once. Therefore, RTR seems to be best suited for the dependability assessment of large and complex systems. Although RTR has been selected as the best suited approach to speedup fault injection experiments of complex models, there still exists a capital problem to solve when dealing with current VLSI systems. Most existing implementations of both fault emulation approaches (CTR and RTR) usually only cover the well-known stuck-at and bit-ip fault models. However, the dependability of modern critical systems must be evaluated in the presence of faults considered representative of new deep-submicrometer technologies. Hence, this paper focuses on whether RTR techniques can be efciently used to emulate the occurrence of the whole set of fault models described in [10].

Fig. 2. Generic architecture of a FPGA.

Fig. 3. Structure of a generic congurable block.

that, the set of considered fault models and the procedures to inject/delete them into/from VLSI systems models are described. A. Generic FPGA Architecture Although somewhat different, the internal structure of the main FPGAs families such as Xilinxs Virtex [20], Alteras Stratix [21], Lattices LatticeSC [22], and Atmels AT40K [23], is very similar. From an abstract viewpoint, every FPGA integrates a number of congurable elements (cf. Fig. 2), organized as a grid of congurable blocks (CBs) that are connected by means of programmable matrixes (PMs). 1) Congurable Blocks: CBs (cf. Fig. 3) mainly consist of a function generator, a D-type ip-op, and a number of multiplexers that determine the CBs functionality by establishing the proper connections among the CBs inputs, outputs, and internal elements. The function generator which is usually built as a four-input lookup table (LUT) implements the combinational logic of the circuit. The ip-op (FF) acts as a storage element that implements the sequential logic of the circuit. The multiplexer InvertFFinMux (MUX1) can be used to invert the logic value of the sequential dedicated input signal. Multiplexer LUTorFFMux (MUX2) determines whether that signal or the LUTs output feeds the FF. A local or global signal will trigger the preset and clear of the FF depending on the conguration of multiplexers PRMux (MUX5) and CLRMux (MUX6), respectively. The multiplexer LSRMux (MUX4) enables the use of the local signal for triggering the set/reset of the FF and the InvertLSRMux (MUX3) can invert the incoming logic value of that local signal. 2) Programmable Matrixes: PMs interconnect CBs by linking line segments that cross the device both in vertical and horizontal directions. Each connection is usually established by means of a pass transistor (PT) that can be turned on or off. 3) Conguration Memory: The conguration of all these elements is controlled by the conguration memory of the FPGA. For instance, some bits represent the truth tables of the circuits

III. EMULATION OF DEEP-SUBMICROMETER FAULT MODELS Critical systems must be validated in the presence of faults considered representative of deep-submicrometer technologies. Since we are focusing on the use of RTR techniques, the fault injection process consists in writing a new conguration le into the FPGAs conguration memory to emulate the behavior of the system in the presence of faults. Information regarding the current state of the system, the current conguration of FPGAs recongurable elements, and the fault to be injected, must be gathered to generate this le. The use of RTR techniques is closely related to the architecture of the FPGA being considered. Therefore, a generic FPGA architecture must rst be dened to make this study as comprehensive as possible. It must include all the elements that are susceptible of being recongured to emulate the behavior of the system in the presence of faults. This section provides a list of all the available operations that can be used for fault emulation using RTR techniques. After

DE ANDRS et al.: FAULT EMULATION FOR DEPENDABILITY EVALUATION OF VLSI SYSTEMS

425

TABLE I BASIC OPERATIONS ON FPGAS RECONFIGURABLE ELEMENTS

TABLE II REPRESENTATIVE FAULT MODELS FOR DEEP-SUBMICROMETER TECHNOLOGY

implemented by LUTs, other bits hold the current state of FFs, some more bits set the control inputs of the multiplexers, and other bits turn pass transistors on/off, routing all the lines of the circuit. That conguration memory is loaded with the le obtained from the model synthesis and implementation process, which results in the reconguration of all these elements to functionally behave as the system described by the model. B. Operations on FPGAs Recongurable Elements Different operations can be applied to the recongurable elements (LUTs, FFs, multiplexers, and pass transistors) of any FPGA matching the generic architecture depicted in Fig. 2. In general, these operations are related to obtaining or modifying the current conguration of the considered element. The operations in charge of acquiring the contents or determining the state of a congurable element are listed in Table I as read operations. Likewise, the operations that can be used to change the contents or the state of these elements appear as write operations in Table I. As FFs state can only be changed by normal operation or by triggering its clear or preset signal, there is not any write operation available for FFs. These basic operations will be used to inject and delete faults in VLSI systems models following the RTR approach. C. Fault Models Transient faults appear during a short period of time. They usually result from the interference or interaction of the circuitry with its physical environment (transients in power supply, crosstalk, electromagnetic interferences, temperature variation, and cosmic radiation, etc.). Permanent faults are related to irreversible physical defects in the circuit. They are usually produced during the circuitry manufacturing process or normal operation due to a wear out process. In the latter case, sometimes they initially reveal as intermittent faults until they cause a permanent one. A previous work on fault representativeness [10] established the connection between the system physical-device and logic levels by theoretically studying the physical causes and mechanisms related to modern deep-submicrometer technologies and how they manifest at logic level. It covered and cosmic radiation, oxide breakdown, plasma damage, electromigration, hot carrier detection, stress voiding, packing and assembly defects, process variations, and manufacturing residuals. Table II lists the set of deduced fault models that are considered representative of deep-submicrometer technologies currently at logic level. These are the models that are considered for fault emulation purposes in the rest of this paper.

TABLE III ADVANCED OPERATIONS ON FPGAS EXTRACTED INFORMATION

D. Fault Emulation Process The fault emulation process consists in physically injecting the faults into the FPGA to emulate the faults virtually occurring into the systems model. The emulation of any fault involves two different steps: 1) introducing the fault into the system when the injection time is reached and 2) deleting the fault once it disappears from the system. These two phases, that are clearly devised when dealing with transient faults, are not so clearly identied for permanent faults, as they always remain in the system. When experiments end, it is necessary to reset the system to its original state and, in the case of permanent faults, it is also required to restore the original fault-free conguration. As this process takes up much more time than just deleting the fault, we propose considering permanent faults as transient ones with duration equal to that of the fault injection experiment. In that way, all the faults follow the same emulation methodology. The fault emulation process has to make use of the basic operations dened in Table I to modify the FPGA recongurable elements in order to emulate the behavior of the system in the presence of the considered faults [cf. Table II] and restore its fault-free behavior when the fault disappears. The description of the fault emulation process is supported by four additional operations [cf. Table III] which depend on information extracted from the FPGA. Tables IV and V summarize, in C-like pseudo-code, the proposed procedures for emulating the fault models introduced in Table II. They describe which elements must be recongured to inject and delete each considered fault into the systems model and how to accomplish it by using FPGAs. Due to the synchronous nature of FPGAs the current logical value of combinational lines cannot be obtained through

426

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 4, APRIL 2008

TABLE IV PSEUDO-CODE FOR THE EMULATION OF TRANSIENT DEEP-SUBMICROMETER TECHNOLOGY FAULT MODELS

their conguration memory. Thus, fault models like pulse and stuck-open, require two experiments for each single fault: one assuming a current logical 0 and another assuming a logical 1. The results from both experiments must be compared to determine the correct result of the fault emulation process. Index identies elements belonging to the CB/PM affected by the fault, whereas index refers to elements pertaining to other CBs/PMs. Variables csX hold the current (fault-free) state or conguration of the named element. Variables pin, pinIn, and pinOut reference any input and output pins of a CB. Variables srcX (source), dstX (destination), and , refer to pass transistors allocated along a given line. Variable identies whether a stuck-at 0 or 1 is considered for injection. IV. FADES: DEPENDABILITY ASSESSMENT OF VLSI SYSTEMS FADES represents the rst step toward the development of a general framework for assessing the dependability of VLSI systems [25]. An overview of its architecture is shown in Fig. 4. This rst prototype of this tool uses Xilinxs Virtex FPGAs to implement the model of the system. This decision was based on: 1) these devices allow for their dynamic and partial reconguration [26], accelerating the fault emulation process as only a frac-

tion of the FPGA must be recongured and 2) Xilinx provides a Java package named JBits [27] which consists in a number of functions for working with Virtex devices. This package eases the task of generating new FPGAs partial congurations and changing the current one. The software modules that control the fault emulation process have been written in Java, beneting from the advantages of the JBits package and assuring their portability across platforms. A graphical user interface enables the conguration of the tool for experimentation. FADES implements an improved version of the general RTR ow which is depicted in Fig. 5. This ow is divided into the following three different modules. 1) Experiments and Prototyping Platform Set-Up. The FPGAs conguration le is obtained by following a classical ow for recongurable devices. It consists in the synthesis and implementation (placement and routing) of the model, usually described using a hardware description language (HDL). Many different parameters, such as the fault model, location and duration, the observation points, and the length of the experiments, have to be determined for the denition of fault injection experiments.

DE ANDRS et al.: FAULT EMULATION FOR DEPENDABILITY EVALUATION OF VLSI SYSTEMS

427

TABLE V PSEUDO-CODE FOR THE EMULATION OF PERMANENT DEEP-SUBMICROMETER TECHNOLOGY FAULT MODELS

Fig. 4. Architecture of FADES.

2) Execution of Fault Injection Experiments. A fault injection experiment deliberately causes the occurrence of a certain fault in the system under test at the estimated injection time, while executing the selected workload on the system. This module makes use of JBits to control the prototyping board. It reads and analyzes the FPGAs contents, generates les reecting the changes required to emulate the occurrence of faults, and writes these les into the FPGAs conguration memory. At the end of the experiment, the system is reset to its original state and a new experiment can be launched.

The fault injection and deletion process for each fault model has been implemented following the procedures presented in Tables IV and V. 3) Observation and Analysis of Results. Selected observation points, mainly the signicant outputs (service provided) and internal elements (state) of the system, are monitored throughout the execution of the workload and their state is recorded in a trace le. A trace of the fault-free execution of the workload, named Golden Run, is also stored. The Golden Run can be compared to the traces of fault injection experiments to determine the effect of faults on the service provided by the system and its nal state. According to these effects, Table VI proposes a failure mode classication useful for studying the robustness of any kind of VLSI system. FADES provides an error syndrome of the system, i.e., the percentage of injected faults that lead the system to behave according to each one of the aforementioned failure modes. V. EXPERIMENTS AND RESULTS Several different experiments have been conducted by means of FADES to show the feasibility of using FPGAs to emulate the occurrence of the set of faults dened in Table II. The speedup ratio that FADES can achieve when compared to state of the art simulation-based fault injection tools was presented in [28] and [29]. Therefore, these experiments focus on determining whether fault emulation can be effectively used for the dependability assessment of critical systems.

428

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 4, APRIL 2008

TABLE VII SYSTEMS COMPLEXITY IN TERMS OF VIRTEX INTERNAL RESOURCES

probability. The injection time and location were uniformly distributed along the execution time and injection points. The selected hardware platform for implementing the different systems was a RC1000PP board from Celoxica [33], which supports JBits and holds a Virtex XCV1000 FPGA. The complexity of those systems in terms of required internal resources of this FPGA is shown in Table VII. B. Analysis of Results Experiments were classied according to Table VI into failure, latent, or silent. Fig. 6 depicts the percentage of experiments leading to different failure modes for each of the considered fault models, fault duration, and affected logic. These results could be interpreted in many different ways depending on the purpose of the designed system. For instance, safety critical systems are computer-based systems in which the occurrence of a fault may endanger human lives or the environment. For these systems, safety is a dependability attribute of prime importance. Silent experiments do not have any impact on the safety of the system, since it has tolerated the fault. On the other hand latent experiments, that may cause a failure after some time, and failure experiments can lead to catastrophic consequences. Mission critical systems usually operate in hostile environments. So, availability is the major concern for this type of systems. Those fault injection experiments leading to a failure have a negative impact on the availability. Latent and silent experiments provide the correct service and thus the systems availability is not affected. The I8051 microcontroller is very badly inuenced by faults affecting the routing of the system, such as delays, shorts, openlines, and bridgings, leading to a systems failure. This clearly shows that its architecture should be modied to cope with that particular problem. The MC8051 is the best option when dealing with these faults, but in the presence of open-lines related to sequential logic where the PIC is a safer option. When dealing with faults targeting the logic of the system, bit-ips are the transient faults with deeper impact over the sequential logic of the analyzed systems. They are of great interest in embedded domains, like the spatial one, where cosmic radiation (high energy neutrons and protons) causes single event upsets (SEUs) in memory cells. The MC8051 microcontroller appears as the best choice for both manned missions (highest safety) and unmanned missions (highest availability), such as space probes or satellites. The I8051 microcontroller ranks second with the PIC falling far behind. Regarding the transient faults occurring in combinational logic, pulses have an increasing importance in deep-submicrometer technologies. They may occur due to radiation in space applications or electromagnetic noise (crosstalk, power spikes) in industrial environments. In this case, the MC8051 microcontroller presents best rate of availability, whereas the I8051 is the safest microcontroller. The PIC scores last again.

Fig. 5. Run-time reconguration ow of FADES.

TABLE VI CONSIDERED FAILURE MODES

A. Experimental Set-Up Three different microcontrollers were considered for fault emulation: a PIC165X (PIC) [30], and two different 8051 compatible microcontrollers, an optimized version (MC8051) [31] and a simpler one (I8051) [32]. These different microprocessors will allow us to perform signicant comparisons between a simple (PIC) and a more complex (MC8051) system, and between two versions of the same microcontroller (MC8051 and I8051). All the microprocessors ran a bubblesort algorithm to sort ten integer numbers. A different number of experiments were carried out for each fault model and microprocessor, so to assure that each possible injection point (LUT or FF) was targeted once with 95%

DE ANDRS et al.: FAULT EMULATION FOR DEPENDABILITY EVALUATION OF VLSI SYSTEMS

429

Fig. 6. Syndrome analysis results of the considered microprocessors.

This is also the case when considering the occurrence of transient indeterminations, both in combinational and sequential logic. The MC8051 is the best option for maximizing the availability of the system and the I8051 for increasing its safety. Permanent faults can be related to manufacturing defects and wearout mechanisms during systems normal operation. Electromigration, over-voltage, and over-current are typical mechanisms found in industrial environments with a high noise level. The steady exposition to high energy radiation may also cause permanent faults in space applications. When dealing with stuck-ats targeting the sequential logic of the system, the MC8051 microcontroller is, again, the best option in terms of safety and availability with the PIC ranking second this time. However, the I8051 and PIC swap positions when occurring stuck-ats in the combinational logic of the system. The 8051 microcontroller obtains the highest availability and safety rates in the presence of indeterminations and stuck-opens. The I8051 is the second option but in the case of indeterminations affecting the sequential logic of the system, where the PIC presents better results. In general, these systems are safer when dealing with faults targeting combinational logic and present a higher availability in the occurrence of faults in sequential logic. This is due to the effect of intrinsic masking mechanisms in combinational logic, like logic, electrical, and temporal mechanisms, on one hand, and to sequential elements memorizing errors that may later lead to failures, on the other hand.

All these results show that different VLSI systems can be classied depending on diverse criteria such as the purpose of the system, the target locations, and the considered fault models. The analysis of the results should take into account, among other factors, only those faults representative of the working environment of the system and their relative likelihood of occurrence (considering the duration of the fault, since transient faults are more frequent during the normal execution of the system whilst permanent ones are more frequent during the rst and last stages of the system life cycle), and the ratio between the systems combinational and sequential logic, which may cause some faults to have greater impact than others. The study of the robustness of the system can also help designers to improve the dependability of their systems by pointing out their weaknesses. VI. OPEN CHALLENGES Some key challenges for future research have been identied after the study of these systems. A. Bechmarking of VLSI Systems The benchmarking process can be understood as the continuous measurement of a system to obtain meaningful measures that can be used for systems comparison. Traditionally, the benchmarking of VLSI systems has been limited to the performance of the system in terms of number of operations per second or another similar measure. Currently, there exists a great interest in the development of systems with

430

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 16, NO. 4, APRIL 2008

low power consumption due to very restrictive requirements in embedded systems. This is another important issue to take into account when selecting a particular component for a system. As previously explained, advances in semiconductor technologies are increasing the likelihood of occurrence of faults into deep-submicrometer manufactured systems. Hence, the development of benchmarks for the dependability assessment of critical systems is a very important topic for the community [34]. The study presented in this paper shows how availability and safety can be experimentally estimated to compare different VLSI systems. In this sense, this constitutes a step forward in VLSI dependability benchmarking. However, other dependability attributes such as reliability, integrity, and maintainability, still need further research. From all these facts arises the idea of the benchmarking of VLSI systems encompassing performance, power consumption and dependability properties. The measurement of these attributes could be of great interest to determine interdependencies and nd tradeoffs among the desired level of dependability, required power consumption and minimum acceptable performance. This comparison will serve designers to improve their implementations, system integrators to select the most suitable component for a particular system, and nal users to assess the capabilities of their systems. B. Fault Simulation-Emulation and Design for Validation Fault emulation presents many benets but also some drawbacks, which are related to those benets found in common model-based fault injection techniques (fault simulation for short). Therefore, why not combine fault-simulation and fault emulation approaches in a single common validation framework? That work should guide designers to specify the systems model in order to take benet from the advantages of the most suitable technique for its validation. The considered fault models, target locations, and results representativeness, are some of the variables that could determine which part of the system will be validated by fault emulation or fault simulation, which HDL languages (VHDL, Verilog, or SystemC) and structures should be used to specify the model of the system, and which is the best suited systems description level (gate, register-transfer, or behavioral level) for that particular case. Technical aspects like how to synchronize the execution of fault simulation and fault emulation, how to monitor the system and correlate the obtained measurements are other important aspects to handle in this work. All these questions, once answered, may lead to the denition of very interesting design for validation rules. This issue could nicely complement the benchmarking of VLSI systems as they will be designed to be effectively validated and monitored in the presence of faults. VII. CONCLUSION This paper focuses on the use of run-time reconguration techniques to speedup the dependability assessment of VLSI systems by means of FPGAs. The major contribution of this work is the denition of the procedures to be followed for the

injection and deletion of a wide set of faults considered representative of deep-submicrometer technologies that were not addressed yet for fault emulation. A tool named FADES has been developed to show the feasibility of this approach and constitutes the rst step towards the denition of a global framework for the evaluation of the dependability of VLSI systems. Three different microprocessor systems were selected to study their robustness and signicant results were obtained in terms of their availability and safety. In a sense, this work provides the foundations for the specication of dependability benchmarks for VLSI systems. Several aspects such as the denition of the workload and faultload, the benchmarking process, and the set of measurements, still need further research. This study could later lead to the specication of a global benchmark for VLSI systems encompassing performance, power consumption and dependability attributes. The integration of fault emulation and simulation techniques may lead to the specication of general guidelines to the design for validation of system models. All these aspects will help the community to design, integrate, and compare VLSI systems under a dependability perspective for its use, for instance, in critical applications. REFERENCES
[1] T. Karnik, P. Hazucha, and J. Patel, Characterization of soft errors caused by single event upsets in CMOS processes, IEEE Trans. Depend. Secure Comput., vol. 1, no. 2, pp. 128143, Apr./Jun. 2004. [2] C. Constantinescu, Trends and challenges in VLSI circuit reliability, IEEE Micro, vol. 23, no. 4, pp. 1419, Jul./Aug. 2003. [3] A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Trans. Depend. Secure Comput., vol. 1, no. 1, pp. 1133, Jan./Mar. 2004. [4] J. Arlat, M. Aguera, L. Amat, Y. Crouzet, E. Martins, D. Powell, J.-C. Fabre, and J. -C. Laprie, Fault injection for dependability validation: A methodology and some applications, IEEE Trans. Softw. Eng., vol. 16, no. 2, pp. 166182, Feb. 1990. [5] M.-H. Hsueh, T. K. Tsai, and R. K. Yer, Fault injection techniques and tools, IEEE Computer, vol. 30, no. 4, pp. 7582, Apr. 1997. [6] J. Arlat, Y. Crouzet, J. Karlsson, P. Folkesson, E. Fuchs, and G. H. Leber, Comparison of physical and software-implemented fault injection techniques, IEEE Trans. Comput., vol. 52, no. 9, pp. 11151133, Sep. 2003. [7] E. Jenn, J. Arlat, M. Rimn, J. Ohlsson, and J. Karlsson, Fault injection into VHDL models: The MEFISTO tool, in Proc. 24th Int. Symp. Fault-Tolerant Comput., 1994, pp. 6675. [8] D. Gil, J. C. Baraza, J. Gracia, and P. Gil, VHDL simulation-based fault injection techniques, in Fault Injection Techniques and Tools for Embedded Systems Reliability Evaluation, A. Benso and P. Prinetto, Eds. Norwell, MA: Kluwer, 2003, pp. 159176. [9] S. Hauck, The roles of FPGAs in reprogrammable systems, Proc. IEEE, vol. 86, no. 4, pp. 615638, Apr. 1998. [10] P. Gil, J. Arlat, H. Madeira, Y. Crouzet, T. Jarboui, K. Kanoun, T. Marteau, J. Dures, M. Vieira, D. Gil, J. C. Baraza, and J. Gracia, Fault representativeness, deliverable ETIE2, report from the dependability benchmarking project (IST-200025425), 2002. [11] L. Burgun, F. Reblewski, G. Fenelon, J. Barbier, and O. Lepape, Serial fault emulation, in Proc. Des. Autom. Conf., 1996, pp. 801806. [12] B. L. Hutchings and M. J. Wirthlin, Implementation approaches for recongurable logic applications, in Proc. Int. Workshop Field Program. Logic Appl., 1995, pp. 293302. [13] K.-T. Cheng, S.-T. Huang, and W.-J. Dai, Fault emulation: A new methodology for fault grading, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 18, no. 10, pp. 14871495, Oct. 1999. [14] S.-A. Hwang, J.-H. Hong, and C.-W. Wu, Sequential circuit fault simulation using logic emulation, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 17, no. 8, pp. 724736, Aug. 1998. [15] M. B. Santos, I. M. Teixeira, and J. P. Teixeira, Dynamic fault injection optimization for FPGA-based hardware fault simulation, in Proc. IEEE Int. Workshop Des. Diagnostics Electron. Circuits Syst., 2002, pp. 370373.

DE ANDRS et al.: FAULT EMULATION FOR DEPENDABILITY EVALUATION OF VLSI SYSTEMS

431

[16] P. Civera, L. Macchiarulo, M. Rebaudengo, M. S. Reorda, and M. Violante, An FPGA-based approach for speeding-up fault injection campaigns on safety-critical circuits, J. Electron. Test.: Theory Appl., vol. 18, pp. 261271, 2002. [17] P. Civera, L. Macchiarulo, M. Rebaudengo, M. S. Reorda, and M. Violante, New techniques for efciently assessing reliability of SOCs, Microelectron. J., vol. 34, no. 1, pp. 5361, 2003. [18] L. Antoni, R. Leveugle, and B. Fehr, Using run-time reconguration for fault injection applications, IEEE Trans. Instrum. Meas., vol. 52, no. 5, pp. 14681473, Oct. 2003. [19] A. Parreira, J. P. Teixeira, and M. Santos, A novel approach to FPGAbased hardware fault modeling and simulation, in Proc. 6th IEEE Int. Workshop Des. Diagnostics Electron. Circuits Syst., 2003, pp. 1724. [20] Xilinx Inc, San Jose, CA, Virtex-II platform FPGAs: Complete data sheet, Tech. Rep. DS031 v3.3, 2004. [21] Altera Corp., San Jose, CA, Stratix II device family data sheet, volume 1, v4.1, 2006. [22] Lattice Semiconductor Corp., Hillsboro, OR, LatticeSC family data sheet, 2006. [23] Atmel Corp., San Jose, CA, AT40KAL series FPGA, Tech. Rep. 2818E-FPGA-1/04, 2004. [24] D. Gil, Validacin de Sistemas Tolerantes a Fallos Mediante Inyeccin de Fallos en Modelos VHDL, Ph.D. dissertation, Comput. Eng. Dept., Technical Univ. Valencia, Valencia, Spain, 1999. [25] D. de Andrs, J. C. Ruiz, D. Gil, and P. Gil, FADES: A fault emulation tool for fast dependability assessment, in Proc. IEEE Int. Conf. Field Program. Technol., 2006, pp. 221228. [26] Xilinx Corp., San Jose, CA, Virtex series conguration architecture user guide, XAPP151, 2004. [27] S. Guccione, D. Levi, and P. Sundararajan, JBits: A java-based interface for recongurable computing, in Proc. 2nd Annu. Military Aerosp. Appl. Program. Devices Technol. Conf., 1999 [Online]. Available: http://klabs.org/richcontent/MAPLDCon99/MAPLDCon99.html [28] D. de Andrs, J. C. Ruiz, D. Gil, and P. Gil, Run-time reconguration for emulating transient faults in VLSI systems, in Proc. IEEE Int. Conf. Dependable Syst. Netw., 2006, pp. 291300. [29] D. de Andrs, J. C. Ruiz, D. Gil, and P. Gil, Fast emulation of permanent faults in VLSI systems, in Proc. IEEE Int. Conf. Field Program. Logic Appl., 2006, pp. 247252. [30] E. Romani, Structural PIC165X microcontroller, Hamburg VHDL Archive, 1998 [Online]. Available: http://tams-www.informatik.uni-hamburg.de/vhdl/ [31] Oregano Systems, Vienna, Austria, 8051 IP core, Version 1.4, 2004 [Online]. Available: http://www.oregano.at/ip/8051.htm [32] Univ. California, Riverside, Dalton project, 2001 [Online]. Available: http://www.cs.ucr.edu/~dalton/i8051/ [33] Celoxica, Abingdon, U.K., RC1000 reference manuals, RM-1140-0, 2001. [34] Information and Communication Technologies, Brussels, Belgium, Dependability benchmarking project, IST-2000-25425, 2003. David de Andrs received the M.S. degree in computer science and the Ph.D. degree in computer architecture and technology from the Technical University of Valencia (UPV), Valencia, Spain, in 1998 and 2007, respectively. In 1998, he joined the Fault Tolerant Systems Research Group (GSTF), UPV, where he is currently an Assistant Professor with the Computer Engineering Department (DISCA). His main research interests include the eld of fault injection and recongurable systems.

Juan Carlos Ruiz received the M.S. degree in computer science from the Universidad Politcnica de Valencia (UPV), Valencia, Spain, in 1998, and the Ph.D. degree from the Institut National Polytechnique of Toulouse, Toulouse, France, in 2002. In 2003, he joined the Fault Tolerant Systems Research Group (GSTF), UPV, where he is currently an Assistant Professor with the Computer Science Department, where he is also in charge of the Facultyenterprise relations. His main research interests include dependability and security benchmarking, fault and attack injection, and fault and intrusion tolerance. Dr. Ruiz serves as a program committee in high level international conferences in the dependability domain, like the European Dependable Computing Conference (EDCC) and the IEEE/IFIP Dependable Systems and Networks Conference (DSN), where he also acts as a reviewer. He is member of the IFIP 10.4 SIG on Dependable Benchmarking and the Spanish Technological Platform for Security and Dependability.

Daniel Gil received the M.S. degree in electrical and electronic physics from the University of Valencia, Valencia, Spain, and the Ph.D. degree in computer architecture and technology from the Technical University of Valencia (UPV), Valencia, Spain, in 1985 and 1999, respectively. In 1990, he joined the Fault Tolerant Systems Research Group (GSTF), UPV, where he is currently an Associate Professor with the Computer Engineering Department (DISCA). His main research interests include the design and validation of fault-tolerant systems and reliability of nanoelectronic systems.

Pedro Gil is a Professor with the Computer Architecture and Technology Department and head of the Computer Engineering Department (DISCA), Technical University of Valencia (UPV), Valencia, Spain. In 1996, he was an Invited Professor with LAAS-CNRS, Toulouse, France. He has been Vice Dean for infrastructure of the Computer Science Faculty, UPV, where he is currently a member of the Postgraduate Committee. He is Co-director of the Fault Tolerant Systems Research Group (GSTF), which is integrated into the Institute for ICT Implementation (ITACA), UPV. His main research interests include the design and validation of real-time fault-tolerant distributed systems, validation by means of fault injection, and dependability and security benchmarking. He has authored or coauthored more than 75 papers and has supervised 9 Ph.D. dissertations on these topics. He has been a reviewer of Spanish research and infrastructure projects. Prof. Gil has served as program committee of IEEE/IFIP Dependable Systems and Networks Conference (DSN), European Dependable Computing Conference (EDCC), and Latin American Dependable Computing Conference (LADC), where he also acts as a Reviewer. In 2010, he will be the general chair of EDCC-8, which will be held in Spain for the rst time.

You might also like