You are on page 1of 11

216

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 2, FEBRUARY 2007

Online Fault Tolerance for FPGA Logic Blocks


John M. Emmert, Senior Member, IEEE, Charles E. Stroud, Fellow, IEEE, and Miron Abramovici, Fellow, IEEE
AbstractMost adaptive computing systems use recongurable hardware in the form of eld programmable gate arrays (FPGAs). For these systems to be elded in harsh environments where high reliability and availability are a must, the applications running on the FPGAs must tolerate hardware faults that may occur during the lifetime of the system. In this paper, we present new fault-tolerant techniques for FPGA logic blocks, developed as part of the roving self-test areas (STARs) approach to online testing, diagnosis, and reconguration [1]. Our techniques can handle large numbers of faults (we show tolerance of over 100 logic faults 20 via actual implementation on an FPGA consisting of a 20 array of logic blocks). A key novel feature is the reuse of defective logic blocks to increase the number of effective spares and extend the mission life. To increase fault tolerance, we not only use nonfaulty parts of defective or partially faulty logic blocks, but we also use faulty parts of defective logic blocks in nonfaulty modes. By using and reusing faulty resources, our multilevel approach extends the number of tolerable faults beyond the number of currently available spare logic resources. Unlike many column, row, or tile-based methods, our multilevel approach can tolerate not only faults that are evenly distributed over the logic area, but also clusters of faults in the same local area. Furthermore, system operation is not interrupted for fault diagnosis or for computing fault-bypassing congurations. Our fault tolerance techniques have been implemented using ORCA 2C series FPGAs which feature incremental dynamic runtime reconguration. Index TermsAdaptive computing, fault tolerance, eld-programmable gate arrays (FPGA), recongurable computing, recongurable systems, reliability.

I. INTRODUCTION

DAPTIVE computing systems (ACSs) rely on recongurable hardware to adapt the system operation to changes in the external environment, and to extend mission capability by implementing new functions on the same hardware platform. This results in increased functional density and reduced power consumption; features very important in many domains, such as space missions or mobile devices. Field-programmable gate arrays (FPGAs) featuring incremental dynamic runtime reconguration (RTR) offer additional benets by allowing the system to continue to execute uninterrupted, while portions of the FPGA are recongured for new logic functions. ACSs are often deployed in harsh and/or hostile remote environments, and they are subject to strict high-reliability and high-availability require-

Manuscript received December 30, 2002; revised July 19, 2005. This work was supported by the DARPA ACS program under Contract F33615-98-C-1318. J. M. Emmert is with the Department of Electrical Engineering, Wright State University, Dayton, OH 45435 USA (e-mail: marty.emmert@wright.edu; emmert@ieee.org). C. E. Stroud is with the Department of Electrical and Computer Engineering, Auburn University, AL 36849 USA (e-mail: cestroud@eng.auburn.edu). M. Abramovici is with Design Automation for Flexible Chip Architectures (DAFCA), Framingham, MA 01701 USA (e-mail: miron@dafca.com). Digital Object Identier 10.1109/TVLSI.2007.891102

ments. Marginal defects not causing failures in manufacturing testing (such as a short initially having very high resistance) may become active with the aging of the device, or because of environmental factors. Since direct human intervention for maintenance and repair is impossible in such environments, fault-tolerant (FT) techniques resulting in graceful degradation must be used to achieve the desired mission life span even in the presence of faults. However, traditional FT design, based on replicated modular redundancy and voting, is extremely expensive given the space, weight, and power constraints of ACSs. In this paper, we present FT techniques for FPGA programmable logic blocks (PLBs), developed as part of the Roving STARs project [1], [2]. In most FT methods, faults are detected within the working part of the system, and they must be located and bypassed as fast as possible so that normal system operation can resume as soon as possible. In the roving self-testing areas (STARs) approach (which is conceptually similar to [3]), we divide the FPGA into two distinct parts: the STARs, where built-in self-test (BIST) and diagnosis take place, and the working area where the system function operates. After completing the test of one area, a STAR exchanges place with one adjacent slice of the system logic, so that eventually the STARs rove across the entire FPGA. A consequence of the roving STARs strategy is that the faults are always detected in a STAR, and thus, they do not affect the working area of the FPGA. Therefore, we do not interrupt the normal system operation to replace the faulty resource by a fault-free one, since the logic in the STAR is not performing any system function when the fault is detected. Another important difference from ofine techniques is that fault diagnosis and FT reconguration do not have severe real-time constraints. Since normal system operation continues in the working area, we can allow more time for accurate diagnosis and for computing any required fault-bypassing congurations, compared with approaches where system operation is interrupted for diagnosis and reconguration. Fault tolerance is performed after faults are detected and located. While the roving STARs approach encompasses fault tolerance for both logic and interconnect, this paper deals only with fault tolerance for logic resources. First, we determine if the system function can continue to work correctly in the presence of the located faults. In many situations, this is possible and no reconguration is needed. If a fault does affect the system function, we determine alternate congurations that avoid the faulty resources. In addition to bypassing faults, reconguration can be used to reduce the performance degradation caused by fault avoidance. In other FT approaches, the system clock speed must be set to accommodate the worst case FT conguration that is possible. In the event that the worst case never occurs, the system is penalized with a slower than necessary clock speed. We use the

1063-8210/$25.00 2007 IEEE

EMMERT et al.: ONLINE FAULT TOLERANCE FOR FPGA LOGIC BLOCKS

217

TABLE I ACRONYMS AND ABBREVIATIONS

Fig. 1. FPGA with roving STARs.

concept of an adaptive system clock implemented by a clock generator whose programmable period can be adjusted when the system is recongured to bypass faults [4], [5]. The initial clock frequency is set to the maximum value allowed in the fault-free circuit. If the circuit critical timing path changes as a result of FT reconguration, the clock rate is adjusted based on post-routing timing analysis. The timing analysis is run incrementally only for the signal nets affected by reconguration. Without such an adjustment, the clock must be slow enough to work for the longest paths created by FT recongurations. In contrast, adjusting a programmable clock period will introduce timing penalties only when required, as a result of new faults. This contributes to a more graceful system degradation as faults occur. This paper is organized as follows. In Section II, we overview the roving STARs approach to online BIST and diagnosis of the programmable logic blocks. In Section III, we review some of the prior FPGA-related FT research that led to our approach. In Section IV, we introduce a three-level FT approach that we apply during both logic and interconnect fault tolerance. In Section V, we present the details of our logic FT approaches. We describe faulty logic reuse to increase effective spare capacity, the use of spare allocation strategies to ease local fault tolerance, and we describe multifault strategies to handle a large number of faults in a local area. In Section VI, we present our summary and conclusions. The acronyms and abbreviations used in this paper are summarized in Table I. II. ROVING STARS We assume that an embedded processor, referred to as Test and Reconguration Controller (TREC), manages online BIST, diagnosis, and fault-tolerance for all of the FPGAs in the ACS.

The TREC, which has separate processor-specic fault-tolerance mechanisms, also maintains all the system congurations that may be used in the future. The roving STARs target the permanent faults that appear during the lifetime of the system, including faults in the conguration memory. For online testing of transient faults, the application logic implemented in FPGAs will use a concurrent error-detection technique, such as [6]. Fault tolerance for transient faults is achieved by periodically saving the state of the system (checkpointing) and restoring the last saved state when a transient is detected. The STARs provide areas of the FPGA that are temporarily ofine, while the rest of the device is online, performing normal system operation. Fig. 1 depicts an FPGA with a vertical STAR (V-STAR) and an horizontal STAR (H-STAR); the system application resides in the working areas. Partial RTR via the boundary scan interface of the FPGA allows the test congurations used by STARs to be downloaded without any impact on the system operation. After testing of a STAR has been completed, the STAR roves to a new location, by exchanging places with a equal size slice of the working area. The rst STAR positions are chosen arbitrarily. During system function mapping, we mask the STAR positions so they are avoided. This way we guarantee the STAR areas are initially available. Subsequently, roving the STARs across the FPGA is implemented by a sequence of precomputed partial recongurations and assures that the entire FPGA will eventually be tested. Since our tests are exhaustive, our fault coverage is 100%. Our fault latency is the interval needed to test the entire FPGA. The roving process and the use of roving STARs for test and diagnosis are described in more detail in [1], [7], and [8]. Depending on the positions of the two STARs, the working area may be contiguous, or it may be divided into two or four disjoint regions (as illustrated in Fig. 1). All horizontal wire segments in H-STAR and all vertical segments in V-STAR are reserved for testing. System signals are allowed to pass through the STARs using only horizontal wire segments through V-STAR or vertical segments through H-STAR. The V-STAR roves across the FPGA while the H-STAR roves up and down the FPGA. A STAR tests both the PLBs and the programmable interconnect within its area. While either STAR can be used to test all PLBs, both STARs are required for testing the programmable interconnect. The size of the STARs are chosen to provide sufcient logic and interconnection resources to implement BIST circuitry that will completely test the PLBs within the STAR while minimizing the number of PLBs allocated to

218

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 2, FEBRUARY 2007

the STAR and, as a result, unavailable for system logic. For the ORCA 2C architecture, the V-STAR consists of two columns and the H-STAR consists of two rows of PLBs along with their associated routing resources. Our approach tests the entire FPGA, including the spare resources not currently used by the system application. Testing only the system logic, as usually done in ofine FT systems, is unsafe for online testing, because new faults are equally likely to occur in spare resources or in the currently unused portion of the operational part of the system. These dormant faults may accumulate and have a detrimental effect on the system reliability since these unused resources may be used as replacements for the faulty ones [9]. Thus, to guarantee a reliable operation, our online test completely checks all the system resources, including spares. Our BIST technique is described in detail in [7] and detects any single or multiple faults in a PLB, and any combination of multiple faulty PLBs. When faults are detected our diagnostic procedure is used to locate any group of faulty PLBs based on the BIST results. Additional diagnostic BIST congurations can then be downloaded and applied to the faulty PLBs to identify the specic faulty portion(s) of the PLB, such as faulty look-up tables (LUTs) and/or faulty ip-ops, to increase the diagnostic resolution [7]. The diagnostic goal of most other FT techniques for FPGAs is to identify faulty PLBs, which are then bypassed and replaced by spares. The high diagnostic resolution of our approach allows us to introduce a new form of fault tolerance where faulty resources are reused whenever possible. A partially usable block (PUB) is a faulty PLB with identied failing mode(s) of operation or faulty subcircuits. A PUB may be used as a fault-free cell in the working part of the system, provided that its faults do not affect the intended system operation. Reusing defective hardware resources increases the effective spare capacity and leads to more graceful system degradation and longer mission life. The roving STARs approach offers important advantages in FT applications. In most previous FT work, faults are detected in the working logic, and they must be located and bypassed very quickly to restart the normal operation as soon as possible. In contrast, faults detected in roving STARs do not affect the working logic, such that the system operation does not have to be interrupted for fault diagnosis and for computing fault-bypassing recongurations. It should be noted that there is some system downtime required when moving a STAR from one location to another. When the STAR position is switched, the system clock must be stopped long enough to replace the STAR conguration with the partial system conguration that moves into the vacated STAR location. Second, the state of the system, when the clock is stopped, must be copied to the new system area before the clock is restarted. In an ORCA 2C15 FPGA (a 20 20 array), the system clock is stopped for approximately 250 s for STAR relocation. Total roving and testing time in a fault-free ORCA 2C15 FPGA is approximately 1.34 s when the ORCA 2C boundary scan interface is operated at its maximum specied clock rate of 10 MHz. As a result, in the worst case, a fault could escape detection for almost 1.34 s if it were to occur in a STAR position

that has just been tested. Since diagnosis of faulty PLBs within a BIST tile can be performed based on the failing BIST results, the 1.34 s roving and testing time also includes diagnosis to a faulty PLB. PUB diagnosis, on the other hand, requires a maximum of 7.2 ms in the ORCA 2C15 to be able to identify a faulty LUT and/or ip-op within the faulty PLB. While we used roving STARs to detect and diagnose faults, we want to emphasize the exibility of our fault tolerance techniques. Other methods for detection and diagnosis that could potentially be used with our fault tolerance techniques have also been proposed [10]. Our main criteria relative to testing and diagnosis is that they do not interfere with the normal operation of the system function in the working area of the FPGA.

III. RELATED RESEARCH In this section, we briey describe some other techniques for tolerating faults in FPGAs. Work on fault tolerance for FPGAs and other memory devices is too extensive to thoroughly cover in this paper, so we have limited the scope to reect a few key efforts that provide a background for the direction our work took. For a more detailed, quantitative analysis of different online and ofine FT techniques, see [11]. In later sections, we compare and contrast some of the methods in [11] to ours. The emphasis is on describing how those methods could t into our system as well as advantages and disadvantages over our techniques. Several techniques employ column or row shifting [12], [13]. In [12], Hatori et al. introduced a single spare column for tolerating faults. They used specialized selector circuitry to recongure FPGA circuits in the presence of faults. Similar to methods used for FT in SRAMs, at least one additional column of PLBs is added to the FPGA. If a PLB is faulty, its column is eliminated and all functions mapped to the columns between the faulty column and the closest spare column are shifted toward the spare column. In [13], Durand and Piquet used multiple spare columns. The area overhead and fault tolerance for each of these techniques is dependent on the number of spare columns or rows introduced, but if the number of faults per column (or row) exceeds the spare capacity of that column (or row) then something else must be done to tolerate the fault overages. To improve fault tolerance by increasing the number of tolerable faults, Narasimhan et al. developed an FT technique for FPGAs or wafer-scale integrated arrays [14], [15]. They use a pebble shift algorithm to recongure around faulty PLBs. Their method is exible in that it is not limited to one fault per row, column, or tile. Their technique is similar to ours in that they make use of unused resources for spares or fault tolerance. Thus, there is no required area overhead. Their method was targeted for ofine application, and no information was given on operating speed implications. Kelly and Ivey use redundancy to bypass faults in applications mapped to FPGAs [16]. Their FT technique relies on a shift method to recongure in the presence of faults. They incorporate a reconguration switch to recongure and they use normal place and route (PAR) tools for mapping circuits to FPGAs. The switch matrix network makes their technique more exible than the column and row techniques in [12] and [13]. A switch conguration algorithm was used to congure the special switches

EMMERT et al.: ONLINE FAULT TOLERANCE FOR FPGA LOGIC BLOCKS

219

around faulty PLBs. Their method requires some spare or unused resources to tolerate faults, and the fault tolerance depends on how many unused resources are initially available and on routability. For an average FPGA utilization or 80%, 20% of the resources should be available for spares. In [17], Cuddapah and Corba used Xilinx SRAM-based FPGAs to demonstrate the FT capabilities of FPGAs. In their study, they randomly picked PLBs to be faulty. They recongured the circuit around these faults using commercially available PAR tools. The main contributions of their work were an algorithm to determine fault coverage (the ability to recongure around a given number of faults) of a design and a denition of the fault recovery rate for any given design implemented in an SRAM-based FPGAs. Additionally, they demonstrated that fault recovery was feasible on FPGAs by other than modular redundant methods. Similar to Kelly and Ivey, their method requires some unused or spare resources, and the fault tolerance depends on how many spares are available. Dutt and Hanchek et al. developed a method to increase FPGA yield [18], [19]. Their method used node covering and reserved routing resources to replace the functionality of faulty PLBs. One row (column) of PLBs is reserved for spares. If a PLB in any given column (row) was faulty, the functionality of all PLBs in the column (row) from the faulty PLB to the spare PLB was shifted toward the spare PLB. Spare routing resources were used to eliminate overhead of rerouting the updated circuit placement. The main advantage of this method is that it is very fast relative to reconguration time. Since the spare resources have already been allocated to cover a limited number of faults, the reconguration time is linear with respect to the number of faults. Like many of the previous methods, the main problem with this ofine technique is the limited number of faults that can be tolerated in each column (row). At most, they guarantee toleration of one fault per spare row or column. Emmert and Bhatia developed an FT technique for incrementally reconguring FPGA mapped circuits around faulty PLBs [20], [21]. Like [14] and [15], they can tolerate more than just local faults. They used minimax grid matching, to match faulty PLB locations to unused spare PLB resources. They incorporated a shift methodology to shift PLB functions between the faulty PLB and its matched spare location toward the spare location. The shift method reduced the performance degradation caused by matching PLB functions to spares that were spatially separated by large distances. They applied their method to Xilinx FPGAs. Later, Lakamraju and Tessier improved on the idea of shifting logic by introducing the idea of shifting logic within a PLB [22], and thus, they reduced the amount of reconguration required for logic fault tolerance. Lach et al. developed low-overhead FT systems [23], [24]. Their basic approach was to partition a design into a number of tiles. Each tile was allocated a number of spare PLBs. In the event a fault in a tile was found, a spare PLB was used to replace the faulty PLB. Their reconguration approach provided multiple congurations for each tile. In the event that a fault was detected, the conguration associated with the fault was downloaded. In this method, each tile requires one or more spare resources. The worst case fault tolerance is dened by the number of spares in the tile with the fewest spares as there is no ability to draw spares from other tiles.

There are several points to make relative to the traditional approaches from the literature and our approach. With the exception of Lach et al., all previous methods rely on reconguration around faults to be calculated online (the circuit had to be paused or go ofine once a fault was found). This requires fast computations in order to maintain high availability for the FT system. The precompiled congurations of Lach et al., on the other hand, signicantly reduce system downtime. Relative to area overhead, with few exceptions the previous approaches required a minimum of 10% area overhead for the fault tolerant technique. In our approach, we make use of faulty logic resources (faulty LUTs used in nonfaulty modes) which require no area overhead and we take advantage of unused nonfaulty resources. We do not require any preallocated or spare resources, but the more available spares, the larger the fault tolerance. Our roving-STARs BIST technique does require at least two columns and rows of FPGAs for detecting and diagnosing faults. Relative to operating speed, since we have no special bypassing circuitry and do not require any preallocated spares, system operation can approach optimum speeds. It should be mentioned that our roving-STARs BIST technique does slightly reduce our maximum operating speed [4]. Relative to power consumption, recongurable logic overheads typically require more power than their nonrecongurable counter parts, but there are few quantiable numbers from the literature to compare to our method. While our fault tolerant technique does not increase power requirements, our BIST technique does due to partial conguration of the STARs during testing and due to the testing activity concurrent to normal operation. The power increase due to the roving STARs will be on the order of assuming that the system clock and the test clock are operating at the same frequency, and much less when the system clock is at a higher frequency than the test clock.

IV. MULTILEVEL FT APPROACH In our FT approach, we apply precompiled congurations to bypass faulty PLBs, but if precompiled congurations are not available (for example, we have multiple faults that cannot be covered by precompiled congurations dealing with a single faulty PLB), we generate any necessary bypassing congurations while the system application continues to run. A major advantage of our approach is that we can tolerate a large number of both logic and interconnect faults. In this section, we describe our multilevel approach. Level 1 consists of leaving a STAR (or possibly both STARs) parked over the area where faults have been detected. While the STAR is parked over the detected faults, system operation goes on without interruption because the faults are not located in the working area. STAR parking continues while the detected faults are diagnosed. Additional congurations may be required to achieve maximum diagnostic resolution in certain situations, such as having several faulty PLBs within the STAR, or when identifying the faulty LUT or ip-op within a PLB. The diagnosed faults may affect the next slice of the working logic to be relocated in the parked STAR position. The currently unused logic and routing resources in the working area are considered as spares to be used to bypass faults. The rst question

220

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 2, FEBRUARY 2007

is whether the application logic is mapped such that after relocation it would fall on the newly diagnosed faulty resources. If the faults affect only spare resources, then no action is required. Otherwise, the next question is whether the faulty resources are usable in the context of the functionality required by the system function. The next subsections will discuss usability in more detail. If the fault is not compatible with the current conguration, then we incrementally recongure the working area to avoid the faults before the working area is relocated in the next step of the STAR roving process. Some fault-bypassing roving congurations (FABRICs) can be precompiled, but most faults will require new FABRICs to be computed online; this computation also takes place while the STAR is parked over the faults. A major advantage of STAR parking is that testing can continue while a STAR is parked. Let us assume that H-STAR is the parked STAR. After the faults in the H-STAR have been located, and while TREC is computing FABRICs, the V-STAR can continue roving and testing. This process will still test all PLBs and all vertical interconnect, but the horizontal interconnect resources usually tested by the now-parked H-STAR will not be tested during this period. Thus, a partial testing process is active even when one STAR is parked. Level 2 occurs when we move the parked STAR off of the faults and continue roving. During Level 2, we apply precompiled or newly computed FABRICs. By using the partial reconguration capability of FPGAs, the size of these FABRICs is kept relatively small compared to the memory required for storing complete FPGA congurations. A FABRIC replaces a faulty resource with a spare one. Ideally, we would like to always have a spare resource in the neighborhood of the fault. In the next sections, we will discuss spare allocation strategies that attempt to achieve this goal. Level 3 is entered as the last recourse when all the spare resources in the working area have been used for fault tolerance. In this case, we use some of the PLBs provided by the two STARs to bypass faults. This level is referred to as STAR stealing. By making use of reconguration and alternately stealing from one STAR and then from the other, it is possible to continue roving with both STARs, one at a time. We attempt to maintain both STARs roving one at a time for as long as possible. In the event that one STAR is completely used up (for spares) and we start stealing from the other STAR, we can rove with only a partial STAR. Note that the resources taken from the STAR will no longer be spare. For the roving STARs approach, test latency is a function of the time required to congure the FPGA. For testing and fault tolerance, the faster we can recongure the FPGA, the quicker we can detect and tolerate any fault. Partial reconguration speeds up this process by reducing the conguration time required when loading test congurations for a specic STAR location and when relocating a STAR to a new location. V. FT FOR PLBS Because of the roving process, the system logic does not have a xed placement. Fig. 2 illustrates how the same physical PLB is time-shared among four different logic cell functions, depending on the positions of the STARs. (A logic cell function is the system function mapped to one PLB.) Our FT techniques

Fig. 2. Four logic cells time sharing one PLB.

correctly deal with this dynamic behavior. Whenever possible, we reuse faulty PLBs as PUBs; this is a drastic change from most of the previous FT techniques that completely avoid faulty resources. We make use of a faulty resource in a nonfaulty mode. For example if a LUT has a cell stuck-at-0, we can use this LUT to implement a function that requires a 0 in that cell. If this is not possible, we next try to use the nonfaulty logic in the PLB for other functions. This technique allows us to handle more logic faults than there are available spares. If reuse is not possible, we compute FABRICs. Combining reuse of defective PLBs with fault bypassing via FABRICs results is a major advantage of our technique over other techniquesthe ability to handle large groups of faults in a tight area. Other techniques limit the number of faults they can handle by providing a limited number of spare logic resources in a subarea or tile of the FPGA. When these local spares are exhausted, those FT techniques fail. For example, the row or column shifting techniques similar to those presented in Hatori [12], Durand [13], and Dutt [18], [19] allow quick reconguration using special built-in hardware, but limit themselves in the number faults they can tolerate to one or two per column. The tiling method presented in [23] and [24] could also be used to create the FABRICs for our technique. However, the tiling method is more restrictive than our method. For example, if the total number of faults within a tile exceeds the number of spares by just one, even if the rest of the FPGA is nonfaulty, this method will fail. As described in the following,

EMMERT et al.: ONLINE FAULT TOLERANCE FOR FPGA LOGIC BLOCKS

221

our technique is more exible allowing a higher number of tolerated faults and does not require specialized hardware. We take a more global approach. We can draw on spare and PUB resources from anywhere in the working (and in extreme cases even the testing) area of the FPGA. Unlike some of the other methods, we do however require a separate co-processor (STREC) to compute and/or manage replacement congurations. In addition to our local spare allocation strategies, we also provide a global reconguration approach that makes use of minimax matching to allow a large number of logic faults in the same region of the FPGA to be tolerated. This is similar to what was done in Narasimhan [14], [15], but they used the pebble shifting algorithm instead of the grid matching algorithm. The grid matching attempts to limit the maximum distance between any fault and its spare, whereas the pebble shifting is less restrictive. It should be noted that we can still use our FABRICs for larger numbers of faults in a local area. The FABRICs will be computed while the system function is online. More details of our approach are provided as follows. A. Reusing Faulty PLBs Our diagnosis technique can determine the faulty LUTs or the faulty FFs inside a faulty PLB, or its failing modes of operation. We regard reusability of a faulty PLB as a compatibility relation between the fault in a PLB and the desired logic cell function for that PLB: if the faulty PLB can correctly perform the function, we say that the fault is compatible with the function. A fault that affects only sections of a PLB that are not used in the desired function is compatible with that function. Any fault is compatible with the empty function of a spare PLB. For uniformity, we consider a fault-free PLB as having an empty fault, which is then compatible with any function. The usability of a defective PLB has to be analyzed separately for each one of the four logic cell functions of the system function that may be relocated over it. A simple example of compatibility is a fault in a section of the PLB that is not used in the relocating logic cell function, such as in an unused LUT or FF, or a fault affecting an operation not used in the relocating logic cell function, such as multiplication. Then the faulty cell may be used as is without any reconguration. As another example, consider a LUT with a memory cell stuck-at-0. If the faulty LUT implements a combinational logic function where the faulty memory cell is set to a logic 0, then the fault is tolerated without requiring any reconguration and any spare resources. If the needed value and the stuck value are complementary, or if the faulty LUT is used as a RAM, then the fault is not compatible with the intended system function. If the faulty LUT/RAM has an unused input/address bit, another way to tolerate a stuck cell is to simply use the half of the LUT that does not contain the faulty memory cell. For example, if we have a 4-input LUT/RAM, and the logic cell is using only three inputs, we can still use half of the memory cells to implement a 3-input LUT or a RAM with three address bits. This requires a local reconguration to tie the unused input to the logic value that selects the fault-free half of the RAM. An incompatible fault may be made compatible with a desired function if the faulty PLB has unused parts. For example, the fault may affect one of the LUTs, but the desired logic cell

Fig. 3. Length of logic function moves.

Fig. 4. Maximum system operating frequency.

function leaves another LUT unused that can be considered as a spare and connected to replace the faulty LUT. This requires a local reconguration of a logic cell function [22]. B. Faulty PLB Reuse Results To demonstrate the effect of PUBs when reconguring for logic fault tolerance, we used empirical data from several benchmark circuits. Using other methods, if a fault occurs in a PLB, it is no longer used. Our purpose was to demonstrate that the number of faults tolerated by our techniques could exceed the number of available spares. The test circuits were implemented on the ORCA 2C15 FPGA [25]. The ORCA 2C15 consists of a 20 20 array of PLBs. Since the STARs use two columns (V-STAR) and two rows (H-STAR), we have 324 PLBs available for the system function. The data that follows describes the results for a digital single-sideband modulator circuit (DSSM), a Fibonacci number generator (FIB), and a random-number generator circuit (RNG). Out of 324 possible PLBs in the working area of the FPGA, the DSSM used 229 PLBs (70%), the FIB used 244 PLBs (75%), and the RNG used 139 PLBs (43%). We randomly picked a set of PLBs, and a LUT or FF within each selected PLB, to be faulty. Then, we recongured by moving any

222

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 2, FEBRUARY 2007

Fig. 5. Length of logic function moves. Fig. 7. Length of logic function moves.

Fig. 8. Maximum system operating frequency. Fig. 6. Maximum system operating frequency.

logic cell function that was not compatible with a faulty PLB to a new location. We increased the number of faults until we could no longer recongure the circuit. In Fig. 38, we see representative test results for reconguring with and without using PUBs. In Figs. 3, 5, and 7, the -axis shows the average distance1 that logic cell functions mapped to faults were moved so that all incompatible logic cells functions were moved to compatible locations (spare PLB or PUB locations) for the DSSM, FIB, and RNG benchmark circuits, respectively. The -axis shows the number of faults injected into the circuit. A vertical line was added to each graph to indicate the total number of spare PLBs available in the working area of the FPGA. For low numbers of injected faults (with and without PUBs) the average distance moved can be less than one PLB. While this seems impossible (it should be at least one PLB distance on average) the reason is many of the faults are compatible with the logic programmed there (see Section V-A) and do not require any reconguration. The data also clearly shows the number of logic faults that
1We use Manhattan distance, measured in number of PLBs, to describe the horizontal vertical distance a logic cell function moves during FT reconguration.

can be tolerated increases with the use of PUBs. In these examples, using PUBs, we can exceed 160 faults (for DSSM), 120 faults (for FIB), and 220 faults (for RNG) before we run out of compatible spares. This is a strong result when we consider the number of faults is almost twice the number of spares available in the system area. Thus, with PUBs and faulty logic reuse we can have a completely working system even when the number of faulty PLBs exceeds the number of spare PLBs in the working area. Figs. 4, 6, and 8 show the effect of reconguration on system performance for the DSSM, FIB, and RNG benchmark circuits, respectively. On the -axis, we see the maximum operating frequency as a function of the number of injected faulty PLBs. From the graphs, we see a gradual degradation in performance as the number of faults increases. We can take advantage of this gradual degradation using a programmable system clock. C. Bypassing Incompatible Faults If a defect is not compatible with the system function, the system function must be recongured to avoid the incompatible fault. One approach is to move the entire logic cell function to a compatible spare PLB location. Note that a spare PLB may not be fault-free, since we allow a spare PLB to be relocated over a faulty one. The condition for replacement is to nd a spare

EMMERT et al.: ONLINE FAULT TOLERANCE FOR FPGA LOGIC BLOCKS

223

PLB compatible with the desired function (this also includes the fault-free spare case). The actual replacement takes place in the next roving step, where the reconguration makes the relocating logic cell function fall on a compatible spare instead of the incompatible faulty PLB. Having the compatible spare in the neighborhood of the faulty PLB helps minimize the extent of the changes in the layout of the system function, including changes in the delays of the signals that have to be rerouted. Moving a logic cell function to a compatible spare location will work as long as 1) a compatible spare is available and 2) no two logic cell functions need the same spare. For the second case, we use a grid matching technique [20] to perform reconguration and use other spares available in the working area of the FPGA. When no more spares are available in the working area, we move to STAR stealing. FABRICs for single faulty PLBs can be precompiled and stored to quickly map around faulty PLBs. But in a system where some faults have been bypassed, the initially computed FABRICs may no longer be valid; so after one FABRIC is applied, the TREC updates the other FABRICs that may have been affected by the incremental reconguration. The initial FABRICs are precomputed under the assumption of a single fault, so that when multiple faults occur in the same region, the TREC has to compute a new FABRIC that concurrently bypasses all the faults. This takes longer than using the precompiled FABRICs, but it does not interfere with the system function execution since a STAR is currently parked over the faults. After several faults have occurred in the same area, it is possible that all the spare PLBs have been used to replace faulty PLBs, so it may become necessary to remap the entire system function and reallocate the remaining spares to achieve a more uniform spare distribution. Processing for this is done ofine by the TREC, and it requires the entire system conguration le be reloaded during the next rove. D. Preallocating Spare Resources Our methods do not require preallocated spare PLBs. If there is slack available in the system operating speed, we can make use of preassigned spare PLB locations in the working area to improve reconguration time for lower numbers of faults, and for use with our precompiled FABRICs. A description of some preallocation strategies is presented here for the sake of making our description complete; however, a more detailed analysis is presented in [26]. Additionally, some of the other techniques like tiling [23], [24] could also be used to generate FABRICs. By carefully selecting the preallocated spare locations, we can reduce the size and effect of precompiled fabrics for incremental fault tolerance. We describe two optional spare PLB allocation strategies for the working area of the FPGA, but we are not limited to just using these [26]. The rst strategy guarantees that for a working area PLB utilization which is less than 80% (which is typical for most applications), every system logic cell function will be adjacent to at least one spare PLB. The second strategy guarantees that for a working area PLB utilization of less than 92%, every system logic cell function is no further than one PLB from a spare. (It should be noted that most FPGA circuits use only 80% of the available logic in order to enhance routability. As we use logic spares, the total number of signal

Fig. 9. Spare cell allocation patterns where (a) each logic cell function is adjacent to a spare and (b) each logic cell function is no more than one PLB from a spare.

nets does not increase. Thus, as we use logic spares, routing complexity does not signicantly increase.) Initial placement and routing of the system logic is done in working area of the FPGA. For our the rst strategy, we will constrain the design so that at least 20% of the working area PLBs are left spares. This is not excessive, since typical PLB utilization is less than 80% for most applications. Higher utilizations make the design difcult to route. We can show that if this constraint is satised, we can always nd a placement of the system logic so that every system logic cell function is adjacent to at least one spare PLB in the working area of the FPGA. Fig. 9(a) illustrates such a placement for an example 12 12 FPGA. To force standard design tools to generate evenly distributed spares, we reserve the location of spare PLBs by preplacing dummy logic cell functions in spare PLBs before executing the standard PAR algorithm. In our implementation, for each system logic cell function, we select one of its adjacent spares as its preferred replacement. Note that the same spare may be designated as the preferred replacement for several working logic cell functions. In Fig. 9, the following equations were used for determining if a position with should be reserved as a spare. working area coordinates and if , If we let is a spare location. Additional then working area location

224

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 2, FEBRUARY 2007

TABLE II PERFORMANCE PENALTY WITH SPARE PLBS

spares are needed on the edge of the working area (labeled with in Fig. 9) to ensure that fringe PLBs are sufciently close to a matching spare PLB in order to meet the dened distance criteria. This requirement slightly lowers our 80% system area availability. Fig. 9(b) demonstrates the same concepts for our second strategy. Note that we do not expect the PLB utilization to be so high in the initial design, but only after a number of faulty PLBs have been replaced by spares. While using preallocated spares reduces the time required to recongure in the presence of faults, there is, however, a price to be paid for using the STAR approach and an additional cost required for preallocating spare PLBs. If this cost is within operating limits then preallocating spares is acceptable. Table II shows the performance penalty for several benchmark circuits implemented in an ORCA 2C series FPGA. The column labeled System gives the worst case performance for the benchmark circuit without STARs or preallocated spares. This data forms the baseline for determining the performance penalty of adding the STARs and preallocated spares. The column labeled System w/STARs gives the worst-case performance (for any STAR position) for the benchmark circuit with STARs inserted and no preallocated spare PLBs. The column labeled System w/STARs and FT spares shows the worst case operating frequency for the system function with preallocated spares for each of the 10 STAR positions. The last column shows the percent difference in operating frequency between the circuits without and with preallocated spares. For the circuits tested, the percent difference ranges between 2.5% and 15.1%. It should be noted that preallocated spares are not a requirement of our FT method; however, they reduce the time for incremental reconguration for low numbers of faults. Additionally, there are other methods that can be used for local spare preallocation, like leaving a single LUT and ip-op unused in every PLB. This would allow up to 100% PLB utilization, but would require 25% of available logic be left unused. E. Multiple Logic Faults When two logic cell functions are assigned the same preferred spare PLB location, there is a conict if they both need to be remapped to a spare location. When this occurs, we use a matching strategy to match logic cell functions to compatible spare locations. Minimax matching [27] can be used to match multiple logic cell functions (logic cell functions currently mapped to incompatible PLB locations) to compatible spare locations such that the maximum distance between any

logic cell function and its corresponding spare location is minimized. We take this approach to reduce the distance a logic cell function moves during fault tolerant reconguration and to reduce any adverse effect on system operating speed. In Fig. 10, we see an example of using minimax matching to assign incompatible logic cell functions to spare PLB locations in the working area of the FPGA (the STARs are not shown for this example). In Fig. 10(a), we have a conguration where there are three logic cell functions (labeled ) that are incompatible with the faults at their current locations. If we set the minimax , we can match any of the incompatible logic length, , to cell functions to the adjacent spare, but that leaves two of the cells without a match. By incrementing the minimax length to , we can match each of the incompatible logic cell functions to a compatible spare. In Fig. 10(b), we see the matched locations for each of the incompatible logic cell functions for . Fig. 10(c) shows the circuit after each incompatible logic cell function has been moved to its spare location. Note the locations vacated by the incompatible functions are now available for use by other logic cell functions that are compatible with them. It should be noted that we used the minimax matching strategy when taking the PUBs/No PUBs data shown in Figs. 38. When a conict over spare assignment occurred, we used the minimax strategy to minimize the worst case distance between the logic cell function and its location after fault tolerance. The data in Figs. 3, 5, and 7 shows two trends. First, we see the Average Length of Logic Cell Function Moves ( -axis of Figs. 3, 5, and 7) is lower when we use PUBs. This makes sense because PUBs provide more spare resources than just using nonfaulty PLBs. So on average, a fault will be closer to a compatible spare with PUBs. Second, for the larger more densely packed DSSM and Fib circuits, we see the Maximum Operating Frequency ( -axis of Figs. 4 and 6) is higher for the PUBs than NoPUBs case. Since higher circuit operating speed is desirable, this supports the use of PUBs. Also, it validates the use of minimax matching to match fault location to spare location in order to reduce system operating frequency degradation in the presence of large numbers of faults. VI. CONCLUSION In this paper, we have described our approach for logic fault tolerance in RTR FPGAs. In our approach, we have introduced several new concepts for logic FT. Since regenerating all FT congurations (if they are required) takes place with one or both STARs parked over the faults, the system function is free to continue executing, unaffected by any faults. Thus, when used with the roving STARs approach for test and diagnosis of faults in FPGAs [7], no additional system function downtime is required for fault tolerance. All FT congurations and required changes are determined while the system function is online. Whenever possible, we make use of faulty programmable resources. For logic, we use the faulty PLB if it is compatible with the function programmed in the faulty resource. If the fault is not compatible with the current logic, we attempt to nd logic that is compatible with the fault. If that fails, we use the faulty PLB as a PUB, and make use of the nonfaulty portions of the PLB. This technique is a big improvement over previous techniques

EMMERT et al.: ONLINE FAULT TOLERANCE FOR FPGA LOGIC BLOCKS

225

Fig. 10. Working area minimax reconguration example for multiple fault incompatibility for (a) three incompatible faults, (b) minimax matching results, and (c) recongured system function.

that bypassed the faulty PLB. We have shown that we can actually continue to execute even when the number of faulty PLBs exceeds the number of spare PLB resources by almost a factor of 2. When we do recongure for logic fault tolerance, we can make use of optional, preallocated local spare resources. As our data reect, the STAR technique requires a performance penalty relative to the operating speed of the system clock. Any preallocation of spares requires some additional cushion relative to the operating speed. Our spare resource allocation strategies guarantee local (adjacent) spares if the logic resource usage is less than 80%. If local spares are not available, or if a group of PLBs become faulty, we use a minimax grid matching strategy to match faulty logic cell locations to compatible spare resources such that the amount of reconguration is minimized. We incorporate the idea of a programmable clock for fault tolerance. This allows the clock to initially be set to its fastest possible rate, and the clock is only reduced as necessary when faults affect the circuits critical paths. For most applications, operating the system at a reduced clock speed is better than not operating at all. It should also be noted that our methods require no modication to existing commercial FPGAs. We have also used standard commercial CAD tools (available to all commercial users) to implement our methods. By including both the local and global approach to fault tolerance and using them in conjunction with the roving STARs BIST and diagnosis, we overcome some of the limitations to just taking a local approach like row/column redundancy [12], [13], [18], [19] and tiling [23], [24] or a more global approach like pebble shifting [14], [15]. Like the local approaches, we are not limited to a xed number of PLBs in a local area like a tile, row or column, but we can still benet from having just a few faults in a local area using our small (relative to memory) precompiled FABRICs. For larger numbers of faults in a local area, more than can be handled by tiling or row/column redundancy, we can still recongure using our grid matching approach. In this case, grid matching allows us to make use of spares, even if they are not in the local area. The downside here is the fact that it takes more time to determine the larger conguration changes.

We have demonstrated our techniques on the ORCA 2C15 FPGA. However, they can also be implemented on many other FPGA families. For the demonstration of our techniques, we rely on commercial CAD tools for mapping and incrementally remapping the circuit. One of the steps this involves is mapping the circuit to avoid the STAR areas. A mask with dummy logic and routing is created to avoid the STAR areas. Once a mask for a STAR location is created, the creation of masks for other STAR areas can be automated. While the FT techniques presented in this paper have been developed for online applications with the roving STARs approach, we emphasize that most of the techniques presented can be easily adapted to ofine applications. This would include the use of FPGAs that do not support RTR features. As a result, these FT techniques are applicable to almost any FPGA in any system, including application to ofine manufacturing system-on-chip yield enhancement [28]. ACKNOWLEDGMENT The authors would like to acknowledge the contributions of graduate and undergraduate students in the Department of Electrical Engineering at the University of Kentucky and the Department of Electrical and Computer Engineering at the University of North Carolina at Charlotte: S. Baumgart, J. Cheatham, M. Cheatham, A. Taylor, and P. Kataria. REFERENCES
[1] M. Abramovici, C. Stroud, S. Wijesuriya, C. Hamilton, and V. Verma, Using Roving STARs for on-line testing and diagnosis of FPGAs in fault-tolerant applications, in Proc. Int. Test Conf., 1999, pp. 973982. [2] M. Abramovici, C. Stroud, and J. Emmert, Roving STARs: An integrated approach to on-line testing, diagnosis, and fault tolerance for FPGAs in adaptive computing systems, in Proc. 3rd NASA/DoD Workshop Evolvable Hardw., 2001, pp. 7392. [3] L. Shombert and D. Siewiorek, Using redundancy for concurrent testing and repairing of systolic arrays, in Proc. Fault-Tolerant Comput. Symp., 1987, pp. 244249. [4] J. Emmert, C. Stroud, J. Cheatham, A. M. Taylor, P. Kataria, and M. Abramovici, Performance penalty for fault tolerance in Roving STARs, in Proc. Int. Workshop Field-Program. Logic Appl., 2000, pp. 545554.

226

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 15, NO. 2, FEBRUARY 2007

[5] M. Abramovici, J. Emmert, and C. Stroud, Fault tolerant operation of recongurable devices utilizing an adjustable system clock, U.S. Patent 6 874 108, Mar. 29, 2005. [6] E. McCluskey, Verication testinga pseudoexhaustive test technique, IEEE Trans. Comput., vol. C-33, no. 6, pp. 541546, Jun. 1984. [7] M. Abramovici, C. Stroud, and J. Emmert, On-line BIST and BISTbased diagnosis of FPGA logic blocks, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 12, pp. 12841294, Dec. 2004. [8] C. Stroud, J. Nall, M. Lashinsky, and M. Abramovici, BIST and BISTbased diagnosis of FPGA interconnect using roving STARs, in Proc. IEEE Int. On-Line Test Workshop, 2001, pp. 618627. [9] A. Steininger and Scherrer, On the necessity of on-line BIST in safety critical applications, in Proc 29th Fault-Tolerant Comput. Symp., 1999, pp. 208215. [10] N. Shnidman, W. Mangione-Smith, and M. Potkonjak, On-Line fault detection for bus-based eld programmable gate arrays, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 6, no. 4, pp. 656666, Apr. 1998. [11] J. Cheatham, J. M. Emmert, and S. Baumgart, A survey of fault tolerant methodologies for FPGAs, ACM Trans. Des. Autom. Electron. Syst., vol. 11, no. 2, pp. 501533, Apr. 2006. [12] F. Hatori et al., Introducing redundancy in eld programmable gate arrays, in Proc. IEEE Custom Integr. Circuits Conf., 1993, pp. 7.1.17.1.4. [13] S. Durand and C. Piguet, FPGAs with self-repair capabilities, in Proc. ACM Int. Symp. FPGAs, 1994, pp. 16. [14] J. Narasimhan, K. Nakajima, C. Rim, and A. Dahbura, Yield enhancement of programmable ASIC arrays by reconguration of circuit placements, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 13, no. 8, pp. 976986, Aug. 1994. [15] J. Narasimhan, C. Rim, K. Nakajima, and A. Daubura, Yield enhancement of wafer scale integrated arrays, in Proc. IEEE Conf. Wafer-Scale Integr., 1991, pp. 178184. [16] J. Kelly and P. Ivey, Defect tolerant SRAM based FPGAs, in Proc. Int. Conf. Comput. Des., 1994, pp. 479482. [17] R. Cuddapah and M. Corba, Recongurable Logic for Fault Tolerance. New York: Springer-Verlag, 1995. [18] S. Dutt and F. Hanchek, REMOD: a new methodology for designing fault-tolerant arithmetic circuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 5, no. 1, pp. 3456, Jan. 1997. [19] F. Hanchek and S. Dutt, Methodologies for tolerating logic and interconnect faults in FPGAs, IEEE Trans. Comput., vol. 47, no. 1, pp. 1533, Jan. 1998. [20] J. Emmert and D. Bhatia, Partial reconguration of FPGA mapped designs with applications to fault tolerance, in Proc. Int. Workshop Field-Program. Logic Appl., 1997, pp. 141150. [21] , A fault tolerant technique for FPGAs, J. Electron. Testing, vol. 16, pp. 591606, 2000. [22] V. Lakamraju and R. Tessier, Tolerating operational faults in cluster based FPGAs, in Proc. ACM Int. Symp. FPGAs, 2000, pp. 194197. [23] J. Lach, W. Mangione-Smith, and M. Potkonjak, Low overhead faulttolerant FPGA systems, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 6, no. 2, pp. 212221, Feb. 1998. [24] , Algorithms for efcient runtime fault recovery on diverse FPGA architectures, in Proc. Int. Symp. Defect Fault Tolerance VLSI Syst., 1999, pp. 386394. [25] Field Programmable Gate Arrays Data Book, Lucent Technologies, Allentown, PA, 1998. [26] J. Emmert, C. Stroud, B. Skaggs, and M. Abramovici, Dynamic fault tolerance in FPGAs via partial reconguration, in Proc. IEEE Symp. Field-Program. Custom Comput. Mach., 2000, pp. 165174.

[27] F. Leighton and P. Shor, Tight bounds for minimax grid matching with applications to average case analysis of algorithms, in Proc. Symp. Theory Comput., 1986, pp. 91103. [28] M. Abramovici, C. Stroud, and J. Emmert, Using embedded FPGAs for SoC yield improvement, in Proc. 39th Des. Autom. Conf., 2002, pp. 713724. John M. Emmert (S92M93SM04) received the Ph.D. degree from University of Cincinnati, Cincinnati, OH, in 1999. Currently, he is an Associate Professor in the Department of Electrical Engineering, Wright State University, Dayton, OH, and a Lieutenant Colonel in the U.S. Air Force Reserves, Wright Patterson Air Force Base, Dayton, OH.

Charles E. Stroud (S74M88SM90F04) received the Ph.D. degree from the University of Illinois at Chicago, in 1991. Currently, he is a Professor in the Department of Electrical and Computer Engineering, Auburn University, Auburn, AL. Previously, he was a Distinguished Member of the Technical Staff at Bell Labs, Naperville, IL, where he worked for 15 years as a VLSI and printed Circuit Board Designer with additional work in CAD tool development and BIST for digital and mixed-signal VLSI. He is author of A Designers Guide to Built-In Self-Test (Kluwer, 2002). He has over 100 publications and 13 issued U.S. patents for various BIST techniques for VLSI and FPGAs. He has been an Editorial Board Member of the IEEE DESIGN AND TEST OF COMPUTERS and IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS.

Miron Abramovici (S76M80SM86F93) received the Ph.D. degree from the University of Southern California, Los Angeles, in 1980. He is co-founder and CTO of Design Automation for Flexible Chip Architectures (DAFCA), an EDA startup providing tools for in-system at-speed silicon debug. Prior to this, he was a Distinguished Member of the Technical Staff at Bell Labs, Murray Hill, NJ. He was an Adjunct Professor of Computer Engineering at the Illinois Institute of Technology, Chicago. He was the principal investigator of a DARPA-sponsored project on adaptive computing systems and he is now the principal investigator of a project on recongurable architectures for SoC infrastructure funded by the Advanced Technology Program of NIST. His research has covered silicon debug and diagnosis, recongurable computing, ATPG, DFT, fault simulation, BIST, redundancy identication, special-purpose hardware architectures, on-line testing and fault-tolerance, timing verication, logic optimization, and partitioning. He coauthored Digital Systems Testing & Testable Design, adopted worldwide as the standard textbook in this eld. He has 26 issued patents and over 80 publications. Dr. Abramovici is a Member of the Editorial Board of IEEE DESIGN AND TEST OF COMPUTERS and of the Journal of Electronic Testing.

You might also like