You are on page 1of 14

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO.

1, JANUARY 2007

53

Dual-Threshold CAD Framework for Subthreshold Leakage Power Aware FPGAs


Akhilesh Kumar, Student Member, IEEE, and Mohab Anis, Member, IEEE
AbstractManaging leakage power in eld programmable gate arrays (FPGAs) has become critical for the FPGA industry to remain competitive in the semiconductor market and enter the mobile applications domain. This paper proposes and evaluates several dual-Vt-based designs of FPGA architecture for reducing the leakage power. A dual-Vt FPGA computer-aided design (CAD) framework has been proposed, which is used to develop and evaluate different dual-Vt FPGA architectures. The logic elements and the routing resources are considered as candidates for dual-Vt assignment. The authors estimate the number of the logic elements that can be assigned as high-Vt in the ideal case by using a dual-Vt assignment algorithm in the CAD framework. Based upon this estimate, the authors develop and evaluate two kinds of architectures, homogenous and heterogenous. The results indicate that an average leakage-power savings of up to 50% can be obtained from these architectures. This CAD framework can also be used for developing and evaluating different dual-Vt FPGA architectures other than the ones proposed in this paper. Index TermsCMOS digital integrated circuits, delay estimation, design automation, eld programmable gate arrays (FPGAs), leakage currents.

I. I NTRODUCTION EAKAGE POWER has been recently recognized as a major challenge in the eld programmable gate array (FPGA) industry. This was primarily because other design challenges, such as performance and area, were given more attention, but with technology scaling, leakage power has emerged as a key design challenge [14]. The current-generation FPGAs are being implemented in the 90-nm CMOS technology, which necessitates devising techniques for leakage-power reduction. For the FPGAs to continue retaining its semiconductor market and competitive advantages over the high-performance custom very large scale integration (VLSI) designs, the FPGA industry should adopt new techniques for leakage-power reduction. The work in [4] showed that a 90-nm FPGA consumes too much leakage power to be successfully used in mobile applications. Therefore, for FPGAs to gain popularity in the domains such as wireless personal communication systems or in biomedical applications, the FPGAs need to implement techniques for reducing leakage power. The increase in complexity of the current-generation FPGAs has resulted in more number of transistors in the FPGAs which directly translate into increased leakage power. The resource
Manuscript received April 21, 2005; revised November 19, 2005 and February 22, 2006. This work was supported by the Natural Sciences and Engineering Research Council of Canada. This paper was recommended by Associate Editor F. N. Najm. The authors are with the Department of Electrical and Computer Engineering, University of Waterloo, Waterloo, ON N2L3G1, Canada. Digital Object Identier 10.1109/TCAD.2006.882595

utilization of FPGAs is just over 60%, and the unutilized parts also consume leakage power, which means that reducing leakage power is important both in the used and unused parts of the FPGA. This paper proposes the following for the design of a dual-Vt FPGA for reducing subthreshold leakage power. 1) Dual-Vt FPGA architecture, which involves determining the distribution for the high- and low-Vt devices. 2) Computer-aided design (CAD) framework for designing the dual-Vt FPGA. This CAD framework is used for determining the parameters of the dual-Vt FPGA design. 3) Algorithms associated with the CAD framework. 4) Analysis of different dual-Vt FPGA architectures. The remainder of this paper is organized as follows. In Section II, leakage reduction and the previous work are discussed. Section III explains the static RAM (SRAM)-based FPGA architecture that is used in this paper. In Section IV, FPGA architectures based on dual-threshold-voltage methodology are proposed, and the proposed dual-Vt FPGA CAD ow is discussed. Section V presents the implementation of the proposed dual-Vt FPGA CAD ow. Section VI discusses the results and the evaluation methodology. Section VII proposes some guidelines for designing a dual-Vt FPGA. Section VIII concludes the paper and outlines the future work. II. L EAKAGE R EDUCTION AND P REVIOUS W ORK The subthreshold leakage current through a MOSFET can be modeled as [9] Isub = I0 1 exp Vds VT exp Vgs Vt Vo n VT (1)

where I0 is a constant dependent upon device parameters for a given technology, VT is the thermal voltage, Vo is the offset voltage which determines the channel current at Vgs = 0, Vt is the threshold voltage, and n is the subthreshold swing parameter. Since the leakage current is exponentially dependent on the threshold voltage, increasing the threshold voltage would decrease the leakage current substantially. However, the high-threshold-voltage devices have larger switching delays. The various leakage-current mechanisms and some leakagereduction techniques for CMOS circuits were discussed in [2] and [15]. The techniques for reducing leakage power involve static and dynamic approaches. The dynamic approach involves run-time decision making for leakage reduction. One such popular technique is the use of sleep transistors in multi-threshold CMOS (MTCMOS) circuits for controlling the leakage power [18]. Several optimizations for MTCMOS circuits involving

0278-0070/$25.00 2007 IEEE

54

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

Fig. 1. Dual-Vt design implementation.

the use of sleep transistors have been proposed, such as in [16] and [17]. However, this technique can reduce only the standby leakage power and lead to performance degradation. The static technique does not involve run-time decision making for leakage control. The dual-threshold-voltage-design technique, which is a static approach, has been widely used in the custom VLSI designs for reducing leakage power. The dual-Vt implementation reduces both the active and standby leakages. Further, there is no performance degradation in a dual-Vt implementation for custom VLSI designs. The dual-threshold-voltage-design technique uses two kinds of transistors in the same circuit. Some transistors have a high threshold voltage, while other transistors have a low threshold voltage. The high-threshold-voltage transistors have less subthreshold leakage-power dissipation but also have a larger delay as compared to the low-threshold-voltage transistors. Fig. 1 shows the concept of dual-threshold-voltage implementation in custom VLSI designs. Here, the gates on noncritical paths are assigned as high-Vt, and the gates on the critical path are assigned as low-Vt. The objective is to maximize the number of transistors having high threshold voltage without sacricing the performance of the circuit. Several prior works have proposed algorithms which assign high- and low-Vt to the logic gates of the given circuit [1], [5], [19], [20]. However, the dual-threshold-voltage-design technique proposed in the literature for custom VLSI designs cannot be used for FPGAs. This is because the FPGAs are programmable and we do not know the circuit that would be eventually implemented on it and hence the delays through various paths of the circuit are not known. In this paper, we present a dual-Vt FPGA CAD ow for developing and evaluating different dual-Vt FPGA architectures. We do not know of any published work in which a CAD ow for evaluating and developing dual-Vt FPGA architectures has been proposed. The leakage-current mechanisms for the FPGAs are the same as that for the custom VLSI designs. The two main sources of leakage currents are the subthreshold and gate leakages. There have been very few works targeted at reducing the leakage power in FPGAs. The work in [3] used a technique based on the property that the leakage power consumed by a CMOS circuit is dependent on the state of its inputs and used the signal statistics to alter the state of the inputs in order to reduce the leakage power, in such a way that the functionality of the circuit does not change. However, this paper addressed only active leakage. Further, this paper addressed leakage reduction only in the used

parts of the FPGA. The work in [6] approached the problem of reducing leakage power by dividing the FPGA fabric into small regions, with each of the regions being controlled by a sleep transistor which would be turned on/off depending upon whether that region is being used or not, thus reducing the leakage power. This technique leads to reduction of only the standby leakage power. The work in [7] explored the dual-Vt, body biasing, and gate biasing of NMOS pass transistors for reducing leakage power. The dual-Vt technique used in [7] was based on varying the percentage of high-threshold-voltage elements in the routing resources of the FPGA and studying its effect for a number of benchmarks. It did not specify the leakage savings obtained, and no detailed evaluation of several possible routing architectures was presented. Body-biasing technique requires control circuitry and generation of voltages for body biasing. Further, this technique leads to reduction in leakage savings with technology scaling [21]. The negative gate biasing of NMOS pass transistors has implementation issues and also leads to additional leakage-current component, called gateinduced drain leakage. The work in [24] used a dual-Vdd/Vt architecture to reduce leakage power and dynamic power. Only the conguration SRAM cells were assigned high-Vt to reduce leakage power. The dual-Vdd approach requires design of power supply network with two voltage levels and level converters which lead to area overhead and increased complexity. The dual-Vt technique and the CAD ow that we propose have the following advantages: 1) reduces both active and standby leakage powers; 2) provides a CAD framework for developing and evaluating a dual-Vt FPGA implementation; 3) the inherent area penalty in using sleep transistors is not present in this design technique; 4) the dual-Vt architecture does not require any modication in the existing placement and routing tools from the perspective of the users. III. T ARGETED FPGA A RCHITECTURE The FPGA architecture is very regular in structure. It is made up of two main componentslogic blocks (CLBs) and routing resources. The logic blocks implement the functionality of the given circuit, while the routing resources provide the connectivity for implementing the logic. The logic blocks and the routing resources are congurable. Although many types of architectures have been experimented with, the most popular one is the SRAM-based architecture which is described below and is used in this paper [10]. The logic block of the SRAM-based FPGA is lookup table (LUT)-based and is composed of basic logic elements (BLEs). LUT is an array of SRAM cells. Each BLE consists of a k-input LUT, ip-op, and multiplexer for selecting the output directly from either the output of LUT or the registered output value of the LUT stored in the ip-op. Fig. 2 shows the BLE. Previous works have shown that the four-input LUT is the most optimum size as far as logic density and utilization of resources are concerned, and this has been widely used. Cluster-based logic blocks were investigated in [10], and it was shown that the cluster-based logic blocks are better in speed and area. The structure of a cluster-based logic block is shown in Fig. 3. In the cluster-based logic block, the logic block is made up of

KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs

55

Fig. 2.

Basic logic element [10].

have assumed that metal 3 is used for the routing wires. The sizing of the routing switches was done for minimum area delay product, as proposed by the study in [10]. We have used the regular MOS models for low-threshold voltage implementation. The Vdd for this technology is 1.2 V, and Vt is 0.2 V. The high-Vt transistors have threshold voltage 100 mV higher than the low-Vt transistors [27]. SPICE simulation shows that, upon increasing the high-Vt value further, the delays of the subblocks increase without leading to any signicant leakage savings. Since the SRAM cells do not contribute to any run-time delay, we assume that they are implemented with high-Vt transistors to reduce the leakage power, and hence, our results for leakage savings do not include any leakage savings from the SRAM cells [4].

IV. P ROPOSED D UAL -T HRESHOLD FPGA A RCHITECTURE The very nature of programmability of FPGAs makes the dual-Vt design technique for FPGAs different from dual-Vt custom VLSI designs. We propose dual-Vt FPGA architectures and then use a CAD framework for determining the parameters and subsequently evaluating the proposed dual-Vt FPGA architectures. This paper targets the logic elements within the CLBs and the routing resources as candidates for dual-Vt assignment. We evaluate the following architectures:
Fig. 3. Cluster-based logic block [10].

N BLEs. There are I inputs to the logic block, such that each input can connect to all the BLEs. Also, the output of each BLE can drive one of the inputs of each of the BLEs. The clock feeds all the BLEs. The work in [10] showed that the logic clusters containing four to ten BLEs achieve good performance. We consider each subblock to be made up of BLE and the corresponding LUT input multiplexers. The routing resources are again of various types, but the one used in this paper is the island-based architecture. In the island-based architecture, the routing resources form a meshlike structure with the horizontal and vertical routing channels. The horizontal and vertical routing channels are connected by switch boxes which are programmable and thus provide the exibility in making the connections. The logic blocks are connected to the routing channels through the connection boxes which are also programmable. Fig. 4 shows the island style routing architecture [10]. Apart from the logic blocks and the routing resources, the clock distribution is assumed to have a dedicated network. Most of the commercial FPGAs have a structure similar to the one described above or some variant of the above architecture. We have used an industrial CMOS 0.13-m technology node for the technology dependent data for the FPGA architectures. The methodology proposed in [10] was used to extract the required technology dependent data. In extracting the delay data for the logic blocks, SPICE simulation was done. The delay through the subblock is the delay through the LUT input multiplexer, LUT, ip-op, and multiplexer for selecting the output of the ip-op or LUT. The sizing of the buffers was done for equal rise and fall times, as proposed by the study in [10]. We

1) homogenous architecture with only subblocks within CLBs considered for dual-Vt assignment (arch1); 2) homogenous architecture with subblocks and routing resources considered for dual-Vt assignment (arch2); 3) heterogenous architecture with only subblocks considered for dual-Vt assignment (arch3); 4) heterogenous architecture with the subblocks and routing resources considered for dual-Vt assignment (arch4). The proposed homogenous and heterogenous architectures are described as follows.

A. Homogenous Dual-Vt FPGA Architecture This proposed architecture has a xed number of high- and low-Vt subblocks in each of the CLBs. For a cluster size of N , M (M < N ) subblocks are assigned as high-Vt, whereas (N M ) subblocks are assigned as low-Vt in each CLB. For example, Fig. 5 shows a homogenous dual-Vt FPGA architecture in which each CLB has 50% high- and 50% low-Vt subblocks for a cluster size of four. In the case of arch1, we consider only the subblocks as candidates for dual-Vt assignment. We extend this architecture to include the routing resources as candidates for dual-Vt assignment, which leads to the architecture arch2. The switches in the connection blocks and the switch blocks, which drive the routing segments, are considered for dual-Vt assignment. In the architecture arch2, certain fraction of these switches are assigned as high-Vt. Fig. 6 shows the switch block and the associated routing switches. The connection block has similar switches for providing connections between CLBs and routing segments.

56

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

Fig. 4. Island style routing architecture [10].

Fig. 5. Proposed homogenous FPGA architecture. Each CLB has a xed ratio of the high- and low-Vt subblocks.

B. Heterogenous Dual-Vt FPGA Architecture This proposed architecture has two kinds of CLBs distributed uniformly throughout the array. The rst type of CLBs has all high-Vt subblocks, and the second type has some high- and low-Vt subblocks. These two kinds of logic blocks are distributed uniformly throughout the array in such a way that the array is regular. If the logic-block cluster size is N , then the proposed FPGA architecture would have one of the two kinds of CLBs with all N high-Vt subblocks (type1), whereas the second kind of logic blocks would have M high-Vt subblocks (M < N ) and N M low-Vt subblocks. Fig. 7 shows the distribution of the two kinds of the logic blocks in the FPGA, such that 50% of the logic blocks are of type1, whereas the other 50% of the logic blocks are of type2 for a cluster size of four. We selected such an architecture because the number of subblocks assigned as high-Vt was very high and so a large percentage of logic blocks can possibly have all high-Vt subblocks. This heterogenous architecture is extended to include the routing resources for dual-Vt assignment, leading to the architecture arch4, in the same way as arch1 is extended to arch2.

Fig. 6. Switch block. (a) Overall architecture of a switch block. (b) Buffered pass-transistor switch. (c) Pass-transistor-based switch.

C. Proposed Dual-Vt FPGA CAD Framework Developing and evaluating dual-Vt FPGA architectures, as outlined previously, would require a CAD framework which can perform dual-Vt assignment to the subblocks and the routing resources and has a methodology for analyzing different

KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs

57

Fig. 7. Proposed heterogenous FPGA architecture. Two kinds of CLBs, one having all high-Vt subblocks and the other having a xed ratio of the high- and low-Vt subblocks.

dual-Vt FPGA architectures. The typical existing FPGA CAD ow within the framework of versatile place and route (VPR), for placing and routing a given netlist, and the proposed generic dual-Vt FPGA CAD ow are shown in Fig. 8. The proposed dual-Vt FPGA CAD ow has been developed using the widely used academic research tools VPR and T-Vpack [10] and a state-dependent leakage-power model that we developed for FPGAs [11] and integrated with the power model proposed by the study in [13]. The salient features of the leakage-power model developed by us are the following. 1) It is a state-dependent model which takes into account the dependencies of leakage power on the state of the input. 2) It models both the gate and subthreshold leakages. 3) It takes into account the various short-channel effects to accurately model the leakage. 4) It is based on analytical equations from BSIM4. The leakage-power model takes into account the state dependence of leakage power by considering the probability of states for each of the circuit element. Consider a circuit element which has n states and the probability of the states are Prob1 , Prob2 , . . . , Probn , such that n Probi = 1. i=1 The leakage power for different states are given as Pleak1 , Pleak2 , . . . , Pleakn . The average leakage power can then be written as
n

as follows. We model the subthreshold leakage of the inverters in both the states, i.e., when the input is zero and when the input is one, and the gate leakage of the inverter when the input is one. With the input at zero, only the subthreshold leakage ows through the NMOS of the inverter and the gate leakage through PMOS is ignored, as shown in Fig. 9(b). When the input to the inverter is one, there is subthreshold leakage through the PMOS and gate leakage through the NMOS. In similar ways, the leakage power through other FPGA circuit elements are computed. VPR is used for placement and routing of the benchmark circuits. After the placement and routing of the given circuit, the power model computes the probability of states for each node of the circuit. For computing the probability of the nodes of the circuit, we have used the work done in [13]. After the probability of states for each of the nodes is computed, the power model looks at each of the circuit elements of the FPGA and computes the leakage for each of the states of that element using the leakage computation engine (LCE). We have implemented the LCE, which computes the leakage for each of the circuit element of the FPGA based on a given input vector (and the state of the SRAM cells, if they are present in the given circuit element). The LCE is basically a library of statedependent leakage calculation functions. This library has the basic leakage equation for the subthreshold and gate leakages, and the computation of the associated parameters. It also has the models for computing the leakage for each of the FPGA circuit elements for a given input vector. For a low-Vt subblock with all the inputs set to zero, the total leakage is 3.46 nA, whereas for a high-Vt subblock, the leakage is 0.139 nA. For a low-Vt buffered pass-transistor routing switch, with the input as zero, the leakage is 0.933 nA, whereas for the high-Vt switch, the leakage is 0.0156 nA. The typical FPGA CAD ow involves circuit optimization using sequential interactive synthesis (SIS) [22] and technology mapping to LUTs using FlowMap [23]. T-Vpack is then used to do the packing and clustering, which generates the netlist le of logic blocks. VPR is then used for placement and routing. The proposed dual-Vt FPGA CAD framework has six stages as shown. The next section describes various stages of the CAD ow.

V. CAD F RAMEWORK I MPLEMENTATION Pavgleak =


i=1

Probi Pleaki .

(2)

We consider both the PMOS and NMOS in various circuit elements as the candidates for subthreshold leakage, but we consider only the NMOS transistors as candidates for gate leakage because the gate leakage in PMOS is considerably smaller than NMOS [12]. Furthermore, the back gate leakage of the NMOS transistors is ignored, and we consider the gate current only from the gate to channel which then gets partitioned and ows into the source and drain of the transistor, as shown in Fig. 9(a). The methodology for computing the leakage through an inverter in the FPGA leakage-power model developed by us is

We modied the VPR to support the dual-Vt assignment to the subblocks and the T-Vpack to support reclustering. Each of the stages of the dual-Vt FPGA CAD ow [Fig. 8(b)] is explained below.

A. Stage 1 In this stage, the .net le is generated, which is a netlist of logic blocks, along with the activity and function les required for the power estimation. The .net le is generated from the .blif (Berkeley logic interchange format) le based upon the specied FPGA architecture, such as cluster size, inputs per cluster, LUT size, etc., by T-Vpack. The activity and function

58

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

Fig. 8. (a) Typical FPGA CAD ow within the VPR and T-Vpack framework. (b) Proposed generic dual-Vt FPGA CAD ow.

le generations are a part of the power model framework [11], [13]. These les serve as inputs to the second stage. B. Stage 2 The second stage is a delay estimation stage, which is used to assign high- and low-Vt to various subblocks, based on the slacks available for each path. The delay calculation can either be accurate or be based upon some estimation. In this paper, we have used accurate delay estimation based on the actual placement and routing of the benchmark. In this stage, it is assumed that all the subblocks are low-Vt, and the delay calculation is done based on the low-Vt delay values for the subblocks and routing resources. VPR does the placement and routing of the benchmark for delay calculation. Since the leakage-power model is integrated into the VPR, we calculate the power for the single low-threshold-voltage implementation of the FPGA for the benchmark, immediately after placement and routing. This leakage-power-consumption result is used in the nal stage of the CAD ow for comparison with the leakage power consumed by the dual-Vt implementation and, hence, calculation of leakage savings. C. Stage 3 Initially, it is assumed that all the subblocks are low-Vt except the conguration SRAM cells which are all high-Vt for both the single low-Vt and dual-Vt implementations. Based on the delays of various paths of the circuit computed in the previous stage, the dual-Vt assignment algorithm assigns highVt to various subblocks. The dual-Vt assignment algorithm is shown in Algorithm 1 [1]. The algorithm works on the timing graph created by the VPR for placement and routing. This

Fig. 9.

(a) Gate leakage in NMOS. (b) Subthreshold leakage in an inverter.

timing graph contains the information for the delays between various pins of the circuit. High-Vt assignment to a subblock is done based on the slack available at the output pin of the subblock. The assignment of high-Vt would change the delay through the subblock. This delay value is updated in the timing graph, and the next subblock is then considered for high-Vt assignment. It should be noted that the dual-Vt assignment is not constrained to keep the CLBs identical in terms of the number of high- and low-Vt subblocks in each CLB. This kind of dual-Vt assignment, which is the ideal assignment, would result in a dual-Vt FPGA implementation which would not have any performance degradation. This dual-Vt assignment gives a theoretical upper limit for the number of subblocks assigned as high-Vt. This dual-Vt assignment methodology is the same as the dual-Vt assignment for custom VLSI designs. Therefore, the dual-Vt assignment produces an output in which there are different numbers of high- and low-Vt subblocks across different CLBs. This is, however, not a feasible solution. This information is needed by the next reclustering stage to create a feasible implementation for the dual-Vt FPGA architectures.

KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs

59

Algorithm 1 Dual-Vt assignment [1] Using timing - graph in VPR (after Place and Route) for each subblock do Compute the slack available if slack > 0 then Assign high-Vt to the subblock Recompute the slack if slack < 0 then Re-assign low-Vt to the subblock end if end if end for.

where the third part of the equation denotes the attraction of a BLE to a cluster based on the number of high-/low-Vt BLEs in the current cluster. The parameter Hvt takes a value of one for a high-Vt BLE and zero for a low-Vt BLE. NumHvt denotes the number of high-Vt BLEs in the current cluster, and NumLvt denotes the number of low-Vt BLEs in the current cluster. The value of depends upon the benchmark, but a value between 0.01 and 0.05 was found to give good results for all the benchmarks. The value of was set to 0.8. The number of low-Vt subblocks in the type2 logic blocks is determined as follows: NumLvtBLEs = ClusterSize SumLvtCrit NumFracLB (5)

D. Stage 4 This stage works within the framework of T-Vpack. For the homogenous architectures (arch1 and arch2), we estimate the number of high-Vt subblocks per CLB based upon the results of the dual-Vt assignment to subblocks in the previous stage. Based on these results, we evaluate the homogenous architecture with a xed number of high-Vt subblocks per CLB. The heterogenous architectures (arch3 and arch4) are developed and evaluated in the following way based on the algorithm shown in Algorithm 2. We have modied the clustering algorithm proposed by the study in [10]. The clustering algorithm in [10] picks up a seed BLE and then tries to build up the cluster by attracting the other BLEs based on an attraction function. The attraction function used in [10] to add a BLE B to a current cluster C is shown in the following: Attraction = Criticality(B)+(1) Nets(B) Nets(C) MaxNets (3)

where Criticality(B) represents that part of the attraction function which tries to reduce the delay of the circuit. For computing the delay of the various paths in the circuit during the packing stage, the intracluster delay is assumed to be 0.1, logic delay is assumed to be 0.1, and intercluster delay is assumed to be one [10]. For the modied clustering algorithm, we set the delay of high-Vt BLEs to 0.2 to account for the increased delays of high-Vt BLEs. The exact values of these delays are not very important as long as they represent the correct trend [10]. The methodology for criticality computation is explained in [10], and its value lies between zero and one. The second part of the function tries to cluster the BLE which share maximum nets with the current cluster. To take into account the high-Vt BLEs and increase the type1 (all high-Vt) clusters the modied cost function is shown in the following: Attraction = Criticality(B) + (1 ) + Nets(B) Nets(C) MaxNets

where SumLvtCrit denotes the sum of criticality of all low-Vt subblocks, and NumFracLB denotes the number of logic blocks which have at least one low-Vt subblock so that they can be categorized into type2 logic block. We use this equation to determine the number of low-Vt subblocks in type2 CLBs because the criticality measure takes into account the importance of a subblock with regard to delay, which directly translates to the threshold voltage of the transistors of the subblock. The next step is to assign high-/low-Vt to subblocks in type2 logic blocks so that each of the type2 logic blocks have the same number of low-Vt subblocks. We start by arranging all the subblocks in a logic block in the ascending order of criticality. If the logic block has more low-Vt subblocks than NumLvtBLEs, the BLEs are assigned high-Vt, starting with the lowest critical low-Vt subblock, until the number of low-Vt subblocks is equal to NumLvtBLEs. If the logic block has low-Vt subblocks less than NumLvtBLEs, some of the subblocks are assigned lowVt, starting with the highest critical high-Vt subblock, so that maximum gain in performance might be achieved. This method of assigning high-/low-Vt to subblocks during the reclustering stage is also used for the homogenous architecture, such that the value NumLvtBLEs is used for all the logic blocks. Algorithm 2 Recluster Do clustering based on T-Vpack algorithm with modied attraction function for each logic block do if All-subblocks - high- Vt then Mark the logic block as type1 else Mark the logic block as type2 end if end for Determine the number of low-Vt subblocks in type2 CLB based on (5) for Each type2 logic block do Arrange the subblocks in ascending order of criticality if No.-of-low-Vt-subblocks < NumLvtBLEs then while No.-of-low-Vt-subblocks < NumLvtBLEs do Re-assign a high-Vt subblock to low-Vt based on criticality (Highest critical high-Vt subblock assigned low-Vt rst) end while

Hvt NumHvt + (1 Hvt) NumLvt ClusterSize (4)

60

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

else while No.-of-low-Vt-subblocks > NumLvtBLEs do Re-assign a low-Vt subblock to high-Vt based on criticality (Lowest critical low-Vt subblock assigned high-Vt rst) end while end if end for. This stage produces an output which gives an estimate of the number of type1 and type2 CLBs for the heterogenous architectures (arch3 and arch4) and number of low-Vt subblocks in the type2 CLBs. E. Stage 5 Homogenous architectures: For the architecture arch1, placement and routing is done in this stage. For realizing the architecture arch2, a certain fraction of routing resources are assigned high-Vt, and then, placement and routing is done. The high-Vt assignment to routing switches is done as follows. Consider a routing channel width of 100 with two kinds of segments, such that 50 segments are driven by passtransistor switches in the switch box and 50 segments are driven by buffered pass-transistor switches, as dened by the VPR architecture le. The output pins of the CLB drive all these segments with a buffered switch in the connection box. When we assign high-Vt to, say, 40% of routing segments driven by pass-transistor switches, it would mean that 20 segments of those 50 segments will be driven by high-Vt pass-transistor switches. Also, all the output pin switches of the CLBs, which drive these 20 segments, will be assigned high-Vt. We assume that a gate boosting for these switches are done, which would minimize the voltage degradation when a logic one is propagated [10]. Other techniques, such as the use of level restorer, can be used to eliminate voltage degradation in the routing switches [25]. SPICE simulation shows that logic level one passing through a high-Vt pass transistor with gate boosting to 1.5 V (Vdd = 1.2 V) degrades to only 1.198 from 1.2 V. The SPICE simulation results indicate that, with gate boosting only 1.5%, extra static leakage current ows in the driven buffer through a high-Vt pass transistor as compared to the case when there is no voltage degradation. Therefore, we consider that the static power dissipation due to voltage degradation in high-Vt pass-transistor switches is sufciently small. Heterogenous architectures: Based on the reclustering results, we determine, for a given FPGA array, the number of type1 and type2 CLBs. We estimate these values based on the ratio of type1 and type2 logic blocks in the reclustered netlist, after providing sufcient margin to account for the performance degradation because of the physical restriction of FPGA regularity. These CLBs are assumed to be distributed evenly throughout the FPGA array. For the architecture arch3, constrained placement followed by routing is done. For arch4, a certain fraction x of the total routing switches are assigned high-Vt, and then, constrained placement followed by routing is done. We vary the value of x to determine the

leakage savings and performance tradeoffs. The algorithm for constrained placement is shown in Algorithm 3. Since the VPR uses simulated annealing for timing-driven placement, which is a very effective placement technique, we have not modied the basic placement algorithm. However, we have incorporated delays of the high- and low-Vt parts of the CLBs into the timing graph accordingly, so that VPR can optimize the overall placement for performance. Algorithm 3 Constrained Placement P-ratio = (Num type1 CLBs)/(Num type2 CLBs) Determine physical locations for the two types of CLBs based on P-ratio Assign high-Vt to frac fraction of routing switches Initial Random placement for all the CLBs Start placement based on VPR placement algorithm Allow the type2 CLBs to be placed on physical locations meant for type1 CLBs and vice versa End Placement for each CLB do if Logic-Block-not-on-same-type-CLB then Convert the logic block to type1 or type2 according to its placement end if end for. Since the VPR router is timing driven, it optimizes the overall routing. The delays of the high-Vt switches would be more than the delays of the low-Vt switches, which would force the router to use more of low-Vt switches. Therefore, we provide the increased delays of the high-Vt routing switches to the VPR, so that the router can optimize the overall routing. F. Stage 6 This stage computes the results obtained from the previous stages of the CAD ow. After the computation of delays and power for the dual-Vt FPGA implementation is over, this stage computes the leakage-power savings obtained and the delay penalty for the benchmark. VI. E VALUATION , R ESULTS , AND D ISCUSSIONS This section describes the realization and evaluation of different FPGA architectures and their comparison. The method of developing and evaluating these architectures is by using a number of benchmarks on the proposed dual-Vt FPGA CAD ow to nd out which subblocks can be high-Vt for maximizing leakage savings and reducing the delay penalties. It would be worthwhile to point out the difference between the physical design of the FPGA and the mapping of an application on the FPGA. The physical resources of an FPGA are xed, and the applications are mapped later on. The dualVt FPGA CAD ow that we outlined above is meant for the actual physical design of the FPGA and, later on after the physical design of the FPGA is complete, for the design of associated CAD tools. This dual-Vt FPGA CAD ow is to be used (along with other CAD tools) during the design of the

KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs

61

TABLE I LEAKAGE SAVINGS WITH LOGIC BLOCKS ASSIGNED AS DUAL-VT FOR A CLUSTER SIZE OF 12 FOR HOMOGENOUS AND HETEROGENOUS ARCHITECTURES

FPGA, where a number of benchmarks are evaluated before a design for an FPGA is nalized. This CAD ow is not meant for mapping of an application on to an FPGA chip, as all the logic blocks and routing resources would then be xed, which would be a routine job using the standard FPGA CAD tools. Therefore, the dual-Vt FPGA CAD ow is meant for developing a dual-Vt FPGA by determining its architectural parameters. A. Estimating Power Savings We use the power model proposed in [11] for estimating the leakage-power savings. The total power consumption can be divided into two parts, active power and standby power. The total average power can be written as Pavg = tact Pact + to Po tact + to (6) (7)

the standby-leakage-power consumption of the FPGA, because during the standby mode, only the leakage power is dissipated. Hence, reducing leakage power during the standby mode for mobile applications would increase the battery life signicantly. The dual-Vt technique reduces both the standby and active leakages, and hence, it is an attractive method to reduce subthreshold leakage. B. Evaluation Methodology For obtaining the results from the benchmarks, we used the LUT size of four and a cluster size of 12. It was shown in [8] that a LUT size of four and a cluster size of eight or 12 leads to minimization of total power. The inputs per cluster was chosen as K/2(N + 1), where N is the number of subblocks per cluster and K is the LUT size [26]. We used the default routing architecture present in the FPGA architecture le of VPR having 50% routing segments with pass-transistor switches and 50% routing segments with buffered pass-transistor switches. For computing the leakage savings and performance tradeoffs, a xed-size FPGA array of 20 20 was used for all the benchmarks, except bigkey, clma, des, dsip, and s38584.1, for which a 35 35 array was used. A routing channel width of 100 was used for the FPGA architecture. Table I shows the results of the dual-Vt assignment. Based on these results, homogenous architectures with four, six, and eight high-Vt subblocks per CLB were considered. The results shown for the number of type1 CLBs and low-Vt subblocks in type2 CLBs in Table I correspond to a minimum-sized FPGA (required to t a benchmark), and these values are ideal values which does not consider the regularity of the structure of the

T = tact + to

where Pavg , Pact , and Po are the average, active, and standby (off) power. For personal wireless communication systems, typically, the standby time or off time (to ) is 90% of the total time (T ), and active time (tact ) is 10% of the total time. During the active time, the components of power dissipation can be written as Pact = [Pdyn + Psckt + Pactleak ]used + [Pactleak ]unused (8) where Pdyn , Psckt , and Pactleak are the dynamic, short circuit, and active leakage-power consumptions, respectively. Po is

62

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

FPGAs. This kind of dual-Vt assignment leads to an FPGA which would have no delay penalty. This is same as the dual-Vt assignment for ASICs, in which there is no restriction with regard to the logic elements that can be assigned as high-Vt. This kind of dual-Vt assignment leads to nonidentical logic blocks (in terms of high- and low-Vt subblocks in each logic block) in the FPGA, and the reclustering stage estimates the parameters of a uniform dual-Vt FPGA (in which all the logic blocks are identical with respect to number of the high- and low-Vt subblocks), based on the results of the dual-Vt assignment. Hence, to provide sufcient margin and prevent severe degradation of nal placement and routing imposed because of the limitation that the FPGA needs to be regular and uniform in structure, we set the values for type1 CLBs and number of low-Vt subblocks in type2 CLBs as 50% and 10, respectively. C. Realizing and Evaluating Different FPGA Architectures for Leakage Savings and Design Tradeoffs 1) arch1: This is a homogenous architecture in which all the logic blocks are identical, such that each logic block has a xed ratio of the high- and low-Vt subblocks. Based upon these results for dual-Vt assignment, we evaluated three different homogenous architectures with CLBs having eight, six, and four high-Vt subblocks in each CLB. After assigning high-Vt to, say, eight subblocks in each CLB, we placed and routed different benchmarks to evaluate the leakage savings and performance tradeoffs. Table I shows the overall average leakage saving results for these three different cases. The mean performance penalties were 3.7%, 4.8%, and 5.04% for the architectures with four, six, and eight high-Vt subblocks in each CLB, respectively. The delay penalties are close to each other because of two reasons: 1) There are sufcient number of low-Vt subblocks for the critical path(s) and 2) the routing contributes to a larger portion of total delay, which means that, even if there are few high-Vt subblocks on the critical path, this would not cause a large delay penalty. 2) arch2: This is the homogenous architecture with routing resources considered for dual-Vt assignment. The evaluation methodology is same as that of the arch1 case, except that we vary the fraction of the two kinds of routing resources assigned as high-Vt and estimate the leakage savings and the delay tradeoffs. The evaluation was done for the three cases, viz., FPGA architecture with four, six, and eight high-Vt subblocks per CLB for a cluster size of 12. Fig. 10 shows the leakage savings and the performance tradeoffs when the fraction of the high-Vt switches are varied over the base architecture having six high-Vt subblocks per CLB. Fig. 10(a) shows the leakage savings and delay penalty when only the buffered switches are considered for high-Vt assignment along with the CLB output pin switches which can drive these segments (i.e., those segments which can be driven by these high-Vt buffered switches), all the pass-transistor switches remaining as low-Vt. Fig. 10(b) shows the leakage savings and delay penalty when pass-transistor switches are considered for high-Vt assignment along with the CLB output pin switches which can drive these segments (i.e., those segments which can be driven by these high-Vt pass-transistor

Fig. 10. Leakage savings for arch2 with six high-Vt subblocks per CLB. (a) Buffered switches assigned as high-Vt. (b) Pass-transistor switches assigned as high-Vt.

switches), all the buffered switches remaining as low-Vt. The leakage savings increases considerably when the routing resources are assigned as high-Vt with some impact on the performance. This is because a major part of leakage comes from the routing resources and is discussed further in Section VI-E. Leakage savings for both the cases, when the pass-transistor switches are assigned as high-Vt and buffered switches are assigned as high-Vt, are almost the same. Although the buffered switches have more number of devices but the pass-transistor switch is eight times the minimum size, the pass transistor in the buffered switch is two times the minimum size and the buffer is two times the minimum sized buffer, which leads to almost similar leakage savings. The delay penalties for these architectures vary just between 10.7% and 3.32%. This can be attributed to the fact that a large percentage of routing switches are unutilized [4], and therefore, they do not contribute to delay penalties even if they are assigned as high-Vt. Further, the routing optimization effort of the VPR would also increase, tending to use more of the low-Vt switches, which compensates, to some extent, the increased delay of the high-Vt switches. 3) arch3: This is a heterogenous architecture in which there are two kinds of CLBs in the FPGA. Table I shows the leakage savings for this architecture. Since the FPGA architecture needs to be regular in structure, the two kinds of CLBs are distributed uniformly throughout the array. Based on the clustering results, type1 CLBs were xed to 50% of the total CLBs, and type2

KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs

63

and 4.8% for different fractions of high-Vt switches. The increase in delay penalty is not too large when routing switches are assigned high-Vt because a large fraction of routing resources are unutilized. We observe that the leakage savings for this architecture is about 3% greater than of the arch2 (six highVt subblocks per CLB), with almost the same delay penalty. D. Design Tradeoffs Before deciding upon any dual-Vt FPGA architecture, we need to consider the design tradeoffs. In this paper, we evaluated two types of architectureshomogenous and heterogenous. The design tradeoffs are the delay penalties and the impact on overall power. We evaluated these for all the FPGA architectures. It was observed that the dynamic power dissipation remains almost constant for all the architectures. A dual-Vt custom VLSI design does not lead to any delay penalty. However, because of the very nature of the programmability of the FPGAs, there would be some delay penalty associated with a circuit implemented on a dual-Vt FPGA as compared to a single low-Vt FPGA. The delay penalties would vary with the benchmark. Essentially, the delay penalty occurs because the critical path of a circuit in the dual-Vt FPGA changes from that in the single low-Vt FPGA. There are three sources of delay penalties in a dual-Vt FPGA: 1) the increased delays of the subblocks, some of which might lie on the critical path; 2) the altered placement and routing in the dual-Vt FPGA because of the different delays through the subblocks which impacts the placement and, subsequently, routing; and 3) increased delays of high-Vt routing switches, some of which might lie on the critical path. Table II shows the leakage savings and the delay penalties for arch2, arch4, and all high-Vt subblocks architectures. The delay penalties for arch2 and arch4 are comparable. These architectures have a very high percentage of high-Vt switches and would be most susceptible to performance degradation. However, the mean delay penalty for arch2 with 80% high-Vt pass-transistor switches is 6.4%, with the maximum delay penalty being 15.3% for a mean leakage savings of 40.3%. For arch4 with 80% high-Vt switches, the mean delay penalty is 5.8%, with the maximum delay penalty being 13.6% for a mean leakage savings of 43.5%. We have also shown the results of all high-Vt architecture, where all the subblocks in all the CLBs are high-Vt. The mean leakage savings is 32.6%, and the mean delay penalty is 19%, with the maximum delay penalty as high as 48.5%. We see that keeping all high-Vt subblocks leads to severe performance degradation in many of the benchmarks. This is because, in the architectures which contain low-Vt subblocks, the placer and the router can search for low-Vt subblocks to minimize the delay, which is not possible in the case of architecture with all high-Vt subblocks. The performance degradation in the all high-Vt subblocks is more pronounced for benchmarks which have a large number of subblocks on the critical path. The benchmark diffeq, which has a severe delay penalty of 48.5%, has 30 subblocks on the critical path for the implementation on the all high-Vt subblocks architecture. The benchmark bigkey has only six subblocks on the critical path, and the routing of the VPR compensates for the increased delay, which leads to

Fig. 11. Leakage savings for arch4. (a) Buffered switches assigned as high-Vt. (b) Pass-transistor switches assigned as high-Vt.

CLBs had ten low-Vt subblocks for this architecture. The number of low-Vt subblocks in type2 CLBs was xed by averaging this value for all the benchmarks. The ratio of type1 and type2 CLBs was xed by averaging and providing sufcient margin to prevent performance degradation for all the benchmarks. The results shown in this paper were obtained after xing these values for all the benchmarks. Table I shows that the overall mean leakage savings for this architecture is 19.3%. The mean delay penalty for this architecture is 6.9%. 4) arch4: This architecture is realized using the architecture arch3. The base architecture is the same as that of arch3, and over that, architecture routing resources are also assigned as high-Vt to realize the architecture arch4. The evaluation methodology is the same as that for arch3, except that the fraction of high-Vt routing resources is varied and the leakage savings and design tradeoffs are calculated. Fig. 11(a) shows the leakage savings and delay penalty when only the buffered switches are considered for high-Vt assignment along with the CLB output pin switches which can drive these segments (i.e., those segments which can be driven by these high-Vt buffered switches), all the pass-transistor switches remaining as low-Vt. Fig. 11(b) shows the leakage savings and delay penalty when pass-transistor switches are considered for high-Vt assignment along with the CLB output pin switches which can drive these segments (i.e., those segments which can be driven by these high-Vt pass-transistor switches), all the buffered switches remaining as low-Vt. The delay penalties vary between 12.4%

64

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

TABLE II DESIGN TRADEOFFS FOR HOMOGENOUS ARCHITECTURE arch2 WITH SIX HIGH-VT SUBBLOCKS AND 80% HIGH-VT PASS-TRANSISTOR SWITCHES, HETEROGENOUS ARCHITECTURE arch4 WITH 80% HIGH-VT PASS-TRANSISTOR SWITCHES, AND ALL HIGH-VT SUBBLOCKS ARCHITECTURE

Fig. 12. (a) Leakage contributions of routing resources, logic resources, and SRAM cells for single low-Vt implementation and (b) and (c) after dual-Vt implementation for alu4.

almost no delay penalty. Hence, having an architecture with all high-Vt subblocks would not be a good design. There is almost no area penalty as the dual-Vt technique inherently does not lead to any area penalty. In the experimental results for all the benchmarks, we kept the routing channel width and the number of CLBs the same for both the baseline single-Vt FPGA architecture and the dual-Vt FPGA architectures. This shows that there is no area penalty associated with the proposed design of the dual-Vt FPGA architectures.

E. Distribution of Leakage Savings In all these evaluations, we have assumed high-Vt SRAM cells for both the single low-Vt and dual-Vt implementations. The leakage savings is solely from the logic and routing parts of the FPGA. Fig. 12 shows the distribution of leakage among the logic blocks and the routing resources. Fig. 12(a) shows the leakage distribution in the single low-Vt implementation for the benchmark alu4. It can be seen that almost two thirds of leakage is from the routing resources. This is because of a large amount of unused routing resources [4]. SRAM leakage is negligible as they have high-Vt transistors to reduce the leakage. Assigning dual-Vt to the logic blocks reduces the leakage component of the logic blocks, as shown in Fig. 12(b). Fig. 12(c) shows the
Fig. 13. Leakage power for alu4. (a) Single low-Vt implementation. (b) Dual-Vt arch3. (c) Dual-Vt arch4 with 60% high-Vt pass-transistor switches.

impact on leakage distribution as a result of dual-Vt assignment to both the logic blocks and the routing resources. The contribution of routing leakage to total leakage increases by a small amount, even though it results in overall leakage savings. This can be explained by the leakage-power results for a benchmark alu4, as shown in Fig. 13. For architecture arch3, the logic leakage reduces by more than 50%, with the routing leakage remaining the same. When 60% of pass-transistors switches

KUMAR AND ANIS: DUAL-THRESHOLD CAD FRAMEWORK FOR SUBTHRESHOLD LEAKAGE POWER AWARE FPGAs

65

Fig. 14. Realizing a dual-Vt FPGA design.

are also assigned as high-Vt, the routing leakage decreases by 28%. This results in an overall leakage savings of 37%, with 19% savings from the logic blocks and 18% savings from the routing resources. This explanation can be extended to all the benchmarks, and the other benchmarks have similar leakage distributions. The leakage-power model in [11] models both the gate and the subthreshold leakage. For CMOS 130 nm, the gate leakage is six orders of magnitude smaller than the subthreshold leakage. Therefore, the gate leakage does not impact our methodology. VII. D ESIGNING A D UAL -V T FPGA Designing a dual-Vt FPGA would require selecting a dual-Vt FPGA architecture and then determining the parameters for the dual-Vt FPGA. The dual-Vt FPGA CAD framework proposed in this paper is intended for this purpose. Fig. 14 shows the steps involved in the design of a dual-Vt FPGA. The previous sections explored the various dual-Vt FPGA architectures for leakage-power savings and performance tradeoffs. It can be seen from the results of the previous section that architectures with high-Vt switches lead to signicantly increased leakage savings with small increase in delay penalties. Further, some of the architectures are more suitable for some benchmarks, whereas other benchmarks are more suitable for another architecture. This is because the number of interconnections between the logic blocks is different for different benchmarks. This implies that some benchmarks having a large number of high-Vt switches would not have a large impact on delay because of a shorter total wire length in the critical path, whereas some benchmarks having a high percentage of high-Vt subblocks per CLB would not have a big impact on performance because there are very few subblocks in the critical path. Homogenous architectures would, in general, be easier in implementation as compared to heterogenous architectures. However, a careful design of heterogenous architectures and the associated CAD tools can lead to increased leakage savings for same delay penalties as that for homogenous architectures. VIII. C ONCLUSION The dual-Vt FPGA architectures explored above indicate that an average leakage-power savings of up to 50% can be obtained by the dual-Vt FPGA architectures. We proposed a dual-Vt FPGA CAD ow for implementation and evaluation of dual-Vt FPGA architectures. The results show that a high

percentage of subblocks are assigned as high-Vt for the ideal dual-Vt assignment. This indicates that a large amount of slack is available with the logic blocks, which can be exploited to reduce the leakage power. We explored two primary architectures, homogenous and heterogenous. Results indicate that there is a large percentage of unused routing and logic resources, such that dual-Vt FPGA architectures can lead to good leakage savings without large delay penalties. The main bottleneck in using any leakage-reduction technique for FPGAs is the very nature of the programmability of the FPGA, leading to an architecture not very suitable for leakage-reduction techniques. For our future paper, we intend to develop novel leakageaware FPGA architectures which would be especially adapted to implement leakage-reduction technique without incurring large delay penalties. R EFERENCES
[1] L. Wei, Z. Chen, K. Roy, M. C. Johnson, Y. Ye, and V. K. De, Design optimization of dual-threshold circuits for low-voltage low-power applications, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 7, no. 1, pp. 1624, Mar. 1999. [2] J. Kao, S. Narendra, and A. Chandrakasan, Subthreshold leakage modeling and reduction techniques, in Proc. IEEE Int. Conf. Comput.-Aided Des., 2002, pp. 141148. [3] J. H. Anderson, F. N. Najm, and T. Tuan, Active leakage power optimization for FPGAs, in Proc. FPGA, 2004, pp. 3341. [4] T. Tuan and B. Lai, Leakage power analysis of a 90 nm FPGA, in Proc. IEEE Custom Integr. Circuits Conf., 2003, pp. 5760. [5] S. Sirichotiyakul, T. Edwards, C. Oh, R. Panda, and D. Blaauw, Duet: An accurate leakage estimation and optimization tool for dual-Vt circuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 10, no. 2, pp. 7990, Apr. 2002. [6] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan, Reducing leakage energy in FPGAs using region-constrained placement, in Proc. FPGA, 2004, pp. 5158. [7] A. Rahman and V. Polavarapuv, Evaluation of low-leakage design techniques for eld programmable gate arrays, in Proc. FPGA, 2004, pp. 2330. [8] F. Li, D. Chen, L. He, and J. Cong, Architecture evaluation for power efcient FPGAs, in Proc. FPGA, 2003, pp. 175184. [9] C. Hu, A. Niknejad, X. Xi, W. Liu, X. Jin, K. M. Cao, M. Dunga, and J. Ou. (2005, Jul. 29). BSIM4 MOS Models, Release 4.5.0. [Online]. Available. http://www.device.eecs.berkeley.edu/~bsim3/bsim4.html [10] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for DeepSubmicron FPGAs. Norwell, MA: Kluwer, 1999. [11] A. Kumar and M. Anis, An analytical state dependent leakage power model for FPGAs, in Proc. DATE, 2006, pp. 612617. [12] D. Lee, D. Blaauw, and D. Sylvester, Gate oxide leakage current analysis and reduction for VLSI circuits, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 2, pp. 155166, Feb. 2004. [13] K. Poon, A. Yan, and S. J. E. Wilton, A exible power model for FPGAs, in Proc. Int. Conf. Field Programmable Logic and Appl., 2002, pp. 312321. [14] S. Borkar, Design challenges of technology scaling, IEEE Micro, vol. 19, no. 4, pp. 2329, Jul./Aug. 1999. [15] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, Leakage current mechanisms and leakage reduction techniques in deepsubmicrometer CMOS circuits, Proc. IEEE, vol. 91, no. 2, pp. 305327, Feb. 2003. [16] M. Anis, M. Mahmoud, M. Elmasry, and S. Areibi, Dynamic and leakage power reduction in MTCMOS circuits using an automated efcient gate clustering technique, in Proc. DAC, 2002, pp. 480485. [17] J. Kao, A. Chandrakasan, and D. Antoniadis, Transistor sizing issues and tools for multi-threshold CMOS technology, in Proc. DAC, 1997, pp. 409414. [18] S. Mutah, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu, and J. Yamada, 1-V power supply high speed digital circuit technology with multi-threshold voltage CMOS, IEEE J. Solid-State Circuits, vol. 30, no. 8, pp. 847853, Aug. 1995. [19] N. Kato, Y. Akita, M. Hiraki, T. Tamashita, T. Shimizu, F. Maki, and K. Yano, Random modulation: Multi-threshold-voltage design

66

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 26, NO. 1, JANUARY 2007

[20] [21] [22] [23]

[24] [25] [26] [27]

methodology in sub-2-V power supply CMOS, IEICE Trans. Electron., vol. E83-C, no. 11, pp. 17471754, Nov. 2000. L. Wei, K. Roy, and C. Koh, Power minimization by simultaneous dualVth assignment and gate-sizing, in Proc. IEEE Custom Integr. Circuits Conf., 2000, pp. 413416. Y.-F. Tsai, D. Duarte, N. Vijaykrishnan, and M. J. Irwin, Implications of technology scaling on leakage reduction techniques, in Proc. DAC, 2003, pp. 187190. E. M. Sentovich et al., SIS: A System for Sequential Circuit Analysis. Berkeley, CA: Univ. California, 1992. J. Cong and Y. Ding, FlowMap: An optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs, IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 13, no. 1, pp. 112, Jan. 1994. F. Li, Y. Lin, L. He, and J. Cong, Low-power FPGA using predened dual-Vdd/dual-Vt fabrics, in Proc. FPGA, 2004, pp. 4250. G. Lemieux and D. Lewis, Design of Interconnection Networks for Programmable Logic. Norwell, MA: Kluwer, 2004. E. Ahmed and J. Rose, The effect of LUT and cluster size on deep submicron FPGA performance and density, in Proc. FPGA, 2000, pp. 312. M. Hirabayashi, K. Nose, and T. Sakurai, Design methodology and optimization strategy for dual-Vth scheme using commercially available tools, in Proc. Int. Symp. Low Power Electron. and Design, Aug. 2001, pp. 283286.

Mohab Anis (M03) was born in Montreal, Canada, on February 19, 1974. He received the B.Sc. degree (honors) in electronics and communication engineering from Cairo University, Cairo, Egypt, in 1997, and the M.A.Sc. and Ph.D. degrees in electrical engineering from the University of Waterloo, Waterloo, ON, Canada, in 1999 and 2003, respectively. Currently, he is an Assistant Professor and CoDirector of the VLSI Research Group with the University of Waterloo. He has authored/coauthored over 40 papers in international journals and conferences and is the author of the book Multi-Threshold CMOS Digital CircuitsManaging Leakage Power (Kluwer, 2003). His research interests include integrated circuit design and design automation for VLSI systems in the deep-submicrometer regime. Dr. Anis is a member of the program committee for several IEEE conferences and is a regular reviewer for a number of IEEE journals and conferences. He was awarded the 2005 New Opportunity Fund Award from the Canada Foundation for Innovation, the 2004 Douglas R. Colton Medal for Research Excellence in recognition of excellence in research leading to new understanding and novel developments in Microsystems in Canada, and the 2002 IEEE International Low-Power Design Contest Prize.

Akhilesh Kumar (S06) received the B.Tech. (honors) degree in electrical engineering from Indian Institute of Technology, Roorkee, India, in 2001, and the M.A.Sc. degree in electrical and computer engineering from University of Waterloo, Waterloo, ON, Canada, in 2006. He is currently working toward the Ph.D. degree at the Department of Electrical and Computer Engineering, University of Waterloo. He worked with the ST Microelectronics, India, from November 2002 to December 2003 and with the Hughes Software Systems, India, from June 2001 to November 2002. His research interests include the areas of design automation, digital circuit design, and architecture design for deep-submicrometer technologies.

You might also like