Professional Documents
Culture Documents
In current microprocessors, the number of wires used for inter-module communication has skyrocketed. Furthermore, the increased complexity and high level of integration requires higher wire densities, and coupling capacitance has dominated total wire capacitance for several technologies already. A high coupling capacitance ratio is not favorable in conventional busses due to the possibility of adjacent wires switching in the opposite direction, yielding a worst-case Miller capacitance factor (MCF) of 2. For example, when MCF=2 the coupling capacitance ratio over the total
interconnect capacitance is over 80% for a minimum pitch intermediate metal layer in 65-nm[5].It is possible to reduce coupling capacitance by increasing spacing or introducing shielding, but this comes at the cost of significant area penalties[1]. Hence, a key challenge in interconnect design is to reduce the worst-case MCF while maintaining the same physical footprint of the interconnect, thereby reducing the
In this project, a new encoding technique that achieves a worst-case MCF of 1, while preserving the best-case MCF of 0 has been presented. This is done by controlling the edges of rising and falling transition in time, namely always performing rising transitions on the negative edge of the clock and falling transition on the positive edge of the clock (or vice versa). Since the worst-case switching is separated by as much as one phase (half clock cycle), this technique remains robust against process variation. Hence, both the average and worst-case energy can be reduced without impacting the sensitivity to process variation. Average energy savings will aid battery life and typical energy costs, but worst-case energy is also a meaningful metric in terms of thermal management and peak demand for power grids and decoupling capacitance.[13] These savings are accomplished at the expense of minimal encoder logic with half cycle latency and additional clocking. However, we find that the logic and clocking overhead is small in long interconnects where interconnect power consumption is dominant, and also show
1
that the potential latency overhead can be eliminated or minimized in multi-cycle interconnects.
1.1 SELF-TRANSITIONS AND COUPLING TRANSITIONS Self-transitions are defined as transitions on the capacitance between a bus line and the substrate (ground) while coupling transitions are defined as transitions on the capacitance between adjacent lines. Figure 1.1 shows a simplified bus model with coupling (ignoring all the resistances). Cs is the self-capacitance from each bus line to ground; Cc is the coupling-capacitance between two adjacent lines. The average power consumption on the bus is given by: ( ) (1.1)
where s is the number of average self-transitions per bus cycle and c is the number of average coupling-transitions per bus cycle; only the charging transitions that require current flow from the power supply are being counted. For example only the 0 1 transitions are counted as self-transitions for power consumption.
Figure 1.1
1.2 MILLER-COUPLING CAPACITANCE A simple way of performing timing analysis in the presence of crosstalk noise is to replace the coupling capacitance by an equivalent Miller capacitance to account for the slow down or the speed up of the victim transition. Hence, we are effectively
modeling the delay noise on the victim by simply scaling the coupling capacitance with the appropriate Miller-coupling factor (MCF). The simple approximation leads to a decoupling between the aggressor-victim interconnects (as shown in Figure 1.2) and allows us to compute the signal arrival times using the existing STA framework.
Figure 1.2
The exact value of the Miller-coupling factors for the victim (MCFvictim) depends on the mutual switching directions and the ratio of the slopes of the aggressor-victim waveforms ( ) .(1.2) If the aggressor-victim nets switch in mutually opposite directions (as shown in Figure 1.2), then MCFvictim is obtained by adding 1 to the relative ratio of the slopes(a/v) Conversely, if the aggressor-victim nets switch in the same direction,
then MCFvictim is obtained by subtracting (a/v) from one. Suppose the aggressorvictim nets switch in the opposite directions with equal slopes (i.e., a = v). Using Equation 1.2, we can obtain the commonly used MCFvictim value of two. In such cases, the Miller capacitance used to decouple the aggressor-victim interconnects is twice the size of the original coupling capacitance. Similarly, if the aggressor-victim nets switch in the same direction with equal slopes, then we obtain an MCFvictim value of zero.
1.3 FLIP-FLOP In electronics, a flip-flop or latch is a circuit that has two stable states and can be used to store state information. The circuit can be made to change state by signals applied to one or more control inputs and will have one or two outputs. It is the basic storage element in sequential logic. Flip-flops and latches are a fundamental building block of digital electronics systems used in computers, communications, and many other types of systems.
Flip-flops and latches are used as data storage elements. Such data storage can be used for storage of state, and such a circuit is described as sequential logic. When used in a finite-state machine, the output and next state depend not only on its current input, but also on its current state (and hence, previous inputs). It can also be used for counting of pulses, and for synchronizing variably-timed input signals to some reference timing signal.
Flip-flops can be either simple (transparent or opaque) or clocked (synchronous or edge-triggered); the simple ones are commonly called latches. The word latch is mainly used for storage elements, while clocked devices are described as flip-flops.
1.3.1 Dual Edge Flip-Flops The edge encoding technique requires dual-edge flip-flops and hence the number of flip-flops placed in multi-cycle interconnects is inevitably increased compared to conventional multi-cycle interconnects with single-edge flip-flops.
Along the critical path of multi-cycle interconnects, both the total setup time and CLK-Q delay in flip-flops increase as well. Hold time is not a concern even when using dual-edge flip-flops because the interconnect paths are well-defined with several repeaters and large wire load, and thus short paths between flip-flops do not exist. Time-borrowing flip-flops have zero or performance negative setup time, providing
encoding techniques, time-borrowing flip-flops are considered for the inserted flip-flops to mitigate the increase in total setup time and variation. Several types of flip-flops from [18] were considered to absorb the setup time and mismatch between interconnect paths, for the time-borrowing dual-edge flip-flops. We selected the pulsed triggered dual-edge flip-flop in Figure 1 . 3 for the proposed edge encoding because it is more area-efficient and the D-Q path is shorter.
Figure 1.3
1.4 TIME BORROWING WITH LATCHES The unique property which enables above advent transparent for the duration of an active clock pulse normal edge-to-edge timing requirements of sync long enough and is determining the maximum freq a shorter path in subsequent latch-tolatch stages
Figure 1.4
The figure 1.4 has 2 timing paths: Path 1 from the p a negative-level latch (2), while Path 2 is from the triggered register (3). Let us examine this simple compensate for the delay through the logic cloud A in Path 1, we can have two scenarios of timing (Figures 1.6 and 1.7.)
Figure 1.5
In Case A, data arrives from logic A at Latch 2 be this case, the behavior of the latch is similar to the need to borrow any time to achieve our timing go.
Figure 1.6
In Case B, the negative clock edge enables the la at the input of the latch. So the latch will go to tra from Logic A through to Register B for a while. Bu from Logic A reaches Logic B and passes through Register 2. So if the propagation delay of Logic B, some of the time reserved for Logic B, and the cir this extra time in order to complete its propagation analysis will consider the end of the borrowed time delay.
Figure 1.7
While doing STA, Timing reports will be generate the timing when the latch is enabled is the same element (Figure 1.8)
Figure 1.8
1.5 REPEATER INSERTION Repeater insertion is a technique for reducing the time delay associated with long wire lines in integrated circuits. The technique involves cutting the long wire into one or more short wires and inserting a repeater between each new pair of short wires. The time it takes for a signal to travel from one end of a wire to the other end is known as wire-line delay or just delay. In an integrated circuit, this delay is characterized by RC, the resistance of the wire (R) multiplied by the wire's capacitance (C). Thus, if the wire's resistance is 100 ohms and its capacitance is 0.01 microfarad (F), the wire's delay is one microsecond (s). The resistance of a wire on an integrated circuit is directly proportional, or linear, according to the wire's length. If a 1 mm length of the wire has 100 ohms resistance, then a 2 mm length will have 200 ohms resistance. The capacitance of a wire also increases linearly along its length. If a 1 mm length of the wire has 0.01 F capacitance, a 2 mm length of the wire will have 0.02 F, a 3 mm wire will have 0.03 F, and so on as shown in table 1.1. Thus, the time delay through a wire increases with the square of the wire's length. This is true, to first order, for any wire whose cross-section remains constant along the length of the wire.
Time delay 1s 4 s 9 s
The interesting consequence of this behavior is that, while a single 2 mm length of wire has a delay of 4 s, two separate 1 mm wires only have a delay of 1 s each. The two separate wires cover the same distance in half the time. By cutting the wire in half, we can double its speed. An active circuit must be placed between the two separate wires so as to move the signal from one to the next. An active circuit used for such a purpose is known as a repeater. In a CMOS integrated circuit, the repeater is often a simple inverter. Reducing the delay of a wire by cutting it in half and inserting a repeater is known as repeater insertion. The cost of this procedure is the additional new delay through the repeater itself, plus power cost because the repeater is an active circuit that must be powered, whereas the plain unrepeated wire was originally an unpowered passive component.
1.6 BASIC LOW POWER DIGITAL DESIGN Moores law states that the number of transistors that can be placed inexpensively on an integrated circuit will double approximately every two years, has often been subject to the following criticism: while it boldly states the blessing of technology scaling, it fails to expose its bane. A direct consequence of Moores law is that the power density of the integrated circuit increases exponentially with every technology generation. This implicit trend has arguably brought about some of the most important changes in electronic and computer designs. Since the 1970s, most popular electronics manufacturing technologies used bipolar and nMOS transistors. However, bipolar and nMOS transistors consume energy even in a stable combinatorial
state, and consequently, by 1980s, the power density of bipolar designs was considered too high to be sustainable.
1.6.1 CMOS Transistor Power Consumption The power consumption of a CMOS transistor can be divided into three different components: dynamic, static (or leakage) and short circuit power consumption.
Figure 1.9 illustrates the three components of power consumption. Dynamic and short circuit power are also collectively known as switching power, and are consumed when transistors change their logic state, but leakage power is consumed merely because the circuit is powered-on.
Switching power, which includes both dynamic power and short-circuit power, is consumed when signals through CMOS circuits change their logic state, resulting in the charging and discharging of load capacitors. Leakage power is primarily due to the sub-threshold currents and reverse biased diodes in a CMOS transistor.
lea age ..
total
dynamic
shortcircuit
(1.3)
Figure 1.9
10
1.6.2 Switching Power When signals change their logic state in a CMOS transistor, energy is drawn from the power supply to charge up the load capacitance from 0 to Vdd. For the inverter example in Figure 1.9, the power drawn from the power supply is dissipated as heat in pMOS transistor during the charging process. Energy is needed whenever charge is moved against some potential. Thus, dE = d(QV). When the output of the inverter transitions from logical 0 to 1, the load capacitance is charged. The energy drawn from the power supply during the charging process is given by,
dEP = d(VQ) = Vdd.dQL since the power supply provides power at a constant voltage Vdd. Now, since QL=CL.VL, we have: dQL = CL.dVL Therefore, dEP = Vdd. CL. dVL Integrating for full charging of the load capacitance, = CL. Vdd 2 I ...(1.4) Thus a total of CL.Vdd2 energy is drawn from the power source. The energy EL stored in the capacitor at the end of transition can be computed as follows: dEL = d(VQ) = VL.dQL where VL is the instantaneous voltage across the load capacitance, and QL is the instantaneous charge of the load capacitance during the loading process. Therefore, dEL = VL. CL dVL Integrating for full charging of the load capacitance,
= CL. Vdd 2/2 (1.5)
11
Comparing Equations 1.4 and 1.5, we notice that only half of the energy drawn from the power supply is stored in the load capacitance; the rest is dissipated as heat. This energy stored in the output capacitance is released during the discharging of the load capacitance, which occurs when the output of the invertors transitions from logical 1 to 0. The load capacitance of the CMOS logic gate consists of the output node capacitance of the logic gate, the effective capacitance of the interconnects, and the input node capacitance of the driven gate.
1.3
Figure 1.10.
Short circuit power consumed in a circuit when both nMOS and pMOS are on
12
As the input voltage rises, at time t0 , we have Vin > VTn, i.e., the input voltage become higher than the threshold voltage of the nMOS transistor. At this time a shortcircuit current path is established. This short circuit current increases as the nMOS transistor turns on. Thereafter, the short circuit current first increases and then decreases until, after t1 , we have Vin >(Vdd- VTp ),i.e., the pMOS transistor turns off, signaling the end of short-circuit current. Therefore, in the duration when (VTn < Vin < (Vdd - VTp)) holds, there will be a conductive path open between Vdd and GND because both the nMOS and pMOS devices will be simultaneously on. Short-circuit power is typically estimated as:
.(1.6)
1.6.4 Leakage Power The third component of power dissipation in CMOS circuits, as shown in Equation 1.5 is the static or leakage power. Even though a transistor is in a stable logic state, just because it is powered-on, it continues to leak small amounts of power at almost all junctions due to various effects.
1.6.4.1 Reverse Biased Diode Leakage The reverse biased diode leakage is due to the reverse bias current in the parasitic diodes that are formed between the diffusion region of the transistor and substrate. It results from minority carrier diffusion and drift near the edge of depletion regions, and also from the generation of electron hole pairs in the depletion regions of reverse-bias junctions. As shown in Figure 1.11, when the input of inverter in Figure 1.9 is high, a reverse potential difference of Vdd is established between the drain and the nwell, which causes diode leakage through the drain junction. In addition, the n- well region of the pMOS transistor is also reverse biased with respect to the p-type substrate. This also leads to reverse bias leakage at the n-well junction.
13
where A is the area of the junction, Js is the reverse saturation current density, and Vbias is the reverse bias voltage across the junction, and Vth = KT/ q is the thermal Voltage. Reverse biased diode leakage will further become important as we continue to heavily dope the n- and p-regions. As a result, zener and band-to-band tunneling will also become contributing factors to the reverse bias current.
Figure 1.11
1.7 CROSSTALK NOISE As the feature sizes have been shrinking with process-technology scaling, the spacing between adjacent interconnect wires keeps decreasing in every process technology. Also, while the lateral width of interconnect wires has been scaled down
14
significantly their vertical height has not been scaled in proportion (as shown in Figure 1.12). Both these trends lead to a very rapid increase in the amount of coupling capacitance (essentially like parallel-plate capacitors) between the wires. We know that coupling capacitance accounts for more than 85% of the total interconnect capacitance in the 90nm technology node.
Figure 1.12
Shrinking of wire geometries in the nanometer process technology leads to an increase in the amount of coupling capacitance.
More aggressive technology scaling will only leads to an increase in the overall contribution of the coupling capacitances to the total interconnect capacitance. Therefore, signal-integrity issues such as crosstalk noise have become important when performing timing verification of VLSI chips. Due to capacitive coupling, the switching characteristic of a net is affected by simultaneous switching of nets that are in close physical proximity. The net under analysis which suffers from coupling noise is referred to as victim, and all neighboring nets which contribute to coupling noise on the victim are termed as aggressors. Figure 1.13 illustrates coupling noise injected on the victim due to the rising transition of its aggressor. As the aggressor transition occurs, the voltage at the victim gets pulled up due to the AC current flowing through the coupling capacitance. The resulting glitch in the victim voltage due to the aggressor transition is referred to as coupling- noise pulse. The
15
peak of the coupling-noise pulse usually occurs when the aggressor transition is completed and there is no more coupling current flowing through the coupling capacitance. Finally, the coupling-noise pulse gradually dies down once the aggressor transition is completed and the victim node discharges the accumulated charge.
Figure 1.13
It has become imperative to model the signal-integrity issues that can arise due to the charge transfer through the coupling capacitances for VLSI chips in the nanometer process technology.
1.7.1 Preliminaries of Crosstalk Effects Crosstalk can affect signal delays by changing the times at which signal transitions occur. For example, consider the signal waveforms on the cross-coupled nets A, B, and C in Figure 1.14. Because of capacitive cross-coupling, the transitions on net A and net C can affect the time at which the transition occurs on net B.
A rising-edge transition on net A at the time shown in Figure 1.14 can cause the transition to occur later on net B, possibly contributing to a setup violation for a path containing B.
16
Figure 1.14
Similarly, a falling-edge transition on net C can cause the transition to occur earlier on net B, possibly contributing to a hold violation for a path containing B. The logic effects of crosstalk on steady-state nets are due to the cross-coupled nets as shown in Figure 1.15.
Figure 1.15
Net B should be constant at logic zero, but the rising edge on net A causes a noise bump or glitch on net B. If the bump is sufficiently large and wide, it can cause an incorrect logic value to be propagated to the next gate in the path containing net B.
17
A net that receives undesirable cross-coupling effects from a nearby net is called a victim net. A net that causes these effects in a victim net is called an aggressor net. Note that an aggressor net can itself be a victim net; and a victim net can also be an aggressor net. The terms aggressor and victim refer to the relationship between two nets being analyzed. The timing impact of an aggressor net on a victim net depends on several factors: The amount of cross-coupled capacitance The relative times and slew rates of the signal transitions The switching directions (rising, falling) The combination of effects from multiple aggressor nets on a single victim net Figure 1.16 illustrates the importance of timing considerations for calculating crosstalk effects.
Figure 1.16
The aggressor signal A has a range of possible arrival times, from early to late. If the transition on A occurs at about the same time as the transition on B, it could cause the transition on B to occur later as shown in the figure, possibly contributing to a setup violation, or it could cause the transition to occur earlier, possibly contributing to a hold
18
violation. If the transition on A occurs at an early time, it induces an upward bump or glitch on net B before the transition on B, which has no effect on the timing of signal B. However, a sufficiently large bump can cause unintended current flow by forwardbiasing a pass transistor. Similarly, if the transition on A occurs at a late time, it induces a bump on B after the transition on B, also with no effect on the timing of signal B. However, a sufficiently large bump can cause a change in the logic value of the net, which can be propagated down the timing path.
19
K. Nose and T. Sakurai proposed a new buffer insertion scheme for bidirectional buses, namely dual-rail bus (DRB) scheme, which does not have
noise problems [3]. One more proposal is on a high-speed buffer insertion scheme for uni-directional buses by making use of staggered firing. The staggered firing bus (SFB) is proposed and measured. As the device dimension is scaled down, interconnect RC delay becomes dominant performance limiter in high-performance VLSI's. Another issue in the submicron interconnects is a drastic increase of coupling capacitance due to the higher aspect ratio to reduce the interconnect resistance. The increase of the coupling capacitance degrades signal integrity, inducing noise problems and delay fluctuation problems. The original buffer insertion, however, cannot be
applied to bi-directional buses because the buffer is uni-directional in nature. These circuits turn out to be prone to malfunctions when there is a noise from adjacent lines in scaled down interconnect systems where capacitive coupling is large.
B. Victor and K. Keutzer proposed to employ data encoding to eliminate crosstalk delay within a bus. They present a rigorous analysis of the theory behind "self-shielding codes", and gives the fundamental theoretical limits on the
performance of codes with and without memory [9]. In this paper, they have introduced the concept of using data encoding to mitigate crosstalk delay on buses. The latter would involve codes that eliminate crosstalk delay as well as reduce average power consumption, perform error detection or correction. The propagation delay across long on-chip buses is increasingly becoming a limiting factor in high-
speed designs. Crosstalk between adjacent wires on the bus may create a significant portion of this delay Placing a shield wire between each signal wire crosstalk problem but doubles the area used by the bus. alleviates the
20
M. Khellah, J. Tschanz, Y. Ye, S. Narendra, and V. De proposed Static Pulsed Bus (SPB) driver, repeater and receiver techniques that reduce the worstcase CCM value to 1[11]. RC delay of long on-chip interconnects continues to be a key limiter to performance and power of microprocessors. Coupling
capacitance (Cc) between neighbouring lines remains a large fraction (50%) of the total line capacitance (C1NT) with technology scaling. SPB offers significant advantages over SB in delay, energy, total device width and peak Vcc current. These improvements are due to reduction in CCM and repeater skewing enabled by monotonic signal transition. Unlike dynamic schemes, energy savings of SPB are maintained across all activity factors without any clock power or routing overhead. R. Arunachalam, E. Acar, and S. Nassif presented common current methods to decrease coupling noise include shielding and buffering, both of which can increase overall power dissipation[1]. An alternative method is spacing, which has the added benefit of improving the manufacturability (i.e. defect insensitivity) of the design. Capacitive coupling is recognized as one of the most critical problems that designers need to address for deep submicron technologies. A commonly used technique to avoid coupling is to shield signal lines from each other by inserting power/ground lines in between them. Although shielding practically eliminates coupling between signal lines, it does result in increased power and area. Their results demonstrate that spacing is a viable and more effective option than shielding for low power designs, even after budgeting for noise and delay increase due to coupling. Excessive (unnecessary) shielding may significantly increase the total capacitance of the signal line, which dissipates more dynamic power in operation.
M. Khellah, M. Ghoneima, J. Tschanz, Y. Ye, N. Kurd, J. Barkatullah, S. Nimmagadda and Y. Ismail presented a bus architecture called Skewed Repeater Bus (SRB) for reducing on-chip interconnects energy in microprocessors[5]. By introducing relative delay between neighbouring bus lines, SRB reduces both average and worstcase coupling capacitance between those lines On-chip interconnect RC delay is not scaling with technology.
21
Therefore, the longest interconnect that can be accommodated in a single clock cycle between two flip-flops (flop distance) reduces, and the number of clock cycles needed to propagate a signal from one point of the chip to another increases. This adversely impacts overall performance of the processor.
H. Deogun, R. Senger, D. Sylvester, R. Brown, and K. Nowka presented a new dual-VDD bus technique that is well suited for low power operation[12]. This technique adapts static pulsed bus architecture to use dual-VDD power supplies. During quiescent periods, the bus system idles at the lower of the two VDD supplies, thereby lowering static power dissipation. When actively transitioning, the inverters in the bus system are temporarily boosted to the higher VDD supply to provide the needed drive strength for performance. Minimizing power consumption is a problem that has been, and will continue to be, attacked on a variety of design fronts. A first-order analysis in projects that the fraction of total cells in a functional logic block used for repeaters will grow from 6% at the 90nm node to 70% at the 32nm node. This rapid increase in repeaters, which are generally sized aggressively to maintain delay and signal skew, dramatically increases the total power of the integrated circuit. They have introduced a novel dual-VDD boosted pulsed bus technique for total power reduction. They have shown that this technique maintains noise margins very close to those found in a traditional bus design. They showed that the achievable savings are consistent through a wide range of data switching rates.
H. Kaul, J. Seo, M. Anders, D. Sylvester, and R. Krishnamurthy described an alternate repeater insertion technique that uses correct-by-construction polarities to reduce worst-case miller coupling factor (MCF) across any multiple segmented portion of a repeated bus[7]. The increased integration of multiple-cores and large shared caches in microprocessors requires improved energy- efficiency for core-to-core, core-to-cache and even intra-core communication to sustain the required performance benefits within shrinking power envelopes. For multi-core chips, improved bus designs need to satisfy the constraints of drop-in replacement for minimal change in design methodology, robust operation and gains across process corners and design space and the ability to improve power-
22
performance of shared busses with multiple driver and receiver points along the bus. The technique is robust across process corners and for bus designs with non- equidistant repeater placement.
C. J. Akl and M. A. Bayoumi proposed a hybrid polarity repeater insertion technique that combines inverting and non-inverting repeater insertion to achieve constant average effective coupling capacitance per wire transition for all possible switching patterns[8]. A simple yet effective hybrid polarity repeater insertion technique is used to minimize delay uncertainty and reduce delay and/or energy dissipation and buffers area of on-chip buses and adjacent signal wires. The reduction in worst case capacitive coupling reduces peak energy which is a critical factor for thermal regulation and packaging. With the continuous scaling of CMOS technology, interconnect delays dominate gate delays and become the major limiter to high performance systems. One of the major causes of interconnect delay degradation is the high crosscoupling capacitance between adjacent signal wires in deep sub- micrometer (DSM) technologies. This is mainly due to the dense wiring employed to achieve high integration density, and the increased aspect ratio used to lower interconnect resistance. The proposed technique has negligible overhead and it has regular and uniform layout. It can be easily integrated into current automated layout tools, which can be a possible extension of this work. K. Hirose and H. Yasuura proposed a technique for reduction of maximum bus delay caused by crosstalk to prevent simultaneous opposite transition by skewing signal transition timing of adjacent wires[4].As the CMOS technology scaled down, the horizontal coupling capacitance makes crosstalk interference between adjacent wires a serious problem in VLSI design. The bus wires are placed with alternation of normal timing wire and shifted timing wire for the purpose of no adjacent wires switch at the same transition timing. The shifting time can be made by a delay line of inverter chain or two phase clock. By approximated equation of bus delay, it becomes clear that our technique is effective for repeater-inserted bus.
23
K. Bernstein, C.-T. Chuang, R. Joshi, and R. Puri described CMOS scaling challenges for sub-9Onm designs and CAD challenges to support energy, parameter variation, and micro architectural challenges[20]. Design practice will have to change from todays deterministic design to probabilistic and statistical design. Future high performance microprocessor design with technology scaling beyond 90nm will pose two major challenges: (1) energy and power, and (2) parameter variations. Excessive sub threshold and gate oxide leakage are emerging as serious problem. Active power density of memory, such as on-chip SRAM cache, is an order of magnitude smaller than logic. Therefore, the overall processor performance can be improved in a more energy-efficient manner by using more memory than logic. This effectively reduces the overall activity factor of the chip. Dual-Vt designs can reduce leakage power during active operation, burn-in and standby.
24
3.1 EXISTING METHOD The hybrid polarity repeater insertion technique is shown in Figure. 3.1. The inverting repeaters at the midpoint of alternate bus lines are replaced with non-inverting repeaters. For an n-bit bus, n/2 inverting repeaters are substituted with non-inverting ones. When a worst delay transition occurs at the first half of a line, a best delay transition occurs at the second half and vice versa.
Figure 3.1
Proposed hybrid polarity repeater insertion with non-inverting repeaters inserted at the midpoint of alternate bus wires.
This leads to constant average coupling capacitance factor for all possible input transitions as shown in Table 3.1. As a result, reduction of both worst case delay and delay uncertainty is achieved. This technique has negligible overhead and it has regular and uniform layout. This technique is useful in many applications and it can be easily integrated into current automated layout tools, which can be a possible extension of this work.
25
Table 3.1
Average Coupling Capacitance Factor of the middle switching line for all possible input transition patterns
3.1.1 Drawbacks of Existing Method This technique reduces the overall worst-case MCF of an interconnect to 1, but also eliminate the best-case MCF of 0 (all adjacent wires switching in the same direction), leading to less advantage in average energy consumption.
3.2 PROPOSED METHOD The edge encoding technique and coding scheme are used to reduce the power consumption due to coupling capacitance and self transitions respectively. In edge encoding technique, worst-case MCF of 1 is achieved, while preserving the best-case MCF of 0.This is done by controlling the edges of rising and falling transition in time, namely always performing rising transitions on the negative edge of the clock and falling transition on the positive edge of the clock (vice versa). 3.2.1 Edge Encoding Technique In a multi-cycle bus structure, the transitions between neighboring wires are synchronized at every flip-flop as the signal propagates down the bus. This often generates simultaneous switching of adjacent wires in the opposite or same direction. In
26
Figure. 3.2, the worst-case and best-case switching of the conventional bus are shown. The MCF=2 case, where every other wire switches in the opposite direction, generates the worst-case delay, which defines the clock frequency and also consumes the worstcase energy due to maximum coupling capacitance. To avoid this, we propose to selectively shift rising and falling edges and separate them by as much as half cycle. For example, as seen in Figure. 3.2 (b), if we selectively delay only the rising transitions by a half cycle and keep the falling transitions unaltered, the worst-case MCF is reduced from 2 to 1. We refer to this selective edge shifting as edge encoding.
Figure 3.2
27
Since edge encoding shifts the same transitions together, the advantage of bestcase switching is still maintained, this is unachievable in most other approaches. Since the edge-encoded signal transitions at both positive and negative edges of the clock, we use dual-edge triggered flip flops to propagate the signal along long multi-cycle interconnects. Since the signals must be synchronized back to positive edge triggered flip-flops after long interconnects, the number of dual-edge flip-flops within the multicycle interconnect should be even. The objective of the edge encoder is to selectively shift the rising and falling transition by different amounts. This encoding is done simply by performing an AND operation between the original signal and the half-cycle delayed version of itself. In this way, only the rising edge is delayed by a half cycle, separating simultaneous rising and falling transition by a half cycle. Since the encoder logic is very simple, the encoding overhead in terms of power and area is very small. This makes the edge encoding technique a highly practical approach. We propose two schemes to effectively use the edge-encoding technique in multi-cycle interconnect. The two methods differ in the procedure to cope with the initial half cycle latency required for edge encoding and to address the issue of aligning back to the positive-edge triggered signal at the far end of the wire. 3.2.1.1 Zero Latency (ZL) Scheme The ZL scheme reduces energy consumption in multi-cycle interconnects without any latency overhead although encoding requires a half-cycle delay at the near end of the wire. This scheme exploits the fact the signal propagation in the edgeencoded bus is faster than that in the conventional bus due to reduced coupling capacitance. The block diagram of a multi-cycle interconnect with simple encoder logic is shown in Figure. 3.3.
28
Figure 3.3
The encoding procedure and the propagation of the encoded signal are shown in Figure. 3.3. When data toggles every cycle, the encoder generates a half-cycle pulse (enc_out). As this half-cycle pulse propagates through an even number of dual-edge flip-flops, it automatically aligns back to a positive edge triggered signal (ff4_out) at the far end. Therefore, there is no need for any decoder circuit.
Figure 3.4.
29
To achieve overall zero latency, the interconnect system is set up as shown in Figure. 3.5. If the conventional scheme requires n cycles to propagate through the entire interconnect, the edge-encoded bus must propagate through in (2n-1) half cycles, considering that the encoding takes one half cycle to synchronize at the far end of wire. In Figure. 3.5, L1 is the distance between positive-edge triggered flip-flops in the conventional bus, and L2 is the distance between dual-edge triggered flip-flops in the edge-encoded bus. Overall zero latency is achievable if L2 is defined by L1 and as follows:
...................................................................... (3.1)
Figure. 3.5
For example, in a 9 mm interconnect, when n=3 and L1=3, the edge-encoded signal will propagate 1.8 mm every half cycle while the conventional signal will propagate 3 mm every cycle.
30
Effectively, the edge-encoded signal is traveling 20% longer (1.8 versus 1.5 mm) during the same time period, which is possible when at least a 17% speedup is achieved in the edge encoded bus due to coupling capacitance reduction.
3.2.1.2 One Cycle Latency (OCL) Scheme In multi-cycle interconnects, multiple cycles are required to propagate across the entire wire. In these cases, one additional cycle latency may be acceptable if a clock frequency increase or aggressive energy reduction is a higher design priority. The OCL scheme is intended to achieve further performance improvement and energy reduction for a fixed throughput at the expense of one-cycle latency. After the encoding, the data must eventually align to the positive edge of the clock at the far end of the wire. To achieve this, we can align the transition at the near end to the positive edge of the clock by encoding with a full one cycle delay, and then allow for normal signal propagation along the wire. The one-cycle latency is therefore introduced once at the beginning of the wire and the throughput is not hampered. The block diagram of the OCL edge encoding scheme are shown in Figure. 3.6.
Figure 3.6.
The difference in the encoder in Figure 3.6 compared to ZL is that a dual-edge flip-flop is added at the output to intentionally delay enc_in by one cycle and align the
31
rising edge of enc_out at the positive edge of the clock as shown in Figure timing diagrams of the OCL edge encoding scheme are shown in Figure 3.7.
3.6.The
Figure 3.7.
The corresponding flip-flop placement in the OCL edge encoding scheme is shown in Figure 3.8. Dual-edge flip-flops are placed at intervals equal to half the flop distance of the conventional bus. In the OCL edge-encoded bus, since the worst-case wire delay is reduced due to MCF reduction, we can either increase the clock frequency for high-performance busses or downsize the repeaters for iso-performance to the conventional bus for aggressive energy reduction.
32
Figure 3.8
3.2.2. Coding Scheme This scheme minimizes self transition activities in the bus lines. Using this scheme, the present data can be converted into four types. They are inversion of odd lines, inversion of even lines, swapping adjacent bits and keeping the data unchanged. Then the number of self transitions between these four data patterns and the previously coded data can be calculated using energy estimator and the data pattern which has the less number of self transitions can be found using comparator and this corresponding data pattern can be transmitted on the bus. Let the data on an n bit wide bus, at time instant t be denoted as At = {atn-1, atn2,....at1, at0}. The data transmitted on the bus is denoted as A(t)enc. The function calculateST_n(data1, data2) finds the number of self transitions between (data1,data2). Here, data1 and data2 should be n bits wide. The function swapAdj_n (At) swaps the adjacent bits in At and gives the output Asw(t) = {atn-1,atn-2,....at0,at1}. The coding scheme is as follows: 1) Let A(t-1)enc be the previously coded data which was transmitted on the bus and let At be the present data which should be encoded and transmitted. 2) Invert the odd lines in At and affix it with 00, for decoding purposes. Let this new data be denoted as At(odd) . Evaluate st_odd=calculateST_n(At(odd) ,A(t-1)enc). 3) Similarly, invert the even lines in At and affix it with 01.Let this new data be denoted as At(eve).Evaluate st_eve=calculateST_n(At(eve) ,A(t-1)enc).
33
4) Let Asw(t) =swapAdj_n(At).Suffix Asw(t) with 10 and let this new data be denoted as At(swp) .Evaluate st_swp=calculateST_n(At(swp) ,A(t-1)enc). 5) Suffix At with 11 and let this new data be denoted as At(unc) .Evaluate
st_unc=calculateST_n (At(unc) ,A(t-1)enc). 6) Find min (st_odd,st_eve,st_swp,st_unc). 7) The coded pattern corresponding to the minimum value in step 6 is transmitted. The block diagram of the proposed encoder is given in Figure 3.9
The encoder takes an n bit input. The calculateST_n functions in the steps 2,3,4,5 in the coding scheme above are evaluated by the Energy Estimator block. The outputs of the Energy Estimator block, min(st_odd,st_eve,st_swp,st_unc) are compared among themselves and the minimum amongst them is found out. The encode data pattern corresponding to the minimum value of self transitions is transmitted on the bus. The decoder to be used at the receiving end is shown in Figure 3.10.
34
The two least significant bits in the transmitted pattern are given to a 2 to 4 decoder and depending on these two bits, the appropriate decoding procedure is done.
3.3 SOFTWARE DESCRIPTION VHDL is a versatile and powerful hardware description language which is useful for modeling electronic systems at various levels of design abstraction. 3.3.1 VHDL VHDL (VHSIC hardware description language) is a hardware description language used in electronic design automation to describe digital and mixed-signal systems such as field-programmable gate arrays and integrated circuits 3.3.1.1 Design VHDL is commonly used to write text models that describe a logic circuit. Such a model is processed by a synthesis program, only if it is part of the logic design. A simulation program is used to test the logic design using simulation models to represent the logic circuits that interface to the design. This collection of simulation models is commonly called a test bench.
35
VHDL has constructs to handle the parallelism inherent in hardware designs, but these constructs (processes) differ in syntax from the parallel constructs in Ada (tasks). Like Ada, VHDL is strongly typed and is not case sensitive. In order to directly represent operations which are common in hardware, there are many features of VHDL which are not found in Ada, such as an extended set of Boolean operators including nand and nor. VHDL also allows arrays to be indexed in either ascending or descending direction; both conventions are used in hardware, whereas in Ada and most programming languages only ascending indexing is available. VHDL has file input and output capabilities, and can be used as a generalpurpose language for text processing, but files are more commonly used by a simulation test bench for stimulus or verification data. There are some VHDL compilers which build executable binaries. In this case, it might be possible to use VHDL to write a test bench to verify the functionality of the design using files on the host computer to define stimuli, to interact with the user, and to compare results with those expected. However, most designers leave this job to the simulator. It is relatively easy for an inexperienced developer to produce code that simulates successfully but that cannot be synthesized into a real device, or is too large to be practical. One particular pitfall is the accidental production of transparent latches rather than D-type flip-flops as storage elements. One can design hardware in a VHDL IDE (for FPGA implementation such as Xilinx ISE, Altera Quartus, Synopsys Synplify or Mentor Graphics HDL Designer) to produce the RTL schematic of the desired circuit. After that, the generated schematic can be verified using simulation software which shows the waveforms of inputs and outputs of the circuit after generating the appropriate test bench. To generate an appropriate test bench for a particular circuit or VHDL code, the inputs have to be defined correctly. For example, for clock input, a loop process or an iterative statement is required. A final point is that when a VHDL model is translated into the "gates and wires" that are mapped onto a programmable logic device such as a CPLD or FPGA, then it is
36
the actual hardware being configured, rather than the VHDL code being "executed" as if on some form of a processor chip.
3.3.1.2 Advantages The key advantage of VHDL, when used for systems design, is that it allows the behavior of the required system to be described (modeled) and verified (simulated) before synthesis tools translate the design into real hardware (gates and wires). Another benefit is that VHDL allows the description of a concurrent system. VHDL is a dataflow language, unlike procedural computing languages such as BASIC, C, and assembly code, which all run sequentially, one instruction at a time. A VHDL project is multipurpose. Being created once, a calculation block can be used in many other projects. However, many formational and functional block parameters can be tuned (capacity parameters, memory size, element base, block composition and interconnection structure). A VHDL project is portable. Being created for one element base, a computing device project can be ported on another element base, for example VLSI with various technologies.
37
The dynamic power of 0.068W and quiescent (leakage) power of 0.010 that together contribute a total on-chip power of 0.078W is given in this figure.
Figure 4.1
38
5.1 CONCLUSION
39
APPENDIX CODING
library IEEE; use IEEE.STD_LOGIC_1164.ALL;
entity self is Port ( a : in STD_LOGIC_VECTOR (7 downto 0); encout : out STD_LOGIC_VECTOR (9 downto 0) ); end self;
architecture Behavioral of self is signal odd,even,swap,append :STD_LOGIC_VECTOR (9 downto 0); component energy is Port ( ina : in STD_LOGIC_vector(9 downto 0); bout :out std_logic_vector(3 downto 0));
end component; signal egy1,egy2,egy3,egy4 : std_logic_vector(3 downto 0); begin p1:process(a) begin odd <= (a(0)&(not a(1))&a(2)&(not a(3))&a(4)&(not a(5))&a(6)&(not a(7))& "00"); even <= ((not a(0))&a(1)&(not a(2))&a(3)&(not a(4))&a(5)&(not a(6))&a(7)& "01"); swap <= (a(1)&a(0)&a(3)&a(2)&a(5)&a(4)&a(7)&a(6) &"10"); append <=(a&"11");
40
end process p1; egymet1 : energy PORT MAP(ina=>odd,bout=>egy1); egymet2 : energy PORT MAP(ina=>even,bout=>egy2); egymet3 : energy PORT MAP(ina=>swap,bout=>egy3); egymet4 : energy PORT MAP(ina=>append,bout=>egy4); library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use IEEE.std_logic_unsigned.all; entity energy is Port ( ina : in STD_LOGIC_vector(9 downto 0); bout :out std_logic_vector(3 downto 0));
end energy;
architecture Behavioral of energy is signal atp,exr : std_logic_vector(9 downto 0) :="0000000000"; component FA is Port ( a,b,cin : in STD_LOGIC; s : out std_logic; c : out STD_LOGIC); end component FA; component HA is Port ( a,b : in STD_LOGIC; s : out std_logic; c : out STD_LOGIC);
41
end component; signal c1 : std_logic_vector(7 downto 1); signal s1 : std_logic_vector(5 downto 1); begin exr <= ina xor atp; FA1 : FA PORT MAP(a=>exr(1),b=>exr(2),cin=>exr(3),s =>s1(1),c=>c1(1)); FA2 : FA PORT MAP(a=>exr(4),b=>exr(5),cin=>exr(6),s =>s1(2),c=>c1(2)); FA3 : FA PORT MAP(a=>exr(7),b=>exr(8),cin=>exr(9),s =>s1(3),c=>c1(3)); FA4 : FA PORT MAP(a=>c1(1),b=>c1(2),cin=>c1(3),s =>s1(4),c=>c1(4)); FA5 : FA PORT MAP(a=>s1(1),b=>s1(2),cin=>s1(3),s =>s1(5),c=>c1(5)); HA1 : HA PORT MAP(a=>exr(0),b=>s1(5),s =>bout(0),c=>c1(6)); FA6 : FA PORT MAP(a=>s1(4),b=>c1(6),cin=>c1(5),s =>bout(1),c=>c1(7)); HA2 : HA PORT MAP(a=>c1(7),b=>c1(4),s =>bout(2),c=>bout(3)); end Behavioral; library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use IEEE.std_logic_unsigned.all; entity HA is Port ( a,b : in STD_LOGIC; s : out std_logic; c : out STD_LOGIC); end HA;
42
s<=a xor b; c<= a and b; end Behavioral; library IEEE; use IEEE.std_logic_1164.all; use IEEE.numeric_std.all; use IEEE.std_logic_unsigned.all; entity FA is Port ( a,b,cin : in STD_LOGIC; s : out std_logic; c : out STD_LOGIC);
43
REFERENCES
[1] A. B. Kahng, S. Muddu, and E. Sarto, (1999) Interconnect optimization strategies for high-performance VLSI designs, in Proc. Int. Conf. VLSI Des., pp. 464469. A. B. Kahng, S. Muddu, and E. Sarto, (2000) On switch factor based analysis of coupled RC interconnects, in Proc. Des. Autom. Conf. (DAC), pp. 7984. B. Victor and K. Keutzer, (2001 )Bus encoding to prevent crosstal delay, in Proc. Int. Conf. Comput.-Aided Des. (ICCAD), pp. 5763. C. J. Akl and M. A. Bayoumi, (Sep. 2008)Reducing interconnect delay uncertainty via hybrid polarity repeater insertion, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 16, no. 9, pp. 12301239. H. Deogun, R. Senger, D. Sylvester, R. Brown, and K. Nowka, (2006) A dual-Vdd boosted pulsed bus technique for low power and low leakage operation, in Proc. Int. Symp. Low Power Electron. Des. (ISLPED), pp. 7378. H. Kaul, D. Sylvester, M. Anders, and R. Krishnamurthy, (Nov. 2005) Design and analysis of spatial encoding circuits for peak power reduction in on-chip buses, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 13, no. 11, pp. 12251238. H. Kaul, J. Seo, M. Anders, D. Sylvester, and R. Krishnamurthy, (2008) A robust alternate repeater technique for high performance busses in the multi-core era, in Proc. Int. Symp. Circuits Syst., pp. 372375. H. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper, (1996) Flow-through latch and edge-triggered flip-flop hybrid elements, in Dig. Tech. Papers IEEE Int. Solid-State Circuits Conf. (ISSCC), pp. 138139. J. Eble, V. De, D. Wills, and J. Meindl,( 1998) Minimum repeater count, size, and energy dissipation for gigascale integration (GSI) interconnects, in Proc. Int. Interconnect Technol. Conf.,pp. 5658.
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10] J. Seo, D. Sylvester, D. Blaauw, H. Kaul, and R. Krishnamurthy, (2007) A robust edge encoding technique for energy-efficient multi-cycle intercon- nect, in Proc. Int. Symp. Low Power Electron. Des. (ISLPED), pp. 6873. [11] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, M. Sachdev, and V. De, (2001) Comparative delay and energy of single edge-triggered and dual edge-triggered pulsed flip-flops for high-performance microprocessors, in Proc. Int. Symp. Low Power Electron. Des. (ISLPED), pp. 147152.
44
[12] K. Bernstein, C.-T. Chuang, R. Joshi, and R. Puri, (2003)Design and CAD challenges in sub-90 nm CMOS Technologies, in Proc. Int. Conf. Comput.Aided Des. (ICCAD), pp. 129136. [13] K. Bowman, J. Tschanz, M. Khellah, M. Ghoneima, Y. Ismail, and V. De, (2006) Time-borrowing multi-cycle on-chip interconnects for delay variation tolerance, in Proc. Int. Symp. Low Power Electron. Des. (ISLPED), pp. 7984. [14] K. Hirose and H. Yasuura, (2000) A bus delay reduction technique considering crosstal , in Proc. Des., Autom. Test Eur. (DATE), pp. 441445. [15] K. Nose and T. Sa urai, (2001) Two schemes to reduce interconnect delays in bi-directional and uni-directional buses, in Symp. VLSI Circuits Dig. Tech. Papers, pp. 193194. [16] M. Khellah, J. Tschanz, Y. Ye, S. Narendra, and V. De, (2002) Static pulsed bus for on-chip interconnects, in Symp. VLSI Circuits Dig. Tech. Pa- pers, pp. 78 79. [17] M. Khellah, M. Ghoneima, J. Tschanz, Y. Ye, N. Kurd, J. Barkatullah, S. Nimmagadda, and Y. Ismail, (2005) A s ewed repeater bus architecture for on-chip energy reduction in microprocessors, in Proc. Int. Conf. Comput. Des. (ICCD), pp. 253257. [18] P. P. Sotiriadis, A. Wang, and A. Chandra asan, (2000) Transition pattern coding: An approach to reduce energy in interconnect, in Proc. Euro. Solid-State Circuits Conf. (ESSCIRC), pp. 348351. [19] R. Arunachalam, E. Acar, and S. Nassif, (2003) Optimal shielding/spacing metrics for low power design, in Proc. IEEE Comput. Soc. Annu. Symp. VLSI, pp. 167172. [20] S.-C. Wong, T. G.-Y. Lee, D.-J. Ma, and C.-J. Chao, (May 2000) An empirical three-dimensional crossover capacitance model for multilevel interconnect VLSI circuits, IEEE Trans. Semiconductor Manufacturing, vol. 13, no. 2, pp. 219227.
45