Activity-Sensitive Flip-Flop and Latch Selection For Reduced Energy

Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy
Seongmoo Heo, Ronny Krashinsky, Krste Asanovi MIT - Laboratory for Computer Science http://www.cag.lcs.mit.edu/scale
ARVLSI March 15, 2001
Flip-Flop and Latch

(collectively timing elements)
Critical Timing Elements (TEs) in modern synchronous VLSI systems Significant impact on cycle time
Big portion of energy consumption

Energy breakdown of a MIPS 5 stage pipeline datapath for SPECint 95 programs
3% 1% 2% 7% 23%
Flip-flop
8%
23% 23%
Latch
10%
EqualCheck Buffer Shifter Adder ALU RegFile Mux Latch Flipflop
[Heo, MS Thesis, 00]
Motivation
Previous work tried to find the most energy-efficient and fastest TEs
assuming
a single TE design used uniformly throughout a circuit. using a very limited set of data patterns and un-gated clock signal. Two important observations
There
is a wide variation in clock and data activity across different TEs. Many TEs are not in the critical path, and thus have ample time slack.
Basic Idea
Selection from a heterogeneous library of designs, each tuned to different operating regimes Operating regimes :
o o
Different input and clock signal activities Different speed requirements
Related Work
The use of timing slack for reduced energy

o
Examples : - Traditional transistor sizing - Cluster voltage scaling [Usami and Horowitz 95] - Multiple threshold voltage or series transistor for reducing leakage current [McPherson et al. 00, Yamashita et al. 00, Johnson et al. 99]
Our Contribution
Detailed energy characterization of wide range of TEs as a function of signal activities. Detailed measurement of TE signal activities for a microprocessor running complete programs
Exploit signal activity to reduce TE energy by using different TE structures.
Overview
Flip-Flop and Latch Designs Test Bench and Simulation Setup Delay and Energy Characterization Energy Analysis with Test Waveforms Evaluation with Processor Conclusion
Latch Designs
Transistor sizes optimized for two extremes: Highest speed vs. Lowest power
Flip-Flop Designs
Flip-Flop Designs
Flip-Flop Designs
Flip-Flop Designs
Flip-Flop Designs
Flip-Flop Designs
Test Bench
Used fixed, realistic input driver Determined appropriate output load o As large as 200fF output load was used by previous work. o We used 7.2fF (4 min-inv cap) because 60% of output loads in the VP microprocessor datapath are smaller than 14.4fF. o Further work on load-sensitive analysis at upcoming WVLSI
Sized clock buffer to give equal rise/fall time

7.2fF
Simulation Setup
Custom layout in 0.25m TSMC CMOS process with Magic layout program Layout extraction with SPACE 2D extractor Circuit simulation with Hspice under nominal condition of Vdd=2.5V and T=25C
o
Hspice .Measure command to measure delay and energy
Delay Characterization
Flip-flop : Minimum D-Q delay [Stojanovic et al. 99] Latch
1200 1000
: D-Q delay
lowest power highest speed
delay (ps)
800 600 400 200 0
SS FF AF SA F M FF SA FF H L H FF LS S S FF SS AP C AS L C P PP L C FF PP C L PT A SS LA SS AL A2 A C LA PN LA
PP
(a) Flip-flops
(b) Latches
Energy Characterization
Total energy = input energy + internal energy + clock energy output energy
Accurate energy characterization

o
State-transition technique based on [Zyuban and Kogge 99]
D
1 3 2 12
Q C
3
C D Q
Energy Tables
(a) Flip-flops
(b) Latches
Energy Tables
(a) Flip-flops
000 100 001 100 010 111 011 111 100 000 110 010 101 001 111 011 000 010 100 110 101 111 001 011 010 000 110 100 111 101 011 001
Low-Power Flip-Flop
PPCFF 48.4 95.5 95.4 89.2 89.0 47.6 46.3 46.0 101 91.5 49.1 46.8 68.1 19.4 19.2 19.4 68.1 68.0 49.7 49.7 6.9 6.9 6.9 51.2
(b) Latches
Test Waveforms
Test 1 and 2 : high clock activity, no data and output activity Test 3 and 4 : high data activity, no clock and output activity Test 5, 6, and 7 : high clock, data, and output activity (Traditional)
Test 8 : high clock and data activity, no output activity
Energy Analysis
700
(a) Flip-flops
fJ/cycle
600 500 400 300 200 100 0 Test 1 Test 3 Test 5
PPCFF SSAFF SAFF MSAFF HLFF HLSFF SSAPL SSASPL CCPPCFF
250
1131
(b) Latches
fJ/cycle
200 150 100 50 0 Test 1 Test 3 Test 5 PPCLA PTLA SSALA SSA2LA CPNLA
Low-power flip-flops and latches
Processor Design and Simulation
Evaluation on a microprocessor datapath Vanilla Pekoe Processor

o
o o
A classic 32-bit MIPS RISC 5 stage pipeline with caches and system coprocessor registers (R3000-compatible) Aggressive clock gating to save energy 22 multi-bit flip-flops and latches, totaling 675 individual bits A fast cycle-accurate simulator [Krashinsky, Heo, Zhang, and Asanovic 00] with the ability of counting TE state transitions 1.71 billion instructions and 2.69 billion cycles Cannot track the exact timing of signals Cannot model glitches
Simulation with 5 programs of SPECint95 benchmarks

o
Some constraints
o o
Flip-Flops and Latches in Processor
Energy Breakdown
Flip-flops
HLFF-hs f_recovpc d_inst d_epc x_epc m_epc x_sd 25.1 31.2 20.5 20.3 20.2 2.6 Lowest-Energy SSAFF-lp SSAFF-lp SSAFF-lp SSAFF-lp SSAFF-lp SAFF-lp 3.57 6.52 2.74 2.62 2.55 1.06 p_pc f_pc d_rsalu d_rtalu d_rsshmd d_rtshmd d_aluctrl m_exe cp0_count cp0_comp cp0_baddr cp0_epc 42.6 0.1 0.3 0.1 SSAFF-lp HLFF-lp HLFF-lp HLFF-lp 4.80 x_sdalign 0.03 w_result 0.18 0.05 2.74 SSALA-lp 2.42 0.30 SSA2LA-lp 0.27
Latches
PPCLA-hs 3.22 2.95 3.27 2.81 0.75 0.65 1.26 3.88 Lowest-Energy SSALA-lp SSALA-lp SSALA-lp SSALA-lp PPCLA-lp PPCLA-lp SSALA-lp SSALA-lp 2.25 1.72 3.16 2.28 0.70 0.63 0.97 3.65
x_addr
m_exe
8.0
24.6
SAFF-lp
SSAFF-lp
2.57
4.76
(unit: mJ)
(unit: mJ)
32-bit MIPS 5 stage pipeline datapath SPECint95 benchmarks: perl(test, primes), ijpeg(test), m88ksim(test), go(20,9), and lzw(medtest)
Energy Breakdown
Flip-flops
Latches
x_addr
m_exe
8.0
24.6
SAFF-lp
SSAFF-lp
2.57
4.76
(unit: mJ)
(unit: mJ)
Energy Breakdown
Flip-flops
Latches
x_addr
m_exe
8.0
24.6
SAFF-lp
SSAFF-lp
2.57
4.76
(unit: mJ)
(unit: mJ)
Processor Energy Results - Flip-Flop

0.2 0.18
HLFF-hs
HS: Highest-Speed LP: Lowest-Power
Total Flip-flop Energy (J)
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 500 1000
HLFF-lp SSAFF-hs SSASPL-hs SSASPL-lp SSAFF-lp
Uniform HLFF-Sizing (A single design used uniformly throughout a circuit) HLFF-AS SSASPL-Sizing SSASPL-AS
Flip-flop Delay (ps)
Ref : Total datapath energy Total TE energy = around 0.21J

0.2 0.18
HLFF-hs 34% energy saving

Uniform HLFF-Sizing HLFF-AS SSASPL-Sizing SSASPL-AS
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0
500
1000
34% energy saving with conventional transistor sizing

0.2 0.18
HSLE: Activity-Sensitive selection

HLFF-hs 69% energy saving
Uniform HLFF-Sizing HLFF-HSLE SSASPL-Sizing SSASPL-AS
0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0
52% energy saving
500
1000
52% energy saving over just transistor sizing with the best performance (HLFF-hs)
Processor Energy Results - Latch

0.04
Total Latch Energy (J)
0.035 0.03 0.025 0.02 0.015 0 100 200 300 400 500 600
21
Uniform PPCLA-Sizing PPCLA-HSLE
PPCLA-hs SSA2LA-lp
Latch Delay (ps)
6.1% energy saving over just transistor sizing (1) 8.3% energy saving compared to homogeneous design with PPCLA-hs (2) PPCLA is the fastest and also very energy-efficient.
Summary of Energy Results
63% TE energy saving compared to a homogeneous design with HLFF-hs and PPCLA-hs 46% TE energy saving compared to a design with conventional transistor sizing while keeping the best performance
Conclusion
We showed that activation patterns for various TEs in a circuit differ considerably. We found that there is wide variation in the optimal TE designs for different regimes. We provided complete energy and delay characterization. We applied our technique to a real processor which we simulated 2.7 billion cycles of programs and showed over 63% TE energy reduction without losing any performance. Difficulty of using a heterogeneous mix of TEs? - Already designers have been doing verification for each local clock and added complexity is minimal. - Timing verification for non-critical TEs is simple.

Activity-Sensitive Flip-Flop and Latch Selection For Reduced Energy

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Activity-Sensitive Flip-Flop and Latch Selection For Reduced Energy

Uploaded by

Copyright:

Available Formats

Activity-Sensitive Flip-Flop and Latch Selection for Reduced Energy

Flip-Flop and Latch

Big portion of energy consumption

EqualCheck Buffer Shifter Adder ALU RegFile Mux Latch Flipflop

[Heo, MS Thesis, 00]

Different input and clock signal activities Different speed requirements

The use of timing slack for reduced energy

Exploit signal activity to reduce TE energy by using different TE structures.

Sized clock buffer to give equal rise/fall time

Hspice .Measure command to measure delay and energy

800 600 400 200 0

Accurate energy characterization

State-transition technique based on [Zyuban and Kogge 99]

Test 8 : high clock and data activity, no output activity

600 500 400 300 200 100 0 Test 1 Test 3 Test 5

PPCFF SSAFF SAFF MSAFF HLFF HLSFF SSAPL SSASPL CCPPCFF

Low-power flip-flops and latches

Processor Design and Simulation

Evaluation on a microprocessor datapath Vanilla Pekoe Processor

Simulation with 5 programs of SPECint95 benchmarks

Flip-Flops and Latches in Processor

Flip-Flops and Latches in Processor

Flip-Flops and Latches in Processor

Processor Energy Results - Flip-Flop

HS: Highest-Speed LP: Lowest-Power

Total Flip-flop Energy (J)

HLFF-lp SSAFF-hs SSASPL-hs SSASPL-lp SSAFF-lp

Flip-flop Delay (ps)

Ref : Total datapath energy Total TE energy = around 0.21J

Processor Energy Results - Flip-Flop

HLFF-hs 34% energy saving

Total Flip-flop Energy (J)

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0

Flip-flop Delay (ps)

34% energy saving with conventional transistor sizing

Processor Energy Results - Flip-Flop

HSLE: Activity-Sensitive selection

Total Flip-flop Energy (J)

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0

52% energy saving

Flip-flop Delay (ps)

Processor Energy Results - Latch

Total Latch Energy (J)

Uniform PPCLA-Sizing PPCLA-HSLE

Latch Delay (ps)

Summary of Energy Results

You might also like