Professional Documents
Culture Documents
April 2005
The need to reduce power consumption—long recognized as a significant design issue—becomes more
critical as larger, faster ICs go into portable applications. As a result, techniques for managing power
throughout the design flow are evolving to assure that all parts of the product receive power properly and
efficiently, and that the product is reliable. Techniques such as multi-voltage islands and dynamic scaling
of both clock frequency and threshold voltage help conserve battery power in portable applications, while
delivering high performance.
Perhaps more critically, increases in system-on-chip (SoC) size and speed have led to power consumption
challenges across a broad range of designs that have not been viewed traditionally as supply-limited. In
these designs, heat dissipation and reliability issues such as electromigration and IR drop have become
vitally important. (For information on dealing with power-related reliability issues, please consult the
Synopsys Professional Services’ White paper “Design Planning Strategies to Improve Physical Design
Flows—Floorplanning and Power Planning” http://www.synopsys.com/cgi-bin/sps/wp/dps/paper1.cgi)
Power issues in mainstream deep submicron designs may limit functionality or performance and severely
affect manufacturability and yield. Higher power dissipation increases junction temperature, which slows
transistors and increases interconnect resistance. Design techniques aimed at improving performance
may therefore fall short if power is not considered. Lower-than-expected performance decreases device
yield. Additionally, higher power dissipation requires more system-level measures for thermal management.
In general, these power issues are increasing SoC and system costs. Managing power consumption at
appropriate points in the SoC design flow keeps these costs under control.
where C is the load, V is the voltage swing and F is the number of logic-state transitions.
As semiconductor structures become smaller, device and interconnect capacitances decrease, allowing for
higher performance and lower power. Countering these factors are power increases due to larger designs
and higher switching rates.
Static power (leakage power) is consumed while transistors are not switching:
Although transistors have some reverse-biased diode leakage from drain to substrate, the larger portion of
leakage power is due to the sub-threshold current through a transistor that is turned off. This sub-threshold
current results from the conduction between source and drain through the transistor channel.
The sub-threshold leakage current is problematic because it increases as transistor threshold voltages
(Vth) decrease. In fact, the move to 130 nanometer (nm) and beyond may boost leakage power as high as
50 percent of the total chip power (Figure 1). Increased leakage power helps to exponentially increase
reliability related failures in chips (even in standby).
250
200
Power (W)
150
100
50
0
0.25µ 0.18µ 0.13µ 0.10µ 0.07µ
Technology
Source: Intel
Figure 1: Increase in leakage power—Bringing down transistor threshold voltages helps decrease dynamic
power but increases sub-threshold leakage current. A power-aware design flow is thus needed to meet timing
requirements and keep power consumption within acceptable limits. Source: Intel. Published in IC Insights Inc.
2003 Technology Trends.
As CMOS technologies scale down, the main approach for reducing power has been to scale down the
supply voltage VDD. Voltage scaling is a good technique for controlling a chip’s dynamic power because of
the quadratic effect of voltage on power consumption. However, just reducing the power supply degrades
circuit speed because the switching delay time is proportional to the load capacitance and the ratio Vth/
VDD. To maintain sufficient drive strength for fast switching, Vth must decrease in proportion to VDD. This
relationship leads to the leakage power increase. Fortunately, a power-aware design flow helps balance
timing requirements with various power goals.
Power Solutions
The higher the level of design abstraction, the greater the influence on power consumption. At the system
and algorithm levels, for example, using a parallel approach rather than a serial implementation reduces
clock frequencies, which helps to decrease power consumption significantly. The lower power of the parallel
approach may come at the expense of somewhat greater area or slower performance.
To give an example of the effect of parallel vs. sequential architectures, in one chip that received data
samples serially, the samples were processed in parallel to reduce this logic’s clock speed from 80 to 10
MHz. Additionally, the supply voltage was reduced from 1.8V to 1.25V. The parallel processing logic was much
larger than the serial processing equivalent, but the logic’s reduced voltage and operating frequency reduced
the power consumption by 75 percent. This parallel approach was able to save power because power has
a squaring function to voltage and only a linear function for frequency and switching. In other designs, the
area penalty has been small but the power savings significant, so it is worth exploring the tradeoffs.
Voltage islands
Floorplanning
Figure 2: In the context of the design flow, the potential for power savings and the accuracy of power
estimates is greatest early in the flow
Figure 2 references several power optimization and analysis techniques that can be used throughout an
SoC design flow. The power solutions covered in this paper include:
Because techniques such as clock gating and dividing affect design for test (DFT), that topic is also
addressed. A brief design example at the end of the paper shows the benefits of combining dynamic
frequency and voltage scaling.
When to perform the How Gates are How Load is Calculated Estimation Tool(s) Used
estimation – during Calculated
1. Design/library Rough estimation Unknown/In definition Spreadsheet
exploration
2. Pre/early synthesis Rough estimation DC-Wire Load Models Design Compiler, Power
Compiler
3. Post-synthesis Accurate (placed) Wire Load Models/ Power Compiler,
SPEF Physical Compiler,
PrimePower
4. Post-layout Exact Extracted –SPEF PrimePower
The power-analysis spreadsheet includes approximate gate counts, rough activity-per-block values, side-by-
side vendor µW/MHz data, and relative power estimates. The analysis at this point also helps to show if a
design consumes too much power to be practical–thus avoiding weeks of design effort to implement a chip
that will never be manufactured.
To use the spreadsheet analysis method, it is necessary to estimate each block’s gate count (number of
library cells of each type) and activity level. The amount of energy consumed by the switching of each cell
type is also needed; data from a library vendor’s manuals can be used to assign an appropriate power value
relative to speed (in µW/MHz). A block’s internal power consumption for a particular type of cell is given by
the equation:
Summing these power values for all the different types of cells in a block gives the block’s overall internal
active-power estimate. Before synthesis, gate counts are estimated based on architectural choices and
an understanding of the design. For example, approximate gate counts can be drawn from features such
as bus sizes, word lengths, control layers and memory depth. When the library has been selected, the gate
counts for a block can be estimated by using Design Compiler’s report _ reference capability after early
synthesis, which reports the number of each instance type for the design.
A key aspect of the power calculation is assigning the activity levels. The gates of a design have different
activity levels that can be estimated with or without a simulation to extract switching activity. After the
library is selected, however, a functional simulation is recommended to determine the switching activity.
Switching activity is measured in terms of a toggle rate (TR). Toggle rate is the number of logic-0-to-logic-1
and logic-1-to-logic-0 transitions of a design object (for example, a net, pin or port) per unit of time. A net
having an activity of 50 logic-1-to-logic-0 transitions and 50 logic-0-to-logic-1 transitions during a 100ns
interval has a TR of 1. A net having an activity of five logic-1-to-logic-0 transitions and five logic-0-to-logic-
1 transitions during a 10ns interval also has a TR of 1. These examples have nanoseconds as the unit of
Keep in mind that power estimates at any level of abstraction are meaningful only when the switching
activity represents the chip’s actual working operation. A common mistake is to use a vector set that
simulates system boot sequences when trying to determine activity. This activity rarely represents actual
working conditions and therefore leads to inaccurate power estimates. An RTL simulator helps to generate
a Switching Activity Interchange Format (SAIF) file automatically, but the activity values are accurate only
if the vector set is realistic. Current tools are not able to generate such vectors automatically—the task
requires an understanding of the circuit’s intent.
Figure 3 shows the Programming Language Interface (PLI) system tasks that can be used within VCS®
to generate an SAIF file during simulation. Power Compiler offers a power _ estimate capability that
uses an SAIF file to define libraries and constraints and annotate the design for power estimation. Power
Compiler’s default switching activity for non-annotated ports is 0.25 toggle per positive edge; this value is
applied and propagated throughout the block.
$set_gate_level_monitoring (“rtl_on”);
$set_toggle_region;
$toggle_start;
$toggle_stop;
$toggle_report;
Tables 2 lists examples of results estimated using the above methods. After calculating internal power,
switching power can be estimated as 30 percent of internal power. Without accurate load and switching
data, this value is only a rough estimate. Such estimates are useful mainly as a way to compare the power
implications of various design strategies rather than as predictors of a chip’s actual power consumption.
As mentioned earlier, however, rough estimates at the RTL stage do provide an early warning that a design
may turn out to be unacceptably hot.
report_reference:
Reference Library Unit Area Count Total Area Attributes
INV tech_lib 1.00 1 1.00
MX1P tech_lib 8.00 8 64.00 n
NAND2 tech_lib 1.00 6 6.00
NAND3 tech_lib 2.00 1 2.00
Total 8 references 174.00
$set_gate_level_monitoring (“on”);
Again it must be emphasized that activity values are accurate only when the simulation vectors represent
actual application behavior. Physical Compiler® helps improve the accuracy of the load values by using the
write _ parasitics -distributed command after physical optimization. This command produces a
SPEF file annotating Steiner route and RC parasitic estimates.
After layout, a gate-level simulation helps generate a Value Change Dump (VCD) file for use in
PrimePower® analysis. VCD files log changes to signal values during a simulation and provide the design’s
nodal activity, structural data hierarchical connectivity, path delays, timing and event information.
Note that chip I/Os can be a significant source of inaccuracy if they are numerous, switching at high speed
and driving long wires. If design goals require accurate rather than worst-case power estimates, lumped
load models for the I/Os may produce overly pessimistic results. To get a more accurate picture, HSPICE®
simulations can be performed on critical I/O cell types with accurate distributed-impedance models.
The I/O cell power can then be calculated using numeric methods that determine charge and energy
per rising/falling edge. Given the HSPICE output of current and time, the internal energy per transient
is calculated using the trapezoidal integration method (in Matlab, for example). The I/O activity recorded
during PrimePower analysis is used to scale I/O power, and the total I/O power is combined with the core
power for an overall power estimate.
To show how power estimates vary using the methods described here over different phases of the design
and implementation cycle, Figure 4 shows examples based on one block (a high-speed FIR filter) in a
DSP design. This example demonstrates how the power estimates vary depending on the accuracy of the
information supplied. The graph shows how the estimates changed for an example block at four points in
the flow:
• Case 1—An estimate using worst-case switching activity and worst-case wire load estimates
• Case 2—An estimate using more accurate wire load estimates and worst-case activity
• Case 3—An estimate using accurate wire load estimates and realistic activity
• Case 4—An estimate using exact wire loads (extracted) and realistic activity based on
SPICE-accurate simulation
300 260
237
200
100
0
1 2 3 4
Case
Figure 4: In the course of a design flow, power estimates can vary considerably.
1. Physical
approach
(Power supply Multi-clock
control or 2. Design
source
voltage island) approach
Multi-power
supply
Power gating Clock gating
Static Power Dynamic Power
3. Synthesis
approach
Figure 5: Power optimization techniques for different stages of a design flow (from
top to bottom) and how they affect static or dynamic power (from left to right).
Module clock gating can be applied in a series of levels, including the chip level, domain level (DSP, CPU,
etc.), module and sub-module. When the whole chip is in idle mode but must respond to external wakeup
events, an application can gate the chip clock. The same is true at the lowest level; when no memory
access is needed, the clock to the SDRAM controller can be switched off, given that the SDRAM is first set
to self-refresh mode. In addition to turning clocks on and off, the gating structure can include configurable
clock dividers to change the clock speed to various parts of the design.
Designing such a clock structure depends on an understanding of the chip’s function and insights from
power analysis about how much power can be saved by clock-gating ever-smaller portions of the design. In
general, clock switching power is more than 30 percent of a chip’s total power consumption, so clock gating
at all levels is usually well worth the effort.
While a tool such as Astro™ CTS (clock tree synthesis) synthesizes high-quality clock trees for typical chips,
complex gated clocks and dividers can require manual intervention, largely based on the need to modify
parts of the design outside the purview of the tool. This intervention may be needed to prevent severe clock
phase delay, for example. Clock phase delay might occur because registers and non-CTS cells in a high-
level clock hierarchy are placed far apart, causing an increase in high-level expanded clock tree insertion
delays and thus an increase in clock phase delay. Netweight-based placement control of non-CTS cells
can avoid the problem. This method involves extracting nets that connect the clock gating cells, switching
multiplexers and driven CTS macros, then applying heavy net weights to these nets to pull the cells close
to each other in the placement optimization. The optimization then minimizes the cells’ load and hence cell
delays and output slews.
A poor floorplan for clock distribution can also cause phase delay problems because clock tree synthesis
balances the clock tree according to the delay of the longest clock tree branch. A single long clock path
due to a poor floorplan therefore increases the entire clock tree insertion delay. Careful floorplanning
constraints for better clock tree balancing prevent this problem.
Other sources of clock phase delay are bad placement of non-CTS cells and large slew at non-CTS cell
outputs. The Synopsys Professional Services paper “Clock Distribution and Balancing In a Large and
Complex ASIC: Issues and Solutions” gives solutions to these problems as well as methods for dealing with
three other clock distribution issues: clock skew reduction, clock duty cycle distortion reduction and clock
gating efficiency (The paper is available at http://www.synopsys.com/sps/techpapers.html.). The paper
also provides a clock-balancing automation strategy. Manual clock tree analysis and balancing methods are
not suitable for complex ASIC designs due to time-to-market constraints. The automation strategy involves
three steps: extracting a common shared clock distribution topology, defining a local balance strategy
for each clock path that does not fit in the common clock distribution, and combining these local balance
constraints with the constraints of the common clock distribution. The result is a clock tree synthesis
constraint for the CTS tools to balance the complete clock distribution automatically.
Another timing issue is the clock glitch that can occur when restarting a clock asynchronously. (Figure 6
shows how this glitch occurs.) It is therefore necessary to include a circuit that times the restart to avoid
the glitch.
Select
Out Clock
CLK0
CLK0
CLK1
Select
Out Clock
Glitch
Figure 6: Clock switching glitch — After “turning off” a clock using clock
gating, the clock restart must be timed to avoid the glitch shown here.
The use of voltage islands or voltage domains offers a way to meet both power consumption and
performance requirements. In this scheme, sections of logic are grouped physically into separate regions
according to their functionality. The logic regions that must operate at the highest speed use the highest
supply voltage, while less timing-critical regions use lower supply voltages.
Frequency scaling is thus necessary along with the voltage scaling, so the voltage island approach works
well with clock gating. The logic in a clock-gated block constantly consumes leakage power, but reducing
the supply voltage to this block reduces the leakage.
Multiple supply voltages must be provided through separate power pins or analog voltage regulators
integrated into the device. The efficiency of these voltage regulators must be included in power calculations
for the device. If only a small portion of the design will operate at a lower voltage, more power may be lost
in the voltage regulator than is saved in the lower-voltage logic. Note that voltage island design may require
level-shifter cells to ensure a proper rail shift for signals traveling between voltage domains.
In addition to reducing supply voltages, it is possible to vary the supply voltage of an island depending
on system requirements. Among other challenges, this method requires the use of cells that have been
characterized at all voltages. Synopsys Scalable Polynomial Models (SPMs) support the necessary timing
and power information. Non-Linear look-up table Models (NLMs) can also be used for voltage-island designs.
An SoC can also be designed to power-down certain voltage islands to eliminate their leakage power.
Such islands require the use of power isolation cells, which can be simple AND gates. The outputs from a
powered-down section into an active power domain should never be allowed to float. Power isolation logic
ensures that all inputs to the active power domain are clamped to a stable value. Additionally, a state-
retention technique may be required so that the blocks can resume operation when powered-up. Powering-
down various islands’ voltages or scaling their voltages dynamically may also require power-sequencing
circuitry to ensure correct operation of the chip.
A one or two-pass synthesis flow can be used for multi-threshold designs, depending on the design team’s
methodology or preference. Initial synthesis may be performed with the low-Vth, high-performance library,
followed by an incremental compile using multi-Vth libraries to reduce leakage current. For designs in
which both timing and leakage are important, one-pass synthesis uses multi-Vth libraries simultaneously.
The design is first optimized for timing, then leakage power optimization is performed without affecting
the achieved timing (i.e., the worst negative slack, or WNS). The timing optimization is not degraded by
power optimization. The power optimization is followed by area optimization. The use of multi-Vth libraries
is recommended in the synthesis environment (using Power Compiler with Design Compiler or Physical
Compiler) when optimizing for leakage power for either the one- or two-pass flow.
The flow relies on the use of a reasonable leakage constraint, set in Power Compiler by the set _ max _
leakage _ power command.
RTL clock gating shuts down the clock to large register banks when the outputs of these flip-flops are not
needed. Figure 7 shows the difference between a clock gating circuit and the synchronous load enable
circuit that Design Compiler would otherwise use. The feedback net and multiplexer of the synchronous
load enable circuit are replaced by a latch and a two-input gate inserted in the register’s clock net.
Fig 7a:
always@(posedge CLK)
if (EN) Synchronous Load-enable
D_out = D_in implementation without Clock Gating
elaborate D_out
D_in Reg
Bank
EN
FSM
CLK
Fig 7b:
always@(posedge CLK)
if (EN) Synchronous Load-enable
D_out = D_in implementation with Clock Gating
elaborate -gate_clock
D_in D_out
Reg
EN G_CLK Bank
FSM Latch
CLK
Figure 7: Power optimization during synthesis — Power Compiler automatically inserts clock-gating circuits,
replacing typical Design Compiler implementations (a) with the gating circuit (b).
Power Compiler also has the capability to replace the manually inserted clock gates with an ICG from the
library. This feature helps support the legacy blocks or IPs that have manual clock gates throughout the
physical flow. Power Compiler recognizes the ICG’s power-related attributes, which aid in the placement
of such cells. For advanced users of clock gating, Power Compiler helps obtain greater power savings by
performing multi-stage clock gating. In this technique, one clock gating cell feeds another clock gating cell
instead of a register bank. (This technique is also an RTL-based feature.)
RTL clock gating saves power in several ways. Internal power consumption decreases because the clock
does not continuously feed register banks, switching power decreases because of reduced capacitance on
the clock network, and power decreases further because downstream logic does not change.
When Power Compiler works with Physical Compiler, the placements for the clock gating cells are
optimized. Within the Physical Compiler flow, Power Compiler makes sure that the gate element cells are
placed close together and that the gating element is placed close to the sequential elements it drives. This
layout reduces the clock skew that can otherwise occur with clock gating.
Clock gating can reduce a chip’s testability unless specific DFT features are added. Because the clock
signal is gated with an internal signal, a test engineer cannot control the loading of the DFT scan flip-flops.
This problem is avoided by adding a test pin and assigning a fixed value (1’b1) to it during test compilation.
No specific coding style is required. Figure 8 shows a clock gating circuit with a control point added.
Figure 8: Clock gating circuit with added control point — Because clock gating makes part of the
circuit untestable, clock-gated designs require the addition of control points, as shown here.
The options of Power Compiler’s set _ clock _ gating _ style command improves the chip’s testability
by specifying the amount and type of testability logic added during clock gating. It is possible to add a
control point for testing before or after the clock-gating latch, for example, and choose test _ mode or
scan _ enable mode. Other options add observability logic or setup and hold-time margin. To use the
Design Compiler commands check _ test or check _ dft, use the following commands prior: hookup _
testports and set _ test _ hold 1 Test _ Mode.
Note that clock gating should not be used on designs that have variables (or signals) from which Design
Compiler implements master/slave flip-flops. Design Compiler uses the clocked _ on _ also signal-type
attribute in implementing these flip-flops. At the abstraction level at which clock-gating occurs, Power
Compiler does not recognize this attribute and will gate only the slave clock of the flip-flop. It is possible to
use the set _ clock _ gating _ signals command to exclude specific design variables (or signals) that
are implemented as master-slave flip-flops: dc _ shell> set _ clock _ gating _ signals -design
TOP -exclude { A B }
• “If–Else” statements
• Conditional assignments
• “Case” statements
• “For” loops
In addition to RTL optimization, Power Compiler optimizes power simultaneously with timing and area using
the following gate-level optimization techniques (in order of priority):
• Sizing
• Technology mapping
• Pin swapping
• Factoring
• Buffer insertion
• Phase assignment
These optimizations require the use of a power-characterized library. Because Power Compiler maintains
timing automatically and keeps area within the designer’s constraints, the tool provides “push-button” power
savings at the gate level.
The control elements include ARM Intelligent Energy Manager software that balances processor workload
and energy consumption. PowerWise hardware from National monitors performance and communicates
with voltage regulators to scale the supply voltage to the minimum operating level at each operating
frequency. This system compensates for silicon performance variations due to the manufacturing process
as well as run-time performance changes due to temperature fluctuations.
The 240-MHz chip is partitioned into three primary power domains: voltage-scaled CPU and memory
power domains and a standard fixed-voltage domain for the rest of the chip. The independent power
domains allow precise voltage control and current measurement for the CPU and RAM. Standard cells and
level shifters operate in the 0.7-1.32V range.
For cache-intensive workloads, both the power consumption and the precise time to process a workload
were measured to compare dynamic frequency scaling alone with dynamic voltage and frequency scaling.
Figure 9 summarizes the results normalized to the 1.2V operating voltage. Note that this diagram shows
the power savings only for the chip’s dynamic-voltage-and-frequency-scaling subsystem. Normally in such
SoCs, some of the chip will not be voltage scalable. Components such as external memories typically
operate at a fixed voltage, so design partitioning and planning must take into account the system-level
power savings.
The figure shows that voltage and frequency scaling can significantly reduce energy consumption compared
to frequency scaling alone. Running at 120 MHz cuts power requirements by half, for example, but scaling
the supply voltage at the same time slashes power consumption to about 20 percent of full power.
Summary
Dramatic power reductions such as those achieved by the Synopsys, ARM, National and Artisan test chip are
possible through a combination of high- and low-level power management techniques. The typical SoC may
not require all of these techniques, but mainstream solutions are available to meet all design requirements.
Choosing the right solutions depends on careful power analysis as well as understanding the capabilities
of available tools. Analyzing power requirements as early as possible in the design flow helps avoid power-
related disasters. Early analysis also makes power goals easier to attain because higher-level techniques
save the greatest amount of power.
Brandon Waldo
Senior IC Design Engineer, Synopsys Professional Services
Brandon Waldo is a Senior Design Consultant working in Synopsys Professional Services where he specializes
in low-power design, physical design and signal integrity analysis. He joined Synopsys in 2001 and has over
15 years of experience in semiconductor design. Prior to joining Synopsys, Mr. Waldo worked at Motorola
and Advanced Micro Devices doing full-custom, semi-custom and ASIC designs on several microprocessor
projects. Mr. Waldo has a BS and MS degree in Electrical Engineering from Texas A&M University.
References
Design Planning Strategies to Improve Physical Design Flows Floorplanning and Power Planning, Synopsys Professional Services
White Paper, August 2003 (authors Sachin Idgunjj, Steve Lloyd, Rick Mitchell, Ron Spillman, Jon Young.)
Clock Distribution and Balancing In a Large and Complex ASIC: Issues and Solutions
<http://wwwin.synopsys.com/sps/docs/marketing/techpapers/13_designcon03_omap.pdf>
DesignCon 2003
James Song, Sandeep Aggarwal, Texas Instruments, Inc.
Kaijian Shi, Stewart Shankel, Synopsys Professional Services
700 East Middlefield Road, Mountain View, CA 94043 T 650 584 5000 www.synopsys.com
©2007 Synopsys, Inc. Synopsys, the Synopsys logo, Design Compiler, Physical Compiler, VCS, PrimePower, and HSPICE are
registered trademarks and Power Compiler and Astro are trademarks of Synopsys, Inc. All other products or service names
mentioned herein are trademarks of their respective holders and should be treated as such.
Printed in the U.S.A. 03/07.CE.WO.06-14884