Professional Documents
Culture Documents
SoC Design
Jim Flynn – Senior IC Design Engineer, Synopsys Professional Services
Brandon Waldo – Senior IC Design Engineer, Synopsys Professional Services
http://www.synopsys.com/sps
April 2004
Perhaps more critically, increases in system-on-chip (SoC) size and speed have led to power consumption
challenges across a broad range of designs that have not been viewed traditionally as supply-limited. In
these designs, heat dissipation and reliability issues such as electromigration and IR drop have become
vitally important. (For information on dealing with power-related reliability issues, please consult the Synopsys
Professional Services’ White paper “Design Planning Strategies to Improve Physical Design Flows—
Floorplanning and Power Planning” http://www.synopsys.com/cgi-bin/sps/wp/dps/paper1.cgi)
Power issues in mainstream deep submicron designs may limit functionality or performance and severely
affect manufacturability and yield. Higher power dissipation increases junction temperature, which slows
transistors and increases interconnect resistance. Design techniques aimed at improving performance may
therefore fall short if power is not considered. Lower-than-expected performance decreases device yield.
Additionally, higher power dissipation requires more system-level measures for thermal management. In
general, these power issues are increasing SoC and system costs. Managing power consumption at
appropriate points in the SoC design flow keeps these costs under control.
where C is the load, V is the voltage swing and F is the number of logic-state transitions.
As semiconductor structures become smaller, device and interconnect capacitances decrease, allowing for
higher performance and lower power. Countering these factors are power increases due to larger designs
and higher switching rates.
Static power (leakage power) is consumed while transistors are not switching:
Although transistors have some reverse-biased diode leakage from drain to substrate, the larger portion of
leakage power is due to the sub-threshold current through a transistor that is turned off. This sub-threshold
current results from the conduction between source and drain through the transistor channel.
The sub-threshold leakage current is problematic because it increases as transistor threshold voltages (Vth)
decrease. In fact, the move to 130 nanometer (nm) and beyond may boost leakage power as high as 50
percent of the total chip power (Figure 1). Increased leakage power helps to exponentially increase reliability
related failures in chips (even in standby).
As CMOS technologies scale down, the main approach for reducing power has been to scale down the
supply voltage VDD. Voltage scaling is a good technique for controlling a chip’s dynamic power because of
the quadratic effect of voltage on power consumption. However, just reducing the power supply degrades
circuit speed because the switching delay time is proportional to the load capacitance and the ratio Vth/VDD.
To maintain sufficient drive strength for fast switching, Vth must decrease in proportion to VDD. This relationship
leads to the leakage power increase. Fortunately, a power-aware design flow helps balance timing
requirements with various power goals.
Power solutions
The higher the level of design abstraction, the greater the influence on power consumption. At the system
and algorithm levels, for example, using a parallel approach rather than a serial implementation reduces
clock frequencies, which helps to decrease power consumption significantly. The lower power of the parallel
approach may come at the expense of somewhat greater area or slower performance.
To give an example of the effect of parallel vs. sequential architectures, in one chip that received data
samples serially, the samples were processed in parallel to reduce this logic’s clock speed from 80 to 10
MHz. Additionally, the supply voltage was reduced from 1.8V to 1.25V. The parallel processing logic was
much larger than the serial processing equivalent, but the logic’s reduced voltage and operating frequency
reduced the power consumption by 75 percent. This parallel approach was able to save power because
power has a squaring function to voltage and only a linear function for frequency and switching. In other
designs, the area penalty has been small but the power savings significant, so it is worth exploring the tradeoffs.
Voltage islands
Floorplanning
Figure 2: In the context of the design flow, the potential for power savings and the accuracy of power estimates is
greatest early in the flow.
Figure 2 references several power optimization and analysis techniques that can be used throughout an
SoC design flow. The power solutions covered in this paper include:
Because techniques such as clock gating and dividing affect design for test (DFT), that topic is also
addressed. A brief design example at the end of the paper shows the benefits of combining dynamic
frequency and voltage scaling.
The power-analysis spreadsheet includes approximate gate counts, rough activity-per-block values, side-
by-side vendor µW/MHz data, and relative power estimates. The analysis at this point also helps to show
if a design consumes too much power to be practical–thus avoiding weeks of design effort to implement
a chip that will never be manufactured.
To use the spreadsheet analysis method, it is necessary to estimate each block’s gate count (number of
library cells of each type) and activity level. The amount of energy consumed by the switching of each cell
type is also needed; data from a library vendor’s manuals can be used to assign an appropriate power
value relative to speed (in µW/MHz). A block’s internal power consumption for a particular type of cell is
given by the equation:
Summing these power values for all the different types of cells in a block gives the block’s overall internal
active-power estimate. Before synthesis, gate counts are estimated based on architectural choices and an
understanding of the design. For example, approximate gate counts can be drawn from features such as
bus sizes, word lengths, control layers and memory depth. When the library has been selected, the gate
counts for a block can be estimated by using Design Compiler’s report_reference capability after
early synthesis, which reports the number of each instance type for the design.
A key aspect of the power calculation is assigning the activity levels. The gates of a design have different
activity levels that can be estimated with or without a simulation to extract switching activity. After the
library is selected, however, a functional simulation is recommended to determine the switching activity.
Switching activity is measured in terms of a toggle rate (TR). Toggle rate is the number of logic-0-to-logic-1
and logic-1-to-logic-0 transitions of a design object (for example, a net, pin or port) per unit of time. A net
having an activity of 50 logic-1-to-logic-0 transitions and 50 logic-0-to-logic-1 transitions during a 100ns
interval has a TR of 1. A net having an activity of five logic-1-to-logic-0 transitions and five logic-0-to-logic-1
transitions during a 10ns interval also has a TR of 1. These examples have nanoseconds as the unit of
time, and a TR of 1 indicates one activity transition per ns. Power and TR can be related by understanding
that for each transition an amount of energy must be supplied to change the state of an internal circuit
during the time interval of the state change.
Figure 3 shows the Programming Language Interface (PLI) system tasks that can be used within VCS® to
generate an SAIF file during simulation. Power Compiler™ offers a power_estimate capability that uses an
SAIF file to define libraries and constraints and annotate the design for power estimation. Power Compiler’s
default switching activity for non-annotated ports is 0.25 toggle per positive edge; this value is applied and
propagated throughout the block.
$set_gate_level_monitoring ("rtl_on");
$set_toggle_region;
$toggle_start;
$toggle_stop;
$toggle_report;
Figure 3: Programming Language Interface (PLI) commands — These commands cause VCS to generate an SAIF
file for use in Power Compiler.
Tables 2 lists examples of results estimated using the above methods. After calculating internal power,
switching power can be estimated as 30 percent of internal power. Without accurate load and switching
data, this value is only a rough estimate. Such estimates are useful mainly as a way to compare the power
implications of various design strategies rather than as predictors of a chip’s actual power consumption.
As mentioned earlier, however, rough estimates at the RTL stage do provide an early warning that a design
may turn out to be unacceptably hot.
report_reference:
$set_gate_level_monitoring ("on");
Again it must be emphasized that activity values are accurate only when the simulation vectors represent
actual application behavior. Physical Compiler® helps improve the accuracy of the load values by using the
write_parasitics -distributed command after physical optimization. This command produces a
SPEF file annotating Steiner route and RC parasitic estimates.
After layout, a gate-level simulation helps generate a Value Change Dump (VCD) file for use in PrimePower®
analysis. VCD files log changes to signal values during a simulation and provide the design’s nodal activity,
structural data hierarchical connectivity, path delays, timing and event information.
Note that chip I/Os can be a significant source of inaccuracy if they are numerous, switching at high speed
and driving long wires. If design goals require accurate rather than worst-case power estimates, lumped
load models for the I/Os may produce overly pessimistic results. To get a more accurate picture, HSPICE®
simulations can be performed on critical I/O cell types with accurate distributed-impedance models. The
I/O cell power can then be calculated using numeric methods that determine charge and energy per
rising/falling edge. Given the HSPICE output of current and time, the internal energy per transient is
calculated using the trapezoidal integration method (in Matlab, for example). The I/O activity recorded
during PrimePower analysis is used to scale I/O power, and the total I/O power is combined with the core
power for an overall power estimate.
To show how power estimates vary using the methods described here over different phases of the design
and implementation cycle, Figure 4 shows examples based on one block (a high-speed FIR filter) in a DSP
design. This example demonstrates how the power estimates vary depending on the accuracy of the
information supplied. The graph shows how the estimates changed for an example block at four points in
the flow:
■ Case 1—An estimate using worst-case switching activity and worst-case wire load estimates
■ Case 2—An estimate using more accurate wire load estimates and worst-case activity
■ Case 3—An estimate using accurate wire load estimates and realistic activity
■ Case 4—An estimate using exact wire loads (extracted) and realistic activity based on
SPICE-accurate simulation
300 260
237
200
100
0
1 2 3 4
Case
Figure 4: In the course of a design flow, power estimates can vary considerably.
3. Synthesis
approach
Figure 5: Power optimization techniques for different stages of a design flow (from top to bottom) and how they
affect static or dynamic power (from left to right).
Module clock gating can be applied in a series of levels, including the chip level, domain level (DSP, CPU, etc.),
module and sub-module. When the whole chip is in idle mode but must respond to external wakeup
events, an application can gate the chip clock. The same is true at the lowest level; when no memory
access is needed, the clock to the SDRAM controller can be switched off, given that the SDRAM is first
set to self-refresh mode. In addition to turning clocks on and off, the gating structure can include configurable
clock dividers to change the clock speed to various parts of the design.
Designing such a clock structure depends on an understanding of the chip’s function and insights from
power analysis about how much power can be saved by clock-gating ever-smaller portions of the design.
In general, clock switching power is more than 30 percent of a chip’s total power consumption, so clock
gating at all levels is usually well worth the effort.
A poor floorplan for clock distribution can also cause phase delay problems because clock tree synthesis
balances the clock tree according to the delay of the longest clock tree branch. A single long clock path
due to a poor floorplan therefore increases the entire clock tree insertion delay. Careful floorplanning
constraints for better clock tree balancing prevent this problem.
Other sources of clock phase delay are bad placement of non-CTS cells and large slew at non-CTS cell
outputs. The Synopsys Professional Services paper “Clock Distribution and Balancing In a Large and
Complex ASIC: Issues and Solutions” gives solutions to these problems as well as methods for dealing
with three other clock distribution issues: clock skew reduction, clock duty cycle distortion reduction and
clock gating efficiency (The paper is available at http://www.synopsys.com/sps/techpapers.html.). The
paper also provides a clock-balancing automation strategy. Manual clock tree analysis and balancing
methods are not suitable for complex ASIC designs due to time-to-market constraints. The automation
strategy involves three steps: extracting a common shared clock distribution topology, defining a local
balance strategy for each clock path that does not fit in the common clock distribution, and combining
these local balance constraints with the constraints of the common clock distribution. The result is a
clock tree synthesis constraint for the CTS tools to balance the complete clock distribution automatically.
Another timing issue is the clock glitch that can occur when restarting a clock asynchronously. (Figure 6
shows how this glitch occurs.) It is therefore necessary to include a circuit that times the restart to avoid
the glitch.
CLK1
Select
Out Clock
CLK0
CLK0
CLK1
Select
Out Clock
Glitch
Figure 6: Clock switching glitch — After “turning off” a clock using clock gating, the clock restart must be timed to
avoid the glitch shown here.
The use of voltage islands or voltage domains offers a way to meet both power consumption and performance
requirements. In this scheme, sections of logic are grouped physically into separate regions according to
their functionality. The logic regions that must operate at the highest speed use the highest supply voltage,
while less timing-critical regions use lower supply voltages.
Frequency scaling is thus necessary along with the voltage scaling, so the voltage island approach works
well with clock gating. The logic in a clock-gated block constantly consumes leakage power, but reducing
the supply voltage to this block reduces the leakage.
Multiple supply voltages must be provided through separate power pins or analog voltage regulators
integrated into the device. The efficiency of these voltage regulators must be included in power calculations
for the device. If only a small portion of the design will operate at a lower voltage, more power may be lost
in the voltage regulator than is saved in the lower-voltage logic. Note that voltage island design may require
level-shifter cells to ensure a proper rail shift for signals traveling between voltage domains.
In addition to reducing supply voltages, it is possible to vary the supply voltage of an island depending on
system requirements. Among other challenges, this method requires the use of cells that have been
characterized at all voltages. Synopsys Scalable Polynomial Models (SPMs) support the necessary timing
and power information. Non-Linear look-up table Models (NLMs) can also be used for voltage-island designs.
An SoC can also be designed to power-down certain voltage islands to eliminate their leakage power.
Such islands require the use of power isolation cells, which can be simple AND gates. The outputs from a
powered-down section into an active power domain should never be allowed to float. Power isolation logic
ensures that all inputs to the active power domain are clamped to a stable value. Additionally, a state-
retention technique may be required so that the blocks can resume operation when powered-up.
Powering-down various islands’ voltages or scaling their voltages dynamically may also require power-
sequencing circuitry to ensure correct operation of the chip.
Multiple-threshold design
Multiple supply-voltage islands work well with multi-threshold synthesis. Optimization meets timing goals by
using low-Vth cells on critical timing paths and high-Vth cells on non-critical paths. Note that better leakage
quality of results can be obtained by using state-dependent leakage models, if the silicon vendor provides
such models.
A one or two-pass synthesis flow can be used for multi-threshold designs, depending on the design team’s
methodology or preference. Initial synthesis may be performed with the low-Vth, high-performance library,
followed by an incremental compile using multi-Vth libraries to reduce leakage current. For designs in
which both timing and leakage are important, one-pass synthesis uses multi-Vth libraries simultaneously.
The design is first optimized for timing, then leakage power optimization is performed without affecting the
achieved timing (i.e., the worst negative slack, or WNS). The timing optimization is not degraded by power
optimization. The power optimization is followed by area optimization. The use of multi-Vth libraries is rec-
ommended in the synthesis environment (using Power Compiler with Design Compiler or Physical
Compiler) when optimizing for leakage power for either the one- or two-pass flow.
The flow relies on the use of a reasonable leakage constraint, set in Power Compiler by the
set_max_leakage_power command.
RTL clock gating shuts down the clock to large register banks when the outputs of these flip-flops are not
needed. Figure 7Fig 7a: the difference between a clock gating circuit and the synchronous load enable
shows
circuit that Design Compiler would otherwise use. The feedback net and multiplexer of the synchronous
load enable circuit are replaced by a latch and a two-input gate inserted in the register’s clock net.
always@(posedge CLK)
if (EN) Synchronous Load-enable
D_out = D_in implementation without Clock Gating
Fig 7a:
elaborate D_out
D_in Reg
Bank
elaborate D_out
D_in Reg
Bank
EN
FSM
CLK
Fig 7b:
always@(posedge CLK)
if (EN) Synchronous Load-enable
D_out = D_in implementation with Clock Gating
elaborate
Fig 7b: -gate_clock D_in D_out
Reg
Bank
EN G_CLK
always@(posedge CLK) FSM Latch
CLK
if (EN) Synchronous Load-enable
D_out = D_in implementation with Clock Gating
elaborate -gate_clock
Figure 7: Power optimization during synthesis — Power Compiler automatically inserts clock-gating circuits, replac-
D_in D_out
Reg
ing typical Design Compiler implementations (a) with the gating circuit (b). Bank
EN G_CLK
FSM Latch
CLK
This type of clock gating has a relatively low impact on area because the gating circuits replace muxes
(and, in fact, reduces the area used by 5 to 15 percent). Power Compiler implements the gating automatically,
and it requires no RTL code change, though it is possible to specify the gating manually using a variety of
coding styles.
When Power Compiler works with Physical Compiler, the placements for the clock gating cells are optimized.
Within the Physical Compiler flow, Power Compiler makes sure that the gate element cells are placed close
together and that the gating element is placed close to the sequential elements it drives. This layout reduces
the clock skew that can otherwise occur with clock gating.
Clock gating can reduce a chip’s testability unless specific DFT features are added. Because the clock signal
is gated with an internal signal, a test engineer cannot control the loading of the DFT scan flip-flops. This
problem is avoided by adding a test pin and assigning a fixed value (1'b1) to it during test compilation. No
specific coding style is required. Figure 8 shows a clock gating circuit with a control point added.
Figure 8: Clock gating circuit with added control point — Because clock gating makes part of the circuit untestable,
clock-gated designs require the addition of control points, as shown here.
The options of Power Compiler’s set_clock_gating_style command improves the chip’s testability by spec-
ifying the amount and type of testability logic added during clock gating. It is possible to add a control
point for testing before or after the clock-gating latch, for example, and choose test_mode or
scan_enable mode. Other options add observability logic or setup and hold-time margin. To use the Design
Compiler commands check_test or check_dft, use the following commands prior: hookup_testports and
set_test_hold 1 Test_Mode.
Note that clock gating should not be used on designs that have variables (or signals) from which Design
Compiler implements master/slave flip-flops. Design Compiler uses the clocked_on_also signal-type attrib-
ute in implementing these flip-flops. At the abstraction level at which clock-gating occurs, Power
Compiler does not recognize this attribute and will gate only the slave clock of the flip-flop. It is possible
to use the set_clock_gating_signals command to exclude specific design variables (or signals) that are
implemented as master-slave flip-flops: dc_shell> set_clock_gating_signals -design TOP -exclude
{ A B }
■ “If–Else” statements
■ Conditional assignments
■ “Case” statements
■ “For” loops
In addition to RTL optimization, Power Compiler optimizes power simultaneously with timing and area using
the following gate-level optimization techniques (in order of priority):
■ Sizing
■ Technology mapping
■ Pin swapping
■ Factoring
■ Buffer insertion
■ Phase assignment
These optimizations require the use of a power-characterized library. Because Power Compiler maintains
timing automatically and keeps area within the designer’s constraints, the tool provides “push-button” power
savings at the gate level.
The control elements include ARM Intelligent Energy Manager software that balances processor workload
and energy consumption. PowerWise hardware from National monitors performance and communicates
with voltage regulators to scale the supply voltage to the minimum operating level at each operating
frequency. This system compensates for silicon performance variations due to the manufacturing process
as well as run-time performance changes due to temperature fluctuations.
The 240-MHz chip is partitioned into three primary power domains: voltage-scaled CPU and memory power
domains and a standard fixed-voltage domain for the rest of the chip. The independent power domains
allow precise voltage control and current measurement for the CPU and RAM. Standard cells and level
shifters operate in the 0.7-1.32V range.
For cache-intensive workloads, both the power consumption and the precise time to process a workload
were measured to compare dynamic frequency scaling alone with dynamic voltage and frequency scaling.
Figure 9 summarizes the results normalized to the 1.2V operating voltage. Note that this diagram shows
the power savings only for the chip’s dynamic-voltage-and-frequency-scaling subsystem. Normally in such
SoCs, some of the chip will not be voltage scalable. Components such as external memories typically
operate at a fixed voltage, so design partitioning and planning must take into account the system-level
power savings.
The figure shows that voltage and frequency scaling can significantly reduce energy consumption compared
to frequency scaling alone. Running at 120 MHz cuts power requirements by half, for example, but scaling
the supply voltage at the same time slashes power consumption to about 20 percent of full power.
Summary
Dramatic power reductions such as those achieved by the Synopsys, ARM, National and Artisan test chip
are possible through a combination of high- and low-level power management techniques. The typical SoC
may not require all of these techniques, but mainstream solutions are available to meet all design requirements.
Choosing the right solutions depends on careful power analysis as well as understanding the capabilities
of available tools. Analyzing power requirements as early as possible in the design flow helps avoid power-
related disasters. Early analysis also makes power goals easier to attain because higher-level techniques
save the greatest amount of power.
Synopsys and Vera are registered trademarks and SystemC and OpenVera are trademarks of Synopsys, Inc.
All other trademarks or registered trademarks mentioned in this release are the intellectual property
of their respective owners and should be treated as such. All rights reserved. Printed in the U.S.A.
©2004 Synopsys, Inc. 05/04.KF.WO.04-12222