Async TWilliams 14may03

Clock Skew and other Myths
Dr. Ted Williams

Infineon Technologies MorphICs
ASYNC 2003 Conference Slide 1 Dr. Ted Williams, Infineon Technologies MorphICs
Overview
lLearning from Asynchronous and Synchronous designs
lClassic Asynchronous benefits
lPlanning for real wire delays
lMultiple solutions for “avoiding clock skew”
lReal voltages aren’t digital à Physics always wins
lHandling Min/Max bounds on all timing calculations
lRe-evaluation of Asynchronous claims/myths
lSummary of “analogous benefits”by sync design
lConclusion: General good design principles
Synchronous Motivations
Clean logic state machines
Separating Logic specification from Timing specification
Separating Logic verification from Timing verification:

Formal Verification can verify logic without timing
Performance optimization can be automated

independently of logic functionality.
Asynchronous Benefits
Divide-and-Conquer applied to clock distribution
lClaim of avoiding “clock skew penalty”
Robustness with respect to variable delays

lFrom data, process, voltage, temperature, coupling, characterization
lScalability and portability
Operation at actual rather than worst-case

lLess margining
More self-consistent for precharged logic

lMany better choices for control of precharge/postcharge intervals
lDual usage of “completion detection”for asynchronous handshaking
Easier Interfaces
lArbiters still required whenever sampling signals of unknown phase
Real Systems already both sync & async
Systems à SoC (System-on-a-chip)
Diameter of a clock period à Synchronous horizon
Pipelining in space à 3D Compilers (x,y,t)
Chip Die Area

Event Horizon
Event Horizon @ 500MHz
@ 100MHz
Clock domain crossing à Check phase-correlation

Globally Asynchronous - Locally Synchronous (GALS)
Learning from each other
In last 3 decades, synchronous designers have been
usefully challenged by asynchronous designers.
Today, modern SoC designs always have both aspects.
The technical part of this presentation will delve into

some “synchronous details”, but useful even for
asynchronous designers to learn principles needed.
Goal: High-performance design
Theme: If you can analyze, you can optimize.
Appropriate Goals
Throughout this presentation, it’s important to emphasize the context for judgement of ideas.
Relevant goals for improvement:

lPerformance
lCost
lArea, yield
lRobustness
lReliability, Manufacturability
lDifficulty
lSchedule, manpower
lRisk
Irrelevant goals:
lDelay insensitivity
lThe only reasons useful “for its own sake”are for the intrinsic elegance
or academic purity, and these are unimportant for engineered products.
Gates versus wires
Old view à Gates are what matter
Modern view à Wires are what matter
Changes to basic optimization and planning strategies
Synchronous world-view changes:

lAcademics now counting interconnect instead of counting transistors
l“Design compiler”counting cell area versus new “Physical compilers”
lWire-load model farce becoming increasingly unable to cope
Asynchronous world-view changes:

lStopfocus on gate delays separated by “isochronic forks”
lQDI: Quasi-Delay-Insensitive (but ignores rather than addresses the issue)
lEncourage pipelining on the wires
Planning with linearized signal velocity
Asynchronous design doesn’t change reality that
distance determines delay
lTiming Closure à Performance Closure
lTiming Closure of synchronous hierarchical design is more
parallelizable across a design team.
Floorplan with an early view of timing issues

lPredominant effect on achieving timing goal is wire lengths to actual
physical locations of gates and ports.
lRun early top-level timing budgeting, assuming registered block
inputs/outputs (best case, but add some margin)
lDrives analysis of the long paths, more pre-planning of buses, tuning
of block region ports, guidance for ram instance placements, and
study of actual gate placements along critical paths.
lSeparate wire planning from later actual repeater insertion.
Plan for Velocity of signals
µm per ps = mm per ns
During floorplanning, multiply velocity by Manhatten distance
times a routing non-ideality factor.
Assume ideal repeaters (typically every 1mm to 2mm in a
0.13um process) get added later. Without repeaters, it only
gets worse.
The mm per ns metric assumes no fanout. Linearized velocity
delay must be added to delay of buffer trees to drive actual
loads.
Real case (on nets with more than 1 endpoint) will be between
two extremes (so examine both during floorplanning):
lBest case, assuming no fanout, just linearized velocity
delay.
lWorst case, add the delay of a buffer tree to drive the
entire capacitive load of the wire in a Steiner tree linking all
endpoints on net.
Classic Synchronous Constraints
Classic goals of synchronous design:
Separate task of clock distribution from logic propagation.
Equalize clock arrival times at all registers
Tolerate skew by subtracting from maximum clock
frequency, and adding to required minimum logic delay.
ClockPeriod >= LogicDelaymaximum + SetupD->Clock + PropClock->Q + Skewmaximum

LogicDelayminimum >= HoldD->Clock + Skewmaximum
Consider every register to require a known outcome to

2 races, one for setup and one for hold constraints
across all conditions
Classic Synchronous clocks
Before describing a fresher approach, review steps of

“old”way:
lAssume a uniform clock-arrival time at all registers
lAssess the skew that will still be present even after
seeking to make the clock arrive synchronously
everywhere.
lPenalize the attainable clock-frequency by adding
the skew into every setup path computation
lPenalize area by adding delay elements required to
correct hold violations assuming the whole skew
budget
lWork hard to distribute the clock within the
specified skew, even in cases where the paths
actually have plenty of slack.
Classic Synchronous Clock Problems
More complex logic -> More total load.
Today’s SoC can have > 500K registers
loading the clock tree.
Longer latency of clock tree to buffer up to bigger load,

so more proportional variability
More cross-coupling à more variability
Denser clock grids à higher power
à Low-skew Clock tree distribution becoming harder
Typical Asynchronous Answer
Since low-skew distribution is hard à do away with the
clock entirely?
lLoses the simplicity of synchronous state model

lCompounds logic verification dilemma, by coupling it to fine-
grain event sequencing
lMisses the point that the problem isn’t the clock itself or the
clock latency, but only the handling the calculable part of the
arrival time difference (“skew”).
Re-think issue of skew
The hard part in clock distribution is getting low-skew, but that
doesn’t mean we can’t have a clock.
If we don’t really force low-skew, then we can save effort and
power in clock distribution.
Rather, just build a timing methodology to robustly ensure we have
full analysis of actual skew effects on both setup and hold.
Tolerate larger skew, and actually get a more conservative design.
To implement, start tracing of races at the point in clock distribution

where the paths to the launching register and receiving register
diverge.
Or backtracing from the registers, find the point in the clock tree
where the reverse paths from launching register and receiving
register re-converge.
Remove “boundary” at register clock pins
Modern, better approach to combine clock tree analysis and improvement
with combinational path optimization:
lUse actual clock paths feeding every launching & receiving register pair
lTrace clock paths toward root only back to point of “reconvergence”
(this feature added to commercial tools just in 2001)
lDon’t require clock distribution to be equal to all points à
Effectively, relaxing of “synchronicity” constraint
lLet unified critical path analysis drive improvement in the clock tree, but
only as required
lLet useful skew help
lMaximum skew doesn’t matter except when in series with top critical
paths
lMake sure analysis includes real max/min path combinations for every
clock check
Effectively, this is making a synchronous design
more asynchronous, or GALS-like, even though
regions are still using all the same root clock.
Setup and Hold Checks
Launching Register Receiving Register
Combinational
D Q Logic D Q
Clock Reconvergent Node
Setup Path Check Hold Path Check

Red < Green + Period Red > Green
Observe: No explicit clock skew

Observe: Variability in the
budget needs to be added,
“Common part” has no effect on
because path tracing already
the setup and hold races for this
includes effect of any arrival time
pair of registers. Clock differences.
Why do we have to be careful?
We have to be sure our path-tracing accounts for all
possible paths for both setup and hold checks.
With increasing divergence in actual capacitance due to
various factors, EVERY computation needs to be
viewed as requiring analyses using both min and max
values.
If we miss any paths that did have large differences in
clock arrival times at launching and receiving registers,
we would suffer functional failures (hold violations) or
performance loss (setup violations).
Old method even worse: Adding large enough additive

margin to account for biggest possible difference
between all longest max and shortest min paths
Why is this method still hard?
Most static timing analyzers today still use the model of
initiating all path tracing at the Q outputs of registers.
Then to do this type of analysis, they “correct”the reported
paths by the differences in the traced arrival times.
But the correction must be applied to all paths, prior to
sorting, to be sure the top-N list (that we need to optimize
and fix) really is right.
Also, back-tracing must stop at point of clock-reconvergence.

Applying separate min/max delays to the common part of the
clock tree would be an incorrect and overly severe penalty.
Now examine in more detail how to handle min/max issues

and where the spreads come from…
Real circuits aren’t digital
Physics à Nothing ever is instantaneous in real world. Gates don’t “trigger”
Actual signal transition times are 3x to 10x gate “delays”
transition times (often defined as 10% to 90%)
Vdd
“Threshold”
Gnd
Can even move threshold

definition and change
t1 individual delays, but still
Delays: t2 get same sum through a
path.
t3
Speed versus “noise margin”tradeoffs, esp. if not full-swing, such as sense amps
Asynchronous designs shouldn’t have cycle times less than transition times!
Accuracy versus uncertainty
Physics à Every quantity has an error bar
A value quoted as a single number is always wrong!
Calculated delays always have a range due to:

lFabrication parameter spreads (transistors, metals, dielectrics) ß process corners
lAggressor/victim signal couplings ß address explicitly
lInaccuracies in capacitance extraction
lInaccuracies in delay calculation
address by
lSwitching threshold approximations multiplicative
lPower supply noise, including inductive bounces margin factors
lPower supply resistive voltage drops
Don’t get bogged down in 3% “correlation to spice”when

already ignoring many 10% factors
Margin Types
Little or no margin
lMost common reality is that design teams run STA with just one configuration, and strive (fruitlessly) to make it “accurate”.
Explicit spread between min/max
lOnce methodology enhanced to have both min/max calculations, can choose distinct values differently for known effects.
lAppropriate for aggressor cross-coupling where definitive calculations are possible to get relative magnitudes.
Additive margin
lAdding an extra pessimistic value into each setup path and hold path check.
lIndependent of path length, or namely “once”per path.
lAdditive margin is appropriate to model:
l Inaccuracies in setup and hold characterization.
Multiplicative margin
lWiden min/max range by multiplying/dividing by a pessimistic factor each capacitance, resistance, or delay.
lSince “per gate”, it gains conservatism proportional to path length.
lMultiplicative margin (sometimes called applying a “library derating factor”) is appropriate to model:
l Inaccuracies in capacitance extraction
l Inaccuracies in library or delay calculation
l Power supply noise, including inductive bounces
l Power supply resistive voltage drops
Statistical margin
lGives each data value an error bar, and propagates the potential errors assuming a symmetric gaussian distribution.
lStatistical margin is appropriate for issues that are statistical, such as:
l Fabrication parameter spreads (transistors, metals, dielectrics)
lAssumes long paths will see cancellation of effects, which may not be true depending on the direction of on-die process “tilt”.
lNot right substitute for min/max of issues that may not actually cancel in long paths.
Corner “margin”
lReplication of entire computation set with different fabrication/circuit parameters.
lAppropriate for process, because vendors specify representative points (typically hot/typ/cold for libraries, and true circuit
analysis will also include the cross-corners: fast-n/slow-p and fast-p/slow-n)
lNot a substitute for the other types of margin that need to express spreads within a corner, but we do want to run again
using all the other types of margin at every process corner.
Margin summary: Use all together!
§Explicit min/max spread for aggressor coupling
§Additive margin at register endpoints
§Multiplicative margin for all gate&wire delays in a path
§Statistical margin for on-chip process tilt
§Corner “margin”for die-to-die variation
Timing analysis should use all of the margins together, since each is
appropriate for different issues.
At each corner - all of the other margins should be applied
(it is unnecessary to combine min/max across corners, which can often differ by large factors)
For asynchronous designs, margin analysis is still needed to quantify the

performance, and thus to verify the meeting of the performance targets.
Now more detail about what goes into the individual margins…
Effective coupling capacitance
Voltage
Effective Effective capacitance of switching aggressor
Capacitance =
Ratio Capacitance of quiet aggressor
Time
3
Opposing
Aggressor switching aggressors
in opposite direction
2 as victim signal
No aggressor
0
Aggressor switching
in same direction
as victim signal
Aiding
-1 aggressor
0.2 1 5
Signal transition time of victim
Slew Ratio =
Signal transition time of aggressor
Complementary Min/Max delays
Using a single delay value is never right!
Example: Compare two clock trees with the “same”
computed delay, but of greatly differing height.
The shorter one is better (less min/max spread) but a
single number doesn’t convey the advantage.
Every delay really within a range, so bound by min/max

Every constraint has a “dual”
Every calculation needs the “pessimistic”combination
Unfortunately, often a significant CAD tool change:

lDoubling size of many data structures
lMust pick the correct combinations
Choosing Min/Max values
Max C: (1+x) * (Csupply (vdd/vss) + 2 * Csignal-cross-coupling)
Min C: (1-x) * (Csupply (vdd/vss) + 0 * Csignal-cross-coupling)
Using these Max and Min RC values in all Setup/Hold

checks back to point of clock reconvergence forces
attention to both potential data and clock improvements
without paying global arbitrary penalties.
lMore sophisticated capacitance munging can take into account some actual
ratios of correlated aggressor coupling to total coupling.
lMin and Max variations (1 +/- x) factors can apply to resistances too.
lMultiplicative factors more important than simple additive ps margins.
lThese RC changes are not meant to handle process corners, but to be
applied within a process corner.
lAll comparison is done within a process corner, but multiple corners used for
the checks (setup interesting at cold process, hold interesting at hot process)
Clock tree tuning based on min/max path tracing
Tuning to fix paths can be either in comb logic or in clock trees
Registers on neighboring clock branches see little clock skew.

Registers on remote clock branches see big skew, and it is real!
Without doing this analysis, we might be inserting delay to “match

skew” when it actually hurts performance or conservatism.
Applies to clock trees both within blocks and again at top level.
Tuning of clock trees and delay matching to account for varying

block insertion delays is done at typical process, but analysis is
seen at the corners.
Practical details with today’s tools
Need to truly choose all min delays and max delays correctly.
Static timing analysis must have both min and max delay available
for each gate transition in the same timing run.
Current versions of Primetime do not allow multiple spef
capacitance datapoints for same node, so must use separate delay
calculations runs to pre-compute sdf delays, which can then be
loaded to simultaneously specify multiple min/max wire delays for
each node.
With min/max sdf delay data on every node, Primetime can then
choose the correct ones along clock paths leading separately to
launching and receiving registers, using the mode called “on-chip
variation”
So, for each block need three primetime runs (2 for sdf calc + 1
main) for each voltage/temperature/process operating condition).
For (cold,typ,hot) process, use total of 9 runs at block level.
Rigorous Setup and Hold checking using simultaneous min/max delays
Launching Register Receiving Register
Combinational
D Q Logic D Q
All annotated delays
Max RC
include additional
includes 2x all
multiplicative margin to
potential signal
account for:
cross-coupling
capacitances §tool inaccuracies
§on-chip process tilt
Min RC §IR-drop supply variation
contains only
capacitance to
vdd/vss/substrate

Max RC in clock feeding Launching Register Min RC in clock feeding Launching Register
Register clk->q prop delay Register clk->q prop delay
Max logic choices in Combinational Logic Min logic choices in Combinational Logic
Max RC in Combinational Logic Min RC in Combinational Logic
Min RC in clock feeding Receiving Register Max RC in clock feeding Receiving Register
Library Setup spec for Register Clock Library Hold spec for Register + margin
Handling min/max throughout hierarchy
Currently, commercial Static-Timing-Analysis tools have “on-chip-
variation”modes that allow the correct choices of min/max
delays for setup/hold path tracing within a block only.
For “true-hierarchical”design, must create a parent-level timing run
that does not have to see all the details in every child block.
But, “on-chip-variation”modes that work in blocks aren’t
implemented for creation of abstract timing models for use in a
parent.
So, need to more explicitly force the selection of correct min/max
combinations:
lAnnotate mixes of min/max capacitance prior to timing block abstraction
lUsing the RC data (.spef) instead of pre-computed delays (.sdf) for model abstraction has
additional advantage that the abstract enables expression of dependencies on input edge-
rate and output loads.
lConsider the different possible arcs that can be generated…
Abstraction of paths into constraint arcs
Input Port Output Port

Comb Comb
D Q D Q
Logic Logic
Comb Comb
Logic Logic
Clock Port
Combinational
Through-Timing Arc
Output
Setup/Hold Propagation
Timing Timing
Arc Arc
Clock Port
No single choice of min/max net delay annotation is sufficient
Hierarchical boundary
min Comb min min Comb min

D Q D Q D Q
Logic Logic
Comb
Logic
min max
min or max ?
Clock
min/max values shown are the ones that would be needed for the parent HOLD check run
No single choice of min/max net delay annotation is sufficient
Input Port
max Comb Comb max Output Port
D Q D Q
Logic Logic
all-max
model, often max max
used for setup
checks, but Comb Comb
still not Logic Logic
correct
max
Clock Port
Input Port min Comb Comb min Output Port

D Q D Q
Logic Logic
all-min model,
often used for
hold checks, min min
but still not
correct Comb Comb
Logic Logic
min
Clock Port
Take annotation from 4 separate pre-processed capacitance
sets, and recombine into 2 models per process corner
Input Port max Comb Comb max Output Port

D Q D Q
Logic Logic
model needed
for setup min max
check in Each
parent Comb Comb grouping is
Logic Logic a pre-
processed
timing
analysis run
Clock Port type, from
which arcs
are pulled to
create these
Input Port min min Output Port
Comb D Q Comb combined
D Q
Logic Logic models for
model needed use in a
for hold check parent run.
in parent max min
Comb Comb
Logic Logic
Clock Port
Need Multiple STA runs to get all arcs
Need a total of 4 static-timing-analysis runs for each of

the min/max groupings, for each process corner.
For example, to cover (cold,typ,hot) process corners,

need 4*3 = 12 block characterization runs.
The timing arcs for block_inputà clock and

clockà block_output taken from above runs, but must
use external timing arc swizzling to combine (from
different block runs) the right arcs into the models
needed for the parent-level setup and hold runs.
Always combine arcs only from the same process
corner. Keep process corners separate, but perform
same swizzling of min/max arcs again at each corner.
Recombine arcs into setup-check model
max Comb Comb max

D Q D Q
Input Port Logic Logic Output Port
min max
Comb Comb
Logic Logic
Clock Port Clock Port
D Q max max
D Q
Setup Output
Constraint Propagation
Timing Timing
Arc Arc
max min
Clock Port
Clock
Recombine arcs into hold-check model
min Comb Comb min

D Q D Q
Input Port Logic Logic Output Port
max min
Comb Comb
Logic Logic
Clock Port Clock Port
D Q min min
D Q
Hold Output
Constraint Propagation
Timing Timing
Arc Arc
min max
Clock Port
Clock
Parent runs then get correct min/max treatment recursively
Launching Register Receiving Register “Registers” can

now be whole
Combinational Mgate blocks
D Q Logic D Q
Max RC
includes Miller
factors for all
potential signal All annotated delays
cross-coupling include additional
capacitances multiplicative margin to
account for:
§tool inaccuracies
Min RC
contains only §on-chip process tilt
capacitance to
§IR-drop supply variation
vdd/vss/substrate

Max RC in clock feeding Launching Register Min RC in clock feeding Launching Register
Register clk->q prop delay Register clk->q prop delay
Max logic choices in Combinational Logic Min logic choices in Combinational Logic
Max RC in Combinational Logic Min RC in Combinational Logic
Min RC in clock feeding Receiving Register Max RC in clock feeding Receiving Register
Library Setup spec for Register Clock Library Hold spec for Register + margin
Benefits of full-complementary timing analysis
More Robust, even for large integration complexity
§Both additive and multiplicative margin built into delay equations.
§Margins built into all setup/hold checks ensure functionality and timing performance
corresponding to simulation.
§Let min/max methodology encompass clock-skew judgement, to take advantage of useful
skew, and not penalize irrelevant skew.
Lower-power clock distribution

§Differentblocks can have different clock-distribution skew requirements
§Enables tall-thin clock trees with reconvergent tracing of launching and receiving clock trees,
applying min/max spread only where paths diverge.
§Increased shippable yield throughout process spread.
§Can take advantage of useful clock skew between registers and blocks
Automated, simultaneous setup + hold improvements

§Normal critical path analysis and optimization improves clock-tree as well as datapaths.
§Can allow buffer insertion and gate-sizing tools (example: Sequence Design PhysStudio) to
improve clock tree as well as combinational logic.
§Delay buffers for hold-fixes inserted at points of maximum setup slack.
§Timing closure: touching fewer items each pass (primarily closure of only hold violations in
later passes).
Summary: Min/Max Flow vs. Traditional Additive Margins
By adding min/max computation into the fundamental path timing

computation, normal critical path sorting will show where there is
the least slack, allowing optimization and designer attention to fix it.
This single robust method automatically handles:

lAnalyzing clock tree together with the downstream data paths.
lAllowing different blocks to have different clock-distribution skew
requirements (lower power & area, avoiding unnecessary gates and work)
lTaking advantage of useful clock skew between registers (automatic)
lTaking advantage of useful clock skew between blocks (crafted choices,
but automatic verification)
lRequires no special casing for gated clocks that insert different local
skew. Control signal for gated clock must come from a register inside
block so its own setup/hold checks are handled at block level.
Min/Max timing analysis applies to Async performance

analysis too, even if not required for correctness.
Myth : Clock skew will stop synchronous progress
Problem: Clock skew does get worse in finer geometry
technologies, and with increasing complex designs.
Reality: Methods that handle the skew analysis along

with other path analysis can tackle this problem, and
enable increasingly higher-frequency synchronous
designs.
The separation of logic verification and timing
verification will continue to promote synchronous
designs.
Myth : Asynchronous designs are safer
Problem: Process variation does get worse in finer
geometry technologies.
Variations in actual operating conditions (process,
voltage, temperature, coupling) do affect circuit speed.
Reality: Consistently handling min/max delays and RC

data can tackle this problem, and enable increasingly
robust synchronous designs too.
Analysis of the expected worst-case is always needed

anyway, to ensure correct functional behavior of the
enclosing system.
Myth : Completely new tools needed to handle Signal Integrity
Problem: Aggressor cross-coupling (as a percentage of
total capacitance) does get worse in finer geometry
technologies.
Reality: Aggressor couplings are one piece of delay

variation that can be accounted for using min/max RC
data full-complementary timing analysis.
Min/Max analysis does not necessarily require the use
of “Primetime-SI”or any other “signal-integrity”tool.
These tools often haven’t implemented a multiplicative
min/max spread to account for extraction inaccuracy.
But, these tools do have advantage of more
sophisticated exclusion of aggressor couplings that
can’t occur due to non-overlapping switching windows.
Myth : Statistical timing takes the place of margins
Problem: A new theme is that variability in delays should be
handled by timing analysis that project every delay to be statistical.
Good in the sense of adding an “error bar”, but bad by creating
impression other margins are no longer needed, even though they
are more appropriate for the calculable known effects, such as
aggressor coupling and power-supply drops.
Reality: When propagating through logic, min/max values may not

be gaussian or even symmetrical.
While statistical tools can expose some of the same potential faults,
they won’t be guaranteed to find them as exhaustively as the
min/max bounded approach will. The best use is to combine both
approaches, using a component of statistical variation on top of the
min/max approach for the other effects (such as aggressor coupling
and IR-drops) that won’t cancel out in long paths.
Myth : Useful clock skew can only apply inside a block
Problem: Previous algorithms for taking advantage of
useful skew only help for individual signals.
Reality: It is possible to deliberately skew the arrival

time of the clock to whole regions, to adjust for the
harder direction of data travel. The fully-complementary
timing analysis can still completely verify the
hierarchical interconnect timing across hierarchy.
Block A Data buses C Block B
Data buses D
Clock
Myth : Async design means not thinking about timing
Problem: In an isolated sense, a chain of asynchronous
logic will still work independent of actual timing, but not
thinking about timing will almost assuredly mean
inadequate performance.
Reality: Good performance doesn’t come for free.

Even asynchronous pipelines must analyze timing
latency and throughput to ensure desired operation.
Myth : QDI design makes circuits independent of wire delay
Problem: QDI (Quasi-Delay-Insensitive) design focuses
on correctness, but can ignore performance.
Reality: The performance is still dependent on the

actual wire distances (delays).
QDI misses the point by hiding the issue, rather than
exposing it.
Without highlighting of performance loss, insufficient
attention brought to wire distances and floorplanning.
Even after insertion of async handshaking buffers, still
need “performance closure”.
Myth : Performance closure is easier in Async design
Problem: Performance through an asynchronous
pipeline is constrained by the local cycle times of
handshake loops, and forward and reverse latencies of
each stage. Any of these values can be influenced by
physical wire distance, or mis-sized gates.
Asynchronous designs also have “critical paths”that
determine performance, and they often aren’t as cleanly
partitioned, and therefore require more thought to find
and care to fix.
Reality: Synchronous designs allow better separation

of issues, and parallelizable effort.
Myth: Asynchronous designs are more robust across operating
voltage changes
Problem: Synchronous libraries are typically

characterized at only specific voltages.
But, the fundamental switching digital circuits do work
over the same ranges in both async and sync design.
The multiplicative factor in min/max complementary
timing can handle non-linear differences in the scaling
of delays with voltage for different gate types.
Reality: The real problem is in generation of the clock

of exactly the right frequency for the synchronous circuit
to just work. The fact that asynchronous designs are
“self-adjusting”is a valid advantage.
Myth : Timing should use instance-specific power-supply voltages
Problem: Power-supply IR drop does affect delay

timing (almost linearly). If ignored, this can cause
incorrect timing analysis.
Reality: The min/max timing method should choose a

multiplicative factor that does account for the potential
range of power-supply drops.
This will ensure correct robust handling.
But, silly to rely upon the drops for correct operation,
which could happen if a timing tool was trying to
calculate a single “corrected”delay accounting for
power-supply drop, without doing min/max bounding.
Myth : Async circuits lower power due to being data-driven
Problem: Circuits that transition when there is no need
for new data, can waste power.
Reality: Much of the benefit of data-driven transitions is

attainable by clock gating in synchronous systems.
Further, dual-monotonic (dual-rail) pairs, often used in

asynchronous precharged logic, can actually consume
more power because they always must have one
polarity transition, instead of synchronous logic that
need not transition if the data values are unchanged.
Synchronous designers believe myths too
Myth : Clock gating introduces too much skew
Problem: Inserting clock gating elements into the clock
tree does change the distribution latency, and can
complicate balancing seeking matched arrival times.
Also, fine-grained gating introduces more irregularity,
making it less possible to match delays by replication.
Reality: The fully-complementary timing approach

enables easier clock-gating because it doesn’t have to
be an exception to the general timing methodology, and
doesn’t add to a skew budget where it is not used.
The normal min/max analysis and tracing of paths
(through potential clock-gating elements) back to point
of clock-tree reconvergence fully analyzes this, and
therefore allows optimization.
Myth : Chip area is determined by transistor count
Problem: Design complexity historically measured through transistor
count, or effective gate count. Density is usually quoted in gates/mm2,
with the implication of linear scaling.
Reality: Transistors (gates) are the free objects that sit under the
wires. Density of logic is determined by wire connectivity
requirements, for both sync & async.
Maximum total wire density is:
(num effective metal layers)/(wire pitch) = (wire length) / area
Example: A 6 layer process, may effective have 4 usable layers
For wire pitch of 0.5 microns, max density is:
4/0.5 = 8 wires per µ width = 8000 mm wire / mm2 = 8000 mm-1
1.25cm2 chip would have maximum total wire length = 1 km wire
(Analogous to “ideal”quoted density of back-to-back packed nand gates)
Typically, routed regions will be lucky to get a third to half of this.
Quoting wire-utilization density also is a fairer way of expressing top-
level area and complexity, where there may be few gates.
Density measurements in wirelength/area better than gates/area
Myth : “Synchronizers” fix asynchronous interfaces
Problem: Any time a signal of unknown phase is
sampled (including in an asynchronous arbiter), it might
be transitioning just at the wrong time, causing an
intermediate voltage value to be trapped.
Metastability can result in the trapped voltage persisting
indefinitely, with recovery based on the time constants
of the feedback loops to amplify the node away from its
midpoint.
Failures occur at the point where the signal branches,
and different receivers treat the analog voltages
differently.
Reality: Using multiple registers in series improves the
problem exponentially by allowing recovery times
greater than a single clock cycle, but it never results in a
zero failure probability.
Valid: Asynchronous better enables precharged logic
Problem: In synchronous design, there is always a
challenge in how to generate precharge control signals.
Common methods either add an interval at the
beginning or end of a clock cycle, or use a clock phase
for precharging, wasting parts of a clock cycle.
Reality: Asynchronous handshake signals are natural

sources for precharge control. Dynamic precharged
logic is more self-consistent with asynchronous design,
and can provide a 2x performance advantage compared
with fully-complementary static logic.
Valid : Good timing principles apply equally to all circuits
Timing dominated by issues outside of gate internals
(wire RC, aggressor coupling, power supply variations).
Physics: Every number has an uncertainty, and
computations should use these bounds.
Min/Max margin analysis puts design improvements
where they really do add safety and performance, for
both sync and async designs.
Analysis of arrival time differences in distribution of

high-fanout clocks and asynchronous control signals.
Delivered performance is determined by signal timing.
Summary: Similar palettes for both
SoC design performance is all about Floorplanning and
Architecture that accounts for physical on-chip distance.
Gate-sizing and repeater insertion (“wire-synthesis”) must be
automated, because now fundamental.
Complexity and area are determined by wires, not transistors.
Domain crossings and sampling of unknown-phase-signals must
always be handled with care, with correct synchronizers, and
quantified metastability MTBF.
Varying the power supply voltage allows speed/power tradeoff.
Optimizing transition count is optimizing power consumption.
Removing unnecessary low-skew and synchronization constraints
(at register clock pins) allows design to focus improvement where it
matters, and is a step toward more general GALS design.
Conclusions
Many of the historically claimed advantages of
asynchronous design are really Myths, because they
can be solved with equivalently good solutions in
synchronous design.
Example: Clock skew effectively handled through
better techniques that minimize overall penalty.
Asynchronous design can be advantageous for

performance in self-timed precharged pipelines.
Asynchronous designers tend to be good circuit

designers because they don’t get overly
constrained by thinking only in a rigid synchronous
framework.

Async TWilliams 14may03

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Async TWilliams 14may03

Uploaded by

Copyright:

Available Formats

Clock Skew and other Myths

Dr. Ted Williams

Separating Logic verification from Timing verification:

Performance optimization can be automated

Robustness with respect to variable delays

Operation at actual rather than worst-case

More self-consistent for precharged logic

Chip Die Area

Clock domain crossing à Check phase-correlation

Today, modern SoC designs always have both aspects.

The technical part of this presentation will delve into

Goal: High-performance design

Theme: If you can analyze, you can optimize.

Relevant goals for improvement:

Changes to basic optimization and planning strategies

Synchronous world-view changes:

Asynchronous world-view changes:

Floorplan with an early view of timing issues

ClockPeriod >= LogicDelaymaximum + SetupD->Clock + PropClock->Q + Skewmaximum

Consider every register to require a known outcome to

Before describing a fresher approach, review steps of

Longer latency of clock tree to buffer up to bigger load,

à Low-skew Clock tree distribution becoming harder

lLoses the simplicity of synchronous state model

To implement, start tracing of races at the point in clock distribution

Clock Reconvergent Node

Setup Path Check Hold Path Check

Observe: No explicit clock skew

Old method even worse: Adding large enough additive

Also, back-tracing must stop at point of clock-reconvergence.

Now examine in more detail how to handle min/max issues

Can even move threshold

Calculated delays always have a range due to:

lPower supply resistive voltage drops

Don’t get bogged down in 3% “correlation to spice”when

For asynchronous designs, margin analysis is still needed to quantify the

Every delay really within a range, so bound by min/max

Unfortunately, often a significant CAD tool change:

Using these Max and Min RC values in all Setup/Hold

Tuning to fix paths can be either in comb logic or in clock trees

Registers on neighboring clock branches see little clock skew.

Without doing this analysis, we might be inserting delay to “match

Tuning of clock trees and delay matching to account for varying

Launching Register Receiving Register

Clock Reconvergent Node

Setup Path Check Hold Path Check

Input Port Output Port

min Comb min min Comb min

Input Port min Comb Comb min Output Port

Input Port max Comb Comb max Output Port

Need a total of 4 static-timing-analysis runs for each of

For example, to cover (cold,typ,hot) process corners,

The timing arcs for block_inputà clock and

max Comb Comb max

Clock Port Clock Port

min Comb Comb min

Clock Port Clock Port

Launching Register Receiving Register “Registers” can

Setup Path Check Hold Path Check

Lower-power clock distribution

Automated, simultaneous setup + hold improvements

By adding min/max computation into the fundamental path timing