Professional Documents
Culture Documents
1,600 12,000
1,400
Frequency (MHz)
10,000 Global
1,200 Local
1,000 8,000
800 6,000
600 4,000
400
200 2,000
0 0
0.25 0.18 0.15 0.13 0.1 0.07 0.05 0.25 0.18 0.15 0.13 0.1 0.07 0.05
1997 1999 2001 2003 2006 2009 2012 1997 1999 2001 2003 2006 2009 2012
Technology (micron)/year of first shipment Technology (micron)/year of first shipment
(a) (b)
Figure 1. The National Technology Roadmap for Semiconductors: (a) total transistors per chip, (b) on-chip local clock.
12 IEEE MICRO
it became obvious that 1-GHz processors were Power
feasible immediately—not between 2005 and
2010, as projected earlier. The revisions arise
from coordinated improvements in scaling,
circuit design, and CAD tools. Server
Feature sizes below 0.1 micron lead to other processor
design
technical challenges. High-performance proces-
sors need special cooling techniques, which T 3P = k
could consume as much as 175 W of power. Area
Enabling GHz signals to travel into and out of AT = k
the chips requires new circuit designs and algo- Client
rithms. Preventing latch-up and reducing noise processor
coupling may require new materials such as sil- design
icon-on-insulator. Similarly, reducing cross talk Time
and DRAM leakage may require low κ as well
as high κ materials, respectively. These chal- Figure 3. Area, performance (time), and
lenges demand that designers provide whole- power trade-off trends in server and client
system solutions instead of treating logic processor designs.
design, circuit design, and packaging as inde-
pendent phases of the design process.
servers and clients, based on economy of scale.
Client versus server processors With added computational resources (in area
In the era of deep-submicron technology, and time) becoming available through
two classes of microprocessors are evolving: improved feature sizes, the movement to estab-
client and server processors. The majority of lish separate design points—differentiated by
implementations are commodity system-on- power—seems inevitable.
chip client processors (including network and An important dimension—computational
embedded processors) devoted to end-user integrity—does not appear in Figure 3. Later
applications such as personal computers. we discuss its nature and effects on area, time,
These highly cost-sensitive processors are used and power.
extensively in consumer electronics. Individ-
ual applications may have specific require- Time considerations
ments; for example, portable and wireless Microprocessor performance has improved
applications require very low power con- by approximately 50% per year for the past
sumption. The other class consists of high- 15 years. This can be attributed to higher
end server processors, which are performance clock frequencies, deeper pipelines, and
driven. Here, other parts of the system dom- improved exploitation of instruction-level par-
inate cost and power issues. allelism (ILP). In the deep-submicron era, we
At a fixed feature size, area can be traded off can expect performance improvement to
for time. VLSI complexity theorists3 have result largely from reducing cycle time at the
shown that an AT n bound exists for micro- expense of greater power consumption.
processor designs, where n usually falls between
1 and 2. By varying the supply voltage, as we Cycle time
show later, it is also possible to trade off area A Processor clock frequencies have increased
for power P with a PT 3 bound. Figure 3 shows by approximately 30% per year for the past
the possible trade-off involving area, time T, 15 years, due partly to faster transistors and
and power in a processor design. Client and partly to fewer logic gates per cycle. Tradi-
server processors operate in different design tionally, digital designs have used edge-trig-
regions of this three-dimensional space. The gered flip-flops extensively. Such a system’s
power and area axes are typically optimized for cycle time Tc is determined by Tc = Pmax + C,
client processors, whereas the time axis is typ- where Pmax is the maximum delay required for
ically optimized for server processors. Until the combinational logic, and C is the total
recently, a single design point existed for both clock overhead, including setup time, clock-
JULY–AUGUST 1999 13
DEEP-SUBMICRON DESIGN
re0 Vector register read (single wave) We use wave pipelining to illustrate some
we2 Vector register writeback (single wave) of the new clocking considerations. For mem-
mula Multiplier input (multiple waves)
mulr Multiplier output (multiple waves)
ories and other functional blocks that contain
regular interconnect structures, wave pipelin-
ing is an attractive choice. The technique relies
re0 on the delay inherent in combinatorial logic
circuits. Suppose a given logic unit has a max-
imum interlatch delay of Pmax and a corre-
we2 sponding minimum delay of Pmin with clock
overhead C. Then the fastest achievable cycle
time ∆t equals Pmax − Pmin + C.
mula As with conventionally clocked systems,
system clock rate Tc is the maximum ∆ti over
i latched stages. Sophisticated CAD tools can
mulr ensure balanced path delays and thus improve
cycle time. In practice, using special CAD
tools lets us set Pmin to within about 80% to
31.04 45.67 60.29 74.91 89.54
90% of (Pmax + C). While this would seem to
Nanoseconds
imply clock speedup of more than five times
Figure 4. Wave-pipelined vector multiplication8 fabricated with 0.8-micron the maximum clock using traditional clock-
technology. ing schemes, environmental issues such as
process variation and temperature gradient
across a die restrict realizable clock rate
to-output delay, and clock skew. For high-end speedup to about three times.7,8 Figure 4
server processors, the SIA predicts that the details the register-to-register waveforms of a
clock cycle time will decrease from roughly 16 wave-pipelined vector multiplier.7 The
FO4 (fanout-of-four) inverter delays at pre- achieved rate shown at the bottom of the fig-
sent to roughly five FO4 inverter delays at ure is more than three times faster than the
0.05-micron feature size. As a result, clock traditional rate determined by the latency
overhead takes a significant fraction of the between the multiplier input and output
cycle time, and flip-flop clocking systems (shown in the top two segments of Figure 4).
appear infeasible. Fortunately, a number of Wave pipelining also exemplifies how
circuit techniques can improve deep-submi- aggressive new techniques offer architects and
cron microprocessor cycle times: implementers both benefits and challenges.
Although allowing significant clock-rate
• Several new flip-flop structures, such as improvements over traditional pipelines, wave
sense-amplifier-based, hybrid latch, and pipelines cannot be stalled without losing the
semidynamic flip-flop, have been pro- in-flight computations. Because individual
posed to lower clock overhead. waves in the pipeline exist only by virtue of cir-
• Asynchronous logic4 eliminates the need cuit delays, they cannot be controlled between
for a global clock. Average latency pipeline latches. In the case of wave pipelines,
depends on Pmean (average logic delay) architecturally transparent replay buffers9 can
instead of Pmax (maximum logic delay), provide the effect of a stall and extend the
but completion detection and data ini- applicability of wave pipelining to applications
tialization incur significant overhead. that require stalling the pipeline. Other new
• Various forms of dynamic logic5 have techniques may not fit in directly with current
been proposed to minimize the effects of architectures and may also require special treat-
clock skew. These techniques reduce ment to be generally applicable.
clock overhead at the expense of power
and scalability. Performance versus complexity
• Wave pipelining6 uses Pmin (minimum Processors with faster cycle times often pro-
logic delay) as a storage element to vide better overall performance, although clock
improve cycle time. overhead and pipeline disruptions during exe-
14 IEEE MICRO
cution may nullify this advantage. Consider T
the case of a simple pipelined processor, illus-
(a)
trated in Figure 5. Assume that the total time T/S
to execute an instruction without clock over-
head is T (Figure 5a). Now, T is segmented Clock overhead (C ) plus skew
(b)
into S segments to allow pipelining (Figures
5b and 5c). The pipeline completes one
instruction per cycle if no interruptions occur; T/S C
however, disruptions such as unexpected Clock overhead
branches result in flushing and restarting the (c)
pipeline (Figure 5d). Suppose these interrup-
tions occur with frequency b and have the
Result available
effect of invalidating S − 1 instructions. The
pipeline throughput G becomes
Disruption S −1
1 1 cycles delay Restart
G= ⋅
1 + (S − 1)b T / S + C (d)
(1 − b)T
S opt = Increasin
bC g clock o
verhead
JULY–AUGUST 1999 15
DEEP-SUBMICRON DESIGN
0.8 0%
1%
tion disruptions involve statically scheduled
0.6
0.4 2% VLIW processors with a software approach or
3% dynamically scheduled superscalar processors
0.2
4%
0 with a hardware approach.
1 2 3 4 5 6 7 8 8 10
The software approach relies on the com-
Instruction width (N )
piler to prevent as many disruptions as possi-
Figure 8. Normalized performance versus instruction-level parallelism (Com- ble by eliminating the possibility of nonerror
press benchmark). disruptions in the code schedule. Even system
interrupts can be managed in this manner,
although this solution is less applicable to gen-
scalar processor running the Compress bench- eral-purpose computing. The software
mark. In this graph, normalized performance approach minimizes hardware complexity,
is defined as IPC ⋅ (1 − O ⋅ N). Consider the providing both cost and performance bene-
case when the overhead O is 3%. Because of fits; however, the requirement to pessimisti-
ILP overhead, increasing the instruction cally schedule the code often results in
width greater than five times diminishes the suboptimal performance.
actual performance. The hardware approach relies on hardware
The trade-off between VLIW (very large to predict, detect, and execute correctly in the
instruction word) and superscalar machines presence of disruptions. Hardware-based
amounts to reducing overhead with a VLIW dynamic prediction (driven by recent execu-
machine versus reducing disruption with a tion patterns) can achieve greater accuracy
superscalar machine. Latency tolerance than compiler-based branch prediction (dri-
reduces average latency d and hence improves ven by either heuristics or profile information).
the ILP available. However, with the complexity of dynamic pre-
For either processor approach (fast ∆t or diction hardware (say, for branch prediction
increased ILP), performance is bounded by or data prefetching), this approach rapidly
implementation overhead and program reaches the point of diminishing returns.
behavior. We can use our understanding of
algorithms to extend both of these limits. The “memory wall” obstacle
Currently, popular processors have extend-
Disruptions ed techniques such as branch prediction, data
Execution disruptions include execution and instruction caches, and out-of-order exe-
interruptions (for example, from mispredict- cution to enable greater performance. Unfor-
16 IEEE MICRO
1.5 4
3
1.0
CPI
CPI
2
0.5
1
0 0
2 4 6 8 25 100 2 4 6 8 25 100
Memory latency (cycles) Memory latency (cycles)
(a) (b)
1.4 0.6
1.2 0.5
1.0 0.4
0.8
CPI
CPI
0.3
0.6
0.4 0.2
0.2 0.1
0 0
2 4 6 8 25 100 2 4 6 8 25 100
Figure 9. Performance breakdown showing cycles per instruction (CPI) versus memory latency on the four benchmarks (a)
espresso, (b) li, (c) alvinn, and (d) ear for an eight-wide VLIW processor with varying memory system penalties.
tunately, each of these techniques is reaching The SIA1 points to increasingly higher power
the point of diminishing performance returns. for microprocessor chips because of their
Like branch prediction, cache memory already higher operating frequency, higher overall
achieves a prediction rate in the high 90% capacitance, and larger size.
range, and performance is now dominated by At the device level, total power dissipation
the ever-increasing processor-memory speed (Ptotal) has three major sources—switching loss,
mismatch. leakage current, and short-circuit current:14
Figure 9 shows the performance breakdown
for an eight-instruction-wide VLIW processor.
C ⋅V 2 ⋅ freq
As memory latency increases, memory-related Ptotal = + I leakage ⋅V + I sc ⋅V
penalties (the so-called memory wall) rapidly 2
begin to dominate performance.13 Branch pre-
diction reduces branch penalties, but the mem- where C is the device capacitance, V is the sup-
ory penalty still dominates. Related results (not ply voltage, freq is the device switching fre-
shown in the figure) indicate that little would quency, Ileakage is the leakage current, and Isc is
be gained by adding out-of-order-execution the short-circuit current. Of the three power
capability to such a processor. Despite the effec- dissipation sources, switching loss remains
tiveness of hardware approaches, until the mem- dominant.
ory wall can be controlled, the memory penalty We can reduce switching loss by lowering
will continue to dominate performance. the supply voltage. Chen et al.15 showed that
the drain current is proportional to (V −
Power considerations Vth)1.25, where V is the supply voltage and Vth
Growing demands for wireless and portable is the threshold voltage. If we keep the origi-
electronic appliances have focused much nal design but scale the supply voltage, the
attention recently on power consumption. parasitic capacitances remain the same, and
JULY–AUGUST 1999 17
DEEP-SUBMICRON DESIGN
18 IEEE MICRO
power management must be Table 2. FUPA components and recently announced results (for FPUs only).
implemented from the sys-
tem architecture and operat- Effective Normalized Normalized effective FUPA
ing system down to the logic Processor latency (ns) area (mm2) latency (ns) (cm2ns)
gate level. DEC 21164 12.38 77.51 35.36 27.41
MIPS R10000 14.25 70.08 40.71 28.53
Area considerations HP PA8000 24.44 81.16 48.89 39.68
Another important design Intel P6 28.25 51.56 80.71 41.61
trade-off entails determining Sun UltraSparc 23.65 133.43 50.32 67.15
the optimum die size.17 In the AMD K5 66.00 48.42 188.57 91.32
server market, the processor
may be a relatively small com-
ponent in a much more costly system domi- nate the effect of technology by using a tech-
nated by memory and storage costs. In this nology-normalized metric. Along this line, Fu
design area, an increase in processor cost of developed a cost-performance metric for eval-
10 times (from, say, a die cost of $10 to a die uating floating-point-unit implementations.19
cost of $100) may not significantly affect the The metric, floating-point-unit cost-perfor-
overall system’s cost. However, even when cost mance analysis (FUPA), incorporates five key
per se is not a design factor, efficient layout aspects of VLSI systems design: latency, die
and space usage are increasingly important in area, power, minimum feature size, and a pro-
determining cycle time. McFarland18 has file of applications. FUPA uses technology
shown that at 0.1-micron feature size and 1- projections based on scalable device models
ns cycle time, a long interconnect delay may to identify the design/technology compati-
make it difficult (or impossible) to have one- bility and lets designers make high-level trade-
cycle cache accesses for caches larger than 32 offs in optimizing FPU designs. Table 2 shows
Kbytes. Using the same assumptions, he sug- example FUPA applications to specific proces-
gests that the limit for two-cycle cache access sors. Processor FPUs with the lowest FUPA
is 128 Kbytes. Thus, even in cost-insensitive define the state of the art.
server designs, cycle time may still limit chip
complexity. Computational integrity
Client processor implementations are The last basic trade-off is determining the
extremely cost sensitive, and optimum area level of computational integrity. When
use is very important. As technology and costs rebooting a personal computer after an appli-
allow, we will have to increase functionality cation has caused the system to crash, we may
within the processor to incorporate various wonder about the application or the system
signal processing (multimedia and graphics) or both. However, the observed failure is a ret-
and memory functions now assigned to sep- rograde problem solved years ago in hardware
arate chips. with the introduction of user and system states
Area optimization simply means making and corresponding memory protection.
the best use of available silicon. Each algo- What’s lacking is the determination to imple-
rithmic state of the art can be encapsulated in ment the solution. In looking ahead to
an AT (area-time) curve. In general, we have improved models of computational integrity,
AT n = k, where n usually ranges from 1 to 2, we should consider
depending on the nature of communication
within the implementation. • reliability,
Note that k defines the state-of-the-art, or • testability,
“par,” designs at any particular moment. If • serviceability,
some implementations use design points in • process recoverability, and
the interior, the resultant design is inferior, or • fail-safe computation.
“over par.” Over time, k (and par) decreases
because of advances in technology and in algo- Reliability is a characteristic of the imple-
rithms themselves. mentation media. Circuits and cells may fail,
To study algorithms, we can largely elimi- but this need not lead immediately to demon-
JULY–AUGUST 1999 19
DEEP-SUBMICRON DESIGN
20 IEEE MICRO
pins or by using direct optical connections to
the die. Meeting server performance require-
ments demands both traditional and nontra-
A s technology scales, important new
opportunities emerge for microproces-
sor architects. The simple, traditional mea-
ditional solutions. sures of processor performance—cycle time
In this cost-is-no-object server region, the and cache size—become less meaningful in
need to customize implementations to specif- correctly evaluating application performance.
ic applications may alter manufacturing as The most significant challenges facing the
well. Although expensive, effective cus- microprocessor architect include
tomization may require fabrication micropro-
duction runs to maximize performance. One • Creating high-performance server
side effect of such a manufacturing change is processors jointly or cooperatively with
the requirement to automate software tool enabling compiler software. Whether the
delivery for each implementation. Tensilica resultant architectures are vector proces-
took this approach, which delivers not only a sors, VLIW, or some other type, the
custom processor implementation but also processors must actually deliver the spec-
customized tools (compiler, assembler, and ified performance across a spectrum of
debugger) that provide basic access to the spe- applications.
cific capabilities of a given implementation. • Using advanced CAD tools to design
For many server applications, hundreds of power-sensitive system-on-chip client
processors on a die can be very attractive, pos- processors in a very short design time.
sibly even providing sufficient area to include System issues such as testing and verifi-
gigabytes of memory. Decoupling require- cation become important challenges.
ments (due to long wires, as discussed earlier) • Improving ways to preserve the integrity
limit the overall integration potential, leaving of computation, reliability, and diagnos-
significant area available for redundancy. This tic features.
redundancy can serve to improve both die • Increasing the use of adaptability in var-
yield and computational integrity. In addi- ious processor structures, such as cache
tion, this die area surplus may let designers and signal processors. For example, an
implement higher performance communica- adaptive cache would not just prefetch,
tion structures; optical communications it would prefetch according to a particu-
might help reduce the liability of wire delay lar program’s history of accessing behav-
across the chip. ior. Similarly, we may see adaptability in
Architecturally at least, the challenges for arithmetic, where arithmetic functional
client processors are somewhat simpler than units can be redefined, perhaps with the
for server processors. The client design point assistance of programmable logic ele-
is a processor operating at low power and ments in functional units, to improve
reduced speed for some applications, perhaps performance for a range of applications
on the order of 100 µW and 100 MHz, respec- with different computational needs.
tively, with years, not hours, of battery life. A
client processor is limited by long wires and Understanding technology trends and spe-
hence is partitioned into multiple units: core cific applications is the main criterion for
processor with cache, various signal processors, designing efficient and effective processors.
wireless (RF) capabilities, and cryptographic Without such understanding, design com-
arithmetic facilities. The resultant system is plexity will quickly become overwhelming,
also bound by computational integrity require- preventing designers from using a die’s full
ments, but probably not at the same high level capacity. MICRO
as a server. We see the core processor as a gen-
erally conventional design, but the supporting References
signal processors may use vector/VLIW or 1. The National Technology Roadmap for Semi-
other suitable forms of ILP to manage their conductors, tech. report, Semiconductor
tasks efficiently. Memory may also occupy the Industry Assn., San Jose, Calif., 1994 and
same die, as long as we can predetermine the 1997 (updated).
size requirements and allow sufficient space. 2. H.B. Bakoglu, Circuit, Interconnections, and
JULY–AUGUST 1999 21
DEEP-SUBMICRON DESIGN
22 IEEE MICRO