Deep-Submicron Microprocessor Design Issues

DEEP-SUBMICRON
MICROPROCESSOR DESIGN ISSUES

TO FULLY EXPLOIT SHRINKING FEATURE SIZES AND AVOID BEING
OVERWHELMED BY COMPLEXITY, MICROPROCESSOR DESIGNERS MUST KEEP
UP WITH TECHNOLOGY TRENDS, UNDERSTAND SPECIFIC APPLICATIONS, AND
USE ADVANCED CAD TOOLS.
Deep-submicron technology allows sus server processors. This article discusses

billions of transistors on a single die, poten- these trade-offs in light of industry projections
tially running at gigahertz frequencies. and the many considerations affecting deep-
According to Semiconductor Industry Asso- submicron technology.
ciation (SIA) projections,1 the number of tran-
sistors per chip and the local clock frequencies Limits of technology scaling
for high-performance microprocessors will Improved microprocessor performance
continue to grow exponentially in the near results largely from technology scaling, which
future, as Figure 1 (on the next page) illus- lets designers increase the level of integration
trates. This ensures that future microproces- at higher clock frequencies. While current
sors will become ever more complex. implementations use feature sizes of about
However, physical and program behavioral 0.25 micron, devices with feature sizes small-
Michael J. Flynn, constraints will limit the usefulness of this er than 0.1 micron are expected in the next
complexity. Physical constraints include inter- few years. Table 1 shows projected specifica-
Patrick Hung, connect and device limits, as well as practical tions through 2012. As feature sizes shrink,
limits on power and cost. Program behavioral device area shrinks roughly as the square of
and Kevin W. Rudd constraints result from program control and the scaling factor. Meanwhile, device propa-
data dependencies, and from unpredictable gation delay (under constant field assump-
Stanford University events during execution. tions) improves linearly with the decrease in
Other challenges include the need for feature size.
advanced CAD tools to combat the negative Nevertheless, designers face several major
effect of greater complexity on design time. technical challenges in the deep-submicron
Designers will also have to make improve- era. The most important is that interconnect
ments to preserve computational integrity, delay (especially global interconnect delay)
reliability, and diagnostic features. Successful does not scale with feature size. If all three
implementations will depend on the proces- dimensions of a wire are scaled down by the
sor architect’s ability to foresee technology same scaling factor, the interconnect delay
trends and understand the changing design remains roughly unchanged. This is because
trade-offs for specific applications, beginning the fringing field component of wire capaci-
with the differing requirements for client ver- tance does not vary with feature size.2 Thus,
0272-1732/99/$10.00  1999 IEEE 11

DEEP-SUBMICRON DESIGN
Table 1. Semiconductor Industry Association roadmap summary for high-end processors.
Specification/year 1997 1999 2001 2003 2006 2009 2012

Feature size (micron) 0.25 0.18 0.15 0.13 0.1 0.07 0.05
Supply voltage (V) 1.8-2.5 1.5-1.8 1.2-1.5 1.2-1.5 0.9-1.2 0.6-0.9 0.5-0.6
Transistors/chip 11M 21M 40M 76M 200M 520M 1.4B
DRAM bits/chip 167M 1.07G 1.7G 4.29G 17.2G 68.7G 275G
Die size (mm2) 300 340 385 430 520 620 750
Local clock freq. (MHz) 750 1,250 1,500 2,100 3,500 6,000 10,000
Global clock freq. (MHz) 750 1,200 1,400 1,600 2,000 2,500 3,000
Maximum power/chip (W) 70 90 110 130 160 170 175
No. of transistors (millions)
1,600 12,000
1,400
Frequency (MHz)
10,000 Global
1,200 Local
1,000 8,000
800 6,000
600 4,000
400
200 2,000
0 0
0.25 0.18 0.15 0.13 0.1 0.07 0.05 0.25 0.18 0.15 0.13 0.1 0.07 0.05
1997 1999 2001 2003 2006 2009 2012 1997 1999 2001 2003 2006 2009 2012
Technology (micron)/year of first shipment Technology (micron)/year of first shipment
(a) (b)
Figure 1. The National Technology Roadmap for Semiconductors: (a) total transistors per chip, (b) on-chip local clock.
H Height reduce interconnect delay, but these one-time

W S W S W T Thickness improvements will not suffice in the long run
W Width
S Space as feature size continues to shrink.
C Capacitance In an effort to minimize interconnect resis-
tance, modern designs scale interconnect height
at a much slower pace than interconnect width.
C20 C21 Consequently, the aspect ratio (T/W in Figure
T 2) should rise gradually from 1.8 at present to
3.0 by the year 2012.1 This shift reduces wire
resistance but also increases the effects of line
H C1
coupling, from 20% at 0.7 micron to 80% at
Ground plane 0.18 micron. Cross talk between adjacent wires
will pose a more serious problem, and wire
Figure 2. Interconnect capacitances. congestion will ultimately determine inter-
connect delay and power dissipation. Imple-
mentations that use more devices in critical
the interconnect delay decreases far less rapid- paths yet offer less wire congestion and short-
ly than the gate delay and proves more signif- er interconnect delays may be preferable to
icant in the deep-submicron region. older implementations that simply focus on
Increasingly, interconnect delays rather than reducing the number of gate delays.
device delays dominate functional unit designs. It is interesting to compare the current SIA
This may cause architects and designers to projections with those from 1994,1 which pre-
rethink the algorithms used to implement var- dicted far more modest cycle time improve-
ious functional units. Introducing copper met- ments. Note that the earlier projections were
alization and low κ dielectric insulators helps generally based on scaling alone. By 1997-98,
12 IEEE MICRO
it became obvious that 1-GHz processors were Power
feasible immediately—not between 2005 and
2010, as projected earlier. The revisions arise
from coordinated improvements in scaling,
circuit design, and CAD tools. Server
Feature sizes below 0.1 micron lead to other processor
design
technical challenges. High-performance proces-
sors need special cooling techniques, which T 3P = k
could consume as much as 175 W of power. Area
Enabling GHz signals to travel into and out of AT = k
the chips requires new circuit designs and algo- Client
rithms. Preventing latch-up and reducing noise processor
coupling may require new materials such as sil- design
icon-on-insulator. Similarly, reducing cross talk Time
and DRAM leakage may require low κ as well
as high κ materials, respectively. These chal- Figure 3. Area, performance (time), and
lenges demand that designers provide whole- power trade-off trends in server and client
system solutions instead of treating logic processor designs.
design, circuit design, and packaging as inde-
pendent phases of the design process.
servers and clients, based on economy of scale.
Client versus server processors With added computational resources (in area
In the era of deep-submicron technology, and time) becoming available through
two classes of microprocessors are evolving: improved feature sizes, the movement to estab-
client and server processors. The majority of lish separate design points—differentiated by
implementations are commodity system-on- power—seems inevitable.
chip client processors (including network and An important dimension—computational
embedded processors) devoted to end-user integrity—does not appear in Figure 3. Later
applications such as personal computers. we discuss its nature and effects on area, time,
These highly cost-sensitive processors are used and power.
extensively in consumer electronics. Individ-
ual applications may have specific require- Time considerations
ments; for example, portable and wireless Microprocessor performance has improved
applications require very low power con- by approximately 50% per year for the past
sumption. The other class consists of high- 15 years. This can be attributed to higher
end server processors, which are performance clock frequencies, deeper pipelines, and
driven. Here, other parts of the system dom- improved exploitation of instruction-level par-
inate cost and power issues. allelism (ILP). In the deep-submicron era, we
At a fixed feature size, area can be traded off can expect performance improvement to
for time. VLSI complexity theorists3 have result largely from reducing cycle time at the
shown that an AT n bound exists for micro- expense of greater power consumption.
processor designs, where n usually falls between
1 and 2. By varying the supply voltage, as we Cycle time
show later, it is also possible to trade off area A Processor clock frequencies have increased
for power P with a PT 3 bound. Figure 3 shows by approximately 30% per year for the past
the possible trade-off involving area, time T, 15 years, due partly to faster transistors and
and power in a processor design. Client and partly to fewer logic gates per cycle. Tradi-
server processors operate in different design tionally, digital designs have used edge-trig-
regions of this three-dimensional space. The gered flip-flops extensively. Such a system’s
power and area axes are typically optimized for cycle time Tc is determined by Tc = Pmax + C,
client processors, whereas the time axis is typ- where Pmax is the maximum delay required for
ically optimized for server processors. Until the combinational logic, and C is the total
recently, a single design point existed for both clock overhead, including setup time, clock-
JULY–AUGUST 1999 13
re0 Vector register read (single wave) We use wave pipelining to illustrate some
we2 Vector register writeback (single wave) of the new clocking considerations. For mem-
mula Multiplier input (multiple waves)
mulr Multiplier output (multiple waves)
ories and other functional blocks that contain
regular interconnect structures, wave pipelin-
ing is an attractive choice. The technique relies
re0 on the delay inherent in combinatorial logic
circuits. Suppose a given logic unit has a max-
imum interlatch delay of Pmax and a corre-
we2 sponding minimum delay of Pmin with clock
overhead C. Then the fastest achievable cycle
time ∆t equals Pmax − Pmin + C.
mula As with conventionally clocked systems,
system clock rate Tc is the maximum ∆ti over
i latched stages. Sophisticated CAD tools can
mulr ensure balanced path delays and thus improve
cycle time. In practice, using special CAD
tools lets us set Pmin to within about 80% to
31.04 45.67 60.29 74.91 89.54
90% of (Pmax + C). While this would seem to
Nanoseconds
imply clock speedup of more than five times
Figure 4. Wave-pipelined vector multiplication8 fabricated with 0.8-micron the maximum clock using traditional clock-
technology. ing schemes, environmental issues such as
process variation and temperature gradient
across a die restrict realizable clock rate
to-output delay, and clock skew. For high-end speedup to about three times.7,8 Figure 4
server processors, the SIA predicts that the details the register-to-register waveforms of a
clock cycle time will decrease from roughly 16 wave-pipelined vector multiplier.7 The
FO4 (fanout-of-four) inverter delays at pre- achieved rate shown at the bottom of the fig-
sent to roughly five FO4 inverter delays at ure is more than three times faster than the
0.05-micron feature size. As a result, clock traditional rate determined by the latency
overhead takes a significant fraction of the between the multiplier input and output
cycle time, and flip-flop clocking systems (shown in the top two segments of Figure 4).
appear infeasible. Fortunately, a number of Wave pipelining also exemplifies how
circuit techniques can improve deep-submi- aggressive new techniques offer architects and
cron microprocessor cycle times: implementers both benefits and challenges.
Although allowing significant clock-rate
• Several new flip-flop structures, such as improvements over traditional pipelines, wave
sense-amplifier-based, hybrid latch, and pipelines cannot be stalled without losing the
semidynamic flip-flop, have been pro- in-flight computations. Because individual
posed to lower clock overhead. waves in the pipeline exist only by virtue of cir-
• Asynchronous logic4 eliminates the need cuit delays, they cannot be controlled between
for a global clock. Average latency pipeline latches. In the case of wave pipelines,
depends on Pmean (average logic delay) architecturally transparent replay buffers9 can
instead of Pmax (maximum logic delay), provide the effect of a stall and extend the
but completion detection and data ini- applicability of wave pipelining to applications
tialization incur significant overhead. that require stalling the pipeline. Other new
• Various forms of dynamic logic5 have techniques may not fit in directly with current
been proposed to minimize the effects of architectures and may also require special treat-
clock skew. These techniques reduce ment to be generally applicable.
clock overhead at the expense of power
and scalability. Performance versus complexity
• Wave pipelining6 uses Pmin (minimum Processors with faster cycle times often pro-
logic delay) as a storage element to vide better overall performance, although clock
improve cycle time. overhead and pipeline disruptions during exe-
14 IEEE MICRO
cution may nullify this advantage. Consider T
the case of a simple pipelined processor, illus-
(a)
trated in Figure 5. Assume that the total time T/S
to execute an instruction without clock over-
head is T (Figure 5a). Now, T is segmented Clock overhead (C ) plus skew
(b)
into S segments to allow pipelining (Figures
5b and 5c). The pipeline completes one
instruction per cycle if no interruptions occur; T/S C
however, disruptions such as unexpected Clock overhead
branches result in flushing and restarting the (c)
pipeline (Figure 5d). Suppose these interrup-
tions occur with frequency b and have the
Result available
effect of invalidating S − 1 instructions. The
pipeline throughput G becomes
Disruption S −1
1 1 cycles delay Restart
G= ⋅
1 + (S − 1)b T / S + C (d)
Figure 5. Optimum pipeline stages.

Differentiating G with respect to S lets us
determine the optimum number of stages Sopt,
as shown by Dubey and Flynn:10
Performance
(1 − b)T
S opt = Increasin
bC g clock o
verhead
Intuitively, if we make the cycle time too large Sopt

and have too few stages in the pipeline, we sac-
Increasing number of pipeline stages (S)
rifice overall performance by not overlapping
execution sufficiently. On the other hand, if we Figure 6. Performance versus number of pipeline stages, given different lev-
make the cycle time too small, we sacrifice over- els of clock overhead.
all performance by incurring too much clock
overhead and suffering long pipeline breaks.
Figure 6 shows the relationship between per- time by a fraction of O. If N operations are
formance and number of pipeline stages. issued per cycle, and d is the average latency of
Selecting the optimum level of ILP entails instructions in a scalar processor, we calculate
an analogous trade-off. The algorithm or the processor’s instructions per cycle (IPC) as
application imposes the fundamental limit to
N
the available ILP: you can never exploit more IPC =
parallelism than the application makes avail- 1 + N ⋅ d ⋅ (1 + O ⋅ N )
able. This maximal ILP declines with code
generation inefficiencies, processor and sys- as shown by Hung and Flynn.11 Figure 7
tem resource limitations, and execution dis- shows the relationship between IPC and ILP
turbances. Suppose the processor can issue a (N) on the basis of the above equation. By dif-
maximum of N instructions per cycle. Because ferentiating IPC with respect to N, we can
of control and data dependencies, a fetched determine the optimum N by
instruction often cannot be issued and must
wait for prior instructions to complete.
1
Assume the probability that an instruction N opt ≈
cannot be issued is proportional to the num- (d ⋅ O )
ber of active instructions in the pipeline and
that adding one more instruction width to an In Figure 8 we use the MXS superscalar
ILP processor increases the overall execution simulator12 to calculate the IPC of a super-
ed branches or operation exceptions) and result

delays (say, from extra computation cycles or
cache misses). They can be predicted with
varying degrees of accuracy; branches can often
Performance
be predicted in the high 90% range. Howev-

Increa
sing IL
er, mispredictions of these events require
P ove expensive corrective actions. For example, after
rhead
a mispredicted branch, it may be necessary to
flush pipelines, recover the register state, and
ILPopt restart execution along the correct path.
A large variance in penalties accompanies
Increasing instruction-level parallelism (N ) execution disruptions. A simple misscheduled
Figure 7. Performance in instructions per cycle versus instruction-level paral- operation may disrupt only a single cycle;
lelism, given different levels of ILP overhead. however, a cache miss can disrupt upward of
hundreds of cycles in a high-performance sys-
tem. Thus, any technique whose application
1.6 can improve the prediction rate or minimize
Normalized performance
1.4 the penalty can yield significant performance

1.2 benefits.
1.0 Overhead
The two primary ways9 to manage execu-
(IPC)
0.8 0%
1%
tion disruptions involve statically scheduled
0.6
0.4 2% VLIW processors with a software approach or
3% dynamically scheduled superscalar processors
0.2
4%
0 with a hardware approach.
1 2 3 4 5 6 7 8 8 10
The software approach relies on the com-
Instruction width (N )
piler to prevent as many disruptions as possi-
Figure 8. Normalized performance versus instruction-level parallelism (Com- ble by eliminating the possibility of nonerror
press benchmark). disruptions in the code schedule. Even system
interrupts can be managed in this manner,
although this solution is less applicable to gen-
scalar processor running the Compress bench- eral-purpose computing. The software
mark. In this graph, normalized performance approach minimizes hardware complexity,
is defined as IPC ⋅ (1 − O ⋅ N). Consider the providing both cost and performance bene-
case when the overhead O is 3%. Because of fits; however, the requirement to pessimisti-
ILP overhead, increasing the instruction cally schedule the code often results in
width greater than five times diminishes the suboptimal performance.
actual performance. The hardware approach relies on hardware
The trade-off between VLIW (very large to predict, detect, and execute correctly in the
instruction word) and superscalar machines presence of disruptions. Hardware-based
amounts to reducing overhead with a VLIW dynamic prediction (driven by recent execu-
machine versus reducing disruption with a tion patterns) can achieve greater accuracy
superscalar machine. Latency tolerance than compiler-based branch prediction (dri-
reduces average latency d and hence improves ven by either heuristics or profile information).
the ILP available. However, with the complexity of dynamic pre-
For either processor approach (fast ∆t or diction hardware (say, for branch prediction
increased ILP), performance is bounded by or data prefetching), this approach rapidly
implementation overhead and program reaches the point of diminishing returns.
behavior. We can use our understanding of
algorithms to extend both of these limits. The “memory wall” obstacle
Currently, popular processors have extend-
Disruptions ed techniques such as branch prediction, data
Execution disruptions include execution and instruction caches, and out-of-order exe-
interruptions (for example, from mispredict- cution to enable greater performance. Unfor-
16 IEEE MICRO
1.5 4
3
1.0
CPI
CPI
2
0.5
1
0 0
2 4 6 8 25 100 2 4 6 8 25 100
Memory latency (cycles) Memory latency (cycles)
(a) (b)
Data cache memory effects

Instruction cache memory effects
Branch misprediction
Ideal CPI limit
1.4 0.6
1.2 0.5
1.0 0.4
0.8
CPI
CPI
0.3
0.6
0.4 0.2
0.2 0.1
0 0
2 4 6 8 25 100 2 4 6 8 25 100
(c) Memory latency (cycles) (d) Memory latency (cycles)
Figure 9. Performance breakdown showing cycles per instruction (CPI) versus memory latency on the four benchmarks (a)
espresso, (b) li, (c) alvinn, and (d) ear for an eight-wide VLIW processor with varying memory system penalties.
tunately, each of these techniques is reaching The SIA1 points to increasingly higher power
the point of diminishing performance returns. for microprocessor chips because of their
Like branch prediction, cache memory already higher operating frequency, higher overall
achieves a prediction rate in the high 90% capacitance, and larger size.
range, and performance is now dominated by At the device level, total power dissipation
the ever-increasing processor-memory speed (Ptotal) has three major sources—switching loss,
mismatch. leakage current, and short-circuit current:14
Figure 9 shows the performance breakdown
for an eight-instruction-wide VLIW processor.
C ⋅V 2 ⋅ freq
As memory latency increases, memory-related Ptotal = + I leakage ⋅V + I sc ⋅V
penalties (the so-called memory wall) rapidly 2
begin to dominate performance.13 Branch pre-
diction reduces branch penalties, but the mem- where C is the device capacitance, V is the sup-
ory penalty still dominates. Related results (not ply voltage, freq is the device switching fre-
shown in the figure) indicate that little would quency, Ileakage is the leakage current, and Isc is
be gained by adding out-of-order-execution the short-circuit current. Of the three power
capability to such a processor. Despite the effec- dissipation sources, switching loss remains
tiveness of hardware approaches, until the mem- dominant.
ory wall can be controlled, the memory penalty We can reduce switching loss by lowering
will continue to dominate performance. the supply voltage. Chen et al.15 showed that
the drain current is proportional to (V −
Power considerations Vth)1.25, where V is the supply voltage and Vth
Growing demands for wireless and portable is the threshold voltage. If we keep the origi-
electronic appliances have focused much nal design but scale the supply voltage, the
attention recently on power consumption. parasitic capacitances remain the same, and
most critical issue for applications such as

portable and wireless client processors running
operating frequency
on batteries. Conventional nickel-cadmium

Maximum
battery technology has been replaced by high-

energy-density nickel metal hydride (NiMH)
and lithium ion technology. However, for safe-
ty reasons, energy density is unlikely to
improve dramatically in the near future.
For client processors to run on battery
Vth
power for an extended period, the entire sys-
0 0.5 1.0 1.5 2.0 2.5 3.0
tem’s power consumption must remain very
Supply voltage (V )
small (on the order of a microwatt). Sub-
Figure 10. Maximum operating frequency versus supply voltage, given a threshold circuits operate below Vth; they offer
threshold voltage of 0.6 V. an ideal way to reduce both the switching loss
and the leakage current.16 If a significant por-
tion of the chip supports communication and
the maximum operating frequency is propor- other DSP functions, then exploiting the nat-
tional to (V − Vth)1.25 / V. ural parallelism in these applications (for
Figure 10 shows the relationship between example, using subword-parallel multimedia
maximum operating frequency and supply instructions) can result in acceptable perfor-
voltage. In this figure, we assume Vth is 0.6 V. mance even with a very slow clock frequency.
Between 1- and 3-V supply voltage, the oper- In the future, to save power, many client
ating frequency is roughly proportional to the processor systems will eliminate disks. Thus,
supply voltage. Reducing the supply voltage programs will have to be downloaded from
by half also reduces the operating frequency server to client memory using wireless or wired
by half, and the total power consumption communications. Depending on the applica-
becomes 1/8 of the original. Thus, if we take tions, the DRAM may be located on or off
an existing design optimized for frequency chip. We know that integrating the CPU with
and modify that design to operate at a lower embedded DRAM is problematic because the
voltage, the frequency is reduced by approxi- current DRAM process would yield low-per-
mately the cube root of the original power: formance logic circuits. However, embedded
DRAM is ideal for subthreshold logic circuits,
which run at a very low clock frequency.
freq2 P
≈3 2 Self-timed and asynchronous circuits have
freq1 P1 been implemented in the past mainly to boost
performance. Because of signaling overhead,
It is important to understand the distinc- asynchronous circuits in processors are unde-
tion between scaling the frequency of an exist- sirable. However, asynchronous circuits may
ing design and that of a power-optimized serve a different purpose in the future. These
implementation. Power-optimized imple- circuits do not require a global clock and can
mentations differ from performance-opti- help reduce total power consumption.
mized implementations in several ways. Amdahl’s law states that the performance
Power-optimized implementations use less gained by improving only part of a computer
chip area not only because of reduced require- system is limited by the fraction of time the
ments for power supply and clock distribu- system uses that particular part. The same
tions, but also, and more importantly, because principle extends to power dissipation: power
of reduced performance targets. Performance- reduction achieved by improving only part of
oriented microprocessor designs consume a a computer system is limited by the fraction
great deal of area to achieve marginally of power dissipation attributed to that partic-
improved performance—very large floating- ular part. If the chip accounts for 20% of the
point units, minimum-skew clock distribu- system’s total power consumption, reducing
tion networks, or maximally sized caches. the chip power by 50% reduces the total
Power dissipation, not performance, is the power consumption by only 10%. As a result,
18 IEEE MICRO
power management must be Table 2. FUPA components and recently announced results (for FPUs only).
implemented from the sys-
tem architecture and operat- Effective Normalized Normalized effective FUPA
ing system down to the logic Processor latency (ns) area (mm2) latency (ns) (cm2ns)
gate level. DEC 21164 12.38 77.51 35.36 27.41
MIPS R10000 14.25 70.08 40.71 28.53
Area considerations HP PA8000 24.44 81.16 48.89 39.68
Another important design Intel P6 28.25 51.56 80.71 41.61
trade-off entails determining Sun UltraSparc 23.65 133.43 50.32 67.15
the optimum die size.17 In the AMD K5 66.00 48.42 188.57 91.32
server market, the processor
may be a relatively small com-
ponent in a much more costly system dominate the effect of technology by using a tech-
nated by memory and storage costs. In this nology-normalized metric. Along this line, Fu
design area, an increase in processor cost of developed a cost-performance metric for eval-
10 times (from, say, a die cost of $10 to a die uating floating-point-unit implementations.19
cost of $100) may not significantly affect the The metric, floating-point-unit cost-perfor-
overall system’s cost. However, even when cost mance analysis (FUPA), incorporates five key
per se is not a design factor, efficient layout aspects of VLSI systems design: latency, die
and space usage are increasingly important in area, power, minimum feature size, and a pro-
determining cycle time. McFarland18 has file of applications. FUPA uses technology
shown that at 0.1-micron feature size and 1- projections based on scalable device models
ns cycle time, a long interconnect delay may to identify the design/technology compati-
make it difficult (or impossible) to have one- bility and lets designers make high-level trade-
cycle cache accesses for caches larger than 32 offs in optimizing FPU designs. Table 2 shows
Kbytes. Using the same assumptions, he sug- example FUPA applications to specific proces-
gests that the limit for two-cycle cache access sors. Processor FPUs with the lowest FUPA
is 128 Kbytes. Thus, even in cost-insensitive define the state of the art.
server designs, cycle time may still limit chip
complexity. Computational integrity
Client processor implementations are The last basic trade-off is determining the
extremely cost sensitive, and optimum area level of computational integrity. When
use is very important. As technology and costs rebooting a personal computer after an appli-
allow, we will have to increase functionality cation has caused the system to crash, we may
within the processor to incorporate various wonder about the application or the system
signal processing (multimedia and graphics) or both. However, the observed failure is a ret-
and memory functions now assigned to sep- rograde problem solved years ago in hardware
arate chips. with the introduction of user and system states
Area optimization simply means making and corresponding memory protection.
the best use of available silicon. Each algo- What’s lacking is the determination to imple-
rithmic state of the art can be encapsulated in ment the solution. In looking ahead to
an AT (area-time) curve. In general, we have improved models of computational integrity,
AT n = k, where n usually ranges from 1 to 2, we should consider
depending on the nature of communication
within the implementation. • reliability,
Note that k defines the state-of-the-art, or • testability,
“par,” designs at any particular moment. If • serviceability,
some implementations use design points in • process recoverability, and
the interior, the resultant design is inferior, or • fail-safe computation.
“over par.” Over time, k (and par) decreases
because of advances in technology and in algo- Reliability is a characteristic of the imple-
rithms themselves. mentation media. Circuits and cells may fail,
To study algorithms, we can largely elimi- but this need not lead immediately to demon-
caused by recoverable but recurring errors.

CAD tools Process recoverability includes features for
Circuit complexity and constraints will characterize future processor design, making effec- instruction retry, process rollback, and, in
tive design tools essential. On the circuit level, techniques such as self-timed and wave- multiprocessor systems, process migration to
pipelined circuits become attractive in deep-submicron designs. We see a dire need to another processor.
develop more efficient and reliable CAD tools to help in designing these difficult circuits. As Fail-safe computation integrates all the
the number of transistors on a chip increases, the amount of time a designer can spare for above with environmental considerations such
each transistor decreases. Designers must rely on CAD tools to ensure that timing and other as power and temperature. In principle, even
design requirements are met. power failure should not cause an executing
On the logic level, interconnect capacitance will largely determine chip performance and process to abort. Using an uninterruptible
power consumption. Traditional logic optimization tools, which optimize the total gate delay, power supply or some other backup system
will not suffice. Synthesis tools must optimize logic design and physical placement at the same lets us save the system state so that computa-
time. Similarly, on the system level, design tools must calculate interconnect delay and con- tion can resume when power returns.
gestion in very early design stages. Interconnect-driven placement and floor-planning tools
are needed to improve routability and to optimize performance and power dissipation. Designing the “new” processor
As system complexity increases and design time decreases, reuse of processor core intel- In the deep-submicron era, greater capaci-
lectual property (IP)1 seems the only way to produce a reliable design within a reasonable ty (computational as well as memory) and
time. CAD tools must automatically map the IP implementations to the required specifica- faster transistors encourage exploration of new
tions and generate development tools—assembler, compiler, simulator, and debugger. Opti- architectural domains and execution para-
mization requires high-level cycle-accurate simulators and accurate area estimation tools. digms. However, wire delays and less dramat-
In addition to performance modeling and analysis, such a cycle-accurate simulator can help ic increases in chip bandwidth restrict the
with hardware/software codesign and co-verification.2,3 range of implementable solutions because of
the need to decouple regions on the die and to
limit demand for off-chip resources.
References Aggressive circuit technologies let architects
1. Virtual Socket Interface Architecture Document, Virtual Socket Interface and designers effectively exploit the benefits
Alliance, http://www/vsi.org/library/vsi-or.pdf. of new technology, but alone they do not suf-
2. G. De Micheli and M. Sami, Hardware/Software Co-Design, Kluwer Academic fice. Although we focused here on hardware
Publishers, Norwell, Mass., 1996. issues, not all solutions will be possible relying
3. Mentor Graphics Seamless, http://www.mentorgraphics.com/codesign/main-f/. solely on hardware. We will also need coop-
erating software solutions, such as self-adapt-
ing applications and algorithms, as well as
strable faults in the processor. Indeed, smaller improvements at the language, compiler, and
feature sizes may lead to increasing failures over operating system level. To achieve these goals,
time resulting from electrostatic overstress, we need better CAD tools (see the sidebar).
electro-migration, and so on. Error correction Server processor applications, where chip
systems provide an important way to recover and system costs are less important than total
from certain modes of device failure. In case performance, encompass a wide range of poten-
of transient errors, error detection systems cou- tial requirements, from computation intensive
pled with instruction retry are a minimum to memory intensive to I/O intensive. Design-
requirement for enabling correct computa- ers may increase computation by constructing
tions. ever more capable processor units (super-SISD)
Testable designs explicitly include accessi- or by integrating additional processor units on
bility paths, such as scan paths, that enable a given die (super-SIMD, super-MIMD, or
special validation programs to verify a proces- super-MISD). (SISD: single-instruction, sin-
sor’s correct operation over a broad variety of gle-data; SIMD: single-instruction, multiple-
state combinations. Testability is important data; MIMD: multiple-instruction,
for continuing test and design validation. multiple-data; MISD: multiple-instruction,
Serviceability allows for ready diagnosis of single-data.) They may increase memory capac-
both transient and permanent failures. It ity and performance by placing more of the
depends on error detection, error scanning on memory hierarchy on chip. They may increase
detection, and error logging. The goal is a I/O bandwidth by using improved packaging
design that lets us identify degraded paths techniques to increase the number of available
20 IEEE MICRO
pins or by using direct optical connections to
the die. Meeting server performance require-
ments demands both traditional and nontra-
A s technology scales, important new
opportunities emerge for microproces-
sor architects. The simple, traditional mea-
ditional solutions. sures of processor performance—cycle time
In this cost-is-no-object server region, the and cache size—become less meaningful in
need to customize implementations to specif- correctly evaluating application performance.
ic applications may alter manufacturing as The most significant challenges facing the
well. Although expensive, effective cus- microprocessor architect include
tomization may require fabrication micropro-
duction runs to maximize performance. One • Creating high-performance server
side effect of such a manufacturing change is processors jointly or cooperatively with
the requirement to automate software tool enabling compiler software. Whether the
delivery for each implementation. Tensilica resultant architectures are vector proces-
took this approach, which delivers not only a sors, VLIW, or some other type, the
custom processor implementation but also processors must actually deliver the spec-
customized tools (compiler, assembler, and ified performance across a spectrum of
debugger) that provide basic access to the spe- applications.
cific capabilities of a given implementation. • Using advanced CAD tools to design
For many server applications, hundreds of power-sensitive system-on-chip client
processors on a die can be very attractive, pos- processors in a very short design time.
sibly even providing sufficient area to include System issues such as testing and verifi-
gigabytes of memory. Decoupling require- cation become important challenges.
ments (due to long wires, as discussed earlier) • Improving ways to preserve the integrity
limit the overall integration potential, leaving of computation, reliability, and diagnos-
significant area available for redundancy. This tic features.
redundancy can serve to improve both die • Increasing the use of adaptability in var-
yield and computational integrity. In addi- ious processor structures, such as cache
tion, this die area surplus may let designers and signal processors. For example, an
implement higher performance communica- adaptive cache would not just prefetch,
tion structures; optical communications it would prefetch according to a particu-
might help reduce the liability of wire delay lar program’s history of accessing behav-
across the chip. ior. Similarly, we may see adaptability in
Architecturally at least, the challenges for arithmetic, where arithmetic functional
client processors are somewhat simpler than units can be redefined, perhaps with the
for server processors. The client design point assistance of programmable logic ele-
is a processor operating at low power and ments in functional units, to improve
reduced speed for some applications, perhaps performance for a range of applications
on the order of 100 µW and 100 MHz, respec- with different computational needs.
tively, with years, not hours, of battery life. A
client processor is limited by long wires and Understanding technology trends and spe-
hence is partitioned into multiple units: core cific applications is the main criterion for
processor with cache, various signal processors, designing efficient and effective processors.
wireless (RF) capabilities, and cryptographic Without such understanding, design com-
arithmetic facilities. The resultant system is plexity will quickly become overwhelming,
also bound by computational integrity require- preventing designers from using a die’s full
ments, but probably not at the same high level capacity. MICRO
as a server. We see the core processor as a gen-
erally conventional design, but the supporting References
signal processors may use vector/VLIW or 1. The National Technology Roadmap for Semi-
other suitable forms of ILP to manage their conductors, tech. report, Semiconductor
tasks efficiently. Memory may also occupy the Industry Assn., San Jose, Calif., 1994 and
same die, as long as we can predetermine the 1997 (updated).
size requirements and allow sufficient space. 2. H.B. Bakoglu, Circuit, Interconnections, and
Packaging for VLSI, Addison-Wesley, Bartlett Publishers, Boston, Mass., 1995.

Reading, Mass., 1990. 18. G.W. McFarland, CMOS Technology Scaling
3. J.D. Ullman, Computational Aspects of VLSI, and Its Impact on Cache Delay, PhD thesis,
Computer Science Press, Rockville, Md., Stanford Univ., Dept. Electrical Eng., 1997.
1984. 19. S. Fu, Cost Performance Optimization of
4. S.H. Unger, Asynchronous Sequential Microprocessor, PhD thesis, Stanford Univ.,
Switching Circuits, Wiley-Interscience, Dept. Electrical Eng., 1999.
Somerset, N.J., 1969.
5. D. Harris, Skew-Tolerant Circuit Design, PhD
thesis, Stanford Univ., Stanford, Calif., 1999.
6. L. Cotton, “Maximum Rate Pipelined Michael J. Flynn is professor of electrical engi-
Systems,” Proc. AFIPS Spring Joint neering at Stanford University. He was found-
Computer Conf., 1969, pp. 581-586. ing chair of both the ACM Special Interest
7. K. Nowka, High Performance CMOS System Group on Computer Architecture and the
Design Using Wave Pipelining, PhD thesis, IEEE Computer Society’s Technical Com-
Stanford Univ., Dept. Electrical Eng., 1995. mittee on Computer Architecture. He
8. F. Klass, M. Flynn, and A.J. van de Goor, received his PhD from Purdue University. He
“Fast Multiplication in VLSI Using Wave was the 1992 recipient of the ACM/IEEE
Pipelining,” J. VLSI Signal Processing, Vol. Eckert-Mauchley Award and the 1995 recip-
7, No. 3, May 1994, pp. 233-248. ient of the IEEE Computer Society’s Harry
9. K.W. Rudd, VLIW Processors: Efficiently Goode Memorial Award.
Exploiting Instruction-level Parallelism, PhD
thesis, Stanford Univ., Dept. Electrical Eng.,
1999. Patrick Hung is a PhD candidate in the Stan-
10. P.K. Dubey and M.J. Flynn, “Optimal Pipelin- ford Architecture and Arithmetic Group at
ing,” J. Parallel and Distributed Computing, Stanford University. His research interests
Vol. 8, No. 1, Jan. 1990, pp. 10-19. include computer arithmetic, microprocessor
11. P. Hung and M.J. Flynn, Optimum ILP for architecture, and deep-submicron CAD tool
Superscalar and VLIW Processors , Tech. design. He received an MSEE from Stanford
Report CSL-TR-99-783, Stanford Univ., Dept. University and a BSEE from the University of
Electrical Eng., 1999. Hong Kong. He is a student member of the
12. J.E. Bennett, Latency Tolerant Architec- IEEE.
tures, PhD thesis, Stanford Univ., Computer
Science Dept., 1998.
13. W. Wulf and S. McKee, “Hitting the Memo- Kevin W. Rudd is a PhD candidate in the
ry Wall: Implications of the Obvious,” ACM Stanford Architecture and Arithmetic Group.
Computer Architecture News, Vol. 13, No. His research interests include high-perfor-
1, Mar. 1995, pp. 20-24. mance computer architecture, hardware-soft-
14. A.P. Chandrakasan, S. Sheng, and R.W. ware design trade-offs, and architectural
Brodersen, “Low-Power CMOS Digital simulation and modeling. He received a BSEE
Design,” IEEE J. Solid State Circuits, Vol. 27, and an MSEE from Stanford University and
No. 4, Apr. 1992, pp. 473-484. anticipates receiving a PhD in 1999.
15. K. Chen et al., “Predicting CMOS Speed
with Gate Oxide and Voltage Scaling and
Interconnect Loading Effects,” IEEE Trans.
Electron Devices, Vol. 44, No. 11, Nov. 1997,
pp. 1,951-1,957.
16. M. Godfrey, “CMOS Device Modeling for
Subthreshold Circuits,” IEEE Trans. Circuits Direct comments about this article to
and Systems, Vol. 39, No. 2, Aug. 1992, pp. Patrick Hung, Gates Computer Science
532-539. Building 3A, Room 332, 353 Serra Mall,
17. M.J. Flynn, Computer Architecture Pipelined Stanford, CA 94305-9030; hung@arith.stan-
and Parallel Processor Design, Jones and ford.edu.
22 IEEE MICRO

Deep-Submicron Microprocessor Design Issues

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep-Submicron Microprocessor Design Issues

Uploaded by

Copyright:

Available Formats

DEEP-SUBMICRON

MICROPROCESSOR DESIGN ISSUES

OVERWHELMED BY COMPLEXITY, MICROPROCESSOR DESIGNERS MUST KEEP

UP WITH TECHNOLOGY TRENDS, UNDERSTAND SPECIFIC APPLICATIONS, AND

USE ADVANCED CAD TOOLS.

Deep-submicron technology allows sus server processors. This article discusses

0272-1732/99/$10.00  1999 IEEE 11

Table 1. Semiconductor Industry Association roadmap summary for high-end processors.

Specification/year 1997 1999 2001 2003 2006 2009 2012

H Height reduce interconnect delay, but these one-time

Figure 5. Optimum pipeline stages.

Intuitively, if we make the cycle time too large Sopt

ed branches or operation exceptions) and result

be predicted in the high 90% range. Howev-

1.4 the penalty can yield significant performance

Data cache memory effects

(c) Memory latency (cycles) (d) Memory latency (cycles)

most critical issue for applications such as

on batteries. Conventional nickel-cadmium

battery technology has been replaced by high-

caused by recoverable but recurring errors.

Packaging for VLSI, Addison-Wesley, Bartlett Publishers, Boston, Mass., 1995.

You might also like