Professional Documents
Culture Documents
CarnegieMelion University
SPRINGER-VERLAG
Berlin - Heidelberg - New York
Copyright 1981 Carnegie-Mellon University
Softcover reprint of the hardcover lst edition 1981
Printing 1 2 3 4 5 85 84 83 82 81 Year
The papers in this book were presented at the CMU Conference on VLSI Systems and
Computations, held October 19-21, 1981 in Pittsburgh, Pennsylvania. The conference was
organized by the Computer Science Department, Carnegie-Mellon University and was partially
supported by the National Science Foundation and the Office of Naval Research.
These proceedings focus on the theory and design of computational systems using VLSI.
Until very recently, integrated-circuit research and development were concentrated in the device
physics and fabrication design disciplines and in the integrated-circuit industry itself. Within the
last few years, a community of researchers is growing to address issues closer to computer
science: the relationship between computing structures and the physical structures that
implement them; the specification and verification of computational procosses implemented in
VLSI; the use of massively parallel computing made possible by VLSI; the design of special-
purpose computing architectures; and the changes in general-purpose computer architecture
that VLSI makes possible. It is likely that the future exploitation of VLSI technology depends as
much on structural and design innovations as on advances in fabrication technology.
v
vi Preface
These papers were selected by the program committee from among 120 extended abstracts
submitted in response to the call for papers. Selection was based on originality and relevance to
the theme of the conference, and was very difficult, owing to the large number of excellent
papers submitted. Among the papers that could not be accepted were some excellent ones in
design automation and computer-aided design, important areas beyond the scope of the
conference.
We wish to express our thanks to the authors for making their works available while
complying with strict deadlines and formats to aid in the timely appearance of the book; to the
invited speakers for their excellent papers and for sharing their insights and experience; and to
the program committee members for their careful evaluation of the many extended abstracts,
despite the limited time made available to them. Especially, our grateful thanks go to
Louis Monier, who contributed greatly in the planning of the conference and the publication of
this book, and to Sharon Carmack, who was not only responsible for conference registration, but
also handled the many details involved in the preparation of the conference.
The logo and cover design appearing on this book and throughout the conference were
designed by E. Heidi Fieschko.
H. T. Kung and Bob Sproull
Fall 1981
Program Committee
Co-Sponsors
Carnegie-Mellon University.
National Science Foundation.
Office of Naval Research.
vii
Authors
viii
Contents
Preface v
Program Committee, Co-Sponsors vii
Authors Index viii
Invited Papers
The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors
R.F. Lyon
Designing a VLSI Processor - Aids and Architectures
F. Baskett 20
Keys to Successful VLSI System Design
J.G. Peterson 21
Programmable LSI Digital Signal Processor Development
A Sawai 29
Functional Parallelism in VLSI Systems and Computations
N.R. Powell 41
Functional Extensibility: Making The World Safe for VLSI
J. Rattner 50
Models of Computation
Replication of Inputs May Save Computational Resources in VLSI
Z.M. Kedem and A Zorat 52
Planar Circuit Complexity and the Performance of VLSI Algorithms
J.E. Savage 61
Three-Dimensional Integrated Circuitry
AL. Rosenberg 69
A Critique and an Appraisal of VLSI Models of Computation
G. Bilardi, M. Pracchi and F.P. Preparata 81
Complexity Theory
On the Complexity of VLSI Computations
T. Lengauer and K. Mehlhorn 89
On the Area Required by VLSI Circuits
G.M. Baudet 100
The VLSI Complexity of Sorting
C.D. Thompson 108
Minimum Edge Length Planar Embeddings of Trees
W.L. Ruzzo and L. Snyder 119
The VLSI Approach to Computational Complexity
D.Cohen 124
Special-Purpose Architectures
Digital Signal Processing Applications of Systolic Algorithms
P.R. Cappello and K. Steiglitz 245
A Two-Level Pipelined SystOliC Array for Convolutions
H. T. Kung, L.M. Ruane and D. W.L. Yen 255
Systolic Algorithms for Running Order Statistics in Signal and Image Processing
A. Fisher 265
Systolic Array Processor Developments
K. Bromley, J.J. Symanski, J.M. Speiser and H.J. Whitehouse 273
A Systolic (VLSI) Array for Processing Simple Relational Queries
P.L. Lehman 285
A SystOliC Data Structure Chip for Connectivity Problems
C. Savage 296
Multiplier Designs
Fixed-Point High-Speed Parallel Multipliers in VLSI
P. Reusens, W.H. Ku and Y.H. Mao 301
A Mesh-Connected Area-Time Optimal VLSllnteger Multiplier
F.P. Preparata 311
A Regular Layout for Parallel Multiplier of 0(log2n) Time
W.K. Luk 317
Processors
VLSI Implementations of a Reduced Instruction Set Computer
D. T. Fitzpatrick, J.K. Foderaro, M.G,H. Katevenis, H.A. Landman, D.A.
Patterson, J.B. Peek, Z. Peshkess, C.H. Sequin, R. W. Sherburne and
K.S. Van Dyke 327
MIPS: A VLSI Processor Architecture
J. Hennessy, N. Jouppi, F. Baskett and J. Gill 337
Comparative Survey of Different Design Methodologies for Control Parts of
Microprocessors
M. Obrebska 347
Contents xi
1. Introduction
A mouse is a pointing device used with interactive display-oriented computer systems,
which tracks the movement of a user's hand as the user pushes the mouse about on a pad
(usually on the work surface next to the user's keyboard). Mice have recently become available
in the office products market as a part of the Xerox "Star," the 8010 Professional Workstation
[Business 1981, Seybold 1981-1, and Seybold 1981-2].
The work reported here is motivated by the desire for a high-reliability mouse with no
moving parts (excluding button switches if any). In Xerox research, the mouse has been in
popular use for over eight years, and has been found to be preferable to other pointing devices
[Card el al 1977]. However, it has not been outstandingly reliable; the balls or wheels can get
dirty and slip on the pad, rather than rolling, or the commutators can get dirty and skip. This
is likely to be a significant problem in maintaining workstations that use the mouse in an
uncontrolled environment Another disadvantage of the electro-mechanical mouse is that it's
expensive; the one-chip optical mouse is cheap. And the special patterned pad that it needs to
make it work is cheap, too, as it can be printed for about a penny on an ordinary ink press.
The goal of a mouse with no moving parts has been achieved through the use of a
combination of innovations in electro-optics, circuits, geometric combinatorics, and algorithms.
all implemented in a single custom NMOS integrated circuit (patent pending); see figure 1 for
an illustration of the optical mouse.
Button
Figure 1. PC board
The Optical Mouse
Cable
Patterned Pad Surface
The mouse was redesigned at Xerox to use ball-bearings as wheels, and optical shaft
encoders to generate a two-bit quadrature signalling code (see figure 2). That is, the motion of
a wheel caused the two output bits for that dimension to form square waves in quadrature. with
phase and frequency determined by the direction and speed of travel; each bit transition
represented motion of one resolvable step, which was used to move the cursor one pixel on the
screen) [Hawley et aL 1975]. The mouse was again redesigned to use a ball instead of two
wheels, eliminating the drag of side-slipping wheels [Rider 1974, and Opocensky 1976];
internally, it was built like a trackball [Koster 1967], with shafts turning against the ball and
using commutators as shaft encoders.
C. XA
4. Architectural methodology
The optical mouse chip was designed as an experimental application, in a new domain, of
the logic, timing, circuit, and layout design methodologies taught by [Mead & Conway 1980].
It was designed with the goal of fab-line and process-parameter independence, so it utilizes
only very simple and conservative device models, design rules, circuits, and timing techniques.
Those methodologies have been informally extended into an architectural methodology for
sensors, which have to deal with real-world analog effects and convert them to stable and
reliable digital form in the face of wide parameter variations. An architectural methodology is
a set of guidelines and constraints that help a designer pick a system architecture, by showing
how certain architectural concepts can be implemented and made to work, with high certainty.
An architectural methodology for a different domain is discussed in [Lyon 1981].
The layers of design methodologies used to map a concept into a layout must be supported
by a compatible implementation system that will map that layout into working silicon. Such a
system, described in [Hon & Sequin 1980 and Conway et aL 1980], was used to carry out the
implementation of the Optical Mouse design as part of a multiproject chip set, on an outside
vendor fab line.
The benefits of this approach are clear in the resulting chip: design time was very short,
standard switch-level simulation could be used to verify the correctness of the circuits, the first
implementation worked, several orders of magnitude of light-level variation are tolerated, and
the techniques developed are very robust against process parameter variation, temperature
variation, etc.
The idea of using lateral inhibition to make a digital imager was conceived in June 1980;
the rest of the techniques discussed here were developed while writing up the inhibition idea,
in June and July 1980. A chip design was done quickly in the latter part of July. and was
debugged by hand cross-checking of the layout against design sketches (thankS to C. P.
Thacker, some bugs were found and corrected). After the chip was into implementation, our
tools for design rule checking, circuit extraction, and simulation became more available. and the
design was verified as correct except for some non-fatal design rule violations.
Finished chips were delivered by the implementation system in December, and were
quickly tested on a crude test lash-up connected to the mouse port on a personal workstation.
Later. with the help of several interested colleagues, a completely packaged mouse prototype
based on this chip was completed.
The optical mouse chip should be regarded as only the first representative of a new
architectural methodology for smart digital sensors. It seems clear that there will be many
more applications of bits and pieces of this methodology to sensors of all sorts. For example.
even in something so simple as an analog-to-digital converter, great performance enhancements
can be made by using self-timed successive approximation logic to optimize speed while
avoiding metastable conditions.
Symbol:
l
---
~ 1--------- ---
Diffusion
Cross Section:
Junction
and communicate it out to other circuits. The output voltage from the inverters will start low
when the array is reset, then go toward high as the corresponding dynamic nodes go low due to
light. Figure 4 shows a schematic diagram of this simple "analog" imager celL
An array of analog imagers of this sort has a digital all-low output initially, then has an
interesting analog image for a while, but eventually ends up in the digital all-high state until it
is reset. Both of its digital states are uninteresting. What we would lilce is a way to get an
interesting digital bitmap image reliably. A way to do this is to implement a form of
"inhibition" between cells, so that after some cell outputs have gone high, all others are held
low and the picture is stable from then on_ This is somewhat analogous to the lateral inhibition
in the retina of most biological vision systems [von Bekesy 1967]_ It has the desirable effect of
producing sensible images, almost independent of light leveL Such digital sensor arrays can be
built in a self-timed loop of logic that recognizes stable images, latches them, resets, and starts
over, at a rate roughly proportional to the light intensity.
Reset Output
The simplest imager with mutual inhibition is the two-pixel system shown in figure 5.
Each pixel circuit is essentially a NOR-gate, with one input of each being the light-sensitive
dynamic node, and the second input being the output of the other cell. The initial reset state is
00, with outputs being pulled low by the NOR inputs that are connected to the initially high
dynamic nodes. The final state can be either 01 or 10, since 00 will decay with time and 11 is
not possible as the output of cross-coupled NOR gates.
Sensor-Node-l
Ready
Reset
Done
Pixel-Light-2
Sensor-Node-2
The existence of a final state can be sensed by an OR gate whose logic threshold is higher
than the thresholds of the pixel NOR gates. Intermediate and metastable states will have both
output voltages near the NOR gate thresholds, but distinctly below the OR gate threshold. So
this two-pixel digital imager compares the light level at two points, and indicates when it has
made a decision (but there is no bound on how long it might take, even in bright light).
More complicated logic can be used to detect stable images (Done) in larger sensor arrays
with more complicated inhibition NOR networks.
The concept illustrated by the two-element imager is the use of additional transistors to
convert the image sensing inverters to cross-coupled NOR gates, as in a flip-flop. Any pairs of
elements in an imaging array may be chosen to be connected by these two-transistor mutual
inhibition subcircuits. For example, each pixel may be connected with its eight neighbors in a
square grid, resulting in nine-input NOR gates.
For any pattern of inhibition and any shape and size image array, the set of possible stable
images can be easily enumerated. For example, in a three-by-three array with neighbor
inhibition, the following eight images can be derived by inspection (notice that all 0 bits are
inhibited from changing to I, by virtue of having a neighbor equal to 1):
Of course, in larger arrays the images are more interesting, and often more numerous.
In section 9, we will show that by using a four-by-four sensor array, with inhibition of cells
up to 2.9 or more pixels away, it is easy to formulate a simple and reliable tracking algorithm
that follows spots in a hexagonal array and represents their motion with a quadrature code.
The inhibition network is defined by choosing an inhibition neighborhood for each cell.
Generally, we choose neighborhoods symmetrically, such that if A inhibits B, then B inhibits A;
we say A "is coupled with" B, reflecting the cross-coupled NOR structure. In many cases, the
inhibition neighborhood of some cells will be all other cells in the array; Cell Done signals
from such cells will be redundant, but may be implemented just for the convenience of layout
regularity.
Note that we do not use the inhibition NOR gate output itself for done-detection, but a
buffered version of it after a high threshold buffer (inverter pair); this is the easiest way to
prevent false done-detection during a metastable condition [Seitz 1980]. The buffered signal is
not used for inhibition, since that would make it participate in the metastable condition, and
because the extra delay would cause oscillatory metastable states.
If the white and dark line widths are both equal to about twice the sensor spacing, these
images correspond in an obvious way to positions of the stripes relative to the sensors. Any
two adjacent sensor outputs, say A and B, can be used directly as quadrarure output signals that
sequence through the states 00, 01, 11, 10, forward or backward.. depending on the direction of
motion. The advantage over previous optical quadrature detectors is that no fixed threshold or
specific light level is needed. The sensors will cycle at a rate depending on the light leveL and
latched outputs will be made available to the host system.
Another linear tracking scheme that is closer in spirit to our two-dimensional tracker uses
narrow white lines (about one-third white) and a different inhibition pattern. If four imager
cells are used.. and we arrange to have each cell inhibit cells up to two steps away (say cells at
distance less than 2.5), then we get a set of three stable images, shown here:
If the white line spacing (imaged onto the chip) is about three cell widths, then these
images correspond in an obvious way to positions of the bright lines relative to the cells
(l=bright); see figure 6. The figure illustrates a simple digital machine (on the same chip) that
would compare the current image with the previous image (Le., the machine has only three
states) and output a signal tl1at says moved up or moved down. 'nms we have a relative motion
sensor for one dimension of travel. A 2-bit counter is used to com'en to the familiar
quadrature signal representation which is convenient for asynchronous sampling by the host
system.
Other spacings, inhibition patterns, numbers of cells, etc., can be applied easily to the
linear motion detector problem. The real challenge is to make it work in two dimensions, and
to make it tolerant of rotation (of the imager with respect to the pattern). After discussion of
inhibition patterns. we show how to extend the 4-element one-dimensional line-tracker to a 4-
by-4-elemcnt two-dimensional dot-tracker.
Richard F. Lyon 7
Typical
Configuration
VDD
Pattern of
bright lines on a
dark background
I direction
up imagedoflines
(of
Down
mati on
relative to sensor cells)
(CeIl-Done-2 and
Cell-Done-3 are
redundant here)
Resct
Quadrature Signalling
Output to User S} tem
In many cases, we can specify inhibition neighborhoods as all cells within a certain radius,
by Euclidean distance in the plane, assuming unity cell spacing. We choose a radius such that
no cells fall at exactly that radius, to avoid ambiguity; hence radius 1.5 means cells at distance
1.414 in the plane are inhibited, but cells at distance 2.0 are not Some inhibition
neighborhoods, however, cannot be specified simply by a radius; two-pair inhibition is an
example.
Figure 7 graphically tabulates a succession of inhibition neighborhoods and the resulting
stable images. for four-element linear sensor arrays and four-by-four two-dimensional sensor
arrays. Square symmetry is assumed to reduce the complexity of the figure.
Notice that radius 2.9 is the smallest inhibition neighborhood such that when comparing
images, no dot can appear to have moved to two different adjacent pixels. That is, this
sequence cannot occur:
old new
0 o 0 0 1 000
0 1 0 0 0 000 (moved up-left or down-right?)
0 o 0 0 0 010 (can't happen for radius> 2.83)
0 o 0 0 0 000
What appears to be most useful is the "3.0 special" pattern of inhibition, a cross between
the radius 2.9 and radius 3.1 patterns (radius 3.0. where points separated by exactly three pixels
are coupled only if they are comers). The stable images that can result from this inhibition
pattern fall into two classes: Either there is a single 1 in the central quad of pixels, or there are
.... ....
.... ....
toto.
......
.... ++
+++
to
+.+
(1 1
:::1
1
+ ....
.++++ .
8W
r::J
+ .... ....+ ~ & ~ ~
1.1 2 8 '. 2 42
+. to+"t-
I.'.'1 2~ 4 4 . 4
r:::;]
a
~
... :
~
..
+
++++
++++
1.5 ..8 .. 8'.:& ~ & r::J4
~~EJ 79
. 4. 8 . 4 .& ~
rJD[z]LJ .. 4
I.'.'1 20 . 8
.4 ~
1 D
1:.;1 .+
to
.+-.to
2.1 2OI:JtJ
I.'.'1
S S 1 43
W
[J'.
.. ++++
2.3 [:J4 0 1:J[:Jc:J1:3
& 1 4 4
4 25
o
c
Figure 8,
Various positions of 4x4 imagers with respect to a hexagonal dot array, showing
. . . ays to see all the possible stable images for radius 2,9 or more inhibition,
If we use radius 2,9 inhibition instead of "3.0 special" or 3.1. the "four-romers" image
would gh'e us an interesting problem, Although the images of t..... o and three dots are easy to
integrate intO a set of images of dotS in a hexagonal array, the image of four dotS is not.
Worse than that, it is possible for a positioning of t..... o dots near opposite comers 10 force !he
fourdot image to occur; then it is impossible to tell in whkh pair of opposite comers the dots
..... ere really seen. This is ..... hy the "3.0 speciar pattern was developed-it eliminates the four-
comer image and the images of three dots. while still allowing an the images of two dots. some
of which would have been eliminated by going to radius 3.1. The images of three dotS are not
reall y missed. since seeing only t..... o of the three dots still guarantees that with movement at
least one of the dots will remain in the field of view. so the image :.an be uacked by looking at
IocaI dot motion.
Counting all rotations and mirrorings. there are 30 distinct stable images for the "3.0
special" inhibition. Of the 9CXI combinations of tv.'o SUttessive stable images, most have an
obvious interpretation in lenns of movement of the white dots with respect to the imager; lhose
that do not have an obvious interpretation must be handled by the [tacking algorithm, but will
probably not occur often.
A possible non-specific implementation of the trading algorithm is simply a finite-state
machine which takes one stable image as input (possibly encoded in just a few bits), looks also
at its cumnt state (state equals previou5 input, most likely). and outputs a signal indicating
direction of mo\'ement based on the state and input, and also outputs a new Slate. If the
machine is built of a simple PLA (programmed logic arra)'} with no special encoding. the PLA
can ha\'e as many as 32 inputS and 900 product terms, which would occupy most of a
reasonable size NMOS chip. The size could be reduced by first encoding the 30 images into 5
bits (PLA ..... ith 10 inputs instead of 32). and by nOt decoding image pairs which are
meaningless or which correspond to no motion (maybe about 600 terms instead of 900); so it
may fit in a Quaner of a chip. We are still free to design the trac king algorithm and specify
PLA outputs required. and program the PLA accordingly (ie. the tracking problem may be
regarded as a simple matter of programming). A more specific lracking algorithm and a novel
compact implementation of it will be described in section 11.
Richard F. Lyon 11
Digital
Imager Outputs
Ready
Timing Interface
Done
D,
Sensor-Node
I
Pixel-Light
Ready ilr-
Done '1\ "L ~
Stop ;J.
'I r
Phi-Long
Phi-Shon l I
'L. ~
"
Reset
) )
Watchmg Cychng
(longTime) (Short Time)
Figure 9. Imager and Logic tied together by Self-timed
Gock Circuit, with timing waveform diagram.
12 The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors
with the imager array, and it is assumed that the digital logic is fast enough to keep up with the
imager (this assumption becomes a constraint for the designer to satisfy). The generated clocks
are called Phi-long and Phi-short, to indicate which one is of unbounded length; Phi-long
should be used as a quasi-static feedback enable to keep the logic alive and insensitive to light
while waiting for the imager. The steps of operation of the clock generator are in quick
succession as follows:
The good thing about this technique is that it doesn't care how slow the imager is;
everything is \\1lling to wait until there is a solid digital answer. Hopefully, the imager will
receive enough light to cycle faster than once every few hundred microseconds on the average,
so it will be able to get image samples often enough to track mouse motion of several thousand
steps per second.
The counters needed for X and Y simply count through four states, in either direction (up
or down), changing only one bit at a time (ie., 00, 01, 11, 10). This is a simple case of either a
Gray-code counter or a Johnson counter (Moebius counter). The PLA (tracker machine)
outputs needed to control the counters are just Right-X, Left-X, Up-Y, and Down-Y.
In the scheme actually implemented. the counters run through eight states, so that the
tracking algorithm can repon a finer gradation of motion (Up-Half-Y, etc.). Only four states,
representing full steps, would be seen by the host system; the states mentioned above are
simply augmented by an alternating "least significant bit". so the eight-state sequence is 000,
001, 010, OIl, 110, Ill, 100, 101.
edge and a new image from the top edge, write in the square which way it looks like the dots
moved, and by how much (half step or full step).
One quickly develops simple algorithms to describe the reasoning about filling in the
squares. But how do we write some simple rules to do this in a digital machine, without
resorting to precomputing all the cases? To fit the capabilities of VLSI, we have come up with
a distributed local algorithm which can be implemented right in the imager array. Each pixel
saves its old value in a register, and on each cycle compares it with its new value and that of all
its neighbors. Each pixel reports one of eleven results (my dot moved <to one of 8 neighbors},
my dot stayed, my dot disappeared, or I didn't have a dot to track) to some decision logic. The
decision logic then just has to see what gets reported, and filter out contradictions (a move and
a stay can be converted to a half-step move).
The decision logic can also be partially distributed as a set of nine AND-OR-INVERT
gates running through the array (one for each of the eight move directions and one for the no-
move case-disappearing dot and no dot to track are not reported). These gates report a low
logic state if a pixel had a dot in the old picture, AND the appropriate neighbor has a dot in
the new picture, OR any other pixel met a similar condition. A single 9-input conflict
resolution PLA is needed outside the array to decode combinations of zero, one, or two
reported move directions and to produce the counter control signals (see figure 10). Actually,
of the 36 conceivable patterns of more than one indicated movement, only twelve are both
possible and clearly meaningful (as half-steps); so the logic can be very simple (PLA with only
20 terms, for the eight possible full-steps and the eight possible half-steps, four of which occur
two ways). Any other sequence, whether sensible or not, will produce no count commands.
The eight-state up-down counters are also most easily designed as PLA's.
Example:
r This cell reports
Moved-Down
n
fi
Old
Old
Image W New
Moved-Right
-J, Resultant:
New
Image
D
Half-Step Down-Right
Moved-Down
Old lNew
L This cell reports
Moved-Right
Moved-Up-Left X-Right X-A
Moved-Up X-Half X Counter X-B
Moved-Up-Right X-Full (8 states) X-L
Tracker I"-
Moved-Left
PLA
Stayed-Here Y-Up Y-A
Moved-Right (22 terms) Y-Half Y Counter Y-B
Moved-Down-Left Y-Full (8 states) Y-L
I"-
Moved-Down
Any-Good
Moved-Down-Right
Jump ToHos
Exactly q, 1, or 2
of these IS true. Counter control
and test signals
A simple three-pixel example, diagrammed in figure 12, \\ill serve to clarify the properties
of this kind of detector array. Note that when all cells have received light it is possible for the
array to arrive at a stable state in which no dots were detected (Spot Detected = 0, Pixel
Light = 1 in all cells). Any set of dots which is a subset of an image that would have been
detected by the equivalent inhibition pattern in a light spot detector array is a possible stable
image.
Therefore, for the three'pixel neighbor-inhibiting dark spot sensors, we get these stable
images:
H:::::;:::jli:i1
!:~~
~ 111!~ g
liil:~!~~!~~i:;~ill :xl
mi~ ~:i n
~
..a.
;n
r-
~
.~ ~!1J ~~~ ~1~~ :::II
~I" ~i'
Pixel-Light poly disoibutioll II ircs and diffusion grounds
.....
Figure 11. TI1C layout of the upper left Optical Mouse Cell.
'"
16 The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors
For four-by-four arrays, the additional stable "subset images" are illustrated in figure 13.
One result is that with the radius 2.9 inhibition pattern, seeing spots on opposite comers does
not force the four-corners image, but is actually most likely to give the correct two-comers
image. A more general result is that the spot pattern to be tracked does not need to be so
closely matched to the inhibition pattern, since the circuit is willing to wait for spots to really
be there before it claims to see them; a pseudo-random distribution of dots would probably
work quite well.
With this technique, it would be possible to make a linear motion tracker with only three
cells, each inhibiting all the others, with a dark line spacing of three cells or greater; similarly, a
2-D tracker might be built with just a three-by-three array or cells. For the linear tracker, the
image sequence for uniform motion could be either the 100, OlD, 001, 000 cycle or the 100,
010, 001 cycle. These trackers would have to assume that a dot disappearing from one edge
and/or appearing on the other represents a step of motion (or a half step, depending on what
assumptions are made about the line spacing).
Sensor-Node-l
...-----, I
P---~
* low-threshold
** high-threshold
Spot-Detected-l
Spot-Detected-2
Reset
Spot-Detected-3
Ready Done
Ds08D 402c::Js
US U 8~s [:]4 c:}
[jS[J4~401
0808D 40201
* 2.9 D4080S08 60
D4D 40201
3.0s D Os01
4 43
3.1 D4Os01 39
Figure 13.
Additional patterns for the four-by-four dark-spot detector array.
* Radius 2.9 may be best for the dark-spot detector scheme.
14. Of mice and pens
The optical mouse's compact internals will allow it to be repackaged into various other
forms. For example, a pen-like device with a big base that keeps it from falling over might be
desirable. A "ball-point" tracking device that watChes a golfbalHike pattern of dots on a
rolling baIl in the tip of a pen may also be useful
15. Summary
The optical mouse embodies several ideas that are not obvious extensions of standard
digital or analog design practices, but which contribute to the design of robust sensors of the
analog-to-digital sort. Using the concept of lateral inhibition, sensor cells that are trivial and
useless alone become powerful in their synergism. A sensor array that forces itself into a useful
and informative stable digital state is very easy to deal with, through standard digital
techniques. It is especially useful if it can decide when it has reached such a stable state, and
when it has been reset enough to be ready to start over, for then it can be regarded as self-
timed., and clocks can be generated that cycle it quickly yet reliably.
The optical mouse is just one simple example of an application of smart digital sensors,
which happens to involve a few stages of logic to arrive at the answer in the desired format
Fortunately for this project, the NMOS technology that we know and love for logic is also well
suited for sensing photons; so once the ideas and algorithms were firm, the chip design was
relatively routine, and quick-turnaround implementation was available through the standard
well-greased path.
18 The Optical Moule, and an Architectural Methodology for Smart Digital SenIors
The interrelated inhibition neighborhoods, contrasting patterns, sets of stable images, and
tracking strategies for the optical mouse application have been thoroughly discussed in the text,
and do not seem amenable to summarization here.
References
[Business 1981]
Business Week, "Will the boss go electronic, too?" pp 106-108, May 11, 1981.
[Card et aL 1977]
S. K. Card, W. K. English, and B. Burr "Evaluation of mouse, rate-controlled isometric
joystick, step keys, and text keys for text selection on a CRT', Xerox Palo Alto Research
Center SSL-77-1, April, 1977.
[Conway et aL 1980]
1.. A. Conway, A. O. Bell, and M. E. Newell, "MPC79: The Large-Scale Demonstration of
a New Way to Create Systems in Silicon," Lambda-The Magazine of VLSI Design. pp.
10-19, Second Quarter, 1980.
[Englebart 1970]
D. C. Englebart, "X-Y position indicator for a display system", U. S. Patent 3,541,541.
Nov. 17, 1970.
[Englebart & English 1968]
D. C. Englebart and W. K. English, "A Research Center for Augmenting Human
Intellect", FJCC 1968, Thompson Books, Washington Books, Washington, D. c., p. 395.
[Englebart et al. 1967]
D. C. Englebart, W. K. English, and M. 1.. Berman, "Display-selection techniques for text
manipulation", IEEE Transactions on Human Factors, HFE-8, 1, 5, 1967.
[Hawley et aL 1975]
1. S. Hawley, R. D. Bates, and C. P. Thacker, "Transducer for a display-oriented pointing
device", U. S. Patent 3,892,963. July 1, 1975.
[Hon & Sequin 1980]
R. W. Hon and C. H. Sequin, A Guide 10 LSI Implementation, Xerox PARe Technical
Report SSL-79-7, Palo Alto, California, 1980.
[Koster 1967]
R. A. Koster, "Position control system employing pulse producing means indicative of
magnitude and direction of movement", U. S. Patent 3,304,434. Feb. 14, 1967.
[Lyon 1981]
R. F. Lyon. "A Bit-Serial VLSI Architectural Methodology for Signal Processing", VLSI
81 Very Large Scale Integratioll, (Conference Proceedings. Edinburgh. Scotland.. John P.
Gray, editor), Academic Press, August, 1981.
Richard F. Lyon 19
Forest Baskett
Xerox PARC
Palo Alto, California
Stanford University
Palo Alto, California
20
Keys to Successful VLSI System Design
James G. Peterson
Consultant to TRW DSSG
BACKGROUND
The interest in the potential of VLSI first began to explode
several years ago when G. Moore unveiled his famous curve showing
the exponent1al increase with time of the available transistor-level
complexity on one integrated circuit. Exciting new system
capabilities were projected for the near future, and many new
architectures proposed. This initial enthusiasm was quickly
tempered, however, by the observation that the effort required to
spec1fy, deSign, implement, and verify an ultra-complex item such as
a VLSI appeared to be at least LINEAR with the complexity, and highly
unpredictable. To solve this problem in specific cases, more
designers were planned per project. As in software, the added
communication caused the error rate of each deSigner to increase and
the individual productivity to decrease. In addition, the complexity
of the devices designed made system test development and design
verification more difficult, leading to systems produced with more
built-in deSign faults and poorer performance than planned. The
performance of many of today's systems is limited more by deSign cost
and schedule considerations than by the available processing
technology.
This situation of long, unpredictable schedules, extensive
manpower reqUirements, questionable robustness and potentially
unpredictable performance causes a high cost to be associated with
the use of custom VLSI. Consequently, many military and industrial
systems in which performance could be substantially improved by the
use of custom VLSI do not use it. I believe that many of the initial
enthusiastic predictions of the kinds of systems to be available with
new processing technology remain unfulfilled for these reasons.
A significant and often unplanned-for aspect of the engineering
environment is that no engineering is done in a vacuum. There are
continual forces during the course of a design which change the
problem to solved and the constraints on the solution. These may be
21
22 Keys to Successful VLSI System Design
PROPOSITION
I propose that at any time there is always more useful
regularity or hierarchicalness to be found in any useful problem.
This implies that for many of today's applications, economically
sound VLSI solutions could be found that would greatly improve the
product performance.
PLACES TO LOOK FOR ADDITIONAL REGULARITY IN TODAY'S PROBLEMS
The way to find this additional regularity is evaluate all the
aspects of the problem for the seven attributes listed above. This
includes the market analysis which determines what product should be
built and how it should be specified, the initial large-scale design
decisions, all the way down to the layout and test of the individual
transistors. Especially good places to look are in the highest-level
specifications and system design decisions. There are many new
optimization techniques available to the VLSI designer. A technology
which has worked in one place, as the above examples demonstrate,
should be vigorously applied wherever it is effective.
It is appropriate when using this technology to disregard the
traditional boundaries between specification, chip, logie, test,
mechanical, software, and system design tasks, and to address all
these aspects simultaneously. At first glance, this seems a very
complex and difficult task, but in situations where hierarchicalness
and/or regularity are present, a single engineer can usually
comprehend all the facets of the entire system at one time. And, he
can then often consider significant tradeoffs which are usually
overlooked.
An example of this thorough application of regularity is found
in refrigerator design. Refrigerator buyers want two sexes of
refrigerators -- left and right hand opening. Initially, two
separate types of refrigerator were made. This gave refrigerator
Jame. G. Peterson 25
The CGA's were used in the system because they had already been
designed, and it was thought that inadequate funds were available to
replace them. Most of the standard cells were required in the final
system design to interface the CGA logiC to the new chips, and some
to combine the functions of the blocks.
In hindsight, however, it is now apparent that it would actually
have been cheaper to redesign MORE of the system. The additional
available regularity would have actually DECREASED the cost of the
entire system desiSn. One more block would have been developed,
probably a type of state machine, and the functions of the counter
block would have been augmented by the addition of more programming.
This would have REDUCED the number of blocks in the system, and also
REDUCED the chip count from 14 to 7 or 8. We were not zealous enough
in applying our own philosophy.
28 Keys to Successful VLSI System Design
CONCLUSION
All of the VLSI design methodology improvements yet conceived
are based on some common intangible attributes, and principles for
their application. By consciously using these proven attributes and
principles in a uniform way on all relevant issues, we have a chance of
realizing the full potential of VLSI systems technology. The further
use of this information is up to you.
Programmable LSI Digital Signal
Processor Development
Aklra Sawal
C ,. C Systems Research Laboratortes
Nippon Electrtc Co., Ltd.
Kawasaki, Japan
ABSTRACT
I. INTRODUCTION
The signal processing functions common to all these applications are linear
and nonlinear operations to provide filtering, averaging, prediction, optimization,
adaptation, spectrum analysis or signal energy detection.
In implementing signal processors, the most important requirements are for
accuracy and speed in multiply-add operations, which are frequently used in
digital filters or DFT/FFT processors.
The accuracy determines dynamic range and signal-to-distortion (S/D)
ratio. The standard PCM employs nonlinear 8 bit encoding, which is equivalent
to linear 13 bits for a smaller signal level. Hence, signal processors should be
designed so as not to incur significant SID degradation for such PCM encoded
signals. This requires additional bits added to the linearized PCM signals.
Practical surveys on various applications, such as voiceband signal filters, DTMF
receivers, modems and ADPCM codecs, have shown that the required accuracy is
in the range of 12-20 bits. However, most of the cases including voice
recognition application would be realized with 16-bit accuracy under careful
software design.
The speed directly affects the amount of signal processing within a given
time period, thereby affecting the size and cost of digital processing systems.
Single chip realization of a DTMF receiver requires about 25 biquad filter
operations in an 8kHz sampling interval. This corresponds to 0.8 million
multiplications per second. Taking into account the figure as well as other
additional control functions, about 50 biquad filter processing capability for an
8kHz sampled signal is the processing speed objective.
Memory capacity is also an important requirement for signal processor
implementation. To implement 50 different biquad filters by a single chip
processor, at least 100 word capacity is required for variable data memory
(RAM), because one biquad filter needs two delay elements. For 128 real point
FFT or 64 complex point FFT realization (whose possible applications may be
such as adaptive transform coders), 128 words are the minimum requirement.
In adition to the variable data memory, a non-volatile fixed data memory
(Data ROM) should be employed. Several hundred words are required for these
purposes, that is, 200 words for the biquad filter coefficients, around 200 words
for 128 real point FFT twiddle factors and window coefficients, and 256 words
for unidirectional linear/nonlinear PCM code conversion.
Program capacity requirement is another problem, and highly dependent
both on the processor architecture and instruction cycle time.
III-B. Architecture
Figure 1 shows the architecture for signal processor J,lPD 7720. The
processor has a built-in multiplier, Data ROM, Data RAM, Instruction ROM,
ALU, double accumulation registers and several other registers; i.e. temporary
register (TR), multiplier input registers K and L, multiplier output registers M
and N, data pointer (DP), ROM pointer (RP), serial I/O registers (SI and SO), data
register (DR) and status register (SR). These registers are interconnected
through a main bus or sub-buses. The arithmetic operation is carried out by
fixed point arithmetic. The fixed point exists between the first bit (i.e. sign bit)
and thesecond bit of the l6-bit 2's-complement data words.
The Instruction ROM capacity is chosen to be 512 words x 23 bits to
enable execution of 500 non-repetitive instructions during an 8 kHz sample
interval. In order to increase available program steps for processing lower
sampled signals, a four-level subroutine stack is provided.
The Data ROM capacity requirements previously given need not be
satisfied at the same time for a single application. A compromise in the form of
a 512 word memory capacity is provided. The word length is determined as 13
data loading to registers K and L through the main bus and a sub-bus can be
achieved by using special destination codes.
I) OP/RT Instructions
ROM POINTER
DECREMENT
2) Branch Instructions
The signal processor chip is fabricated with a 3-Jlm N-channel EID MOS
technology [7]. A 250ns instruction cycle under 8 MHz clock operation is
realized. Total power consumption is 900 mW. Figure 3 shows a chip photo-
micrograph. More than 40,000 transistors are integrated on a 5.47 x 5.20 mm die
area.
The multiplier is made up of two input registers, Booth's decoders, 112
multiplier cells and an adder with a 3l-bit output. Each multiplier cell consists
of a partial product generation multiplexer and a full adder, and is realized by 30
tran8tors in a 200 x 94 pm area. The total multiplier hardware occupies about
3mm on the die, and executes a 16 x l6-bit multiplication within 250ns at
280m W power consumption.
For comparison, so far announced single chip signal processors, together
with their architectural and performance features, are listed in Table l. Only
Intel 2920 has built-in A/D-DI A converters. It has a feature wherein its I/O
interface is analog, however, it has no hardware multiplier on a chip.
IV. APPLICATION EXAMPLES AND PERFORMANCE
x.1
W. =
1
x.1 - blw.1- 1 - b2w.1- 2 D:Delay
{
y.1 = w.1 + alw.1- 1 + a2w.1- 2
Instruction cycle
:IE
!!!..
time (n sec.) 400 300 250 800 250
Co>
en
Col
GO
Table 1. (Continued) a
"
ID
3
3
Second order ICr
filter sections 19 50 55 39 ..
at 8kHz sampling roo
rate !l!
c
Ci
Po~er ::I'
supplies ~5 +5 +5 +5 +5 !!.
VI
(V) Ci
::J
!!.
Power
dissipation 0.8 0.9 1.5 "a
n
(W) CD
(D
(D
0
I/O 4 MPX analog 8 bit bus I/O 8 bit bus I/O 16 bit bus I/O 16 bit bus I/O ...
inputs serial I/O serial I/O seri a 1 I/O serial I/O c
CD
8 DMPX analog (same pin) ~
outputs 0"
'Q
3
CD
Development - software real-time - X-asm for iMDS assembler 3.
support simulator in-circuit - X-asm FORTRAN/iMDS
- assembler emulator based simulator
- PC for EPROM - evakit with
- progr. app1. progr. function
oriented
compiler
Aklra SlwII 37
5
ACCA ... SGN
I+max or-maXI
6
ACCA +- ACCA + M TR +- ACCA(OLD)
DP modify
IWil I02Wi-21 IWil
ACCA ... ACCA + RAM K +-RAM , L +- ROM
7 RP decrement
/Wi+ 0 2Wi-21IWi-11 IWi-11 10 1-11
ACCA ... ACCA + M RAM'" TR
8 DP modify
/Wj+Wi-I+Cl7.wt-2/ /(O,-IT'Ni-i/ IWJ/
RAM ... K
9
IWj-J1
10 dB/V.Div.
500 Hz/H.Div.
AN AlOG
II
PCM
J , RST!
SCK
ADP CM
OU T
ClK
IN SO , SI SO
,
SYNC .... , SIEN SOEN ~
I
I NPUT I
OUT
IN I I
BUFF ~
~~~~~
~ ; Signal Processor Module Packet Processor Module
V. FUTURE TRENDS
REFERENCES
ABSTRACT
The effectiveness of very large scale integration (VLSI) in re-
ducing the incremental cost per unit of performance of a variety of
flexible system functions can be significantly enhanced by employing
a high degree of functional parallelism with serialized data-flow and
control. Both Functional Parallelism (the parallel use of an array of
high density, low cost, lower performance devices to obtain a high
performance function) and Bit-Serialized Arithmetic (the use of single
bit-stream operations to perform elementary arithmetic functions) have
been factored into VLSI systems and computations to permit advantageous
use of MOS solid-state technologies as well as graceful transitions of
processor implementation from one scale of large scale integration to
the next. Some of the major considerations linking form to function
are noted here with examples illustrating the impact of functional par-
allelism and serialized arithmetic.
FUNCTIONAL PARALLELISM
The use of functional parallelism as a means to realize high per-
formance system functions has emerged with the economic availability of
custom LSI and the technology improvements being realized in connection
with the fabrication of MOS-type electronic devices of very high func-
tional density. While serially organized memory devices, such as
shift-registers, have employed functional parallelism to achieve high
performance levels with low performance elements, it was not until the
advent of the CE chipt 1), conceived to function as part of an inte-
grated array of functionally parallel devices, that digital arithmetic
units with distributed ~erialized logically integrated data flow and
control were reported(2). In this device the flow of data and control
can be contrasted as shown in Figure 1 with that of more commonly em-
ployed bit-parallel devices. Rather than to employ bit-parallelism for
arithmetic throughput, effective reduction of overall complexity at
high levels of functional throughput has been achieved by serializing
both arithmetic operations and memory at the device-level while par-
alleling devices at the function-level of design. In simple terms,
this approach can lead to economic solutions to signal processor design
by a substantial reduction in device pin-outs, a higher level of func-
tional operation per device at lower bit rates, direct parallelism of
41
42 Functional Parallelism In VLSI Systems and Computations
PARALLEL ARITH.
SERIAL ARITH. (8 BIT BYTES)
OATA
DATA OUT
OUT
GND
FUMeT.
CONTROLS
8 PINS 22 PINS
Figure 3. FFT-64
Figure 4. VFFT-2
44 Functional Parallelism In VLSI Systems and Computations
Figure 5. VFFT-I0
The VFFT-I0 shown in Figure 5 uses 40 identical printed wire
boards, grouped in five sets of eight of each radix 4 stage. Each
board contains two radix 4 complex arithmetic units, alternating mem-
ories, and multiple~ers for performing data permutations. The algor-
ithm utilized is tl)
w s
a
b
CE Chip a . b c d
c
d
FUNCTIONAL EXTENSIONS
Clearly, this organization of modular arithmetic with distributed
memory and control contracts directly and gracefully at successively
higher scales of solid-state integration as such consolidation becomes
economically attractive. At the three micron level of geometry, for
example, the machine illustrated would be capable of unpartitioned ex-
ecution of 128 x 128 matrix operations; or, a machine of four times the
complexity shown will execute the 256 x 256 matrix arithmetic directly.
Typical performance of an array 128 x 128 operating at a clock rate of
10 megahertz is a matrix multiply in 516 microseconds, a complex matrix
multiply in 2.06 milliseconds, and a real matrix inversion in 14 milli-
seconds. The 256 x 256 case requires four times these periods with the
same size array and precision described.
REFERENCES
1. Powell, N.R. and Irwin, J.M., "Flexible High Speed FFT with MOS
Monolithic Chips," Eighth Asilomar Conference on Circuits, Systems,
and Computers, December 1974.
2. Powell, N.R. and Irwin. J.M., "A MOS Monolithic Chip for High-
Speed Flexible FFT ~1icroprocessors," ISSCC-75. February 1975.
3. Powell, N.R. and Irwin, J.M . "Signal Processing with Bit-Serial
Word-Parallel Architectures." Real Time Signal Processing, Proc.
SPIE, Vol. 154. 98-104.
Functional Extensibility: Making the World
Safe for VLSI
Justin R. Rattner
Intel Corporation
Special Systems Operation
Aloha, Oregon 97005
50
Justin R. Rattner 51
l. IN'I'RCIXTCTICN
This research was supported in part bv NSF under Grants ECS 80-09938,
M::S 80-25376, M::S 81-04882, J)CS 81-10097, and was facilitated by the
use of Theory Net, NSF Grant K:S 78-01689.
52
lvl M. Kedem and Aleaaandro Zorat 53
VLSI: circular shifts. '1llis proposed construction also shows that the
computation of circular shifts can be naturally partitioned among
several chips, without requiring any inter-chip communication.
2.
-- ---
CCJ.WLEXI'IY OF TRANSITIVE FmCI'ICNS WI'IH MULTIPLE INPUT mPIES
THEOREM: Fbr any VLSI chip computing a transitive function and for
which replication of inputs both in space and in time is allowen:
3 4 4
AT = O(n ).
Since this is lower than the bound of AT2 = Q(n 2), which is A~4
= O(n4), valid for such computations when no input replication is
allowed, one may hope that input replication can be used to decrease
the oonplexityof at least sane transitive functions to a vaJue below
the bound of A~4 = O(n4 )
In the next Section it is shown that this can be done for a
circuit that canputes the circular shifts of a bit-vector, which is a
transitive function.
4.
-'mE CG1PLEXI'IY
- -
OF '!'HE CIInJIAR SHIFTER
In particular, for the measure A3T4 the optimal choice for k and
m IS. n 1/3
a ndn 2/3 , resoectIve
. 1ey, resu1"tlng In a comP.leXItv
' o.f A3T4
O(n13/ 3). While this compares quite favorably with the lower bound of
Q. (n4 ) der i ved earlier, it nevertheless leaves a gap between lower and
upper bounds.
FIGURFS
M[2~~t M[2~3]
M[3 0] M[3 1] M[3 2] M[3 3]
--r
.,
_I-
I IControl unit
...... M[p-1 p-1]
pro
P[l
P[2
P[3
1
I .,
....... P[q-1 q-1]
:'b-m-tl
\.L~h':K -.,---t--,-
L040-+---,-I--+-
OUT
LEGEND: INPUT
OO_-OO'--$-"~--'"
DIFFERlNCE CLOCKED DfLAY
ONE BIT SUBTRACTOff ELEMENT
============ PHl\SE 0 = = = = = - = - = = = = =
4 5 6 20 21 22
8 9 10 24 25 26
12 13 14 28 29 30
16 17 18 32 33 34
20 21 22 36 37 38
24 25 26 40 41 42
28 29 30 44 45 46
< - - MOdule 0 --------> <--- Module 1 ----->
Input wave 1, accepted.
20 21 22 36 37 38
24 25 26 40 41 42
28 29 30 44 45 46
32 33 34 48 49 50
36 37 38 52 53 54
40 41 42 56 57 58
44 45 46 60 61 62
< - - MOdule 0 ------> <--- Module 1 ------>
Input waves 2 and 3 (net shown) follow.
Shift up bv 3.
32 33 34 48 49 50
36 37 38 52 53 54
40 41 42 56 57 58
44 45 46 60 61 62
48 49 50 0 1 2
?7 ?? ?? ?? ?? ?? ?7 ?? ?? ?? ?? ?? ?? ??
1? ?? ?? ?? ?? ?? ?? ?? ?1 1? 11 ?1 11 ??
<---- MCXIule a ------> <--- Module 1 - - - - >
Shift left by 2.
34 ?? ?? 46 47 48 49 50 ?? ??
38 ?? ?? 50 51 52 53 54 ?? ??
42 ?? ?? 54 55 56 57 58 ?? ??
46 ?? ?? 58 59 60 61 62 ?? ??
50 ?? ?? 62 63 0 1 2 ?? ??
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
?? ?? ?? ?? ?? ??
?? ?? ?? ?? ?? ?? ?? ??
<---- Module 0 - - - - - - > <---- Mojule 1 - - - - - - >
48 49 50 51 52 53 54 0 1 2 3 4 5 6
52 53 54 55 56 57 58 4 5 6 7 8 9 10
56 57 58 59 60 61 62 8 9 10 11 12 13 14
60 61 62 63 0 1 2 12 13 14 15 16 17 18
0 1 2 3 4 5 6 16 17 18 19 20 21 22
4 5 6 7 8 9 10 20 21 22 23 24 25 26
8 9 10 11 12 13 14 24 25 26 27 28 29 30
12 13 14 15 16 17 18 28 29 30 31 32 33 34
Shift up by 3.
0 1 2 12 13 14 15 16 17 18
4 5 6 16 17 18 19 20 21 22
8 9 10 20 21 22 23 24 25 26
12 13 14 24 25 26 27 28 29 30
16 17 18 28 29 30 31 32 33 34
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
Shift left by 2.
62 63 0 1 2 ?? ?? 14 15 16 17 18 ?? ??
2 3 4 5 6 ?? ?? 18 19 20 21 22 ?? ??
6 7 8 9 10 ?? ?? 22 23 24 25 26 ?? ??
10 11 12 13 14 ?? ?? 26 27 28 29 30 ?? ??
14 15 16 17 18 ?? ?? 30 31 32 33 34 ?? ??
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
<----- MOdule 0 -------> <--- Mcx:'lule 1 ------>
Figure 4: (cootinued)
Planar Circuit Complexity and The
Performance of VLSI Algorithms +
John E. Savage
Brown University
Department of Computer Science
Providence, Rhode Island 02912
1. INTRODUCTION
In 1979 Thompson [1] reported that, under a suitable model for VLSI
chips, the product AT2 of chip area and time T to compute the Fast Fourier
=
Transform on n inputs must satisfy A~ 0(n2). His model accounts for the
chip area used by wires as well as for computational elements. He extended
these results in [2] and in addition examined the sorting problem. Brent and
Kung [3] introduced a somewhat different model for VLSI chips in which the
area occ~ied by wires and circuit elements is convex. They demonstrate that
AT2 = O(n) to multiply two n-bit integers, a result obtained with the original
model of Thompson by Abelson and Andreae [4]. Brent and Kung show that
=
A O(n) and they present algorithms that come close to meeting their lower
bounds. Savage [5] obtained bounds of AT2 = 0(p4) with both models for pxp
matrix multiplication, inversion and transitive closure. Algorithms previously
given by Kung and Leiserson [6] and Guibas et al. [7] were shown to be
optimal. Preparata and Vuillemin [8] subsequently introduced a family of
optimal matrix multiplication algorithms for 0(1) S T ~ O(n).
Vuillemin [9] has extended the Brent-Kung model to pipelined chips. If P
is the period of computations, that is, the time between production of two suc-
=
cessive results, he shows that Ap 2 O(n2) for transitive functions on n inputs.
He also demonstrates that A = O(n) for these problers, Ya~ [10] considers
=
VLSI algorithms for x + yz, over a finite field F, n Ilog21 F II' as well as the
predicate associated with graph isomorphism, and derives bounds of the form
AT2 = 0(n2 ) under various conditions. Lipton and Sedgewick [11] consider the
performance of VLSI algorithms for the Brent-Kung modified somewhat and
obtain quadratic lower bounds for a number of predicates.
+ This work was supported in part by the National Science Foundation under grant MeS 76-
20023, by the University of Paris-Sud, Orsay and by INRIA, Rocquencourt, France. Preparation was
supported in part by NSF Grant ECS 80-24637.
Keywords: VLSI, planar circuits, complexity, algorithms, tradeofis, predicates. integer powers
and reciprocals.
61
82 Planar Circuit Complexity and the Performance of VLSI Algorithms
In this paper we offer a model of VLSI algorithms which differs from previ-
ous models in important respects and we present a new method of analysis in
which the performance of algorithms can be related to the planar circuit com-
plexity of functions for which they are designed. The model is described as fol-
lows:
A1. The chip realizes a sequential machine constructed from discrete logic
elements and straight wire segments.
A2. Wires have a width of A and a separation and a length of at least A. Each
logic element occupies an area of at least A2. The chip has II planes each
of which may contain wires or logic elements.
A3. Inputs are read and outputs produced at times that are data-
independent.
A4. a) Each input variable is supplied exactly once to the chip; or,
b) Each input variable is supplied at one place on the chip.
Assumption A4 b) is weaker than A4 a) because it permits input variables to be
read multiple times but only at the same place on the chip. We say that a VLSI
algorithm that meets A4 a) is semelective t while one that meets A4 b) is
semelocal t
The Thompson model [1,2] assumes A1, A2 and A4 a), plus assumes that
wires are rectilinear and that each input variable is associated with a unique
place on the chip. In [1] assumption A3 was made. More recently, Thomp-
son [12] is working with assumption A4 b). Our model differs from the Brent-
Kung model in that in addition to A1 and A2 they assume that the region of the
chip occupied by elements and wires is convex. It is the area of this region
that is measured. Lipton and Sedgewick and Yao use this model. However,
they and Thompson [2] do not require assumption A3, which appears to be
essential to our results. Lipton and Sedgewick obtain their results with
assumption A4 b) while Brent and Kung assume A4 a).
t Ssmelactive and seTnslocal are neologisms formed from the latin words semel, meaniIl8
once, Isctio, meaning to read, and loc1.ls, meaning place.
John E. Savage 63
CN):o;; 7m2n
The former result demonstrates that planar circuit complexity is no larger
than quadratic in the standard circuit complexity. The later result demon-
strates that for most Boolean functions, standard and planar circuit complex-
ity measures have about the same value.
The following computational inequalities relate chip area A and the
number of cycles of execution T to the two circuit complexity measures.
Theorem 3: Let f:!O,l!n ... !O,l!m be computed in T cycles on a VLSI chip of area
A with an algorithm that meets conditions Ai through A4. Then, the inequality
C(f) :0;; II(AI )...2)T
must be satisfied.
Theorem 4: Let qo.qn .... !O,llm be computed in T cycles on a VLSI chip of area
A. Then. the inequalities
CpU) :0;; 1211(AI )...2)T2. Cp(f):O;; 1211(AI )...2)2T
must be satisfied where the first holds when VLSI algorithms are semelocal
and the second when they are semelective. ]f the chip algorithm is not
semelocal or semelective. the inequalities hold when C;(f) is replaced by Cp(f).
The semelective condition on a planar circuit appears necessary to obtain
strong lower bounds to cg(f). Theorem 3 is a restatement of a result in [14].
Valiant [15] has observed an interesting connection between the second
inequality and lower bounds to space and time for uniprocessor machines. He
notes that the analysis used by Gri~oryev [16] can be extended to the VLSI
model to obtain a lower bound to A T and that all previous bounds obtained
with the Grigoryev method apply here. In particular. this means that the
semelective condition is unnecessary to show a quadratic lower bound to many
mUltiple-output functions such as polynomial multiplication over GF(2) [16].
the discrete Fourier transform [17]. binary integer multi~lication [18]. and
matrix inversion [19]. (Grigoryev [16] has also shown a O(p ) lower bound for
pxp matrix multiplication.) Thus. these bounds will hold even for multilective
algorithms.
Lipton and Tarjan have applied their separator theorem for planar
graphs [20] to the problems of realizing a superconcentrators. to shifting and
to Boolean matrix multiplication and they demonstrate that each of these
problems requires a planar circuit size which is quadratic in the length of
their inputs. In t.he next section we state certain conditions on functions
which permit the application of the Planar Separator Theorem to the deriva-
tion of lower bounds to the planar circuit size of many important problems.
3. COMPUTATIONAL INEQUAIJTIES
We begin with a few definitions.
Definition 2: A function h:!O.ljs .... !O.ll t is a sub/unction of QO.ll n.... !O.ll m if it
can be obtained by suppressing some output variables and/or by an assign-
ment to a subset of the input variables.
The next definition identifies a class of predicates.
Definition 3: A function QO.lln .... !O.11 is w-separated if there exists a subset X
of its variables such that for any partition of X into two sets A. B with
1A I. 1B 1 ,,;; 21X 1/3. there exists at least 2" pairs HCli. ~)j of variables such that
f(Cli. b j ) = 1 if i = j and 0 otherwise. A function QO.lln .... !o.qm is also w-
sepa:rated if it contains a subfunction which is w-separated.
64 Planar Circuit Complexity and the Performance of VLSI Algortthm.
!
associated predicate F:!0.1In+m .... !0.11 is defined by
5. DISCUSSION
Much of the recent literature on the performance of VLSI algorithms con-
cerns quadratic lower bounds to AT2. These results reinforce the notion that
this measure is basic and should be used to evaluate the performance of VLSI
algorithms for all problems. In this section we demonstrate that there are
problems for which the measure A2T is a better measure in that algorithms
can be found for which it is much closer to the lower bounds that can be
derived with it than to lower bounds derived with AT2.
The inequality involving AT2 is stronger than that involving A2T when
AI }..2 ~ T. If a VLS] algorithm is semelective several authors have shown that
88 Planar Circuit Complexity and the Performance of VLSI Algorithms
the area required for various problems must be at least linear in the length of
the input. This is true for the shifting function [22], for binary integer multi-
plication [3], transitive functions [9], and for matrix multiplication [23]. Thus,
in this case, the ATz bound will be the stronger for large problem sizes.
A problem for which the measure AZT is superior to ATz is binary sorting.
The standard sorting problem for words consist~ o~ strings over fO, l!k is
modeled by a function f~ftf):!O,l!nk .... !O,llnk. If k ~ ,logznl + 1, it is easy to show
that the function is transitive of order n. Combining this with the observation
above, the better measure for this function is AT2. However, for binary sort-
=
ing, namely, when k 1, the other measure is superior since one can con-
struct a planar circuit from the schema of Muller and Preparata [24] which
uses area as small as 0(log2n) and for which AT O(n). =
The lower bounds for the chip area required by semelective algorithms
which are stated above hold for multiple output functions. We exhibit a predi-
cate for which a similar result holds.
Definition 6: p~~m) is the set of functions nO, ll n.... ! 0, 11 m for which there exist
at least g distinct subfunctions of f by some assignment to variables in the set
=
J for all J such that IJ I p.
The following is a simple extension of a result of Yao [10] which was originally
stated for the case IJ I = n/2.
Theorem 12: Let qO,ll n""!O,ll m , a member of p~~m), be computed by a
semelective VLSI algorithm. Then, the chip area must satisfy
AI >.,2 ~ [IOgzg k
2
even if the reading and writing by the chip is not done in a data-independent
manner (assumption A3).
Meyer and Paterson have defined a Boolean function f~'fHo,1In""!0.11 that is
= =
contained in P~~gl) for g 2P for p Q(n/ log n) and that has a linear standard
circuit size [13,p.43]. Their circuit can be reworked to produce a linear sized
planar circuit for this function. Thus, the A2T measure is the better, at least
for large inputs.
As a last observation, we show that if some natural constraints, such con-
tiguity of input and output variables on the periphery of a chip, are placed on
VLSI algorithms, then the area required by them can be excessively large.
This is demonstrated with the binary addition function when the Boolean vari-
ables representing each of the three integers, the two inputs and the result,
are required to be contiguous on the boundary of a chip. In this case it is easy
to show that many node-disjoint paths must exist between inputs and outputs
and these must cross at many points. This leads to a lower bound on the chip
area which is quadratic in the length of the inputs. This quadratic area can be
avoided by the standard expedient of overlapping the registers holding the two
binary numbers and loading them in two successive cycles for subsequent
addition with a standard full-adder array. This technique reduces the area
used to linear in the length of the input, thus showing a very large jump in the
area required when the contiguity constraint is dropped.
6. CONCLUSIONS
We have outlined an approach to the study of the performance of VLSI
algorithms in which performance is related to the planar circuit complexity of
the functions considered. This complexity measure provides a lower bound to
performance as recorded in the quantities AT2 and A2r. We have stated
John E.Savage 87
properties of the planar circuit complexity measure and have stated bound on
it for a wide variety of functions and predicates. We have also considered
cases for which each of the two measures is the superior, thus illustrating that
the better measure to use is problem dependent.
Before closing, we note that the inequalities stated above can reflect cer-
tain special conditions that are placed on VLSI algorithms. For example, if I/O
is to be done on the boundary of a chip, then the planar circuits will exhibit
this property also and this fact could be used to improve the lower bounds
obtained to planar circuit complexity.
7. REFERENCES
1. C. D. Thompson, "Area-Time Complexity for VLSI," Frocs. 11th ACM Ann.
Symp. Th. Camp., pp. B1-BB (April 1979).
2. C. D. Thompson. "A Complexity Theory for VLSI," Report No. CMU-CS- BO-
140, Dept. of Compo Sci., Carnegie-MeHon U., Pittsburgh, Penn. (August,
19BO).
3. R. P. Brent and H. T. Kung. "The Area-Time Complexity of Binary Multipli-
cation," Report No. CMU-CS-79-136, Dept. of Compo Sci., Carnegie-Mellon
U., Pittsburgh, Penn. (July, 1979). revised report also available
4. H. Abelson and P. Andreae. "Information Transfer and Area-Time
Tradeoffs for YLSI Multiplication," CACM 23 pp. 20-23 (January 19BO).
5. J. E. Savage, "Area-Time Tradeoffs for Matrix Multiplication and Related
Problems in LSI Models." Jnt. of Comput. and Sys. Sci., (April, 1981).
6. H. T. Kung and C. E. Leiserson, "Algorithms for LSI Processor Arrays,"
pp. 271-292 in Introduction to VLSI Systems, ed. C. Mead and L. Conway,
Addison-Wesley, Reading. MA (1980).
7. L.J. Guibas, H.T. Kung. and C.D. Thompson. "Direct LSI Implementation
of Combinatorial Algorithms," Froc. ConJ. Very Large Scale Integration:
Architecture, Design. Fabricati~n, California Institute of Technology,
(January 1979).
8. F. Preparata and J. Yuillemin, "Area-Time Optimal VLSI Networks for Mul-
tiplying Matrices Info. Froc. Letters
11 ( 2 ) pp. 77-80 (Oct. 20, 1980).
9. J. Vuillemin, "A Combinatorial Limit to the Computing Power of VLSI Cir-
cuits," Frocs. 21st Ann. Symp. Th. Camp., pp. 294-300 (Oct. 12-14, 1980).
10. A. Yao, "The Entropic Limitations on LSI Computations," Froc. 13th Ann.
ACM Symp. on Theory of Computing, pp. 308-311 (May 11-13,1981).
11. R. J. Lipton and R. Sedgewick. "Lower Bounds for VLSI," Froc. 13th Ann.
ACM Symp. on Theory of Computing, pp. 300-307 (May 11-13, 1981).
12. C. D. Thompson, The VLSI Complexity of Sorting, Division of Computer
Science, D. C. Berkeley (1981).
13. J. E. Savage, The Complexity of Computing, John Wiley and Sons (1976).
14. J. E. Savage, "Computational Work and Time on Finite Machines," JACM
19(4) pp. 660-674 (1972).
15. L. G. Valiant, Personal Communication
16. D. Yu. Grigoryev. "An Application of Separability and Independence No-
tions tor Proving Lower Bounds of Circuit Complexity." Notes of Scientific
Seminars, Steklov Math. Inst. 60 pp. 35-48 (1976).
68 Planar Circuit Complexity and the Performance of VLSI Algorithms
Arnold L. Rosenberg
Duke University
Department of Computer Science
Durham, North Carolina 27706
69
70 Three-Dimensional Integrated Circuitry
1. INTRODUCTION
One would anticipate three benefits if we could
realize microelectronic circuits in a three-dimensional
medium. First, wire-routing should become easier and
more systematic. Next, since one could avoid obstacles
by using the third dimension, runs of wire should be
shorter, at least in the worst case (This has been noted
in the "modestlv" three-dimensional thermal conduction
module develop~d by IB~ [91.) Finally, since avoidin~
obstacles in a two-dimensional environment can require
area-consuming circuitous routing of wires, one would
expect savings in material: the volume of a three-
dimensional realization of a citcuit should be less than
the area of any two-dimensional realization. In this
paper we demonstrate that at least the last two
expectations are well founded: unbounded savings in both
wire length and material are achievable via three-
dimensional circuit real.ization, at least in the worst
case: we cannot comment about the first expected benefit.
~ow, we have intentionally been vague about the
"level" of circuit implementation at which we are
proposing the use of a three-dimensional medium: our
abstract setting does not distinguish between the problem
of laying out gates and their interconnections on (or in)
a chip and the problem of laying out chips and their
interconnections on (or in) a printed circuit hoard.
Thus the patient reader can view-our study as predicting
the kind of gains in efficiency that will be achievable
when the current technological obstacles to the
fabrication of truly three-dimensional chips are
overcome: and the less patient reader can view our study
as an indication of the currently achievable benefits of
three-dimensional circuit "boards."
Although we obviously want to stress the pbtential
benefits of three-dimensiona1 circuit rp.a 1 izations, we
should also mention some of the problems inherent in the
task of implementing such realizations. Most of these
problems accompany the proposal to make three-dimensional
chips, the problems with circuit boards being minor in
comparison. One problem that exists (\lso with circuit
boards (though far less than with chips) is the problem
of registration. In the miniature world being discussed,
it is difficult to assure that corresDonding positions on
successive layers of either a chin or a stack of circuit
boards line up. In fact, with the current technology for
fabricating chips (as described, say, in [5]), even the
goal of two or three more metal layers for wire routing
poses nontrivial problems. A related second issue that
plagues the chip fabrication nrocess is the difficulty of
creating true cylindrical holes: current processes are
plagued by holes that have accentuated tops and/or
Amold L. Rosenberg 71
\~-------
- _------~I
'-wi"
F(S)
\~--------------,,--------------~I
R(S)
Figure 1. The 8-rearrangeable network R(8) and the 8-point FFT graph F(8).
Arnold L. Rosenberg 75
3/2 1/2
VOLI_AL (P) = o.(n (logan) )
76 ThreeDlmenslonal Integrated Circuitry
(Two-Dimensional Embeddings)
(e) For any n-permuter P,
AREA(P) = .o.(n 2 );
and there exist n-permuters P -- R(n) being an
example -- with
WIRE-LENGTH 2 (P) = .o.(n/10g n).
6. CONCLUDING REMARKS
We have demonstrated dramatic efficiencies in
three-dimensional circuit realizations that are not
attainable in two dimensions. Moreover, the benefits of
these efficiencies in orders of growth are not delayed by
huge constants hidden in the "big-Oh"'s~ the constants in
our constructions are small. These efficiencies are of
sufficient magnitude that further research is warranted
on two questions. First, how wides~read are the
advantages of three dimensions over two that we have
found here? qow, for example, would a three-dimensional
realization of a random small-degree graph compare with a
two-dimensional real.ization of the graph? Second, are
the technological impediments to three-dimensional
circuitry surmountable in the foreseeabJ.e future, or are
they so tied to the current state of the technology that
only a revolutionary advance will overcome them? We plan
further study of these and related questions in the near
future.
REFERENCES
1. R. Aleliunas and A.L. Rosenberg: On embedding
rectangular grids in square grids. IEEE Trans. Comp., to
appear.
2. V.E. Benes: Optimal rearrangeab1e multistage
connecting networks. 8ell Syst. Tech. J. 43 (19M)
1641-1656.
3. H.W. Lam, A.F. Tasch, Jr., T.e. Holloway:
Characteristics of MOSFETs fabricated in laser-
recrystallized polysilicon islands with a retaining wal.l
structure on an insulating substrate. IEEE Electron Dev.
Lett. EDL-l (1980) 206-208.
4. C.L. Leiserson: Area-efficient graph layouts (for
VLSI) . Proc. 21st IEEE Symp. on Foumlations of Computer
Science, 1980, 270-281.
5. C. Mead and L. Conway: Introduction to VLSI Systems,
Reading, MA, 1980.
Ad<'lison-~'I7es1ey,
1. Introduction
Considerable attention has been paid to the definition of a suit-
able model of computation for Very-Large-Sca1e Integrated (VLSI) cir-
cuits [1], [2], [3], [4]. The basic parameters Of any VLSI computation
model are chip area A and computation time T. VLSI systems display a
trade-off between these two parameters, each of which represents a we1l-
defined cost aspect: chip area is a measure of fabrication cost and
computation time is a measure of operating cost.
A general feature of all proposed - and presumably of all future -
VLSI models of computation is that a chip is viewed as a computation
graph, whose vertices are called nodes and whose arcs are called wires.
Nodes are, by and large, devices and are responsible for information
processing (computations of boolean functions); wires are just electri-
cal connections, and are responsible for both transfer of information
and distribution of power supply and timing waveforms.
A giver computation graph is to be laid-out in conformity with the
rules dicta ed by technology. These rules are taken into account in the
model of c', Iputation by the following assumptions.
Area Assumptions
All ,", .. res have minimum width A > 0, and at most v <:: 2 wires can 2
overlap at any point. Transistors and I/O ports have minimum area A
Time Assumptions
T1. (Propagation time). To propagate along a wire of length a bit
requires:
T1.1 A constant time, irrespectively of [Brent-Kung [2]]
(synchronous model).
T1.2 A time 0(10&) [Mead-Conway [1]; Thompson [3]] (capacitive
model). 2
T1.3 A time 0( ) [Seitz [5]; Chaze11e-Monier [4]] (diffusion
model)
T2. (Algorithm time). The computation time of an algorithm is the time
of the longest sequence of wire propagation times between beginning and
completion of the computation. [All models.]
While the area assumptions are uncontroversia1 there is little
consensus among researchers about the computation time T, as is
81
82 A Critique and an Appraisal 01 VLSI Models 01 Computation
(b)
Ca)
l,oS
Figure 1. The CMOS configuration and the transistor characteristics.
cut-off, with its drain load initially charged to voltage VO. So,
with reference to (IDS'VDS ) characteristic curves of figure l(b), PI is
the initial operating point of TI By applying at t = 0 a step voltage
Vg =VO at the gate of TI the operating point becomes P2 and, from then
on, it moves on the vg = V0 curve toward the origin and the transis tor
G. Bllardl, et al 83
a
(a) (b) v?o Va
Figure 2. The model (a) and the idealized characteristics.
These are instances of the classical diffusion equation (or heat equa-
tion), but our boundary conditions deserve special attention. We use a
general approach to solve partial differential equations with homoge-
neous boundary condi ti ons , called me thod of separation of variab 1es, which
consists of the following steps: 1) find the solutions to eq. (1) satis-
fying the boundary conditions and expressed as a product of a time func-
tion and of a space function; the space functions are called the
eigenfunctions of the boundary problem; 2) express the initial condi-
tions as a linear combination of the eigenfunctions; 3) get the final
solution by using the superposition principle. For a large class of
problems (known as Sturm-Liouville problems [7]),step 2) is simplified
by the orthogonality of the eigenfunctions. In our case while the
saturated regime is of the Sturm-liouville type, the ohmic regime is
not. In the latter case, the difficulty has been circumvented by intro-
ducing an unconventional inner product with respect to which the
eigenfunctions become orthogonal.
We now solve eq. (1) for both the saturated and ohmic regime. T~e
results are expressed in the normalized variables ~ ==x/t and T == t/rcL .
In both cases the eigenfunctions are sinusoidal of angular frequency ~,
where ~ must satisfy the characteristic equation determined by the
boundary conditions.
84 A Critique and an Appraisal of VLSI Models of Computation
1,2
T+~
00 g. (1) (
L: 1. -~-- l-e
-iJ.~T
~
)
(2)
Co i=l ~ iJ.~
~
(4 )
l~~~~~----==~-------r~---------'r-----------'
-2 -1 3
Figure 3. Discharge curves for y = 10 ,10 , ... ,10 The broken
lines describe the discharge at constant current.
(5 )
yp 1 ~o
=---
Y tp ~o y+p
whence CD
y+p
S, = ROCO(ltytp)(l+lty+p i~l (6 )
tJ. '" 2i
Letting (y,p) = (y+p) r t.~O I(l+y+p), gives the relative devia-
i=l ~
tion of tn from ROCO(ltytp), which is linear in p and y and gives the
time constant in an idealized capacitive model. In this model the dis-
persive line is replaced by a single equivalent capacitance of value
cL/(l+p/y), where ply is a constant in any given fabrication tech-
nology; indeed CO(l+y+p)=C O+cL(l+p/y). A set of contour lines of
as a function of p and y is plotted in a logarithmic diagram in
figure 4.
r 3
3.78 x 10 aim 1.89 104 aim
x
c 3.46 X 10- 10 Flm 6.93 X 10- 11 Flm
ply 1.10 x 10- 4 0.992 x 10- 3
Lmax 10 mm 50 mm
2
Ymax 84 5.65 X 10
REFERENCES
89
90 On the Complexity of VLSI Computltlonl
Different MD-models can vary in the number of strength levels and the
kinds of gates they allow. As an example we give the following MD-model:
There are three levels of strength: 1 (isolated charge), 2 (connection
to GND resp. VOD through a pull-up resistor), and 3 (direct connection to
GND resp. VDD). There is only one kind of gate, namely the MoS transistor
T. T has three terminals s (source), d (drain), and g (gate). The trans-
ition function 0T is defined as follows: 0T(s,d,g)=(s',d',g'), where
g'=(gb,1), and s'=(sb,1), d'=(d b ,1) if gb=o, otherwise s'=d' is the
strongest of the values sand d. In case of a tie resolve arbitrarily, or
set the value to be undefined. Thus a transistor with a 1 on the gate be-
haves exactly like a wire with two terminals. It is straightforward to
show the following lemma.
Lemma 1: Each circuit in the LS-model can be simulated by a circuit in the
above MD-model with area, time, and period only increased by a constant
factor.
Sketch of Proof: We can simulate inverters as well as AND- and OR-gates
with arbitrary fan-in in the MD-model by using the NOR-gate implementa-
tions common in NMOS. These implementations translate easily into the MD-
model. Since wires in the MD-model can have arbitrary shape it is possible
to simulate each gate in the LS-model "in place". Any uridefined values oc-
curring in the MD-circuit have to have been introducRd through incorrect
operation of the simulated LS-circuit.c
The following theorem shows that the reverse of Lemma 1 holds for
all MO-models. This means that circuits in the LS-model are powerful
enough to model also multi-directional VLSI circuits. In the proof of the
theorem the properties of the gates in the MD-model chosen do not play a
significant role. They only influence the constant factor.
Theorem 2: Choose any MD-model. Any circuit in the model can be simulated
by a circuit in the LS-model with area, time, and period only increased by
a constant factor. The factor depends on the MD-model chosen.
Sketch of Proof: Let C be the circuit in the MD-model that is to be sim-
ulated. Let C run in area A, time T, and period P. We will simulate C
"in place" by a Boolean circuit C'. It ~ill turn out th~t we can fit a
layout of C' inside a blowup by a constant factor of the layout of C.
This yields the bound on the area. The bounds on time and period follow
directly from the definition of C'.
For the purpose of the simulation, values vEB will be encoded in a
unary fashion, i.e., by unit Boolean vectors of length 2r. The vector
(v 1 , ,v2r ) with the unique 1 being element vi encodes the value
v=( (i+1) mod 2, (i+1) div 2).
Since each gate has at most k=o(1l terminals its function can be
simulated by a Boolean circuit with A,T,P=o(1) that fits into a blowup
by a constant factor of the layout of the gate in C. (Note that we can,
in addition, assure the proper location of the terminals in the circuit
C'. However, this may require different layouts of the circuit simulating
gate g, for different copies of the same gate g in C.l
It remains to show how to simulate the wires. Let (v 1 ' . ,v ) with
v.=(v.
1 1,
1'.'.'v.1, 2r ) be the unit vectors encoding the values on th~ termi-
92 On the Complexity of VLSI Computations
nals of the wire before switching. The "output" value w of the wire is
encoded by a vector (w i , ,w2r ), where for i~i~2r
2r n n
A A vJ',k) A J'ivJ',i'
k=i+i j=1
(An undefined value is here encoded by a vector which is not a unit vec-
tor.) If we allow AND- and OR-gates with arbitrary fan-in this formula
has depth 2. For the simulation of one step on the wire 2r such functions
have to be computed in parallel. The layout of the circuitry necessary
for this can fit inside a blowup by a constant factor (depending on r)
of the layout of the wire in C.o
Note that in the proof of Theorem 2 we make extensive use of the
fact that in the LS-model AND- and OR-gates with arbitrary fan-in are
allowed, as well as of the liberty we can take with respect to the shape
of the gates.
I
as shown in Figure 1.
M.1,p M"t
dist. Hi,R,t(i)\
r
M.1,m
We can choose R-t[i) and R-b[i) such that vol d _1 [H. R-t[.)nM. )~2vi[d-1)/d
.. . (d-1)/d 1, 1 ~,p i/d
and voId 1(H. nb(.)nM. )~(V.-v.) as well as R-t(1)~(i/2)v.
- 1,k 1 1,q 1/d 1 1 1
and R-b(i) ~ (-1/2)(V.-v.) For assume that such an R-t(i) cannot be
1 1
found. Then
1/d
Vi
v.1 ~ f0 /2 voId - 1(H.1,konM.l,p ) dR- > -21 v.1i/d "2"v.1 (d-1)/d =v.1
which is a contradiction. An analogous argument holds for R-b(i). We in-
clude H.1,kot[.)nM.
1 1,p and H.1,NOb[.)nM.
1 l,q in the cut C. Now, i f ll(M.l,m )~2/3
we are done. Since ].l(M. t),].l(M. b' )~2/3 we only have to choose M. 0 to be
1... 1 JI l.jl x,
the most expensive of the three sets, and M. correspondingly to be the
union of the other two sets, and we cut M. l,r as desired.
1
94 On the Complexity of VLSI Computations
Sketch of Proof: Since we are only concerned with an upper bound on the
area we will not take special care to make the chip fast.
We give the chip n1 / 3 input ports. The i-th input port receives the
inputs x.lon 2/3""'X Cl'+ 11 on 2/3 - 1 in this order (O~i~n1/3_11. These are
n 2/ 3 inputs per port. We will give the chip slightly fewer output ports,
2/31/3 2/31/3
namely n/Cn +n 1 output ports. Each output port produces n +n
output bits. Output port j produces the bits y. ( 2/3+ 1/3 1, ...
Jon n
y(j+11o(n2/3+n1/31_1 for O~j~n/(n2/3+n1/31 -1.
The idea of this arrangement is,informally, that for each value of k
there will be one output port j and one input port i ,such that the first
input bit to be read by input port i has an index that is at most n1/ 3
smaller than the first output bit to be produced by output port j. Thus,
if we start reading inputs from input port i we have to store at most
n1 / 3 input bits before we can directly output the input bits read star-
ting at output port j. We continue reading inputs in a clockwise fashion
and directly produce the corresponding output. At the end we produce the
input bits read and stored at the beginning.
Formally we define j=j(k1 such that
(j(kl-11 on1 / 3 < k mod n2 / 3 ~ jCk1 on1 / 3
and let i(k1 = j(k1 - (k div n 2 / 3 1. We start reading at input port iCk1
and producing output at output port j(k1. Then the first bit to be output
switching a state. In the first case only "static" energy is expended for
maintaining the value. In the second case "switching" energy is expended
in addition, for changing the value. Whereas static energy consumption is
dominant in the NMOS process, switching energy consumption is dominant in
processes like CMOS. M~reover, switching energy is the energy concept that
is more closely related to computational complexity, and it is the central
energy concept introduced in [MC]. Thompson bounds static energy from be-
low.
We derive lower bounds on switching energy consumption based on the
following assumption:
Every unit of chip area on every layer of the chip consumes one unit
of energy every time it changes its state (from 0 to 1 or vice versa).
Theorem 8: Let f be a transitive function of degree n. 1 Consider any
where-oblivious chip computing f with area A, period P, and switching
energy E per solved problem. Then
n2
P log(c2Ap2/n2)
where 01 and C2 are con5tants involving the technological parameters
A and V of the LS-model.
Sketch of Proof: We assume the layout of the chip to be overlaid with a
grid of mesh-siZe A. We consider only cuts along grid lines. As in [Th]
~ } vertical
t ~
",-,I J
sections of
o or A cut Ci
Figure 2
Each C. cuts the chip in half w.r.t. the I/o-ports. We associate a cros-
sing s~quence with each cut as in [LS]. We disregard all values contri-
buted to the crossing sequence by the middle section of Ci . Those are at
most (4i+2)v bits per crossing value.
As shown in [LS] at least n(n) bits have to cross the cut C. for
each solved problem. Thus, if the chip is running at full rate,nJ(nT/P)
bits have to cross C. during any time interval of length T, i.e., over
the vertical section at least nCnT/P) - o(iT) bits have to cross in this
interval. Let L. be the number of bits contributed to each crossing value
by the verticallsections of cut C.. Let W. = w. II ... II w. be the con-
l l l,1 l,L.
catenation of the crossing sequences associated with each bit contributed
to the crossing value by the vertical sections of cut C..
1 The theorem can be proved in the same way for any 8'-o-o-;j,t-e-a-n--::f-u-n-c"-t";"i-on-f-=--
such that at least 2n crossing sequences are necessary across any cut
that divides a chip computing f. In this class fall all functions for
which AT2=n(n 2 ) can be shown with the crossing sequence argument.
98 On the Complexity of VLSI Computations
The w.. are bit strings of length exactly T. We will encode these strings
in a lJ special fashion. First we introduce some notation.
Definition: Let w be an arbitrary bit string, say w=0~~1~20~3 ~t
a) sew) is the number of state changes in w. i.e . s(w)=t.
b) bin(w) is the bit string obtained from w after substituting each 0
with a 00 and each 1 with a 11.
c) compress(w) = 0 bin(~1) 01 bin(~2) 01 bin(~3) 01 . 01 bin(~t)'
If the first bit of w is a 1 so is the first bit of compress(w).
ExampZe: compress(00011) 0 11 11 01 11 00.
Iwl
Lemma 9: Icompress(w)I S 4 sew) + 2 sew) logs(w)'
Proof: Not included.C
Let us now only consider cuts Co ... Co for some 0 to be chosen la-
ter. Let W = Wo! I ... ! IwO' Since ~(nT/P)-O(iT) bits have to cross cut Ci
it follows by summation that Iwl S ~(nTO/P)-0(02T). Furthermore. since
the w.. have the same length T the mapping
lJ 0 Li
w'" II II compress (w lJ.. )
i=O j=1
o Li
is injective. Thus. if we define S = r r s(w ij ) and L
i=O j=1
and apply Lemma 9. we get the following.
Lemma 10: n(nTO/P)-0(02T) S 4S + 2S 10g(TL/S).
Proof: Not included.c
Choosing 0 = 8(n/P) appropriately yields
~(n2T/p2) S 4S + 2S 10g(TL/S)
which implies that
S ~
C1P2 log(c 2 Lp2/n2)
It remains to relate S to the switching energy expended per solved
problem. and to bound L from above. Both can be done with the following
argument.
Let c be a component crossing the vertical section of cut C. It can
be shown that we can charge to c an area of size n(1) belonging lto the
layout of component c. such that the areas charged to components crossing
different cuts do not overlap. and such that the areas charged to two
different components overlap at most v-fold. i.e on V different layers.
(The details of this process will not be given here.) From this follows
that both L=O(A) and Etot=~(S), where Etot is the total switching energy
expended in the time interval of length T. Since in this time interval
O(T/P) problems are solved the theorem follows.C
For a chip working at the limits dictated by information transfer.
i. e.. Ap2 =0 (n 2 ). the bound given in Theorem B is tight up to a constant
factor: E=~(n2/P)~n[AP) follows from Theorem B. and E=O[AP) is obvious.
Thomas Lengauer and Kurt Mehlhorn 99
5. ACKNOWLEDGEMENTS
6. REFERENCES
[BK] R.P. Brent, H.T. Kung, "The Area-Time Complexity of Binary Multi-
plication," JACM 28,3 (July 1981), 521-534.
[Br] R.E. Bryant, "A Switch-Level Model of MDS Logic Circuits," in
VLSI 81 (ed. John Gray), Academic Press (August 1981), 329-340.
[CM] B. Chazelle, L. Monier, "A Model of Computation for VLSI with
Related Complexity Results," 13th Ann. Symp. on Theory of Com-
puting,ACM (May 1981), 318-325.
[LS] R.J. Lipton, R.S. Sedgewick, "Lower Bounds for VLSI," 13th Ann.
Symp. on Theory of Computing, ACM (May 1981), 300-307.
[MC] C. Mead, L. Conway, Introduction to VLSI Systems, Addison Wesley
(1980).
[Ro] A. L. Rosenberg, "Three-Dimensional VLSI, I: A Case Study," Res.
Rep. 8745, IBM (March 1981).
[Sa] J. E. Savage, "Planar Circuit Complexity and the Performance of
VLSI Algorithms," Dep. of Compo Sci. Tech. Rep., Brown University,
Providence, RI (Jan. 1981).
[Th] C.D. Thompson, "A Complexity Theory for VLSr." Ph.D. Thesis,
Carnegie-Mellon University, Pittsburgh, PA (1980).
[Vu] J. Vuillemin, "A Combinatorial Limit to the Computing Power of
VLSI Circuits," 21st Symp. on the Foundations of Computer Science,
IEEE (1980).
On the Area Required by VLSI Circuits
Gerard M. 8audet
INRIA Rocquencourt
B.P.105
78150 Le Chesnay, France
ABSTRACT
~. - INTRODUCTION
100
Gerard M. Baudet 101
The results of this paper are based on the VLSI model of computa-
tion already discussed in [2,7,9,IOJ. The model is briefly reviewed, and
the notations and assumptions are introduced in Section 2, along with
the measures of performance used to evaluate integrated circuits. In
particular, in addition to the usual parameters of a circuit, the area
and the time required to perform an operation, we discuss the notion of
the period of a circuit, initially introduced by Vuillemin [IOJ to
capture the potential pipeline of a circuit.
As a first illustration of the method developed in this paper, a
very simple proof is given to show that any circuit performing binary
mUltiplication requires an area proportional to the total number of bits
input (and output) by the circuit. This result is presented in Section 3
and reestablishes a result of Brent and Kung [2J in a much more general
setting. The result is valid for a very general class of functions and
applies, in particular, to important problems such as cyclic shift,
convolution, etc.
While the result of Section 3 provides us with lower bounds rela-
ting the area required by a circuit directly to the function computed
by the circuit, the result of Section 4 establishes a general tradeoff
between the area and the period of a pipelined circuit. As an applica-
tion of this result, we derive a new lower bound on the complexity of
any circuit performing operations like prefix computation or binary
addition. The existence of VLSI designs for these operations shows that
the result is optimal up to some constant factor.
In ~3:!, Chazelle and Monier present a different computational model
for VLSI. In the last section of this paper, we discuss our results and
present lower bound in relation to their model.
(i.e., among the N input bits taken into account by the restriction f O)'
We define NO=O and Ni=Ni_l+ni-l, i=I,2, ... ; this quantity represents
the number of bits input to C prior to time ti' Also, let si be the
number of bits encoded within the circuit. Finally, let Si be the set
generated by the output that has not been delivered by time ti ; the
size of this set, denoted by ISil, corresponds to the number of distinct
values that can still be taken by the output after time ti'
If, at time ti, the evaluation of fO is not completed, any output
bit that has yet to be delivered must be evaluated from the N-Ni+1
input bits that have not been read by the circuit at time ti, from the
ni input bits that have just been read at time ti' and from the infor-
mation memorized within the si cells of the circuit (which encodes the
input that has been read prior to time ti)' This leads us to the follo-
wing lemma.
PROOF: Let ti be the time when the last input bit is read by the circuit.
We have Ni+ni=Ni+I=N. Since any output bit depends on all input bits, no
outputNhas been produced by the circuit at time ti, thus Si=SO and
ISil=2 Equation (3) then follows directly from Lemma I. 0
LEMMA 2: The apea A and the pePiod P of a cipcuit computing some func-
tion f must satisfy : ~
A.P ~ pTN + ST L..J [10g2IS.I- (N-N.)J. (4)
O:O;i<k 1 1
PROOF: Define p=P/T. If some problem Pt starts at time to, consider the
sequence of previous problems, Pt-I, ... ,Pt - , still in progress at time
t., for O:O;i<p (recall that the clrcuit may We pipelined). From the
d~finition of the period and by assumption AI, we deduce that, at time
ti, problem Pt - j is in the same stage as problem Pt would be at time
ti+jP=ti+jp (i.e., the same variables have been read and the same
variables have been produced). Therefore, circuit C has to be able to
generate the set Si for problem Pt, the set Si+? for problem Pt_l, etc.
Since all the problem instances input to Care lndependent, the circuit
must potentially produce Ri=ISil .ISi+pl .. ISi+mpl distinct values. Then
the proof of the lemma is similar to the proof of Lemma I with a summa-
tion for i=O,I, ... ,p-I. Note that, at time ti, the circuit has yet to
read E(N-N.
. 1+ I +JP
. ) input bits; it has just read En. . inout
. l+JP . bits; and
.lt contalns
J. 11 h . f' J . l'
ri memory ce s, were ri satls les an lnequa lty Slml. . 1ar to
inequality (2). n
Again, let us consider a restriction fO(xl, .. ,sN)=(ZI"",zN) of
function f which is surjective, that is, such that ISol=2N. For i~O,
let M, be the number of bits that have been produced by the circuit up
1 N-M.
to (and including) time ti' Then we have IS.I= 2 1, for i=O,I, .. , and
L:
inequality (4) simplifies as : 1
A.P. ~ pTN + ST (NCMi)' (5)
o:o;i<k
We are now able to state a lower bound for binary addition and,
more generally, for the computation of any function f such that any
output bit Zj of fO depends on j input bits, for j=I, .. ,N. In the
case of binary addition, bit Zj depends on bits xl, ... ,x j . Other
Gerard M. Baudat 105
PROPOSITION 2: The area A~ the period P~ and the time T of any circuit
performing the binary addition of two N bit numbers must
satisfy :
A.P :> pTN + 2"1 BT N.logb (TNiT) . (6)
l:t ~ - J
t. 1 :> T
~+
L 10gb (j-N.)
~
:> -21 Tn ~.. log,
b
(n.),
~
the sum being extended in the range N.<j~Ni+1 (recall that ni=Ni+I-N i ).
By summing this last inequality for i~O,I, ... ,k, the result tollows
from the convexity of the function x+x.logb(x). n
As an immediate consequence of Proposition 2, we deduce (by looking
separately at the case T<log2N and T:>log2N) that
2
A.P.T :> O(N.log N).
Since T:>P, we also have
A.T 2 :> O(N.log 2N),
which reestablishes a result of Johnson's [5J.
These last two bounds on APT and AT2 are weaker forms of the result
of Proposition 2. For example, the classical full adder, with A=O(I).
and P=T=O(N), is optimal (up to some constant factor) with respect to
inequality (6), while it is not optimal with respect to the APT and AT2
measures. This is the same for the systolic adder built out of N linear-
ly connected full adder cells, with A=O(N), P=O(I), and T=O(N). The
fast adder described by Brent and Kung [2J is of interest since it
shows that the APT measure is indeed a tighter bound than the AT2
measure: with A=O(N log N), P=O(I), and T=O(log N), we have
APT=O(N 10g2N) while AT2=O(N 10g3N). A pipelined version of this fast
adder, with A=O(N) and P=T=O(log N) is optimal with respect to both
the APT and the AT2 complexity measures.
5 - CONCLUDING REMARKS
REFERENCES
C.D. Thompson
University of Callfomla at Berkeley
Division of Computer Science
Berkeley, Callfomla 94720
O. Abstract
The area-time complexity of sorting is analyzed under an updated model of
VLSI computation. The new model has fewer restrictions on chip I/O than previous
models. Also, the definitions of area and time performance have been adjusted to
permit fair comparisons between pipelined and non-pipelined designs.
Using the new model, this paper briefly describes eleven different designs for
VLSI sorters. These circuits demonstrate the existence of an area*time 2 tradeoff
for the sorting problem. The smallest circuit is only large enough to store a few
elements at a time; it is, of course, rather slow at sorting N elements. The largest
design solves a sorting problem in only O(lg N) clock cycles. The area*time2
performance figure for all but three of the designs is close to the limiting value,
O(N2).
1. Introduction
Sorting has attracted a great deal of attention over the past few decades of
computer science research. It is easy to see why: sorting is a theoretically
interesting problem with a great deal of practical significance. As many as a
quarter of the world's computing cycles were once devoted to sorting [Knu 73, p.3].
This is probably no longer the case, given the large number of microprocessors
running dedicated control tasks. Nonetheless, sorting and other information-
shuffling techniques are of great importance in the rapidly growing database
industry.
The sorting problem can be defined as the rearrangement of N input values so
that they are in ascending order. This paper examines the complexity of the
sorting problem, assuming it is to be solved on a VLSI chip. Much is already known
about sorting on other types of computational structures [Knu 73, pp. 1-3BB], and
much of this knowledge is applicable to VLSI sorting. However, VLSI is a novel
computing medium in at least one respect: the size of a circuit is determined as
much by its inter-gate wiring as by its gates themselves. This technological novelty
makes it necessary to re-evaluate sorting circuits and algorithms in the context of
a "VLSI model of computation."
Using a VLSI model, it is possible to demonstrate the existence of an
area*time 2 tradeoff for sorting circuits. A preliminary study of this tradeoff is
contained in the author's Ph.D. dissertation [Tho Baa], in which two sorting circuits
This work W68 supported in part by the National Science Foundation under Grant ECS-Sl10684 and by
the U.S. Army Research Office under Grant OAAG29-7B-G-0167.
108
C.D. Thompson 109
were analyzed. This paper analyzes nine additional designs under an updated
model of VLSI computation. The updated model has the advantage of allowing fair
comparisons between pipelined and non-pipe lined designs.
None of the sorting circuits in this paper is new, since all are based on
commonly-known serial algorithms. Still, this paper is the first to layout these
circuits for VLSI and to analyze their area*time2 performances. Eight of the
sorters will be seen to have an Af2 performance in the range 0(N2lg 2 N) to
0{~lg5 N). Since it is impossible for any design to have an Af2 product of less than
O{~) [Vui 80], these designs are area- and time- optimal to within logarithmic
factors.
A number of different models for VLSI have been proposed in the past few
years [B&K 81, C&M 81. K&Z 81, Tho 80a, Tho 80b, Vui 80]. They differ chiefly in
their treatment of chip I/O, placing various restrictions on the way in which a chip
accesses its input. Typically, each input value must enter the chip at only one
place [Tho 80a] or at only one time and place [B&K 81]. Savage [Sav 81] has
characterized these as the "semelocal" and "semelective" assumptions,
respectively.
The model of this paper builds on its predecessors, removing as many
restrictions on chip I/O as possible. Following Kedem and Zorat, the semelocal
assumption is relaxed by allowing a chip to access each input value from several
different I/O memories. The intent is to allow redundant input codes: if each input
bit appears in k places, a chip's area*time 2 performance may be enhanced by a
factor of k 2 [K&Z 81]'
Additionally, the new model is not semelective, for it allows multiple accesses
to problem inputs, outputs, and intermediate results. Here, the intent is to model
the use of off-chip RAM storage; the area of the RAM is not included in the total
area of the sorting circuit. This omission clarifies the area*time 2 tradeoff for
sorting circuits, since RAM area is involved in an entirely different form of tradeoff.
{The recent work of Hong and Kung [H&K 81] indicates that a (time * 19 space)
tradeoff may describe how local memory affects the speed of a sorting circuit with
fixed I/O bandwidth.) Leaving RAM area out of the new model permits the analysis
of sublinear size circuits. It also makes the model's area measure more sensitive
to the power consumption of a circuit, since memory cells have a low duty cycle
and generally consume much less power per unit area than a "processing" circuit.
Other authors have used non-semelective models, although none has
elaborated quite so much on the idea. Lipton and Sedgewick [L&S 81] point out
that the "standard" AT2 lower bound proofs do not depend on semelective
assumptions. Hong [Hon 81] defines a non-semelective model of VLSI with a space-
time behavior which is polynomiaUy equivalent to that of eleven other models of
computation. His equivalence proofs depend upon the fact that VLSI wiring rules
can cause at most a quadratic increase in the size of a zero-width-wire circuit.
Unfortunately, Hong's transformation does not necessarily generate optimal VLSI
circuits from optimal zero-width-wire circuits, since a quadratic factor cannot be
ignored when "easy" functions like sorting are being studied.
Lipton and Sedgewick [1.&S 81] point out another form of input restriction, one
that is not removed in this paper's model. Inputs and outputs must be assigned to
fixed I/O ports at the time the circuit is designed, and these assignments must not
depend upon the problem input values. Thus this paper's model is "where-
oblivious." It is an interesting theoretical question to ask what functions become
easier to compute when the "where-oblivious" restriction is removed. Certainly,
shifting is such a function; sorting may be another. although this seems unlikely.
In any event, there are practical reasons for requiring VLSI circuits to be where-
oblivious. Otherwise, permutation networks would have to be used to connect one
VLSI circuit to another!
110 The VLSI Complexity of Sorting
The catalog of input restrictions is not yet complete. ]n both Vuillemin's [Vui
BO] and Thompson's [Tho BOb] models of pipelined VLSI computation, analogous
inputs and outputs for different problems must be accessed through identical 1/0
ports. For example. input 1/1 of problem 1/2 must enter the chip at the same place
as input #1 of problem #1. While this seems to be a natural assumption for a
pipelined chip, it leads to a number of misleading conclusions about the optimality
of highly-concurrent designs. For instance, the highly parallelized bubble sort
design of Section 3.10 is nearly area*time2 optimal under the old models, but it is
significantly suboptimal under the model of this paper.
When the restriction on pipelined chip inputs is removed, it becomes
impossible to prove an O(N2} lower bound on Ar2 performance until the definitions
of area and time are adjusted. However, no change needs to be made in the
definitions of area and time performance for non-concurrent designs.
]n the new model, the area performance of a design is its "area per problem,"
equal to its total area divided by its degree of concurrency. Thus it does not
matter how many copies of a chip are being considered as a single design: doubling
the number of chips doubles both its concurrency and its total area, leaving its
area performance invariant. The old definition of area performance was the total
area of a design, with no correction factor for its concurrency.
The time performance of a design is newly defined as the delay between the
presentation of one set of problem inputs and the production of the outputs for
that problem. The old definition of time performance was the rate at which a
design accepted input bits. It is easy to see that duplicating a design doubled its
time performance under the old definition, but leaves its time performance
invariant under the new definition.
The old and new definitions of area and time can be contrasted by analyzing
the combined sorting performance of N independent serial processors. As will be
shown in Section 3.1, each one of these processors has an area of O(Lg N) and each
can solve one sorting problem every O(N tg 2 N) time units. A collection of N
processors would thus consume input data at the rate of one bit every O(lg N} time
units. Their total area is O(N tg N), yielding an "impossibly good" area*time2
performance of O(N Lg 3N} under the old definitions of area and time. Under the
new definitions, the total area per problem is just O(Lg N} and the solution delay is
O(N 192N}, so the AT2 performance is 0{N 2tg 5N}.
This paper is organized in the following fashion: Section 2 describes the new
VLSI model of computation in full detail; Section 3 sketches eleven different
designs for VLSI sorters, analyzing the area-time performance of each; Section 4,
compares the performances of each of the designs, with some discussion of the
"constant factors" ignored by the asymptotic model; and Section 5 concludes the
paper with a discussion of of some of the currently open issues in VLS] complexity
theory.
2. Model of ~ Computation
]n all theoretical models of VLSI, circuits are made of two basic components:
wires and gates. A gate is a localized set of transistors, or other switching
elements, which perform a simple logical function. For example, a gate may be a
"j-k flip-fiop" or a "three input nand." Wires serve to carry signals from the output
of one gate to the input of another.
Two parameters of a VLSI circuit are of vital importance, its size and its speed.
Since VLSI is essentially two-dimensional, the size of a circuit is best expressed in
terms of its area. Sufficient area must be provided in a circuit layout for each gate
and each wire. Gates are not allowed to overlap each other at all, and only two {or
perhaps three} wires can pass over the same point.
C.D. Thomp8on 111
greater than (a fraction of) the speed of light. Right now, the clock frequency and
circuit dimensions are just small enough to allow a signal to propagate from one
side of the circuit to the other in one clock period. Any increase in either speed or
size would make this impossible. The computational limitations of such enhanced
(and hypothetical) technologies could be analyzed under Chazelle and Monier's
linear delay assumption.
Before leaving the subject of wire delay, it should be noted that the model of
this paper makes provision for the "self-timed" regime predicted by [Sei 79]. It
may eventually become very difficult to guarantee that all portions of a VLSI
circuit get a clock signal with the correct frequency and/or phase. Fortunately, it
is feasible to have the long-wire drivers include timing information with the data
being transmitted, so that special "receiver" circuits can resynchronize the data
with respect to the local version of the clock. Also, single-stage, unit-delay
"repeater" pircuits can be used to avoid a driver delay at each vertex in the mesh-
type connection pattern of Section 3.7.
Thus far in the discussion, only "standard features" have been introduced to
the VLSI model. The interested reader is referred to [Tho BOa] for more details on
the practical significance of the model, and to [Sav 79] for an excellent
introduction to the theoretical aspects of VLSI modelling.
A major distinction between the model of this paper and most previous VLSI
models is the way in which it treats "Va memory." Here, only a small area charge
is made for the memory used to store problem inputs and outputs, even if this
memory is also used for the storage of intermediate results.
In the new model, each input and output bit is assigned a place in a k -bit "Va
memory" attached to one or more "Va ports." Two types of access to the I/o
memory are distinguished. If the bits are accessed in a fixed order, the I/O
memory is organized as a shift register and accessed in 0(1) time. If the access
pattern is more complex, a random access memory (RAM) is used. Such a memory
has an access time of O(lg k) [M&C BO, p. 321]. The random access time covers
both the internal delays of the memory circuit as well as the time it takes the I/O
port to transmit (serially) flg kl address bits to the RAM. Of course, many other
organizations could have been assumed for the Va ports. This paper's bit-serial
interface seems to be the simplest one that allows optimal area-time results.
Allowing more than one I/O port to connect to a single I/O memory makes it
easy to model the use of multiport memory chips. However, a few restrictions
must be placed on their usage, to remove the (theoretical) temptation to reduce
on-chip wiring at the expense of increasing printed-circuit board wiring. All Va
ports connecting to a single memory must be physically adjacent to each other in
the chip layout, to avoid any possibility of "rats-nest" wiring to the memory chips.
This assumption allows the area*time 2 lower bound proofs to proceed without
difficulty, since all cross-chip communication must use on-chip wires. (Note that a
two-port memory provides a communication channel between its two I/O ports.)
The model makes as few assumptions as possible about the actual location of
the I/O memory circuitry, even though this can have a large effect on system
timing. If the memory is placed on a different chip from the processing circuitry,
its access time is considerably increased. Fortunately, this will not always
invalidate the model's timing assumptions. The O(lg k) delay already assumed for
a k-bit RAM will dominate the delay of an off-chip driver, if k is large enough.
Alternatively, if k is small, it should be relatively easy to locate the RAM on the
processor chip. As for off-chip "shift register" I/O memories, there should be no
particular difficulty in implementing these in such a way that one input or output
event can happen every 0(1) time units.
As indicated above, time charges for off-chip I/O are problematical and may
be underestimated in the current model. Area charges for I/O are also
C.D. Thompson 113
troublesome. Here, I/O ports are assumed to have 0(1) area even though they are
obviously much larger than a unit-area wire crossing or an 0(1) area gate. It is also
assumed that a design can have an unlimited number of I/O ports. In reality, chips
are limited to one or two hundred pins, and each pin should be considered a major
expense (in terms of manufacturing, reliability, and circuit board wiring costs). An
attempt is made in Section 4 to use more realistic estimates of I/O costs when
evaluating Section 3's constructions.
The complete model of VLSI computation is summarized in the following list of
assumptions.
~S'Umption 1: F1mbedding.
b. The state of a node is changed every time unit, i.e. its FSA undergoes one
state transition per time unit.
c. Logic nodes, repeater nodes, and receiver nodes are limited to O{l) bits of
state.
d. Driver nodes have O{lg k) bits of state, one bit for each stage in th~Jir
amplification chain.
e. The state vector of a "k-bit" I/O memory contains one bit for each of its
assigned problem input and output bits. The assignment of problem bits to
memories is one-to-one and is not data-dependent.
f. There are O{lg k) bits in the state vector of each I/O port attached to a k-
bit memory. These state vectors are used to address specific memory bits, as
explained in Assumptions 4g and 4h. Two different ports may not access the
same bit simultaneously.
g. "RAM-type" k -bit I/O ports run a memory cycle every O(lg k) time units.
During the first 19 k time units of a cycle, the port receives a bit-serial
address on its input wire. The next input signal is interpreted as a read/write
indicator. If a write cycle is indicated, the following input signal is written into
the addressed bit. During the last time unit of a memory cycle, the value of
the addressed bit is available on the I/O port's output wire.
h. "Shift-register-type" I/O ports run a memory cycle every 0(1) time units.
During the first time unit of a cycle, the value of the currently-addressed data
bit is available on the port's output wire. In the last time unit of a cycle, the
signal appearing on the port's input wire is written into this data bit, then the
port's address register is incremented (mOd k).
Assumption 5: Area, time performance.
a. The total area of a chip is the number of unit squares in the smallest
enclosing rectangle.
b. The area performance A of a chip is its total area divided by its degree of
concurrency p. See Assumption 2a.
c. The time performance T of a chip is the maximum number of time units it
takes to solve anyone of its p problem instances.
3. Circuit Constructions
(This section has been omitted, due to space considerations. A complete
version of this paper is available from the author.)
uniprocessor design, since the two designs make very similar demands on their
processors. A drawback of the multiprocessor design is that it requires 19 N
independently addressable memories, one for each processor. The total memory-
processor bandwidth increases proportionately (see Table 2) to 19 N bits per time
unit.
The (lg N)-processor bitonic design has about the same area and time
performance as the {lg N)-processor heap sort design. The former has the
advantage of a slightly simpler control algorithm, and uses the simpler shift-
register type of I/O memory; the laller uses a more efficient sorting algorithm and
hence less memory bandwidth.
The {lg2N)-processor bitonic sorter is smaller than either of the (lg N)
processor designs, for moderately sized N. Its control algorithm is extremely
simple, so that a "processor" is not much more than a comparison-exchange
module. Its major drawback is that it makes continuous use of
(11 2)(lg N)(lg N - 1) word-parallel shift-register memories, of various sizes.
The ("'N 19 N )-processor bitonic sorter has been entered in Table 2 with a
total area of O(N 19 N), so that there is room on the chip for all of its temporary
storage registers. Otherwise, it would require '"N 19 N separate I/O memories. It
has the same speed and a somewhat beller I/O bandwidth than the (lg2N)-
processor bitonic sorter just discussed. However, the laller's shift registers could
also be placed on the same chip as its processing circuitry, removing its I/O
116 The VLSI Complexity of Sorting
bandwidth disadvantage. When "constant factors" are taken into consideration, the
{v N 19 N )-processor design is clearly much larger than the {lg2 N)-processor
design, because it has more processors and a much more complicated control
algorithm.
The {N / 2)-processor bubble sorter has a couple of significant advantages that
are not revealed in either Table 1 or Table 2. Its comparators need very little in
the way of control hardware, so that at least for small N, it occupies less area than
any of the preceding designs. Also, it can be used as a "self-sorting memory,"
performing insertions and deletions on-line. {The uniprocessor and the (lg N)-
processor heapsorter can also be used in this fashion.) However, for even
moderately-sized N, the bubble sorter's horrible area*time 2 performance becomes
noticable. For example, when N = 256, the (lg2N)-processor's 36 comparators and
491 words of storage probably occupy less room than the 128 comparators in a
bubble sorter. Still, the bubble sorter maintains about a 2:1 delay advantage over
the (lg2N)-processor bitonic sorter, when similar comparators are used.
The N-processor mesh-type bitonic sorter is the first design to solve a sorting
problem in sublinear time. Unfortunately, it occupies a lot of area. Each of its
processors must run a complicated sorting algorithm, reshuffling the data among
themselves after every comparison-exchange operation. Its I/O bandwidth must
also be large, since it solves sorting problems so rapidly. However, constant factor
improvements may be made to its area and bandwidth figures, by reprogramming
the processors so that each handles 0(1) data elements at a time. Also, large area
and bandwidth are not always significant problems: in an existing mesh-connected
multiprocessor, the N processors are already in place and the I/o data may be
produced and consumed by local application routines.
The next three designs in Tables 1 and 2 are variants on a fully-parallelized
bitonic sort. The shuffle-exchange processor has a slight area advantage over the
CCC processor. because of its simpler control algorithm. However, the eee is a
somewhat more regular interconnection pattern, so that it may be easier to wire
up in practice. Both designs are smaller in total asymptotic area than the
{N 192N)-processor bitonic sorter, which solves 192N problems at a time.
Nonetheless, the control structure of this last design is so simple that it probably
takes less area than the others for all N < 220. Of course, if a shuffle-exchange or a
CCC processor has already been built, the additional area cost for programming
the sorting algorithm is very small.
There seems to be little to recommend the final design, the N2-processor
bubble sorter. It has the same I/O bandwidth, a bit more total area, and a much
worse time performance than the {N 192N)-processor bitonic sorter.
5. Closing Remarks
At the time of this writing, there are a number of important open questions in
VLSI complexity theory. A simply stated, but seemingly perplexing problem, is to
find out how much area can be saved when additional "layers" of wiring are made
available by technological advances. It is known that a k-level embedding can be
no smaller than 1/ k 2 of the area of a two-level embedding [Tho 80a, pp. 36-38], but
it is not known whether this bound is achievable. (When k grows as the square root
of the two-level embedding's area, it seems that the 1/ k 2 bound is tight.)
A second problem is to derive a better lower bound for the area*time2
complexity of the sorting problem. The original proof that AT2 = O{N 2lg 2N) [Tho
80a] applies only to sorting circuits that read entire words of input through their
I/O ports. When input words can be broken down into bits, the largest lower bound
known is O{N2) [Vui 80]. I can see how to prove an O{N2l~ N) bound for the case of
unrestricted inputs, but I know of no proof of O{N 2lg 2 N). Of course, it is possible
that it is the upper bound that is too weak.
C.D. Thompson 117
Another set of problems is opened up by the fact that the area*time 2 bounds
are affected greatly by nondeterministic, stochastic, or probabilistic assumptions
in the model. For example, equality testing is very easy if one only requires that
the answer be "probably" correct [Yao 79, L&S B1].
A final and very important problem in VLSI theory is the development of a
stable model. Currently there are almost as many models as papers. If this trend
continues, results in the area will become difficult to report and describe.
However, it is far from settled whether wire delays should be treated as being
linear or logarithmic in wire length, and the costs of off-chip communication
remain unknown.
Despite the open problems noted above, VLSI theory has been fairly successful
in obtaining matching upper and lower bounds for the computation of such
functions as Fourier transformation, matrix multiplication, integer multiplication,
integer addition, and sorting [A&A BO, B&K 79, B&K Bl, Joh BO, P&V 79, P&V BO, Sav
80, Tho BOa, Tho BOb, Vui 80]. It has led to increased understanding and new
models for the embedding of graphs in the plane [Lei BO, KLLM B1]. The area*time2
performance metric has been shown to be applicable over a wide range of circuit
sizes and speeds, indicating that it is a fundamental type of space-time tradeoff.
References
[A&A 80] H. Abelson and P. Andreae, "Information Transfer and Area-Time
Tradeoffs for VLSI Multiplication," CACM Vol. 23, No.1, pp. 20-23,
January 1980.
[Arm 7B] Philip K. Armstrong, U.S. Patent 4131947, issued December 26, 1978.
[B&K 79] R P. Brent and H. T. Kung, "A Regular Layout for Parallel Adders,"
CMU-CS-79-131, Carnegie-Mellon Computer Science Dept., June 1979.
To appear in IEEE-TC.
[B&K B1] R P. Brent and H. T. Kung, "The Area-Time Complexity of Binary
Multiplication," JACM Vol. 28, No.3, pp. 521-534, July 1981.
[C&M B1] B. Chazelle and L. Monier, "Towards More Realistic Models of
Computation for VLSI," Proc. 11th Annual ACM Symp. an Theory of
Com:pu1ing, pp. 209-213, April 1979.
[CLW BO] Kin-Man Chung, Fabrizio Luccio, and C. K. Wong, "On the Complexity of
Sorting in Magnetic Bubble Memory Systems," IEEE-TCVol. C-29, No.
7, pp. 553-562, July 1980.
[Des 80] A. Despain, "Very Fast Fourier Transform Algorithms for Hardware
Implementation," IEEE-TCVol. C-28, No.5, pp. 333-341, May 1979.
[Eva 79] S. A. Evans, "Scaling 12L for VLSI," IEEE Journal of Solid-State
Oircuits, Vol. SC-14, No.2, pp. 318-326, April 1979.
[Hon 81] J-W Hong, "On Similarity and Duality of Computation," unpublished
manuscript, Peking Municipal Computing Center, China.
[H&K 81] J-W Hong and H. T. Kung, "I/O Complexity: The Red-Blue Pebble
Game," Proc. 13th Annual ACM Syrnp. on Theory of Compuf:i:ng, pp.
326-333, May 1981.
[Joh 80] R. B. Johnson, "The Complexity of a VLSI Adder," Info. Proc. Letters,
Vol. 11, No.2, pp. 92-93, October 1980.
[Ket 80] M. B. Ketchen, "AC Powered Josephson Miniature System." 1980 Int'l
Conf. on Oircuits and Computers, IEEE Computer Society, pp. 874-
877, October 1980.
118 The VLSI Complexity of Sorting
INTRODUCTION
Valiant [1] showed how to embed any binary tree into the plane in
linear area without crossovers. The edges in this embedding have a
maximum length of 0 Un ) . With Paterson, we [2] showed that a complete
binary tree can be embedded in the plane with maximum edge length of
OCIn/log n) and we argued the importance of short edge length for VLSI
design and layout. Here we show that every binary tree can be embedded
in the plane with all three properties: linear area, no crossovers, and
OC~/log n) maximum edge length. This improves a result of Bhatt and
Leiserson [3] -- a graph with an n~-E separator theorem can be embedded
Cperha~s with crossovers) in linear area and with a maximum edge length
of OC/n/log n) -- for the case of binary trees. In the paper we also
observe that Valiant's result can be extended to the case of oriented
trees [7].
These bounds on edge length are the best possible in the ~ense that
there are graphs in the families requiring edges of length ~CIn/1og n)
in any planar embedding [2]. Edge length is an important quantity
because it corresponds to wire length in VLSI circuits. Since the
delay in charging or discharging a wire is related to its length, long
wires can significantly influence performance [8]. For example,
families of pipe1ined circuits in which the maximum length wire is
longer for circuits solving larger problem~ will have the propagation
delay of the longest wire entering as a multiplicative factor in the
timing complexity of the circuit family. Thus, it is crucial to know
how long wires will be in a layout. Our results provide this information
for trees which must be embedded without Crossovers.
The added condition of crossover-freedom is of theoretical interest
since it can be had without incurring an area or maximum edge length
penalty. This fact should be compared with Valiant's results on general
planar graphs [1] where embeddings without CrOSsovers require more area
C8(n2)) than those with crossovers (O(n(log n)2)) and consequently
greater maximum edge length.
RESULTS
The model is the same one used in earlier works [1-3]. The set, T ,
is the set of all trees on n vertices with vertex degree at most four. n
The tree is a guest graph that is to be embedded in a host graph which
is a grid; that is, the vertex set is the set of lattice points in the
first quadrant and the edge set is composed of those edges connecting
the unit distance neighbors of the vertices. An embedding is a 1-1
mapping of the vertices of a tree tETn to grid vertices and the tree
edges to vertex disjoint (i.e., without crossovers) paths in the grid.
The area of tETn is the area of the minimum square bounding the image of
an embedding of t. The maximum edge length in an embedding of t is the
greatest number of edges in an image path. The edge length of t is the
least maximum edge length over embeddings of t.
First we identify a useful lemma (of some interest in its own right)
that a modified version of Valiant's crossover-free tree embedding can
embed oriented trees [7] in linear area.
Lemma 1. Every tETn with a specified orientation can be embedded
in the plane with O(n) area and without crossovers so that the
orientation is preserved.
The proof is direct by making obvious modifications to the
"reconnection" phase of the Valian~ algori thm. Notice that the
resulting edge length is still O(/n ).
Theorem 2: [Main result] Every tETn can be embedded in t~e plane
with area O(n), without crossovers, and edge length of O(/n/log n).
Proof sketch: The proof takes the form of an algorithm to achieve
the required embedding.
We begin by "balancing" the tree, that is by embedding it in a tree
of height O(log n). As a result, certain guest tree edges will be
mapped to host tree paths. We refer to edges in these host tree paths
as "double" edges because they host segments of two guest tree edges.
If the final embedding is to be planar, we must keep track of the
orientation of these double edges.
Although an indirect technique is known [4] for the balancing
operation, we prefer a more direct technique [5] that has better
dilation, i.e., less edge stretching. The details of how the-direct
method performs the balancing are unimportant here. What does matter is
that the two cases of that construction can be laid out in a planar
fashion. Figure I illustrates the two relevant balancing transformations.
Next we use a refined version of the Bhat t- Leiserson [3] "hyper H"
embedding on the balanced tree. Their method, motivated by the Mead-
Conway, "hyper-H" planar embedding of complete binary trees [8], shortens
the edges near the root of the tree by "pulling" vertices up narrow
Walter L. Ruzzo and Lawrence Snyder 121
'.
~
fi
Figure 1. Transformations used in the balancing operation. Dotted
edges are host tree edges.
----: I
1--~
r: ---
I
---J tt---
I
I
The "hyper H" embedding is used to embed those vertices nearest the
root and is terminated with unembedded subtrees of size O(n/(log n)2).
By the lemma these trees can be embedded in O(n/(log n)2) area without
crossovers and with edge length O(iJl/log n).
As a consequence of the above techniques, we have
Theorem 3. Every tETn with a specified orientation and height at
most O(n~-E), can be embedded with O(n) area, without crossovers,
with edge length O(Al/log n) so that the orientation is preserved.
We do not know whether arbitrary oriented trees can be embedded
with O(n) area, without crossovers and with edge length O(Al/log n). If
so, then the property of orientation would not penalize an embedding.
For comparison purposes let us relax our area measurement to be the
area of the smallest bounding convex set; this is legitimate since we
are now interested in lower bounds.
Theorem 4. Every tETn with a specified orientation can be embedded
with its leaves on the perimeter of a convex set, without
crossovers in area 9(n 2) so that the orientation is preserved.
The upper bound is obvious. A tree that forces the lower bound is
shown in Figure 3. The best lower bound for the unoriented case is
Brent and Kung's Q(n log n) area lower bound for complete binary trees
with leaves on the perimeter of a convex set.
,.- ."
Figure 3. A tree forcing the lower bouno of Theorem 3.
REFERENCES
[1] L. G. Valiant
Universality Consideration for VLSI Circuits
IEEE Transactions on Computers, 1981
[2] M. S. Paterson, W. L. Ruzzo, and L. Snyder
Bounds on Minimax Edge Length for Complete Binary Trees
Proceedings of the Thirteenth Annual Symposium on the Theory of
Computing, 1981
[5 ] W. L. Ruz zo
Embedding Trees in Balanced Trees
Manuscript, University of Washington, 1981
[7] D. E. Knuth
Art of Computer Programming
Volume I, Addison Wesley, 1968
The rapid advance of VLSI and the trend toward the decrease of the geometrical feature size,
through the submicron and the subnano to the subpico, and beyond, have dramatically reduced the cost
of VLSI circuitry. As a result, many traditionally unsolvable problems can now (or will in the near
future) be easily implemented using VLSI technology.
For example, consider the traveling salesman problem, where the optimal sequence of N nodes
("cities") has to be found. Instead of applying sophisticated mathematical tools that require investment
in human thinking, which because of the rising cost of labor is economically unattractive, VLSI
technology can be applied to construct a simple machine that will solve the problem!
The traveling salesman problem is considered difficult because of the requirement of finding the best
route out of N! possible ones. A conventional single processor would require O(N!) time, but with
clever use of VLSI technology this problem can easily be solved in polynomial time!!
The solution is obtained with a simple VLSI array having only N! processors. Each processor is
dedicated to a single possible route that corresponds to a certain permutation of the set [1,2,3 ... NJ.
The time to load the distance matrix and to select the shortest route(s) is only polynomial in N. Since
the evaluation of each route is linear in N, the entire system solves the problem in just polynomial
time! Q.E.D.
Readers familiar only with conventional computer architecture may wrongly suspect that the
communication between all of these processors is too expensive (in area). However, with the use of
wireless communication this problem is easily solved without the traditional, conventional area penalty.
If the system fails to obtain from the FCC the required permit to operate in a reasonable domain of the
frequency spectrum, it is always possible to use microlasers and picolasers for communicating either
through a light-conducting substrate (e.g., sapphire) or through a convex light-reflecting surface
mounted parallel to the device. The CSMAlCD (Carrier Sense Multiple Access, with Collision
Detection) communication technology, developed in the early seventies, may be found to be most
helpful for these applications.
If it is necessary to solve a problem with a larger N than the one for which the system was initially
designed, one can simply design another system for that particular value of N, or even a larger one, in
anticipation of future requirements. The advancement of VLSI technology makes this iterative process
feasible and attractive.
124
Professor J. Finnegan 125
This approach is not new. In the early eighties many researchers discovered the possibility of
accelerating the solution of many NP-complete problems by a simple application of systems with an
exponential number of processors.
Even earlier, in the late seventies many scientists discovered that problems with polynomial
complexity could also be solved in lower time (than the complexity) by using a number of processors
which is also a polynomial function of the problem size, typically of a lower degree. NxN matrix
multiplication by systems with N2 processors used to be a very popular topic for conversations and
conference papers, even though less popular among system builders. The requirement of dealing with
variable N was (we believe) handled by the simple P/O technique, namely, buying a new system for
any other value of N, whenever needed.
According to the most popular model of those days, the cost of VLSI processors decreases
exponentially. Hence the application of an exponential number of processors does not cause any cost
increase, and the application of only a polynomial number of processors results in a substantial cost
saving!! The fact that the former exponential decrease refers to calendar time and the latter to problem
size proba\:!ly has no bearing on this discussion and should be ignored.
The famous Moore model of exponential cost decrease was based on plotting the time trend (as has
been observed in the past) on semilogarithmic scale. For that reason this model failed to predict the
present as seen today. Had the same observations been plotted on a simple linear scale, it would be
obvious that the cost of VLSI processors is already (or about to be) negative. This must be the case, or
else there is no way to explain why so many researchers design systems with an exponential number of
processors and compete for solving the same problem with more processors.
Conclusions
With the rapid advances of VLSI technology anything is possible .
The more VLSI processors in a system, the better the paper.
Optimal Placement for River Routing
Charles E. Lelserson and Ron Y. Pinter
Massachusetts Institute of Technology
Laboratory for Computer Science
Cambridge, Massachusetts 021391986
Abstract- River ?"Outing is the problem of connecting a sP.t of' terminals at, ... ,an on
a line to another set b1 , , bn in order acr08~ a rectane;ular channel. When the terminals
are located on modules, the mod ules must be placed relative to one another before routing.
This placement prcJblem arises frequently in design systems like bristle-blocks where stretch
lines through a module can eifedively break it into severa! chunks, each of which must be
placed separately. This paper gi~'es concisf;' necessary and suflicient conditions for wirability
which are applied to reduce the optimal placement problem to the graph-thtoretic single-
source-longest-paths problem. By exploiting the special structure of graphs that arise from
the placement problem for rectilinear wiring, an optimal solution may be determined in
linear time.
1. Introduction
River rou.tinq j~ 11 special rout.ing probIern which arises often in the design of integrated
cirC1.iit.s, ?,nil it bas be,.'n shown to be ofll.in!a!!y solvahl" in p0lynomial-time for many wiring
models (see in p:ut.icular [Tompa 1 and [Dolev et al.D. In this paper we demonstrate that
the placcmem prohlem for rivf)f routing is also polynomial-time solvable.
The gClieral character of the placement problrm for river routing is illustrated in Figure
1. Two sds of terminals at, ... ) an and b1 , ... , On are to be connected by wires across a
rectangular channel so that wire i is routed from ai to bi The terminals on ('ach side of
thl) channel are grouped into chunks which must be placed as a unit. The quality of a legal
JlI~ccmcnt---onc for which the channel Cl),ll be routed---can be measured in tejms of the
dimensions of tll" channel. The separation is the vertical distance between the two lines
of terminals, aPG the 8pread lS the horizontal dimension of the charnel.
The wiriH!" model give8 tIl(' eonstraints that the routing mllst Eatisfy. Although
our rc&ults can bt generalized to include a variety of wiring lTlodp.ls (see Section 5), we
concentrate on thc (one-layer) squrlregrid model. Crossovers ar~ disallowed in the square-
grid model, and nll wires must take disjoint paths through the grid.
The placement problem for river ronting ari,1'2<l often during ordinary integrated circuit
debign. A common insta.nce is when the tcr:ninals of one or more modules are to be
connected to drivrfs. Tl;e vr.riolls independent "chunk~" are the TYlouulcf:, which lie on one
side of the chaIin ..], and the dri', ters, ",hie]) lie on the other.
Tbis r",earch was suppurted in part by the De,'"nse Advanced Rpsearcb Projects .Ag~ncy unolH Contract
No. ~JOO()J4-80- C-0622.
126
Charle. E. Lel.eraon and Ron Y. Pinter 127
.
A Ftl Its
==i
b7
~
bs
II
I
b9 610 I
Figure 1: Two sets of chunks on either side of a rectangular channel.
Terminal ai must be connected to bi for i =
1, ... ,10.
Let us now turn to the river routing problem and examine how this constraint can
be brought to bear. Let a" . .. ,an denote both the names of the terminals at the top of
the channel and their x-coordinates, and let the same convention hold fo; the terminals
bl,"" bn at the bottom of the channel. Consider a half-open line segment drawn from
terminal a,to terminal b, as shown in Figure 5. The j .- i wires emanating from
a" ... , aJ-I must all cross this line, and similarly for a line drawn from bJ to a,. In order
for a channel with separation t to be rouiable, therefore, it must be the case that
(j - i) wires
r....- - - " " -
..._ - _ ,
ai ai+1 ... aj __ 1 a,.
, ,
, ,,
,
,,
"
(
""
bi b'+ 1 bj - 1 b,.
Figure 5: The U - i) wires from a" ... ,ai-I must cross the dashed line
between b. and ai'
130 Optimal Placement for River Routing
Although Condition (1) is a new condition for wirability, the analysis that leads to it
is essentially the same as that in [Dolev et al.l and represents previous work in the field.
A more compact condition exists, however, which is equivalent:
(2)
since ak+l ~ ak + 1 for all 1 ~ k < n. Thus the two conditions are indeed equivalent.
Figure 6 shows a simple geometric interpretation of Condition (2). The condition
ai+t - b. ;;::: t means that a line with unit slope going up and to the right from b, must
intersect the top of the channel at or to the left of terminal ai+t. And if the condition
fails, terminal b,. must be t.o the right of a,. for i ~ j < i +
t - 1, that is, each wire
from a a, goes down and to the right, which can be shown to follow from the fact that
+
a,+l ~ u,. 1. (For bi +t - a. ~ t the line with slope -1 going down and to the right
from ai must intersect the bottom of the channel at or to the left of terminal bi+t.)
To see that ihis algorithm works given Condition (2), we must be more precise about
what paths are taken by the ,,:ires. Consider without loss of generality a block c,f consecu-
tive wires i,hat go down and to the right, that is, a, S bi for all wires in the block. For
any horizontal position x such that a. - t < x S b.,
define
The path of wire i is then described by the locus of points (x + 1)i(X), '1t(x)) for at - t <
x S bi .
A geometric interpretation of this formulation uses the same intuition as was given in
Figure 6. The line with unit siope drawn from (x, 0) "INhere x is in the range ai - t < x S b.
must cross wire i. Th' valu~ '7i(X) ~iveR the y-cvordinatc of wire i wher~ it r.rOSf.'~8 this line
of unit slope. The two-part maximum in the definition of 1'}i(X) corresponds to whether
the wire is being routed straight across the chaanel or whether it is following the contour
of the bottom. The value of 1'}i(X) for the lat.ter situation is the number of wires to the left
of wire i which must cross the line of unit slope.
We must now show that the locus of points for a wire is a path, that the paths are
disjoint, and that they never leave the channel. That the locus of points is indeed a path
can be seen by observing that as x ranges from ai - t to bi , the initial point is (ai, t -I),
the final point is (b" 0), and with a change of OIle in x the coordinates of the path change
by a single grid unit in exactly one of' the two dimensions. To show that the paths are
disjoint, consider two adjacent wires i and i + 1, and observe for a'-I-l - t < x S b. that
ai - x < ai+l- x and maxbi_r~rr < maxbH.I_r~xr, and therefore '1.(x) < '1i+l(X).
To show a path of a wire never leaves the channel, we demonstrate that '1i(X) < t for all
i and x in the associated range. It is fcr this part of the proof that we need the assumption
that Condition (2) holds. If for a wire i, t.he two-part maximum in the definition of '1i(X)
is achieved by 0 - x, t.hen '1i(X) musL be less than t because x > ai - t. Suppose then,
that the two-part maximum is achieved by the maximal r such that bi-r ~ x. To show
that r < t, we assume t.he contrary and obtain a contradiction. But since bi - t ~ b.- r ~
x > ai - t, the contradiction is immediate because ai - b.- t ~ t from Condition (2).
,spread
2
28 Feasible region
27
26
25
24
23
22
I I I I
2 J 4 5 6 7 8 9 10
I I I I > separation
Figure 7: The curve of minimum spread versus separation for the example
of Figure 1.
channel is routable in t tracks. If minimum separation is the goal, for example, binary
search can determine the optimum t in O(lg t) steps. Since the algorithm presented in
the next section deiermines a placement for fixed t in O(n) time where n is the number
of terminals, and since the separation need never be more than n, a minimum-separation
placement can be achieved in O(nlgn) time. For more general objective functions sllch as
area, the oplimum value can be d.:termined in O(n 2 ) time.
We now examine the character of the placement problem for river routing when the
~eparation t is given. The n terminals are located on m chuuh which arc partitioned into
two sets that form the top and bottom of the channel. For co!1venience, we shall number
+
the chunks frorn one to k on the top. and k 1 10 m on the hotLom. The order of chunks
on each side of channel is fixed, but they may be moved sideways so iong as they do not
overlap. For each ehunk i, a variable Vi repreEents the horizontal pasitilln uf it.s left edr~e.
Any placement can therefore be specified by an assignment of values to these variables.
We also add two variables Vo and V m +l to the sei of variables, which represent the left
and right boundaries of the channel. The spread is thus Vm +l - Vo. Figure 8(a) shows
the eight variables for the example from Figure 1.
Since the relative positions of terminals within a chunk is fixed, the wirability con-
straints of Condition (2) can be reexpressed in terms of the chunks themselves to give
placement constraints that any assignment of values to the Vi must satisfy. If terminal
ai+t lies on chunk h, and terminal bi lies on chunk j, the constraint ai+t - b; ~ t can
be rewritten as Vii - Vi ~ rhJ' where rhi reflects t and the offsets of the terminals from
the left edge of their respective chunks. The constraint between two chunks determined in
this way will be the maximal constraint induced by pairs of terminals.
Cherles E. Lelseraon end Ron Y. Pinter 133
al a2 a3 a41 as as a7 as ag alO
VI ~
Va V4 Vs ~
Va M I b b,l
3 bs ] I bs b1 bs
] I bg blO ] Vi
Additional constraints arise from the relative positions of chunks on either side of the
+
channel. For each pair of adjacent chunks i and i 1, the constraint Vi+l - V, ~ Wi must
be added to the set of placement constraints, where Wi is the width of chunk i. Four more
constraints are needed which involve the boundary variables Vo and V m +l' For chunks 1
and k + 1 which are leftmost on the top and bottom, the constraints Vl - Vo ~ 0 and
Vk+l - Vo ::::: 0 enforce that these chunks lie to the right of the left boundary of the
channel. For chunks k and m which are rightmost on the top and bottom, the relations
V m +l - Vk ::::: Wk and Vm+l - Vm ::::: Wm constrain them to lie to the left of the right
boundary, where Wk and Wm are the widths of the chunks.
Figure 8(b) shows a placement graph which represents the constraints between chunks
for the placement problem of Figure 1 where the separation is 3 tracks. A directed edge
with weight Okl goes from Vk to Vt if there is a. constraint of the form Vt - Vk ~ Okt. For
example, the weight of 1 on the cross .~dgc going from Vs to V2 is the maximal constraint
of ag - bs ~ 3 and alO - b7 2 3 which yield V2 - Vs ~ -1 and V~ - Vs ~ 1 since
ag = V2 +5, alO = V2 +
6, b6 = Vs +
1, and b7 = 'Us +
4. The side edge from V4 to Vs
arises from the const.raint that chunk 4, which is 5 units long, must not overlap chunk 5.
134 Optimal Placement for River Routing
The goal of the placement problem is to find an assignment of values to the v. which
minimizes the spread v m -j-1 - Va subject to the set of constraints. This formulation is
an instance of linear programming where both the constraint.s and the objective function
involve only differences of variables. Not surprisingly, this problem can be solved more
efficiently thai! by using general linear programming techniques. In fact, it reduces to a
single-sourCf-longest-paths problem in the placement graph. The length of a longest path
from Va to V,n+I corresponds to the smallest spread of the channel that complies with all
the constraints. The placement of each chunk i relative to the left end of the channel is
the longest path from Va to Vi. If the placement graph has a cycle of positive weight, then
no placement is possible for the given separation.
For the placement problem of Figure 1 with a three-track separation, the longest path
from Va to V2 in the placement graph (Figure 8) is Va - VI - V4 - V5 - V2 with weight
13 which corresponds to the positioning of chunk 2 in the optimal placement shown in
Figure 9(a). Figures O(b) through \lid) show optimal solutions to the placement problem
of Figure 1 for separations t = 4 through t = 6. The constraints for t = 2 yield a cycle of
po~,itiv'; weigLt in the placement graph, and thus no placement is possible which achieves
a orparation of <lnly two tracks.
The list C of edges is the key to the correctness of the algorithm. The length of a
longest po.th from the source 'Va to a vertex Vi converges to the correct value if the edges of
the path form a subsequence of the list c. (This can be proved by adapting the analysis
of [Yen].) In the normal algorithm for a general graph G = (V, E), the list. C is IVI - 1
Charle. E. Lelserson and Ron Y. Pinter 135
repetitions of an arbitrary ordering of the edges in E, which ensures that every vertex-
disjoint path in G heginning with Vo is a subsequence of c. If there are no cycles of positive
weight in the graph G, then from Vo' to each other vertex in G, there is a long;est path
that is vertex-disjoint; hence the algorithm is guaranteed to succeed. The condition of
positive-weight cycles can be tested at the end of the algorithm either by checking whether
all constraints are satisfied or by simply running the algorithm through the edges in E one
additional time and testing whether the values of any ).,( Vi) change.
The list c is also the key to the performance of a Bellman-Ford algorithm. For the
general algorithm on an arbitrary graph G = (V, El, the length of the list is (IVI--1) ~EI,
and thus the algorithm runs in O(IVI'IEI) time. For a placement graph it is not, difficult
to show that both IVI and lEI are O(rn), and thus the longest-paths problem can be
solved in O(m 2 ) time by the general algorithm. But a linear-time algorithm can be found
by exploiting the special structure of a placement graph to construct a list c of length
O(rn) that guarantees the correctness of the Bellman-Ford algorithm. We now look at the
structure of placement graphs more closely.
The vertices of a placement graph G = (V,E) corresponding to the chunks on the
top of the channel have a natural lincar order imposed by the left-to-right order of the
chunks. We define the partial order -< as the union of this linear order with the similar
linear order of bottom vertices. Thus u -< V for vertices 11 and v if t.heir chunks lie on
the same side of' the channel anJ the chunk t.hat eorrpsponds lOlL lies to the I( ft of the
one which corresponds to v. The left-boundary vertex Vo precedes all other vertices, and
all vertices precede the right-boundary vertex V m +l' The partial order :S is the natura!
extension to -< that includes equality.
The next lemma describes some of the structural properties of placement graphs.
Figure 10 illustrates the impossible situations described in Properties (i) and (ii) and shows
the only kind of simple cycle that can occur in a placement graph together with the two
consecutive cross edges that satisfy Property (iii).
(c) Every simple cycle contains at most one vertex from the top or
at most one vertex from the bottom. The edges incident on the
vertex are a consequence of Property (iii).
Each edge in the placement graph is either a top edge, a top-bottom edge, a bottom-
top edge, or a bottom edge. For each of these four sets of edges, there is a natural linear
order of edges based on ::5, where (u, v) precedes (x, y) for two edges in the same set if
u ::S x and v ::S y. Property (ii) guarantees that the linear order holds for two cross edges
in the same set. Let TT, TB, BT, and BB be the four lists of edges according to the
natural linear order, and include the two edges out of Vo and the two edges into vm-H in
either TT or BB as appropriate.
e
The list used by the Bellman-Ford algorithm is constructed by a merge of the four
lists which w!' eall MERGE. At each step of MERGE, a tournament is played among the
first elements of each list. If (u, v) and (v, w) are the first elements of two lists, then (u, v)
138 Optimal Placement for River Routing
beats (v, w) if w :i u. Since there may be more than one edge beaten by none of the other
three, ties are broken arbitrarily. The winner is appended to , and removed from the
head of its list. The tournament is then repeated until no edges remain in any of the four
lists. The performance of the tournament can be improved by recognizing thaL only six of
the twelve possible comparisons of edges need be tried, and that w :i u is guaranteed for
all but two. Figure 11 shows a possibb ordering of edges in , for the placement graph in
Figure 8.
In order for MERGE to be well-defined, the tournament must always produce a winner,
which is a consequence of the next lemma.
Proof. First, we show that the relation R is acyclic so that the edges can indeed be
topologically sorted. By definition of R, a cycle in R induces a cycle in the placement
graph. According to Property (iii), the cycle must have two consecutive cross edges (u, v)
and (v, w) such that w :S u. But since (u, v)R( v, w), we also have that w ~ u, which is a
contradiction.
The proof that MERGE topologically sorts the edges of E according to R makes use
of the fact that if a vertex v is the tail of an arbitrary edge in anyone of the four lists
TT, TB, BT or BB, then for every u :S v there is an edge in the same list emanating
from u. Suppose that MERGE does not topologically sort the edges of E according to R.
Then there is a first edge (u, v) in , such that there exists an edge (v, w) earlier in , and
(tt, v)R(v, w). Consider the edge (x, y) in the same list as (u, v) that competed with (v, w)
when (v, w) was the winner of the tournament. For each of the possible combinations of
lists for (u, v) and (v, w), it can always be deduced that there is an edge emanating from
y such that which makes (x, y) an earlier violator of the topological sort than (u, v). I
Since each edge of E is included exactly once in the list e created by MERGE, the
Bellman-Ford algorithm applied to , has a running time linear in the number of chunks.
The correct values for longest paths are produced by the algorithm if for every vertex v,
there is a subsequence of [ that realizes a longest path from Vo to v, under the assumption
Charles E. LalselSon and Ron Y. Pintar 138
that there are no positive-weight cycles in the placement graph. Since for every longest
path, there is a vertex-disjoint longest path, the following theorem proves the correctness
of this linear-time Bellman-Ford algorithm.
Theorem 3. Let G be a placement graph with left-boundary vertex Vo. Then every
vertex-disjoint path beginning with Vo is a subsequence of the list, created by the procedure
MERGE.
Proof. We need only show that every pair of consecutive edges in a vertex-disjoint
path from Vo satisfies R because then Lemma 2 guarantees that the path is a subsequence
of ,. Suppose ('tI, v) and (v, w) are two consecutive edges on a vertex-disjoint path from Vo
which violate R, that is, w ::; u. If either ('tI, v) or (v, w) is a side edge, the pair must satisfy
R, and thus both must be cross edges with the vertices u and w on the same side. Since
if w = u, the path is not vertex-disjoint, we need only show that w -< 'tI is impossible.
Assume, therefore, that w -<: u, and consider the initial portion of the pat.h from Vo
to u. Since Vo -< v and 'Vo -< w, there must be an edge (x, y) on the path which goes from
the set of vertices to the left of (v, tV) to the right of (v, w) in order to get to u. But then
either Property (i) or Property (ii) is violated depending on whether x -< v and w -< y,
or y -< v and x -< w. I
5. Concluding Remarl<s
The reduction from the fixed-separation placement problem in the sqllare-!~rid model
t.o t.he single-source-long;pst-pat hs problem is possible because the wirability constraints can
all be written in the form v, -Vj 2 Oi,. Thus for any wiring model v.here wiring c:onstraints
can be written in this form, the reduction wlll succeed. Also, it should be observed that
in general, the performance of the single-source-longest-path algorithm will not be linear,
but will be a function of the number of constraints times the number of variables. This
section reviews other models and gives the necessary and sufficient wirability (:onstraints
for each.
1. One-layer, gridlesB rectilinear ([Dolev et al.]). Wires in this model must run
horizontally or vertically, and alihough they need not run on grid points, no two wires
can come within one unit of each other. The wirability constraints for this model are the
same as for the square grid model:
3. Onelayer, grid less ([Tompa]). Wires can travel any direction. The const.raints are
a,+r - b, 2: Jr' tJ and bi +r - ai 2: Jr2 t2
for t ::; r ::; nand 1 ::; i ::; n - r. The placement algorithm runs in O( m 3 + n 2) time.
4. Multilayer models. All the models presented until now have been one-layer models.
It is natural to generalize to I-layer models in which wires may travel on different layers.
Remarkably, optimal routability can always be achieved with no contact cuts ([Baratz]),
that is, a wire need never switch layers. The necessary and sufficient conditions for these
multilayer models are a natural extension of the onelayer conditions. For example, in the
one-layer, gridless, rectilinear model the conditions are modified for I layers to be
Extensions can be made to the placement algorithm as well as to the wiring model.
For instance, multiple (parallel) horizontal channels are easily handled within the same
graph-theoretic framework. More interesting is the two-dimensional problem illustrated in
Figure 12. Here, a line between two chunks indicates that wires must be routed between
them. Unfortunately, in order to optimally solve this general problem, it appears that
the constraints indicated by the lines must be convex in both dimensions, not just one
as is the case for the models heretofore considered. When the constraints are convex,
however, convex programming can be used to optimize a cost function such as the area
of the bounding box of the layout. One model which gives convex constraints for the
general two-dimensiona.l problem is the one in which all wires must be ro~ed as straight
line segments between terminals such that no minimum spacing rules are violated. This
model is not particularly interesting from a practical standpoint, however. Heuristics for
solving the related two-dimensional compaction problem by repeatedly compacting in one
dimension and then the other can be found in [Haueh].
Acknowledgments. We would like to thank Howie Shrobe of the MIT Artificial Intel-
ligence Laboratory for posting the plots of the data paths from the Scheme81 chip which
inspired our interest in this placement problem and for his valuable comments on the
practicality of our work. We would also like to thank Alan Barah and Ron Rivest from
the MIT Laboratory for Computer Science for numerous helpful discussions, and Shlomit
Pinter also from the Laboratory for Computer Science for influencing the direction of our
proof of Theorem 3. Finally, special thanks to Jim Saxe of Carnegie-Mellon University for
his key contributions to the linear-time algorithm for longest-paths.
References
[Barah] Baratz, A. E., priv8:te communication, June 1981.
[Batali et al.] Batali, J., N. Mayle, H. Shrobe, G. Sussman and D. Weise, "The DPL/
Daedalus design environment," Proceedings of the Internati01lal Confer-
ence on VLSI, Univ. of Edinburgh, August 1981, pp. 183-192.
[Dolev et al.] Dolev, D., K. Karplus, A. Siegel, A. Strong and J. D. Ullman, "Optimal
wiring between rectangles," Proceedings of the Thirteenth Annual ACM
Symposium on Theory of Computing, May 1981, pp. 312-317.
[Dolev and Siegel] Dolev, D. and A. Siegel, "The separation required for arbitrary wiring
barriers," unpublished manuscript, April 1981.
[Hsueh] Hsueh, M.-Y., Symbolic Layout and Compaction of Integrated Circuits,
MemoranduIIl No. UCB/ERL-M79/80 (Ph.D. dissertation), Electronics
Research Laboratory, Univ. of California, Berkeley, December 1079.
[Johannsen] Johannsen, D., "Bristle blocks: a silicon compiler," Proceedings of the
Caltech Conference on VLSI, January 1970, pp. 303-310. Also appears
in the Proceedings of the Sixteenth Design Automation Conference., June
1979, pp. 310-313.
142 Optimal Placement for River Routing
l INTIWDUCTION
The problems of placement and routing in integrated circuit design have been gaining increasing
attention as fabrication techllology advances. Although a variety of these the problems have been
proven to be ).I P-hard, progress is being made on restricted versions. TOlllpa, for example, gives
a qnadratic solution to a particular single layer routing problem ([T]). This paper gives efftcient
algorithms for IIllding the separation and the olTset in contexts which include his lIIodel.
The objective is to compute the space required for wiring without taking the extra time neces-
sary to deterllline the coordinates for the wires. This information would be useful for placement
and compaction.
Figure 1.1:
Wires restricted by rectangular barriers around PI
Since wires must be separated by minimum distances, every routing problem has forbidden
regions restricting the wiring flow. Specifically, for every vertex Pj and index 8, there is an 8 th
region (specified later) around Pj which cannot be entered by a wire connecting Pj +! with Qj+t
for 1t 12: 8. See Figure 1.1.
With a scheme permitting arbitrarily shapf'd wires, these regions will be concentric disks
centered at connection point vertices ([TD. It is easy to see that the boundaries conform to the
shapes permitted by the wiring scheme and grid restrictions. In the rectilinear case with an integer
grid they arc concentric rectangles. On a quarter-integer grid the separation barriers are no longer
rectangular. See Figure 1.2.
(a) Arhitrary (circular) (b) Rectilinear integer grid (c) Rectilinear quarter-integer grid
Figure 1.2: Separation barriers
In Section 2 we initially consider families of forbidden regions whic.h are convex and geometri-
cally similar. Later the requirements of convexity and similarity will be somewhat relaxed to include
virtually all known wiring schemes.
Given n pairs of vertices, (PI, Q d, ... , (Pn, Qn), we take Pi to be both the name of a point on
the bottom row and the horizontal position of that point, and we take Qj to be a similar point on the
upper row. A left block is defined to be a maximal sequence of pairs of points (Pi, Qi), ... , (Pj, Qj)
such that Qk ::; Pk , for i ::; k ::; j. This condition says that all the connections in the block have
a position in the upper row to the left of the corresponding position on the lower row.
We may define a right block in the obvious, symmetric way. We call a left or right block a
block. In th" rest of the paper we refer to left blocks.
If a block can be legally wired, then there is a wiring in which all wires move monotonically; a
wire need never reverse diredion. Consequently a wireable block can be wired within its rectangular
boundary. This implies that for a fixed ofTset the separation is determined by the worst block.
Since the wiring can be accomplished with monotone wires, we may alter the separation barri-
ers to include this observation. Imagine a separation harrier centered at the origin, and extend the
barrier in the second quadrant as a constant. The constant represents a possible crossing number
([DKSSU]). Figure 1.3 shows the modified separation barriers in a left block. These regions are
still gcomcLrically convex and similar. We now assume that all barriers arc so modified.
It is not hard to see that the separation canbc dcLerrnined from the barriers which emanate
from one specific side, say the families centered at the points of P. Wireahility in this instance
is equivalent to all points Qj lying outside the relavent separation regions. The reason is that
modified harriers represent compact wiring. It follows that the separation is maXi,j(W( i, j)) where
Alan Slagel and Danny Oolev 145
W(j,j + 3)
the separation function W(i,J') is ddilled (for lCt blocks) as the height at Qj of the (j - i)th
barrier emanating from I';. Notice that ill Figure 1.3 the separation will be determined by the
at P;'s third barrier.
If, for example, we require that wiring in the integer rectilinear case leave rows P and Q
vertically for at least one unit in length, then the physical separation is
Figure 1.4 ilhlHtrates rectilinear wiring with thi~ restriction. The purpose is to avoid unknown
wires inside modules P and Q (the black boxes denote possible internal wires). It turns out that
all rea.~onable initialization schemes can be accomplished with simple changes in the definition of
W. Wires with a fIXed positive thickness can he accommodated as well.
PJ
h(x)
o 1
On the region x < p the function IIp(x) is concave. The basic inequalities arc: the concavity,
if j >i (26)
if j ~ i.
This is congistent with wires from P to Q starting in a possibly horizontal direction.
The idea behind the partition property is essentially that rooted line segments connecting Qi,
and Qj, with Pi" and Pi, have the following behavior: if i l and i2 respectively maximize W(.,jr)
and W(., j2), then the segments cannot ~ross. Figure 2.3 illustrates this result in the more general
context, of Theorem 1.
Theorem 1. ( The 1'artitioning Theorem)
Let h(x) be a suitable concave barrier function and W(i,j) the corresponding separation func-
tion. Suppose Pi, 1'i+" Qj, and, Qj+8 are in a left block, where r 2: 0, B 2: 0, and j 2: i + r.
Proof: (sketch) When all four terms in (2-7) are positive, some computation shows that
1'18{
10
_(k-z)2 /I z+u+v}
o 0
(k +u+v ph (k +u+v ) dudv.
Under the conditions of the theorem, each integrand is non-negative. See Figure 2.3.
Note k = jjr k
k~
// /
/
/
kH//
;:::::. --- --- k +s+r
W(*,*)
Lemma 2. Let W(ijo,Jo) = maxi(W(i,Jo)). Thw the separation function attains its maximum
at (imaXl Jmax) where either i max ~ i j and Jmax ~ Jo, or imaz ~ ij and Jmaz > Jo.
Theorem 3. Suppose W(i, J) is a separation function which satisfies the partition property. Then
the separation can be found in O(n log n) time.
Proof: Without loss of generality, we assume that P and Q constitute one left block.
We first fllld a Pi. maximizing W(i, ~), the separation induced by the pairs Pi and Q'f. Lemma
2 ensures that the maximuHl separation is among separations restricted to th(> intervals [PI, I'i.l and
[(h, Q '}]. or [Pi., Pnl and [Q'i ,(JnJ. R"peating this divide-and-conquer step on the Q coordinates
requires a total of O(n log nl comparisons.
Corollary 4. The separation for a wiring scheme with circular barriers can be found in O(nlogn)
time.
148 The Separation for General SlngleLayer Wiring Barriers
I, if x < .5
h(x) = { 1.5 - X, if .5:::; x <1
0, if 1 :::; x.
As before, we define the separation function:
W(i,j)= {(
.
d.-I.) h( Qj-p;)
~, if j > i
if j :::; i.
The partition property holds for these wiring barriers, but for fIXed i, W( i, j) is not even
unimodal when restricted to the part of a block where W(i,j) > O. Consequently the algorithm
for the rectilinear separation problem is inappropriate for this model. Nevertheless there is a
linear-time algorithm for this separation problem.
Theorem 5. The separation for rectilinear plus 45-degree wiring on a half-integer grid can be
determined in linear time.
Proof: (sketch) The separation function can be decomposed into two parts, one resulting from the
rectilinear (Aat) portions of the separation barriers and one from a restriction to the 45 pieces. The
maximum contribution from the Aat portions can be found from a linear scan as in the rectilinear
case. The other contribution can also be evaluated by a linear scan. In this instance a priority queue
must be maintained to indicate which Pi connection point gives a maximum (restricted) separation
for a current point Qj. When j is incremented a new restricted W(i,j) value is computed, and
the data structure is updated. As i is incremented, Pi is inserted in the data structure. The
linearity results from the fact that during the insertion, old data contributing a separation less
than that from a new Pi entry can be deleted. In addition, Ph entries giving former (but not
current) maximum separation 'values can be deleted during the updating. I
max(<I>(W)) = <I>(max(W)).
Rectilinear wiring on a quarter-integer grid is equivalent to a comparable rectilinear and 45 wiring
scheme under the map <1>( x) = H4x1- This proves
Alan Siegel and Danny Ooley 149
Corollary 7. The separation problem for rectilinear wiring on a quarter integer grid can be soltled
in linear time. I
For completeness we observe that one-third and onl~half integer grid rectilinear wiring schemes are
identical to the integer grid problemsj the separation barriers are the same.
P[d] = P+d
The separation of a left block of P[d] and Q can't decrease as P slides to the right (d increases).
The left separation can increase, and the left block size can increase as neighboring right blocks
are cOll"umed by Idt blocks. Consequently, for a fixed offset do, we lIlay find the separation of
the left and right blocks independently. If, say, the left separation exceeds the right, it follows
that a minimal separation occurs for some value d ~ do. If the two separation distances are equal,
then do is a solution to the offset minimizing problem. (We point out that the last condition is not
neces~aryj sec Figure :1.1.)
Therefore a solution to an integer grid optimal olTsct problem can he fO'lIld by using an
appropriate separation algorithm on P[d] and Q where the values d arc obtained from a binary
search on the interval [-Pn,Qn]' This gives an operation count of O(ljI(n)log(Pn + Qt.)) for the
offset algorithm where ljI(n} is the complexity of the separation problem. While this is sufficient
in most instances, it does suffer from the fact that log(Pn + Qn) is noi bounded by any function
of n. We now describe an O(ljI(n)log(n)) complexity integer grid optimal offset algorithm. The
idea behind it is to limit the number of choices used in the offset region [-Pn, Qn]. This task is
simplified by defining the interdigitation number of an olTset:
Thus ldn(-Pn - 1) = 0 and Idn(Qn) = n 2 The obvious choice is to try a binary search on
the interdigitation number.
Th' problems with this approach are:
1. The interdigitation number is insufficient to determine the offset and
2. It is necessary to find an offscl which corresponds to a given bisecting interdigitation number.
We show the first difIiculty is not serious, and the second can be overcome.
1. Once we have the interdigitation where a minimal separation occurs, the possible offsets will
be limited to an interval [a, b]. If b- a ~ 2n, then a binary search on [a, b] will find an optimal
olTsd the separation for an additional cost of ljI(n) log n. If b - a > 2n, then d = a + n gives
150 The Separation for General SlngleLayer Wiring Barrlel'S
an offset where the vertices do not interact with each other. Consequently the separation is
just based on the largest number of wires passing a connection point. It is possible that the
offset d = a will allow the connection points to align perfectly, but a test for this is easy to
include.
2. The remaining problem can be stated as follows: given interdigitation numbers DI and D2 with
corresponding displacements d l and d 2 , find a displacement d 3 which gives an intermediate
interdigitation between DI and D 2 We relax the condition on the binary search in the sense
that our new value need not cut the interval [DI' D2J in half; it is su{ficient to require that the
length of the new subinterval be a fraction of the previous interval's length where that fraction
is bounded by a constant less than one. The following algorithm finds the intermediate value
d 3 We note that the actual interdigitation numbers D are not used; they were introduced as
motivation for the algorithm.
Algorithm Intermediate( d l , d 2 , P, Q):
ARRAY P I : n , QI:", C I :", fl:,,;
k +- 1;
j <- 1;
FOR i +- 1 TO n DO BEGIN
WlIILE j ::; n and Pi + d l ~ Qj DO
j<-j+lj
Qj is the first point to the right of Pi + d 1
WHILE k ::; n and Pi + d2 ~ Qk DO
k+-k+l;
Q k is the first point to the right of Pi + d 2
C i +- k - j;
C i is the number of points Qj crossed by the jth point of P, P[dji, as d increases
from d l to d 2 Formally, C i = Gardinality{j I P[dtJi < Qj ::; P[d2 Ji}'
ei +- Ql u,-'
Note that z:; Gi = D2 - D 1
J - Pi;
fi is the olTset such that P[dJi crosses G;j2 points of Q as d increases from d 1 to
fi: fi = min{f I Gardinality{j I P[dtJi < Qj ::; P[JJ;} ~ rG
i /21}.
END FOR;
RETURN WeightcdMedian(e, G).
EN D Intermediate.
In addition, arguments similar to those above ~how that the continuous optimal offset problem
can be solved in time O(n 2 10gn).
if j - i > w
if i - i ::; w.
w
_2
_1
PI
Corollary 12. The offset range problem for wlrmg defined by geometrically similar convex
separation barriers can be solved in O(nlogn) time.
Theorem 13. The offut range problem faT wiring defined by similar polygonal separation barriers
can be ~olved in O(n) time.
Theorem 14. The offset range problem for rectilinear plus 45-degree wiring on a half-integer
grid, can be solved in O(n) time.
152 The Separation lor General Single-Layer Wiring Barriers
5 REFERENCES
[AHU] Aha, A. V., J. E. Hopcroft, and J. D. Ullman, The Vcsign and Anaiysis of Computer
Algorithms, Addison Wesley, Reading, Mass., 1974.
[DKSSU]Dolev, D., K. Karplus, A. Siegel, A. Strong, and J. D. Ullman, "Optimal wiring between
rectangles," TlJirteen Annual ACM Symposium on the Theory of Computing, pp. 312-317,
1981.
[FP] Fischer, M. J. and M. S. Paterson, "Optimal tree layout," Proc. Twelfth Annual ACM
Symposium on the Theory of Computing, pp. 177-189, 1980.
[GJ] Garey, M. and D. Johnson, Computers and Intractability: A Guide to J/ P-completeness,
Freeman, San Francisco 1978.
[J] Johannsen, D., "Bristle blocks: a silicon compiler,' Caltech Conf. on VLSI, pp. 303-310,
Jan., 1979. See also Sixteenth Design Automation Proceedings, pp. 310-313, June, 1979.
[L] LaPaugh, A. S, "A Polynomial Time Algorithm for Optimal Routing Around a Il.ectangle,"
Proc. Twenty-Brst Annual IEEE Symposium on Foundations of Computer ScieIlce, pp.
282-293, 1980.
[LP] Leiserson, C. and R. Pinter, "Optimal Placement for River Routing," Carngie-Mellon
Conference on VLSI Systems and Computatiolls, Oct., 1980. .
[MC] Mead, C. and L. Conway, Introduction to VLSI Systems, Addison Wesley, Reading, Mass.,
1980.
[SLj Storer, J. A., "The node cost measure for embedding graphs on the planar grid," Proc.
Twelfth Annual ACM Symposium on the TllCory of Computing, PI'. 201-210, 1980.
[T] Tompa, M., "An optimal solutioll to a wire-routing problem," Proc. Twelfth AIInual ACM
Symposium on the Theory of Computing, PI'. 161-176, 1980.
[V] Valiant, L., "Universality Considerations in VLSI Circuits," IEEE Transactions OIl Com-
puters, pp. 135-140, February 1981.
Provably Good Channel Routing Algorithms
Ronald L. Rivest, Alan E. Baratz, and Gary Miller
Massachusetts Institute of Technology
Laboratory for Computer Science
Cambridge, Mallachuaatts 021391986
I. Introduction
In this paper we present three new two-layer channel routing algorithms that are provably
good in that they never require more than 2d-l horizontal tracks where d is the channel density,
when each net connects jLlst two terminals. To achieve this result we use a slightly relaxed (but
still realistic) wiring model in which wirc~ may run on top of each other for short distances as
long as they arc on different layers. Two of our algorithms will never use such a "parallel run"
of length greater than 2dl and our third algorithm will require overlap only at jog points or cross
points. Since in this wiring model at least dl2 hori7ontal tracks are required, these algorithms
produce a routing requiring no more than fOLlr times the best possible number of horizontal
tracks. The second algorithm also hm; the property that it uses uses at most 4n contacts, where n
is the number of nets being connected.
II. The i\Iodel
The (infinite) channel of width I consists of (1) the set V of grid poillls (x,y) such that the
integers x and y satisfy the conditions 0:::;;)':::;;,+ 1 and -OO(x<co, (2) the set P of poly segments
consisting of all unit length line scgments connecting pairs of adjacent grid points which do not
both have y=O or y=t+l, (3) the set M of metal segments which is isomorphic to but disjoint
from P. The channel (V,P,AI) thus forms a multigraph with verh~){-set V and edge-set PUM. If
two vertices are adjacent in this graph they are connected by precisely two edges one of type
poly and one of type metal. We define track i of the channel (V,P,Af) to be the subgraph
composed of all grid points in V with y-coordlllate equal to i, and allscgmenis of PUM which
conncct pairs of these grid points.
A wire IV consists of a sequence of distinct grid points separated by segments which connect
them:
w == (p{),s/,p1,s]> ... ,sk,Pk)'
Here PO,,,,,Pk arc the grid point, and si connccts Pi-j to Pi' Each si may be of either type, poly or
mCkll, and we define the sets of poly segmcnt~ and metal segments of wire IV as follows:
P(H') == {si I sjEP},
M( H) = {si I sr~f}
Similarly, the contact points C( If) is defined to be tlie ~ct of grid P~)illt5 whrrc W starts. ellds or
challges layers:
nil):::: {p(/'Pkl U {p,IO<i<kandtypc(l)* typd'i+,l}
153
154 Provlbly Good Chlnnel Routing Algorithms
(typc(s) = poly if siEP( It) and Lype(s) = metal if sF- AI( If)
We say that two wires I-V] and W2 are compatible if there does not exist a pair of segments
siE W] and sl
W2 such that si and Sj are incident on a common grid point and type(s)= type(s).
Notice that two compatible wires may "overlap" by connecting to common grid points with
segments of different type, as illustrated in Figure 1.
M(1'~1
d t
t. .. ,,~t
-------. P"\y
W. -= fCl,s" bJ $1. ,.1, s;'1J e~ % ,'3, 58 ~
<, '5", kJ S!J) j 1
"" 1 =~c:~ S)/.\' ~S";+~ S?I '-', Sql <, S,o)
1<) S'l / 11
Figure 1.
Many previous channel routing algorithms employ a more restricted wiring model in which
no such "overlap" is permitted. We do not know how to prove our current results without
making use of a modest amount of overlap. The current model is certainly a realistic two-layer
model, although it docs permit wirings which arc susceptible to "cross-talk" via the capacitive
coupling of long overlapping wires. Our wirings will not have any long sections of overlapping
wires - the longest such section will have length at most the width of the channel.
A l1el Ni = (p,q) is an ordered pair of integers sp,~cifying an enlry (x-)coordinale Pi and an
exit coordinale qr A net is said to be rising if qi(Pi, fallil;g if p/qr and trivial if Pj= qr A
channel rOUling problem is simply a set of n nets, for some integer n, such that no two net~ have a
common entry coordinate or a common exit coordinate. A solufioll to a channel routing problem
consists of an integer t and a set of Il compatible wires W], ... ,Wn in the channel of width t, such
that Wi begins at grid point (Pl,t+ 1) and ends at grid point (qtO). The oplilllal widlh for a
channel routing problem is defined to be the least integer I such that the problem has a solution
in a channel of width f.
For any real number x, we say that a net Ni = (Pi,q) "crosses x" if either Pi5:,x(qi or qi::,x(Pt
The channel density of a channel routing problem is defined to be the maximum over all xER of
the number of nets crossing x. It is simple to show tJJat a problem has optimal width at least dl2
if it has density d.
III. A Prol'ahly Good Channel H.Q!!.!l!!R. Algorithm
Let CRP= {N1, ... ,Nn} denote any channel routing problem. We assume without loss of
generality that 15:, Pl,qi5:,111 for all 1::;i5:,11 and some integer m. Thus the nets NiECRP specify
end-points which lie witJ1ln some 111 "columns" of the channel. We will now describe a
polynomial time algorithm which is guar"ntccd to compute a solution to CRP having channel
width exactly 1= 2d-1, where d is the channel density of CR 1'. Since dl2 is it lower bound on the
optimal channel width for CR P, this algorithm will never generate a solution with channel width
marc than four times oplimal.
Algorithm l.
This alglHllhm PWCl'l'tI, c. lillinn bv C(llllllill !'(lUling ,111 nels II hid1 uo:,s j in :'!(P J. The
Ronald L. Rivest, 8t al 155
solution generated wiII have the properties that t= 2d-1, there wi\l be at most d wires passing from
column j to column j+ 1 for any j, and for some j there will be at least d such wires. Further,
wires will pass from a column j to column j+ 1 only on the oddnumbered tracks; there will be
no horizontal segments on the even-numbered tracks. In addition, if there are k nets which cross
j then there will be exactly k horizontal segments connecting columns j and j+ 1. These segments
will all lie on distinct odd-numbered tracks and they may be of either type, poly or metal,
independently. Finally, if exactly r of the k nets which cross j are rising and fare falling (so that
r+ f= k), then between columns j and j+ 1:
(1) The top-most r odd tracks will be devoted to wire segments for the r rising nets,
(2) The "middle" d-r-f odd tracks will be empty, and
(3) The bottom-most f odd tracks will be devoted to wire segments for the f falling nets.
It now remains to demonstrate that this set of invariant properties can be maintained as the
algorithm proceeds from column to column. If a column contains a trivial net, the net is wired
straight across the column using the even numbered tracks to change layers as necessary. No
other wiring is needed in such a column.
If a falling net Ni =(p;J) enters column j from column j-1 on track tz' the algorithm drops a
vertical connection from grid point (j,lz) down to grid point (j,O). The algorithm then "closes up
ranks" in column j so that all the empty odd tracks are in the middle of the channel. Figure 2
illustrates how such a wiring can be generated. Rising nets with entry coordinate j are handled
similarly.
Finally, any rising net N i = (Pj,) is routed in column j with a vertical connection from grid
point (j,O) up to grid point (j,tJ, where tw is the top-most odd track which would be empty (Le.
contain no horizontal segment between grid points (j,t)) and (j+ 1,tJ) if net N; were not present.
Similarly any falling net is routed down to the lowest odd track that would otllerwise be empty.
If both of these situations occur in the same column, a modest amount of "overlap" is required as
indicated in Figure 3. However, the situation of Figure 3 is the only place where overlap is
needed.
col --_I'>
'" j .
Figure 2. Figure 3.
Theorem 1:
-----
/\lgorithm I is g.lIar,llIt':ed to llWlPlilc a solution to CRI' 11<1\ ing cilannel widlh no more than
156 Provably Good Channel Routing Algorithms
In this section we will describe a polynomial time algorithm which, like Algorithm 1, is
guaranteed to compute a solution to CRP having channel width no more than four times optimal,
but unlike Algorithm 1 requires no more than 4n total contact points. This new algorithm
employs the same basic approach as Algorithm 1 and thus its description will be facilitated by
simply noting the differences between the two algorithms.
Algorithm 2.
Similar to Algorithm 1, this algorithm proceeds column by column routing all nets which cross
j in step j. Further, a solution generated by Algorithm 2 will have esscntic\lIy the same properties
as a solution generated by Algorithm 1 with only two significant exceptions. The first of these
exceptions is that all horizontal segments belonging to wires of fa1\ing nets (with the possible
exception of the top-most such segment in each column) will be of type metal. A similar
property will hold for rising nets and poly horizontal segments. The second significant exccption
is that for each column j there may be at most one distinct horizolltal segment which is associated
with a falling net and connects columns j and jt 1 wllile lying on an even-numbered track.
Further, the net of such a segment will not have exit coordinate equal to j+ 1 and the odd-
numbered track immediately below the segment will be empty betwecn columns j and j+ 1. A
similar property will also hold relative to rising nets.
The maintenance of this new set of invariant properties requires a somewhat different set of
wiring rules from thosc employed by Algorithm 1. Consider the case where a faIling nct
Ni = (pjJ) enters column .i from column j-1 on track fz. As in tllC previous algorithm, a vertical
connection is dropped from grid point U,l z} down to grid point (i,O). Notice, however, that at
most one contact point will be required along this connection since all segments whkh must be
crossed will have tile same type. The algorithm must now "close up ranks" so that all blank
columns remain in the middle of the channel. It should be clear that the tcchnique employed by
Algorithm I in solving this problem can be of no lise here. I !owever, tbe problem can be easily
slllvcd by dropping a vertical connection from the lop-must track containing a t'allinf, nct which
crus~('s j down [() grid [Jilint (jJ z + J), as shilwn in Figurc eLt, The only I'lilblcm that occurs is
Ronald L. Rivest, et al 157
when the net to be dropped has exit coordinate equal to )+1/ In this case, however, the
algorithm simply drops the next lower net (if any) as shown in Figur~ 4b. Rising nets with entry
coordinate ) are handled similarly and all other cases are handled as in Algorithm 1.
----1---
- -- -
-----
- -.-
---.
---1-----
--------
--- ----
'. '---1
~v
----QI~
i
Figure 4.
Theorcm 2:
Algorithm 2 is guaranteed to compute a solution to CR,P with channel width no more than
four times optimal and with no more than 4n total contact points.
Proof:
The proof of Theorem 2 follows from the above discussion and a more detailed case analysis
of the wiring rules applied within each column.
Finally, we note that Algorithm 2 has time complexity O(d'n) for a channel routing problem
containing II nets and having density d.
V. Reducing Ovcrlap
Let IlS now assume that we wish to compute a solution to CRP which has minimal channel
width and no segment overlap. In this section we will describe a polynomial time algorithm
which is guaranteed to compute a solution to CRP having channel width no more than four times
optimal and requiring only "corner overlap". However, the number of contact points required by
this algorithm will be O(d'lI) rather than O(n).
Algorithm 3.
This algorithm proceeds track by track rather than column by column. The processing at each
step involves a pair of adjacent tracks, i and i+ I, such that i is odd. Furthermore, the algorithm
proceeds bottom-up beginning with tracks 1 and 2. At each step the algorithm extends all
existing wires across both track i and track i+ 1, in such a way that the density of the subproblem
between track i+ 1 and the top of the channel decreases. This reduction in density will result
from horozontal wire extension along the odd-numbered track. Once again the final solution will
have the properties that {= 2d-l and there will be horizontal wire segments lying 2.!!!Y on odd-
numbered tracks; the even-numbered tracks will be used solely for layer changes along vertically
running wires.
When the algorithm begins processing a pair of tracks i and i+ 1, there will exist exactly II
distinct vertical segments connecting a grid point in track i-I to a grid point in track i. Further,
each of these scgmclIls will belong to a dbtinct wirc. Since track i-I is (,Icn-numbered and thlls
used solely fur layer changes, we i1utc that the type, poly or mct:ll, of each of these scgmcnts can
always he a',signed ,lS a l'uIlLli('n ur li1e huriwllial nHlling in tr,Kk i, Wl' \'Iill IHlII <lc',(' 1ilJi.' tile
158 Provably Good Channel Routing Algorithms
1 10
Figure 5.
Ronald L. Rivest, .t al 158
A!gOlithm 3 is guarallttcd to compute a solution to CRP with chJllllel width no more than
four times optimal and requiring only "corner overlap".
Proof:
It follows directly from the above discussion that Algorithm 3 will always generate a solution
in which the only type of overlap is corner overlap. The upper bound on channel width then
follows from the ob,crvation that the density between track i and the top of the channel is strictly
decreasing as the algorithm proceeds and i increases.
We now point out that Algorithm 3, like the previous two algorithms, has time complexity
O(dn). Unlike the pr(;vious two algorithms, however, this algorithm may generate wires which
arc non-monotonic (Le. weave back and forth across the channel), thus resulting in increased total
wire length.
VI. Conclusions
We have presented three channel routing algorithms which are guaranteed to compute a wiring
requiring no more than four times the optimal channel width. Furthermore, one of these
algorithms requires only a small number of contact cuts and another requires only a minimal
"mount of overlap. However, many open questions still remain:
(1) Can the upper bound be improved (e.g. to 3dl2)?
(2) Can this bound be proved ill more restricted wiring models (e.g. the model of
[076])?
(3) Can this bound be proved for multi-tenninal net..,?
(4) Can both the number of contact cuts and the amount of o\c::lap be simultaneously
minimized?
vn. Acknowledgements
We would like to thank Charles Leiserson, Ron Pinter and Brenda Baker for helpful discussions.
~II. References
[076] Deutsch, D.N., "A Dogleg .Channel Router," Proceedings of the 13th IEEE Design
Automation Conference (1976), 425-433.
[DKSSU81j Dolev, D.; Karplus, K.: Siegel. A.; Strong, A. and Ullman, lD., "Optimal Wiring
Be[ween Rectangles," 13th Annual ACM Stoc Proceedings (Milwaukee, 1981), 312-317.
IH,)7l] Hashimoto, A. and Stevens, l, "Wire Routing By Ol~timjzing Channel Assignment
\Vitllin Large Apertures," Pro~eedings of the 8th [EEE Design Automation Workshop (1971),
155-169.
(['80] Tompa. M., "An Optimal Solution to a Wire-routing Problem," 12th Annual ACM
Stoe Proceedings (Los Angeles, 1980). 161-176.
This research was supported by NSF grant MCS78-05849 and by DARPA grant NOO014-80-C-0622.
Optimal Routing in Rectilinear Channels
Ron Y. Pinter
Massachusetts Institute of Technology
Laboratory for Computer Science
Cambridge, Massachusetts 02139-1986
Abstract: Programs for integrated circuit layout typically have two phases: placement and
routing. The router should produce as ~fficicnt a layout as possible, but of course the quality of
the routing depends heavily on the quality of the placement. On the other hand, the placement
procedure would like to know how' good a routing it can expect without actually routing the wires.
This paper pre,cnts fast algorithms for optimal routing and for accurately estimating the area cost of
such routings II' ithout actually laying them out.
The most cummon types ofjunctiollS occuning in layouts are T-shaped or X-shaped; this paper
presents efficient c1leorithms to measure and produce the optimal renilim:ar, tw,r-Iayer routing in
channels formed around such junctions. The ability to do this is based on the new notion of painvise
ordering which is used to propagate routing constraints from olle pan of a channel to the rest, and
alleV1Jtcs a fundam(;nt31 problem plaguing tradition:l1 channel r0l11ers. In :Hldition we present a
greedy aigorithm for 0ptimal fOUling in rectangles with a new type' nftenllillai ord(:j~ing which comes
up frequently in practice but has not been studied before.
1. I nt roduction
The most common methodology for sclving the htyout problem for integrated circuits is to
dec()mpo~e it into two subproblems: placcl11C1l! and TOuling ([Ilr a definition of the layout problem
sec, for example', [l.aP80]). For complicated VLsI circuits, we may need to fom1 a hierarchy of such
problems (JS ill [Pr79]). At each level we arc gilen pieces of the circuil. called modules, which have
bC'en ldid out at the preceding lc\cl. and hJIC' to lay tih?!l1 out to form a modnk for tile next level up.
Each module has termina/s locall'd ::t!lIng irs boundary. and cach tcrmillJ! is as:;ociat,~d with a signal
l1el. In this metilOciology, we first plilee the mod ales, I.C. decide their gcomctrii: positions on the chip,
and then route paUlS to intC'rcl'I111C'ct tennin:;ls of common ,ignal nets. The objectil e is to minimize
the total Jrea required to rcan/e rhe circuit subj<:ct to vcllilllS design rules. An eXdmplc of a typical
placcnh'Ilt is given in Figure 1.
Ndlurally, the placcrncnt and routing problems :IlC ,rroni'll' rcJ~tcd: the way in which n~()dl1les
are placed relative \0 each otiler may cfTe:ct dnmatic;illy the qU:lliry and difl!culty (If the J'Outing
phase. Ideally, we would solve the routing problem for each proposed placement (while looking for
an optimal one), but this is obviously intractable. One reason why routing is a hard problem is its
global nature: the way one signal net is routed effects the potential solutions for other nets. Thus we
need good estimates for the area which is going to be needed for routing relative to a given placement
without spending the time nceded to actually solve the routing problem.
Traditionally ([HaSt71],[Hi74],[Ri81]), a rectangular pamdigm is used in VLSI design. Modules
have rectangular shapes, and the routing among them is achieved by partitioning the given space
into rectangular channels. Each such channel is assigned signals with terminals along its boundary
(some are common with modules, some with other channels), and the signals are routed inside. While
rectangular modules seem to be generally acceptable, rectangular channels pos.:: difficulties for two
reasons: both the evaluation of the placement based on possible routings and the globality of the
routing problcm are fragmented by the channel structure in a way that makes their solution even
harder to achieve.
Therefore, I propose that we shall start looking at routing problems in non-rectangular channels
(but still maintaining rectilinear sides). As long as modules are rectangular, such channels take one
of three general shapes: T. X or L (as indicated in Figure 1). While the latter is relatively easy to
handle, the other two are more complicated. Some instances ofT's and X's yield to some interesting
theoretical analyses which are presented in this paper. In general, non-rectangular channels are
treated by partitioning them internally along edges and dealing with each section separately; the
edges are used to maintain constraints in such a way that overall optimality is achieved. We develop
a powerful algebraic abstraction for constraint propagation, called painvise ordering, which is well
sl:itcd to the problem. and study it carefully.
The theoretical research thitl has been done so far on routing in rectangles ([LaP80],[DKSSU81])
has paid little attention LO (Onfie:,Lltarions which are common in practice, and even less to the problem
of propagating routinb cor.srraints through the chan;le!. In lDKSSU81] we find a polynomial-time
algorithm [() solve a sImple siw:l!.ioll in which the ordering of the terminals with respect to their
signal-nNs is the same on two p3r.,]icl edges, On the other hand, [Lar80] tells us that the problem
in general (and e\cn if limited c()n~ictcrably) is NP-complete. In order to fin the gap, we should
examine routing of uscfi.1i patrerns both in the newly proposed channels and in rectangular ones.
After describing tlVO relevJnt wiring model:; in Section 2. the three subsequent sections describe
polynomial-time algonlhms for al.t<1ining optimal routings for certain lonligm3tions in the following
three CJses: a T-shapcd channel (Section 3). an X-shaped channcl (Section 4). and allowing arbitrary
ordering among the signals along a channel's side (Section 5), The notion of pairwise ordering is
defined and discussed in the beginning of Section 4. We conclude with a discussion of implications
on il melhodology for gencralized channel routing.
Now let us define the wiring ntles: For any two paths (al! four terminals involved arc distinct)
no turning point of one path may Ii,: an a scgm(!nt of the other; thus, the paths may cross each
other, but cannot share segments, go thrOllgh each others terminals or turn at the same grid-point
(see Figure 2(a)). This is somewhat different from the model in [['h80] where turning points may
bc shared (see Figurc 2(b)), but our model conforms with the traditional Manhattan wiring model
([Hi74]) in which two layers are being used - one for each direction (preassigned). In current
technology (e.g. nMOS. see for example [Me80]) connections between the layers arc facilitated by
vias (or contacts). If we adhere to the cOllvention of one layer per direclioll, we had better refrain
from making two turns at the same point (causing two vias to overlap). Some of our results are
affected only mildly by this divergence from Th,)mpsol1's model. but some arc quite sensitive to it.
xl ...
loX
iy I . v,
I
I
l. 'lJ
"Y
Y
top
left right
end end
left flank right flank
eft ri-
eg ght
leg
/L-._ _ _ _.....
bottom end
~ /
aT-shape? First, we decide that the flanks of the channel will remain aligned (i.e. share the same
grid-line). Second, it seems unnatural to use the absolute area as the optimality criterion for various
reasons (sec IPr79]).Thus it is natural to consider the distance betYieen the legs (denoted Wa, for
bottom width) and the distance between the top and the flanks (denoted WI" for top width) as our
criteria in a way to be described in the following paragraph.
Moreover, it is obvious that changing the distance between the lower modules (i.e. changing
the leg to leg distance) may effect the routing of signals going to the top and the flanks. Thus our
strategy will be to minimize Wn first, and then minimize UJT with respect to it. This approach is most
practical in most design situations and is also likely to approximately minimize other interesting cost
functions, such as area. Notice that minimizing WT first (by setting it to 0) will flatten out the T, i.e.
make it into a rectangular channel by pushing the lower modules outward to the ends of the upper
one. Also, once Wa is known, we can fix the horizontal location of the lower modules with respect to
the top one, forming a solid T. Finally, we shall see that Wr tends to be much bmaller than Wn: thus
minimizing WB first in an unconstrained manner is preferable from the placement procedure's point
of view (since it is better in preserving the T-shape).
Now we restrict ourselves to twa-point nels, ie. to instances oflhe problem where each signal-
net name can appear as the label of exactly two terminals. Also. for sake of simplicity, we exclude
the ends as possible sides for terminals to lie on (they l:an be added at a later stilgC). Assuming no
net connects two terminals lying on the same side or on two adjacent sides of the channel (this is
reasonable if these are the sides of single modules). the nets can be divided into five cases according
to which kinds of sides they connect:
(i) top to flanks
(ii) top to legs
(iii) flanks to legs (left to right and right to left only)
(iv) flank to flank
tv) leg to leg
Ron Y. Pinter 165
The most interesting case to consider is (ii); ca~es (i), (iv) & (v) arc embedded in standard
rectangular routing, whereas (iii) is essentially a restriction of (Ii). We shall solve case (ii) in the rest of
this section.
T-terms
I ,----------------~~~------------------
A
B ..
L-!;..,erms
C
, M-terms, r R-te~ms
D E
~
.. """
left central right
portion portion portion
----
crossing edge
C
B- H B-
B
terms
F
E
.-.-. --
Figure 4: More tcnninology for T-shaped channels.
Here (B.G) is an aligned pair inducing a conflict; (A,F) is an aligned
pair not inducing a conflict.
n = 8, n, = nr = 3, nm = 2.
Ikfitlirion I. All tennin;:!:; on the I~gs (c)'c1uJing lhe corners) are called H-tcrms(for bflttom-
lenninal:;), ,mel the lennin:tis 1111 the lOp - F-/cni1\ (for top-[crminals). We dellotc by n the number
of :;i~II;l1 p;,jrs: thlls there arc 'It I--tl'lIll'- ;iild II, lhl'rrJb. l-kl IIh \\h()~(' l"uHII\lin;,!c i~ wiri.in till:
r;!Ilgc uf til~ left (right) Rank (illcluJillg end pdlnl;;) ~rc c,tlkJ r hl.'l:.\ (J: lenn\): the rest "ftile '1'-
telms are called M-tenns (for middle-terminals). We denote the number of L-terms, R-terms and
M-terms by nl, nr and n m. respecti\cly (thus n = nl nr nm). + +
Definition 2. Two J3-terms with the same y-cl'ordinate are called an aligned pair. Alignments
of I3-terms induce pairing between t,1-Je corresponding T-terms (i.e. the T-tenns bearing the same
labels). If the x-coordinate of the top terminal corresponding to the bottom terminal lying on the
166 Optimal Routing In Rectilinear Channels
right leg is smaller than that (i.e. is to the left) of the top tenninal corresponding to the bottom
terminal lying on the left leg. the pair is considered to be a conflicting pair. In symbols: let S10 S2
be two signal nets. the top tenninals of which are sf, Sr. respectively. and the bottom terminals are
sf, sq. respectively. If sf lies on the left leg, S~I on the right leg and they are (y-)aligned. then
Sf and Sf constitute a conflicting pair if Sf lies to the left of Sf. We classif} such conflicting pairs
according to the subclasses their T-tenns fall intc: fOf ~1l po~sible combinabons of X, Y E {L.R,M}
an XY-pair is a conflicting pair in which 011e T-term is an X-tenn and the other a Y-tenn*; e.g. the
pair (B,G) in Figure 4 is an LR-pair. We order the pairs according to the positions of their T-terms,
ie. we write (SI, S2) if the x-coordinate of Sf is smaller than that of S1. The number ofXY-pairs is
denoted by nxy; thus there are niT LR-pairs.
Definition 3. The grid-line segment going from the left end-pomt of the right flank to the
right end-point of the left flank (dashed in Figure 4) is called the crossillg edge. The part of the
channel above it is called the top part. the one below it - the bollom part. The extension of the right
(left) leg upwards, until it hits the top (dotted in Figure 4). is called thc right (left) edge; the portion
of the top part to its right (left) is called the right (left) portion. The portion between the right and the
left edges is called the cenlra! portion. A grid point residing on an edge and coinciding with a routing
path is called a crossing point.
Definition 4. A grid-line segment enclosed by either the top part or the bottom part of the
channel is called a track; the tracks' orientations (i.e. horizontal or venical) arc relative to the T-
shape, not to its parts.
(A) (F)
r<cp,. ..l UJW-
I ,: I I ,
I-I
I
D
I I I I
I I
I
I
I legend:
G I
layer 1
I
I
layer 2
A
contact (via)
terminal
. _-_._--.-.-
Figure 5: Routing the bottom part and propagating the resulting constraints to
the crossing edge.
Ron Y. Pinter 167
, ,
-----11, , -r- --- -
.,-------- Yl
r: :~ ----r-:. :. -:-I....Tr
I
---' t-I.. 112
l-.l::Lj llill" r
I Xl
1/.J
1'1 :':
I I I I : I
I I I I I
down through the central portion. As for ordering on the left and right edges: Signals corresponding
to LR-pairs arc put as low as possible. This leaves one free track either on the right or the left
(depending on the direction used in rile wnstruction of the crossovers. as in Lemma 2) which is used
by one (any) signal of the corresponding sidc; all other ~ignals corresponding to L- and R-terms
are put above. This strategy yiclds the situation in Figure 7(a), which is abstracted schematically in
Figure 7(b).
, ,
T
I
Thus, if we define
we have
Theorem 1. If n",/ = n,nm =
nmr = 0, then* the width of the top part of aT-shaped
channel is WT = max(nTJ Til) XT 1. + +
Proof 011 tline: Conflicting Ll-p3irs alld 1m-pairs are ordered properly all the crossing edge.
Tracks for signals of L-tenns and R-terms are then assigned arbitraril~ so as to fonn Ule situation
in Figure 7(h). If Xr = 0 then either there arc no LR-pairs to worry about, or the extra horizontal
track needed to accommodate the LR-pairs in the centwl portion is being lIsed by a signal either on
the right or on the left. Oniy if XT = 1, we arc forced to usc an extra horizontal track. I
Changing the wiring rules to Thompson's (in which two sign:<1s may share a common turning
point) docs not effect this result.
Notice that we hale opted to resolve wnniets only in the top pilrt of the channel. Some such
connicls, however, could hale been rc,ol\cQ in rhe bl!l101l1 pal! without loss of optil1litlity there.
This is not dime since exphll'ing thi'; ph,iiJllit,. ,,\iIl Implicate null,'r:; c()!l~idc'r;i11Iy. '1 he additional
we a(ld 1 at the end b0(al!:,t Wi' Illl':I:,Ur,: . width. ~ IHell j, i IIIU(\' th:l'l the [lumb"r of tr.lcks
Ron Y. Pinter 169
A
, , B C
T
D
f
,
E
I
FG
TT
H
,
T
f r 1 t I r
I r I I
1 1" r J
I
L....<--'----r""1 A
E
G -----_B
(a) All conflicting pairs involving M-terms are simple.
, ,,
ABC DE
T '
,
F G
T
H
T
,I J
T
K
T
L
T
MN
TT
I I I
I
~ t r 1 I
I I I i r
I 1
,I
I I I 1I I
I I I
I I
I I I
I
I I
I
4-
I
1 I i I I I t I r
I I 1 I I I r ir N
D
~
1 I .1
r
1 : f ,
I
I I f
I
C
K _1 . -' I F
1 J ..l!
L
G
I
I
J . ;j I I . ,
I A
, B
.J.
;i I
1
I
1 E
J
I I I
I 1 I 1 H
M I
(b) A more complicated case, in which M-terms are involved in conflicts with L-
or R- terms of the far portions.
Here n = 14 with nl = 5, n,. = 5, and nm = 4. Using the definitions given
in the text, we obtain Xr = 0, <Pr = 1, and J.l.r = 0 (since nl = n r ). By
Theorem 2, Wr = 7.
complexity does not justify the effort since the best it can help is by reducing Wr by 1 for The'Jrem 1
(only if n[ = nr and then by resolving all conflicts) or 2 for Theorem 2 (by resolving most conflicts).
Thus, the notion of"optimal" in this section (and in the subsequent one) should be viewed relative to
this simplifying decision. but it is not far from being truly so.
The case in which M-terms are involved in conflicts is dealt with by extending the paradigm of
Figure 7. First ML- and MR-pairs whose M-temls are above the crossing points allocated for the
corresponding L-term or R-term, respectively, can be accommodated in the appropriate ranges by
simply forcing assignments of crossing points to the corresponding L- or R-term (e.g. D and F in
Figure 8(a. The number of ML-(MR-) pairs not handled in such a way is derJoted by n~j (n;"'T)'
and consequently we define n;n = nmm n;"'/ n:nr' + +
Other conflicts of the above kind and MM-pairs are handled one at a time in the remaining
tracks, making full use of track segments left free in tl1e upper-middle part of the central portion.
The block of LR-pairs is pushed all the way towards the more congested edge. A greedy approach
in assigning tracks at this stage (on a pair-by-pair basis, putting the two crossing points as close to
each other as possible) is good enough to attain a minimal WT; again, pairs share tracks as LR-pairs
did. TI1e only trouble is with M-terms which are too close to the right and left edges and are involved
in conflicts with L-terms and R-tenns, respectively, or appear in MM-pairs. Surprisingly, this might
cost us at most 1 extra horizontal track:
We say that an M-tenn is m-adjacent to the right (left) edge if all (possibly 0) grid-points be-
tween it and the edge (along the top side) have M-terms located at them. Now we define Ji- = 1 if
an M--term invohed in a conflict with an L- or M-terrn is m-adjacent to the right edge alld n T ~ nj;
o otherwise. Ji-~ is defined likewise by reversing the roles of left and right in tht definition, and
Ji-T = max(Ji-, Ji-). Also, rPT = 1 iff both left edge and right edge have m-adjacent M-terms
involved in such conflicts alldnT = nj.
Now we are ready to state
Theorem 2. The width of the top part of a T-shaped channel is given by
~--~
-----
a2 .... -->e a 4
as'" - -;..aa
a7 .... --;..aJ
as
arm
I
------ - 'I- - -
I I
la
arm I central
,- - - - - mr
I portion
arm
.... ... .. .. . . ..
5432 111096787213141
J.'igure It: Topologically sorting a cycle that docs not create a conflict.
172 Optimal Routing In Rectilinear Channel,
Definition 5. Given a set A, a pairwise ordering 'W of A's elements is a binary, antisymmetric
relation over A such that if (ai, aj) E 'W then ai i: aj and neither ai nor aj appear in any other
member of'W.
The interpretation of (A, 'W) as a directed graph induces an undirected graph with bounded de-
gree 1 (see Figure 9(a. The reason this algebraic structure represents channel routing constraints is
that exactly two signals are involved in each conflict, no signal is involved in more than one conflict,
and the conflicts are directional in nature.
An X-shaped channel (Figure 10) is most naturally partitioned lilto 5 portions: 4 arms and 1
central ponion. If we ignore, for the time being, terminals on the channel's ends, the constraints
propagating from the arms inwards to the edges separating the arms from the central portions are
simply pairwise orderings. Again, if we restrict ourselves to two-point nets and ignore signal nets
connecting terminals in the same aIm, the central portion is a rectangular channel with pairwise
orderings on its 4 sides. For sake of simplicity, we deal here only with nets having points on opposite
edges (but not adjacent ones).
Each edge of the central portion has a pairwise ordering associated with it. In routing this
portion, we have to satisfy these constraints. 'This can be studied by looking at the structure obtained
by taking the union of two pairwise orderings, 1~ 'W2, defined on the same set of elements. This
is exempli!led in Figure 9(b) and (c). The graph interpretation of the resulting structure induces an
undirected graph with bounded degree 2, thus it consists of isolated vertices, open paths and (even
length) cycles. Open paths and cycles that arc not directed cycles (i.e. there arc at least two arcs going
in opposite directions) can be arranged on a line such that all arrows go in one direction by topologi-
cally sorting the nodes (see Figure l.l). Thus the union of two pairwise orderings corresponding to
opposite edges of the central portion of an X-shaped channel can be arranged in such a way that
signals in the central ponion can go straight across unless there is a directed cycle.
Directed cycles are the only interesting case to look at. Obviously, a cycle involving k nets can be
+
routed rectilinearly using k 1 tracks in one direction and 1 in the other (Figure 12(a. This turns
out to be optimal for one cycle, whether we are using our wiring model (of Section 2) or Thompson's.
However, sharing tracks between cycles going in perpendicular directions turns out to be beneficial.
Before we proceed. let us introduce some notation: A cycle whose points are on the horizontal
(vertical) edges is called a vertical (horizontal) cycle. because of the directions of the signals. The
number of vertical cycles is denoted by v, and of horizontal ones- by h; al~o, c = h+v. Obviously,
we need at least h horizontal tracks and v vertical tracks to roUle across the central portion of an X.
Let Av and Ah denote the number of exIra vertical and horizontll, respectively, tracks needed to do
the routing.
Theorem 3. For our wiring model, A1} + +
Ah = c 1. whereas in Thompson's model
Av = Ah = 1. Moreover, in our model, any pair of values for Av and Ah satisfying Av tl.h = +
+
c 1, Av, Ah > 1 can be attained.
Proof outlil/c: The construction follows the paradigm of Figure 12(b) and (c). Both cases attain
the claimed bound, iind can be fi)lded around b) horizontally and (e) vertically <IS indicated by the
arrows) to attain all interim valucs. The result for Thompson's model is achieved by merging corners.
The (1ptimaiity is proven by induction on c. I
Using this n::sult, dilfercnt lJptimality criteria can h~ cmpll'ycd to achieve dc:.irablc layuuts.
Ron Y. Pinter 173
f<. tracks
extra
... track
(a) One cycle needs an extra track
in each dimension.
I I
---, '-h
I I I I I I
I I
I I I I
I I I I
I I I I
I I I
I I
I J I I
(b) Ah = h + 1, Av = v. (c) Ah = h, Av = v + 1.
Figure 12: Laying out vertical and horizontal cycles in an X -shaped channel.
Here h = 2, v = 3.
, r ., T~ r
,t r, If.""-"'11---IIi"""'8.,
.
~ .. I .,
,
....---t-I...
,
~ ~ l I
~.
(a) A simple case using only one (b) A more complicated case with
horizontal track. ec = 3 (due to x).
Figure 13: Routing across a rectangle with arbitrary terminal ordering (notice the
terminals have no labels).
174 Optimal Routing In Rectilinear Channels
We restrict ourselves to a channel, C, having tenninals on two opposite sides only (like in
[DKSSU81]). For sake of presentation, let us assume these are the horizontal sides (as in Figure 13).
Let IT{Xo) (similarly IB{Xo)) be the number of terminals on the top (bottom) side of tIle channel lying
to the left of Xo (i.e. whose z-coordinate is smaller than Xo). Then we define the excess /lumber at a
point z = Xo to be*
(we could have similarly done this with temlinals lying to the right of Xo).
The excess number of the channel. fC. is defined as
where z/- and ZR are the z-coordinates of the left and right ends. respectively. of the channel. Then
we have
Theorem 4. The number of horizontal tracks needed to route C is exactly Ec. and this is
optimal.
Proof outline: This number of tracks can be attained by first routing aligned terminals straight
across and then assign horizontal jog tracks using the grcedy algorithm mentioned above. Note that
this algoritllrn does not cause two ,erticai tracks to ovcilap. as opposed to the case in [DKSSU81] (see
Figure l3(b). Also, no wire passes through more than two contacts. The lower bound is proven by
drawing vcrticalline segments tluough the channtl and showing that at least at some point as many
as fC signals have to be routed from its left side to its riglU side, L1US forcing us to use as many as ec
tracks. I
The calculation of the excess number is linear once the tenninals arc sorted. The assignment
of tracks, however. takes time O(n log n) (where n is the number of terminals) to allow for the main-
tenance of the priority queue holding the free tracks. So all in all we have an O{ n log n) algorithm.
but in case the terminals are presorted, evaluating the channel's width (WitllOut routing it) takes
linear time.
This result can be generalized to dealing with disjoint sets of signals, the order of tenninals
within each is arbitrary, but we arc not JlImved to mix terminals from diOcrent sets. This is achieved
by modifying the definition of tlle excess number to accommodate tllis constraint. FurthemlOre, the
excess number has tlle same cOl1vexivity property as me conflict number discussed in [DKSSU81].
thus it can be used to solve the offset problem in the same fashion as described there.
6. Conclusions
We have shown mat optimal routing for some configurations in rectilinear-polygonal channels
can be obtained efficicmly. Technically some surprisingly compact, yet simple, routing patterns were
discovered. AltllOUgh most seem to be ad hoc. they share a common flavor which is induced by
the pairwise ordering introduced to represent constraint propagation. The results obtained are truly
optimal. i.c. not just in order of complexity (as in [LeiSO]).
Surprisingly, this can be achieved by decomposing me polygons in a natural way, and solving
the parts almost independently while maintaining the constraints that must be shared in a simple
1
1:0 n~cd nGt b(: integral; in filet. ViC haw to look at point; of the form 1+ where I is an i!ltcgcr. MorcO\cr,
it i, :,upcrnUOli' to Inl,k at inkgral point,: it i, "'-~'.I{h tn look at POiUiS right hefore (-"-11 and right after (+t)
tcrmina\<. for the apl'ilc.ltion to foll(lw.
Ron Y. Pinter 175
fixed
?l
;.... 'i
cj U'
M 1-"
\ I
+' <.""t-
'ri 'i
,.Q ?l
M 'i
c;l '<
p.w.o.
(a) Center portion of aT-
fixed shaped channel as dis- p.w.o.
cussed in Section 3.
.!,.~ .::;:
'0
0
- ,
'\,vi
~
'\(
~ 0
p.
p.w.o. p.w.o.
(b) Center portion of a Cc) Center portion of a general
T-shaped channel with X-shaped channel.
terminals on flanks.
Figure 14: Possible routing configurations for rectangular parts of rectilinear chan-
nels. p. w.o. stands for pairwise ordering; a two-headed arrow indicates
that routing occurs between the sides pointed at
(a) aT (b) an X
manner. This giy('s rise to a general methodology according to which the routing area of a chip
will be divided into polygonal channels, which in turn will be subdivided into rectangular parts.
The original channels will be used to fOlm routing constraints in tenns of orderings on the sides of
these rectangles. The types of orderings on the sides of a rectangle and their interaction (in tenns
of common signal nets) induce a typing of rectangles. For example, thc center portion of the T-
shiJped channel in Section 3 may be described as arbilrary-jixed-arbitrary-pairwise (going clockwise
from thc left edge) where nets are split betwecn the first three sides and the last onc (Figurt: 14(a.
Allowing tcnninals to reside also on the flanks yields a pairwise-/ixed-pairwise-painvise (with similar
net splitting) description for the same portion (Figure 14(b. An X-shaped c\l,innel in which two-
point nets can be split in any way between two different sides yields a paimise-pairwise-pairwise-
pairwise description with the aforementioned net interaction (Figure 14(c. Such types can be
characterized in tcnns of the complexity of their optimal routing probkm. Some, as we have seen,
can be routed optimally efficiently, but other configurations may be intractable (e.g. instances of NP-
complete proble:ns): still, guod heuristic solutions will be hclpful.
For this method to be effectivc we may need to allow channels to overlap (relaxing the definition
given in Section 2). The common areas will reflect constraints arising from more than one polygonal
channel which have to be solved simultaneously; trying m filld independent solutions and piecing
them togcther is clearly a bad idea: Although this complicates matters slightly, the types of rectangles
are essentially [he ~,:tme iHld the general methodology applies.
A further direction i~ to consider paral1lelerized modules ([Go081]) which can be integrated into
the constraint pI\)pagation methodology 10 enhance the interrelat.ion between placement and routing
even further. Other interesting cases are skewed T's and X's (Figure 15(a) and (b), respectively) in
which a side of an internal rectangle might be further subdivided. Solving the offset problem (as
defined by [DKSSU811 in a limited context) ior such channels is another extension.
Acknowledgll1ellts. I am grateful to Charles Lciserson for suggesting some problems that initiated
this research, and would like \(I thallk I1ml ilnd Ron Hivcst for many helpful discussions.
References
[DKSSU81] Dolev,D., Karplus,K., Siegel,A., Strong,A. & UlIman.J.D.: Optimal Wiring Between
Rectangles; Proceedings of the Tbirteenth Annual ACM Symposium on Theory of
Computing, May 1981, pp. 312-317.
[Goo80] Goodhue.E.: private communication (February 1981).
[HaSt71] lJashimoto,A. & StevensJ.: Wiring Routing by Optimizing Channel Assignment within
Large Apertures; Proceedings of the Eighth Design Automation Workshop, IEEE, 1971,
pp.155-169.
[Hi74] Hightower,D.: The Intcmi/lneciioll Problem: A Tutorial; Computer, Vol 7, No.4 (April
1974), pp. 18-32.
[LaPSO] LaPaugh,A.S.: A Polynomial-time Algorithm for Optimal Routing Around a Rectangle;
Proceedings of the Twenty-first Annual IEEE Symposium on Foundations of Computer
Science. October 1980, pp. 282-293; also in /llgorithmsfor Integrated Circuit Layout:
An Analytic Approach, MlT/LCS/TR-248 WIlD. dis5ertation), November 1980.
Ron Y. Pinter 177
[LeiSO] Leiserson, C.E.: Area-Efficient Graph Layouts (for VLSI); Proceedings of the Twenty-
first Annual IEEE Symposium on Foundations of Computer Science, October 1980, pp.
270-281.
[MCSO] Mead,C. & Conway,L.: Introduction to VLSI ~stems; Addison Wesley, Reading.
Mass. 1980.
[Pr79] Preas,B.T.: Placement and Routing Algorithmsfor Hierarchical Integrated Circuit Lay-
out; Computer Systr:ms Laboratory Technical Report No. lSO/SEL-79-032 (Ph.D.
dissertation), Stanford University, August 1979.
[RiSI] Rivest,R.l..: The "PI" (Placement and Interconnect) System - Progress Report. un-
pubiished mannscript, M.l.T, May 1981.
[ThSa} Thompson,C.D.: A Complexity Them), for VLSI; Technical Report CMU-CS-SQ-140
(Ph.D. dis~crtation), Carnegie-i\1cUon University, August 1980.
New Lower Bounds for Channel Width
Donna J. Brown Ronald L. Rivest
University of Illinois Massachusetts Institute of Technology
Coordinated Science Laboratory Laboratory for Computer Science
Urbana, Illinois 61801 Cambridge, Massachusetts 021391986
ABSTRACT
We present here a simple yet effective technique for calculating
a lower bound on the number of tracks required to solve a given channel-
routing problem. The bound applies to the wiring model where horizontal
wires run on one layer and vertical wires run on another layer. One of
the major results is that at least ~ tracks are necessary for any
dense channel routing problem with n two-terminal nets that begin and
end in different columns. For example, if each net i begins in column
i and ends in column i+l, at least ~ tracks are required, even though
the channel "density" is only 2. This is the first technique which can
give results which are significantly better than the naive channel den-
sity arguments. A modification results in the calculation of an im-
proved bound, which we conjecture to be optimal to within a constant
fac tor.
I. INTRODUCTION
The "channel-routing" problem has recently attracted a great
amount of interest and is becoming increasingly important with the
advent of VLSI. The results of this paper are of both practical and
theoretical interest. On the practical side, the techniques allow a
channel-routing algorithm to estimate more accurately a bound on the
number of tracks required to solve a given problem, and thus to know
when to stop looking for an impossibly good solution. From a theoreti-
cal point of view, this paper makes two points. The first is that
channel "density" is not the only factor determining the limits of
channel-routing performance in this wiring model; we must also consider
how many nets must "switch columns" in order to be routed. The second
point is closely related: the "traditional" wiring model - which we
study here - seems to be in some significant sense provably worse than
related wiring models where nets can overlap slightly (say at corners).
In these models twice channel density is provably an upper bound on the
number of tracks required [RBM81].
Related work has been done by, among others [HS71], [D76], [T80],
and [DKSSU81].
II. DEFINITIONS AND THE WIRING MODEL
The (infinite) channel of width t consists of (1) the set V of
grid points (x,y) such that x and yare integers and 0$ y$ t+l,
-= < x < =, and (2) the set E of edges connecting points (x,y) and (x',
y') whenever these points are at distance 1 from each other and yand y'
are not both equal to 0 or t+l. Figure 1 shows a channel of width 4.
If the width of the channel is t, we say that the channel has t tracks;
track i (for 1 $ i $ t) consists of all grid points with y=i and the
178
Donna J. Brown and Ronald L. Rivest 179
Let m denote the number of nets which IllUst be "moved" (i.e. which
IllUst switch columns because Pi~qi)' The structure of our argument is
a track-by-track analysis of how many wires can be moved into their
final columns on each track. Consider the first track (i.e. y=l). If
below track 1 (i.e. connecting track 0 to track 1) we have mO '" m nets
which IllUst be moved, after track 1 (i.e. between tracks 1 and 2) we
will have a number ml of nets to be moved, where ml ~ mO' We continue
in this manner for each track; when mi = 0 we are done (with t = i).
How many nets can be moved into their target columns in One track?
The fundamental but simple observation is that if net i moves from its
current column to its target column q. on the track, then column q.
1. 1.
IllUst have been empty, (i.e. there were no wires in column qi between
this track and the previous one). Let e.1. denote the number of empty
columns between tracks i and i+l in our window. Then clearly
t-l
or m ~ !! e i .
i=O
The only way to change e i from one track to the next is to route
wires from a column inside the window to a column outside the window
(which increasese i by one) or vice versa (which decreases e i by one).
We also observe that e. - 2 ~ e. 1 ~ e. + 2, since at most two wires can
1. 1.+ 1.
cross the window boundary on any track.
Our initial conditions are eO = e t = w - n (w is the width of the
window, n the number of nets), and we have the inequality
D
Clearly '2 tracks are required for the D "departing" nets. But
~ - 1
2 E (e +2i)
i=O t
"inside" nets might also be routed. Similarly, the A "arriving" nets
A
require at least '2 tracks, which could also be used to route
~ - I
2 E (e O+2i)
i=O
"inside" nets. This leaves max[O,I'l, where
Q_ l ~-l
I'=I_ 2 E (e t +2i)_2 E (e O+2i),
i=O i=O
more "inside" nets to be routed. Bound (*), previously established,
gives a minimum number of additional tracks required to route these.
Recalling that T nets pass completely through the window, we obtain
D A
t~T+'2+'2+max{O,-(et+D)+ J(et+D) 2 +2I'}.
This formula is illustrated by the example in Figure 5 (where only
the left half has been drawn; the right half is the mirror image of the
left). For this problem, T = 0, D = 2, A = 6, I = 42, eO = 0, e = 4,
and the above formula gives t
t;;O: 1+3+max[O,-6+J36+64 1= 8.
This minimal number of tracks is in fact achieved by the routing shown.
The above formula can, of course, be extended to subwindows where
AL # ~ and DL + DR. In addition, small improvements can easily be
made by considering relative positions of, say, the D nets and the e t
columns.
Finally, it should be noted that channel density is, in fact, a
subcase of what we are here computing. If the (maximum) density d is
in column i, then the subwindow of size one which includes i will
require at least d tracks. So the maximum over all subwindows can be
no less than d.
v. CONCLUSIONS
We have presented a new simple but powerful technique for deriving
a lower bound on the number of tracks required to solve a traditional
channel-routing problem for two-terminal nets. We have as of yet found
no example for which this bound is more than a constant factor from
optimal.
Donna J. Brown and Ronald L. Rlvelt 183
ACKNOWLEDGEMENTS
This research was supported by NSF grants MCS80-08854,
IST80-12240, MCS78-05849, and by DARPA grant N00014-80-C-0622.
REFERENCES
[D76] Deu tsch, D. "A Dogleg Channe 1 Rou ter," Proceedings of the
13th Design Automation Conference (IEEE 1976), 425-433.
track
4
3
2
1
I I
.
4= - --
,-- r--
- r-- .--
r-- r-- r-
r-- r--
r
."
I I I
Figure 5. Illustration for improved lower bound.
Compact Layouts of Banyan/FFT Networks
David s. Wise
Indiana University
Computer Science Department
Bloomington, Indiana 47405
I. INTRODUCTION
This paper offers two results that can both be des-
cribed as pictures. They are Figures 1 and 3. The percep-
tive reader may stop here, since the remainder of this
paper only describes them.
188
David S. Wise 187
II. NOTATION
The 10garithm-base-2 of n is written 1& x. For real
functions f and g, we write f1xl = O(g(x)) if there is a
constant k and some value Xo such that f(x) ~ k'g(x) for
all x > xo' This notation expresses a proportional,
asymptotic upper bound of g for f. If f(x) = O(g(x and
g(x) = O(f(x then we write f(x) = Q(g(x. I t is only
necessary to express g as a single term (e.g. 19 x, 2x,
x2 ) in such contexts.
The abbreviatios FFT and VLSI refer to the Fast
Fourier Transform and Very Large Scale Integration circuit
technology, respectively.
III. AN APPLICATION
This work is immediately motivated by the need for a
switching network between processors and memory in a multi-
processor system of 100 or 1000 processors [4]. In order
to increase bandwidth to memory, reducing contention among
the processors, a banked memory is envisioned. Its access
is through a fast, parallel switching network.
A sui table model for such a network is a banyan
network [7] whose elementary functional unit is a 2x2
crossbar switch. It may be perceived as a QY1~L [2], a
store-and-forward unit with two input lines and two output
lines. Figure 1 might be interpreted as suoh a network
from 2n = 16 processor to 2n memories.
Memory fetch and store instructions are transmitted as
packets through the network; duplicate networks pass
information in the reverse direction. (Say, honoring a
fetch instruction or allocating free nodes from a heap
[5].) Each packet initially contains a binary
(destination) memory-address followed by a message. Upon
arrival at each router, its high-order address list
determines along which path it is to be forwarded. The
entire message is shifted left one bit, displacing that
address bit for transmission to the next stage. In Figure
1, a zero bit would send the modified packet from a router
on a leftward (northwest) line; a one bit would send the
remainder of the packet to the right (northeast). It is
possible to insert into the vacated low-order bit a value
identifying the input line by which a packet entered each
router. After a packet has passed through the network to
its destination, its destination address will have been
shifted out. In its stead (at the end of the packet) would
be the address of its source processor when the vacant bi ts
are so used.
This describes a network which uses crossover pattern
to allow as many as n messages to arrive at some of 2n
different destinations simultaneously, each over a path of
at most Ig n routers. As many as n'lg n messages might be
already in the switch, pipelined behind the arriving wave.
188 Compact Layouts of Banyan/FFT Networks
Michael J. Foster
CarnegleMelion University
Computer Science Department
Pittsburgh, Pennsylvania 15213
Abstract
This paper introduces a new technique, called syntax-directed verification, for proving properties
of circuits composed of standard cells. The lengths of proofs using this technique are
independent of the size of the circuits, but depend only on the number of standard cell types
and the complexity of the rules for interconnecting them. Syntax-directed verification is thus
well-suited to VLSI, in which large circuits are built using relatively few types of cells. The
paper describes the syntax-directcd verification method, and presents an example of its use.
Introduction
Many current VLSI designs are composed of standard cells, which themselves perform simple
functions but are wired together to perform more complex functions. Often it is not obvious
that the function performed by the cell combination is the one specified, even if we assume that
the cells themselves are correct. Examples of complex circuits formed from simple cells are
Leiserson's systolic priority queue [Leis79], the pattern matcher of Foster and Kung
[FostKung80], and the programmable recognizer array (PRA) [FostKung81]. In all of these
examples, correctness of the circuits has been demonstrated by tracing the action of the circuit
rather than by formal proof.
This paper suggests a syntax-directed technique for verifying circuits composed from standard
cells. This technique allows proofs of correctness to be developed in a mechanical way. It
relies on the usc of a context-free grammar to specify both the function and structure of the
legal combinations of cells. The terminal characters in the grammar correspond to the primitive
cells, and non-terminals correspond to combinations of cells. The start symbol corresponds to
the class of circuits whose correctness is to be verified. By proving a single theorem for each
production in the grammar, the correctness of any circuit constructed according to the grammar
may be verified.
This research was supported in part by the Office of Naval Research under Contract NOO0l4-80-C-0236, NR 048-659,
by the Defense Advanced Research Projects Agency under Contract F33615-78-C-1551, by the National Science
Foundation, and by the Fannie and John Hertz Foundation.
196
Michael J. Foster 197
To. construct this kind of theorem, we must give the specifications of the cells, tell what
combinations of cells are legal, and give a rule for determining the specifications of any legal
cell combination from its form. In this paper, we assume that the specifications of the cells are
primitive, along with the cell designs. The legal combinations of cells will be precisely those
circuits that. are generated by the context-free grammar, and the specification of a circuit will
depend upon its derivation in the grammar.
As with program correctness, proof of circuit correctness proceeds in two steps: development of
verification conditions, followed by their proof. To develop the verification conditions we make
use of syntactic assertions on the values and timings of signals at the ports of each primitive cell
and compound circuit. These assertions correspond to the inductive assertions [Floyd67J of
program verification. The verification conditions are theorems relating the syntactic assertions.
One syntactic assertion is required for each symbol of the grammar. Terminal symbols of the
grammar correspond to primitive cells, and the assertions for these symbols are simply the
primitive cell specifications. Assertions for the non-terminals are specifications of the various
compositions of primitive cells. The assertion for the start symbol is thus the specification for a
complete circuit constructed using the grammar.
Once we have the syntactic assertions we can develop the verification conditions. Each
production of the grammar corresponds to one verification condition, stating that the syntactic
assertions of the symbols on the right side of the production imply the assertion on the left.
Proof of these theorems completes the verification of the circuit family.
An Example
As an example of this technique let us verify that the recognizcrs described in [FostKung81J
actually recognize the regular languages they are supposed to. Three kinds of primitive cells are
used to build these recognizers, corresponding to types of characters in the regular expression;
one cell type is a comparator for single characters, while the other two types correspond to the
union (+) and Kleene star (*) operators. The three cell types, together with symbols used in
drawing large recognizers, are shown in Figures 1, 2, and 3. Note that each cell type has left
and right ports, which are used for cells concatenated to its right and left. In addition, the +
and * cells have upper and lower ports for connecting circuits for their operands.
198 SyntaxDlrected Verification of Circuit Function
RE'S
CHR
ENB
RES
CHR
ENB
~
.... ....
RES RES
CHR < CHR
ENB < ENB
RES
CHR
ENB
Figure 2: OR-Node Cell
RES
CHR
ENB
RES ~ RES
CHR < CHR
ENB ENB
Each of the primitive cells has one or more data paths passing through it, with three data
streams on each data path. The CHR and ENB streams, which flow from right to left, carry the
text characters and enable bits. The RES stream, which flows from left to right. carries the result
bit. We hook these cells together to form recognizers by connecting the data path at the right
side of one cell to a data path at the left side of another cell.
Mlch1J. Foster 199
Circuits formed from these cells will take the fonn of ternary trees. All communication with a
recognizer tree takes place at the root, through the right port of the rightmost cell. Any cell
with nothing concatenated to its left must have a terminatillg loop connected to its left port.
This is simply a wire running from the ENB output to the RES input, and ensures that RES is
always equal to ENB.
G)-c
(ENB), and the result bit (RES). On even clock phases the recognizer inputs CIIR, while on odd
phases it inputs ENB and outputs RES. If a string in the language generated by the regular
expression is input, preceded by a 1 on ENE, the recognizer outputs a 1 on RI;S immediately
after the last character of the string; otherwise it outputs a 0 on RES. Note that a single
character in the input stream may be a member of several recognized strings.
We have now described the legal combinations of cells, and have claimed that a legal
combination of cells should recognize its corresponding regular expression (by setting RES to 1
at the right times). Verification of the circuit function will consist of a proof of this claim.
Before proceeding with the proof we must make this circuit specitication more precise, as well
as supply specifications for the individual cells. These specifications of circuit function will be
the syntactic assertions used in the proof of correctness. They will consist of predicates on the
sequence of values at each port of a single cell (comparator, + node, or * node) or larger
module (P or R).
The above description of the operation of a recogrlizer circuit is an informal statement of the
assertion for R. By introducing suitable notation for the signals on the ports of a recognizer, we
can translate this informal description into a concise predicate. Let Et, C t , and Rt stand for the
ENE, CIIR, and RES signals at beat t on the right port of a recognizer, and let E't, C'l' and R't
stand for the signals on the left port of a primitive recognizer. The assertion for a recognizer
for regular expression X is then the predicate P R(X):
where the expression "(C t- 2(n-l) ... Ct- 2 Ct>in X" means that the string Ct- 2(n-1) ... Ct- 2 Ct
is in the language generated by X.
A primitive expression recognizer (P) has an associated pipe length 6. It sends CHR and ENB
from right to left with delay 6, so that data entering the right port at time [ leaves the left port
at time t + 6. Furthermore, if RES is input from the left 6-1 beats before the start of a string in
the language generated by the primitive expression, then 1 will be output on RES from the right
immediately after the last character of the string. Otherwise, a 0 is output on RES on every beat.
The assertion Pp(X) for a primitive recognizer of pipe length 6 for expression X is thus the
conjunction of:
(Vt) E'r+6 ~ Et
(Vt) C't+6 ++ C t
(Vt) [(3n) R'H6-2n+1 A C t- 2(n-l) ... Ct- 2 Ct> in X)] ++ R t +1'
Our primitive specifications of the cells assert that they function according to the circuit
diagrams. Thus a character recognizer for the character "x" obeys the predicate P x:
(Vt) E't+1 ~ Et
(Vt) C't+1 ++ C t
(Vt) [R't A (C t = x)] ++ Rt+ l'
For the OR node we denote the upper and lower ports by the superscripts I and II, so that Cl is
the character output of the upper port. The predicate P OR is then the conjunction of:
Michael J. Foster 201
(\It) Rt -
-- -Rl V RU
t
Clt CU
t
-
(\It) C't t - Ct
(\It) Elt E\ - R't
(\It) E't E-t
Using the same convention for the upper port, the predicate P* for the Kleene * is:
Having stated the assertions that apply to each symbol of the grammar, we are ready to state
and prove the verification conditions. The verification condition for each production is a
theorem stating that the assertions for the symbols on the right side imply the assertion for the
left symbol. Each of these theorems assumes of course that the semantic rule associated with
the production is applied. cl11e theorems show, taken together, that a circuit constructed using
the semantic rules above will recognize the regular expression that drove the productions. There
are five verification conditions, one for each production in the grammar; we shall state and
prove only one of them here.
To prove the veritication condition corresponding to the production R -> P, we must show that
if a termination loop is added to a circuit satisfying Pp(X) then the resulting circuit satisfies
PR(X). A circuit satisfying Pp(X), with an added termination loop, satisfies the conjunction of:
The last equivalence here comes from the termination loop. By substitution in the third
conjunct, first E't+11-2n+l for R't+11-2n+l' then Et- 2n + 1 for E't+11-2n+1' we obtain the
predicate:
This is precisely PR(X), so we have proven the theorem corresponding to the production R ->
P. The verification conditions for the other four productions are similar in statement and proof.
Conclusions
This paper has introduced a method for verifying properties of circuits composed from standard
cells using a context-free grammar. Proof of a small number of theorems using this method can
verify the correctness of allY circuit built from the standard cell system. It should be mentioned
that this technique is not a panacea, since we may wish to combine circuits in ways that cannot
be described by context-free grammars. For example, no context-free grammar describes the set
of rectangular arrays, since the language {aCba'b ... ba c: c an integer} is not context-free.
A wide class of circuits can be composed from standard cells using context-free grammars,
however. For this class of circuits, correctness proofs can be developed at the same time as the
202 SyntaxDlrected VerificatIon of CIrcuIt FunctIon
design of the cells and grammar, to provide assurance that large circuits will meet their
specifications. Properties of circuits other than correctness of function, such as timing
properties, are also verifiable by this method. Ikcause of the usefulness of this technique in
verifying large circuits, it is worthwhile to try to specify interconnections of standard cells using
a context-free grammar. Designers of standard cell systems should apply syntax-directed
verification techniques, to help ensure that circuits built with their systems wiII work as
expected.
References
[Floyd67] R. W. Floyd, "Assigning Meanings to Programs", Proc. Amer. Math Soc. Symp. in
Applied Mathematics 19 (1967), pp. 19-31.
[Leis79] C. E. Leiserson, "Systolic Priority Queues," Proc. Caltech Con[ Very Large Scale
Integration, California Institute of Technology, Pasadena, Calif., Jan. 1979, pp. 199-214.
Temporal Specifications of SelfTimed
Systems
Abstract
Self-timed logic provides a method for managing the complexity of asynchro-
nous module connections; the correctness of a properly constructed self-timed
system is independent of the speed of its components. In this paper we present a
means of formally specifying self-timed systems and modules using temporal logic,
an extension of ordinary logic to include an abstract notion of time. We show by
example that temporal logic can describe Seitz's self-timed modules, giving detailed
specifications for combinatory logic, and sketching the treatment. of wires, align
elements, feedback registers, pipelines and finite state machines. Temporal logic
has an expressive power that makes it well suited to this task; it also provides a
framework for proofs of the properties of self-timed systems.
Introduction
Self-timed logic is a method for managing the complexity of asynchronous
connection~ between system components. Its basis is a signal-acknowledgment
protocol that guarantees that a module remains inactive until its input is available,
and that the input then remains available as long as it is needed. The cycle of
signals and acknowledgments plays a role much like that of a two-phase clock.
However, self-timed logic has the advantage that the correct behavior of a system
does not depend on the speeds of its components or the length of communication
delays.
Seitz [S1, S3j has developed a number of conventions for defining and compos-
ing self-timed modules. One of his basic building blocks is the Combinatorial Logic
(CL) module ([S1]). This class of modules is specified by a set of constraints, called
the weak conditions, that describe permissible timing orders for changes of values
on input and output lines. The simplest CL modules combine Boolean logic with
appropriate timing signals. CL modules can be composed, and the result will itself
be a CL module, as long as a simple set of combining rules are obeyed.
Work supported in part by the Defense Advanced Research Project Agency under contract No. MDAfJOIJ-
79-C-0680
203
204 Tempoi'll Speclflcatlona of SelfTlmed Syatema
Seitz defined the weak conditions using timing inequalities, where p ~ q means
that q occurs after p or simultanously with p. This notation is adequate for
expressing timing pre-requisites, but it cannot be used to express requirements
that certain events must eventually occur. For example, one usually expects that
a module whose input is available will eventually produce output. We will call
the first kind of properties safety properties, and the second liveness properties.
Both safety and liveness properties, as well as more complex timing requirements,
are naturally expressible in Temporal Logic, an extension of ordinary logic that
includes an abstract notion of time. In addition to its expressive power, temporal
logic provides a rigorous basis for proving the soundness of composition rules and
for verifying properties of compound modules; one shows that the specifications of
the component imply the specifications of the compound module, using the axioms
of temporal logic.
In this paper we first review the weak conditions and temporal logic, and
then show how temporal logic can be used to specify formally the timing behavior
of Seitz's classes of self-timed modules. The specifications for CL modules are
presented in some detail. Other classes of modules - the align element, feedback
register, pipeline, and finite-state machine - are discussed more briefly. We con-
clude with an assessment of the use of temporal logic in specifying self-timed logic.
If J denotes a set of input lines and 0 a set of output lines, we can then express
facts like "all inputs are undefined" as -,d(J), "all outputs are defined" as D( 0), or
"not all outputs are defined" (equivalently, "some output is undefined") as -'D(O).
We have no reason to deal with empty sets of signals, and therefore in the formal
treatment of the problem we assume that D(X) :J d(X).
We now present the inequalities for the weak conditions using this notation.
Those inequalities specify the relation between I, the set of input signals of a
module, and 0, the set of output signals of the same module. Later, when we want
to talk about several modules simultaneously, we will introduce subscripts on the
signal set names.
2 Temporal logic
Temporal logic is a formal system that provides operators for reasoning about
transitions in time. Variants of this logic are used for verifying programs and
processes, and have been found especially useful for parallel programs ([MPI, [01],
[MaJ, [HaJ, [HO]). Our version of temporal logic is based on an axiomatization
similar to the one in [Mal; its framework and semantics are fully described in [MPj.
The basic modal operators are <) (read diamond), 0 (read box), 0 (read circle),
and U where <) p, 0 p, 0 p, and p U q denote "eventually p", "henceforth p", "next
p" and lip until q" respectively. We assume that time is linear, i.e., each time
instant has exactly one successor, and reflexive, i.e., the present is part of the
future. All the operators state facts about the future, and their informal semantics
are
The until operator in our variant of the system differs from the one described in
[MPj, and it can be called weak until in the sense that it does not imply the eventual
occurrence of its second argument. We use this until because we have found it more
suitable for expressing the properties of self-timed systems.
As an example of temporal formula, one of the axioms of our system is
oP :J P 1\ 0 p 1\ 0 0 p,
stating that henceforth p implies the truth of p at the present, at the next moment,
and in all the future of the next moment.
208 Temporal Speclflcltlons of SelfTlmed Systems
The axioms that characterize our until operator and make our system different
from the one in [MaJ are,
The first states that p U q is true iff q is true now or p is true now and p U q is true
in the next state. The second states that if p is always true then so is p U q. The
first of the two axioms is valid for both the weak and strong until while the latter
characterizes the weak until.
3 Derived operators
The specification of the behavior of self-timed systems can be more concise if
we use derived operators whose intuitive meaning is close to the conditions that
we want to specify. The while operator, p W q states that q is true as long as p is
true and can be defined formally as,
Another useful operator is the precedes operator, p P q ([MP]) stating that p must
precede q if q is going to ever happen. The definition is
stating that the first operand implies t.he eventual occurrence of the second. Note
that since we use reflexive time, p :J q implies p........---+ q.
assertion w is always true during its operation." If <p is an assertion describing the
legitimate initial state, this is expressed by
1= w,
borrowing notation (used somewhat differently) from Manna and Pnueli [MP] to
avoid repeatedly stating the initial conditions. In a G L, <p - ,d(I) 1\ ,d( 0).
4
1
OL
h 3 0
ML
I
IR OR
MR 5
2
M
We now show how the interconnection rules can be specified in our notation. It
suffice~ to specify the interconnection rules for two modules; all other configurations
can be built incrementally by adding one module at a time. The rules for two
modules are captured by the configuration in figure 1, where I and 0 are the input
and output of the composite system M, while h, OL,!R, OR are those of ML (L
for left) and MR (R for right), respectively. Every line in the diagram represents
a set of signal lines. Lines 1 and 5 must both represent non-empty sets; lines 2,
3, and 4 may be empty, but none of I,h,IR,O,OL,OR may be empty. Because
our assumption is that a line in the diagram represents one point in the circuit
(see also section 5.1) we do not need and do not allow lines connecting the input
directly to the output. All this can be formalized in the following axioms, labeled
according to the lines in the diagram.
CID. f- D(I) :J D(h)
Cld. f- d(h) :J d(T)
CI2D. f- D(h) 1\ D(IR) :J D(I)
C12d. f- d(I) :J d(h) V d(IR)
C23D. f- D(I) 1\ D(OL) :J D(IR)
C23d. f- d(IR) :J d(I) V d( 0 L)
C34D. f- D(O)I\D(IR) :J D(OL)
C34d. I- d(OL) :J d(IR) V d(O)
C45D. I- D(OL) 1\ D(OR) :J D(O)
C45d. f- d(O) :J d(OL) V d(OR)
C5D. f- D(O) :J D(OR)
C5d. f- d(OR) :J d(O)
If the interconnection rules are obeyed and ML and MR are both CL's then
the composite system .M will also be a CL. (Note that this fact must be proved
using the axioms of temporal logic. We do not supply such a proof in this paper.)
5.1 Wires
In self-timed systems the delay caused by "wires" can be modelled by treating
wires as CL modules. It is easy to prove that a delay line (when we look at it as a
one directional information pipe) satisfies the CL conditions. The CL composition
rule allows us to combine a line and a CL into a single CL. Thus it is always
possible to incorporate the delay between two modules into one of them (or divide
it arbitrarily betweeen them) without changing the logic of the system.
It is clear that SAl and SA2 imply the axioms that they replace, and therefore
any Align is also a CL. Any combination of Align and CL elements that obeys the
interconnection ruls above is also a CL. Some interconnections of CL and Align
modules are Align modules as well.
AI Ao
C
A ~
l
I ~
0
~
CL , 9 .
n
Essentially these are two 4-phase Muller cycles ([S2]) one in the handshaking
of I and AI and the other between 0 and Ao. Each of these two cycles is similar
to the "requesters and granter" example described in [MPj.
The C element in the implementation proposed in figure 2 is a conversion
element whose output is defined when the input is undefined and vice versa.
The price that we pay for having an asynchronous pipeline is that each module
has to wait for the acknowledgment from the next one before it can issues its own
acknowledgment. This price is not too heavy in two important cases:
a. When the computation time within each module is much larger than the
communication time, and all the modules have similar computation time. For
example, this is the case when each module performs many iteration of some
simple computation before passing the result and getting the next chunk of
Yonatan Malachi and Susan S. Owlckl 211
data. In general we can move toward this behavior by increasing the amount
of buffering within.-each module.
b. When the variability of computation time is big and depends on the actual
data. In such a case a synchronous system will have to operate with a clock
rate corresponding to the worst case; the asynchronous computation will be
much faster on the average.
Note that we could have given weaker restrictions that allow a P L to ac-
knowledge its input before its output is acknowledged. This would allow greater
flexibility; however such a P L would not in general be a C L.
I A o "
"-
OL ,
~ I
'-9 i
1
g
n
...::=
O/<'B
FB j
.
[FB
'---.
6 Conclusions
We have found that it is possible to specify the timing behavior of a variety
of self- timed modules. In fact, since these modules all depend on the same sort of
two-phase signalling conventions, it is not hard to deduce any of the specifications
from those of the OL. Note that in the process of formalizing the specifications of
OL we added restrictions that were not part of the original weak conditions such
as that individual lines are wcll behaved, The need for this restrictions becomes
evident when trying to build sound formal specifications,
The advantages of a formal notation like temporal logic for expressing these
timing specifications are twofold.
1. Temporal logic's expressive power allows us to state various kinds of spec-
ifications explicitly and unambiguously. It is more precise than informal
specification methods, and seems to be able to exprcss a wider range of
important properties than other formal methods,
212 Temporal Specifications of SelfTlmed Systems
Acknowledgments
This paper builds on the work of Chuck Seitz on self-timed systems and Zohar
Manna on temporal logic. We are indebted to both of them for valuable discussions
that provided the context for our approach. Greogor v. Bochmann ([BoD developed
an earlier method of using temporal logic to describe selt-timed systems, and
conversations with him contributed to our understanding of the problem. Pierre
Wolper and Amy Lansky provided helpful comments on various drafts of this paper.
References
[Bo] G.v. Bochmann, "Hardware specification with temporal logic: An example,"
Technical Note, Computer Systems Laboratory, Stanford University, June
1980.
[GPSS] D. Gabbay, A. Pnueli, S. Shelah, and J. Stavi, "The temporal analysis
of fairness," Proc. of the 7th Symposium on Principles Of Programming
Languages) Las Vegas, NY (January 1980), pp. 163-173.
[Ha] B.T. Hailpern, "Verification of concurrent processes using temporal logic,"
technical report No. 195, Computer System Laboratory, Stanford Univer-
sity, August 1980.
[HO] B.T. Hailpern and S. Owicki, "Modular Verification of Computer Commu-
nication Protocols," submitted to IEEE Transactions on Communications.
[Ma] Z. Manna, "Verification of deterministic, sequential programs: Temporal
axiomatization," Computer Science Dept., Stanford University, 1981.
[MP] Z. Manna, and A. Pnueli, "Verification of concurrent programs: The tem-
poral framework," Computer Science Department, Stanford University,
June 1981.
[MC] C.A. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley,
1980.
[0L] S. Owicki and L. Lamport, "Proving liveness properties of concurrent
programs," submitted to ACM Transactions On Programming Languages
And Systems.
[SI] C.L. Seitz, "System Timing", Chapter 7 of [MC], pp. 218-262.
[S2] C.L. Seitz, "Ideas about arbiters", LAMBDA First Quarter 1980.
[S3] C.L. Seitz, personal communication.
A Mathematical Approach to Modelling the Flow
of Data and Control in Computational Networks
ABSTRACT
~his paper proposes a mathematical formalism for the
synthesis a.no aUnli tative analysis of computational neblOrks
that treats data and control in the same manner.
Expressions in this not~tion are qiven a direct
interpretation in the iMpJementation domain. "'opology,
broadcasting, pipelining, and similar properties of
implementations can be determinec'l oirectlv from the
expressions.
This treatment of computatinnal networks em~hasizes the
space/time tradeoff of implementations. A full
instanti.ation in space of most computational problems is
unrealistic, even in VLSI (Finnegan [4]). ~herefore,
computations also have to be at least partially instantiated
in the time domain, requirinq the use of explicit control
mechanisms, which typically cause the data flow to be
nonstationary and sometimes turbulent.
INTRODUCTION
The evaluation of mathematical expressions in general
requires arithmetic operations as well as communication of
data and control signals. The computations mav be a.rranged
in several ways, spanning the spectrum from fully parallel
to fully sequential. "'he former approach spreads the
computations in space "Thereas the latter spread them in
time.
1
Mailing ao.dresses: Computer 8cience, MS-280, Caltech,
Pasadena, California 91125 and ISI, 4676 Admiralty Way,
Marina del Rey, California 90291.
213
214 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks
1:
11-1
yen) = a(i)x(n-i.) (1)
1-0
218 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks
o y
Figure 1: The implementation of (2)
o y
The UT means "enabling" Y to have the value 'T' \-Then U=l and
not having any value when U=O (e.q., E in Fiqure 3 ~s a
tristate driver). Hence, the explicit control that resets
accumulators (the Z-units) to 0 also enables the E-units to
drive the output bus. Note that the existence of the output
of E is contr011en by U, not its value, as in the case of
multiplexers.
2
A is called the O-input and R is the I-input.
218 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks
x--t------- x
u- u
y --..-.------ y
1: z\
. B.
m
[(Z l U) ( Z X) where R = !k-i-l] (7)
m=O
3
This is proven bv ZU(t)=d([t-l-j])=d([t-(j+l)]).
220 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks
x X
A A
I
I
&.[!].- u
y y
4
Proof: Unit #i has the phase of i and is selected
(enabled) when [k]=i, at that time B=[k-i-ll=[i-i-ll=N-l.
Hence, Y is a sum of the last N terms.
Lennart Johnsson and Danny Cohen 221
m -1
The implementation of (1 - w Z) mav not look intuitive
to the untrained reader. However, note that
m -1
U = (1 - w Z) V which yields:
( 11)
III m
(1 - w Z) U = V or U = w 7,tJ + V
u )040- V
m
W
m -l
Figure 5: The implementation of U = (l-w Z) V
m -1
With the network shown in Figure 5 for (l-w Z) the
implementation of equation (10) is:
~~x
Y"
z x x Z
mN m
-W W
CONCLUSIONS
A rigorous mathematical formal approach may be applied to
the synthesis and the analysis of both the aata paths and
the control signalling of computational networks.
This mathematical tool is instrumental for the investigation
of time/space tradeoffs, and supports the transformations of
networks between implementations ranging from fully parallel
to sequential or both, such as in pipelines.
224 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks
ACKNOWLEDGMENTS
The authors gratefullv acknowledge the support for this
research provided generously by the Defense Advanced
Research Projects Agency unner contract MDA-80-C-0523 wj. th
the USC/Information Sciences Institute and contract N00014-
79-C-0597 with the California Institute of Technology.
Views and conclusions contained in this paper are the
authors~ and should not be interpreted as representing the
official opinion or policy of DARPA, the u.S. Government,
nor any person or agency connected with them.
Lennart Johnsson and Danny Cohen 225
REFERENCES
0) Cohen, D., "Mathematical approach ot iterative
computational networks," Proceedings of the Fourth
Symposium on Computer Arithmetic, pp. 226-238, October
1978, also published as USC/Information Sciences
Institute RR-78-73, November 1978.
(2) Cohen, D., and V.C. Tyree, "VLSI system for Synthetic
Aperture Rader (SAR) processing," Proceedings of the
Societ of Photo-O tical Instrumentation En ineers
SPIE , vol 186, pp. 166-177, 1979.
(3) Cooley, l.W., and l.W. Tukey, "An algorithm for
machine calculation of complex fourier series,"
Mathematics of Computation, vol 19, pp. 297-301, 1965.
(4) Finnegan, J. "The VLSI approach to computation
complexity," in these proceedings.
(5) lohnsson, S.1., and D. Cohen, "Computational arrays
for the Discrete Fourier Transform," COMPCON 81,
February 1981.
(6) Johnsson, S.L., and D. Cohen, "A VLSI approach to
real-time computational problems," Proceedings of the
Societ of Photo-O tical Instrumentation En ineers
SPIE , vol 298, September 1981.
(7) Johnsson, S.L., U. Weiser, D. Cohen and A. Davis,
"Towards a formal treatment of VLSI arrays,"
Proceedings of the Second Caltec Conference on VLSI,
January 1981.
(8) Kung, H.T., and C. E. Leiserson, "Algorithms for VLSI
processor arrays," in (10).
(9) Kung, S.Y., "VLSI matrix computation array processor,"
The MIT Conference on Advanced Research in Integrated
Circuits, February 1980.
(10) Mead, C.A., and L.A. Conway, Introduction to VLSI
Systems, Addison-Wesley, 1980.
(11) Oppenheim, A.V., and R.W. Schafer, Digital Signal
Processing, Prentice-Hall, 1976.
(12) Rabiner, L.R., and B. Gold, Theory and Application of
Digital Signal Processing, Prentice-Hall, 1975.
(3) Weiser, U., and A. Davis, "Mathematical representation
for VLSI arrays," University of Utah, Computer
Science Department, Report UUCS-80-lll, September
1980.
A Wavefront Notation Tool for VLSI
Array Design
1. INTRODUCTION
(1)
226
Uri W....r and AI Davis 227
lG(1)
x(I-4)
Z[KM(x,y,z)]=KM(Z[x],Z[y],Z[z]) (3)
2. MANIPULATION OF WAVEFRONTS
The progression of sets of data elements and computations through
computational networks can be modeled by the concept of a wavefront.
Wave fronts can be defined either graphically in terms of the networks,
or mathematically in terms of equations. A wavefront is an ordered
set, such that no two elements belong to the same data stream, and all
the elements of the set move uniformly in time or in space. A
wavefront, denoted as 1, represents an ordered set of data elements:
{a(1,m),a(2,m), ,a(N-l,m),a(N,m)} , where m is the "time" subscript.
The eI.ements- a(I,m) for -all m belong to the I-th data stream. For
simplicity, the "time" subscript in elements of a wavefront is omitted
and a(i,!) will simply be represented as a(i).
This paper introduces definitions of some of the more important
transformations on wavefronts, and presents examples of the technique.
A more complete set of transformations can be found in [6].
2.1 DELAYED WAVEFRONTS
Al =Z[A] (4)
where:
A={a(1),a(2), a(N-l),a(N)} and
A1={Z[a(1)],Z[a(2)], Z[a(N-l)],Z[a(N)]}
228 A Wavefront Notation Tool for VLSI Array Design
,.
Figure 2-1: Wavefront A and Z[A].
Figure 2-1 shows a wavefront A and its delayed version Z[A].
N-1
y=KM{A,X}= (8)
KM{a( 1) ,xC 1),
KM{a(2),x(2),
KM{a(N-l),x(N-l),
KM{a(N),x(N),
ID}}}}
Where ID is the two sided identity element for the function KM.
y=KM{R+[E),R+[F)}= (9)
KM{Z[e(2)],Z[f(2)
KM{zN-2[e(N_l),ZN-2[f(N_l)],
KM{ZN-1[e(N),zN-1[f(N)],
IO}}}}}
y=KM{R+[E],R+[F]}= (10)
KM{e(1),f(1),
Z[KM{e(2),f(2),
Z[KM{e(N-1),f(N-1),
Z[KM{e(N),f(N),
ID}]}]}]} ]}
(a) (b)
Both networks presented in Figure 2-3 produce the same output from the
same set of inputs. Network 2-3b exhibits high throughput since the
time step duration is the time required for the function KM to produce
an output from the given inputs.
3. EXAMPLES
The output from the network for the m-th set is equal to:
,
The graphioal representation of Equation 12 is given in Figure 12.
'I
The network in Figure 3-1 is slow sinoe the time step duration
will be the time required to compute the function KM function H times.
Using equivalence between the networks in Figure 2-3 it is possible to
embed delay elements between the KM elements and thus decrease the time
step by H, which is equivalent to increasing the throughput by
N. Application of a positive rotation on each of the wavefronts will
result in:
(13 )
Where:
A1=Z-(H-1)[A)= (14)
{a(1,m+N-1),a(2,~), a(N-1,m+N-1),a(N,m+N-1)}
X1=Z-(N-1) [8]=
{b(1,m+N-1),b(2,!!!:1), b(N-1,m+N-1),b(H,m+N-1)}
I I
I
I
I I
I I
: I I _ _ oJI
-_ .. ---'"T----- -
I I I
I
I I
I I
IL ____
y(~,J)
_
I
,. ,.
Figure 3-2: y=KM[A"X,].
'13,2)00(3'3)'i~~~ ,
VI3,I) VI3.2) VI3,3) '113", Vl3,i)
...
_14,3) 't'4) 00(4,5)
...
()(
00(5,4)1(5,1)1(5,6) y(l,rl-
~~00(5,l1
vl5,3) VI"') y(5,I) y(I,II
0 0 0
(17 )
the inner product of Z[A) and S- [X]. For these new wavefronts the
network that computes y(J-l,J) is identical to the one presented in
Figure 3-2. It is possible -to use the same structure, delaying the
wavefront A, and shifting the wavefront Xl' This process can be
repeated for all n and the new structure will contain at all pOints of
the grid the same elements as in the diagonal in Figure 3-2. Delay
elements reside in the vertical and horizontal directions.
Y(J-n,~):KM{Zn[A],S-n[X]}= (19 )
z-(s,+r'+')[KM{Zn[Al),S-n[X,)})
4. CONCLUSIONS
The wavefront notation presented in this paper drastically reduces
the complexity of mathematical representation of the computational
array problems This encapsulation can also be used as a perceptual
tool to represent movement of sets of data elements through the
network, and consequently may enhance the appreciation of the
interactions between data elements in the array.
REFERENCES
ABSTRACT
I. INTRODUCTION
BEGIN
WHILE PTR @ (*,*) DO
BEGIN
CASE KIND =
(1,*) FETCH B,Up;
(*,1) : FETCH A,Left;
PROCESSOR ARCHITECTURE
For the purpose of designing prossor architecture, local MDFL is
treated like a low level language which has one to one correspondance
with the machine language of each processor directly dictating the hard-
ware. In this sense, each processor is, in effect, a hardware interpreter
of local MDFL. However, MDFL is not low level in the conventional sense
because the inclusion of some high level constructs like REPEAT .. UNTIL,
IF BEGIN .. END. Yet these instructions are treated here as low level
as they are directly executed by the processor.
* TERMINATED is a boolean flag providing the mechanism to exit a repe-
titive block and enter another.
s.y. Kung, et 81 241
ENDCASE;
MOVE Down;
END;
ENDPROGRAM.
SIMULATION
Simulation of a programmable array processor at the register-level
(within each processor) has been undertaken. The entire simulation can
be summarized as follows: The absolute loader, a local MDFL program
permanently stored in all processors is executed first. It loads the
user's program. After loading, the user's program is executed. The
user's program consists of three major parts : input, program body
(algorithm) and output. The input section of the user is to ensure
that the data is distributed as desired by the algorithm and the output
is to collect the results, the program body executes the algorithm.
The entire user's program is written in MDFL and the sections can be
intermixed depending on the situation.
The simulation indicates that the prototype architecture (c.f.
figure 2) proposed performs as desired. The entire simulator written
in SIMULA can be made available on request.
Finally, in summary, we wo~ld like to draw the reader's attention
to the following remarks :
(1) We had seen that the processor architecture of a programmable
array processor is dictated by the language as the processor is a hard-
ware-interpreter for local MDFL. On the other hand, if a dedicated
non-programmable array processor is desired, the local MDFL program for
that dedicated algorithm should be hardwired in the processor.
(2) Local MDFL thus, leads to processor hardware. It is also
capable of describing the algorithm from the processor's perspective.
Yet global MDFL is indispensible because it is close to matrix algorithm
and it is easier to program the array in global MDFL rather than write
several local IIDFL programs.
S.Y. Kung, at 81 243
IV CONCLUSION
REFERENCES
1. INTRODUCTION
VLSI structures and algorithms are given for bit-serial FIR filtering, IIR filter-
ing, and convolution. We also present a bit-parallel FIR filter design. The struc-
tures are highly regular, programmable, and area-efficient. In fact, we will show
that most are within log factors of asymptotic optimality. These structures are
completely pipelined; that is, the throughput rate (bits/second) is independent of
both word size and filter length. This is to be contrasted with algorithms designed
and implemented in terms of, say, multipliers and adders whose throughput rates
may depend on word length.
In this section we present a cellular FIR filter structure. We may express the
K-l
nth output sample (a B-bit number) of a K-tap FIR filter as Y n - 1: akxn-k' It is
k-O
easy to see from [3,4,51 that Yn can be computed using the cellular structure
shown in Figure 2.A. In this architecture, the input signal values are piped to the
right while the output signal values are formed as they move left. Each process
step is of the form yn(k) - yn(k-I) + aK-l-kXn-K+l+k'
We now consider some detail of the structure of the inner product step cell.
Using two's complement arithmetic we may rewrite Yn as follows:
K-l K-l B-1
Yn- 1: akXn-k - 1: ( 1: (ak 2b ) Xn-k.b) The recurrence for Yn becomes
k-O k-O b-O
B-1
Yn(O) - 1: (uK_!"2 b ) X n -K+l.b
b-O
B-1
yn(k) - yn(k-I) + 1: (aK_I_k,2 b ) Xn-K+I+k.b
b-O
Y n _ Y n(K-I)
t This work was supported in part by the National Science Foundation under Grant ECS-7916292. and in
part by the U.S. Army Research Office. Durham. NC. under Grant DAAG29-79-C-0024.
245
246 Digital Signal Processing Applications of Systolic Algorithms
where the general recurrence for sum bits, s (and carry bit, c) is
e Ci~I-]) e b~i<B
!
S/b-]) (ai-b'xb ) for O<b<B,
S(b) _
I s (b-]) else
I
Figure 2.G illustrates acellular, bit-serial inner product step structure for K - 4
and B - 4. Figure 2.H gives the bit-serial inner product step algorithm. I/O and
computation are overlapped. Finally, the overall FIR filter computation is given in
Figure 2.1. Note that input signal values arrive and output signal values depart on
every other major cycle.
The time complexity (i.e., the number of clock pulses) using this structure
and algorithm is 0 (nB+KB) for a signal of length n through a K-tap filter, using
B-bit input and output signal values (approximately 2 clock pulses per output bit).
The FIR filter's area complexity is 0 (KB). Note that the structure computes the
whole output value. Thus if the filter design requires Bo-bit coefficients and Bo-bit
input values, then the B referred to in this structure is B - 2Bo + 10gK (bits).
Also, if there are fan-out restrictions then the inner product step cell grows as
o (B 10gB) rather than 0 (B). Thus the FIR filter structure has area complexity
o (KBolog(Bo+logK) + KlogKlog(Bo+logK.
It is straightforward to modify the bit-serial FIR filter structure to accommo-
date coefficient loading, making the FIR filter structure programmable, without
increasing its area complexity. Viewed as a finite state machine, such a bit-serial
chip must have at least 2KBo- 1 states. (Consider the case \vhen all the filter
coefficients are 1.) Thus KBo-l bits of information are needed to represent the
state of the chip which implies area growth n (KB o). Thus the FIR filter structure
is within no more than log factors of asymptotic optimality.
ture provided that it can get its input on schedule. This turns out to be the case.
Figure 3.A depicts the IIR filter structure. Figure 3.B presents the algorithm for the
structure.
As with the FIR structure, the IIR structure is amenable to structural
enhancements supporting programmability and fast initialization.
The asymptotic time and area complexities of the bit-serial IIR algorithm and
structure are identical to their FIR counterparts (this is true only in the bit-serial
case). The I1R filter structure is likewise within no more than log factors of
asymptotic optimality.
enouah to represent the result with no loss of information (i.e., B - 2Bo + 10gK
where Bo is the number of bits needed to represent one x (y) signal value). Con-
sequently, and assuming fan-out restrictions, the convolution structure has area
complexity 0 (KBolog(B o + 10gK) + KlogKlog(B o + 10gK. A lower bound on
the area of a bit-serial convolver is 0 (KBo) (see [6]). Thus the structure given is
within no more than log factors of asymptotic optimality.
In this section we present a cellular FIR filter structure which operates in bit-
parallel fashion; that is, an entire output word is produced every 2 clock cycles.
Using two's complement arithmetic, we may express the nth output of a K-tap
filter as follows:
K-I K-I 8-1 8-1 K-I
Yn - I 0kXn-k - IOk I 2 b x n _k,b- I 2b(I 0k'Xn-U) (S.A)
k-O k-O b-O b-O k-O
We will call this computation a bit-FIR filter, and a structure for computing it
a bFIR (structure). Figure S.B illustrates a cellular bFIR structure for K - 4 and B
- 4. Bit-level recurrences follow.
(0)
Sb - 0K-I,b'xn,b for Ot;;. b< B
S,,(k) - Sb(k-ll E9 cb~lll E9 (OK-I-k.bXn-K+l+k.b) for Ot;;.b<B, O<k<K
Cb(O) - 0 for Ot;;. b< B
c~1) - 0 for Ot;;.k< K
+ cb~lll + > 1
Cb(k) - {~ if
else
Sb(k-ll (OK-I-k.bXn-K+lH.b
for Ot;;. b < B, 0< k < K
Now,
8-1 8-2
bFIR - I 2b Sb(K-ll + I2b+l cb(K-O.
b-O b-O
Note that since the bFIR output is the sum of a "sum bit number" and a carry bit
number," we need the adder. Data flow in the pipeline algorithm is as follows: The
input bits, bn- k , move to the east while the sum bits move west, and carries move
Peter R. Cappello and Kenneth Stelglltz 249
southwest. The algorithm for a general cellb(k) is given in Figure S.C. Note that
input bits arrive and output words depart on every other cycle.
The area of the FIR filter, (taken as its width-length product) is
o (B2K + B3/ogB). Its data rate is 0 (B) bits/cycle (1 output word every 2 clock
pulses).
6. CONCLUSIONS
We have presented some VLSI structures for digital signal processing that are
completely pipelinable, programmable, and area-efficient. More detail is given in
the full length version of this paped7J. The development process highlighted the
fact that the tasks of designing VLSI "hardware" and VLSI "software" are insepar-
able. One compelling reason for this is the desire to have data in the right place at
the right time. Systolic algorithms on topologically (quasi-) planar cellular struc-
tures are, thus, well suited for the computations considered. Cellular structures
also lend themselves to hierarchical description - surely a useful method for coping
with descriptive complexity.
On the other hand, to achieve a completely pipelined result, one must verti-
cally integrate the functions involved. For example, we took
1. a systolic algorithm/cellular structure [3,4,5] for computing an FIR filter, and
2. a systolic algorithm/cellular structure for computing an inner product step,
and integrated them to form a single systolic algorithm/cellular structure that har-
moniously computes both functional "levels." References [2,3] also discuss the idea
of bringing the design of systolic algorithms down to the bit level.
The approach of this paper can clearly be applied to other classes of digital sig-
nal processing algorithms, and to other arithmetic systems (such as residue arith-
metic). These applications are currently under investigation.
REFERENCES
[6) Vuillemin, Jean, "A Combinatorial Limit to the Computing Power of VLSI
Circuits," Proc.IEEE 21st Annual Symposium on Foundations of Computer
Science, 1980.
[71 Cappello, P. R. and K. Steiglitz, "VLSI Structures for Completely Pipelined
Processing of Digital Signals: Tech. Rpt. No. 288, EECS Dept., Princeton,
N.J. Aug. 1981. Submitted for publication.
Filure 2.A
" initialization t 0,
Reprat 2KB times:
Clock the FIR cell with input bit 0;
Repeat 2nB + KB times:
Clock the FIR cell with the 2nB + KB bit sequence shown in Filure 2.K;
fO IIR algorithm of
(\1 inner product cells)
{
Shift register x; fO input bit of denominator FIR structure is sSUM of
(\1 sum-carry cells)
(
Read X,I; r
SSUM=X of denominator FIR cello) of
if I - 1 then C - 0;
s' - s~c~(Q'x);
C' - if d+c+(Q'x) > 1 then 1 else 0;
}
Shift register s; I" output signal value of
Shift circular register I; fO timing chain of
I
(SUM cell)
{
Read s~'UM,sDENOM,cSUM;
SOUT - SNUM~SDENOM~CSUM;
COUT - if SNUM+SDENOM+CSUM > 1 then 1 else 0;
SSUM - SOUT'IHEAD;
CSUM - COUT'IHEAD;
Shift circular register I; I" timing chain of
}
J1 t:.:[] r ~HIIM
~
q'QJ 1 t1 t:rJ r-. SIlE_
:.~
<:'OUT r ... r
s, ....
~ ~} ...
;t ;t -) .. -0 2 I
\. ,
X3 )(2 XI Xo
1~1-rIAL
Yo Y, Y2 Ya
S~,I<TX
X3 Xl XI Xo
0
Yo YI ~ >;
514,.1" 'I
><3 Xl '><'1 Xo
1 Y,
Yo >"2. Y3
$ ~,F' X
X3 )(2. XI )(0
2 Y,
Yo ~ ~
5 ",,7 Y
X3 Xl XI Xo
3
Yo Y, >'2- Yo
SHIFT)(
X3 Xz X, Xo
4 y.
Yo >'2- >3
sH,Fry
)(3 Xl ><", '10
5" Y,
Yo ~ ~
SMWr)(
)(3 X2 X, 'fo
6 Yo YJ ~ X
sll,",1
'13 Xz X, Xo
Yo 1, ~
~ Yo
Figure 4.A Data movement scheme for convolution
'iMr~.,
1
X
---J
BIT- SERIAL
~
-
...
... ~
~
y
Eo- T
MULTIPLY (j(B)
A
1
~y
CONSTANT
O(K)----1
Figure 4.8 High level convolver structure
Shift register y:
Read YIIW x:
l:
p -pfficffi(x'Y);
c - if p+c+(x'Y) > 1 then I else 0;
Shift register p; j' product 'j
I
Shift circular register t; jO liming chain OJ
I
Figure 4.0 Modified multiply by a constant algorithm
254 Digital Signal Processing Applications of Systolic Algorithms
Y~ St+IFT
LEFT
2.M~a)-1
)(1-2.-- - - - - - - - - - - - - -- - -
'-::::-=-:-{
J LATCH I
I I
- 8, 82 8A
ADDER
-
- 8, 82
~ ~
MULTIPLIER
8, 82 .. . 8R
SHIFT REGISTER
k-l
ys = I wi +1 * Xi+s s = 1.2 n-k+l.
i=O
Conceptually. if the vector indices increase from left to right.
the first result. Yl' is obtained by aligning the leftmost element of
the W vector -.,i th the leftmost element of the X vector. then computing
the inner product of the W vector and the section of the X vector it
ov~rlaps. The window then slides one position rightward and the inner
product calculation is again performed on the overlap to produce the
second result. '.i'he last result is obtained when rightmost element of
the Ii vector is aligned with the rightmost element of the X vector.
In order to execute l-D convolution on the systolic array
described in the preceding section. the number of stages of the shift
register must be one greater than the number of stages of the adder.
that is. R = A + 1. There is no constraint on the number of multiplier
stages, however. For the time being. assume that there are at least as
many cells in the array as elements of the window. i.e. C > k.
First. a total of C K zero elements are entered into the
pipe lined IV path. followed by the window elements in the order ill'
w2 through Ii.. ('fhis causes wk to be placed in the multiplier
latch of th~ leftmost cell. 'l'ne rightmost (; K cells will be
258 A TwoLevel Plpellned Systolic Array for Convolutions
a __ TIME_a
3b-TIME-'
3.-TIME -.
3d-TIME- 15
k-l p-l
Yrs I
i=O j=O
I w'+ 1 '+1
1. ,J
* xi+r,j+s
41-TIME-a
0 ... -----1
.e-TIME -12
0 ... -----1
4d-TIME-"
11 II
11 II 13 14
21 22
21 22 23 24
31 32
20: 31 32 33 34 5. - first result
41 42 41 44
51 52 53 54
11 12 0 a 21 22 a 0 31 32
10: 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 51 52 53 54
11 12
11 12 13 14
21 22
21 22 23 24
31 32
20: 31 32 33 34 5b - aecond re8ul t
41 42 43 H
51 52 53 54
11 12 0 0 21 22 0 0 31 32
10: 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 51 52 53 54
11
11 12 13 14
12 21
21 22 23 24
22 31
2D: 31 32 33 34 5c - inval id result
(ignored)
32
41 42 43 44
51 52 53 54
11 12 0 o 21 22 0 0 31 32
10: 11 12 13 14 21 22 23 24 31 J2 33 34 U 42 43 U 51 52 53 54
11 12 13 14
11 12
21 22 23 24
21 22
20: 31 J2 33 34 5d - first result of the
second row
31 32
41 42 43 44
51 52 53 54
11 12 0 0 21 22 0 o 31 32
10: 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 51 52 53 54
11 12 13 14
21 22 23 24
11 12
20: 31 32 33 34 5e - final result
21 22
U 42 (J 44
31 32
51 52 53 54
11 12 0 0 21 22 0 0 31 32
10: 11 12 13 14 21 22 23 24 31 J2 33 34 41 42 43 44 51 52 53 54
Figure 5. (Conld)
right until the final result, y.,., i<l generated (see f'igure 5e). A
6-cell array showing the cortfigurCition of the cell shift registers
needed to perform the above 2-D convolution is shown in Figure 6.
Although in the gen.:ral case p-1 invalid results are generated for
each row of" output (except tne last), the fraction of total results
\thich are invalid approacnes zero since in general n p.
References
4. H.rr. Kung and :3.Yi. ;;iong, "A Systolic 2-D Convolution Chip,"
Technical Report Cj'IU-CS-dl-110, Carnegie-l>lellon University,
Computer Science Department, March, 19d1. Also to appear in
lIon-Conventional Computers and Image Processing: filgori thins and
Programs, Leonard Uhr (editor;, Academic Press, 1981.
5. D.W.L. Yen and A.V. Kulkarni, "rrne l!:SL Systolic Processor for
Signal and Image Processing," to appear in Proceedings of the 1981
I~~~ Computer Society workshop ~ Computer Architecture for Pa~n
Analysis and Image Database /ilanagement, Hot Springs, Virginia,
November 11-13, 1981.
Allan L. Fisher
CamegleMelion University
Department of Computer Science
Pittsburgh, Pennsylvania 15213
Abstract
Median smoothing, a filtering technique with wide application in digital signal and image
processing, involves replacing each sample in a grid with the median of the samples within some
local neighborhood. As implemented on conventional computers, this operation is extremely
expensive in both computation and communication resources. This paper defines the running
order statistics (ROS) problem, a generalization of median smoothing. It then summarizes some of
the issues involved in the design of special purpose devices implemented with very large scale
integration (VLSI) technology. Finally, it presents algorithms designed for VLSI implementation
which solve the ROS problem and are efficient with respect to hardware resources, computation
time, and communication bandwidth.
1. Introduction
Median smoothing [16] is a filtering operation, widely used in digital signal and image
processing, which involves replacing each sample value with the median of the values found within
some neighborhood of itself. In the two dimensional image processing case, this typically means
taking the median of 25 to 100 numbers for each of 105 to 106 pixels. As a result, the computation
and memory communication resources required to implement this operation on a conventional
computer are very large.
The development of very large scale integration (VLSI) technology has made feasible the
production of relatively inexpensive, highly parallel special-purpose computing engines [4, 10] for
the implementation of compuk,tionally demanding operations. VLSI algorithms have been
This research was supported in part by the Office of Naval Research. under ContracL~ NOOO14-76-C-0370 and NOOO14-80-
C-0236, in part by the National Science Foundation under Grant MCS 78-236-76 and a Graduate Fellowship, and in part by
the Defense Advanced Research Projects Agency under Contract F33615-78-C-] 551 (monitored by the Air Force Office of
Scientific Research).
285
286 Systolic Algorithms for Running Order Statistics In Signal and Image Processing
designed and some prototypes implemented for such applications as pattern matching [3),
convolution in image processing [81, and relational daf.1base operations [61. This paper presents
efficient VLSI algorithms which solve the running order statistics (lWS) problem, a generalization
of median smoothing.
Section 1.1 defines the running order statistics problem, and mentions some of its applications.
Section 1.2 explains the principles underlying the algorithm designs presented, and Section 1.3
describes the approach used in analyzing the complexity of such algorithms. Section 2 describes a
VLSI algorithm for the one dimensional signal processing case. Section 3 gives an algorithm for the
two dimensional image processing case, and describes the extension of tIJat algorithm to problems
of arbitrary dimension. Finally, Section 4 reviews some of the important features of the algorithms
described.
The median smooiliing problem can be generalized by considering order statistics otIler than the
median; while the median of k numbers is ilie one having rank (k + 1)/2 (for odd k), we can
consider asking for the clement having some arbitrary rank r. We will say iliat an instance of the
running order statistics problem has dimension n if the array of numbers to be tiltered has n
dimensions. For the sake of simplicity, we will require tlJat the neighborhood around each clement
over which statistics are taken be in tile fonn of an II-dimensional hypercube wiili odd edge length
centered on that clement, and we will say that an instance is of order k if tile hypercube has edge
length k.
Section 2 describes a VLSI algorithm for tlJe one dimensional ROS problem. The structure
described is based on the same idea as Leiserson's systolic priority qllelle [9]. and is presented here
Allan L. Fisher 267
mainly in order to lay U1e groundwork for U1e discussion of the second algorithm. The second
algoriilim, described in Section 3, solves the two dimensional ROS probiem.
The most common computer system stmcture is the "von Neumann" architecture, in which a
processing unit receives data and instructions from a memory unit and returns the results of its
computations to the memory. This architecture has the disadvantage that, for most computations,
the operation rate achievable is limited not only by the speed of the processor but also by the
bandwidth of the processor-memory communications link. This Iimik'ltion is commonly referred to
as the "von Neumann bottleneck".
One solution to this problem is the concept of systolic arrays [5. 7]. A systolic array is a collection
of relatively simple processing units, either all of the same type or a mixture of a few different types,
which are connected by a simple communications network and operate in parallel. The
performance advantage of a systolic array architecture is iliat it uses each datum retrieved from
memory many times witl10ut having to store and retrieve intermediate results, thus potentially
allowing speedups relative to memory bandwidth which are proportional to the number of
processors used.
TIle examples in this pnper will be discllssed in terms of fully synchronous operations for the sake of simplicity, but they
could be broken up into self-limed segments [14] communicating by some protocol while maintaining the same asymptotic
performance achieved synchronollsly.
268 Systolic Algorithms for Running Order Statistics in Signal and Image Processing
be constructed simply by concatenation of smaller pipelines. Finally, since the interface of a linear
pipeline to the outside world is of bounded size, increases in integrated circuit density can be
exploited while retaining constant chip pinout by laying out pipeline segments on chips in zig7.ag
fashion.
Logic gates of constant fan-in and fan-out require constant area and switch in constant
time .
A constant-width wire of any length can be driven in constant time by drivers which can
be implemented in area proportional to the wire's length. In particular, this means that
wires occupy area proportional to their length.
Given these elements, we can obtain asymptotic measures of the resources required to apply an
algorithm to a problem of some specified size. We will be concerned with the area required for the
implementation of an algorithm, which translates roughly to its hardware cost, and with the time
required to perform the algorid1m.
This section describes a VLSI algorid1m which, for the order k one dimensional running order
statistics problem, yields any particular running order statistic of a vector of length m in time O(m),
while occupying area O(k). The algorithm makes a left-to-right sweep over the input sequence,
computing the required order statistic of each contiguous subsequence of length k. The hardware
structure of the algorithm is a pipeline consisting of k cells, which hold the k values under
consideration at each step. The idea is to keep the k elements in order, so that elements having
particular ranks may be extracted from corresponding positions in the pipeline, and to update the
contents of the pipeline at step n by deleting an_ k and inserting an'
The updating is effected by passing messages from cell to cell down the pipeline; at each step, the
left end of d1e pipeline receives a series of messages, and passes them to its right. First, a message is
sent down the line which seeks out an_ k' the element to be deleted from the array, and causes it to
be deleted. This is followed by a message containing an' d1e new value to be inserted, which passes
Allan L. Fisher 289
down the line until it reaches its appropriate position in order, at which point the value is inserted.
High throughput is achieved by pipelining the processing of the messages, so that many messages
are "cti ve in the pipeline at one time, each in a separate cell.
A check of the correctness of the algorithm can be carried out by checking two assertions. The
first is that the abstract sequence of operations specified by the messages injected yields the desired
result; that is, that the processing of a particular message, in the absence of other messages, leaves
the pipeline in the intended state. The second condition that the pipelining of the operations is
carried out correctly, in that each message causes each cell to perform the same computation in both
the pipelined and non-pipe lined cases.
Complexity analysis of a VLSI algoritllm is concerned with two measures: the area required to
implement the algorithm, and the time that it takes to perfonn a computation. In this case,
assuming that the precision of the numbers to be processed is fixed, so that each comprises a
constant number of bits, the area required by each cell is a constant, independent of k. Thus the
area required for the entire algorithm is proportional to k, since it consists of k constant-size cells.
The time required to process a sequence of numbers is equal to the product of the number of cycles
required to pass the sequence through the machine and the time required to perfonn a machine
cycle. A. sequence of length m requires 3X(m+ k) machine cycles, and cycle time is constant,
regardless of the value of k. Thus, assuming that mk. a sequence of m numbers can be processed
in time O(m).
This section presents an algorithm for the two dimensional running order statistic problem which
may be extended to handle ROS problems of arbitrary dimension. In the two dimensional case, the
algorithm yields a set of s order statistiCs of order k for a matrix with m elements in time
O(mslog log k), while occupying area O(l?log k). Like the algorithm presented for the one
dimensional problem, this algorithm is based on a linear array of cells, down which messages are
passed to maintain an ordered sequence of values.
As in the algorithm of Section 2, many messages are processed simultaneously, in separate cells.
The algorithm for the two dimensional problem has an additional level of parallelism, however, in
that it operates on data belonging to k squares of size kxk simultaneously. Essentially, the
algorithm sweeps a rectangular window of 2k-l rows and k columns across the array, and at each
step produces order statistics for the k overlapping kx k squares contained in such a rectangle.
270 Systolic Algorithms for Running Order Statistics in Signal and Image Processing
Since the algorithm works on more than one square at a time, each array value is tagged with a
row number, ranging from 1 to 2k-l, in ordN to make it possible to calculate to which squares it
belongs. Also, since the mixing of valucs from different squares makes it impossible to compute
results just by reading the contents of particular cells at particular times, order statistics arc gathered
by special messages which cOlInt the number of clements in a given square up to a specified rank,
then pass the value having that rank to the end of the pipeline as the result.
A check of the correctness of the algorithm can be performed in the style of Section 2. Again,
consideration of a uniprocessor simulation makes it apparent that the sequence of operations
applied compute the correct results, and the demonstration of the correctness of the algorithm's
pipelining is identical.
The complexity of the algorithm, though, is more complicated than that of the algorithm for the
one dimensional case, because each cell must handle numbers ranging up to D(kl). In particular,
each cell must be capable of performing comparison and subtraction operations on these numbers.
By encoding the numbers in 2's-complement binary notation, both of these operations can be
expressed in terms of addition and testing for zero value of O(log k)-bit numbers. Brent and
Kung [2] describe a general adder design which yields b-bit adders requiring area O(b) and time
O(1og b). Substituting log k for b, the necessary additions can be performed in area O(log k) and
time O(log log k). Testing for zero value can be performed with a binary tree of OR gates in area
O(log k) and time O(log log k). Thus, since each of 2kl - k cells requires area proportional to log k,
the entire linear array requires area O( kllog k). The time to process m numbers, computing s order
statistics for each square, can be calculated as Oems) machine cycles multiplied by a cycle time
proportional to log log k, yielding the result 0(1I1slog log k).
The algorithm may be extended to handle problems of any dimension II by using a (2k-l)n- 1k
cell pipeline, and sweeping the array with a hyperwindow which has width k in the direction of the
sweep and width 2k-l in every other direction. At each step in the sweep, the algorithm would
read (2k-l)n-l values and produce kn- 1 sets of order statistics. Each value in the pipeline would
be accompanied by n-l numbers indicating to which hypercubes, among the k n- 1 contained in
the window, the value belongs.
Allan L. Fisher 271
4. Conclusion
Acknowledgments
Thanks are due to M. 1. Foster, H. T. Kung, P. L. Lehman, and S. W. Song for helpful criticism,
and to H. T. Kung for suggesting the problem.
References
[1) Andrews, H.C. Monochrome digital image enhancement. Applied Optics ]5(2):495-503,
February, 1976.
[2) Brent, R. P. and H. T. Kung. A regular layoutfor parallel adders. Technical Report CMU-
CS-79-131, Carnegie-Mellon University, Computer Science Department, June, 1979.
[3) Foster, M. 1. and H. T, Kung. The design of special-purpose VLSI chips. Computer
Magazine 13(1):26-40, January, 1980.
[4) Kung, H. T. Let's design algorithmsfor VLSJ systems. Technical Report CMU-CS-79-151,
Carnegie-Mellon University, Computer Science Department, JanuarY,1980.
[6) Kung, H. T. and P. L. Lehman. Systolic (VLSI) arrays for relational database operations. In
Proceedings of ACM SIGMOD 1980 International Conference on Management of Dala, pages 105-
116. Association for Computing Machinery, May, 1980.
272 Systolic Algorithms for Running Order Statistics In Signal and Image Processing
[7] Kung, H. T. ;lOd C. E. Lei~erson. Systolic arrays (for VLSI). In Duff, I. S. and Stewart, O. W.
(editors), Sparse Matrix Proceedings 1978, pages 256-282. Society for Industrial and Applied
l'vlathematics, 1979. A slightly different version appears as Section 8.3 of Mead and Conway [10].
[8] Kung, H. T. and S. W. Song. A systolic 2-D convolution chip. Technical Report CMU-CS-81-
110, Carnegie-Mellon University, Computer Science Department, March, 1981. To appear in Non-
COf/ven/iollal Computers and Image Processing: Algorithms and Programs, Leonard Uhr (editor),
Academic Press, 1981.
[9] Leiserson, Charles E. Systolic priority queues. Technical Report CM U-CS-79-115, Carnegie-
Mellon University, Computer Science Department, April, 1979.
(11] Moore, G. L. Are we really ready for VLSI? In C. L. Seitz (editor), Proceedings of
Conference all Very Large Scale Integration; Architecture, Desig,~ Fabrication, pages 3-14.
California Institute of Technology, 1979.
(12] Noyce, R. N. Hardware prospects and limitations. In M. 1,. Dertouzos and 1. Moses
(editors), The Computer Age; A Twenty- Year View, pages 321-327. Institute of Eicctrical and
Electronics Engineers, 1979.
[14] Seitz, C.l.. System Timing. Chapter 7 of Mead and Conway [10].
[15] Thompson, C. D. A Complexity Theory for VLSI. PhD thesis, Carnegie-Mellon University,
Computer Science Department, August, 1980.
ABSTRACT
The combination of systolic array processing techniques and VLSI
fabrication promises to provide modularity in the implementation of
matrix operations for signal-processing with throughput increasing
linearly with the number of cells utlized. In order to achieve this,
however, many design tradeoffs must be made.
Several fundamental questions need to be addressed: What level of
complexity (control) should the processor incorporate in order to
perform complicated algorithms? Should the control for the processing
element be combinatorial logic or a microprocessor? The broad
application of a systolic processing element will require flexibility in
its architecture if it is to be produced in large enough quantities to
lower the unit cost so that large arrays can be constructed.
In order to have a timely marriage of algorithms and hardware we
must develop both concurrently so that each will affect the other. A
brief description of the hardware for a programmable, reconfigurable
systolic array test-bed, implemented with presently available integrated
circuits and capable of 32 bit floating point arithmetic will be given.
While this hardware requires a small printed circuit board for each
processor, in a few years, one or two custom VLSI chips could be used
instead, yielding a smaller, faster systolic array. The test-bed is
flexible enough to allow experimentation with architecture and
algorithms so that knowledgeable decisions can be made when it comes
time to specify the architecture of a VLSI circuit for a particular set
of applications.
The systolic array testbed system is composed of a minicomputer
system interfaced to the array of systolic processor elements (SPEs).
The minicomputer system is an HP-IOOO, with the usual complement of
printer, disk storage, keyboard-CRT, etc. The systolic array is housed
in a cabinet approximately 28"x19"x21". The interface circuitry uses a
single 16-bit data path from the host HP-IOOO to communicate data and
commands to the array.
Commands and data are generated in the HP-IOOO by the operator using
interface programs written in FORTRAN. Algorithms can be conceived, put
into a series of commands for the systolic array processor, and tested
for validity. Data computed in the array can be read by the host
HP-IOOO and displayed for the operator.
The use of a general purpose minicomputer as the driver for the
systolic array gives unlimited flexibility in developing algorithms.
Through the use of interface routines, algorithms can be tried,
273
274 Systolic Array Processor Developments
8 x 8 ARRAY OF PROCESSORS
CABLE TO
HOST
I
HOST INTERFACE
Host
To Interface
Host,--_...J
A
B
A~
Fig. 3. Rectangular/hexagonal
configuration.
Keith Bromley, et 81 279
X/V
TO HOST TO HOST
TO HOST
8031
MICROPROCESSOR
A
B
C
S
rows, which tells the SPE which routine to perform. The I/O registers
are loaded and read by the microprocessor under program control.
VLSI/VHSIC IMPLEMENTATION
It has been predicted that one or more systolic processing elements
could be put on a single VLSI chip, While the currently implemented
printed circuit board with 18 ICs is undoubtedly not the optimum design
for the future, it is an interesting exercise to calculate the gate
count for a 32-bit version of this SPE. Table 1 shows the estimated
gate count for the present implementation and that which would be used
in a VLSI chip if we eliminate some of the flexibility (programmability)
used in this testbed. These numbers are very rough estimates, perhaps
+30%.
Table 1 - SPE Gate Count (32-bit, Floating Point)
Testbed SPE VLSI
Control Processor 20K 10K
APU 20K 20K
ROM 64K 16K
RAM 16K 16K
I/O 1K 2K
121K 64K
CONCLUSION
In the last two decades numerical analysis has developed many
numerically stable matrix algorithms [2,3] for use with single
arithmetic unit digital computers. However, parallel numerical
algorithms have not been correspondingly developed and only a limited
number of parallel processors or computers have been built.
It is the belief of the authors that concurrent processing with
generalized systolic architectures will provide the capability of
implementing in hardware the matrix processing which currently is
represented by the EISPACK [2] and LINPACK [3] software libraries. The
availability of affordable VLSI matrix processing peripherals for
minicomputers would significantly advance signal processing research.
Similarly, VHSIC [7] implimentations of these matrix processing
peripherals would make advanced signal processing available for real-
time tactical signa] processing applications.
Dr. Mermoz recently addressed the question of spatial signal
processing beyond adaptive beamforming [8]. The conclusions of his
paper were that with adequate computational resources much new
information about the medium through which the signal propagates may be
incorporated into the signal processing with corresponding improvements
in system performance. In particular, he says, "Despite the advances in
computer technology, fast as it may have been in the past decades, such
a program is liable to absorb all the present capacity and probably the
predictable capacity during the next twenty years. Meanwhile, there
284 Systolic Array Processor Developments
will be some trade-offs between complexity and preclslon. But the trend
to introducing the most flexible model, compatible with the array and
the number of sources, is likely to be the right approach toward further
improvements when we deal with complex and unpredictable mediums ... Such
an approach is rather the ~pposite of what has been done so far, except
in advanced research. Of course, anybody would be horrified by the
amount of computing power required. On the other hand, most scientists
have been horrified several times in their career by what turned out to
be current practice within a few years."
Systolic architectures implemented with VLSI should make this type
of signal processing possible for sonar bandwidths within this decade
while VHSIC implementations should provide similar capability at
communication and radar bandwidths.
ACKNOWLEDGEMENTS
The authors wish to acknowledge the support of the Naval Electron-
ics Systems Command (Elex 612, 6l4) and the Naval Ocean Systems Center
Independent Research/Independent Exploratory Development (IR/IED)
program.
REFERENCES
[1AJ Speiser, J.M. and H.J. Whitehouse, "Architectures for Real-
Time Matrix Operations," Proceedings of the 1980 Government Micro-
circuits Applications Conference held at Houston, Texas, 19-2r-NOV.1980.
[IBJ Speiser, J.M., H.J. Whitehouse, and K. Bromley, "Signal
Processing Applications for Systolic Arrays," Record of 14th Asilomar
Conference on Circuits, s~stems and Computers held at Pacific Grove,
California, 17-19 Nov. 19 0, IEEE Catalog No. 80CH1625-3, pp 100-104.
[2] Dongarra, J.J., et al, LINPACK Users'Guide, Society for
Industrial and Applied Mathematics, Philadelphia, Pennsylvania, 1979.
[3] Garbow, B.S., et al, Matrix Eigensystem Routines-EISPACK Guide
Extensions, Springer-Verlag, 1977.
[4] Kung, H.T., "Systolic Arrays for VLSI," in Duff, 1.S. and G.W.
Stewart, Sparse Matrix Proceedings, 1978, Society for Industrial and
Applied Mathematics, Philadelphia, Pennsylvania, 1979 (Reprinted in
Mead, C. and L. Conway, Introduction to VLSI, Addison-Wesley, 1980).
[5] Kung, S.-Y., "VLSI Array Processor for Si gnal Processing,"
presented at Conference on Advanced Research in Integrated Circuits held
at MIT, Cambridge, Massachusetts, Jan. 1980.
[6] Speiser, J.M. and H.J. Whitehouse, "Parallel processing
algorithms and architectures for real-time signal processing," Real-Time
Signal Processing IV, a publication of the SPIE International Technical
Symposium held in San Diego, 25-28 Aug. 1981, Vol. 298.
[7] L.W. Sumney and E.D. Maynard, Jr., "The United States
Department of Defense Program on Very High Speed Integrated Circuits
(VHSIC}," Proc. 1979 Int. Symp. on Circuits and Systems, July 1979,
pp. 559-563.
[8J H.F. Mermoz, "Spatial Processing Beyond Adaptive Beamforming,"
J. Acoust. Soc. Am. 70(1), July 1981, pp. 74-79.
A Systolic (VLSI) Array for Processing
Simple Relational Queries
Philip L. Lehman
CarnegleMelion University
computer Science Department
Pittsburgh, Pennsylvania 15213
1. INTRODUCTION
This paper discusses the use of systolic arrays (a conceptual and design tool for VLSI
systems [11]) to produce VLSI capable of processing simple relational database queries, which are
by far the most frequently executed queries in practical large database systems. We will be
concerned with the exploitation of VLSI technology to process "simplc" relational qucries very
rapidly; the dcsign of an array for this task is described below. The systolic properties of the array
design are considered, and are shown to have analogs in the domain of databases by using the
systolic properties to prove certain consistency and scheduling complexity properties of all
transactions executed by dle array (hereinafter called the simple query array, or SQA). 111e SQA is
intended for use as an integral part of a systolic database machine [13], which would handle very
large databases and is expected to have a high performance gain Clver conventional database
systems. The machinc should also compare quite favorably with other database machine designs
(for example [I, 4,16,17,19]), especially when used for databases with frequent simple queries, i.e.
those databases used by most commercial applications!
2. SIMPLE QUERIES
It has been observed [7, 14] that the following rule applies to the largcst database systems:
Simplicity CharacteristiC: lilmost all of the transactions arc very simple.
For example, in a large banking system most (> 80%) of the transactions run by fue bank on its
Customer_Account Database will be "Dcbi(Account" or "Credi(Account" transactions, as in
Figure 1 [6]. In contrast, the monthly printing of customer statements is more complex, but is
performcd (relatively) very rarely (perhaps 10-7 as often).
This paper, therefore, assumes a model of database system usage in which almost all of the
transactions are drawn from a set of simple transactions that are performed very frequently. 'This
model seems to be satisfied by a wide range of practical applications. including banking, airlines
reservation systems, telephone directory assistance, inventory systems, employee record systems,
etc. The systolic arrays proposed in this paper emphasize high throughput of these transactions
without sacrificing adcquate response. All databases in this paper are assumed to be relational
databases. (sec [2, 3]) In [9] systolic arrays were used to perform "hard" relational database
operations (intersection, difference, union, remove-duplicates, projection, join, division): the
regular structure of systolic arrays was shown to lend itself naturally to the processing of relations,
which are also very regular. (For background on systolic arrays see [8, 11].)
3. CONCEPTUAL OPERATION OF THE ARRAY
As a simple illustration of the use of the SQA, it is appropriate to reconsider a (simplified)
piece of the DESrCCREDIT transaction: the section shown in Figure 2 is of interest, since the
othcr part~ of the transaction are very similar in form to this section. 'Dlis section of the transaction
This research was supported in part by the Office of Naval Research under Contracts NOOO14-76-C-0370 and
NOOO14-80-C-0236. in part by the National Science Foundation under Grant MCS 78-236-76, in part by the Defense
Advanced Research Projects Agency under Contract F3361S-78-ClS51 (monitored by the Air Force Offiee of
Scientific Research). and in part by an IBM Predoctoral Fellowship.
285
286 A Systolic (VLSI) Array for Processing Simple Relational Queries
DEB IT_CREDIT:
!3cgillj'rallsaclioll ;
Get Massage;
<xtrat:t Acct._Num, Delta, Teller, Branch from Message;
,ind !\CCOUNT(Acct_~Jum) in Database;
U' flotJound I Acct_Bal+Delta<O IhenPut "Negative Response";
else do;
Acct_Ba1~Acct_Bai+Delta;
Po~ HISTORY record 011 ACCOUNT (Delta);
CASH_ORi.UER( Te 11 B r) "CASH_DRAlJER (Ta 11 er) +De 1 ta;
BRANCH_BAL(Branch)aBRANCH_BAL(Branch)+Oolta;
Put Massaga(ttNelll Balance =tt Acct_Bal);
end;
Commit;
Figure 1: A simple debit/credit transaction(tj'om [6]).
(In transactions in this paper, field names arc abbreviated for convenience.)
examines the ACCOUNT rclation of a banking database, looking for a single Account_Number
(assuming that there is at most one instance of that AccountNumber in the relation). Jrit finds the
AccountNumbcr, it changes the associated AccountBalance (by Delta). An example ACCOUNT
relation is sho\l;n in Figure 3. As written, the DEBlT_C[U:DIT transaction was intended to be IUn
on a sequential processor. The same transaction-in fact, a balch of such transactions-can instead
be mn on the systolic army shown in Figure 4. In this paper, the term balch refers to a group of
transactions that are processed together and that access the same columns of the same relations in
the same order; an example might be a large batch of bank check cashings.
]~nd ACCOUNT(Acct_Num) in Database;
if Not_Found I Acct_Ba 1+De lta<O then Put "Negat 1ve Response"
else do; Acct_Bal=Acct_Bal+Dalta;
Figure 2: Part of the DEBIT_CImDIT transaction in Figure l.
ACCOUNT Account Number Balance
1215 $ 1234.56
1492 $ 29.95
1701 $ 100000.00
1776 $98765432.10
1812 $ 432.79
1980 $ 1284.73
R2
R1
100000.00
1701 29.95
1492 1234.56
1215
Figure 4: A 3x2 systolic array for simple queries. PI = "= 1701 ?"; Ul = "+ 50"
In the present case, columns Rl (Accoun(Number) and R2 (Account.Bal:mce) are passed into the
two columns of the array, as shown.
For example, suppose the first query is "Add $60 to the AccountBalance of
Account.Number 1701." Then the first lefthand processor is prcloaded with predicate PI: ".
1701 1" The first righthal1d processor is preloaded with update VI: "+ 60." As each element
from the Account.Number column passes predicate PI, the predicate is executed, and the result of
the predicate is passed to the processor containing VI. If the result of the predicate is TR UE, the
update is executed (in this case, on the entry representing the Account_Balance of Account_Number
1701: $100000.00 in the figure).
Noticc that the SQA is synchronous-the processors operate in lock-step. The data is entered
so that the Account.Bal:mce corresponding to a particular Accoun(Number enters the array one
time-step after the Account_Number. This insures that, for example, the result of predicate PIon a
particular Account_Number will reach update VI at the same time as the corresponding
Account.Balance.
Additional queries are placed in succeeding rows of the systolic array. The important
characteristic of this array is that it can contain many queries (hundreds, not just the three shown in
the figure). Hence, many queries can be executed for a single pass of the relation through the
array.
The general philosophy of this array is that its structure is such that many similar queries can
be loaded into a single array simultaneously. Then the relevant relation is passed through the
array, and all of the queries are executed. Even ifit takes SubSl'1ntial time to run:l relation through
an array, the large number of queries processed yields a high throughput (in queries per unit time)
for the combined operations.
288 A Systolic (VLSI) Array for Processing Simple Relational Queries
4. COMBINATION BY PIPELINING
4.1. Updates on Multiple Relations
The DEBlT.CUEDiT transaction updates several relations in the database, including
ACCOUNT, CASH.DRAWER, and BRANCRBALANCE. Each of these updates is conditional
on the appropriate Account.Numbcr being found in the database, but of course this test must only
be done once. As written, the transaction indicates that the following procedure is to be executed.
Once the Account.Numbcr is found (if it is present), the Accouut.Balance is updated. Then, the
CASI(DRAWI~R relation is searched for the entry for the appropriate teller, and that teller's
drawcr balance is updated. Similarly, the BRANCH.BALANCE for the correct branch is updated.
The operations on the ACCOUNT, CASH.DRA WER, and B1~ANCH.BALANCE relations are
alinost identical, and can each be executed by the array shown in Figure 4.
The method described in section 3 will update each of the relations in the transaction, if it i.$
applied separately to each relation. It remains to be shown how to combine multiple instances of
the method to implement a single transaction operating on several relations. The straightforward
method for applying batches to several relations would be to process the batches sequentially,
relation by relation. This method can be improved by exploiting the parallelism that is made
possible by using several instances of the SQA. Specifically, this can be done with a pipeline
scheme, like that shown in Figure S.
Direction of Pipeline
)
Relations Relations Relations Relations
t...-
ARRAY
Queries (STAGE)
r- ' - - ARRAY
Queries (STAGE)
I--
--- ARRAY f--
Queries (STAGE) Queries
1 2 3
operations it is to perfonn arc specified just before they are executed, rather than when the array is
built (as in the case of the "hard" operation arrays). 'DIC users are allowed to specify a wide range
of possible queries to be loaded into the array, in contrast to the operation of the intersection array.
for example, which is completely specified when it is built. Since this programmability makes the
structure of the SQA fairly complex, only relatively simple operations arc allowed as the atomic
programmable operations of the array. This makes the design of the physical array somewhat
simpler.
5.2. Specifications
The functionality of the array derives from a combination of: (a) the op.erations perfonned
by each individual systolic component at each time step, and (b) the method of interconnection
between the components. One of the systolic processors used in the SQA (with the connections
leading to other processors) is shown in Figure 6.
5.2.1. Connections Between Processors
As the figure shows, each processor has six connections to other processors, labelled as
described in Table l. In the present design, data only flows "down" through the systolic array.
Logical results only flow "down" and to the "right." The general execution pattern of the systolic
processors (SPs) is as follows. For each time-step: (a) the inputs are read from the input lines (D t
bt , bl ); (b) they are manipulated according to the operalions programmed for the systolic processor
(described below); (c) and the outputs are put on the output lines (Db' bb; b,). The connections
Dr and Db are used to pass data (elements of the relation) between processors. The bits bl and br
and the bits br and bb are lIsed to pass logical results from one processor to another from left to
right, and top to bottom, respectively. These logical results allow the construction of complex
queries from single systolic processor building blocks, and are used to convey conditional results
between processors.
The connections described are each enhanced with a single bit of memory (see Figure 6).
These memory bits have three modes: TRUE, FALSE, and BYPASS. In BYPASS mode, the
memory bit is transparent, and data is simply passed through the bit as ifit weren't there. If the bit
is not set to BYPASS, then it can be set to TRUE or FALSRby its input line. Then its output line
continues to pass that value until it is changed or the bit is again set to BYPASS. This bit is useful
in handling problems with time-delay, and is intended primarily for use between stages of the
pipeline. This arrangement allows the division of a chip structure into stages to change
One
Dt Input Data entering from the top.
Db Output Data leaving at the bottom.
SQA bt Input Bit entering from the top.
bl~ b
r Output Bit leaving at the bottom.
Processor bb
bl Input Bit entering from the left.
b, Output Bit leaving at the right.
o b
t b
}<'igurc 6: A single systolic processor: Table 1: Data entering and leaving each
Dr and Db are data connections, processor. The directions top, bottom.
bt , bl , b" and bb are logical one-bit connections, left, and right refer to Figure 6.
M's are single-bit memories on the connections.
290 A Systolic (VLSI) Array for Processing Simple Relational Queries
dynamically. The divisions occur where BYPASS is not set. At these points, results are held
between passes, and arc sent on to the next stage during the next pass for that stage.
5.2.2. Operations
The complete operation that can be executed by a systolic processor is specified- by the syntax
"<Input bit exp>=<Primitive op>=<Output bits>." <Primitive op> is either a <Predicate
primitive) (which is a comparison with a constant) or an <Update primitive) (an arithmetic
operation). (Table 2 lisl~ all of me primitive operations.) The predicate operations are used
primarily to implement the conditional parts of queries on the database, like those describcd for
the bank database abovc. The update operations are used to implement the changes to the
database. Only one operation (either predicate or update) may bc loaded into a single systolic
element.
The <Input bit exp) is a logical expression using thc signals from the processors to the top
and left, and is allowcd to bc any of the sixteen possible Boolean combinations of me two input bits
bl and h,: if unspccified, it defaults to TI? UE. <Output bits) is a list of (up to two) bits to be output
to the processors to the right and bottom, along with me BY I' ASS setting for the relevant memory
bits. Both elemcnts in the list have one of the four following forms: "b", "not-bOO, "set(b)",
"clear(b)", where thc b is rimcr b, or bb' The "bOO and "not-b" forms pass TRUE or FALSE on that
output line, respectively, and sct the mcmory bit on mat line to BYPASS mode. The "set(b)" and
"clear(b)" fonns set the associated mcmory bit to TRUE or FALSE, respectively (and turn off
BYPASS).
5.2.3. The Internal Systolic Algorithm
We use the expression internal systolic algorithm to refer to the algorithm executed by each
individual systolic proccssor at each time-stcp. For the SQA this "hardware algorithm" is shown in
Figure 7. Bricfly, it inputs thc data, executes either a predicate or update, and puts the output onto
the output lines.
Pn'rlicate.j).ngatiolls
.symbol Operation (k is a constant) /* Get data and input bits: */
= k? Docs the data cqual k? Read D, and b, and b, from thc input lines;
< k? Is the data less than k? tval+- FALSE; /* Tcmporary variable */
) k? Is the data greatcr than k? D~ +- D ,: /* Temporary variable */
:5 k? Is tllC data at most k? if<Input bit expr) (D(fault: TR UE) then
;:: k? Is thc data at least k? if<Operator) (Default: null) is predicate then
k? Is the data differcnt from k? /* Evaluate the predicate: */
T Constant TRUE. tval+- execute <Operator) on Data
F Constant FALSE. else
Update Operations D~'" execute update <Opcrator) on Dr;
Symbol Operation (k is a constant) tval+- TRUE
+ k Add k to thc data. <Output 'bits) (Default: none) ... IvaI;
k Subtract k from the data, D +- Dl
b b
* k Multiply thc data by k.
+ k Divide the data by k. Figure 7: The internal systolic algorithm
Table 2: Primitive operations. for the SQA proccssors.
Find ACCOUNT(Acct_Num) in Database;
if r~ot_Found I Acct_Ba HOe lta<O then Put "Negat i va Response";
else do;
Acct Bnl=Acct Bal+Dolta;
CASH=DRA\~ER( Te 11ar) =CASH_DR,/\WER( Tall ar) +De 1ta;
BRAtJCH_BAL( Branch) =BRA~JCH_BAL (Branch )+09 1ta
end
Figure 8: A subset of thc DEnrCCREDlT transaction in Figure 1.
Philip L. Lehman 291
5.2.4. An Example
We illustrate the function of the SQA with an example: the implementation of a subset of
the DEBI'CCREDIT transaction shown in Figurc 8. This is implemcnted on the two rows of the
six column SQA shown in Figure 9. The "program" for the array is givcn in Figure 10. The
picture is brokcJl lip (by dashed lines) into the thrce stages us.:d ji)r the three relations. The stages
execute sequel1ti,tlly and arc p:lll of the pipeline. The first stage handles the ACCOUNT relation,
the second CASI I)}RA WER, and the Ulird BlUNCHJIALANCE. In the first stage, processor (I)
checks the Account)'bmhcr. If it matches, processor (2) checks to sec if the A~coulltJ3a\:mce is
large enough. Ifso. processor (4) modifies the AccountBa\:uice. Processor (3) docs nothing. '1l1C
other stages work similarly. Notice that two rows of processors are lIsed for this transaction, and
that they work together by means of the communication wires (shown more clearly in Figure 6;
these wires communicate the logical resulLS from the tests in processors (1) and (2. Processor (4)
performs U1e update as (he relation streams by.
6. CONSISTENCY
Proofs of database consistency that have appeared ill the literature have instead proven the
somewhat stronger condition of serializability [5,12,18] (with a few exceptions, for example [10]).
(A concurrent schedule for a set of transactions is considered to be serializable if its results are
equivalent to a schedule in which the transactions arc mn sequentially.) If only syntactic
informatioll is known about a set of transactions then a schedule from a serializing scheduler is the
best possible. (This is one example of the tradeoff between perfOImance and information in
consistent schedulers [12].) Most of liese proofs have concerned conventional database systems, in
which . locks have been used to guarantee serializability by isolating sections of Ule database for
exclusive access by a single transaction.
In the system considered in this paper, no locks are used. Instead we rely on thc great
regularity of systolic arrays to guarantee the serializability of the transactions executed by those
arrays. Theseserializability results arc summarized below, and concern the SQ,\ and pipeline
discussed above.
Definition 1: A transaction is a mapping F: (DB,I) --t (DB',O), whcre DB and DB' are
dmabase slates: sets of relations, together with the "shapes" of those relations and the values of
their clements; J is the input to F and a is Ule output from F.
292 A Systolic (VLSI) Array for Processing Simple Relational Queries
(2) bl = bb sct(b,)
(3) /lull
(4) bl = + Delta = /lull
(5) bl == Teller? =b,
(6)
(7)
bl = + Delta
null
= sc/(b r)
(8) null
(9) bl = = Branch? =b
(10 ) bl = + Delta = n~l/
( 11) /lull
(12) null
Figure 10: A program for the array in Figure 9 to implement the transaction subset
in Figure 8. The notation "/Iliff' indicates that a field or statement is empty.
(Restricting F produces interesting classes of transactions, including, for example, add-
relation transactions (more relations in DB' than in DB), update-tuples-oilly (only element values
have been changed), etc. [13])
Analogously to the usual definition (but from a different point ofvicw) we define
Definition 2: A transaction F is serially decomposable into transactions FI and F2 if it equals
their composition: F =
F2 F1' The decomposition is said to be syntactic if it can be done
regardless of the semantics of (operations performed by) F, F}, and F2.
This model produces the following results concerning tl1e systolic designs presented above.
Theorem 3: The transaction FSQA executed by the SQA is syntactically serially decomposable
=
into individual queries corresponding to the individual rows of the SQA: FSQA 1'n 0 fn-I 0 0
1; 0J;. Some of these can be re-combined so that more than one row can be used for a single
(complicated) query.
Proof: The proof is by recursion on the rows of the SQA. An SQA with one row is trivially
decomposable into J;. Consider an SQ;\ with k rows (hI). By the inductive assumption, the first
k-I rows (considered as an SQA) are decomposable into J'~_l = fk-l 0f k-2 0 _.. 01; 0ii. 'D1e data
leaving the first k-I rows flows into the kth row and is processed there after the first k-l rows have
completed their work on it. The kth row performs the function f k . Hence the SQA performs FSQA
= ric = fk 0 F",_1'
Theorem 4: The transaction FpIPE executed by several stages (each an SQA) on a single batch
is (syntactically) equivalent to the serial execution of the transactions obtained by taking single
rows acruss the width of the entire pipeline, provided each stage operates on a different relation
(conforming to the usual structure of simple queries). This is the "natural" desired decomposition
into individual transactions, each operating on the same set of several relations.
Proof: The transaction FpIPE = 1'~1 0 0 FI , where each J'~ is a stage (SQA). Hence, by
Theorem 3, FPlPE = fm,n 0 fm,n-l 0 fm,l 0 fm-I,n 0 f m-I ,n-l 0 0 f m-1,l 0 0 ii,n 0 ii,n-l 0
... o/'ll'
, Then by Lemma 5 (below) we can rcorder these: FpIPf:' = f m,n 0 f m-I ,n 0 . . . o/'1 .n 0
Ii!n.n -1 0 f m_]. ,n-1
0 . . . 0 /'1 -1 0
,11
. . . 0 f l o f _} 1 0 . . . 0 /'11' Then since ea~h row of the pipe
m, m, .
(across all stages) Gi = fm,i 0 fm-1,i 0 0 j~.1' we have f;'IPE = Gn 0 0 Gl .
The f's can be switched above because any f's to be switched op(!rate on distinct relations;
this satisfies the condition of lemma S.
Lemma 5: For two transactions Fl and F2: F2 0 F} = r~ 0 "2 provided that F~ and F2
operate 011 independent portions both of the database state and of the I/O infomlation (the
information that accompanies the state as either input or output-in a composition, the output of
one function becomes the input of the next).
Philip L. Lehman 2t3
Proof: By assumption, the state and I/O information arc independent for the two
transactions in question. Hence, we simply "pretend" that each }i locks the appropriate section of
the database and I/O infOlmation. This guarantees that the transactions can operate in either order
(or even concurrently) and still achieve the same results. The lemma easily generalizes (by
associativity of"o") to more than two transactions.
Theorem 6: The transaction schedule resulting from passing more than one batch through the
SQA pipeline is syntactically serially decomposable (into simple queries) provided that no relation
is used more than once in a batch and that the batches arc pipelined so that the batches themselves
arc syntactically serializable (i.e. that the directed graph of the "must-precede-in-serialization"
rel.ation on transactions-induced by the order in which they access relations-is acyclic).
Proof: This is an easy consequence of Theorem 4.
We assume that in running the pipeline, batches are started as soon as they arrive (or in the
next available time slot), provided that this will maintain serializability. The task of maintaining
syntactic serializability (on-the-fly) is delegated to a scheduler. This scheduler is efficient (runs in
polynomial time) for our SQA and pipeline, for two different senses ofserializability, as shown by
the theorems below.
(Papadimitriou [IS] showed that the problem of testing inembership in SR; the set of
serializable transaction histories, is NP-complete. However, there are efficiently recognizable
subsets of SR. The set of transactions we consider ("SQATRANS") is more general than those in
SR, in that the transactions are multistep: in general, they consist of more than one read!then-write
step. SQATRANS has strong analogies to Papadimitriou's DSR set, which has specific rules for
serialization, making it "casy" to recognize. Hence it is not surprising that the polynomial
recognition (or scheduling) algorithms exist for SQATRANS.)
Theorem 7: An efficient (polynomial time) syntactic serializability scheduler exists for the
SQA-pipe.
Proof: The sketch of a simple polynomial-time scheduling algorithm is given in algorithm
Schedule 1 (Figure 11). The algorithm takes time O(q('; + ;;'p2 = O(qw\ where w is the
number of batches scheduled, p (~w) is the number of stages in the pipeline, and q is the length of
the queue of batches waiting to be run (which depends on the arrival rate). The algorithm assumes
that we maintain a wxw matrix, }If, of precedence relations: batch i must precede batch j in a
serialization of the schedule if and only if m ....0. The algorithm is simplified, and its running time
could be improved fairly easily; for exam~le our approximation assumes that computing the
transitive closure of a matrix of size n takes time O(n4).
Definition 8: We say that a schedule is serializable ill the semi-strict sense (following the spirit
of [IS]) if the order in which the transactions begin execution is the same as that in the equivalent
serial schedule.
Theorem 9: An efficient (polynomial time) syntactic semi-strict serializability scheduler exists
for the SQ~-pipe.
Prosf: The scheduling algorithm (algorithm Schedule 2; FIgure 12) is very simple and runs in
time O(qwp3). Notice that this algorithm could also have been used for the proof o/Theorem 7. Its
running time is actually somewhat smaller, but algorithm Schedule 1 may produce better schedules
(in terms of utilization of the pipeline) if the condition of semi-strictness is not required.
7. CONCLUSIONS
In this paper, we have examined a high-level design for a systolic processor array to handle
large batches of simple relational queries with high throughput. This array has the desirable
property of syntactic serializability, and proofs of this are directly derivable from the systolic
structure of its design. Furthermore, serialization can be accomplished with an efficient scheduler,
making the array feasible for use in database machine systems. Less restrictive serializability
constraints are possible, but determining these must involve examining the semantics of the queries
294 A Systolic (VLSI) Array for Processing Simple Relational Queries
to be executed [12]. Such examinations are currently in progress, and should also produce a better
characterization of the power of the SQA, by establishing a precise' mapping between the function
of the systolic array and the class of transactions that it may execute on a database. This mapping
was begun in the characterization of serializability herein reported and serves to show that systolic
structure for VLSI maps readily into other interesting problem domains. Hence, we believe that
the design presented above is strong evidence for the practicality of using systolic designs in high
performance database machines, as well as in other applications.
for each scheduling cycle (wofthem):
for each batch B in the queue (q of them):
1* add B to inatrix M to form test matrix M */
for each liatch Brin M(wofthem):
/* find precedence of Band B */
for each stage in B (at n~ost p): /oreach stage in B (at most p): compare the stages
/* see if there is a cyclk in Af */ '
cOIllPute trans. closure of i\f (time O( w4
schedule some batch; save its M as new M
References
[I] Banerjee, 1., Hsiao, O.K., and Kannan, K.
DBC-ADatabase Computer for Very Large Databases.
IEEE Transactions 011 Computers C- 28( 6):414-429, June, 1979.
[2] Codd, E.F.
A Relational Model of Data for Large Shared Data Banks.
Comlllunicaliollsoflhe ;lei\! 13(6):377-387, June, 1970.
[3] Date, CJ.
An Introduction to Database Systems.
Addison-Wesley, Reading, Mass., 1977.
[4J DeWitt, DJ.
DIRECT-A Multiprocessor Organization for Supporting Relational Database
Management Systems.
IEEE Transactions on ComputersC-28(6):395-406, June, 1979.
[5J Eswaran, K.P., Gray, J.N., Lorie, R.A. and Traiger, I.L.
The Notions of Consistency and Predicate Locks in a Database System.
Communications of the ACl'v119(1l):624-633, November, 1976.
Philip L. Lehman 295
[6J Gray, 1.
Notes on Data Base Operating Systems.
In Bayer, R., Graham, R.M. and SeegmuIler, G. (editors), Lecture Notes ill Computer
Sciellce 60: Operating Systems, pages 393-481. Springer- Ver~ag, Berlin, Gennany,
February. 1978.
[7] Gray, 1.
Private Communication, 1980.
[8] Kung,H.T.
Let's Design Algorithms for VLSI Systems.
In Proc. Conference on Very Large Scale Integration: Architecture, Design, Fabrication,
pages 65-90. California Institute ofTechnology, January,1979.
[9] Kung, H.T. and Lehman. P.L.
Systolic (VLSl) Array~ for Relational Database Operations.
In Proc. ACM-SlGMOD 1980 International Conference on Afanagement ofData, pages 105-
116. ACM, May. 1980.
[10] Kung, H.T. and Lehman, P.L.
Concurrent Manipulation of Binary Search Trees.
ACM Transactions on Database Systems 5(3):354-382, September. 1980.
[11] Kung, I-1.T. and Leiserson, C.E.
Systolic Arrays (for VLSI).
In Duff, I. S. and Stewart. G. W. (editor), Sparse Matrix Proceedings 1978. pages 256-282.
Society for Industrial and Applied Mathematics, 1979.
A slightly different version appears in Introduction to VLSI Systems by C. A. Mead and
L. A. Conway, Addison-Wesley, 1980, Section 8.3.
[12] Kung, H.T. and Papadimitriou, C.H.
An Optimality Theory of Concurrency Control for Databases.
In Proc. ACM SIGMOD 1979 Internatioai Conference on Management of Data, pages 116-
126. ACM, May, 1979.
[13] Lehman, P.L.
The Theory and Design of Systolic Database Machines.
In preparation.
[14] I _orie, R.
Private CommunicatioJ1, 1980.
[15] Papadimitriou, c.B.
The Serializability ofCOJ1current Updates.
Journal of the ACM 26(4):631-653, October, 1979.
[16] Schuster, S.A., Nguyen, H.B., Ozk:irahan, EA, and Smith, K.C.
RAP.2-An Associative Processor for Databases and Its Applications.
IEEE Transactions all COl11puters C-28(6):446-458, June, 1979.
[17) Su, S.Y.W., Nguyen, L.H., Emam. A., and Lipovski, GJ.
The Architectural Features and Implementation Techniques of the Multicell CASSM.
IDlE Transactions all Computers C-28(6):430-445, 1979.
[18] Yannakakis, M., Papadimitri()U, C.H. and Kung, H.T.
Locking Policies: Safety and Freedom from Deadlock.
In Proceedings o/Twentieth Annual Symposium on Foundations of Computer Science, pages
286-297. IEEE. 1979.
[19] Yao, S.B. and Wah, RW.
DIALOG-A Distributed Processor Organization for Database Machine.
Proc. 1980 National Computer Can/. 49:243-253, May, 1980.
A Systolic Data Structure Chip for
Connectivity Problems
Carla Savage
North Carolina State University
Computer Science Department
Raleigh, North Carolina 27650
296
Carla Savage 297
data to be stored and the operations to'be performed on that data would
dictate what sort of data structure chips would be appropriate hardware
to incorporate into the system, under the control of a host machine.
In [7] Kung and Lehman discuss how special purpose chips for performing
data base operations could be incorporated into a data base system.
cell i
i,COMP(i)
BEGIN
IF COlIP (i) = cmax.
J
THEN COHP(i)+cmin.;
J
IF i = uk
THEN cuk+COHP(i)
ELSE IF i = vk
THEN cvk+COHP(i);
cmax.
J
THEN cuk+cmin j ;
IF cVk = cmax j
THEN cVk+cminj
END
Cell n+l needs only to hold values (u,v), cu, cv, cmin, and cmax
for an edge and the logic to execute:
y
cell n + I
,,(u,v) ,,",cv---l> 1
<l--e: (u,v) ,cmln,cmax<J---!
BEGIN
IF cu < cv
THEN BEGIN
cmin+cu;
cmax+cv
END
ELSE BEGIN
cmin+cv;
cmax+cu
END
END
Carta Savage 299
Initially, we assume 'all registers (u,v) , cu, cv, cmin, and cmax
hold the value D. Note the array works correctly. If the operation
UF(e') is inserted before UF(e) and i f UF(e') needs to update COHP(u)
for an endpoint u of e, then when e enters cell u for the first time
to read COMP(u), either e' has already passed through cell u on its
trip left and changed COMP(u), or e and e' will meet in some cell as e
moves right and at that point e can update its cu value to reflect the
change to COMP(u) indicated bye' .
A spanning tree of a connected graph, G, is a connected, acyclic
sub graph of G with the same vertex set as G. The connected component
procedure can be modified to find the edge set of a spanning tree, T,
in the following way. As the edges of G are examined sequentially,
when an edge e is scanned which joins vertices in distinct sets, com-
bine those sets and add e to T. Note that since G is connected, ex-
actly n-l edges will be added to T. Since no edge which joins ver-
tices in distinct components is ever added to T, T contains no cycles.
The UF chip we have described can be used to find a spanning tree
as follows. \lhen an edge e ~ (u,v) enters the right end cell, i f
cu # cv, then u and v are in different components (after processing
all UF instructions preceeding UF(e)), so e is chosen to be an edge of
T. In this case, cmin # cmax and since these values never change as
e moves left, spanning tree edges can be recognized as they leave the
left end cell as those edges with cmin # cmax. Note that to retrieve
this information about an edge takes time D(n). However, a sequence
of m UF operations, including retrievals, can be processed in time
D(m+n). Thus if m>n, we average constant time per operation.
A minimum spanning tree of a vleighted graph, G, is a spanning
tree T such that the sum of the weights of the edges in T is minimum
among all spanning trees of G. It can be shown that if the edges of
G are examined in nondecreasing order of weight, the spanning tree
procedure described above will produce a minimum spanning tree. The
UF chip could be used to find a minimum spanning tree if there were
a way to supply the edges in nondecreasing order of weight: This can
be done by preprocessing the edges using the systolic priority queue
described in [8]. The systolic priority queue is a data structure
chip which maintains a collection of weighted items and is capable of
performing the operations of insertion (constant time) and retrieving
the minimum weight item (constant time). Thus, to find a minimum
spanning tree, the m edges are inserted into the priority queue and
then repeatedly the minimum weight edge is removed and inserted into
the UF chip.
In an area such as image processing where massive amounts of data
are involved, special purpose hardware can be cost effective. The UF
chip described in this paper could be used for storing image data as
well as for solving connectivity problems which arise in region detec-
tion and minimum spanning tree problems which arise in clustering.
3. RELATED HORK
ACKNOWLEDGEMENT
REFERENCES
ABSTRACT
The paper presents techniques to increase the speed of fixed-point
parallel multipliers and reduce the multiplier chip size for VLSI real-
izations. It is shown that a higher order (octal) version of the
Booth's Algorithm will lead to significant improvements in speed,
coupled with a decrease of chip area and power consumption, as compared
to the modified (quaternary) version of the Booth's Algorithm presently
used in or proposed for monolithic multipliers. In addition, further
speed improvements can be obtained by using Wallace trees or optimal
Dadda types of realizations.
The optimal Dadda realizations with minimal number of adders can be
layed out in a regular, rectangular array interleaved with partial pro-
duct generation. The resulting regular structure is suitable for VLSI
implementations. The more complex interconnection wiring which is
needed is shown to be feasable in technologies with at least 3 layers
of interconnections. Layout, interconnection and speed considerations
for the proposed high-speed VLSI parallel multiplier configurations
have been studied.
KEYl>JORDS
Parallel Multiplier, High Speed Multiplier, Wallace Tree Multiplication
Scheme, Booth's Algorithm, Modified Booth's Algorithm.
1. REVIEW OF MULTIPLICATION ALGORITHMS
1.1 REVIEl~ OF SEQUEN'rIAL "ADD AND SHIFT" ~1ULTIPLICATION SCHEMES
Two unsigned numbers can be multiplied by generating partial prod-
ucts one at a time and by adding each partial product to the shifted
sum of all previous partial results. The signed case can be done simi-
larly, except for some correction terms. However a very elegant scheme
was invented by BOOTH [1,2J. Designed for signed and unsigned multipli-
cations, this algorithm generates partial products that are +1,0 or -1
times the multiplicand. Note that the subtraction doesn't pose any par-
ticular difficulty. A complete discussion can be found in, e.g., [3J.
There are time savings also. For a 16x16 case there are 7 levels of
Full Adders versus only 4 in the octal case. If the precalculation
of the times three partial product is done with fast carry propagate
adders, if this precalculation is done during the necessary precharge
time, and if the propagation time for the carry in the Manchester
Adder is 1/4 of the delay D in the Full Adder, then the total delay is
(7 + 4)D in the quat case and (4 + 4lD in the octal case. The speed
improvement is about 25%.
We also expect a decrease of the power consumption, as there are fewer
full adders with restoring logic in the octal case. (They are re-
placed by the pass transistor logic of the multiplexers).
The advantages of the octal case can be extended to other technologies.
The Full Adders are indeed a time critical line in the summing process
of the partial products, while the generation of the partial products
in the multiplexers is less troublesome. As a result the reduction by
50% of the number of adders at the cost of an increased complexity of
the multiplexer will certainly offer similar significant advantages.
4. WALLACE TREE AND DADDA SCHEMES
The multipliers discussed in Section 3 can be realized as a very regular
interconnection of basic cells. The number of adders is minimal, the
computation time however is O(N) and not optimal.
It can be shown that theoretically an O(logN) delay is attainable with
the Wallace Trees [7J or Dadda Schemes [8J. However these tree schemes
were never used in monolithic multipliers, because the speedup they
could give for small N wasn't considered to be significant enough to
justify the more difficult interconnections and the irregularity of
the structure. However for sizes of the order of 16 or higher the use
of the Dadda scheme is attractive, especially in combination with Booth's
algorithm. We worked out an example of a 24x24 multiplier to show that
the routing difficulties are feasible.
For a 24x24 product 13 partial products are generated if the Modified
Booth's Algorithm is used. With a Wallace Tree type of interconnection
scheme these 13 partial products can be summed to form a result in
redundant form (i.e. two bits per column) after 5 Full Adder Delays.
To get the final result the summing has to be completed in some Fast-
Carry-Propagate or Carry-Look ahead Adder. A regular interconnection
scheme for this 24x24 multiplier is shown in Figure 5.
In Figure 5, the bits of the partial products are generated by multi-
plexers. These multiplexers are shaded, and every square represents
a generated bit. For the 24x24 case the partial products are just an
extention of the 8x8 case shown in Figure 3a. The adders are marked
with the level they have in the Wallace tree, the adders of the final
carry chain are marked 7 to 47. The actual interconnections are left
out of Figure 5 for clarity, but we show the expected complexity in
Figure 6. There we give the complete interconnections for the area
marked with dotted lines in Figure 5, where the wiring density is maxi-
mal. (In the final layout of the chip one would of course reorganize
Peter Reusens. et 81 307
REFERENCES
1. "l\ signed binary multiplication technique", A.D. Booth, Quarterly
Journal of tlechanics and Applied rlathematics, Vol.4 pot.2., 1951.
2. "A proof of the modified Booth's algorithm for multiplication",
L.P. ~ubenfield, IEEE Trans. Computers, Vol. C-24, Oct. 1975,
pp. 1014-1015.
3. Theory and Appl ication of Digital Signal Processing, L. Rabiner
and B. Gold, Prentice Hall, 1975, Chapter 8, Paragraph 5.
4. "The IBt1 System 360/Model 91 : Floating-point execution unit",
Anderson S.F. et al., IBM Journal, Jan 1967, pp.35-53.
5. "The design of a 16 x 16 multiplier", R.T. Masumoto, LAr-1BDA, First
Quarter 1980, pp.15-21.
6. "On Parallel Digital Multipliers", L. Dadda. Alta Freguenza, Vol.
45, N.10, Oct. 1976, pp. 574-580.
7. "A suggestion for a fast multiplier", C.S. ~Iallace, IEEE Trans. on
Electronic Computers, Feb. 1964, pp. 14-17.
8. "Some schemes for parallel multipliers", L. Dadda, Alta Freguenza,
Vol. 34, pp.349-356,1965.
9. "Merged Arithmetic for Signal Processing", E.E. Swartzlander, Jr.,
Fourth IEEE Symposium on Computer Arithmetic, Oct 25-27, 1978, pp.
239-244.
10. "~'ultipl ication using logarithms implemented with read only memor-
ies", T.A. Brubaker and J.C. Becker, IEEE Trans. Computers, Vol.
C-24, Aug. 1975,pp. 761-765.
11. "Generation of products and quotients using approximate binary
logarithms for digital filtering applications", E. Hall, et al.,
IEEE Trans on Computers, Vol C-19, N.2, Feb. 1970.
12. Introduction to VLSI Systems, C.r1ead and L. Conway, Addison-~Iesley
Publishing Company, 1980.
13. "A compact high-speed multiplication scheme", 14.J. Stenzel et al.,
IEEE Trans. on Computers, Vol. C-26, No. 10, Oct. 1977, pp. 948-57.
14 "Two's complement parallel implementation of large multipliers",
H. Kobayashi, T. Yamada, H.Ohara, Proc of the first IEEE Internat-
ional Conference on Circuits and Computers, pp. 1085-1088.
15. "The area-time complexity of binary multiplication", R.P. Brent
and H.T. Kung, JACM, Vol. 28, No.3, July 1981, pp. 521-534.
Xo
RO
Figyre 1 :
The Realization of a R1
(5,5,4)counter with 6
Full Adders. R2
Figure 2 : R3
A straightforward reali-
zation of an unsigned 5x5 R4
parallel multiplier
RR R R
R9 RS R7 R6 RS
Fig.1 Fig.2
Figure
,...
,...
3.a
.... :
AO AO
Figure 3.b :
8x8 parallel multi-
plier with Booth's
modified algorithm. ++ ft
Signed operands on- 515 5 14 5 13 5 12 5 11 510 59 5a 57 56 55 54 53 52 5, 50 5'4 5 '3 512 511 510 59 5a 57 56 55 54 53 52 5, 50
ly (2's complement). Fig.3.a Fig.3.b
Ha 1f adder : (2J EI
Ca rry prop. adder:
Figure 4 :
Schematic layout of the
multiplexer and the inter-
connections of the full
adders for the quaternary
"tI
case of the modified ..
Booth's algorithm in 4.a,
1.ttii~~
.
-'---'
----_.. .- '"
_.. - !.
!l
:D
" .
and for the octal case in _._._._.. . .
_._._._
_._._._._~~.
~
4.b.
~~~
--_._._._..-. ..
. .
.
.
----metal !
_._._.- poly ~
!.
------- diff. x1+ 0 n !!.
NMOS Fig.4.a Fig.4.b Co>
Ii
~
o
."
2 i
3 ~
a
::I:
c:::J"
iJJ
~Ii!~ Ion
I
~ ."
iil
i:
s::
, i 1
:.
~ i .a:
~
~ i I 1'4 13' i'
iii
~I ; i 1'8 '71 '6 15 :;
I 120 I~ <
~
~i
~7 .. 6 .... ~U423122211 Fig .5 Fig.6
I-_J
Figure 5 :
Possible layout of the layers of adders and multiplexers for a Wallace Tree
interconnection scheme of a parallel 24x24 multiplier.
Figure 6 :
The complete interconnection scheme of multiplexers and adders for the area
indicated with the dashed line in Fig. 5.
RESULT
A MeshConnected AreaTime Optimal VLSI
Integer Multiplier
Franco P. Preparata
University of illinois at UrbanaChampalgn
Coordinated Science Laboratory
Thus the network appears as in Figure la, in the standard signal flow-
graph notation. (1) If we rearrange the input vector according to the
t
~ r
1,:
~\ ,~
3 0 ';J
"0
(
a, ."
"1 DFT '~2 ~
on 30 2
I
A.
~ Cams
A
a.
a
.~
,.
~
n-2
~- L a
a
- Jl
..1.1
3. 1 A~
i- Jl
L a_
)
A_
)
a DFT ..1.3
f-L on 30 3 ';3
'f-L rn car:ns
.JJ "" I 30 7 ':"7
a ::1 A
.. : n-~
(a) ,b)
as AS
a3 A3
a
7 -1 -1 A7
Figure 2. The flow-graph of Figure l(b) adapted to the unidimensional
4 0 6 2
array operation. (Note that w = -wand w = -w .)
L.. l i
.....
1- L -I
1--'
1 r
I t
T O(';;;)=O(R)=O(JN)
i.e. we have realized a mesh multiplier which achieves the lower bound
O(N 2 ) to the AT2 measure of complexity. This bound holds both in the
synchronous model of computation as in the one recently proposed by
Chazelle and Monier [10]. Note that in the described network all wires
have minimal length.
References
[10] B. Chazelle and L. Monier, "A model of computation for VLSI with
related complexity results," Tech. Rep. Dept. of Compo Sci.,
Carnegie-Mellon University, February 1981.
W.K. Luk*
IMAG Laboratory
Computer Architecture Group
B.P.53X
38041 Grenoble Cedex, France
ABSTRACT
2
A O(log n) time n-bit binary multiplier basing on the Mead and
Conway VLSI design rule is presented. The layout has regular, recursive
structure and is directly suitable for practical VLSI implementation.
This multiplier is much faster than the traditional "serial pipeline
multiplier" having O(n) time and the "Brent-Kung multiplier" having
0(n 1/ 210g n) time. Its layout is of more practical interest than the
multiplier proposed by Preparata and Vui11emin basing on the CCC network
though it is optimal and has time 0(10g 2n). The AT2 measure of this mul-
tiplier layout is nearly optimal, being 0(n 210g 4n) so it answers the
question posed by Brent-Kung that the existence of a practical multi-
plier having AT2 = 0(n 3 ) measure.
The detailed VLSI layout, the theoretical and actual exact comple-
xity of the time and area measure will be presented. The actual imple-
mentation of a 16-bit example through the MPC is also discussed.
1. INTRODUCTION
to other area-time results and greatly affect the layout. For example,
as traditionally discussed [Aho & al 74J, only three multiplications of
n/2-bit, some additions,and bit testings are used. Since only three mul-
tiplications are needed in the recurrence, it leads indeed to smaller
area as seen later; the time complexity remains the same. Also is this
three-multiplication recurrence that improves the sequential multiplica-
tion from 0(n 2 ) to 0(n lo g 3 ) running time. Had it been four, no improve-
ment had been obtained. But, since the layout becomes more irregular
than the one proposed using four multiplications, we keep the former one
for the time being.
We now present the layout of a n-bit multiplier using the recursive
structure. It is directly suitable for VLSI implementation. The layout
is shown in figure 1. A, G, are the input and output cross-bar network
and they are most "area consuming", each requires 0(n 2 ) area. C is a
perfect-shuffle network for distributing the results ad and bc to the
adder and it also has 0(n 2 ) area. E is also an area of wire consuming
0(n 2 ) area. B is the area for the four n/2-bit multipliers. The bit size
of the multiplier in each level is half the parent until arriving at the
"deepest" level where the multiplier is a 1-bit trivial case. One might
choose a 2-bit or 4-bit multiplier to be the terminal level since they
can easily be realized by using ROM or PLA. Area D places the n-bit adder
whereas area F finds the 2n-bit adder.
In the recursive layout, each input data line drives two next lower
levels and so on recursively. An input driver is indispensable (unless
fan-in in each level is one) at each level to maintain the signal level
for practical reasons. A more important reason is that we have assumed
that all other "secondary" devices, e.g. inputs, output buffers, clock
drivers, wire drivers, wire propagation time, etc .. are constant (inde-
pendent of n). The input drivers are so used to maintain the same recur-
sive structure at all levels. Otherwise, a large overall input driver is
needed for all the 210gn fan-ins, and it is also size-dependent. These
input drivers give rise to a overall time delay of O(log n), but it forms
a pipeline structure which might be useful in some applications. It also
pOints out a O(log n) time lower bound for this type or recursive struc-
ture. Also is an interesting aspect of area-time trade-off at I/O ports:
elimination of the O(log n) delay by using a single input driver having
sufficiently large area. For the present multiplier layout, neither would
affect the overall area-time complexity since they are of lower order of
magnitude than the rest (adding time, wire area), see equations (2), (3),
they only affect up to a constant difference.
3. AREA-TIME COMPLEXITY
The cost for wiving is quite high in this layout as in the case of
many other VLSI layouts in which the cost for communication (wire) is of
a higher order of magnitude than the cost for computation (transistor).
This is an important pOint in the present day design of the silicon chip.
The multiplier time is dominated by the performance of the adders. The
choice of the type of adder is critical.
The discussed recursive structure allows us to calculate the area,
time complexity directly. The area of A, C, E, G are 0(n 2 ) as discussed.
Assuming that the terminal level is n=1, the area for level n is given
by the recursive equation :
320 A Regular Layout for Parallel Multiplier of O(Log'N) Time
takes the same o(n 2 ) area with w(n) = h(n) = O(n), it is optimal for
this recursive layout. Here w, h stand for width and height respective-
ly. The results obtained for various multipliers are summarized in table
1.
In practice, it is not interesting to have a chip layouted in thQ
form of a "long bar", in other words the width is of higher order of ma-
gnitude than the height or vice versa; i.e. w(n) = o{(h(n)} or h(n) ~
o{w(n)}, where wand h are the width and height of the layout respecti-
vely. We are going to show that in our recursive layout, the width and
the height have the same order of magnitude for all n.
At level n, the height and width are given by (refer to fig. 1) :
h(n) = max{4w(n/2), A1n}
(7)
w(n) = h(n/2) + C1n + C2Wa(n)
A1 is the height constant of the 2n-bit adder, C1 is the width constant
of all the input cross-bar wire, the perfect-shuffle network and the
wire at E and G, wa(n) is the total width of the two n-bit and 2n-bit
adders. wa(n) = n for the carry-propagate adder and wa(n) = log n for
the Brent-Kung adder. For asymptotic study, wa(n) is immaterial since it
is of a lower order of magnitude than the other terms. For small n,
A1 n >4w(n/2), the solution of equation (7) is h(n) = w(n) = O(n). For
sufficiently large n, 4w(n/2) >Aln, then equation (7) becomes:
h(n) = 4w(n/2)
w(n) = h(n/2) + O(n) 2 ~8)
Solving (8) , we have h(n) = w(n) = O(nlog n) and so takes O(n log n)
area. It is of a factor of O(log n) higher than the expected O(n 2 log n)
for area already proved for this recursive layout. There remains some im-
provements for the running wire (how the different blocks communicate
with the others in such a way that the area can be decreased by a fac-
tor of log n).
A T AT2
parts are the design of the adder, a fast Brent-Kung adder or a simple
slower carry propagate adder, the basic multiplier and the register to
register data flow timed by a two phases non-overlapping clock ~1 and ~2.
Some practical evaluation will also be discussed.
We fix the terminal level of the multiplier to n=2. A 2-bit multi-
plier can easily be implemented using a PLA or simple logic. The compu-
ting time and area necessary to do a 2-bit multiplication is then a cons-
tant depending on the technology and circuit optimization. Let this time
be 'multi and assume such computation can be completed between two suc-
cessive h and ~2 of the two phases clock, i.e. 'multi < T/2 where T is
the period of either ~1 or ~2.
We first give the design of the carry propagate adder since it is
simpler. A n-bit carry propagate adder consists of n full adders each of
which can be implemented by simple logic. At each level, say n, the n-bit
and 2n-bit adders are so formed by cascading the full adders. The time
and area required by a full adder are also constant depending on the
technology. Let this time be 'add and assume such time is less than the
time between two successive ~1 and ~2' i.e. 'add < T/2. So a n-bit addi-
tion requires n consecutive ~1 and ~2 clocks and takes area O(n). Let
T(n) be the time in terms of the total number of clock pulses of ~1 and
~2 necessary to perform a multiplication at level n, including the first
clock pulse ~1 for input. Hence we have :
T (2) 1
T(n) = 1 + T(n/2) + n + 2n = T(n/2) + 3n + 1 (9)
or T(n) = 6n + log2n - 12
T(2) = 1 means that a 2-bit multiplication requires only a single clock
pulse as indicated in the previous paragraph. T(n/2) is the time for in-
puting data and computation at level (n/2), n+2n is the time for the two
n-bit and 2n-bit additions and the 1 is for inputing data at level n. In
the expression T(n) the term 6n accounts for doing computations, whereas
the term log2n accounts for the input pipeline. This exact evaluation of
time will later be used for comparing with the results obtained using the
Brent-Kung adder, hereafter presented.
The Brent-Kung adder[BreKun 79 B] modifies the slow carry propagation
in ordinary adders, and it allows to compute all the carries parallelly
in O(log n) time. Let ai' b i , Ci' i=l, . ,n, be the two n-inputs and the
n carries respectively; and Si' i=l, ,n+l be the n+1 outputs. An ordi-
nary adder is so defined by :
Co 0
Si = ai$ bi $ Ci -1 i=l, ... ,n (10)
Ci = (aiAbi) v ((ai $ b i ) ACi-1)
Sn+1 = C
where $, A , v nrepresent modulo-2 addition, logic AND, and logic OR res-
pectively. In the Brent-Kung scheme, further define gi' Pi' i=l, ,n as
gi = a i A bi
i=l, ... ,n (11)
Pi = ai $ bi
Then a binary adder can be written as equation (12) and is shown in fi-
gure 2 :
Co 0
Si Pi $ Ci-1 i=l, ... ,n (12)
Ci gi v (Pi ACi -1)
All the gi' Pi'S can be found in constant time and it leaves the problem
of constructing all the Ci's from the gi' Pi'S efficiently.
W.K. Luk 323
n 2 4 8 16 32 64
carry propagate 1 14 39 88 185 382
adder
CONCLUSION
ACKNOWLEDGEMENT
The author would like to thank Prof. F. ANCEAU and the members of
the Research Team of Computer Architecture of the IMAG Laboratory in
Grenoble, for providing facilities for this research, and Mme Chaland for
her help in typing this manuscript.
REFERENCES
[Tho 79J C.D. THOMPSON, Area-time complexity for VLSI, Proc. of the 11th
Annual ACM Symposium on the Theory of Computing (SIGACT), May 1979,
pp. 81-88.
[Tho 80J C.D. THOMPSON, A complexity theory for VLSI, PhD Thesis, Dept.
of Computer Science, Carnegie-Mellon University, Pittsburgh, 1980.
[Vui 80J J. VUILLEMIN, A combinatorial limit to the computing power of
VLSI circuits, Proc. of the 21st Symposium on Foundations of Compu-
ter Science, Syracuse, NY, October 1980, pp. 294-300.
a b I,; d
~
---,
~
---""MsB
~ MSB
L-
a
AC -a -
-
'---
-
- ~J ' - - -
MS.
I-- I---
l--
'---
AD I--- I---
'-----
~
0
~
-
~
w I-- L--
~
~
z ~
w ., -
0
0
~
~
~
~
.,g ~
0 ~ r-- ~
" f-
~
~
-
~
u
w
~
----.
- -
~
Be
- I---
-== =
== - r-
-
-
-
- '-- -
110= -
-
BD ll~ -
,'----
::=
-==
a ~
I--- ~
'----
0
'"---- L-
, I
, , ,
C ' D' 'F 'G OUTPUT xy ,
I I I I I I
FIGURE 1 RECURSIVE LAYOUT OF AN o-BIT MULTIPLIER (0'8, NOT TO SCALE)
328 A Regular Layout for Parallel Multiplier of O(Log"N) Time
4h''t+t+t-+-I.....
3~ ~
~ ~ ~
lrl~ rl~ ~
FIGURE 2 A BRENT-KUNG ADDER
I .'i'4-bit I a-bit I
1'::1.;::. I
1.~ltladd. I add. I
I I I I I
dd" .. !d II
na ' 3+S
,";';,
It.,.
'. 16
,, ,
III 4-bit I a-bit I 16-bit I
,,
,:, ... ,:, ... ,:
I.~, multiplication I add. I addltion I
n-8 -It!
a
" '.
1 1511 IS 'It 31.
,\~
'
, ,
8-bitllLlltiplication
,
116-hit
,
I 32-bit
,~
,,
,.~
, ,
I addition I addition
,
old I I I I III III I 1 II ~i ....
1. INTRODUCTION
A leneral trend in computers today is to increase the complexity of archi-
t.cturu commensurate with the increasing potential of implementation techno-
logies. Consequences of this complexity are increased design time. more design
errors, inconsistent implementations. and the delay of single chip implementa-
tion[7]. The Reduced Instruction Set Computer (RISC) Project investigates a
VLSI alternative to this trend. Our initial design is called RISC I.
A judicious choice of a small set of the most often used instructions com-
bined with an architecture tailored to efi'icient execution of this set can yield a
machine of lIurprisingly high throughput. In addition. a single-chip implementa-
tion of a simpler machine makes more efi'ective use of limited resources such as
the number of transistors. area. and power consumption of present-day VLSI
chips[6]. Simplicity of the instruction set leads to a IImall control section. a
comparatively short machine cycle. and a reduced design cycle time.
Students taking part in a multi-term course sequence designed two
difi'erent NMOS versions of RISC 1. The "Gold" ,roup (Fitzpatrick. Foderaro.
Peek. Peshkess, and Van Dyke) designed a complete 32-bit microprocessor.
currently being fabricated. The "Blue" group (Katevenis and Sherburne)
started from the same basic organization but introduced a more sophisticated
timing scheme in order to shorten the machine cycle and also reduce chip area.
At present, only the data path of this more ambitious design has been com-
pleted. The chips were designed using only horizontal and vertical lines
("Manhattan" design). with the simple and scalable Mead-Conway design rules
(fabrication: >.. = 2 microns. no buried contacts).
At the onset of the design of RISC I we defined the following goals and con-
straints: (a) find a reasonable compromise between high performance for high-
level language programs and a simple, single chip implementation; (b) make the
size of all instructions equal to one word and execute all instructions in one
machine cycle; (c) emphasize register oriented instructions and restrict
memory access to the LOAD and STORE instructions. The resulting architecture
has 31 instructions in two formats. uses 32-bit addresses. and supports 6-.16-.
and 32-bit data.
The chip area saved by simplicity of the control circuitry was devoted to a
very large set of 32-bit registers. This permits the processor to allocate a new
327
328 VLSI Implementations of a Reduced Instruction Set Computer
..t of re,isters for each procedure call and thus avoids the overhead of saving
Nli.ten in memGrY. By owrlappilll the Ie windolnl of resisten, parameters
may be passed to a procedure by simply changing a pointer.
Ibis 110 called overlapped register window sCheme[l] is largely responsible
for the surprisingly lood performance obtained in simulation of hilh-Ievel
language programs. Simulations of benchmark programs written in 'C' indicate
that RISC I can run faster than many commercial minicomputers. Table 1 shows
the size and execution time of six C programs on RISe I assuming a machine
cycle of 400 nanoseconds, and 250 nsec instruction-fetches. Also in the table is
the VAX 11/780. a 32-bit SchoUky-ITL minicomputer with a 200 ns microcycle
time; and the ze002. a l6-bit NllOS microprocessor with a microcycle time of
250 ns. Even though the ZB002 is using only l6-bit addresses and data while
RISC is using 32-bit addresses and data, RISC programs are typically lOr. larger
while running about four times faster. The byte-variable length of VAX instruc-
tions reduces program size by about a third; on the other hand. every C pro-
srarn that we have run on the RISC simulator has been faster than the on VAX.
In addition to good performance. in this paper we show that the design of
RISC I also was several times faster and required only one-fifth the manpower of
comparable machines. The most visible impact of the reduced instruction set is
that the area dedicated to control logic has dro})ped from 50 r. in typical com
mercial microprocessors to only 6 r. in RISC 1.
Table 1.
RIse Program Size and Execution T1rnes
Rfllative to II VAX 11 /?8() c.wa.d a Z8000
Average 3883 3060 0.7 :1:.1 3849 1.1 :1:.4 4.0 6.4 1.7 :1:.4 15.2 4.0 :1:1.2
2. MICRO-ARCHITECTURE
The simplicity and regularity of RISC permits most instruction executions
to follow the same basic pattern: (1) read two registers. (2) perform an opera-
tion on them. and (3) store the result back into a register. Jump. call. and
return instructions add II. register (possibly PC') and an otrset and .tore the
relult into the appropriate PC latch. The load and store inltructions violate the
oricinal conltraint: in order to allow enough time for acceas of the main
memory. they add the index re.ister and immediate otrlet during t.he nrst cycle.
and perform the memory access dUrLllI an additional cycle. In all cases. while
the processor is executing the nrst cycle of an instruction. the next instruction
iI fetched from memory.
Daniel T. FHzpatrlck, .t 81 329
TO/FROt1 . MEM'~
!DATA 110J 11Mt1.1
rJ -
'i
A
r-
A r-- PC,
"- ,NC.,
~ .b
,. ....
8 8
~ R ....
"- ek.
-t'DD2.
Ll
~
RECiISTEIt FILE '-\' III.
oJ
..... 0
:%: f--- ALLl
"- I/)~ ~.b TO
C C Met1.
~IoIIFTEIt ~ .... ~
PC,
INC..,
ek ADDfl-.
..4
DATA
TO
'------I MEM.
t.emporary lat.ch (DST). and is only written int.o t.he regist.er-me during the
operat.ion-phale of the next cycle. Special "int.ernal forwarding" circuit.ry t.akes
care of the instructions that use t.he result of their immediately previous one.
In effect.. each instruction now requires three cycles: (1) Inst.ruction fetch and
decode; (2) Register read. operate. and temporary latching of result; (3) Write
result into register tile. However. in the Blue design. t.hese three operations are
pipelined so that a new instruction begins each cycle (except load/stores).
Both designs multiplexed the address and data pins. as we could not afford
to use 64 separate lines with current packagill technology. Power consumption
for the Gold chip is estimated to be between 1.2 and 1.9 Watts.
3. CHIP ANALYSIS
We have analyzed the completed Gold chip and the data path of the Blue
chip. Table 2 Ihows how the chip resources are allocated for the various func-
tional blocks in the Blue and Gold designs (figures for Blue control and I/O are
estimates). Thele relources are mealured in million Iquare lambda. thousands
of transistors. and thousands of rectangles. We can make several observations
based on this table:
(1) "Control" is only 6 to 8 % 0/ the tota.l ch'ifJ. Even if there were no registers.
it would still be only 12 %. Section 5 compares this important result to
commercial microprocessors. and discusses its signitlcance.
(2) The BLue d.a.ta. pa.th is considerably more compact than the Gold data path.
The register file pitch (42 A in Blue. 77 A in Gold) determined the height of
the data path in both versions; the rest of the modules were designed
accordingly. The compactness of the cross coupled static RAM cell allowed
the 138 32-bit registers of the Blue design to occupy about half the area of
the 78 32-bit registers of the Gold design.
(3) SCAN IN/SCAN OUT (SISO) is Less tha.n fj % 01 the chip. SISO is a technique
which improves chip testability by allowing access to each state bit in a
module. The tlip-tlops are connected together as a long shift register. allow-
ing serial reading and writing. The Gold chip has complete SISO on the
shifter. ALU. control, and some of the PC registers. As we had spare pins we
used 11 pins for SISO (4 % of chip area); we could have used fewer.
Table 2.
Am. tftIIIIMtOl'll, all NctUIPt. per Blue ..Ill Colt lImcUeul lIIoolr.
w
~
332 VLSI Implementations of a Reduced Instruction Set Computer
4. TOOLS
There is no doubt in our minds that the key to successful VLSI design lies in
appropriate software. While we have plans for a sophisticated design environ-
ment[5]. for the moment we have to work with a rather small subset. We used
more than a dozen programs to design our chips. Among those. we felt that the
following six tools were invaluable - we cannot imagine how we would have been
able to get this far without them:
CAESAR color graphics editor John Ousterhout U.C. Berkeley
CIFPLOT plot of mask layers Dan Fitzpatrick U.C. Berkeley
MEXTRA Manhattan circuit enractor Dan Fitzpatrick U.C. Berkeley
SLANG multi-level simulator John Foderaro U.C. Berkeley
MOSSIM switch level simulator Bryant/Terman M.I.T.
DRC layout rules checker Clark Baker M.I.T.
MKPLA PLA j[enerator Howard Landman U.C. Berkeley
Other tools used to some degree and which were particularly helpful for our pro-
ject include:
PRESTO PLA minimizer Fang/Newton U.C. Berkeley
EQNTOTT PLA equation translator Robert Cmelik U.C. Berkeley
SPICE circuit simulator Pederson/Newton U.C. Berkeley
STAT electrical rules checker Forest Baskett Stanford/Xerox PARC
POWEST power estimator Robert Cmelik U.C. Berkeley
A key factor was the glue that held all these various tools together and pro-
duced a cohesive design environment; this function was provided by the UNIX
operating system (4th Berkeley Software Distribution) [2] running on a DEC VAX
11/780.
We started the designs with an ISPS descriptiont of RISC I and f1 block
diagram similar to Figure 1. The logic, circuits, and initial layouts were
designed on paper. and then entered into Caesar. As the designers became
more comfortable with Caesar. they used it to do all layout. Caesar converts the
graphical description into CIF (the format needed for fabrication submission, as
well as for some of our tools, like CIFPLOT and DRC). The information necessary
to run STAT, MOSSIM, and POWEST was extracted from this same CIF by MEXTRA.
After the bottom level modules were designed, we used SLANG to completely
describe the chip at a mixture of functional and logical levels. We then ran RISC
diagnostic programs on this description to uncover errors. The SLANG descrip-
tion was also used to specify many of the remaining connections in the chip and
to drive the PLA tools to automatically produce the PLA's for RISC. Howard
Landman acted as a "roving critic", scanning for errors that were overlooked by
the design tools.
t The [sPS description was not very useful, because the [sPS simulator does not run on the VAX
(the machine we use to do our design); also, while useful for architecture descriptions, it is awkward
for describing implementations.
Dlnlel T. FHzpltrlck. at II 333
The tinal step was to use SLANG and llOSSIM to compare the original
description with the tinal masks. SLANG was used to interactively compare,
every half clock phase, the values of hundreds of nodes in both the SLANG
description simulation and the WOSSIM simulation. We ran about a dozen diag-
nostic programs and uncovered several errors.
The submitted Gold design successfully passed the checks provided by all of
the tools.
Table 3.
Demen metric. for zeooo. Ilce8000. ~41111. and RI!C I.
".
Re&Wal'I.at1on factor 11.0 11.1 ?II U ?? 11.'1.2'1.11' ?
...
(111--- III .a.)
(Ana- ......)
-.wpiWh
(1IIfcrou)
SIae of Control b
(Area In ... mill)
1IbI1l1
Ilk
11
S'1k
-
...... 1
II
tIIk
118dllS
MIt
II
17k
11. .....
11
4Gk
U.
II1II81
11
47k
tIIIIlIOII
114k
U!
7k
....o<Irlll5
-eeIr:
12
-?k
Percent Control 118 X II! X III X "X tOX IX -II X
Elapeed TIme to
lint. 811icon SO SO SS" SS" 21" 1'1d. 1'1+,01
(montlul)
DIlI&n TIme 80 100 1'10" 1'10 UIO 10/2' (IIO+?) /2'
(man month.)
Layout TIme '10 '10 .0 100 50 12 II+?
(man month.)
There are two ways to count drawn transistors; the pessimistic approach counts every transis-
tor in a cell even if it derived from simple modifications to a basic cell. The optimistic approach only
counts the transistors that were changed. For the Gold chip the difference is the transistor count for
the register decoders. The optimistic count saves 433 drawn transistors thereby increasing the regu-
larity factor.
b We estimated these sizes trom the photomicrographs of the commercial chips.
C Data provided by Rattner of Intel [10]
01 We counted elapsed time trom the beginning of the first class to the end of the last elas, plus
three months which should be the time for fabrication (we hope).
, Since the designers also did layout this is a somewhat fuzzy distinction. All work before
1/1/81 is considered design and we have included circuit design as part of layout.
Daniel T. Fitzpatrick, et al 335
modified the day before the design was submitted for mask generation!. An
industrial product can rarely atrord such luxury.
. flNAL CmOlENTS
The RISC Project has had a synergistic etTect on research at Berkeley in
architecture, VLSI, and CAD. Often, useful tools were created in response to
specific needs. For example, a special extractor for Manhattan geometry was
formulated since the older extractor, able to handle general geometry, would
have taken too long to tlnd all 44000 transistors of the Gold chip. AB a result of
this synergism our design environment has experienced a dramatic improve-
ment within the last six months.
The gains in performance of the design tools by their rntriction to Wanhat-
tan designs were well worth their inconveniences in layout; we can tlnd only
small areas in each chip where non-Manhattan geometry could save space.
While we realize that more work is necessary to turn RISC into a fuU-fiedged
microcomputer, we also believe that the most difficult and time consuming por-
tion of that task has been completed. These results can be duplicated by indus-
try; reduction in elapsed design time, reduction in manpower, and high perfor-
mance are available for those who are willing to take calculated risks.
7. ACKNOWLEDGEMENTS
The RISC project was aided by several people at Berkeley and other places.
We would like to thank them all, but give special thanks to a few:
John Oullterbout created Caesar, the main interface of the designers. The relia-
bility and quality of this graphics editor and his responsiveness to our needs is a
major reason for our reduced design time. Richard Newton allowed us to use his
graduate class to resolve issues related to RISC. We also want to thank others in
the Berkeley community for the use of their tools: Bob Cmelik, Sheng Fang,
Richard Newton, and Donald Pederson. We would especially like to thank the
people in the ARPA-VLSI community that shared their tools: Randy Bryant and
Chris Terman for the switch-level simulator /lOSSIM and Clark Baker for the
layout rule checker DRC. In addition to these tools from MIT, from Stanford we
received STAT, a static electrical rules checker created by Forest BaBkett. We
Iratefully acknowledge helpful discussions with Osamu Tami:.... in the areas of
MOS circuit design and processing. lim Beck, Bob Cmelik, and Robert Byerle
provided valuable information and ideas on testing the chip. We would also like
to thank the visitors from industry who gave us valuable suggestions on our
designs: Le. Credele from Motorola, Dick Lyon from Xerox PARC, and Peter Stoll
from Intel. Thanks go to Bob Fabry, Richard Fateman, Bill loy, and Bob Kridle
for providing the computing environment necessary to complete our designs.
We would like to thank Danny Cohen and Lee Richardson of the MOSIS group (U.
of Southern California - Information Sciences Institute), and Alan Bell and Al
Pateth of Xerox Palo Alto Research Center, for fabricating our chips. Finally, we
would like to thank Duane Adams and DARPA for providing the resources that
allow universities to attempt high-risk projects.
This research was sponsored by the Defense Advance Research Projects
Agency (DoD), ARPA Order No. 3803, and monit.ored by Naval Elect.ronic Syst.em
Command under Contract No. N00039-7B-G-0013-0004.
8. REFERENCES
1. Frank, E.H. and Sproull, R.F., "An Approach to Debugging Custom
Integrated Circuits," Carnegie-MeLLon Computer Science Research Review
197~80, pp. 21-36 (July, 1981).
338 VLSI Implementations of a Reduced Instruction Set Computer
2. Joy. W.N . Fabry. &'S . and Sklower. K. Seventh Edition. Mrtual VAX-ll v.r-
non. Computer Science Division. Dept of EECS. U. C. Berkeley June. 19B1.
3. t..ttin. W.W. Bayliss. J.A.. Budde. D.L.. Colley. S.R. Cox. G.W. Goodman. A.L.
Rattner. I.R. Richardson. W.S . and Swanson. R.C . "A 32b VLSI Micromain-
frame Computer System." I+oc. IEEE lntem.lltionral Solid-Stllta Clrcuits
Con/eranca. pp. 110-111 (February, 19B1).
4. t..ttin. W.W. Bayliss. I.A.. Budde. D.L. Rattner. J.R. and Richardson. W.S . "A
Methodology for VLSI Chip Design," La.mbda., pp. 34-44 (Second Quarter.
19B1).
5. Newton. A.R. Pederson. D.O. Sangiovanni-Vincentelli. A.L. and S6quin. C.H ..
"Design Aids for VLSI: The Berkeley Perspective." IEEE 7rl.lflSl.ICt-ions on OW-
cvi.ts lind St/st.m.s. (luly 1981).
6. Ousterhout. J . "Caesar: An Interactive Editor for VLSI Circuits." LSI
Dasip. D(4)(Fourth Quarter. 19B1). to appear
7. Patterson. D.A. and Ditzel. D.R.. "The Case for the Reduced Instruction Set
Computer." Gbmput.r Arch.it.ctur. N.'WIl 8(6} pp. 25-33 (15 October 19BO).
8. Pattenon. D.A. and S6quin. C.H.. "Design Considerations for Single-Chip
Computers of the Future," IEEE 7rl.lflSl.lctioflS on Comput.rs C-28(2) pp.
108-116 (February 19BO). Joint Special Issue on Microprocessors and Micro-
computers.
9. Patterson, D.A. and S6quin. C.H . "RISC I: A Reduced Instruction Set VLSI
Computer." I+oc. Eighth Intem.lltionral Symposium on Computer Architec-
ture, pp. 443-457. (May 19B1).
10. Rattner. J.R.. Pri:vllt. Communiclltion. August 19B1
MIPS: A VLSI Processor Architecture
1 Introduction
1. The IUSe architecture is simple both in the instruction set and the hardware needed to
implement that instruction set. Although the MIPS instfl.lction set has a simple hardware
implementation (Le. it requires a minimal amount of hardware control), the user level
instruction set is not as straightforward, and the simplicity of the user level instruction set is
secondary.
2. The thrust of the lUSe design is towards efficient implementation of a straightforward
instruction set. In the MIPS design, high performance from the hardware engine is a primary
goal. and the microengine is presented to the end user with a minimal amount of interpretation.
This makes most of the micro'engine's parallelism ,:vailable at the instruction set level.
3. The RISe project relies on a straightforward instruction set and straightforward compiler
technology. MIPS will require more sophisticated compiler technology and will gain significant
performance benefits from that technology.
MIPS is designed for high performance. To allow the user to get maximum performance, the
complexity of individual instructions is minimized. This allows the execution of these instmctions at
significantly higher speeds. To take advantage of simpler hardware and an instmction set that easily
maps to the microinstmction set. additional compiler-type translation is needed. This compiler
technology makes a compact and time-efficient mapping between higher level constructs and the
simplified instruction set. The shifting of the complexity from the hardware to the software has
several major advantages:
337
338 MIPS: A VLSI Processor Architecture
The complexity is paid for only once during compilation. When a user runs his program on a
complex architecture, he pays the cost of the architectural overhead each time he runs his
program .
It allows the concentration of energies on the software, rather than constructing a complex
hardware engine, which is hard to design, debug, and efficiently utilize. Software is not
necessarily easier to construct, but the VLSI environment makes hardware simplicity important.
The design of a high performance VLSI processor is dramatically affected by the technology.
Among the most important design considerations are: the effect of pin limitations, available silicon
area, and size/speed tradeoffs. Pin limitations force the careful design of a scheme for multiplexing
the available pins, especially when data and instruction fetches are overlapped. Area limitations and
the speed of off-chip intercommunication require choices between on- and off-chip functions as well
as limiting the complete on-chip design. With current state-of-the-art technology either some vital
component of the processor (such as memory management) must be off-chip, or the size of the chip
will make both its perfOImance and yields unacceptably low. Choosing what functions are migrated
off-chip must be done carefully so that the performance effects of the partitioning are minimized. In
some cases, through careful design, the effects may be eliminated at some extra cost for high speed
off-chip functions.
Speed/complexity/area tradeoffs are perhaps the most important and difficult phenomena to
deal with. Additional on-chip functionality requires more area, which also slows down the
performance of every other function. This occurs for two equally important reasons: additional
control and decoding logic increases the length of the critical path (by increasing the number of active
clements in the path) and each additional function increases the length of internal wire delays. In the
processor's data path these wire delays can be substantial, since thy accumulate both from bus delays,
which occur when the data path is lengthed, and control delays, which occur when the decoding and
control is expanded or when the data path is widened. In the MTPS architecture we have attempted to
control these delays; however, they remain a dominant factor in determining the speed of the
processor.
2 The microarchitecture
The fastest execution of a task on a microengine would be one in which all resources of the
microengine were used at a 100% duty cycle performing a nonredundant and algorithmically efficient
encoding of the task. The MIPS microengine attempts to achieve this goal. The user instruction set is
an encoding of the microengine that makes a maximum amount of the microengine available. This
goal motivated many of the design decisions found in the architecture.
MIPS is a load/store architecture, i.e. data may be operated on only when it is in a register and
only load/store instructions access memory. If data operands arc used repeatedly in a basic block of
code, having them in registers will prevent redundant load/stores and redundant addressing
calculations; this allows higher throughput since more operations directly related to the computation
can be performed. The only addressing modes supported are immediate, based with offset, indexed,
or base shifted. These addressing modes may require: fields from the instruction itself, general
registers, and one ALU or shifter operation. Another ALU operation available in the last stage of
every instruction can be used for a (possibly unrelated) computation. Another major benefit derived
from the load/store architecture is simplicity of the pipeline structure. The simplified stmcture has a
fixed number of pipestages, each of UIC same length. Because, the stages can be used in varying (but
related) ways, the rcsult is that pipline utilization improves. Also, Ule absence of synchronization
John Hennes.y, et II 339
between stages of the pipe, increases the perfonnance of the pipeline and simplifies the hardware.
The simplified pipeline eases the handling of both interrupts and page faults (see Section 4.2).
Although MIPS is a pipelined processor it does not have hardware pipeline interlocks. The six
stage pipeline contains three active instructions at any time; either the odd or even pipestages are
active. The major pipestages and their tasks are shown in Table 1.
Interlocks that are required because of dependencies brought out by pipelining are not
provided by the hardware. Instead, these interlocks must be statically provided where they are
needed by a pipeline reorganizer. This has two benefits:
1. A more regular and faster hardware implementation is possible since it does not have the usual
complexity associated with a pipelined machine. Hardware interlocks cause small delays for all
instructions, regardless of their relationship on other instructions. Also, interlock hardware
tends to be very complex and nonregular [3,5]. The lack of such hardware is especially
important for VLSI implementations, where regularity and simplicity is important.
2. Rearranging operations at compile time is better than delaying them at run time. With a good
pipeline reorganizer, most cases where interlocks are avoidable should be found and taken
advantage of. This results in performance better than a comparable machine with hardware
interlocks, since usage of resources will not be delayed. In cases where this is not detected or is
not possible, no-ops must be inserted into the code. This does riot slow down execution
compared to a similar machine with hardware interlocks, but does increase code size. The
shifting of work to a reorganizer would be a disadvantage if it took excessive amounts of
computation. It appears this is not a problem for our first reorganizer.
In the MIPS pipeline resource usage is pennanently allocated to various pipe stages. Rather
than having pipeline stages compete for the use of resources through queues or priority schemes, the
machine's resources are dedicated to specific stages so that they are 100% utilized (see Figurel). To
achieve 100% utilization primitive operations in the microengine (e.g., load/store, ALU operations)
must be completely packed into macroinstructions. 'This is not possible for three reasons:
l. Dependencies can prevent full usage of the microengine, for example when a sequence of
register loads must be done before an ALU operation or when no-opsmust be inserted.
2. An encoding that preserved a\l the parallelism (i.e., the microcontrol word itself) would be too
large. This is not serious problem since many of the possible microinstructions are not useful.
340 MIPS: A VLSI Proceaaor Architecture
Time --+
1 2 3 4 5 6 7 8 9 10
IF ID 00 OS OF EX
IF 10 00 as OF EX
IF 10 00 as OF EX
os
10
lostruction Data
ALU Memory Memory
EX OF
00 ~ as
IF 10
3. The encoding of the microengine presented in the instruction set sacrifices some functional
specification for immediate data. In the worst case, space in the instruction word used for
loading large immediate values takes up the space normally used for a base register,
displacement. and ALU operation specification. In this case the memory interface and ALU
can not be used during the pipe stage for which they are dedicated.
Nevertheless, first results on microengine utilization are encouraging. Many instructions fully utilize
the major resources of the machine. Other instructions, such as load immediate which use few of the
resources of the machine, would mandate greatly increased control complexity if overlap with
surrounding instructions was attempted in an irregular fashion.
MIPS has one instruction size, and all instructions execute in the same amount of time (one
data memory cycle). This choice simplifics the construction of code generators for the architecture
(by eliminating many non obvious code sequences for different functions) and makes the construction
of a synchronous regular pipeline much easier. Additionally. the fact that each macroinstruction is a
single microinstruction of fixed length and execution time means that a minimum amount of internal
state is needed in the processor. The absence of this internal state leads to a faster processor and
minimizes the difficulty of supporting interrupts and page faults.
ALU resources: A high speed, 32-bit carry lookahcad ALU with hardware support for multiply
John Henn...y, et .1 341
and divide; and a barrel shifter with byte insert and extract capabilities. Only one of the ALU
resources is usable at a time. Thus within the class of ALU resources, functional units can not
be fully used even when the class itself is used 100%.
Internal bus resources: Two 32-bit bidirectional busses, each connecting almost all functional
components.
On chip storage: Sixteen 32-bit general purpose registers.
Memory resources: Two memory interfaces, one for instructions and one for data. Each of the
parts of the memory resource can be 100% utilized (subject to packing and instruction space
usage) because either one store or load form data memory and one instruction fetch can occur
simultaneously.
A multistage PC unit: An incrementable current PC with storage of up to two branch targets as
well as six previous PC values. These are required by the pipelining of instructions and
interupt and exception handling.
All MIPS instructions are 32-bits. The user instruction set is a compiler-based encoding (Le.
code generation efficiency is used to choose alternative instructions) of the micromachine. Multiple
simple (and possibly unrelated) instruction pieces are packed together into an instruction word. The
basic instruction pieces are:
1. ALU pieces - these instructions are all register/register (2 and 3 operand formats). They all use
less than 112 of an instruction word. Included in this category are byte insert/extract, two bit
Booths multiply step, and one bit nonrestoring divide step.
2. Load/store pieces - these instructions load and store memory operands. They use between 16
and 32 bits of an instruction word. When a load instruction is less than 32 bits, it may be
packaged with an ALU instruction, which is executed during the Execution stage of the
pipeline.
3. Control flow pieces - these include straight jumps and compare instructions with relative jumps.
MIPS does not have condition codes, but includes a rich collection of set conditionally and
compare and jump instructions. The set conditional instructions provide a powerful
implementation for conditional expressions. They set a register to aliI's or O's based on one of
16 possible comparisons done during the operand decode stage. During the Execution stage an
ALU operation is available for logical operations with other boolcans. The compare and jump
instructions are direct cneodings of the micromachine: the effective operand decode stage
computes the address of the branch target and the Execution cycle does the comparison. All
branch instructions have a delay in their effect of two instructions; i.e., the next two sequential
instructions are executed.
4. Other instructions - include procedure and interrupt linkage. The procedure linkage
instructions also fit easily into the micromachine format of effective address calculation and
register-register computation instructions.
MIPS is a word-addressed machine. This provides several major performance advantages over a
byte addressed architecture. First, the use of word addressing simplifies the memory interface since
extraction and insertion hardware is not needed. This is particularly important, since instruction and
data fetch/store are in a critical path. Second, when byte data (characters) can be handled in word
blocks, the computation is much more efficient. Last, the effectiveness of short offsets from base
register is multiplied by a factor of four.
342 MIPS: A VLSI Processor Architecture
MIPS does not directly support floating point arithmetic. For applications where such
computations are infrequent, floating point operations implemented with integer operations and field
insertion/extraction sequences should be sufficient. For more intensive applications a numeric co-
processor similar to the Intel 8087 would be appropriate.
4 Systems issues
The key systems issues are the memory system, and internal traps and external interrupt
support.
The use of memory mapping hardware (off chip in the current design) is needed to support
virtual memory. Modern microprocessors (Motorola 68000) arc already faced with the problem that
the sum of the memory access time and the memory mapping time is too long to allow the processor
to run at full speed. This problem is compounded in MIPS; the effect of pipelining is that a single
instruction/data memory must provide access at approximately twice the normal rate (for 64k
RAMS).
The solution we have chosen to this problem is to separate the data and instnlction memory
systems. Separation of program and data is a regular practice on many machines; in the MIPS system
it allows us to significantly increase performance. Another benefit of the separation is that it allows
the use of a cache only for instructions. Because the instruction memory can be treated as read-only
memory (except when a program is being loaded), the cache control is simple. The use of an
instruction cache allows increased perfonnance by providing more time during the critical instruction
decode pipe stage.
The MIPS architecture will support page faults, externally generated interrupts, and internally
generated traps (arithmetic overflow). The necessary hardware to handle such things in a pipelined
architecture usually large and complex [3, 5]. Furthermore, this is an area where the lack of sufficient
hardware support makes the construction of systems software impossible. However, because the
MIPS instruction set is not interpreted by a microengine (with its own state), hardware support for
page faults and interrupts is significantly simplified.
To handle interrupts and page faults correctly, two important properties arc required. First, the
architecture must ensure correct shutdown of the pipe, without executing any faulted instructions
(such as the instruction which page faulted). Most present microprocessors can not perform this
function correctly (e.g. Motorola 68000, Zilog Z8000, and the Intcl8060). Second, the processor must
be able to correctly restore the pipe and continue execution as if the interrupt or fault had not
occurred.
These problems arc significantly eased in MIPS because of the location of writes within the
pipe stages. In MIPS all instructions which can page fault do not write to any storage, either registers
or memory, before the fault is detected. The occurrence of a page fault need only turn off writes
generated by this and any instructions following it which arc already in the pipe. These following
instructions also have not written to any storage before the fault occurs. The instruction preceding
the faulting instruction is guaranteed to be executable or to fault in a restartab1c manner even after
the instruction following it faults. The pipeline is drained and control is transferred to a general
purpose exception handler. To correctly restart execution three instructions need to be reexecuted.
A multistage PC tracks these instructions and aids in correctly executing them.
John Hennessy, et al 343
5 Software issues
The two major components of the MIPS software system are compilers and pipeline
reorganizers. The input to a pipeline reorganizer is a sequence of simple MIPS instructions or
instruction pieces generated without taking the pipeline interlocks and instruction packing features
into account. This relieves the compiler from the task of dealing with the restrictions that are imposed
by the pipeline constraints on legal code sequences. The reorganizer reorders the instructions to make
maximum use of the pipeline while enforcing the pipeline interlocks in the code. It also packs the
instruction pieces to maximize use of each instruction word. Lastly, the pipeline reorganizer handles
the effect of branch delays.
Since all instructions execute in the same time, and most instructions generated by a code
generator will not be full MIPS instruction set, the instruction packing can be very effective in
reducing execution time. In fully packed instructions, e.g. a load combined with an ALU instruction,
all tlle major processor resources (both memory interfaces, the alu, busses and control logic) are used
100% of the time.
The example in Figure 2 illustrates the techniques: where possible, short instructions are
moved together into one word. As this is a very short segment, not too many compactions are
possible. Once a basic block has been treated for compaction, the effects of the delayed branch are
processed. In this case it is possible to remove ilie no-ops, required because of pipeline dependencies
and branch delays, completely.
Figure 2: Source code, original machine code, and reorganized machine code
Source code Correct Code Reorganized AlU use Data Me-
with No-Ops Code 00 EX mory use
(* A,B,C: global ld HA, r3
N,R,S: local *) ld HB, r4
ld HC, r5
For i;= a To N Do mv HO, rl ld N(sp),r2;mv #O,rl x x x
ld N(sp),r2 ld #C, r5 x
bgt r1, r2, l30 bgt r1, r2, l30 x x
no-op ld #B, r4 x
no-op ld #A, r3 x
Begin
A[i]:=B[i]+C[i]; l20: 1d ( r4, rl) , r6 llOO:ld (r4,r1),r6 x x
ld (r5,r1), r7 ld (r5,r1),r7;add r6, rB x x x
no-op
add r7, r6, r9 add r7,r6,r9;add r7,r10 x x
st r9, ( r3, r1) st r9,(r3,r1);add #l,r1 x x x
R:= R + B[i]; add r6, rB
S:= S + C[i]; add r7,. rIO
add #1, rl
ble rl. r2, l20 b1e rl, r2, llOO x x
no-op st rB, !!(sp) x )(
Note that the code with no-ops was also of reasonable quality: ilie loading of the array base
addresses is hoisted up, and the store ofS is moved out of the loop. (Initialization ofS is done outside
the segment considered.) The no-op following "ld (r5,r1), r7" is necessary to take care of the missing
pipeline interlock.
344 MIPS: A VLSI Processor Architecture
The optimal packing of instructions is obviously a hard problem (at least Np-complete);
however, we are investigating heuristics that we believe will have acceptable running times, yet will
produce nearly optimal code in most cases.
Data path components: completely designed at the transistor level; approximately 50% laid out
The ALU has been fabricated and performs as simulated, with less than lOOns required for
addition.
Control: a SLIM [2] program for designing the control PLA's has been written and the PLAs
have oeen generated. .
Software: code generators have been written for both C and Pascal. These code generators
produce simple instructions, relying on a pipeline reorganizer. A first version of the pipeline
reorganizer is nmning and an instruction level simuiator are also in use.
Figure 3 shows the floorplan of the chip. The dimensions of the chip are approximately 6.9 by
7.2 mm with a minimum feature size of 4 I-' (i.e. A = 2 1-'). The chip area is heavily dedicated to the
data path as opposed to control structure, but not as radically as in RISC implementation. Early
estimates of performance seem to indicate that we should achieve approximately 2 MIPS (using the
Puzzle program [1] as a benchmark) compared to other architectures executing compiler generated
code. We expect to have more accurate and complete benchmarks available in the near future.
The following chart compares the MIPS processor to the Motorola 68000 running the Puzzle
benchmark written in Pascal. The same code generator (with different target machine schema)
generated code for the program. The MIPS numbers are approximate.
Ine 68000 ICtechnology is much better, and the 68000 performs across a wide range of environmental situations. We do
not expect to achieve this clock speed across the same range of environmental factors.
;bis advantage is not used in the benchmark, i.e. the 68000 version deals with 16 bit objects while MIPS uses 32 bit objects
3A highly optimized (by hand) C version of puzzle runs on the VAX 111780 in 3.5 sec.
John Henney, et .1 345
Acknowledgements
Many people have contributed to the MIPS project. Among the most important contributors
are: Thomas Gross, pipeline reorganizer; Alex Strong, 32-bit carry look ahead ALU; Jim Clark, VLSI
circuit ideas; Cary Kornfeld, Pascal code generators; Chris Rowen, Spice simulations and multistage
PC layout; Glenn Trewitt, resource usage simulator and unified approach for exception handling,
and Wayne Wolf, redesign of the barrell shifter.
The MIPS project has been supported by the Defense Advanced Research Projects Agency
under contract # MDA903-79-C-0680.
References
1. Baskett, F. Puzzle: an informal compute bound benchmark. widely circulated and run
2. Hennessy, J.L. SLIM: A Language for Simulation and PLA Generation in VLSI. Tech. Rept
195, Computer Systems Laboratory, ~tanford University, 1980.
3. I.ampson, B.W., McDaniel, G.A. and S.M. Ornstein. An Instruction Fetch Unit for a High
Performance Personal Computer. Tech. Rept. CSL-81-1, Xerox PARC, Jan, 1981.
4. Patterson, D.A. and Sequin C.H. RISC-I: A Reduced Instruction Set VLSI Computer. Proc. of
the Eighth Annual Symposium on Computer Architecture, Minneapolis, Minn., May, 1981.
5. Widdoes, L.C. The S-l Project: Developing high performance digital computers. Proc. Compcon,
IEEE, San Francisco, Feb, 1980.
Comparative Survey of Different Design
Methodologies for Control Part of
Microprocessors
Monlka Obrebska
IMAG Laboratory
Computer Architecture Group
B.P. S3X
38041 Grenoble Cedex, France
ABSTRACT
We present several methodologies used in the design of control parts of
microprocessors and we discuss their classification with respect to the
qualities of the design. All these different methodologies were brought
out by decoding existing integrated circuits. Afterwards each one of
these methodologies was used to redesign a new control part of the MC
6800 microprocessor, its operation part remaining unchanged. By so
doing, we obtained a set of normalized solutions so that the real effi-
ciency of each method could be estimated in terms of the cost of hard-
ware and design time. The performance expressed by the cycle time of
each control part was also calculated leading to the complete, valid
classification of different design styles. At last the evolution of the
design efficiency versus the circuit complexity was stUdied.
KEY WORDS
Interpretation algorithm - Levels of interpretation - Control part
design - Microprogramming - Programmable logic arrays - Efficiency
evaluation - Regularity factor - Algorithm complexity.
1 - INTRODUCTI ON
2 - MICROPROCESSOR SPECIFICATION
Microprocessors, like all sequential machines, are defined by their lan-
guages i.e. by the sets of executed instructions. So the behaviour of
a microprocessor may be described with an algorithm, called the interpre-
tation algorithm, which explains the semantics of the instruction set.
347
348 Comparative Survey 01 Dillerent Design Meth. lor Control Part 01 Microprocessors
3 - PRINCIPLES OF INVESTIGATION
As a matter of fact, different control part design styles were
brought out by the microphotographic analysis of the internal architec-
ture of existing circuits [1], [3J, [4J, [6J, [8J, [9J, [10J, [14J, [15J
but they could not really be compared. Each one them was applied to
build a different machine, executing a different algorithm and was often
implemented with a different technology, so it was no possible to say
which one was better than the others.
The idea then was to apply each of these design methodologies to the
same example in order to obtain the corresponding hardware realization
of the same algorithm. By so doing, we found a set of normalized solu-
tions reflecting the efficiency of each method. The Me 6800 micropro-
cessor was selected to serve as a benchmark in this study. The main
reason is that we knew its internal architecture as well as the inter-
pretation algorithm of its instruction set [11J.
The following rules were respected in the redesign of each new
version:
- the operation part remained the same as in the original 6800,
- the interpretation algorithm was exactly the same as this of the
original 6800,
- the evaluation was made following the same design rules for the lay-
out which were in our case the rules of GSN3 technology.
The efficiency of each design methodology was estimated according
to three essential characteristicsl the cost of hardware, the speed of
the device and the design time.
- the cost of hardware was valued as a function of the total area of the
redesigned circuit, obtained after the layout proposition.
- the design time and ease were estimated according to the percentage of
structures which can be generated automatically like ROMs or PLAs in
each solution. This percentage was called also regularity factor.
Monlka Ob..baka 349
- the speed of the control part was established by examining the delays
of its different functional blocks. The delays for ROMs and PLAs were
calculated by a special program [5J and then the global timing was ana-
lized in order to obtain the total compatibility with the original.
. . , ""
--. -...
,. " "'" '"
~
11GT1UC1'ICliCl:COC!U 1151.11('.
t]
I
!OIl
.. ., '00 'I
~--------
..." I5'T TS(
--
t, Of(
,"" ."
Fig.2 - Redesign of MC 6800 using parameter PLAs.
352 Comparative Survey of Different Design Meth. for Control Part of Microprocessors
PROPERTIES EXTRACTION
In this approach, the characteristic, global properties are brought
out directly from the instruction code by a separate PLA. This allows to
decrease the number of states of the algorithm because some of them be-
come independent of the commanded actions. There is no longer 8 but 7
address lines. The property lines are directly used in the command gene-
ration PLA as entry line. As all the PLAs are optimized, this version of
the MC 6800 needs only 11% more space than the original one, and its re-
gularity is 55%. It is also more performant because the cycle time given
by the sequencing PLA is about 0.58 ~s. Such a technique allows the
building of well optimized control parts for rather complex instruction
sets.
PARAMETERS EXTRACTION
The parametrization is a direct generation of static commands i.e.
the characteristic, local properties are directly translated by the para-
meter PLA into the set of commands. The other commands which depend on
sequencing are computed in a command generation PLA. The switching bet-
ween the static and the computed commands is performed by the selection
lines coming from a validation PLA. The idea of parametrization can be
easily applied for these functional units which are executing one parti-
cular function during one instruction interpretation (example ALU,RCC).
It is possible however to make an extension and to describe in a parame-
ter PLA two or three particular functions chosen by the selection lines.
For this reason several subparts were distinguished in the MC 6800 ope-
ration part and the associated parameter PLAs were defined (figure Z).
The total area of the microprocessor is not increased and the regularity
factor is equal to 43%. This result proves that it is possible to opti-
mize the design time and to decrease the difficulty of implementation
without increasing its cost. The speed characteristics however are not
so good. The cycle time fixed by the delay of sequencing and validation
PLAs is about 0.77 ~s. The use of parameter PLAs gives the possibility
of designing flexible, easy to test and debug control parts for complex
instruction sets.
4.1.5. CONTROL PART USING TIMING GENERATOR
This approach is based on the fact that each command is characteri-
zed by the moments of its activation during the instruction interpreta-
tion. One instruction may be described then as a set of commands to
activate and a set of moments of their activities.
The commands to activate and other property lines are generated by
the instruction decoder. The moments are described by the timing lines
coming from a timing generator. The timing generator is a small automa-
ton which represents the "skeleton" of the instruction execution. It is
controlled by the property lines of the instruction decoder. Command
lines are finally generated through the validation of lines coming from
the decoder by timing lines. This validation may be performed by and-or
gates net (Z 80, Z 8000, I 8080) or by a PLA (I 8085). The redesign of
the MC 6800 using this technique has a timing automaton of 27 states
which generates 10 timing signals. The area of the microprocessor does
not need any increase. It seems even possible to decrease it of about
Z%. The regularity factor of 33% is worse because the timing generator
was evaluated as random logic. The cycle time fixed by the instruction
decoding PLA is about 0.4 ~s. Although this design methodology has a
very good technical characteristics, none of the existing microprocessors
use it at one level of interpretation. The reason for this is explained
Monlka Obrebska 353
.
IT
;!'
.'
. '"
1--- - - - --
l AREA
INCREASE
z
I REGUl,AR lTV IRREGULAR ! IHTERCON-
I
FACTOR
~
ELEMENTS NECTIONS
l
I
' ~
CYCLE
TlHE
vS
~-~!~--L_-11-.5--t1-0-.5-
I
I--MC
_6800
_ _ __
i 0.89
r-~--------_+I------T'-------+---~,------+----~
~ HORIZONTAl I 26 54 31 I 15
- - ----1-[----- -
i: ii
o '"
~ Cl
VERTICAL I 15 48 I 36 16 I 0.86
1 - - - -- - - ---
~ PROPERTIES : 11 i 55 I 34 I 11 ! 0.58 I
- -----,- -- _,_~ _~
...:a:
__
,I _ _ ' -:- ___ _ _
~
... Z LEVELS , - 8 34 I 51 15 : 0.41 I
Cl I
~RfA I"
2()
SINGIl PLA
- - - - H?RI10NIAL ,PPOG.
.
....
~
...
-'
- - - . VERTICAl. ,PROG.
- - -Il!LT I PLi.. - P"OP(RIiES I ...
~
- - RANOO!I LOGIC
I~ _a_ a_HULlI PLA,PAP.AMl:HRS
- - - II ,1m; Gf 'ERATOR IL
- ' - '-11"1 G GEII(RATOP 2L.
10
would occupy an area of about 19.4mm2. We have also found that the equi-
valent Moore automaton has about 466 states. This point plotted in
figure 4 confirms our expectations and underlines the necessity of more
flexible and economical design methodologies.
We must stress that the aim of this generalization was to find the
characteristic trends in the evolution of the efficiency for different
design techniques. The eight basic design methodologies which were stu-
died here may be considered as "pure" styles, and the curves show the
influence of their basic concepts in design qualities. In reality, the
control parts using several different ideas may be built as for example
in the Me 68000 [8J microprocessor which has a microprogram stored in a
ROM but the microcommands are validating the parameter PLAs. The curves
that we established should help the designer in his judgement of the
impact of different concepts on final control part characteristics.
6 - CONCLUSIONS
This comparative study allows us to find the valid classification of
eight basic control part design methodologies applied to the case of Me
6800 algorithm and then extended to other algorithms of the same kind.
The classification was made in respect to the total area, the regularity
358 Comparative Survey 01 Different Design Meth. lor Control Part 01 Microprocessors
and the speed of each designed structure. We can see that in the case of
small complexity circuit, and especially when the development of a family
of circuits is not considered, the total area and regularity are less
significant and other factors must be taken into account at the time of
design style choice. In particular, the cost of the algorithm analysis
must be considered. The main conclusion is that nowadays when the cir-
cuits are becoming more and more complex, it is worth-while to analyze
the problem in such a way that the application of more elaborated metho-
dology would be possible. This should lead to minimum area, maximum
regularity solution giving optimum extension and test possibilities.
7 - BIBLIOGRAPHY
[iJ J.ABDO, F.ROORIGUEZ "Analysis of MC 6B09 microprocessor", master's
report, june 19B1.
[2J F.ANCEAU "Architecture and design of Von Neumann microprocessors"
NATO Advanced summer Institute, july 19BO.
[3J Ch.BERNARO, B.LAPLACE, Y.ALEXANDRE "Analysis of CP 1600 micropro-
cessor", master's report, june 1979 .
[4J Ch.BERNARO "Analysis of MC2 HP microprocessor",IMAG report, 19BO.
[5J M.BONNET, J.F.TANT "Static NMOS PLAsH, master's report, june 19B1.
[6J A.BoSSEBoEUF "Internal analysis of MC 6BooO",IMAG report,june 19Bo.
[7J J.P.BRAUN "Design and implementation of a VLSI circuit with CMOS/SOS
technology", master's report, june 197B.
[BJ M.GUITTET "Microprogramming of the 6BOOo microprocessor", master's
report, june 19B1 .
[9J A.GUYOT "Comparison of ZBO and INTEL BOBS microprocessors", IMAG
report, september 1979 .
[10J V.S.R. MALLAOI "Analysis of internal architecture of INTEL micro-
processors", IMAG report, june 1979
[11J M.NEMMOUR "Analysis of instructions execution in 6BoO microproces-
sors", Final report EOF/ENSIMAG, nO 511 7B 10, may 1979 .
[12J T.PEREZ SEGOVIA "Optimization of PLA's area" Research IMAG report
n0216, october 19BO .
[13J E.PRESSON "Analysis of sequencing in MC 6BOO microprocessor"
NPL IMAG report, 197B
[14J R.REIS "Analysis of Z BODO microprocessor", IMAG report, september
19BO .
[15J R.SARWAT "Analysis of COP 1B02 microprocessor" IMAG report, june 79
[16J A.A.SUZIM "Operation parts using modular elements" IMAG report,
september 1979.
C.fast: A Fault Tolerant and Self Testing
Microprocessor
During the spring of 1981, the authors were involved in a project to design a single
chip fault tolerant microprocessor. The microprocessor chip is now being fabricated by
the Multi Project Chip (MPC) facilities. This report presents a brief overview of the chip,
examples of the reliability testability techniques implemented, and some of the trade off
issues resolved during the design process: partitioning of control code into several PLA's,
and increase of PLA size and the overall chip size due to testability reliability constraints.
INTRODUCTION
The C.Fast 1 project attempted to accomplish four goals. The first goal was to
provide the authors with experience in designing digital integrated circuits, especially
microprocessors. We hoped that this experience would give us a better basis from which
to build a Design Automation (DA) system using a hierarchical structured design
metrodology. A second goal of the project was to explore ways to connect control signals
to the data path part in a simple structured way with little random routing. A third goal was
to tryout some low cost reliability techniques at the IC design level. Two new ideas
implemented were parity checking on the the control PLA's and the concept of using the
data path to act as a visibility bus for testing purposes. Other reliability techniques were
also implemented for the appropriate sections of the chip. A final goal was to produce, as
a by product of the design effort, a set of register transfer (RT) level building blocks to be
used by our DA programs.
The Fairchild F8 microprocessor [FAIR77] was chosen as the target machine for the
following resons. 1) It represents a "typical" illicroprocessor architecture of the mid70's
era. 2) The original F8 is an nMOS chip, same as the MPC process. The minimum feature
size used is similar to the current MPC process, where the minimum transistor gate area is
5 microns by 5 microns. 3) The complexity of the F8 is not very great, which made
.reimplementing the Instruction Set Processor (ISP) easier. 4) The original F8 is partitioned
lC.fast. in the PMS notation [SIEW8l). stands for Computer: FAulttolerant and Self Testing
357
358 C.fast: A Fault Tolerant and Self Testing Mlcroprocee80r
in such a way that we could implement the basic CPU chip in less than 24 pins, thus
leaving some pins for our testability reliability portion of the design. 5) As part of earlier
research work, we have explored the question of implementing low cost fault tolerant
features on an F8 system at the IC level.
CHIP OVERVIEW
The chip can be regarded as consisting of three interrelated sections: the control
part, the data part and the reliability part (see Figure 1.) These sections were each under
the control of a different designer, though naturally there was considerable consultation
between them.
The control is partitioned into two groups of PLA's. Three large PLA's control the
instruction execution, provide correct sequencing for the external data bus, and attempt
recovery after transient errors. These PLA's broadcast encoded commands on a control
bus which traverses the chip in parallel with the data bus. Small decoder PLA's (called
NanoPLA's because of their resemblance to techniques used in nanocoding) produce the
actual control signals for the data path elements using the broadcast microinstructions as
inputs. This partitioning has produced a smaller and faster control section than would
have been produced by the more conventional design methodology.
Information about the state of the data part is fed back to the control part through a
status bus which is available to all PLA's. The extensive use of buses is intended to reduce
random routing and is partly motivated by our Design Automation research. The use of a
command bus allows easy testing since it can be made readable and writable through the
visibility bus and the I/O bus. It also provides a convenient way for the Retry PLA to take
over the data path control when it attempts instruction retry.
The data part is similar to the Cal Tech OM2 data path chip [MEAD80]. Its
similarities include a two phase clocking scheme, precharging of buses, use of a
precharged Manchester carry chain in the ALU, and interleaved data buses. It is different
from the OM2 in that all of the storage elements are static, although the latches driving the
ALU are dynamic. Also several of the elements have been reworked so that there is a
uniform control, power and ground scheme. One of the buses (called the 8 bus) is divided
into several sections to increase parallelism. Other changes include passive pullups in
some places, a different spatial ordering of ALU and Shifter, use of a more specialized
shifter, and wider Vdd and ground wires. Embedded within the data path part are fault
tolerant devices including a parity checker, parity generator, and shadow registers. Also
added are a zero detector and a status register.
The chip is also serving as a test bed for several reliability and testability techniques.
Testability is enhanced by allowing access to the internal control bus through the I/O bus
via the visibility register. Fault tolerances against transient errors are derived from the
pervasive parity checking and built in retry algorithms. It is also intended that two chips will
be used in a duplicate and compare system with one in the standby slave mode, ready to
take over if an error occurs in the master.
The control scheme makes exclusive use of PLA's. Here, the PLA's can be thought
of as associative read only memories where, unlike ROMs, only those product terms (p.
terms) actually needed are used. Thus a PLA based microprogram can continuously
examine the present state of the machine, rather than have it retained impliCitly in the
Mlcha.1 M. T.ao, .t al 359
microprogram location counter. The PLA's responsible for generating the Microcode
(TIMING and MAIN) have the instruction register and part of the processor state fed
directly in, requiring only five bits of recycled state. There is no need for dispatch ROMs or
condition code multiplexing as is often used in ROM based designs. This simplifies
automatic implementations while still resulting in small size.
Our design utilizes a two level hierarchy of PLA's: a group of decoder PLA's, refered
to as nanoPLA's, and microcode generation PLA's. No pipelining has been attempted in
this design. The current state of the machine determines the operations performed during
the next microcycle. This was done to preserve the sanity of the Microprogrammer
(though he went insane anyway). The microprogram output is broadcast to all the nano
PLA's during q, 2 and their outputs are guaranteed stable by the following q, 1"
The microcode generation is broken up into two PLA's, TIMING and MAIN. The
TIMING PLA keeps track of the individual instruction sequencing and controls timing of the
external processor bus. .It also generates the F8 ROMC codes wtiich direct the other
elements in an F8 system. The generation of ROMC codes and next state was combined
into one PLA since they shared many of the same Pterms. The specific sequence of
states produced by the TIMING PLA is determined by the instruction being processed. The
MAIN PLA combines the present state with the contents of the instruction register to
determine the next micro instruction fo be executed. Few of these Pterms overlap with the
others, so they were placed in their own PLA. This arrangement required less PLA area
than other possibilities, as will be shown later.
When an error is detected during instruction execution, the TIMING and MAIN PLA's
freeze their state and the Retry PLA takes over. It issues its own instructions onto the micro
instruction bus and attempts to return the system to a known state. The instruction is then
retried from the point of failure.
In Figure 2, the area requirements of several different arrangements of PLA's are
presented. These estimates do not include the effects of adding fault tolerance. As can be
seen from the table, the present arrangement requires 920,000 A2, of which 565,000 A2 are
in the MAIN and TIMING PLA's. If the MAIN and TIMING PLA's were combined into a
single PLA, ten Pterms would be saved, but the total area would increase to 630,000 A2.
Also the overall system would be slower. Another possible arrangement would combine all
PLA's into one giant PLA. This yields an even larger and slower PLA of 1,010,000 A2.
Thus the particular partitioning chosen appears to be a good one.
system speed for testing observability. Since there were some questions
concerning the performance of the onchip clock input 10 pad, supplied by the
local MPC symbol library, two 10 pins were used for the two phases of the
system.
A traditional microprocessor design can usually be grouped into the data path part
and the control part. The data part is more observable from the off chip I/O pins. The
control part is somewhat harder to control and even harder to ob~erve. In most
microprocessor designs, the only way to determine the proper operation of the control part
is by observing the output of the data part. The goal for our testable microprocessor was
to design a place where we can put a controllable and observable path (the visibility bus)
between the control part and the data part, and route it to the off chip I/O pins.
In the C.fast microprocessor deSign, the main control PLA generates control
information for the "control bus", not for the data part directly. We convert this bus into
the "visibility" bus during the test mode. Extra circuitry is used for observing values of the
control part, as well as jaming new values onto this visibility ("control") bus. Thus, we
increased the observability of the control part and the controllability of the data part.
The controllability of the control part Finite State Machine (FSM) is increased by
making it easier to write information into the FSM flip flops (FF's). During test mode, these
FF's can be loaded from the off chip data bus I/O pins via the visibility register and the
visibility bus. The operation is very similar to all the scan-set ideas, such as the Level
Sensitive Scan Design (LSSD) technique used on some IBM machines [EICH77]. On
C.fast, the FF values can be loaded during the test mode write cycles. The microprocessor
runs one or more executing cycles. The chip is set back to test mode, and the values
stored by the visibility register are read off. One important difference is that the data
reading and loading uses the a-bit parallel 10 data bus pins. This design does not use
shiftin and shift-out pins as in most scan set-like designs. Furthermore, the pins for
controlling the test mode functions could be shared with the pins used for the fault
tolerance operations. The extra pins do not visibly impact the total pin count. 2
Additionally, portions of the FSM used for fault tolerance operations are highly
observable during "normal" operations. The microprocessor can easily be fooled into an
error recovery mode where the proper operation of the recovery FSM can be observed.
The built in error detectors also increase the testability of the chip.
CADTools
A dozen or so CAD programs were used in designing the chip. Manual layouts were
done using the Xerox ICARUS interactive IC layout program, running on the Xerox ALTO's.
The CIF files generated from the ICARUS layout files were sent over the Ethernet to the
CMU VLSI-VAX. The most interesting path was in generating the CIF files for all the PLA's.
The sequence is illustrated in Figure 5. Many programs were invoked in the prescribed
sequence. However, the tasks were somewhat simplified by using system comand files.
The actual bit patterns for the PLA generator were used to drive the ISPS simulation of the
micro machine, providing an independent check on the correctness of the microcode. A
UNIX shell program was used to merge all the leaf node CIF files, which included 6 ICARUS
files, 10 PLA files, and 'one small hand typed CIF file, into one unified C.fast CIF file.
Because of the size of the entire CIF file, design rule checking of the chip was done in
separate chunks, merging only a few leaf node files at a time.
2We ran out of chip real estate in which to place the Retry PlA. However, all the associated control signals
were placed. Using the visibility bus, the error recovery procedures can still be tested.
382 C.tlst: A Flult Tolerlnt Ind SeN Testing MlcroproC8180r
SUMMARY
Work on the C.fast chip was initiated in midJanuary, 1981, and completed in mid
June, 1981. Four graduate students were actively involved in the design. Figure 6 shows a
checkplot of the C.fast microprocessor, and Figure 7 identifies the major functional areas
on the chip. The completed design, excluding the RETRY PLA, contains approximately
13000 transistors, of which the TIMING PLA and the MAIN PLA uses 4300 transistors. The
a.
data path part, with 16 byte SPM section, contains 5600 transistors. The "nanoPLA's"
accounts for about 3000 transistors. The chip is approximately 6100 microns by 5800
microns, somewhat big for a simple microprocessor. However, we feel that we have
satisfactorily completed the four goals stated at the beginning of this project.
ACKNOWLEDGMENT3
The authors would like to thank Bill Birmingham on the layout of the shifter section,
and Dr. Dennis Lunder, of Fairchild Microprocessor Product division, who donated a
F387X PEP single board computer system, providing a test vehicle for our finished
product. We would also like to thanks our colleagues'at CMU's VLSI community. Without
their many wonderful CAD programs, we could not have completed this design.
REFERENCES
[EICH77] E.B. Eichelberger, and T.W. Williams. "A Logic Design Structure for
LSI Testing". Proc. 14th Design Automation Cont." June 77. pp 462
468.
3The paper presented here is an excerpt of a project report "The MPC C.fast micro computer". available from
the authors.
While working on this project. the authors were supported by the Defense Advanced Research Agency (DOD).
ARPA Order No. 3579. monitored by the Air Force Avionics Laboratory under contract F33615-78C-1551. and by
National Science Foundation grant ENG-78-25755. and by the Carnegie -Mellon University Department of
Electrical Engineering.
The views and conlusions contained in this document are those of the authors and should not be interpreted
as representing the official policies. either expressed or implied. of the Defense Advanced Research Agency or
the US Government. or other funding agencies.
P
C,H MAIN MAIN RETRY TIMING MAIN
KI(2)
E V P P 5
R R CG H
R HE A
E D
R KN 0 5
G 5
E
G
l I Pbit
W P
I M
P 51 1
T R
A
T
U
5 3:
A-bus n
=r
parity on A-bus III
!!.
IStore & Compare on I/O pads - 1 ~
....
Figu re 1 a: RT level block diagram
J!.
Figu re 1 b: Block diagram of the testability
of the C.fast microprocessor !!.
-reliability portion of the C.fast
i!
384 C.fast: A Fault Tolerant and Self Testing Microprocessor
Maste r Slave
Both OK Both OK
Internal
error
CMU- 10A
- --
Stop Master Stop Slave
Stop Master 31:
~~
~
.....
Figure 4: Markov diagram showing the states
..'"
$)
of a duplica te-matc h C.fast system !.
Figure 5: Diagram tracing the design process !!.
through various CAD program s
~
366 C.tast: A FauH Tolerant and Self T.stlng Microprocessor
L I/OPADS
I TIMINGPLA II MAINPLA
I
I CONTROL BUS
I
~aIACC IlcoN I~ ~
v S S
R SPM ACC ALU H T I
E I A R
G ~ T
l 1/0 PADS
A. INTRODUCTION
367
368 VLSI Processor Arrays for Matrix Manipulation
B. ALGORITHMS
A system solver can then be built which solves for [X] in four steps:
register. When all of the elements of [U+J have been calculated and
stored on the LIFO stacks, the shift register containing the intermedi-
ate results, ~i' will also be full. At this point, switch 51 is opened
and 52 is closed; switch 53 is closed long enough to load the C register
with the results, ~i' The previous operations are repeated with the WL
array remaining idle. Each time period a new result, Xi. is calculated
and stored in the shift register.
We have further reduced the functional configuration shown in
Figure 1 to a form that is pipelined and suitable for reduction to hard-
ware. The necessity of pipelining resulted from the constraint that
processors Pi and Qi in Figure 1 only be allowed to communicate with
nearest neighbors. This constraint was imposed to minimize the speed
and power consuming requirements of a global bus.
where cl' c2""'c n , and d are given numbers, and xl,x2"'Xn is the
solution to the system
all xl + a 12x 2 +
a 2l x l + a 22 x 2 +
+ a nnxn b
n
whose determinant is non-zero. The problem can be codified by writing
it as
a b
nn n
-c d
n
or, in abbreviated form,
~
7. (2)
370 VLSI Processor Arrays for Matrix Manipulation
~
:ctn (3)
After Gaussian elimination, in the lower right hand corner the result
[e] [A-I][B] + [D] will appear, where we have used [X] = [A]-IIB].
Other results can be obtained by appropriate choices of the entries in
(3). For example, a common problem is the solution to the linear sys-
tem [A] [X] = [B], where [X] and [B.l are column vectors. The solution
[X] to this equation can be obtained with the entries
~ (4)
:iFo
Here, [I] is the identity matrix. Examination of the Gaussian elimina-
tion procedure shows that only the top line of the identity matrix is
actually being utilized at each step, so the array can be reduced to
(n + 1) x (n + 1), as shown in Figure 2. After each pass of the
Gaussian process, the top line and left column will have been annihi-
lated. If the remaining numbers are then shifted upward and to the
left, the -100 . 0 line of the identity matrix can be restored at
each step. The result is that at each pass the matrix shrinks by one
column, but retains the original number of rows. After n passes, it is
found that the remaining column contains the solution vector.
The Faddeev algorithm can be used to solve linear systems rapidly
by using an (n + 1) x (n + 1) array of identical processors to operate
simultaneously on all elements. As a result, the entire calculation can
be performed in O(n) time steps. One possible two-dimensional pipelined
architecture suited to solving lA][x] = [B] is shown in Figure 3. Here
we are considering the case where [e] and [B] are one-dimensional vec-
tors whose elements are initially stored in the processors below the
double line, and to the right of the single line, respectively. The n 2
elements of [A] are stored in the processor matrix in the upper left
corner. The function of each processor,P .. ,in the array is indicated.
~J
J.G. Nash, at al 371
-~
~
does the mUltiplication and addition, IB][C] + [DJ.
C. PROCESSOR SIMULATIONS
REFERENCES
10880-7Rl
WEINER
LEVINSON BACK
LATIICE SUBSTITUTION
ARRAY ARRAY
LIFO
STACKS STORAGE SHIFT
,--..0.-..., PROCESSORS REGISTER REGISTER
STORAGE
REGISTER PROCESSORS
c
x
i::
<>-
S3
,
c
, x
;..
U
c ,
c
<>-
o
x
b
<>-
"'i OR Xi
10990-5Rl
PROBLEM: Ax=B
DATA FLOW:
''I'
I,,"M ~. , "
....... 11.-.- ......
'.IIU
., .. ,
'.IJ~
I ,U M
I . "'~...... I."" I~
I ."
1._.
I."" I."" t.M" I.''''
I."M .. ~
I.UN I."" . . ...U
...
_ . . .... I . ..,.
~_'H' ...... ....... I.'U' I..... I.""
I ..... I .
1 . "8
-I o
--.
.....
...........
,u ..
.
Hn
..... ....
.- _ .
.......O-M. .........
OM
."..H. ..-
..... ...... ..... ...,
..... .-
I ."" '''.1114' . . . . . . . IN'
n..
H. !';::: ....
. I I "
: ; ::~
I,""
,"" .!!: .::!'!: :::
..... ...... .Mot) ' . MU ....
.. n
. n ..
. tlM
,
H"
.--
....... .....
.."..., .. ....
I H' ,u,.
.....
....."
.-.
, un
.;, :~!: .
Hn ,un
H ..
u
_.'U' . n ..
.U,
1 . , 11
'I . :;~: '!:!~~
",.. \.
1.11"
"I:::;;
r.u ...
.~:ll:: .......". : h~
I _Ut.
:::~:;
,.n,'I.he! ,'UI , N'
. un
hl
...._., .........
. NI t
.-
HH H"
-
.... _ .N" M
.nh , I'"
M'U .... N HH
. ..... ... u
......-.. ......u., .-
) t
, u
- -
..... .....
.... ........
..... .....,
............... ... ._".M" ..-.....
.- -
. u
.....
. .. u
.N"
._. ...OM'
. nu
nn
"
. I.U HI . MII
. IU.
111
, . n .. . ... ,Iot
..-- ......-
. n ... 'M~ .u .. .u ..
, UM ~. "1"1
n
. ....
."M
... ,..
._
,.... .....
, .... . &u
.14 M ,.... ....
. ...... _ ,_ , M"
MIl .Nt..
,Nt .
. f,HJ
. . . ..
10880- 6
n + 1 PROCESSORS
r~--------------------------~A~------------------------~
n+1
PROCESSORS
ROWS OF I
:~" FINDSx=a-bc
Mtb
b~
a
FINDS ab
1088().13b
WEINER-LEVINSON
16 SECTION
FIXED POINT (8.19)
50 SAMPLES
CUMULATIVE DISTRIBUTION
50
40
(:J
z
15
w 30
w
u
x
w
a:
w 20
Ol
:;;
:>
z
10
0
0.00001 0.00010 0.00100 0.01000 0.10000
RMSERROR
10880-29
1. TWO WORD REGISTER
2. SHIFTER
3. l'S INSERT
4. EXPADD
5. 2', COMPLEMENT
6. CARRY-SAVE ADD
7. ACCUMULATOR
8. LOOK-AHEAD ADD
9. LEFT JUSTIFY 4
LIFO
9 8 7
HORIZONTAL BUS
NOMENCLATURE MEANING
SINGLE PRECISION; FIXED POINT DOUBLE PRECISION; FIXED POINT SINGLE PRECISION; FLOATING POINT
----- ----
S-P/P 0.92 34 20 31 5.4 0.16 2.0 15 6.1 30 B.2 0.55 0.65 17.4 9.4 11 16 0.92
--- --
S-S/P 1.9 10 1.9 19 28 2.8 1.7 9.7 2.8 16 21 2.2
' ----.....
P-S/S 3.9 9.7 1.2 38 21 2.2
A = AREA 110- 3 ,2 em 2 J
J.Storrs Hall
Rutgers University
KEYWORDS
content addressable memory, parallelism, general purpose architectures, CAML,
associative processing, applications of VLSI systems, algorithm design for VLSI
systems, impact of VLSI in system design
ABSTRACT
VLSI makes feasible the massive use of content addressable memory in a
general purpose computer. We present a design for memory which is
addressable conventionally and by content, and which supports low-level bit-serial
word-parallel algorithms. CAM provides one of the most easily understood and
programmed frameworks for massively parallel computations. We present a
programming methodology for the use of our design. This includes a
programming language, CAML; a number of algorithms from various fields which
demonstrate the generality of the design and the language; and techniques for
transforming algorithms from conventional to CAM-based structures and methods.
HARDWARE
The CAM consists of words of memory, responder bits, logic to do
comparisons between the bits in the words and the values on the bus, logic to
manipulate the responder bits using the results of the comparisons, logic that
controls whether or not a word is active (comparisons happen, the word can be
written into) depending on the values of the responder bits.
The addressing hardware and the bus are such that values can be read from
and written to individual words by address as in a conventional memory. The
addressing circuitry also controls the CAM activity in associative mode, such that
upper and lower addresses are specified which limit the associative activity. This
provides the ability to deal with blocks of memory as associative structures,
without interfering with the rest of memory.
379
380 A OeneralPurpose CamBased System
OPERATIONS ON MEMORY
There are four basic operations possible for the CAM part of the machine:
reading words, writing words, setting flags depending on the data bits, and setting
data bits depending on the flags.
Reading and writing single words are equivalent to the read and write
operations in a conventional memory. Only single words may be read.
Writing may cause data to be written from the data bus into all selected
words. Words are selected by (a) being inside the specified address range, and
(b) having a response bit or bits that agree with the conditions specified. The set
of conditions is just the state of some of the lines on the bus and may be
thought of as part of the address (e.g. "all words from location 234 through 345
in which response bit 0 is 1"). It is possible to specify a bit mask for the write
operation. The write operation only changes bits corresponding to ones in the bit
mask.
When the data is written into each individual word, it is ORed with a value
specified from the flags (response bits) in that word. This can be made 0 by
selecting no flags for this purpose, so that the value on the bus gets written.
Alternatively, by putting 0 on the bus, we may write the value from a flag into a
data bit.
Set flags is the most characteristic feature of a CAM. It allows flag bits in
each word to be set from data in that word. This is a parallel operation, in that
it is performed on each selected word in memory independently. In this
operation, a one-bit result is computed for each selected word. Words are
selected by the addressing just described for writing. One or more flag bits are
specified on the bus, and the result for each selected word is written into those
flag bits in that word.
The results for a given word are as follows. Let Bj be the ith. data bit in a
word; OJ be the ith data line in the bus; M j be the ith mask line in the bus. Let
Rj be -M j + (B j eqv Ojl. that is, 1 if the bit equals bus or 1 if not in the mask.
The Rj are ANOed together. Before being written into the specified response
bits, the result is XORed with a final line on the bus to give the processor a
chance to invert it.
The addressing logic is such that we can do the following: (a) select a range
of words for associative access; (b) determine the address of the first active
word; (c) determine the number of active words (responders).
A major difficulty in the design of the machine as a whole arises from the
fact that all the lines on the bus must go into each Chip. In a conventional
memory architecture, a 16x 16K memory can be implemented with 16 1x 16K
J.Storrs Hall 381
chipS, since there need be no connection between different bits of the same
word. Not so in the CAM, where each bit must be connected with the flags for
that word.
Our design exacerbates the problem, since the "address range" capability
requires that 2 addresses be supplied with each memory access. Furthermore,
usability requires fairly large word sizes, 64 bits at the very least Data lines plus
mask lines plus two addresses gives many more than can be fed onto a chip.
We multiplex the signals: Since we generally are dealing with fields in the
word, we put only a slice, say 32 bits, on the bus. We can get away with
splitting the bus into data, mask, low address, and high address phases. Since
many of the al!lorithms use the same range of memory in many successive
accesses, we allow the addresses to be latched into memory. In some non-
negligible portion of cases, we may dispense with the mask by assuming it is all
ones.
We refrain from designing a processor, as a conventional variety is sufficient.
We will assume a microprogrammed one with lots of internal registers, since
some reasonably complex algorithms underlie our "machine language".
OPERATIONS IN MICROCODE
Given the ability in hardware to find all words in which given bits have given
values, we can write bit-serial, word parallel algorithms to find words in which a
given field has a numerical value greater than, less than, etc, a given value. Given
the ability to determine if there are any responders, there are bit-serial, word-
parallel algorithms for finding the maximum or minimum element Given the ability
to count the responders, it is simple to sum the elements of an array in parallel.
Given these and the ability to write into selected bits of selected words
simultaneously, we can add or subtract a given value to selected words with a
bit-serial, word-parallel algorithm. Similarly, we can do arithmetic between fields
in the same word, for all words in parallel.
The hardware can be optimized for these algorithms. For example, To add
the contents of a register to some field of all words, for example, we apply the
following algorithm (adapted from [Foster]), on the unoptimized hardware
described above:
[, ] Denote one flag "carry-in" and one "carry-out". Set carry-in in all
words to O. Set I (an internal var in cpu) to the number of the least
significant bit in the field to be added to, J to the number of the
least significant bit in the register to be added to all the words.
This algorithm requires 4 or 5 memory cycles per bit (depending on the register
bit). Assuming a 200-nanosecond memory, this is a microsecond per bit. We
can speed this and the other arithmetic up by attaching a full adder between two
of the flag bits in such a way that each bit would take a single memory cycle.
This is probably a good tradeoff since it would cost an extra 5% at most in
hardware and produce a 4- to 5-fold speedup.
CAML
To make possible the independent development of machine architecture and
software, and to facilitate the latter, we have developed a "systems level"
language for programming instead of assembly language. CAML hides the flag
bits from the programmer in the same way BLISS or C hides registers in a
conventional machine.
The basic data construct in CAML, besides scalars of various types, is the
array of records. A pseudovector is the occurences of some field in a subset of
the records of the array (e.g., the B field in all records where the A field is
greater than 17). In an array foo this would be written foo[:a>17].b. Only
one array can be the basis of the pseudovectors in anyone statement, so the
fieldnames are used alone after the initial mention of the array. For example,
foo[2:34 : x>22 & y=z].z := x-17 means "in records 2 through 34
inclusive of foo, in which the x field is greater than 22 and the y field is equal
to the z field, place in the z field the value of the x field minus 17." Indexing is
zero-based.
include those which are primitives at the microcode level, ie arithmetic between
fields in parallel. count responders (# f 00 [ : x> 3) l. address of first responder
(@foo[ : x>3 h etc.
Control constructs in CAML also reflect the capabilities of the CAM. Iteration
through a pseudovector, for example, is included since the deaddressing capability
allows this to be done in time that depends on the number of items in the
pseudovector, but which does not depend on the length of the base array (the
original array in which the selected records lie).
ALGORITHMS ON A CAM
We support our contention that a CAM is well-suited as a general-purpose
machine with a wide selection of algorithms from different areas of computer
science. The benefits of simplicity and speed are sometimes available together,
sometimes separately. For algorithms using conventional data structures, usually
the algorithm can be simplified, and occasional speedups occur. Sometimes we
can change the data structure and obtain more radical improvement.
for i =
: @mem[: g] ; find the first marked place
; iterate through words whose g field contains 1.
; i is the address of each successive one.
isearch for successive words considers only words following
ithe previous one. Note the difference from while.
mem [ i] 9 : = 0 ; unmark it
for k := 0 to mem[ i] .ptr ;move down
mem[bot+k] := mem[i+k] ;full record assignment
;end iteration on k, after k=mem[ilptr
mem [ : -f &: pt r= i ] . f : = 1 ; relocate all pOinters to it
.ptr := bot
;set the f field to 1, and the ptr field to bot,
i in a/l words where the f field=O and the ptr field = i
bot :+= mem[bot].ptr ; move up for next
iend iteration on i, when no more words with g= 1 remain
The conventional form of the same garbage collector is much more complex.
requiring forwarding addresses. an extra relocation pass. and an extra pointer per
record.
Sorting may be done in linear time using an algorithm which would be
quadratic on a conventional machine. This is a gain in simplicity and speed. Note
J.Storrs Hall 385
that "linear" here refers to an algorithm which does n extractions of the largest
element and that is a bit-serial algorithm (although presumably in microcode); so
there is an implicit extra factor of length-ot-key. However, length-ot-key is
also an implicit extra factor in a comparison when sorting on a conventional
machine; so we are not taking liberties.
GRAPH ALGORITHMS
The garbage collector is much simpler on the CAM, and runs faster by some
constant factor, but is still a reflection of a data structure designed for a
conventional machine. In some cases, we can speed up an algorithm by a linear
factor by appropriate rearrangement of the data structure.
[1] [initialize] There are n vertices. we are looking for the shortest path
from vertex 1 to all the rest. Array u(n): u( 1) = 0, u(i> 1) = +infinity.
Array a(n,n): a(i,j) is the arc length from vertex i to vertex j. The set
T contains all vertices except (vertex number) 1. Set k to 1.
size of T, so the whole thing is nt2 in number of vertices. We can get around
this, however, by rearranging the datastructures. The trick is to distribute u and t
around to the edges so we can do the update u operation in parallel. This leaves
us with a copy of u(i) for each edge leading into vertex i, with the "true" value
of u(i) being the minimum of that set Likewise t is represented by a t for each
edge coming into a vertex; these all change in parallel, so each is the "true" value.
{ edge(nI\2) .x, .y(index(n .a, .u(number) .t(bit) }
; O. [initializeJ
; assume fields x, y, and a have been initialized
; assume there is at least one record with y= 1;
; if not a dummy may be inserted.
edge[:y-l).u :- 0 ;set u and t to 0 for
.t := 0 ; all edges whosey is
=
[ ]. u : 999999 ;set u to "infinity" and
t := 1 ; t to 1 for all other edges
k := 1 ; k is a scalar variable
while edge [ : t)
; iterate as long as any t field in edge is
; 2. [update uJ
; essentially the same as above, but we only update the
; copy of u(i) associated with (k,i).
edge[:t & x=k].u :min- min(edge[:y=k].u)+a
; in each edge record where the t field is 1
;and the x field equals k, set the u field to
;the min of its previous value and the sum of
; its a value and the minimum u value in all
; records where the y field equals k
;3. [new kJ
;taking a "grand total" minimum of all the u's instead of
; doing each u(i) incrementally
k := edge[:t:min(u).y
; k gets the y value from the record with the
; minimum u value from those records in which the
;t field is 1
edge[:y=k).t := 0
; set the t field to 0 in all records where
; the y value equals k.
;exit the iteration when no record's t field contains 1
This new form of the algorithm runs in time proportional to n, the number of
vertices. It remains to prove that it works.
J.Storrs Hall 387
[ 1J Every vertex with at least one incoming edge has at least one u
value since the u's are stored corresponding to the y's.
[b] The fact that the u's for all the y=i have not been updated
does not matter, since the u used in the addition is the result
of the grand total min as per [aJ, and value of u for this
edge can only be higher or equal the "correct" u(i) for the
vertex.
I
[3J The t values are manipulated only by y=k and are thus set and
tested only in blocks corresponding to the individual bits in the serial
version.
Like the garbage collector, the minimal spanning tree algorithm gains a
marvelous simplicity (compared with an efficient conventional version), but it gains
a major speedup as well. The algorithm is one well suited to a CAM: Pick edges
of minimum cost that don't form a cycle until all vertices are connected.
{tree{n"2) .x, .y, .bx, .by{index{n
.cost(number) .mstp(bit)}
; these are edges, connecting vertices x and y.
tree. bx : = x ; originally each vertex
by : = y i is a separate subtree
ms t p : = 0 i and no edge is in the spanning tree
while i := @tree[:bx-=by:min{cost)
; find edge of min cost which doesn't form a cycle
tree[i).mstp := I iput this edge in tree
new : = tree [ i ]. bx i change the subtree number of one
old : = tree [ i ) by ; of the subtrees to the other's
tree[:bx=old).bx := new
tree[:by=old).by := new
Fairly small relational databases of the kind often used as know lege bases in
AI programs can be stored directly as tuples (records) and manipulated with
extreme ease. By the appropriate encoding, contexts of the Conniver variety can
be represented by an extra tag on each tuple, allowing constant-time insertion
and retrieval in the complex structure. An appropriate encoding is merely an
enumeration of the context tree for which an appropriate bit pattern will select
exactly those nodes from the root to some selected node, and of those the
farthest from the root has the highest numeric value. This is another example of
the technique of making structure explicit.
REFERENCES
This is a condensed version of Rutgers LCSR-TR-16, which contains in
particular a more detailed description of CAML.
connected to each other. The delay element must have the ability to
delay each value that passes through it by an arbitrary number of
clocks. In practice, a delay element would be implemented as a small
buffer memory with a structure that will be described shortly. These
delay elements add sufficient logic complexity to the crossbar chip to
effectively and usefully exploit the capabilities of VLSI. The use of
scratch-pad register files to provide the delays is not an adequate
solution since the delay elements, which are intended to facilitate the
scheduling task, would now themselves be resources that need to be
scheduled. This circularity greatly complicates the scheduling task
[3J.
A careful study of a number of possibilities has yielded no
satisfactory alternative to providing a delay element between every
output and input that are directly connected. In view of the
advantages of a small crossbar switch as a building block, it was
decided to design an interconnect chip consisting of a crossbar with a
delay element at each cross-point to facilitate the construction of a
wide variety of interconnection networks for polycyclic processors.
The structure of an individual delay element is determined by the
following considerations. Each value that passes through it will be
written into it once, be read one or more times, and will then be
deleted. The multiple reads result from two or more operations, which
have as input the same value, being scheduled on the same resource.
More than one value may be in the delay element at anyone time, and
their arrivals, usages and deletions may have arbitrary relative
orderings. The memory element implied by this need be nothing more
than a register file with the capability of reading from or writing to
any register.
A further issue is illustrated by the program fragment in Figure
1a, which computes the first 101 terms in the Fibonacci series and
stores them in the array A. Each iteration of the loop computes a new
value (for TO) which is used by the two subsequent iterations.
However, this value has to be reassigned, on successive iterations,
first to T1 and then to T2, in effect performing a programmatic shift
of the value from TO to T1 to T2. Such programmed shifts can
T1 := 1; T1 := 1;
T2 := 1; T2 := 1;
A[O ~ := T2; A[O] := T2;
A[1J := T1; A[ 1] := T1;
FOR I := 2 TO 100 DO FOR I := 2 STEP 3 TO 98 DO
BEGIN BEGIN
TO := T1 + T2; TO := T1 + T2;
A[I] := TO; A[I] := TO;
T2 := T1; T1 := T2 + TO;
T1 := TO; A[I+1] := T1;
END; T2 := TO + T1;
A[I+2] := T2;
END;
(a) ( b)
Figure 1. Program Fragment
B.R. Rau, 8t al 391
complicate the scheduling task for much the same reasons that the use
of scratch-pad registers for implementing delays causes problems. If
programmed shifts are to be avoided in this example, the loop of Figure
1a would have to be unrolled as in Figure 1b where three members of the
Fibonacci series are computed on each iteration. In general, this can
increase the size of the program quite substantially if the body of the
original loop is sizable. The need to perform such a shift arises
whenever successive instances of a value are written into the delay
element before previous ones have been deleted. If each instance of
the value is to be accessed using the same address, and if programmed
shifts are to be avoided, the delay element must have a built-in shift
capability to perform the equivalent of the programmed shift in Figure
1a.
Precious pins can be saved if the address that a value is to be
written into in the delay element does not have to be explicitly
provided. Instead, on-chip logic associated with each delay element
maintains a pointer to the location which is to be written into next.
The resulting structure of the delay element consists of a
register file, any location of which may be read from by providing an
explicit read address. Optionally, one may specify that the value
accessed be deleted. This option would be exercised if this is the
last access to that value. The result of doing so is that every value
with addresses greater that the address of the deleted value, is
shifted down to the location with the next lower address.
Consequently, all values present in the delay element are compacted
into the lowest locations. An incoming value is written into the
lowest empty location which is always pointed to by the Write Pointer.
The Write Pointer is incremented each time a value is written and is
decremented each time one is deleted. As a consequence of deletions, a
value, during its residence in the delay element, drifts down to lower
addresses, and is read from various locations before it is itself
deleted. A value's current position at each instant during execution
must be known by the compiler so that the appropriate read address may
be specified by the program when the value is to be read. Keeping
track of this is a simple, if somewhat tedious, task which is easily
performed by a compiler during code-generation.
There were four basic parameters that had to be specified prior to
designing the logic of the interconnect chip:
1) the number of words per register file, d,
2) the width of a word bit slice, b,
3) the number of crossbar input (write) ports, m, and
4) the number of crossbar output (read) ports, n.
These parameters had to be specified in the context of three
constraints that had to be satisfied. These constraints, which reflect
TRW's 3-D bipolar process, fabrication, packaging and testing
technologies, were that
1) the device count could not exceed 20,000,
2) the pin count could not exceed 64, and
3) the power dissipation could not exceed 4 watts per 64 pin chip.
Based on a preliminary design analysis, the following equations were
derived to evaluate the effect of the constraints upon the parameters:
392 A Statically Scheduled VLSI Interconnect lor Parellel Proceseol'8
~
PORTS 1 2 3 4
READ Iml
PORTS
Inl
31 BITS 15 BITS 9 BITS 7 BITS
1
81 PINS 69 PINS 65 PINS 68 PINS
Figure 2. Maximum Bit Slice Width Without Exceeding 20,000 Devices Per Chip, and Number of Pins
(Assumes 16 Words/Cross-Pointl
B.R. RIU, at II 393
A system level analysis was used to select one out these four
options. This consisted of evaluating the total power dissipation and
the total number of chips involved in constructing a crossbar with 4
input ports, 8 output ports and a 32-bit word width. Figure 3 plots
these quantities for n = 2, 3, and for m = 1, 2, 3 and 4. The four
feasible options are circled. The m = 2, n = 2 option is the best with
respect to both power and chip count. It has the additional advantages
that m and n are powers of 2 and that m and n are equal, thereby
providing improved modularity in both the m and n dimensions when
constructing larger crossbars. A block diagram of the final design is
shown in Figure 4.
Additional features of the chip shown in Figure 4 are a diagnostic
bit serial read-in/ read-out of the input and output registers, and
status bits for detecting register file full and register file empty
conditions. These status bits potentially allow dynamic as well as
static scheduling of the interconnect (by using external logic).
References
TOTAL
POWER
(WATTS)
200
3 READ PORTS/CHIP
150
50
2 3 4
WRITE
PORTS
rOTAL
IC.
128
96 3 READ PORTS/CHIP
64 ( j } o - - - - f i j ) 2 READ PORTS/CHIP
32
2 3 4 WRITE
PORTS
DOl
2
WEIOI
R
6 E CELLIO,OI CELLIO,ll
DINIOI G
2
WEill
R
6
E CELLI1,OI CELLll,ll
DINlll G
WPSM
RST
~
DCLK ~
6 6
The CMOS SLA Cell Set contains elements which have been present in
all previous SLA implementations. These elements may not be implemented
with exactly the same functionality in CMOS, but can be used to write
SLA programs which are functionally equivalent to those in NMOS or I2L.
In addition to the "standard" SLA elements, new ones have been added
which greatly increase the flexibility and power of the CMOS SLA Cell
Set. The CMOS SLA Cell Set includes:
- S (set a flip-flop)
- R (reset a flip-flop)
- I (inverter input),
The CMOS process chosen for this SLA cell set includes an n-channel
Si-gate NMOS process, with enhancement and depletion-mode transistors.
This process is supplemented by the inclusion of an n- well, enclosing a
p-channel device region with enhancement-mode transistors. The n-
regions made possible the inclusion of Schottky diodes as the key
elements of the combinational logic. The n- regions used for p-channel
devices were encircled by an n+ high bias ring, which in turn was
surrounded by a p+ grounded guard ring. These were included to prevent
the SCR action encountered in CMOS circuits.
The CMOS SLA implementation makes use of the Schottky diode in the
merged AND and OR plane. A composite representing a single Schottky
diode in a row cell is shown in Figure 1. The row of the SLA is an n-
stripe with low-resistance n+ shunts on each side. The Schottky diode is
formed by contacting metal to the n- diffusion.
Both the AND and the OR planes are formed by identical Schottky
diodes which have their cathodes tied to the row and their anodes tied
to individual columns. Because of this, both of the planes are
identical and they can be merged together with no space penalty. This
was not true in the I2L and the NMOS technologies which required a
differently configured transistor for the AND plane and for the OR
plane.
K.F. Smith, at al 399
9JtlTTKY 0IOOE
\ N- STRIPE ROW
(in [4], the basic row element size was 75 microns wide by 35 microns
high). An obvious advantage of CMOS over NMOS for SLA programming would
be in implementing circuits with a high ratio of combinational logic to
active elements.
VDD
WI)
PHI
-- ACTl~
PHI-
GD -
!IlJ~
I-VllO
AlTIVE t-PHI
WI)
. I fELLS i-1'IiJ
PHI CT IV'E; t-GD
PHI . lea ~
GD,.
MASTER-SLAVE
4 IN FLIP-FLOPS
5
6
7
8 6U4~4 1
9 2
10 3 R
R 11 4 0
o 12 5 W
W 13 1m 1 6 S
S
14 A)6=.'.-..... 0SI'I'10~6 .... 0-'-.'Blr 7
'~'
15 , 'I'n'-'~"~
1 ....0..... R .... irT1'I'I'
~ ,8
16 , 6}6--..... .... 1 ..........05' '9
I1
I I - I I
11
17
18
19
20 SET RESET
21
22
23
24
25
26
11 11
oS R 1
, i BiG:::..-0~T[~6,)li0--:~IJ ' .
: IOi;:1...:R B>IDII~:::!J: : :
6)6--..-\ .... .. .. 0 sill.
, PD :-'_\0[ 1,---,,0 '
, PD ._'_ .............. i:::r:~'i~0 '
SPD :0 I:::i :::'1 :~0 '
I'[lEQ 'R="":=:-I::3:~ 0'i ~:..... (
'[JEt) '. :.=.'.:=:-(::'.( ::'. i i~:.:: 1
sR) . .:.....+. . .
~.:.=~ :.=:.( j ..--~a:1
PD 0\01 ........0
PD~.:;=.:.=0.:.::0.:.~0.:.i ........1
'_'-11-11-111 I
,
~
l'
11
INITIALIZE
ADD/SUBTRACT
l' l'
CARRY OUTPUT
A INPUT
B INPUT _ _ _ _---l 5 OUTPUT
REFERENCES
[1] Goates, G. B.: Waldron III, H. M.: Patil, S. S.: Smith, K. F.: and
Tatman, J. A.
Storage/Logic Arrays for VHSIC.
In Proceedings of the 1981 Institute for Defense Analysis Semi-
Custom Integated Circuit Technology Symposium. May, 19~
[2] Lin, E. S.
A Study of Loading Constraints of Existing Integrated Injection
Logic Realizations of Storage / Logic Arrays.
Master's thesis, Department of Computer Science, University of
Utah, August, 1980.
[3] Patil, S. S.
An Asynchronous Logic Array.
Project MAC Technical Memo TM-62, MIT, May, 1975.
INTRODUCTION
DEVICE ARCHITECTURE
408
A.M. Chiang 409
signal processor. and the sampled data are transferred and sub-
sequently processed in parallel. Within each processor all the compu-
tation functions are performed in the charge domain. and local charge
domain memories are included for storing the processed signal. Based
on this generic device architecture. a matrix-matrix product device
and a triple-matrix product device have been designed and are
described below.
After this string of output data is serially loaded into the jth
accumulating memory. it is parallel transferred to the storage wells
and summed with previously stored charge. It follows that after the
Nth (or the last) row of the analog sampled data is processed by the
same procedure described above. the information stored in the jth
accumulating memory is
N
L cKn f nj
n=l
L
N
L c2n f nj
n=l
N
L cl n f nj
n=l
It can be seen that this data sequence is equal to the jth column
elements of the [C] matrix. which is to be computed by this device.
Therefore. the stored data sequence glj' g2j' gKj can now be
parallel transferred to the output shift register and serially clocked
out. In other words. the serial output from each accumulating memory
are the corresponding column elements of the [C] matrix. Thus. the
device computes the matrix-matrix product, [C] [F].
We are in the process of designing a CCD matrix-matrix product
device with N.K and J choosen to be 32. and 8-bit HDACs. The chip
size is estimated to be 30.000 mi1 2 excluding the digital memory. At
a 10 MHz clock rate. the device is performing the equivalent of
3.2 x 108 8-bit x 8-bit digital multiplications and 1010 additions per
second.
for k=l. 2. K
1=1. 2. L
N
where gkj = L ckn f nj It can be seen that the top part of the device
n=l
(i.e.. the tapped delay line. the HDACs. the accumulators. and the
digital memory) is identical to the previously described matrix-matrix
product device. The lower part of the device consists of an L-by-K
fixed-weight CCD multiplier bank. All the fixed-weight multipliers on
the same column have a common analog input (i.e the output from the
A.M. Chiang 411
CONCLUSION
power and weight efficiencies of these CCDs should have far reaching
implications for many military and commercial applications such as
radar, communication and image processing systems.
REFERENCES
CCD FLOATING-GATE
TAPPED DELAY LINE
ceD SIGNAL
PROCESSOR
ANALOG DIGITAL
OUTPUT INPUT
M M M
0 0 0
A A A
C C C Nx K
DIGITAL
MEMORY
- CCO
ACCUMULATING
, L'
MEMORY
ANALOG
INPUT
4>1
2A ....-1-.0,
DIGITAL
INPUT
gKI gK
L L
CCO
ACCUMULATING
MEMORY
g31 g32
921 922
9 9 ,2
"
J
I 9 d =h
i = I li j 1 11
J
L 9 d =h
i = I Ii j2 12
J
L 9 d =h
i = I Ii JL lL
J N
h =I I d FOR k = 1... K
kR i =1 n = 1 kn nj j.f
1 = 1. L
J
=L 9 d
i = I ki i1