You are on page 1of 425

VLSI

Systems and Computations


VLSI
SYSTEMS AND COMPUTATIONS
Editors:
H. T. KUNG, BOB SPROULL, and GUY STEELE

CarnegieMelion University

SPRINGER-VERLAG
Berlin - Heidelberg - New York
Copyright 1981 Carnegie-Mellon University
Softcover reprint of the hardcover lst edition 1981

All rights reserved. No part of this work may be reproduced. transmitted.


or stored in any form or by any means. without the prior written consent
of the publisher. except by a reviewer who may quote brief passages in a
review or as provided for in the Copyright Act of 1976.

Computer Science Press, Inc.


11 Taft Court
Rockville. Maryland 20850 U.S.A.

Printing 1 2 3 4 5 85 84 83 82 81 Year

ISBN-13 :978-3-642-68404-3 e-1SBN-13 :978-3-642-68402-9


001: 10.1007/978-3-642-68402-9

Cover design by Heidi Fieschko

This volume consists of papers presented at Carnegie-Mellon University's


Conference on VLSI Systems and Computations, October 19-21,1981.

This book is distributed exclusively by Springer-Verlag - Berlin -


Heidelberg in Africa, Australia. South America, and Western Europe.
PREFACE

The papers in this book were presented at the CMU Conference on VLSI Systems and
Computations, held October 19-21, 1981 in Pittsburgh, Pennsylvania. The conference was
organized by the Computer Science Department, Carnegie-Mellon University and was partially
supported by the National Science Foundation and the Office of Naval Research.

These proceedings focus on the theory and design of computational systems using VLSI.
Until very recently, integrated-circuit research and development were concentrated in the device
physics and fabrication design disciplines and in the integrated-circuit industry itself. Within the
last few years, a community of researchers is growing to address issues closer to computer
science: the relationship between computing structures and the physical structures that
implement them; the specification and verification of computational procosses implemented in
VLSI; the use of massively parallel computing made possible by VLSI; the design of special-
purpose computing architectures; and the changes in general-purpose computer architecture
that VLSI makes possible. It is likely that the future exploitation of VLSI technology depends as
much on structural and design innovations as on advances in fabrication technology.

The book is divided into nine sections:


- Invited Papers. Six distinguished researchers from industry and academia
presented invited papers.
- Models of Computation. The papers in this section deal with abstracting the
properties of VLSI circuits into models that can be used to analyze the chip area,
time or energy required for a particular computation.
- Complexity Theory. This section shows how computations can be analyzed to
obtain bounds on the resources (chip area, time, energy) required to perform some
computation. The last paper in this section is a light-hearted reminder that
complexity theories must acknowledge reality.
- Layout Theory and Algorithms. Papers in this section describe ways to route
wires that connect together different circuits on a chip. This topic is of importance in
computer-aided design, but also relates to the complexity of circuit layouts.
- Languages and Verification. This section presents several results on the
specification and verification of circuits and of entire systems. The large number of
communicating processes in some VLSI architectures must be designed
methodically to insUre proper operation.
- Special-Purpose Architectures. This section deals with systolic computing
architectures and their application to areas such as signal processing.
- Multiplier Designs. The problem of designing an efficient multiplier is of both
practical and theoretical interest. An important application for multipliers is in signal
processing.
- Processors. Two papers in this section describe new designs for single-chip
general-purpose computers whose architecture is influenced by VLSI design
opportunities.
- Systems and Processors. This section contains papers describing frameworks for
entire systems, such as parallel processing arrays and content-addressable
memories.

v
vi Preface

These papers were selected by the program committee from among 120 extended abstracts
submitted in response to the call for papers. Selection was based on originality and relevance to
the theme of the conference, and was very difficult, owing to the large number of excellent
papers submitted. Among the papers that could not be accepted were some excellent ones in
design automation and computer-aided design, important areas beyond the scope of the
conference.
We wish to express our thanks to the authors for making their works available while
complying with strict deadlines and formats to aid in the timely appearance of the book; to the
invited speakers for their excellent papers and for sharing their insights and experience; and to
the program committee members for their careful evaluation of the many extended abstracts,
despite the limited time made available to them. Especially, our grateful thanks go to
Louis Monier, who contributed greatly in the planning of the conference and the publication of
this book, and to Sharon Carmack, who was not only responsible for conference registration, but
also handled the many details involved in the preparation of the conference.

The logo and cover design appearing on this book and throughout the conference were
designed by E. Heidi Fieschko.
H. T. Kung and Bob Sproull
Fall 1981
Program Committee

Jim Clark, Stanford.


Danny Cohen, 151, USC.
Jim Kajiya, Caltech.
Phil Kuekes, ESL Inc.
H.T. Kung, CMU.
Ed McCreight, Xerox PARCo
Ron Rivest, MIT.
Bob Sproull, CMU.
Guy Steele, CMU.
Earl Swartzlander, TRW Inc.
Jeff Ullman, Stanford.
Jean Vuillemin, INRIA.
Harper Whitehouse, Naval Ocean Systems Center.

Co-Sponsors

Carnegie-Mellon University.
National Science Foundation.
Office of Naval Research.

vii
Authors

Arun, K.S. 235 Nash, J.G. 367


Baratz, A E. 153 Nudd, G.R. 367
Baskett, F. 20,337 Obrebska, M. 347
Baudet, G.M. 100 Owicki, S.S. 203
Bilardi, G. 81 Patterson, D.A. 327
Bromley, K. 273 Peek, J.B. 327
Brown, D.J. 178 Peshkess, Z. 327
Cappello, P.A. 245 Peterson, J. 21
Carter, T.M. 396 Pinter, A.Y. 126,160
Chiang, A.M. 408 Powell, N. 41
Cohen, D. 124,213 Pracchi, M. 81
Davis, A 226 Preparata, F.P. 81,311
Dolev, D. 143 Rao, D.V.B. 235
Fisher, A 265 Rattner, J. 50
Fitzpatrick, D.T. 327 Rau, B.A. 389
Foderaro, J.K. 327 Reusens, P. 301
Foster, M.J. 196 Rivest, R.L. 153,178
Gill,J. 337 Rosenberg, AL. 69
Glaeser, C.D. 389 Ruane, L. M. 255
Hall, J.S. 379 Ruzzo, W.L. 119
Hansen, S. 367 Savage, C. 296
Hennessy, J. 337 Savage, J.E. 61
Hu, Y.H. 235 Sawai, A 29
Hunt, C.E. 396 Sequin, C.H. 327
Johnsson, L. 213 Sherburne, RW 327
Jouppi, N. 337 Siegel, A 143
Katevenis, M.G.H. 327 Siewiorek, D.P. 357
Kedem, Z.M. 52 Smith, K.F. 396
Ku, W.H. 301 Snyder, L. 119
Kuekes, P.J. 389 Speiser, J.M. 273
Kung, H.T. 255 Steiglitz, K. 245
Kung, S.Y. 235 Symanski, J.J. 273
Landman, H.A. 327 Thompson, C.D. 108
Lehman, P.L. 285 Tsao,M.M. 357
Leiserson, C.E. 126 Tseng, C.J. 357
Lengauer, T. 89 Van Dyke, K.S. 327
Luk, W.K. 317 Weiser, U. 226
Lyon, A.F. 1 Whitehouse, H.J. 273
Malachi, Y. 203 Wilson, AW. 357
Mao, Y.H. 301 Wise, D.S. 186
McGarity, A.C. 357 Yen, DWL 255
Mehlhorn, K. 89 Zorat, A 52
Miller, G. 153

viii
Contents

Preface v
Program Committee, Co-Sponsors vii
Authors Index viii

Invited Papers
The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors
R.F. Lyon
Designing a VLSI Processor - Aids and Architectures
F. Baskett 20
Keys to Successful VLSI System Design
J.G. Peterson 21
Programmable LSI Digital Signal Processor Development
A Sawai 29
Functional Parallelism in VLSI Systems and Computations
N.R. Powell 41
Functional Extensibility: Making The World Safe for VLSI
J. Rattner 50

Models of Computation
Replication of Inputs May Save Computational Resources in VLSI
Z.M. Kedem and A Zorat 52
Planar Circuit Complexity and the Performance of VLSI Algorithms
J.E. Savage 61
Three-Dimensional Integrated Circuitry
AL. Rosenberg 69
A Critique and an Appraisal of VLSI Models of Computation
G. Bilardi, M. Pracchi and F.P. Preparata 81

Complexity Theory
On the Complexity of VLSI Computations
T. Lengauer and K. Mehlhorn 89
On the Area Required by VLSI Circuits
G.M. Baudet 100
The VLSI Complexity of Sorting
C.D. Thompson 108
Minimum Edge Length Planar Embeddings of Trees
W.L. Ruzzo and L. Snyder 119
The VLSI Approach to Computational Complexity
D.Cohen 124

Layout Theory and Algorithms


Optimal Placement for River Routing
C.E. Leiserson and R. Y. Pinter 126
The Separation for General Single-Layer Wiring Barriers
A Siegel and D. Dolev 143
Provably Good Channel Routing Algorithms
R.L. Rivest, AE. Baratz and G. Miller 153
Optimal Routing in Rectilinear Channels
R.Y. Pinter 160
ix
x Contents

New Lower Bounds for Channel Width


D.J. Brown and R.L. Rivest 178
Compact Layouts of Banyan/FFT Networks
D.S. Wise 186

Languages and Verification


Syntax-Directed Verification of Circuit Function
M.J. Foster 196
Temporal Specifications of Self-Timed Systems
Y. Malachi and S.S. Owicki 203
A Mathematical Approach to Modelling the Flow of Data and Control in
Computational Networks
L. Johnsson and D. Cohen 213
A Wavefront Notation Tool for VLSI Array Design
U. Weiser and A. Davis 226
A Matrix Data Flow Language/Architecture for Parallel Matrix Operations Based on
Computational Wavefront Concept
S. Y. Kung, K.S. Arun, D. V.B. Rao, Y.H. Hu 235

Special-Purpose Architectures
Digital Signal Processing Applications of Systolic Algorithms
P.R. Cappello and K. Steiglitz 245
A Two-Level Pipelined SystOliC Array for Convolutions
H. T. Kung, L.M. Ruane and D. W.L. Yen 255
Systolic Algorithms for Running Order Statistics in Signal and Image Processing
A. Fisher 265
Systolic Array Processor Developments
K. Bromley, J.J. Symanski, J.M. Speiser and H.J. Whitehouse 273
A Systolic (VLSI) Array for Processing Simple Relational Queries
P.L. Lehman 285
A SystOliC Data Structure Chip for Connectivity Problems
C. Savage 296

Multiplier Designs
Fixed-Point High-Speed Parallel Multipliers in VLSI
P. Reusens, W.H. Ku and Y.H. Mao 301
A Mesh-Connected Area-Time Optimal VLSllnteger Multiplier
F.P. Preparata 311
A Regular Layout for Parallel Multiplier of 0(log2n) Time
W.K. Luk 317

Processors
VLSI Implementations of a Reduced Instruction Set Computer
D. T. Fitzpatrick, J.K. Foderaro, M.G,H. Katevenis, H.A. Landman, D.A.
Patterson, J.B. Peek, Z. Peshkess, C.H. Sequin, R. W. Sherburne and
K.S. Van Dyke 327
MIPS: A VLSI Processor Architecture
J. Hennessy, N. Jouppi, F. Baskett and J. Gill 337
Comparative Survey of Different Design Methodologies for Control Parts of
Microprocessors
M. Obrebska 347
Contents xi

C.FAST: A Fault Tolerant and Self Testing Microprocessor


M.M. Tsao, A.W. Wilson, R.C. McGarity, C.J. Tseng and D.P. Siewiorek 357

Systems and Processors


VLSI Processor Arrays for Matrix Manipulation
J.G. Nash, S. Hansen and G.R. Nudd 367
A General-Purpose CAM-Based System
J.S. Hall 379
A Statically Scheduled VLSI Interconnect for Parallel Processors
B.R. Rau, P.J. Kuekes and C.D. Glaeser 389
The CMOS SLA Implementation and SLA Program Structures
K.F. Smith, T.M. Carter and C.E. Hunt 396
A New CCD Parallel Processing Architecture
A.M. Chiang 408
The Optical Mouse, and an Architectural
Methodology for Smart Digital Sensors
Richard F. Lyon
VLSI System Design Area
Xerox Palo Alto Research Center
3333 Coyote HIli Road
Palo Alto, Callfomla 94304

1. Introduction
A mouse is a pointing device used with interactive display-oriented computer systems,
which tracks the movement of a user's hand as the user pushes the mouse about on a pad
(usually on the work surface next to the user's keyboard). Mice have recently become available
in the office products market as a part of the Xerox "Star," the 8010 Professional Workstation
[Business 1981, Seybold 1981-1, and Seybold 1981-2].
The work reported here is motivated by the desire for a high-reliability mouse with no
moving parts (excluding button switches if any). In Xerox research, the mouse has been in
popular use for over eight years, and has been found to be preferable to other pointing devices
[Card el al 1977]. However, it has not been outstandingly reliable; the balls or wheels can get
dirty and slip on the pad, rather than rolling, or the commutators can get dirty and skip. This
is likely to be a significant problem in maintaining workstations that use the mouse in an
uncontrolled environment Another disadvantage of the electro-mechanical mouse is that it's
expensive; the one-chip optical mouse is cheap. And the special patterned pad that it needs to
make it work is cheap, too, as it can be printed for about a penny on an ordinary ink press.
The goal of a mouse with no moving parts has been achieved through the use of a
combination of innovations in electro-optics, circuits, geometric combinatorics, and algorithms.
all implemented in a single custom NMOS integrated circuit (patent pending); see figure 1 for
an illustration of the optical mouse.
Button

Figure 1. PC board
The Optical Mouse

Cable
Patterned Pad Surface

2. Background on mouse implementations


Electro-mechanical mice were first developed in the 1960's at Stanford Research Institute,
and are described in [Newman & Sproull 1973. Engleban 1970, and Englebart & English 1968,
Englebart et aL 1967]. The original mouse used a pair of wheels turning potentiometer shafts
to encode X and Y motion into analog signals. Each wheel turns as the mouse is moved along
its respective dimension, and slips sideways as the mouse is moved in the orthogonal
dimension; both wheels turn and slip simultaneously as the mouse is moved diagonally.
2 The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors

The mouse was redesigned at Xerox to use ball-bearings as wheels, and optical shaft
encoders to generate a two-bit quadrature signalling code (see figure 2). That is, the motion of
a wheel caused the two output bits for that dimension to form square waves in quadrature. with
phase and frequency determined by the direction and speed of travel; each bit transition
represented motion of one resolvable step, which was used to move the cursor one pixel on the
screen) [Hawley et aL 1975]. The mouse was again redesigned to use a ball instead of two
wheels, eliminating the drag of side-slipping wheels [Rider 1974, and Opocensky 1976];
internally, it was built like a trackball [Koster 1967], with shafts turning against the ball and
using commutators as shaft encoders.

C. XA

XA leading =Left. XB leading =Right


: = XB
:
YA




YB leading = Up. YA leading = Down.
YB

1:-:.

1 Figure 2. Quadrature Encoding of Pointer Motion
on a Bitmap Display System.

The concept of an optical mouse, which "watches" the pad and tracks an image of it, is
not entirely new; however, until now the problem of extending the familiar one-dimension
quadrature encoding techniques to two dimensions has not been satisfactorily solved. A
popular attempt has been to use a "grid-tracking" concept to try to directly emulate the
quadrature commutator scheme, using a pair of optical detectors for each dimension.
Unfortunately, there is no known easy way to separate the optical images of the lines for the
two dimensions, and to make the mouse work even when rotated.
In the electro-mechanical mouse and our new spot-tracking optical mouse, motion is
detected relative to the mouse body axes, independent of mouse body rotation; they use "pads"
with no inherent coordinate systems. The grid-tracking idea would detect motion relative to
the pad axes, and degrade with mouse body rotation. It is not obvious which tracking style is
preferable, but the way the electro-mechanical mouse and optical mouse work now is certainly
acceptable.

3. Overview of the imager and motion tracker-a smart digital sensor


The mechanism described in this paper combines two novel concepts to make a one-chip
imaging and tracking system for an optical mouse. The imaging technique may have other
applications, as well (but does not compete with dense CCD analog imagers). The optical
tracking imager for the mouse application has been implemented in the form of an NMOS
chip, which is compatible with the Xerox mouse; it has been packaged in a standard mouse
housing, and is in routine use.
The first concept is a simple "mostly digital" circuit that produces digital image
(bitmap) snapshots of bright features in a dark field, using self-timed circuit techniques and
mutually inhibiting light sensors (a variation on this technique, which detects dark features in a
light field, is also discussed).
The second concept is a tracking algorithm, involving an easy-totrack contrasting
pattern, a detector array and inhibition network matched to the pattern, and the design of the
digital machine that takes images of that pattern as input and tracks relative image motion.
Both concepts apply equally well to either linear or two-dimensional sensor arrays.
TIlere are other novel aspects of the mouse chip. such a~ the integration of sensors,
memory, and logic in a single array, using a srandard MOS logic technology. TIle chip also
illustrates several interesting layout, circuit, and timing styles that are widely applicable.
Richard F. Lyon 3

4. Architectural methodology
The optical mouse chip was designed as an experimental application, in a new domain, of
the logic, timing, circuit, and layout design methodologies taught by [Mead & Conway 1980].
It was designed with the goal of fab-line and process-parameter independence, so it utilizes
only very simple and conservative device models, design rules, circuits, and timing techniques.
Those methodologies have been informally extended into an architectural methodology for
sensors, which have to deal with real-world analog effects and convert them to stable and
reliable digital form in the face of wide parameter variations. An architectural methodology is
a set of guidelines and constraints that help a designer pick a system architecture, by showing
how certain architectural concepts can be implemented and made to work, with high certainty.
An architectural methodology for a different domain is discussed in [Lyon 1981].
The layers of design methodologies used to map a concept into a layout must be supported
by a compatible implementation system that will map that layout into working silicon. Such a
system, described in [Hon & Sequin 1980 and Conway et aL 1980], was used to carry out the
implementation of the Optical Mouse design as part of a multiproject chip set, on an outside
vendor fab line.
The benefits of this approach are clear in the resulting chip: design time was very short,
standard switch-level simulation could be used to verify the correctness of the circuits, the first
implementation worked, several orders of magnitude of light-level variation are tolerated, and
the techniques developed are very robust against process parameter variation, temperature
variation, etc.
The idea of using lateral inhibition to make a digital imager was conceived in June 1980;
the rest of the techniques discussed here were developed while writing up the inhibition idea,
in June and July 1980. A chip design was done quickly in the latter part of July. and was
debugged by hand cross-checking of the layout against design sketches (thankS to C. P.
Thacker, some bugs were found and corrected). After the chip was into implementation, our
tools for design rule checking, circuit extraction, and simulation became more available. and the
design was verified as correct except for some non-fatal design rule violations.
Finished chips were delivered by the implementation system in December, and were
quickly tested on a crude test lash-up connected to the mouse port on a personal workstation.
Later. with the help of several interested colleagues, a completely packaged mouse prototype
based on this chip was completed.
The optical mouse chip should be regarded as only the first representative of a new
architectural methodology for smart digital sensors. It seems clear that there will be many
more applications of bits and pieces of this methodology to sensors of all sorts. For example.
even in something so simple as an analog-to-digital converter, great performance enhancements
can be made by using self-timed successive approximation logic to optimize speed while
avoiding metastable conditions.

5. Digital imager description


Because it is easily available to us at Xerox, the NMOS integrated circuit technology was
chosen to implement the optical mouse chip; other technologies, such as PMOS, CMOS, or
bipolar, could be used as well. In NMOS, when light strikes the circuit side of a chip, the
photons get converted to hole-electron pairs with some reasonable quantum efficiency (see
figure 3); the holes are generally attracted to the negative-biased p-type silicon substrate, while
the electrons are attracted into n-type diffused source/drain regions and channel regions
[Sequin & Tompsett 1975]. Thus, light is detected by collecting negative charge (electrons). If
a node is isolated by a turned-off transistor, it is said to be a "dynamic node". A dynamic
node which has been charged to a positive voltage will "leak" to a lower voltage as light is
received. An imager is simply an array of subcircuits, with a dynamic node in each, which can
watch the declining voltages and make a sensible bitmap image from them.
The guts of each imager pixel (sub circuit or cell) is therefore a dynamic node, a transistor
to "reset" it high and then isolate it. and an "inverter" circuit to sense the voltage of the node
4 The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors

Symbol:

Layout: Aluminum Metal Interconnect

l
---
~ 1--------- ---
Diffusion

Cross Section:

Junction

P-type silicon substrate

Figure 3. An NMOS Photo-Diode.

and communicate it out to other circuits. The output voltage from the inverters will start low
when the array is reset, then go toward high as the corresponding dynamic nodes go low due to
light. Figure 4 shows a schematic diagram of this simple "analog" imager celL
An array of analog imagers of this sort has a digital all-low output initially, then has an
interesting analog image for a while, but eventually ends up in the digital all-high state until it
is reset. Both of its digital states are uninteresting. What we would lilce is a way to get an
interesting digital bitmap image reliably. A way to do this is to implement a form of
"inhibition" between cells, so that after some cell outputs have gone high, all others are held
low and the picture is stable from then on_ This is somewhat analogous to the lateral inhibition
in the retina of most biological vision systems [von Bekesy 1967]_ It has the desirable effect of
producing sensible images, almost independent of light leveL Such digital sensor arrays can be
built in a self-timed loop of logic that recognizes stable images, latches them, resets, and starts
over, at a rate roughly proportional to the light intensity.

Circuit Diagram (NMOS): Logic Diagram:

Reset Output

Output changes from low to high


Photo-
Diode at a rate proportional to light level.

Figure 4_ Simple" Analog" Imager Cell


Richard F. Lyon 5

The simplest imager with mutual inhibition is the two-pixel system shown in figure 5.
Each pixel circuit is essentially a NOR-gate, with one input of each being the light-sensitive
dynamic node, and the second input being the output of the other cell. The initial reset state is
00, with outputs being pulled low by the NOR inputs that are connected to the initially high
dynamic nodes. The final state can be either 01 or 10, since 00 will decay with time and 11 is
not possible as the output of cross-coupled NOR gates.
Sensor-Node-l

Ready
Reset
Done
Pixel-Light-2
Sensor-Node-2

Figure 5. Two-Pixel Digital Imager Logic Diagram

The existence of a final state can be sensed by an OR gate whose logic threshold is higher
than the thresholds of the pixel NOR gates. Intermediate and metastable states will have both
output voltages near the NOR gate thresholds, but distinctly below the OR gate threshold. So
this two-pixel digital imager compares the light level at two points, and indicates when it has
made a decision (but there is no bound on how long it might take, even in bright light).
More complicated logic can be used to detect stable images (Done) in larger sensor arrays
with more complicated inhibition NOR networks.
The concept illustrated by the two-element imager is the use of additional transistors to
convert the image sensing inverters to cross-coupled NOR gates, as in a flip-flop. Any pairs of
elements in an imaging array may be chosen to be connected by these two-transistor mutual
inhibition subcircuits. For example, each pixel may be connected with its eight neighbors in a
square grid, resulting in nine-input NOR gates.
For any pattern of inhibition and any shape and size image array, the set of possible stable
images can be easily enumerated. For example, in a three-by-three array with neighbor
inhibition, the following eight images can be derived by inspection (notice that all 0 bits are
inhibited from changing to I, by virtue of having a neighbor equal to 1):

000 101 o1 0 o0 0 1 01 100 010 001


010 o0 0 000 1 01 o0 0 o0 1 000 100
000 101 o1 0 o0 0 010 100 101 001

Of course, in larger arrays the images are more interesting, and often more numerous.
In section 9, we will show that by using a four-by-four sensor array, with inhibition of cells
up to 2.9 or more pixels away, it is easy to formulate a simple and reliable tracking algorithm
that follows spots in a hexagonal array and represents their motion with a quadrature code.

6. Digital imager logic definition


In the mutually-inhibiting detector array, the cells race to see which can be the first within
a neighborhood to get enough light and inhibit the others. To formally define the logic of
these cells, including general done-detect capability, we use four logic variables in each cell:
Sensor-Node, Pixel-Light, Spot-Detected, and Cell-Done. Start by resetting to the state
Sensor-Node =1, Pixel-Light =O. Spot-Detected =0, and Cell-Done =O. Then, with the
following logic, wait until Cell Done =1 in all the cells:
6 The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors
Sensor Node discharges from 1 to 0 as light hits.
Pixel light = NOR ( Sensor Node, Pixellight's from other cells in neighborhood )
Spot Detected = High ThresholdBuffer ( Pixel Light )
Cell Done = OR ( Spot Detected , Spot Detected's from other cells in neighborhood )

The inhibition network is defined by choosing an inhibition neighborhood for each cell.
Generally, we choose neighborhoods symmetrically, such that if A inhibits B, then B inhibits A;
we say A "is coupled with" B, reflecting the cross-coupled NOR structure. In many cases, the
inhibition neighborhood of some cells will be all other cells in the array; Cell Done signals
from such cells will be redundant, but may be implemented just for the convenience of layout
regularity.
Note that we do not use the inhibition NOR gate output itself for done-detection, but a
buffered version of it after a high threshold buffer (inverter pair); this is the easiest way to
prevent false done-detection during a metastable condition [Seitz 1980]. The buffered signal is
not used for inhibition, since that would make it participate in the metastable condition, and
because the extra delay would cause oscillatory metastable states.

7. One-dimensional tracking imagers


The simplest application that illustrates the digital imager/tracker is a linear motion sensor,
which is built from a row of imager cells looking at white stripes (approximately orthogonal to
the row of imager cells) on a dark background.. It is possible to apply our digital imager idea
directly to the familiar quadrature detection scheme, by using four sensors in two interleaved
coupled pairs; Le., sensors ABC D would have A coupled with C and B coupled with D. The
possible stable images that can result are these four:

o0 1 1 o1 1 0 1100 1001 (two-pair inhibition)

If the white and dark line widths are both equal to about twice the sensor spacing, these
images correspond in an obvious way to positions of the stripes relative to the sensors. Any
two adjacent sensor outputs, say A and B, can be used directly as quadrarure output signals that
sequence through the states 00, 01, 11, 10, forward or backward.. depending on the direction of
motion. The advantage over previous optical quadrature detectors is that no fixed threshold or
specific light level is needed. The sensors will cycle at a rate depending on the light leveL and
latched outputs will be made available to the host system.
Another linear tracking scheme that is closer in spirit to our two-dimensional tracker uses
narrow white lines (about one-third white) and a different inhibition pattern. If four imager
cells are used.. and we arrange to have each cell inhibit cells up to two steps away (say cells at
distance less than 2.5), then we get a set of three stable images, shown here:

100 1 o1 0 0 oa 1 a (radius 2.5 inhibit ion)

If the white line spacing (imaged onto the chip) is about three cell widths, then these
images correspond in an obvious way to positions of the bright lines relative to the cells
(l=bright); see figure 6. The figure illustrates a simple digital machine (on the same chip) that
would compare the current image with the previous image (Le., the machine has only three
states) and output a signal tl1at says moved up or moved down. 'nms we have a relative motion
sensor for one dimension of travel. A 2-bit counter is used to com'en to the familiar
quadrature signal representation which is convenient for asynchronous sampling by the host
system.
Other spacings, inhibition patterns, numbers of cells, etc., can be applied easily to the
linear motion detector problem. The real challenge is to make it work in two dimensions, and
to make it tolerant of rotation (of the imager with respect to the pattern). After discussion of
inhibition patterns. we show how to extend the 4-element one-dimensional line-tracker to a 4-
by-4-elemcnt two-dimensional dot-tracker.
Richard F. Lyon 7

Typical
Configuration

VDD
Pattern of
bright lines on a
dark background
I direction
up imagedoflines
(of
Down
mati on
relative to sensor cells)

(CeIl-Done-2 and
Cell-Done-3 are
redundant here)

Resct

Quadrature Signalling
Output to User S} tem

Figure 6. Linear 1\-lotion Detector including generalized done-detect


8 The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors

8. More about inhibition


First we need to understand patterns of inhibition. We can do this with pictures like those
above, but showing only a single 1 (in each possible position) and the set of elements that are
inhibited (forced to 0) by being coupled with it Other elements remain unknown and are
designated + (at least one +, if any exists, must change to a 1 to make a stable image).
For the first one-dimensional tracker, the inhibition patterns are these:

1+0+ +1+0 0+1+ +0+1 (two-pair inhibition)

And for the second they are these:

10 0+ 010 0 001 0 + 001 (radius 2.5 inhibition)

In many cases, we can specify inhibition neighborhoods as all cells within a certain radius,
by Euclidean distance in the plane, assuming unity cell spacing. We choose a radius such that
no cells fall at exactly that radius, to avoid ambiguity; hence radius 1.5 means cells at distance
1.414 in the plane are inhibited, but cells at distance 2.0 are not Some inhibition
neighborhoods, however, cannot be specified simply by a radius; two-pair inhibition is an
example.
Figure 7 graphically tabulates a succession of inhibition neighborhoods and the resulting
stable images. for four-element linear sensor arrays and four-by-four two-dimensional sensor
arrays. Square symmetry is assumed to reduce the complexity of the figure.
Notice that radius 2.9 is the smallest inhibition neighborhood such that when comparing
images, no dot can appear to have moved to two different adjacent pixels. That is, this
sequence cannot occur:

old new
0 o 0 0 1 000
0 1 0 0 0 000 (moved up-left or down-right?)
0 o 0 0 0 010 (can't happen for radius> 2.83)
0 o 0 0 0 000

What appears to be most useful is the "3.0 special" pattern of inhibition, a cross between
the radius 2.9 and radius 3.1 patterns (radius 3.0. where points separated by exactly three pixels
are coupled only if they are comers). The stable images that can result from this inhibition
pattern fall into two classes: Either there is a single 1 in the central quad of pixels, or there are

Inhibition Inhibition Stable Images Total


Neighborhoods Radius and how many of each Images

'._++1 1."1' ... 1 <l In 1


[E!J !E3 - 1!O2 ~l CEJl 4

II!! 1!:3 1.5 I!:!J 2 ~l 3

CEl ICE 2.5 CIJ2 ~l 3

CD ICJ 3.5 CIJ2 1CJ2 4


Black cells are "responSIve" (a dot IS detected there).
White cells are "inhibited" and not responsive.
Cells with a + are not inhibited by the indicated responsive cell.
Figure 7a. Inhibition neighborhoods and stable images
for a four-by-one sensor array.
Richard F. Lyon 9

Inhibition Inhibition Stable Images Total


Neighborhoods Radius and how many of each Images

.... ....
.... ....
toto.
......
.... ++
+++
to
+.+
(1 1
:::1

1

+ ....
.++++ .
8W
r::J
+ .... ....+ ~ & ~ ~
1.1 2 8 '. 2 42
+. to+"t-
I.'.'1 2~ 4 4 . 4
r:::;]
a
~

... :
~
..
+
++++

++++
1.5 ..8 .. 8'.:& ~ & r::J4
~~EJ 79
. 4. 8 . 4 .& ~
rJD[z]LJ .. 4
I.'.'1 20 . 8
.4 ~
1 D
1:.;1 .+

to

.+-.to
2.1 2OI:JtJ
I.'.'1
S S 1 43

[Z] Q& 1:34 LJS


4

W
[J'.
.. ++++
2.3 [:J4 0 1:J[:Jc:J1:3
& 1 4 4
4 25

[:J I:.JII.:.I 2.9 [:J4 c:J4 0 I:JS [:J41 21

c::J 1"+.:11.:1 * 3.0s c::J4c:J4D2L:JSLJS[J4 30

c::JL:]l:] 3.1 c::J4c:J 402LJSLJ8 26

c::JDCJ 3.2 c::J4 0 2DS 14

[:JOLJ 3.7 [:J4 0 2LJ8 14

[:JDLJ 4.3 [:J4 D4 LJ8 16


* "3.0 special" was chosen for the optical mouse; its images are simply characterized
as either a central dot Of a pair of dots on opposite edges but not sharing an edge.
Figure 7b. Inhibition neighborhoods and stable images fOf a four-by-fouf sensor array.
two 1'5 on opposite edges (but not on adjacent comers). If the 1'5 represent white dots being
imaged onto the chip, any motion of the dots forming the image will either leave one or two
dots within the field of view, or one dot will leave the field of view and the other will stay. So
there is always a dot to track.

9. Image tracking using bitmaps


The general tracking concept is to use a hexagonal array of white dots (which just looks
like dots of constant spacing but no particular orientation when seen through a small window at
an arbitrary angle), and to pick a dot spacing such that bitmap images can be associated with
the dot array easily and movement can easily be detected by comparing successh'e snapshots.
The white dot spacing should be slightly more than the inhibition distance. as a general rule of
thumb, For example. using radius 2,9, "3.0 special", or 11 inhibition with a four-by-four
sensor array, we recommend a dot spacing of about 3.4 pixels. because that is about the a\'erage
distance between dots in the stable images with two dotS. Then the dOES in the stable images
correspond in an obvious way (see figure 8) to positions of one or more dots of the hexagonal
dot array.

o
c
Figure 8,
Various positions of 4x4 imagers with respect to a hexagonal dot array, showing
. . . ays to see all the possible stable images for radius 2,9 or more inhibition,
If we use radius 2,9 inhibition instead of "3.0 special" or 3.1. the "four-romers" image
would gh'e us an interesting problem, Although the images of t..... o and three dots are easy to
integrate intO a set of images of dotS in a hexagonal array, the image of four dotS is not.
Worse than that, it is possible for a positioning of t..... o dots near opposite comers 10 force !he
fourdot image to occur; then it is impossible to tell in whkh pair of opposite comers the dots
..... ere really seen. This is ..... hy the "3.0 speciar pattern was developed-it eliminates the four-
comer image and the images of three dots. while still allowing an the images of two dots. some
of which would have been eliminated by going to radius 3.1. The images of three dotS are not
reall y missed. since seeing only t..... o of the three dots still guarantees that with movement at
least one of the dots will remain in the field of view. so the image :.an be uacked by looking at
IocaI dot motion.
Counting all rotations and mirrorings. there are 30 distinct stable images for the "3.0
special" inhibition. Of the 9CXI combinations of tv.'o SUttessive stable images, most have an
obvious interpretation in lenns of movement of the white dots with respect to the imager; lhose
that do not have an obvious interpretation must be handled by the [tacking algorithm, but will
probably not occur often.
A possible non-specific implementation of the trading algorithm is simply a finite-state
machine which takes one stable image as input (possibly encoded in just a few bits), looks also
at its cumnt state (state equals previou5 input, most likely). and outputs a signal indicating
direction of mo\'ement based on the state and input, and also outputs a new Slate. If the
machine is built of a simple PLA (programmed logic arra)'} with no special encoding. the PLA
can ha\'e as many as 32 inputS and 900 product terms, which would occupy most of a
reasonable size NMOS chip. The size could be reduced by first encoding the 30 images into 5
bits (PLA ..... ith 10 inputs instead of 32). and by nOt decoding image pairs which are
meaningless or which correspond to no motion (maybe about 600 terms instead of 900); so it
may fit in a Quaner of a chip. We are still free to design the trac king algorithm and specify
PLA outputs required. and program the PLA accordingly (ie. the tracking problem may be
regarded as a simple matter of programming). A more specific lracking algorithm and a novel
compact implementation of it will be described in section 11.
Richard F. Lyon 11

10. The selftimed action of the imaging/tracking system


Before going into tracking algorithms, we should indicate how the imager and the
synchronous finite-state machine get tied together by timing logic to make a self-timed
machine, and how they control the output logic that generates the two pairs of quadrature
signals that the host computer wants to see as an indication of mouse movement
What is needed is a circuit which will generate two-phase nonoverlapping clock signals to
run the digital logic, such that each cycle is synchronized to the reset-done cycle of the imager.
This same clock runs an up-down counter controlled by the PLA for each of X and Y, to
generate quadrature signals which can be communicated off chip. So we have three things to
design. the paniculars of which are not interesting in isolation: done and ready detectors, clock
and reset signal generation circuit, and up-down counter with quadrature outputs.
These parts are blocked out, along with logic-level and timing details of the clocking
circuit, in figure 9. Gocks are generated through a delay-independent (self-timed) handshake
r---------~<-----I Stable

Digital
Imager Outputs

Ready
Timing Interface

Done

D,
Sensor-Node
I
Pixel-Light
Ready ilr-
Done '1\ "L ~
Stop ;J.
'I r
Phi-Long
Phi-Shon l I
'L. ~

"
Reset
) )
Watchmg Cychng
(longTime) (Short Time)
Figure 9. Imager and Logic tied together by Self-timed
Gock Circuit, with timing waveform diagram.
12 The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors

with the imager array, and it is assumed that the digital logic is fast enough to keep up with the
imager (this assumption becomes a constraint for the designer to satisfy). The generated clocks
are called Phi-long and Phi-short, to indicate which one is of unbounded length; Phi-long
should be used as a quasi-static feedback enable to keep the logic alive and insensitive to light
while waiting for the imager. The steps of operation of the clock generator are in quick
succession as follows:

Start in the initial sensing state, just after Reset .. 0;


Ready = 1 (meaning all Pixel-Lights are 0),
Done = 0 (meaning not a stable image),
Phi-long = 1 (this is during the long, or waiting, clock phase),
Phi-short = 0 (because the other phase is 1),
Stop = 0 (because not Done yet)_
After a little light is received, some Pixel-Light output starts toward 1; then:
Ready .. 0 (at some irrelevant time before the picture is Stable).
When enough light is received, one or more Spot- Detected's goes to 1, the picture
becomes one of the stable images, and this happens:
Cell-Done ... 1 in all cells,
Done'" 1,
Stop .. 1,
Phi-long ... 0,
Phi-short .. 1,
Stop" 0,
Reset ... 1,
Sensor-Node ... 1 in all cells,
Pixel-Light ... 0 in all cells,
Spot-Detected ... 0 in all cells,
Done ... 0,
Ready" 1,
Phi-short ... 0,
Phi-long ... 1,
Reset ... 0,
And it is all back where it started, having gone through a cycle.

The good thing about this technique is that it doesn't care how slow the imager is;
everything is \\1lling to wait until there is a solid digital answer. Hopefully, the imager will
receive enough light to cycle faster than once every few hundred microseconds on the average,
so it will be able to get image samples often enough to track mouse motion of several thousand
steps per second.
The counters needed for X and Y simply count through four states, in either direction (up
or down), changing only one bit at a time (ie., 00, 01, 11, 10). This is a simple case of either a
Gray-code counter or a Johnson counter (Moebius counter). The PLA (tracker machine)
outputs needed to control the counters are just Right-X, Left-X, Up-Y, and Down-Y.
In the scheme actually implemented. the counters run through eight states, so that the
tracking algorithm can repon a finer gradation of motion (Up-Half-Y, etc.). Only four states,
representing full steps, would be seen by the host system; the states mentioned above are
simply augmented by an alternating "least significant bit". so the eight-state sequence is 000,
001, 010, OIl, 110, Ill, 100, 101.

11. Designing and implementing a tracking algorithm


That brings us back to tracking algorithms. The simplest "algorithm-design" technique is
to get a big piece of paper, draw the 30 stable images across the top and again down the left
side, and make 900 little squares to fill in. For each combination of an old image from the left
Richard F. Lyon 13

edge and a new image from the top edge, write in the square which way it looks like the dots
moved, and by how much (half step or full step).
One quickly develops simple algorithms to describe the reasoning about filling in the
squares. But how do we write some simple rules to do this in a digital machine, without
resorting to precomputing all the cases? To fit the capabilities of VLSI, we have come up with
a distributed local algorithm which can be implemented right in the imager array. Each pixel
saves its old value in a register, and on each cycle compares it with its new value and that of all
its neighbors. Each pixel reports one of eleven results (my dot moved <to one of 8 neighbors},
my dot stayed, my dot disappeared, or I didn't have a dot to track) to some decision logic. The
decision logic then just has to see what gets reported, and filter out contradictions (a move and
a stay can be converted to a half-step move).
The decision logic can also be partially distributed as a set of nine AND-OR-INVERT
gates running through the array (one for each of the eight move directions and one for the no-
move case-disappearing dot and no dot to track are not reported). These gates report a low
logic state if a pixel had a dot in the old picture, AND the appropriate neighbor has a dot in
the new picture, OR any other pixel met a similar condition. A single 9-input conflict
resolution PLA is needed outside the array to decode combinations of zero, one, or two
reported move directions and to produce the counter control signals (see figure 10). Actually,
of the 36 conceivable patterns of more than one indicated movement, only twelve are both
possible and clearly meaningful (as half-steps); so the logic can be very simple (PLA with only
20 terms, for the eight possible full-steps and the eight possible half-steps, four of which occur
two ways). Any other sequence, whether sensible or not, will produce no count commands.
The eight-state up-down counters are also most easily designed as PLA's.
Example:
r This cell reports
Moved-Down

n
fi
Old
Old
Image W New
Moved-Right
-J, Resultant:

New
Image
D
Half-Step Down-Right
Moved-Down
Old lNew
L This cell reports
Moved-Right
Moved-Up-Left X-Right X-A
Moved-Up X-Half X Counter X-B
Moved-Up-Right X-Full (8 states) X-L
Tracker I"-
Moved-Left
PLA
Stayed-Here Y-Up Y-A
Moved-Right (22 terms) Y-Half Y Counter Y-B
Moved-Down-Left Y-Full (8 states) Y-L
I"-
Moved-Down
Any-Good
Moved-Down-Right
Jump ToHos
Exactly q, 1, or 2
of these IS true. Counter control
and test signals

Figure 10. Tracking Spots by Comparing Images


14 The Optical Mouse, and an Architectural Methoclology for Smart Digital Sensors

12. The mouse chip layout


A mouse chip has been designed in NMOS, as a direct one-chip substitution for the
existing electro-mechanical mouse works (to go with a light and three button switches). For
complete compatibility, the chip includes debounce electronics for the button switches. It is
about 3.5mm by 4.5mm in a typical NMOS process (with lambda=2.5 microns, or 5-micron
lines).
There is a single layout for a programmable sensor and logic celL which can be customized
for each position in the array to implement any inhibition pattern and the described tracking
algorithm. The logic to detect a stable image is also partly programmable and distributed. The
cell layout with programming for the top left position is shown in figure II.
The layout style used in this first version of the chip treats a sensor cell with its logic and
memory as a low-level ceIL and constructs the array by selective programming of the cells in
different positions. This approach costs large amounts of wiring area. since every cell has to
have access to every other cell's Pi xe l-L i ght line. This area penalty was not regarded as a
problem, until it was realized that it causes a related light sensitivity problem- about 90% of
the photons get lost in the wires, far from the sensor nodes where they could do some good.
To improve light sensitivity, and also to improve the magnification ratio needed in the optical
path, we have switched to a new layout style, using a densely packed array of N + diffused
areas as sensor nodes, with all logic in compact regular structures outside of it A new chip
based on this approach has been designed by M. P. Haeberl~ and will be the subject of a
future report
One other layout feature of note is the regular structure used for the "random" timing
logic. It is essentially just like one plane of a PLA, except that it can also be programmed with
contacts between the lines running orthogonally through it With a bit of optimization, this
becomes topologically identical to I2L-style gate-matrix layout Look for more on this in a
future report

13. Tracking dark spots, instead of light spots


If we redefine the logic of the sensor cells, we can make an array that looks for a set of
images of dark spots in a light field; this approach has some different and interesting
properties. In the design previously described (light-spot detector), the cells race to see which
can be the first within a neighborhood to get enough light and inhibit the others; in this new
technique, the cells want to see which can be the last to get enough light-which requires quite
a different logical approach. To define the logic of this new celL we use five logic variables in
each cell: SensorNode, Pixel Light, Pixel Dark, Spot Detected, and Cell Done. Start by
resetting to the state SElnsorNode =1, PixelLight =0, Pixel Dark =1, SpotDetected = 0, and
Cell Done = O. Then, with the following logic, wait until Cell Done = 1 in all the cells
(SensorNode goes slowly from 1 to 0 as light hits):

Pixel Light = NOR ( SensorNode, Spot Detected


Pixel Dark = Invert ( PixelLight )
Spot Detected = NOR ( PixelLight, Pixel Dark's from other cells in neighborhood)
Cell Done = High-ThresholdOR ( Pixel-Light, Spot Detected )

A simple three-pixel example, diagrammed in figure 12, \\ill serve to clarify the properties
of this kind of detector array. Note that when all cells have received light it is possible for the
array to arrive at a stable state in which no dots were detected (Spot Detected = 0, Pixel
Light = 1 in all cells). Any set of dots which is a subset of an image that would have been
detected by the equivalent inhibition pattern in a light spot detector array is a possible stable
image.
Therefore, for the three'pixel neighbor-inhibiting dark spot sensors, we get these stable
images:
H:::::;:::jli:i1
!:~~
~ 111!~ g

liil:~!~~!~~i:;~ill :xl
mi~ ~:i n
~
..a.
;n
r-
~
.~ ~!1J ~~~ ~1~~ :::II
~I" ~i'
Pixel-Light poly disoibutioll II ircs and diffusion grounds
.....
Figure 11. TI1C layout of the upper left Optical Mouse Cell.
'"
16 The Optical Mouse, and an Architectural Methodology for Smart Digital Sensors

101 (1) 010(1) ("complete" images)


100 (2) 000 (1) ("subset" images)

For four-by-four arrays, the additional stable "subset images" are illustrated in figure 13.
One result is that with the radius 2.9 inhibition pattern, seeing spots on opposite comers does
not force the four-corners image, but is actually most likely to give the correct two-comers
image. A more general result is that the spot pattern to be tracked does not need to be so
closely matched to the inhibition pattern, since the circuit is willing to wait for spots to really
be there before it claims to see them; a pseudo-random distribution of dots would probably
work quite well.
With this technique, it would be possible to make a linear motion tracker with only three
cells, each inhibiting all the others, with a dark line spacing of three cells or greater; similarly, a
2-D tracker might be built with just a three-by-three array or cells. For the linear tracker, the
image sequence for uniform motion could be either the 100, OlD, 001, 000 cycle or the 100,
010, 001 cycle. These trackers would have to assume that a dot disappearing from one edge
and/or appearing on the other represents a step of motion (or a half step, depending on what
assumptions are made about the line spacing).
Sensor-Node-l
...-----, I
P---~

* low-threshold
** high-threshold

Spot-Detected-l

Spot-Detected-2

Reset

Spot-Detected-3

Ready Done

Figure 12. Three-pixel Dark-spot Sensor Example, with nearest-neighbor inhibition.


Richard F. Lyon 17

Inhibition Additional "Subset" Images Grand


Radius seen by dark-spot sensor array Total

2.1 [} D4 Os D4 [:} 146

Ds08D 402c::Js
US U 8~s [:]4 c:}
[jS[J4~401

2.3 [:]4D 4OsD4[J4 72

0808D 40201
* 2.9 D4080S08 60

D4D 40201
3.0s D Os01
4 43

3.1 D4Os01 39

Figure 13.
Additional patterns for the four-by-four dark-spot detector array.
* Radius 2.9 may be best for the dark-spot detector scheme.
14. Of mice and pens
The optical mouse's compact internals will allow it to be repackaged into various other
forms. For example, a pen-like device with a big base that keeps it from falling over might be
desirable. A "ball-point" tracking device that watChes a golfbalHike pattern of dots on a
rolling baIl in the tip of a pen may also be useful

15. Summary
The optical mouse embodies several ideas that are not obvious extensions of standard
digital or analog design practices, but which contribute to the design of robust sensors of the
analog-to-digital sort. Using the concept of lateral inhibition, sensor cells that are trivial and
useless alone become powerful in their synergism. A sensor array that forces itself into a useful
and informative stable digital state is very easy to deal with, through standard digital
techniques. It is especially useful if it can decide when it has reached such a stable state, and
when it has been reset enough to be ready to start over, for then it can be regarded as self-
timed., and clocks can be generated that cycle it quickly yet reliably.
The optical mouse is just one simple example of an application of smart digital sensors,
which happens to involve a few stages of logic to arrive at the answer in the desired format
Fortunately for this project, the NMOS technology that we know and love for logic is also well
suited for sensing photons; so once the ideas and algorithms were firm, the chip design was
relatively routine, and quick-turnaround implementation was available through the standard
well-greased path.
18 The Optical Moule, and an Architectural Methodology for Smart Digital SenIors

The interrelated inhibition neighborhoods, contrasting patterns, sets of stable images, and
tracking strategies for the optical mouse application have been thoroughly discussed in the text,
and do not seem amenable to summarization here.

16. Concluding remarks


We have examined a family of smart digital sensors, specifically including motion-sensing
imagers. which may find applications in places other than the mouse. Other applications of
mutually-inhibiting and/or self-timed light detectors can be imagined, such as in character
recognizers, edge detectors, light-controlled oscillators, etc. Other kinds of sensors can benefit
from some of the same techniques.
A complete optical mouse has been in use for many months, with only one minor
problem: when one is forced to use a workstation with an electro-mechanical mouse after
becoming accustomed to the optical mouse, the erratic performance is an annoying contrast

References
[Business 1981]
Business Week, "Will the boss go electronic, too?" pp 106-108, May 11, 1981.
[Card et aL 1977]
S. K. Card, W. K. English, and B. Burr "Evaluation of mouse, rate-controlled isometric
joystick, step keys, and text keys for text selection on a CRT', Xerox Palo Alto Research
Center SSL-77-1, April, 1977.
[Conway et aL 1980]
1.. A. Conway, A. O. Bell, and M. E. Newell, "MPC79: The Large-Scale Demonstration of
a New Way to Create Systems in Silicon," Lambda-The Magazine of VLSI Design. pp.
10-19, Second Quarter, 1980.
[Englebart 1970]
D. C. Englebart, "X-Y position indicator for a display system", U. S. Patent 3,541,541.
Nov. 17, 1970.
[Englebart & English 1968]
D. C. Englebart and W. K. English, "A Research Center for Augmenting Human
Intellect", FJCC 1968, Thompson Books, Washington Books, Washington, D. c., p. 395.
[Englebart et al. 1967]
D. C. Englebart, W. K. English, and M. 1.. Berman, "Display-selection techniques for text
manipulation", IEEE Transactions on Human Factors, HFE-8, 1, 5, 1967.
[Hawley et aL 1975]
1. S. Hawley, R. D. Bates, and C. P. Thacker, "Transducer for a display-oriented pointing
device", U. S. Patent 3,892,963. July 1, 1975.
[Hon & Sequin 1980]
R. W. Hon and C. H. Sequin, A Guide 10 LSI Implementation, Xerox PARe Technical
Report SSL-79-7, Palo Alto, California, 1980.
[Koster 1967]
R. A. Koster, "Position control system employing pulse producing means indicative of
magnitude and direction of movement", U. S. Patent 3,304,434. Feb. 14, 1967.
[Lyon 1981]
R. F. Lyon. "A Bit-Serial VLSI Architectural Methodology for Signal Processing", VLSI
81 Very Large Scale Integratioll, (Conference Proceedings. Edinburgh. Scotland.. John P.
Gray, editor), Academic Press, August, 1981.
Richard F. Lyon 19

[Mead & Conway, 1980]


C. A. Mead and L A. Conway, Introduction to VLSI Systems, Addison-Wesley, Reading,
Mass., 1980.
[Newman & Sproull 1973]
W. M. Newman and R. F. Sproull, Principles of Interactive Computer Graphics, McGraw
Hill, 1973.
[Opocensky 1976]
W.1. Opocensky, "Cursor position device", U. S. Patent 3,987,685. Oct 26,1976.
[Rider 1974]
R. E. Rider, "Position indicator for a display system", U. S. Patent 3,835,464. Sept. 10,
1974.
[Seitz 1980J
C. L Seitz, "Ideas about arbiters," Lambda-The Magazine of VLSI Design. pp. 10-14,
First Quarter, 1980.
[Sequin & Tompsett 1975]
C. H. Sequin and M. F. Tompsett, Charge Transfer Devices, Academic Press Inc., New
York, 1975.
[Seybold 1981-1]
The Seybold Report, "Xerox's 'Star'." Vol. 10, No. 16. April 27, 1981.
[Seybold 1981-2]
The Seybold Report on Word Processing, "The Xerox Star-A 'Professional' Workstation."
Vol. 4, No.5. May, 1981.
[von Bekesy 1967]
G. von Bekesy, Sensory Inhibition, Princeton University Press. 1967.
Designing VLSI ProcessorAids
and Architectures

Forest Baskett
Xerox PARC
Palo Alto, California
Stanford University
Palo Alto, California

Paper not available.

20
Keys to Successful VLSI System Design
James G. Peterson
Consultant to TRW DSSG

The designability of successful VLSI systems has been


a major topic of discussion at conferences over the last three to
four years. I will present a method of approaching the overall
problem which has been successfully applied at TRW for several years.

BACKGROUND
The interest in the potential of VLSI first began to explode
several years ago when G. Moore unveiled his famous curve showing
the exponent1al increase with time of the available transistor-level
complexity on one integrated circuit. Exciting new system
capabilities were projected for the near future, and many new
architectures proposed. This initial enthusiasm was quickly
tempered, however, by the observation that the effort required to
spec1fy, deSign, implement, and verify an ultra-complex item such as
a VLSI appeared to be at least LINEAR with the complexity, and highly
unpredictable. To solve this problem in specific cases, more
designers were planned per project. As in software, the added
communication caused the error rate of each deSigner to increase and
the individual productivity to decrease. In addition, the complexity
of the devices designed made system test development and design
verification more difficult, leading to systems produced with more
built-in deSign faults and poorer performance than planned. The
performance of many of today's systems is limited more by deSign cost
and schedule considerations than by the available processing
technology.
This situation of long, unpredictable schedules, extensive
manpower reqUirements, questionable robustness and potentially
unpredictable performance causes a high cost to be associated with
the use of custom VLSI. Consequently, many military and industrial
systems in which performance could be substantially improved by the
use of custom VLSI do not use it. I believe that many of the initial
enthusiastic predictions of the kinds of systems to be available with
new processing technology remain unfulfilled for these reasons.
A significant and often unplanned-for aspect of the engineering
environment is that no engineering is done in a vacuum. There are
continual forces during the course of a design which change the
problem to solved and the constraints on the solution. These may be
21
22 Keys to Successful VLSI System Design

the result of specification changes, inter-engineer communication


misunderstandings or inadequacies, technology improvement, political
maneuvering, fund cutbacks, etc. This factor is more important in
chip design than other types of engineering because of the long time
required to modify, process, and test a design in response to
changes.
A variety of techniques for improving engineering accuracy and
productivity appeared in response to this problem. This response
started with specialized interactive graphics systems for VLSI,
continued with sophisticated CAD tools for simulation and layout
checking, included new design methodologies such as standard cells,
CGA's, PLA's, etc., and extended to such economic opportunities as
the silicon foundry.

WHAT IS THERE ABOUT WHAT WE ALREADY DO THAT WORKS?


A common thread can be found to run throughout all of the
current tools, techniques, and opportunities. This thread is that
each approach increases the degree to which lower or later levels of
the design have certain desirable attributes. I believe that these
attributes are the "basis vectors" which determine the relative
utility of the different approaches. This allows the utility of
these techniques to be understood. Furthermore, I believe that
direct application of this understanding throughout the solution of
the problem will result in the largest economic and schedule
benefits. I call this understanding "regularity technology".
Figure 1 shows these attributes, arranged in order from most
directly to least directly relevant to the day-to-day problems
encountered in a VLSI systems design. The left-hand side is a list
of the desirable attributes. The right-hand side shows the benefits
which are derived from having each attribute. I do not have an exact
mathematical definition of the attributes and therefore do not have a
precise method for their measurement. It is nevertheless possible to
apply the concepts they represent with resultant concrete benefits.

ATTRIBUTES WHICH AFFECT REAL BENEFITS


SYSTEM DESIGNABILITY

Reusability of design effort CAD usage made effective.


Decrease in design complexity Humans made more effective.
Changeability Accommodation to customer easy.
Design errors easily corrected.

Repetitiveness Good packing of components and


wiring.
Predictable high performance.
Programmability Parallel development possible.

Regularity Ease of exact analysis.


Hierarchicalness Testability and robustness.
FIGURE 1 -- Subjective attributes of successful system designs.
James G. Peterson 23

In general, problems either have these attributes or they don't.


The presence or absence of these attributes in solutions, on the
other hand, is controlled by the system designer{s) and specifier{s).
The key to successful VL8I systems design is to use these attributes
to evaluate potential specifications and designs before they are
implemented. Designers and customers can then choose to increase or
decrease the amount of these attributes in a design after conscious
consideration of the effect on system cost.
The first three attributes are reusability of design
effort, decrease in complexity of design task, and changeability.
These attributes have the most directly observable impact on the
final economics of a VLSI system design.
Most design methodology improvements have some element of the
concept of design reusei for example, the utility of the CGA concept
is based on the reuse of a single gate design, a layout of an array
of gates, and completed mask layers. Computer-aided design is also
based on this concept -- design algorithms are programmed to make
then reusable for many systems. Increasing the presence of design
reuse in a system increases the effectiveness and utility of CAD.
Some of the design methodology improvement schemes utilize the
concept of complexity reduction. The intent of standard cell chip
layout schemes is to use this concept; theoretically the logic
designer does not need to know anything about the layout of the gates
on the chip, or their internal electrical construction, or even which
technology they are used with. Reducing the complexity of the system
(or the sUb-system on which a single designer must work) greatly
increases the effectiveness of a human designer. While a human is
very creative, he is only effective when considering a small number
of things at once.
Some of the methodology improvement schemes enhance the
changeability attribute of the design. These include ROM and PLA
"programs, the wiring mat of a CGA, and the power level of the
individual gates in a CGA. An excellent example of changeability is
the stored program computer. Increasing the changeability of a
design results in easier accommodation to new customer reqUirements,
and easier repair of design errors -- found before or after shipment.
The next two attributes, repetitiveness and programmability, are
precursors of the first three -- their presence early in a design
tends to control the degree to which the first three will be present
in the later design stages and the final system. Therefore, they are
important solution evaluation criteria during the early design
phases.
Repetitiveness is very widely exploited - the use of loops in a
computer program is based on repetitiveness, as is the design of bit
sliced microprocessors and most CPU arithmetic units. The main
benefits of the presence of repetition in a system design is that it
allows the physical implementation of the system to be closely packed
together, which improves the final system performance, and almost
always implies reuse of design effort.
Programmability is used here to indicate the number of simple
1-out-of-n choices available to the designer. Examples of this are
the personalization of a ROH, strapping options on circuit boards,
etc. The main benefits o~ programmability are that it allows
parallel development of the item to be programmed and its program,
24 Keys to Successful VLSI System Design

and that it almost always implies reuse of design effort and


changeability.
Repetition and programmability are often synergistic - an item
used repetitively whose instances can be individually programmed is
extremely useful, as in the cases of ROM's, PLAts, and some eGA's.
The parallel development technique can therefore be used
successfully.
The chief desirable attributes are regularity and
hierarchicalness. Repetitiveness and programmability are only
restricted forms of regularity or hierarchicalness. Things which are
regular or hierarchical have a much better chance of being designed
in a repetitive or programmable fashion. For example, while many
arithmetic algorithms which have logarithmic performance are not
directly repetitive, they may usually be stated in a regular form
which uses programmed repetitions of repetitions. These attributes
are valuable because they usually imply that high-quality analytical
tools can be developed to analyze the problem. Increasing the
presence of these attributes therefore allows more thorough analyses
to be done on larger parts of the system, which results in more
robust systems designs.

PROPOSITION
I propose that at any time there is always more useful
regularity or hierarchicalness to be found in any useful problem.
This implies that for many of today's applications, economically
sound VLSI solutions could be found that would greatly improve the
product performance.
PLACES TO LOOK FOR ADDITIONAL REGULARITY IN TODAY'S PROBLEMS
The way to find this additional regularity is evaluate all the
aspects of the problem for the seven attributes listed above. This
includes the market analysis which determines what product should be
built and how it should be specified, the initial large-scale design
decisions, all the way down to the layout and test of the individual
transistors. Especially good places to look are in the highest-level
specifications and system design decisions. There are many new
optimization techniques available to the VLSI designer. A technology
which has worked in one place, as the above examples demonstrate,
should be vigorously applied wherever it is effective.
It is appropriate when using this technology to disregard the
traditional boundaries between specification, chip, logie, test,
mechanical, software, and system design tasks, and to address all
these aspects simultaneously. At first glance, this seems a very
complex and difficult task, but in situations where hierarchicalness
and/or regularity are present, a single engineer can usually
comprehend all the facets of the entire system at one time. And, he
can then often consider significant tradeoffs which are usually
overlooked.
An example of this thorough application of regularity is found
in refrigerator design. Refrigerator buyers want two sexes of
refrigerators -- left and right hand opening. Initially, two
separate types of refrigerator were made. This gave refrigerator
Jame. G. Peterson 25

retailers and distributors inventory headaches, until the sex of


refrigerators was made programmable in the store. Now only one type
of door is made and stocked, designed to be mounted so as to open
from either side. The retailer (or end-user) need only move a few
bolts (a half-hours work, even if you haven't done it before) to
produce the correct sex of refrigerator. The inventory and
production advantages of this breakthrough are significant. The use
of changeable color panels on refrigerators and other appliances are
a similar application of "regularity technology".
Yet another example can be found in the auto industry, where
only a small number of types of frames and engines are made, and then
many different types of bodies are put on top of them.
The common point of all of these uses of regularity is to give
great variability in product appearance, performance, and purpose
with minimal engineering and manufacturing cost.
IMPLICATIONS
The application of this design regularity technology to VLSI
design could result in some startling changes in the system design
business. It is possible that in a rew years, many apparently "one
of a kind" military or industrial systems will be built with
changeable custom VLSI systems. This might result in the savings of
billions of dollars of defense spending annually (or in the creation
of even more exciting systems than are possible today).
The fact that systems could be designed to be more robust might
result in totally new perceptions of system reliability - the number
of design bugs expected to be found hiding deep in a piece of
hardware or software might be very greatly reduced.
Systems with far more degrees of freedom of application will
appear. Note that autos were once assembled at the factory to the
buyer's specification for no additional charge, until the auto makers
discovered that users wanted even less variability at lower cost.
Imagine the effect the availability of cheap customized systems might
have on the electronic system business. On the other hand, systems
buyers may also mature into wanting less variability and
near-duplication of capability at lowered cost.
SPECIFIC EXAMPLES OF APPLICATION
I imagine that there are probably a large number of readers with
objections at this point... What about glue chips? No one can
comprehend all of the system I work on at oncel etc... To show how
a real problem might be solved this way, I will give several examples
of the application of this technology to chips and systems built at
TRW.
The first sample application is to a product line consisting of
8, 12, 16, and 24-bit multipliers and 8, 12, and 16-bit
multiplier-accumulators. The chips use either 3 or 4 data ports, and
some are available in either two's complement and magnitude only
arithmetic. There are several applications of regularity technology
in this family:
1. The family consists of several chips with different word
lengths and arithmetiC conventions. Internally, they are all based
26 Keys to Successful VLSI System Design

on a single design, redone for every technology advance, and


restepped by computer for each family member.
2. The inherent regularity of the final chip was enhanced by
the use of a repetitive algorithm for the actual multiplication.
This allowed the chip design problems, mechanical constraints, and
test issues for the entire family to be handled simultaneously by a
single designer.
3. The decrease in the total design effort required for the
chip family allowed exceptionally robust designs to be produced in a
short time. This increased customer satisfaction and improved
production yields.
4. A tendency to strongly customize the specifications for each
member of the family was aVOided to allow the designers to use the
natural regularity of the multiply function. The resulting
similarities in specification gave additional benefits to the manual
writer and the end user, in that the functional characterstics of the
different members of the chip family are very Similar.
TRW has applied this concept to other chip families, including
both digital and digital-to-analog convolvers, and flash AID
converters.
The other sample application is to the redesign of a 5000-gate
satellite-based spread-spectrum transponder system. The problem
statement was initially as follows:
Reduce the power required by the digital part of the
transponder by changing the implementation from off-the-shelf CMOS
and TRW 2u3D standard cell chips to all-custom TRW 2u3D chips,
and constrained by:
One year and apparently inadequate number of personnel
available to do the job.
Suggested problem solutions were evaluated by looking for the
presence of the seven attributes mentioned above. The most difficult
part of the chosen solution was to convince management that the
problem statement was overly constrained as initially stated. They
had to be shown that it would be more economical to re-specify and
redesign almost all the logiC to manifest algorithm and system
regularities which had been hidden by the initial standard cell logiC
implementation. It is clear now that this was an essential part of
the success of the redesign.
Other characteristics of the chosen solution are:
1. The new logiC design uses a small set (3) of functional
blocks to capitalize on the newly available regularity.
2. The logic blocks used are as large as possible (between 100
and 1000 equivalent gates), which decreases the number of them
required. This in t~rn decreases design complexity.
3. Highly programmable logic blocks were used, to keep the
required set small and increase the amount of deSign reuse.
4. The logic blocks are internally repetitive, to further
decrease the total required design effort.
5. The logic blocks were called by a newly invented name, to
keep up with the Joneses(ns11ces n).

These parts of the solution act synergistically to produce a


very compact system.
Jame. G. Peterson 27

The three logic blocks chosen for this system were a


pseudo-random nOise generator, a shift register with exclusive-OR
gate attached to the first stage, and a programmable counter. All
three blocks have logic attached to detect special states of the
registers, to allow electrical, mask, and pin programming, and to
facilitate testing. The speed-power tradeoff, input and output
inversions, function length in bits, and constants used in algorithms
can all be changed on one mask layer for all the blocks.
Literally dozens of programming points per bit are required to
implement all this programmability. A programming sheet, filled out
by the designer, specifies to a computer program where to insert
optional contacts on the layout.
Figure 2 shows the net improvement in the system performance
achieved by this redesign. Note the 3:1 improvement in power and the
5:1 improvement in parts count. There are 35 of the large scale
logic blocks used in 5 custom chip types, of which 2 types are used
twice. These accounted for approximately 4200 gates, for an average
of 120 gates apiece. About.goO gates were implemented with CGA's and
standard cells. The total design effort spent on the CGA's and
standard cells was about 5 times that spent on the blocks, because of
the complexity of dealing with that many little pieces. This
implementation was accomplished in about. 9-10 months by 6 deSigners,
4 of whom had never built chips before.

RANDOM-LOGIC DESIGN REGULAR LOGIC DESIGN


(STANDARD CELLS, CGAS) (FUNCTION BLOCKS)
Power 8.1W 2.7 W
Chip count 77 14
Speed 6 MHz 6 MHz
Process TRW 2u3D TRW 2u3D
Technology
FIGURE 2 -- Satellite spread-spectrum transponder design statistics.

The CGA's were used in the system because they had already been
designed, and it was thought that inadequate funds were available to
replace them. Most of the standard cells were required in the final
system design to interface the CGA logiC to the new chips, and some
to combine the functions of the blocks.
In hindsight, however, it is now apparent that it would actually
have been cheaper to redesign MORE of the system. The additional
available regularity would have actually DECREASED the cost of the
entire system desiSn. One more block would have been developed,
probably a type of state machine, and the functions of the counter
block would have been augmented by the addition of more programming.
This would have REDUCED the number of blocks in the system, and also
REDUCED the chip count from 14 to 7 or 8. We were not zealous enough
in applying our own philosophy.
28 Keys to Successful VLSI System Design

CONCLUSION
All of the VLSI design methodology improvements yet conceived
are based on some common intangible attributes, and principles for
their application. By consciously using these proven attributes and
principles in a uniform way on all relevant issues, we have a chance of
realizing the full potential of VLSI systems technology. The further
use of this information is up to you.
Programmable LSI Digital Signal
Processor Development
Aklra Sawal
C ,. C Systems Research Laboratortes
Nippon Electrtc Co., Ltd.
Kawasaki, Japan

ABSTRACT

Single chip LSI signal processor JlPD 7720 development is explained.


General requirements, employed architecture, device feature and processor
performances are presented. Comparisons among so far announced single chip
signal processors are briefly commented upon. Future trends are also discussed.

I. INTRODUCTION

Recent advances in LSI technologies are bringing forth great impacts in


various fields. The integration level in digital LSIs becomes much higher and is
increasing more rapidly than in analog ICs. Hence, the digital signal processing
(DSP) cost is decreasing and its application fields are expanding each year.
This trend, in turn, incurs the situation wherein the DSP technologies
themselves are being made more sophisticated and the corresponding LSI circuits
become more complicated. Also, a fully custom designed LSI will take more
time to be developed. Accordingly, the judgement on either to use general
purpose DSP LSI chips or to fabricate new custom designed LSI chips, becomes a
more crucial matter for the system designers.
Recently announced single chip programmable signal processors [1] -[6] , are
aimed at reducing LSI production cost by keeping the total production volume
high while coping with various demands by different stored programs.
In the following, possible application fields and general requirements for
the signal processors are first presented. Then, detailed discussions are given on
NEC signal processor JlPD 7720, especially on its architecture, device feature
and performance capabilities with application examples. So far announced single
chip signal processors are briefly surveyed. Future trends in signal processors
are also discussed.
II. REQUIREMENTS

Needs for real time processing mostly come from telecommunications


applications, such as:
(a) DTMF receivers.
(b) Low bit rate voice coding like PCM- AM conversion or PCM-ADPCM
conversion.
(c) Data modems.
(d) Voice synthesis.
(e) Voice recognition.
29
30 Programmable LSI Digital Signal Processor Development

The signal processing functions common to all these applications are linear
and nonlinear operations to provide filtering, averaging, prediction, optimization,
adaptation, spectrum analysis or signal energy detection.
In implementing signal processors, the most important requirements are for
accuracy and speed in multiply-add operations, which are frequently used in
digital filters or DFT/FFT processors.
The accuracy determines dynamic range and signal-to-distortion (S/D)
ratio. The standard PCM employs nonlinear 8 bit encoding, which is equivalent
to linear 13 bits for a smaller signal level. Hence, signal processors should be
designed so as not to incur significant SID degradation for such PCM encoded
signals. This requires additional bits added to the linearized PCM signals.
Practical surveys on various applications, such as voiceband signal filters, DTMF
receivers, modems and ADPCM codecs, have shown that the required accuracy is
in the range of 12-20 bits. However, most of the cases including voice
recognition application would be realized with 16-bit accuracy under careful
software design.
The speed directly affects the amount of signal processing within a given
time period, thereby affecting the size and cost of digital processing systems.
Single chip realization of a DTMF receiver requires about 25 biquad filter
operations in an 8kHz sampling interval. This corresponds to 0.8 million
multiplications per second. Taking into account the figure as well as other
additional control functions, about 50 biquad filter processing capability for an
8kHz sampled signal is the processing speed objective.
Memory capacity is also an important requirement for signal processor
implementation. To implement 50 different biquad filters by a single chip
processor, at least 100 word capacity is required for variable data memory
(RAM), because one biquad filter needs two delay elements. For 128 real point
FFT or 64 complex point FFT realization (whose possible applications may be
such as adaptive transform coders), 128 words are the minimum requirement.
In adition to the variable data memory, a non-volatile fixed data memory
(Data ROM) should be employed. Several hundred words are required for these
purposes, that is, 200 words for the biquad filter coefficients, around 200 words
for 128 real point FFT twiddle factors and window coefficients, and 256 words
for unidirectional linear/nonlinear PCM code conversion.
Program capacity requirement is another problem, and highly dependent
both on the processor architecture and instruction cycle time.

III. SIGNAL PROCESSOR /lPD 7720 DESIGN


III-A. Design Concept

The architecture design is based on state-of-the-art three micron N


channel E/D MOS technology. Since the multiply-add operation plays a major
role in digital signal processing and is most time consuming, architecture should
be constructed so that the multiply-add operation time is minimized. Key design
concepts employed are as follows.
Built-in multiplier Considering the speed requirement, multiplier integration
on the chip is inevitable. Among various multiplier structures, the parallel-
parallel multiplier, based on Booth's algorithm, is most preferable to satisfy
speed requirement and to simplify programming procedure. Preliminary study
showed that a single 16 x 16 bit multiplication can be executed within 250
nanoseconds and requires about 4000 transistors with the adopted multiplier
structure and device design. Improvements over conventional parallel array
multipliers are that the multiplication time is halved and that the hardware
Aklra Sawal 31

amount is reduced by 20%.


Microprogram control In order to exploit the benefit of adopting the high
speed parallel multiplier, horizontal microprogramming technique is employed.
Micro-instructions control various functional portions, such as a multiplier, ALU
and RAM in parallel. The micro-instruction cycle time is determined as 250
nanoseconds in accordance with the multiplication time. A total of 500 micro-
instruction steps are therefore available during an 8kHz sampling interval.
Multi-bus structure Data transfer to the multiplier also affects the processing
speed. Hence, bus structure should be designed so that two kinds of data,
multiplicand and multiplier, are simultaneously transferred to the multiplier
inputs. The two data will be delivered from various sources; that is, both from
RAM, one from RAM and the other from DATA ROM, and so on. This
requirement results in multi-bus structure employment.

III-B. Architecture

Figure 1 shows the architecture for signal processor J,lPD 7720. The
processor has a built-in multiplier, Data ROM, Data RAM, Instruction ROM,
ALU, double accumulation registers and several other registers; i.e. temporary
register (TR), multiplier input registers K and L, multiplier output registers M
and N, data pointer (DP), ROM pointer (RP), serial I/O registers (SI and SO), data
register (DR) and status register (SR). These registers are interconnected
through a main bus or sub-buses. The arithmetic operation is carried out by
fixed point arithmetic. The fixed point exists between the first bit (i.e. sign bit)
and thesecond bit of the l6-bit 2's-complement data words.
The Instruction ROM capacity is chosen to be 512 words x 23 bits to
enable execution of 500 non-repetitive instructions during an 8 kHz sample
interval. In order to increase available program steps for processing lower
sampled signals, a four-level subroutine stack is provided.
The Data ROM capacity requirements previously given need not be
satisfied at the same time for a single application. A compromise in the form of
a 512 word memory capacity is provided. The word length is determined as 13

Figure l. Signal processor ~PD7720 architecture.


32 Programmable LSI DIgNal Signal Processor Development

bits, satisfying almost every possible coefficient accuracy requirement.


The Data RAM has a "dual-read" capability to improve processor through-
put. Two 16-bit data words, whose addresses differ only in the most significant
bit (MSB), are read out in parallel. The Data RAM is addressed by the contents
of the DP register, which consists of a 4-bit up/down counter and a 3-bit register
with mOdify logic circuits.
The double accumulation registers, each accompanied with flag registers,
are intended to handle double precision (32 bit) data or single precision complex
data, or to preserve main job data during an interrupt job execution. The flag
registers store flags for sign, zero, carry and overflow status in the current ALU
output, and also store a special flag to indicate the resultant overflow in four-
term consecutive additions. Temporary bit extension capability beyond MSD is
provided in the accumulation register. The overflow often leads to large
amplitude limit cycle oscillations in recursive digital filters. Software overflow
correction is executed by replacing the overflown accumulation register content
with a saturation value given by generator SGN. Unnecessary overflow correc-
tions during consecutive additions are thus avoided.
Multiply-add operation is performed by pipeline processing. During one
micro-instruction cycle, the following three functions are executed in parallel.
(1) Two data to be multiplied are stored in multiplier input registers K and L.
(2) The upper and the lower 16 bits of the product between the data previously
set in resisters K and L, are set to multiplier output registers M and N. With the
upper 16 bit data being supplied through a sub-bus, the ALU immediately adds it
to the accumulator content, performing multiply-add operation.
(3) Next addresses for RAM and Data ROM are set in address pointers DP and
RP, respectively, so that new data may be stored in registers K and L at the next
accumulation step.
Multiplier input registers K and L, data register DR, and temporary
register TR are used for scratchpad purposes. Efficient utilization of these four
scratchpad registers makes it possible to greatly reduce the required number of
program steps.
As well as parallel 8080 bus and DMA interfaces, the processor has
separate serial input and output ports (SI and SO), that can be used to
asynchronously interface with external systems at up to 2.048 Mb/s clock rate.
Either an LSB-first or an MSB-first data word can be accepted. The word length
can also be selected between 8 bits and 16 bits.
III-C. Instruction Formats

The processor is controlled by 23-bit horizontal micro-instructions. Their


formats are classified into three categories; (i) "OP/RT" instruction set, (ii)
"BRANCH" instruction set and (iii) "LOAD DATA" instruction set, as shown in
Fig.2.
The "OP/RT" instruction can perform the following tasks in parallel:
Execute ALU function
Transfer data via internal data buses
Modify memory address pointers, DP and RP (within a limited range)
Subroutine Return
"Multiply" is automatically executed in parallel, whenever new data are available
in multiplier input registers K and L. Data transfer, pointer modify and
subroutine return, specified in the present instruction step, become valid in the
next instruction step.
Although data transfer is normally executed from one source to one
destination through the internal main bus, it is so designed that the simultaneous
Aldra SawII 33

data loading to registers K and L through the main bus and a sub-bus can be
achieved by using special destination codes.

I) OP/RT Instructions

Ig: ?t-~Lpr: ~LU: F~ D~L 1D~H' ~ ~ : S~C: I :D~T: I


o I for RETURN \'

ACCUMULATOR DATA POIN


SELECTION ICONTROL
/ SOURCE DESTINATION

ROM POINTER
DECREMENT

2) Branch Instructions

3) LOAD DATA Instructions

I< I :IM~E?IA~< D:AT~


I : : : : : : : : k1 :D~T: I
Figure 2. Instruction formats.

Fi gure 3. Si gna 1 processor J,lPD7720 photomi crograph.


34 Programmable LSI DIg"al Signal Processor Development

The lower 4 bits of the data pointer DP can be incremented or decre-


mented, reset to zero, or kept at the current value, while the upper 3 bits can be
modified bit by bit by exclusive-OR operation with the current value. The ROM
pointer RP can also be decremented by this instruction.
III-D. Device Feature

The signal processor chip is fabricated with a 3-Jlm N-channel EID MOS
technology [7]. A 250ns instruction cycle under 8 MHz clock operation is
realized. Total power consumption is 900 mW. Figure 3 shows a chip photo-
micrograph. More than 40,000 transistors are integrated on a 5.47 x 5.20 mm die
area.
The multiplier is made up of two input registers, Booth's decoders, 112
multiplier cells and an adder with a 3l-bit output. Each multiplier cell consists
of a partial product generation multiplexer and a full adder, and is realized by 30
tran8tors in a 200 x 94 pm area. The total multiplier hardware occupies about
3mm on the die, and executes a 16 x l6-bit multiplication within 250ns at
280m W power consumption.
For comparison, so far announced single chip signal processors, together
with their architectural and performance features, are listed in Table l. Only
Intel 2920 has built-in A/D-DI A converters. It has a feature wherein its I/O
interface is analog, however, it has no hardware multiplier on a chip.
IV. APPLICATION EXAMPLES AND PERFORMANCE

Table 2 shows a program example for implementing a biquad digital filter


section shown in Fig.4. Input x. is converted into output y. according to the
given two equations. Two wordJ in the Data RAM are assikned for W. and
Wi_2 Four coefficients, aCl, a 2 ,_ l-~ and -b 2 , stored in the Data RO\W, are
used". Note that the a ana bl v8.1ues in the <!Urrent example, assumed to be
greater than 1 but smalfer than 2, are replaced by a1-l and l-~, since all the data
to be handled should be in the range of [-1, +1). AlSo note that the accumulator
BIQUAD
r FILTER ~
SECTION

x.1

W. =
1
x.1 - blw.1- 1 - b2w.1- 2 D:Delay
{
y.1 = w.1 + alw.1- 1 + a2w.1- 2

Figure 4. Biquad filter configuration.


Table 1. Announced signal processors.

FEATURE INTEL AMI NEe BELL LABS ITT


2920 S2811 JJPD7720 MAlOOO
Techno1gy N-MOS/EPROM V-~1OS NMOS NMOS NMOS
Die size (mnh 30.4 26.1 28.4 68.5 35
Package pins 28 28 28 40
program 192x24 bit 250x17 bit 512x23 bit 1024x16 bit 512x26 bit
memory EPROM ROM ROM ROM ROM
RAM 40x25 bit 128x16 bit 128x16 bit 128,,20 bit 32,,8 bit (coeff.)
128,,24 bit (data)
Data memory
ROM 128x 16 bi t 512,,13 bit 128,,8 bit
Accu word
length (bits) 25 16 16 40 24
Hardware software 12,,12-16 bit 16><16-31 bit 20,,16 bit 16x8 bit
multip1 ier (Multi -shift) parallel-parallel para 11e1-para 11 e1 parallel-parallel parallel-parallel
Unique hardware on chip AID and FIFO input + 4 level stack - 16 bit IfF
DIA converter scratchpad RAM - 4 static outputs
for ALU fl ags
- internal 24 bit
word length >
~
;-
EPROM version Yes No Yes No No VI

Instruction cycle
:IE
!!!..
time (n sec.) 400 300 250 800 250
Co>
en
Col
GO

Table 1. (Continued) a
"
ID

3

3
Second order ICr

filter sections 19 50 55 39 ..
at 8kHz sampling roo
rate !l!
c
Ci
Po~er ::I'
supplies ~5 +5 +5 +5 +5 !!.
VI
(V) Ci
::J
!!.
Power
dissipation 0.8 0.9 1.5 "a
n
(W) CD
(D
(D
0
I/O 4 MPX analog 8 bit bus I/O 8 bit bus I/O 16 bit bus I/O 16 bit bus I/O ...
inputs serial I/O serial I/O seri a 1 I/O serial I/O c
CD
8 DMPX analog (same pin) ~
outputs 0"
'Q
3
CD
Development - software real-time - X-asm for iMDS assembler 3.
support simulator in-circuit - X-asm FORTRAN/iMDS
- assembler emulator based simulator
- PC for EPROM - evakit with
- progr. app1. progr. function
oriented
compiler
Aklra SlwII 37

Table 2. Program example for a biquad digital filter section.

INST. ALU DATA MOVE


STEP OPERATION AW~y
ACCA +- ACCA - RAM K +-RAM ,L+- ROM RP decrement
1
/Xi/ IWi-11 /WI_II /-b,+1/ DP modify
ACCA +- ACCA + M K +-RAM ,L+-ROM
2 IXI -Wi-III(-b,+I)WI-II I-'r/J./ RP decrement
/WI-2/
3
ACCA +- ACCA +M K ... RAM , L'" ROM
RP decrement
/XI-b,Wi-Il 1-~Wi-21 /WI-2/ Ifl#'J.I
.4 JP TO STEP 6 IF OVAl =0

5
ACCA ... SGN
I+max or-maXI
6
ACCA +- ACCA + M TR +- ACCA(OLD)
DP modify
IWil I02Wi-21 IWil
ACCA ... ACCA + RAM K +-RAM , L +- ROM
7 RP decrement
/Wi+ 0 2Wi-21IWi-11 IWi-11 10 1-11
ACCA ... ACCA + M RAM'" TR
8 DP modify
/Wj+Wi-I+Cl7.wt-2/ /(O,-IT'Ni-i/ IWJ/
RAM ... K
9
IWj-J1

content is initialized with input x .


Step 1 shows that (i) to debrease the accumulator content by W. ,(ii) to
transfer W'_l to register K and (iii) to transfer (l-bJ to L, all these funb110ns are
performed1ifi perallel. The multiplier immediatelY operates and the result is
taken into register M at the end of the instruction cycle corresponding to step 1.
At step 2, the contents of registers K and L are replaced by W. and -b ,
while the multiplication result in register M is accumulated in the ~c~umulat8r
to produce (ACC)=x.-W. 1+(I-~)W. =x'-?IW, l'
Similar operat1on:tappeaI' at1s\ea 3 an~- step 7. Up to step 3 and step 8, a
new value Wi and output Yi are calculated, respectively. At step 6 value W. is
temporarily stored in temporary register TR and then stored in the Data RAM as
a new Wi-I at step 8. The old value Wi-I' kept in register K, are also stored as a
new Wi-2 at step 9. Overflow correctIon is performed at steps 4 and 5, if
necessary.
If additional biquad digital filter sections need be calculated, steps similar
to steps 1 through 9 may be repeated, however, referring to different RAM area
and different Data ROM area. The accumulator contents can be used as the
inputs to the following filter sections, thereby giving 9 effective steps for
calculating each biquad digital filter section.
A program by which signal processor J,lPD 7720 can be used either as (i) a
16-stage biquad filter or (ii) a 63-tap transversal filter, was developed and
incorporated into sample chips. Various filter characteristics can be realized by
initializing all the filter coefficients. Figure 5 shows a frequency response for a
fourth order recursive filter (elliptic) realized by a sample chip.
Many applications are possible as mentioned in Section II. Some bench-
marks for signal processor J,lPD 7720 are listed in Table 3. Cases where the
number of dynamic steps is greater than that of static steps, show that repeated
38 Programmable LSI Dlg"al Signal Processor Development

10 dB/V.Div.

500 Hz/H.Div.

Figure 5. Fourth order recursive filter response.


uses of subroutine calls exist in dynamic operations. PCM/ A M conversion and
PCM/ ADPCM conversion with adaptive prediction (AP) assumes linear PCM
inputs. In the cases of 64 complex point and 128 real point FFT s, instruction
steps for data I/O are not included in the given numbers.
In actual device operation, the program execution must be synchronized
with the external systems. Program synchronization can be achieved either by

Table 3. Signal processor ~PD7720 benchmarks.

Functions No. of Instructions Execution Time Comments


Static Dynami c.
jJlaw/linear 5 5 1.25 ",sec ROM table look-up
Code Conversion operations
1i near/",l aw 51 12.75 ,",sec arithmetic
Code Conversion 29 operations
Biquad Filter 9/8 2.25/2 ,",sec overflow/
Section 9 non-overflow
32-tap Transversal 37 9.25 ,",sec including data
Fil ter Secti on 37 shift operati ons
DTMF Recei ver 260 480 120 ~sec filter bank
PCM/AM 360 ll.M(CVSD) 4bits/
Conversion 96 90 ~sec
PCM word
PCM/ADPCM with AP 300 480 120 ,",sec 4bit ADPCM word
64 Complex Point 310 6,400 1.6 msec including bit
FFT reversal operations
128 Real Point 8,000 2.0 msec including bit
FFT 370 reversal operations
9600 bps 2,500 6,500 dynamic steps/baud
Modem(V.29) (2400 baud)
Aklra Sawal 39

1.544/2.048 MHz 8 MHz


RES ET

AN AlOG
II
PCM
J , RST!
SCK
ADP CM
OU T
ClK
IN SO , SI SO
,
SYNC .... , SIEN SOEN ~

PCM CODEC jlPD 7720

WORD READ OUT


SYNC

Figure 6. Single-chip PCM codec and signal processor combination.


supplying the signal processor with a reset pulse or an interrupt pulse at every
sampling interval, or by internally checking I/O flag information by software. A
circuit inplementation example is shown in Fig.6, where a commercially avail-
able single-chip PCM codec and signal processor jlPD 7720 combination is
depicted.
Multiprocessor configuration, using signal processors, is another interesting

I
I NPUT I
OUT
IN I I
BUFF ~

~~~~~
~ ; Signal Processor Module Packet Processor Module

Figure 7. Multiprocessor configuration example.


40 Programmable LSI Digital Signal Processor Development

i tern to be studied. An example [8] is shown in Fig.7, where signal processor


modules with communication capability are arranged in parallel and/or cascade.
Note that all the signal processor modules are required to operate synchronously
with a system clock.

V. FUTURE TRENDS

It is easy to presume that the integration scale will be doubled or even


quadrupled in the next five years. General purpose 32-bit microprocessors, now
laboratory products, will be commercialized. Super computers, being 100 times
more efficient than CRAY-l, are planned to be developed. However, the need
will continue for single-chip VLSI signal processors with enhanced real-time
capabili ty.
Future signal processors will be reinforced in (i) RAM/ROM size, (ii) data
precision (iii) operation speed and/or (iv) I/O capability.
Completely different processor architectures with vector operations or
data-flow control philosophy [9] ,[10] are quite attractive in the VLSI age. Also,
the modular concept with communication capability, as previously mentioned
[8] , will be another interesting approach for the future signal processors.

REFERENCES

[1] M. E. Hoff, Jr. and M. Townsend, "An analog input/output microprocessor


for signal processing," ISSCC Dig. Tech. Papers, pp.178-179, Feb. 1979.
[2] R. W. Blasco, "V-MOS chip joins microprocessor to handle signals in real
time," Electronics, pp.l31-138, Aug.30, 1979.
[3] Y. Kawakami, T. Nishitani, E. Sugimoto, E. Yamauchi and M. Suzuki, "A
single chip digital signal processor for voiceband applications," ISSCC Dig.
Tech. Papers, pp.40-41, Feb. 1980.
[4] T. Nishitani, Y. Kawakami, R. Maruta and A. Sawai, "LSI signal processor
development for communications equipment," Proceedings ICASSP, pp.386-
389, March 1980.
[5] J. S. Thompson and J. R. Boddie, "An LSI digital signal processor,"
Proceedings ICASSP, pp.383-385, March 1980.
[6] R. P. Capece, "Digital n-MOS chip handles analog signals at record
200 kHz," Electronics, pp.79-80, Dec.4, 1980.
[7] T. Nishitani, R. Maruta, Y. Kawakami and H. Goto, "A single-chip digital
signal processor for telecommunication applications," IEEE J. Solid-State
Circuits, vol.SC-16, No.4, pp.372-376, Aug. 1981.
[8] A. Ogata, T. Kumahara, T. Matoba, A. Sawai and Y. Ohtake, "Real-time
processor for a tokamak," 11th Symposium on Fusion Technology, Oxford,
Sept. 1980.
[9] H. T. Kung and C. L. Leiserson, "Systolic arrays (for VLSI)," in Sparse
Matrix Proceedings 1978, edited by I. S. Duff and G. W. Stewart, SIAM 1979,
pp.256-282.
[10] T. Temma, S. Hasegawa and S. Hanaki, "Dataflow processor for image pro-
cessing," Proceedings Mini'80 Asilomar, Monterey, California, Jan.30-Feb.l,
1980.
Functional Parallelism in VLSI Systems
and Computations
Noble R. Powell
General Electric Company
Electronics Laboratory
Syracuse, New Volt< 13221

ABSTRACT
The effectiveness of very large scale integration (VLSI) in re-
ducing the incremental cost per unit of performance of a variety of
flexible system functions can be significantly enhanced by employing
a high degree of functional parallelism with serialized data-flow and
control. Both Functional Parallelism (the parallel use of an array of
high density, low cost, lower performance devices to obtain a high
performance function) and Bit-Serialized Arithmetic (the use of single
bit-stream operations to perform elementary arithmetic functions) have
been factored into VLSI systems and computations to permit advantageous
use of MOS solid-state technologies as well as graceful transitions of
processor implementation from one scale of large scale integration to
the next. Some of the major considerations linking form to function
are noted here with examples illustrating the impact of functional par-
allelism and serialized arithmetic.
FUNCTIONAL PARALLELISM
The use of functional parallelism as a means to realize high per-
formance system functions has emerged with the economic availability of
custom LSI and the technology improvements being realized in connection
with the fabrication of MOS-type electronic devices of very high func-
tional density. While serially organized memory devices, such as
shift-registers, have employed functional parallelism to achieve high
performance levels with low performance elements, it was not until the
advent of the CE chipt 1), conceived to function as part of an inte-
grated array of functionally parallel devices, that digital arithmetic
units with distributed ~erialized logically integrated data flow and
control were reported(2). In this device the flow of data and control
can be contrasted as shown in Figure 1 with that of more commonly em-
ployed bit-parallel devices. Rather than to employ bit-parallelism for
arithmetic throughput, effective reduction of overall complexity at
high levels of functional throughput has been achieved by serializing
both arithmetic operations and memory at the device-level while par-
alleling devices at the function-level of design. In simple terms,
this approach can lead to economic solutions to signal processor design
by a substantial reduction in device pin-outs, a higher level of func-
tional operation per device at lower bit rates, direct parallelism of
41
42 Functional Parallelism In VLSI Systems and Computations

control, direct control of precision and scaling, and the elimination


of recurring software costs.

PARALLEL ARITH.
SERIAL ARITH. (8 BIT BYTES)

OATA
DATA OUT
OUT

GND

FUMeT.
CONTROLS

8 PINS 22 PINS

Figure 1. Arithmetic Contrast


Examples of such an approach(3) for the purpose of illustrating the
consequences of using successively higher levels of functional par-
allelism are shown in Figure 2 through Figure 5, and can be understood
by observing the form taken by the serialized computational element
shown in Figure 6 as well as the interconnection of such CE devices as
repetitively arrayed in the form of a quad-CE (QCE) module shown in
Figure 7.
FFT -64 VFFT-2 VFFT -10
RECORD SIZE 64 1024 1024
(COMPLEX WORDS)
FFT TIME 500 \lS 512 \lS 102.4 \lS

SUSTAINED I,Q 128 kHz 2 MHz 10 MHz


SAMPLING RATE
NO. OF RADIX 4 1 16 80
COMPLEX ARITHMETIC
UNITS (8 CE CHIPS EACH)
PRECISION 19 Bits 15 Bits 16 Bits
INTEGRATED CIRCUITS 131 687 1891
INCLUDING SPECIAL
INTERFACES
POWER DISSIPATION 37 Watts 98 Watts 244 Watts

Fi9ure 2. Fourier Transform Processors


Noble R. Powell 43

Figure 3. FFT-64

Figure 4. VFFT-2
44 Functional Parallelism In VLSI Systems and Computations

Figure 5. VFFT-I0
The VFFT-I0 shown in Figure 5 uses 40 identical printed wire
boards, grouped in five sets of eight of each radix 4 stage. Each
board contains two radix 4 complex arithmetic units, alternating mem-
ories, and multiple~ers for performing data permutations. The algor-
ithm utilized is tl)

[(1256 x P4) (1 256 x 04) (1 256 x T4)J . .-- Trivial multo

[(164 x P16 ) (164 x 016) (1256 x T4)] ~ Coarse angles

T1024 [(116 x P64) (116 x 064) (1256 x T4)]

[(14 x P256 ) (1 4 x 0256) (I256 x T4)]

[(II x PI024) (II x 01024) (I256 x T4 )J ~ Fin~ angles


Noble R. Powell 4S

w s

a
b
CE Chip a . b c d
c
d

Figure 6. Serialized Computational Element

R256 R768 1256 1768

Figure 7. Interconnected Computational Elements


High functional throughput of FFT processors employing distibut-
ted arithmetic and control modules partitioned at this level can be
flexibly achieved through appropriate choice of memory decimation
modulo-p, clock frequency f, and radix r. The computation time T, for
computing an N-word b-bit Fourier transformation with modules organized
from such devices, illustrated in Figure 6, into complex-arithmetic
matrix elements, illustrated in Figure 7 is given by,
b+3
T = Ji log NPT (1)
r r

for the case of multiple-pass processor organizations. Thus, with six-


teen levels of functional parallelism (p=16) a radix 4, 1024-point,
transformer operating at a clock rate of 2.5 MHz can execute 2000
transforms per second at 16 bits of precision. A cascade realization
of this architecture therefore achieves 10,000 transforms per second
at the same clock rate, or more than 50,000 transforms per second at
CMOS/SOS clock rates.
48 Functional Parallelism In VLSI Systems and Computations

The function performed by each set of eight devices (the 4 x 4


complex-element matrix-multiply of a complex four-vector) is currently
feasible as a single device under forty-thousand square mils in total
area with three micron geometry in a package consisting of twenty-four
pins or less. Thus, the Fourier transformer shown in Figure 5 which
executes 1024 pOint transforms to 16 bit precision in 100 microseconds
with current CE-chips woLild be reduced from 40 arithmetic boards plus
four control and interface boards to five arithmetic boards plus four
control and interface boards, given no corresponding improvement in
the control interface electronics. Since functional throughput
measured in terms of the uninterrupted flow of data transformed is at
twenty-million 16-bit words/second at the 2.5 MHz clock (ten-million
complex samples per second) or ten-thousand transforms per second,
machines consisting of the eight-fold increase in functional density
per device as suggested for 3~m geometry translates into one-hundred-
sixty-million 16-bit words/second at the same level of physical com-
plexity and cost illustrated in Figure 5.
The key to a graceful transition from one level of large-scale
integration to the next higher level is the ability to sustain consoli-
dation of functional density per chip without an offsetting expansion
in either pin-outs per device or board layout complexity leaving con-
trol flow and generation relatively unchanged to preserve system modu-
larity and distributed-functional control. Bit-serial arithmetic and
word-parallel data flow and control provide such architectural freedom,
so that by distributing control electronics throughout the array a sin-
gle command controlling precision and speed of execution can be broad-
cast to all arithmetic devices simultaneously. In addition, the im-
position of identical control-flow and data-flow constraints at the
level illustrated in Figure 7 leaves functional configuration and con-
trol transparent to the incorporation of very large scale integrated
device designs as technology improvements make such designs attractive
and memory improvements support such changes.
REGULARITY OF SERIALIZED ARRAYS
The regularity of arrays of such serialized devices encourages the
use of redundancy since the mechanization of interconnecting and multi-
plexing on a device or subarray level is straight-forward. Techniques
of voting for self-test and of test function utilization are naturally
suited to integrated array functional processors of this class since
interconnect and functional switching of input and output are greatly
reduced.
The integrated matrix functional processor illustrated in Figure 8
provides an illustration of the manner in which the modularized concept
of functional partitioning for VLSI potential consolidation has been
interwoven with conventional hardware. This figure portrays an 8 x 8
module conceived as an element of an architecture for solving recursive
matrix equations arising as boundary-value problems, image filtering,
adaptive beamforming, and as in non-linear optimization problems. Gen-
erally, the primitive executed is
(2)
Noble R. Powell 47

Figure 8. Matrix Functional Processor


where RN-I' CN-I, SN-I are conformable matrices of dimension k x m,
k x n, n x m, respectively; and SN is the solution matrix, k x m. This
equation may be considered typical, although augmentable, of recursive
matrix equations which have been implemented in the form shown in Fig-
ure 9 again with single layer boards. The array of custom devices, ex-
ecuting r x r simultaneous dot products, is operating in a bit-serial
(single word lines), word-parallel architecture. Column and row out-
put arrays as well as the associated memories are similarly organized
with all the busing word-parallel bit serial. A suitable microproc-
essor is used for control, array scaling, and miscellaneous logically
intensive tasks. A host computer is shown supplying data matrices;
however, the more typical channelized application employs direct pipe-
lining of the source data to the row and column memories which are used
in scratch-pad fashion as part of the execution of highly intricate
matrix equations before outputting results to a display, storage de-
vice, or control configuration.
All of the usual matrix operations are thus available including
multiplication, addition, transposition, complex multiplication, and
inversion. The latter is executed with algorithms of the general form,
(3)
48 Functional Parallelism In VLSI Systems and Computations

Figure 9. 16 x 16 Matrix Functional Processor


since all operations are carried out in double precision. For ex-
ample, the processor shown in Figure 9 is a 16 x 16 array of custom de-
vices designed to allow 16 bit precision of results from the inversion
of 256 by 256 complex element matrices with condition numbers of 1000
by employing block floating-point arithmetic and forty bit word lengths
internal to the array and row/column memories. Precision and speed
control is broadcast as a single word command to all array devices sim-
ultaneously, thus variable precision-speed modes of operation can be
controlled dynamically as required by the application. Modularity is
achieved in a manner similar to the FFT processor previously described
as an associated set of identical devices having common inputs either
by row, column, or both. By straightforward partitioning of matrices,
the dimension of which exceeds that of the array, inversion operations
are performed as a conformable succession of array multiplications;
and, by zero-filling of matrices, the dimension of which is less than
that of the array, inversion operations can be similarly performed.
Since the component operations of matrix addition, transposition, mult-
iplication, and inversion are both elementary to processor algorithm
execution as well as intrinsic to the array of device types that can
be conceived for this purpose, the matrix functional processor has two
control components configured as a two level hierarchy. On the higher
level, a microcomputer controls the algorithm of matrix functions being
performed, and on the lower level, a PROM driven sequencer controls the
implementation of each matrix function.
Noble R. Powell 49

FUNCTIONAL EXTENSIONS
Clearly, this organization of modular arithmetic with distributed
memory and control contracts directly and gracefully at successively
higher scales of solid-state integration as such consolidation becomes
economically attractive. At the three micron level of geometry, for
example, the machine illustrated would be capable of unpartitioned ex-
ecution of 128 x 128 matrix operations; or, a machine of four times the
complexity shown will execute the 256 x 256 matrix arithmetic directly.
Typical performance of an array 128 x 128 operating at a clock rate of
10 megahertz is a matrix multiply in 516 microseconds, a complex matrix
multiply in 2.06 milliseconds, and a real matrix inversion in 14 milli-
seconds. The 256 x 256 case requires four times these periods with the
same size array and precision described.
REFERENCES
1. Powell, N.R. and Irwin, J.M., "Flexible High Speed FFT with MOS
Monolithic Chips," Eighth Asilomar Conference on Circuits, Systems,
and Computers, December 1974.
2. Powell, N.R. and Irwin. J.M., "A MOS Monolithic Chip for High-
Speed Flexible FFT ~1icroprocessors," ISSCC-75. February 1975.
3. Powell, N.R. and Irwin, J.M . "Signal Processing with Bit-Serial
Word-Parallel Architectures." Real Time Signal Processing, Proc.
SPIE, Vol. 154. 98-104.
Functional Extensibility: Making the World
Safe for VLSI

Justin R. Rattner
Intel Corporation
Special Systems Operation
Aloha, Oregon 97005

The greatly improved access to to the host architecture. When


VLSI technology, now available to viewed in this way, much of the
non-specialists, has sparked burden for a clean interface is
considerable interest in the design returned to the host architecture.
of special function architectures Functional extension is thus an
(SFAs) that exploit its unique issue for which host arhitecture
characteristics and advantages. and not the SFA has the primary
For example, many of SFA's take the responsibility.
form of pipelines or arrays in We examine several of the
order to capitalize on the traditional architectural
economies of replication that come approaches to functional extension
with VLSI. Application of these such as instruction set extension,
VLSI SFA's has reflected current and find them to be cumbersome and
research interests in areas such as restrictive. We also examine the
pattern recognition, image attachment of SFA's as I/O devices
analysis, data base retrieval, and find this approach to be
interactive graphics, and speech unnatural and complicated by
processing. operating system issues when the
While most of these designs SFA is not intended to provide an
are experimental at present, their I/O function.
successful demonstration has made To eliminate these
the commerica1 exploitation of the difficulties we recommend that
concept quite appealing. The SFA's be viewed as hardware
problem that remains, however, is processes and, like their software
just how to couple these special conterparts, that they base their
purpose machines to general purpose communication with the host on the
computing environments in a manner exchange of messages. We also
that is effiCient, in terms of recommend that to coexist in the
performance and throughput, regular host architectural framework the
and convenient in terms of its SFA be physically connected to the
software interface, and economical host hardware as an additional
in terms of die cost, board area, tightly-coupled multiprocessor. In
power and so on. this way the SFA is made to observe
In this paper we propose that a uniform set of conventions for
above requirements are best addreSSing, access control,
addressed by viewing the attachment interprocess and interprocessor
of an SFA as a functional extension communication, greatly simplifying

50
Justin R. Rattner 51

the host software interface. To


insure the cost of interconnection
and architectural conformance do
not unduly burden the SFA, a
standard VLSI interface unit is
proposed that integrates all these
functions into a single chip.
The recently announced
iAPX-432 is given as an example of
a VLSI-based architecture that
supports functional extension in
the manner proposed. The 43203
Interface Processor is used to show
how the support functions can be
integrated on a single chip.
Replication of Inputs May Save
Computational Resources In VLSI
Zvl M. Kedem and Alessandro Zorat
State University of New York at Stony Brook
Department of Computer Science
Stony Brook, New York 11794

l. IN'I'RCIXTCTICN

Several authors have investigated the complexity of a VLSI chip


that solves a certain problem or class of problems. The results are
stated in terms of a function R(A,T) of the chip area (A) and
execution time (T) and are generally based on an abstract modeJ of
VLSI computation first defined by [T79] and further developed in
[BK81] and [vaO].
An assumption necessary for the technique used to prove lower
bounds and to derive upper bounds on R(A,T) is that each of the inputs
is given only at one point in time and space durinq the course of
computation. This assumption will be relaxed here to allow replication
of inputs in time and/or space (area).
The concept of replication of inputs was first introduced in
[KZ81], in which the implications of such replications were studied
both in terms of lower and upper bounds. That paper concentrated on
the special case of replication in space.
In this extended abstract the implications of input replications
both in space and in time are examined. A general lower bound on the
oamplexity of any chip oamputing transitive functions as defined in
[vaO] will be derived. A construction will then be presented to show
that input replication can be used to drastically reduce the resource
requirements of a familiar computation of theoretical interest in

This research was supported in part bv NSF under Grants ECS 80-09938,
M::S 80-25376, M::S 81-04882, J)CS 81-10097, and was facilitated by the
use of Theory Net, NSF Grant K:S 78-01689.
52
lvl M. Kedem and Aleaaandro Zorat 53

VLSI: circular shifts. '1llis proposed construction also shows that the
computation of circular shifts can be naturally partitioned among
several chips, without requiring any inter-chip communication.

2.
-- ---
CCJ.WLEXI'IY OF TRANSITIVE FmCI'ICNS WI'IH MULTIPLE INPUT mPIES

In this extended abstract the term "transitive function" will. be


used to denote a group of functions {Fe(X O' ,xn_l ) = (zO'
,zn_l) I e=O, ,E-l} with n inputs and n outputs, such that for any
i,j (O~i,j~n-l) there exists a function that maps Xi into Zj. In [vaOl
it was shown that any chip computing a transitive function with no
replication of inPUts allowed, (neither in space, nor in time,) must
have an area A and an execution time T such that AT2 = O(n 2).
Relaxing this requirement results in the following theorem:

THEOREM: Fbr any VLSI chip computing a transitive function and for
which replication of inputs both in space and in time is allowen:
3 4 4
AT = O(n ).

Since this is lower than the bound of AT2 = Q(n 2), which is A~4
= O(n4), valid for such computations when no input replication is
allowed, one may hope that input replication can be used to decrease
the oonplexityof at least sane transitive functions to a vaJue below
the bound of A~4 = O(n4 )
In the next Section it is shown that this can be done for a
circuit that canputes the circular shifts of a bit-vector, which is a
transitive function.

3. A CIIOJIT FOR CIIOJIAR SHIFTS

The circuit considered here computes all circular shifts of n


bits and consists of k modules M[i,j] (O~i,j~l) arranged in a square
mesh with p = Vk modules on its side. All modules are under the
control of a centralized. control unit (see Fig. 1). Each module
contains m' = 4m - 4\IIU + 1 processors, arranged in a square mesh,
54 Replication of Inputs May Save Computational Resources In VLSI

with q = \I-mrprocessors on its side, as shown in Fig. 2. Thus, there


is a total of km~ processors. Each processor comprises an input pad,
an output pad, and sane amount of logic, which is of constant size,
independent of the size of the input.
This circular shifter can be viewed as a multiprocessor
ascribable to the 8IM1) variety of Flynn~s [F66) classification. It
receives its input fran an external input unit and sends the output to
an external output unit. The instances in time at which the input
unit provides the inputs and the outPUt unit retrieves the outputs are
fixed in advance and do not depend on the particular shift being
performed. Only the m processors in the \1m by \1m upper left hand
corner of each module (see Fig. 2) are connected to the output unit:
the set of such processors is called the "windCM. " 'The outputs
produced by all other processors are simply ignored.
The computation is divided into n/km phases. During phase $, ( =
0, ,n/(km)-l), the k modules output km bits, corresponding to the
output bits (z[$*km), z[*km+l], ,z[*km+km-l]). During each
phase, module M[i, j] contributes m output bits, corres]X>nding to the
im+j-th segment of the km bits that are output during that phase. A
similar statement holds at the processor level. In other words, the
final output (z[O], ,z[n-l]) is obtainerl by:
i) concatenating the outputs of the processors of each module to
form the m outputs of that module:
ii) concatenating the outputs of each module to form the km outputs
of that phase:
iii) concatenatinq the outputs of each phase to form the (n/(km))*(km)
= n outputs of the entire computation.
TO perform a circular shift by 8 (O~8~n-l) positions the
follCMing operations are executed:
i) The control unit computes the value 81m and stores it in a
counter Q and a register Qold.
ii) all the n inputs are presented to each module in n/m "input
waves" of m bits each. Fach wave corresponds to a segment of m
bits of the input vector.
iii) The control unit decrements Q by one for each input wave entered:
lvl M. Kedem and Alessandro lorat 55

when Q reaches zero, the corresponding input wave is "accepted"


by a] 1 processors. '1'0 do the decrementing of the counter Q
efficiently, several subtractions are carried out concurrently in
a pipelined fashion. The structure of this part of the control
unit is shown in Fig. 3.
iv) After all the input waves for the current phase have been
entered, the control unit causes all processors to perform sane
local shifting, within their modules, to align the accepted bits
for output.
v) Finally, the aligned bits are output fran all processors that
consti tute the window of each module, in one "output wave" of km
bits.

Concurrently with the operations in in - v), the control unit


computes the new value of Q to be used in the next phase, that is:
Qnew = 00ld + km. After all the operations in v) have been canpleted,
Qnew is loaded into Q and Qold and the process fran ii) to v) is
repeated for the remaining phases.
Notice that there is no communication between the various
modules. This suggests that each module could be a separate chip. In
this case, the proposed circuit shows that a seemingly indivisible
problem can be partitioned into a number of independent subproblems.
on the other hand, the entire circuit could be impementeC! on a single

chip. In this case it is advantageous to organize the modules and the


processors as the leaves of an H-tree (see Fig. 1 and 2), with the
control unit connected to the root of the tree.
Fig. 4 sh~s an example of a cyclic shift by S = 30 positions for
n = 64, k = 2, m" = 49, and m = 16. The processors that consitute the
wind~ are enclosed by a rectangular boundary for c1ar i ty

4.
-'mE CG1PLEXI'IY
- -
OF '!'HE CIInJIAR SHIFTER

'!he resource requirements of the circular shifter are: A = O(km+


log n), with l:skm:sn, and T = O( (n/m + Viii) n/km). In [LSSl] it was
suggested that one may want to optimize different functions of
56 Replication of Inputs May Save Computational Resources In VLSI

A and T. ~onsider then a function of the form A~t that is to be


minimized. Substituting the above values of A and T one obtains:
A~t = (km + log n) a n/m + Vrn) n/km) t)
The optimal choices for the parameters k and m depend on the
values of a and t and are as follows:
i) For a < t: choose k = nl / 3 and m = n2/ 3, for which
A~t= O(na+t / 3).
ii) For a = t: choose 1~k~nl/3 and m = n 2/ 3, for which
AaTt = o (n4t/3)
iii) For t < a < 2t: choose k 1 and m = n2/ 3 , for which
A~t = o(n 2a/ 3 + 2t/3).
iv) For a = 2t: choose k = I and log n~m~n2/3, for which
A~t = O(n 2t ) .
v) For 2t < a: choose k = 1 and m = log n, for which
A~t = 0 n2t) (log n) a-2t)

In particular, for the measure A3T4 the optimal choice for k and
m IS. n 1/3
a ndn 2/3 , resoectIve
. 1ey, resu1"tlng In a comP.leXItv
' o.f A3T4
O(n13/ 3). While this compares quite favorably with the lower bound of
Q. (n4 ) der i ved earlier, it nevertheless leaves a gap between lower and

upper bounds.

In this extended abstract a familv of implementations of a


circuit to compute the circular shifts of a bit vector was presented.
Note that the area requirements of the proposed circular shifter were
reduced to a sublinear function of the size of the input. This was
achieved by replication of the input, both in time and in space. One
could object that this corresponds to allowing "free" memory outside
the chip, which, if counted, would give an area complexity at least
linear in the size of the input. However, the situation here is
analogous to the one encountered in the studv of space complexity in
classical TUring machine computational oomplexity. There, too, there
existed the problem of space (i.e. tape) requirements for storing the
Zvl M. Kedem and Alessandro Zorat 57

initial input, thus not permitting any sublinear space ccmplexity.


This limi tation was circumvented in [LSH65] by introducing the mode]
of an off-line Turing machine, in which the machine has a read- onlv
input tape, whose size is not taken into account when computing the
space requirements of a canputation.
In the proposed circular shifter, the input unit is the analoq of
the read-only tape of the off-line Turing machine, and hence its area
requirements are not included in the canplexity analyses. I t mav also
be worthwhile to note that the input unit behaves "obliviously"
[HU79], in the sense that the locations and times at which the input
bits are delivered are independent both of their values, and of the
amount of shifting to be performed. '!'he concept of oblivious
canputations in the context of VLSI has been recently mentioned in
[LS8l]; in that work the resource requirements of a canputation were
reduced by relaxing the obliviousness assumptions, whereas here such
assumptions were maintained.

[BK81] Brent, R.P. and Kung, H.T.: "'!'he Area-Time C'anplexity of


Binary Multiplication", JAa1, ~. 28, !i. l, pp. 521-534, 1981.
[F66] Flynn, M.J.: "very High SpeeCl Computing Systems", Proc. IEEE,
v. 54, po. 1901-1909, 1966.
[HU79] Hopcroft, J .E. and Ullman J .0.: "Introduction to Autanata
Theory, Languages, and Canputation", Addison-wesley, 1979.
[KZ81] Kedem, Z.M. and Zorat, A.: "On Relations Between Input and
Ccmnunication/C'anoutation in VrBI", Proc. 22-nd Annual SVrn. on
Foundations of Computer Science (to apper), 1981.
[LS81] Lipton, R.LT. an<'l Sedgewick, R.: "Lower Bounds for VLSI", Proc.
13-th Annual Sym. on Theory of Canouting, po. 300-307, 1981.
[lSH65] Lewis, P.M., Stearns, R.E., ann Hartmanis, J.: "Memory Bounds
for Recoqnition of Context-Free and Context-Sensitive
Languages", Proc. 2-th Annual Sym. on Switching and Autcrnata
Theory, pp. 21-35, 1966.
58 Replication of Input. May Save Computational Resource. In VLSI

[T79] 'Jhanpson, ~.n.: Z\rea-'T'iroe Ccmplexity for VLSI", Proc ll-th


Annual Sym. on Theorv of Canputinq, po. 81-88, 1979.
[V80] VUil1emin, J.: "A ~anbinatorial [,imit to the ~anputinq Power
of V.L.S.I. ~ircuits", Proc. 21-st Annual Sym. on Foundations
of romputer Science, pp. 294-300, 1980.

FIGURFS

M[2~~t M[2~3]
M[3 0] M[3 1] M[3 2] M[3 3]

--r
.,
_I-
I IControl unit
...... M[p-1 p-1]

Figure 1: 'DIe layout of k =P x P modules and the central oontrol unit

pro

P[l

P[2

P[3

1
I .,
....... P[q-1 q-1]

Figure 2: The layout of m' = q x 9 ~rooessors within a module.


'DIe connections from PJ1,jl to Pfi,j-11 and P[i-1,j]
used in the "shift-up ana "shlft-left" operations
are not shown.
lvl M. Kedem and Alessandro lorat 59

:'b-m-tl

\.L~h':K -.,---t--,-
L040-+---,-I--+-

OUT

LEGEND: INPUT

OO_-OO'--$-"~--'"
DIFFERlNCE CLOCKED DfLAY
ONE BIT SUBTRACTOff ELEMENT

Figure 3: Pipelined Ilecrementer

============ PHl\SE 0 = = = = = - = - = = = = =

Input wave 0, not accepted.

4 5 6 20 21 22
8 9 10 24 25 26
12 13 14 28 29 30
16 17 18 32 33 34
20 21 22 36 37 38
24 25 26 40 41 42
28 29 30 44 45 46
< - - MOdule 0 --------> <--- Module 1 ----->
Input wave 1, accepted.

20 21 22 36 37 38
24 25 26 40 41 42
28 29 30 44 45 46
32 33 34 48 49 50
36 37 38 52 53 54
40 41 42 56 57 58
44 45 46 60 61 62
< - - MOdule 0 ------> <--- Module 1 ------>
Input waves 2 and 3 (net shown) follow.

Shift up bv 3.

32 33 34 48 49 50
36 37 38 52 53 54
40 41 42 56 57 58
44 45 46 60 61 62
48 49 50 0 1 2
?7 ?? ?? ?? ?? ?? ?7 ?? ?? ?? ?? ?? ?? ??
1? ?? ?? ?? ?? ?? ?? ?? ?1 1? 11 ?1 11 ??
<---- MCXIule a ------> <--- Module 1 - - - - >

Figure 4: Example of the operation of the circular shifter


60 Replication of Inputs May Save Computational Resources In VLSI

Shift left by 2.

34 ?? ?? 46 47 48 49 50 ?? ??
38 ?? ?? 50 51 52 53 54 ?? ??
42 ?? ?? 54 55 56 57 58 ?? ??
46 ?? ?? 58 59 60 61 62 ?? ??
50 ?? ?? 62 63 0 1 2 ?? ??
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
?? ?? ?? ?? ?? ??
?? ?? ?? ?? ?? ?? ?? ??
<---- Module 0 - - - - - - > <---- Mojule 1 - - - - - - >

First wave of outputs: (x [30J, x[ 31). , x [61J )

=============== PHASE 1 =========~~=========

Input waves 0, 1, and 2 (not stnm) follow.

Inp..lt wave 3; accepted.

48 49 50 51 52 53 54 0 1 2 3 4 5 6
52 53 54 55 56 57 58 4 5 6 7 8 9 10
56 57 58 59 60 61 62 8 9 10 11 12 13 14
60 61 62 63 0 1 2 12 13 14 15 16 17 18
0 1 2 3 4 5 6 16 17 18 19 20 21 22
4 5 6 7 8 9 10 20 21 22 23 24 25 26
8 9 10 11 12 13 14 24 25 26 27 28 29 30
12 13 14 15 16 17 18 28 29 30 31 32 33 34

Shift up by 3.

0 1 2 12 13 14 15 16 17 18
4 5 6 16 17 18 19 20 21 22
8 9 10 20 21 22 23 24 25 26
12 13 14 24 25 26 27 28 29 30
16 17 18 28 29 30 31 32 33 34
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

Shift left by 2.

62 63 0 1 2 ?? ?? 14 15 16 17 18 ?? ??
2 3 4 5 6 ?? ?? 18 19 20 21 22 ?? ??
6 7 8 9 10 ?? ?? 22 23 24 25 26 ?? ??
10 11 12 13 14 ?? ?? 26 27 28 29 30 ?? ??
14 15 16 17 18 ?? ?? 30 31 32 33 34 ?? ??
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??
<----- MOdule 0 -------> <--- Mcx:'lule 1 ------>

Second wave of outputs: (x[62J, x[63J. x[OJ, ... ,x [29J)

Figure 4: (cootinued)
Planar Circuit Complexity and The
Performance of VLSI Algorithms +

John E. Savage
Brown University
Department of Computer Science
Providence, Rhode Island 02912

1. INTRODUCTION
In 1979 Thompson [1] reported that, under a suitable model for VLSI
chips, the product AT2 of chip area and time T to compute the Fast Fourier
=
Transform on n inputs must satisfy A~ 0(n2). His model accounts for the
chip area used by wires as well as for computational elements. He extended
these results in [2] and in addition examined the sorting problem. Brent and
Kung [3] introduced a somewhat different model for VLSI chips in which the
area occ~ied by wires and circuit elements is convex. They demonstrate that
AT2 = O(n) to multiply two n-bit integers, a result obtained with the original
model of Thompson by Abelson and Andreae [4]. Brent and Kung show that
=
A O(n) and they present algorithms that come close to meeting their lower
bounds. Savage [5] obtained bounds of AT2 = 0(p4) with both models for pxp
matrix multiplication, inversion and transitive closure. Algorithms previously
given by Kung and Leiserson [6] and Guibas et al. [7] were shown to be
optimal. Preparata and Vuillemin [8] subsequently introduced a family of
optimal matrix multiplication algorithms for 0(1) S T ~ O(n).
Vuillemin [9] has extended the Brent-Kung model to pipelined chips. If P
is the period of computations, that is, the time between production of two suc-
=
cessive results, he shows that Ap 2 O(n2) for transitive functions on n inputs.
He also demonstrates that A = O(n) for these problers, Ya~ [10] considers
=
VLSI algorithms for x + yz, over a finite field F, n Ilog21 F II' as well as the
predicate associated with graph isomorphism, and derives bounds of the form
AT2 = 0(n2 ) under various conditions. Lipton and Sedgewick [11] consider the
performance of VLSI algorithms for the Brent-Kung modified somewhat and
obtain quadratic lower bounds for a number of predicates.

+ This work was supported in part by the National Science Foundation under grant MeS 76-
20023, by the University of Paris-Sud, Orsay and by INRIA, Rocquencourt, France. Preparation was
supported in part by NSF Grant ECS 80-24637.
Keywords: VLSI, planar circuits, complexity, algorithms, tradeofis, predicates. integer powers
and reciprocals.

61
82 Planar Circuit Complexity and the Performance of VLSI Algorithms

In this paper we offer a model of VLSI algorithms which differs from previ-
ous models in important respects and we present a new method of analysis in
which the performance of algorithms can be related to the planar circuit com-
plexity of functions for which they are designed. The model is described as fol-
lows:
A1. The chip realizes a sequential machine constructed from discrete logic
elements and straight wire segments.
A2. Wires have a width of A and a separation and a length of at least A. Each
logic element occupies an area of at least A2. The chip has II planes each
of which may contain wires or logic elements.
A3. Inputs are read and outputs produced at times that are data-
independent.
A4. a) Each input variable is supplied exactly once to the chip; or,
b) Each input variable is supplied at one place on the chip.
Assumption A4 b) is weaker than A4 a) because it permits input variables to be
read multiple times but only at the same place on the chip. We say that a VLSI
algorithm that meets A4 a) is semelective t while one that meets A4 b) is
semelocal t
The Thompson model [1,2] assumes A1, A2 and A4 a), plus assumes that
wires are rectilinear and that each input variable is associated with a unique
place on the chip. In [1] assumption A3 was made. More recently, Thomp-
son [12] is working with assumption A4 b). Our model differs from the Brent-
Kung model in that in addition to A1 and A2 they assume that the region of the
chip occupied by elements and wires is convex. It is the area of this region
that is measured. Lipton and Sedgewick and Yao use this model. However,
they and Thompson [2] do not require assumption A3, which appears to be
essential to our results. Lipton and Sedgewick obtain their results with
assumption A4 b) while Brent and Kung assume A4 a).

2. PLANAR CIRCUITS AND COMPUTATIONAL INEQUMJTIES


Circuits and circuit complexity play an important role in the subsequent
development [13]. Planar circuits are circuits realized with two-input Boolean
logic elements in which the graph of the circuit is planar.
Definition 1: The circuit complexity of a Boolean function qO,ljn ... !O,ljm, C(f),
is the size (number of logic elements) in the smallest logic circuit for f. The
planar circuit complexity Cp(f) of a function f is the size of the smallest planar
circuit for f. If in addition the circuit has one node associated with each input
variable, namely, it is semelective, then C;(f) is the relevant measure.
We have the following two results concerning planar circuit complexity.
Theorem 1: For all functions f:!O,ljn ... !O,ljm
C(f) ~ Cp(f) ~ C;(f) ~ 6[ C(f)]2

Theorem2: For any


f:!O,1jn ... !O,ljm have
< 6 < 1 a fraction of at least 1 - 2-4m2n of the functions
C(f):l'! m(2n/n)(1-6-0(n
if log m :;: O(n), and for all such functions and all values of m and n,

t Ssmelactive and seTnslocal are neologisms formed from the latin words semel, meaniIl8
once, Isctio, meaning to read, and loc1.ls, meaning place.
John E. Savage 63

CN):o;; 7m2n
The former result demonstrates that planar circuit complexity is no larger
than quadratic in the standard circuit complexity. The later result demon-
strates that for most Boolean functions, standard and planar circuit complex-
ity measures have about the same value.
The following computational inequalities relate chip area A and the
number of cycles of execution T to the two circuit complexity measures.
Theorem 3: Let f:!O,l!n ... !O,l!m be computed in T cycles on a VLSI chip of area
A with an algorithm that meets conditions Ai through A4. Then, the inequality
C(f) :0;; II(AI )...2)T
must be satisfied.
Theorem 4: Let qo.qn .... !O,llm be computed in T cycles on a VLSI chip of area
A. Then. the inequalities
CpU) :0;; 1211(AI )...2)T2. Cp(f):O;; 1211(AI )...2)2T
must be satisfied where the first holds when VLSI algorithms are semelocal
and the second when they are semelective. ]f the chip algorithm is not
semelocal or semelective. the inequalities hold when C;(f) is replaced by Cp(f).
The semelective condition on a planar circuit appears necessary to obtain
strong lower bounds to cg(f). Theorem 3 is a restatement of a result in [14].
Valiant [15] has observed an interesting connection between the second
inequality and lower bounds to space and time for uniprocessor machines. He
notes that the analysis used by Gri~oryev [16] can be extended to the VLSI
model to obtain a lower bound to A T and that all previous bounds obtained
with the Grigoryev method apply here. In particular. this means that the
semelective condition is unnecessary to show a quadratic lower bound to many
mUltiple-output functions such as polynomial multiplication over GF(2) [16].
the discrete Fourier transform [17]. binary integer multi~lication [18]. and
matrix inversion [19]. (Grigoryev [16] has also shown a O(p ) lower bound for
pxp matrix multiplication.) Thus. these bounds will hold even for multilective
algorithms.
Lipton and Tarjan have applied their separator theorem for planar
graphs [20] to the problems of realizing a superconcentrators. to shifting and
to Boolean matrix multiplication and they demonstrate that each of these
problems requires a planar circuit size which is quadratic in the length of
their inputs. In t.he next section we state certain conditions on functions
which permit the application of the Planar Separator Theorem to the deriva-
tion of lower bounds to the planar circuit size of many important problems.

3. COMPUTATIONAL INEQUAIJTIES
We begin with a few definitions.
Definition 2: A function h:!O.ljs .... !O.ll t is a sub/unction of QO.ll n.... !O.ll m if it
can be obtained by suppressing some output variables and/or by an assign-
ment to a subset of the input variables.
The next definition identifies a class of predicates.
Definition 3: A function QO.lln .... !O.11 is w-separated if there exists a subset X
of its variables such that for any partition of X into two sets A. B with
1A I. 1B 1 ,,;; 21X 1/3. there exists at least 2" pairs HCli. ~)j of variables such that
f(Cli. b j ) = 1 if i = j and 0 otherwise. A function QO.lln .... !o.qm is also w-
sepa:rated if it contains a subfunction which is w-separated.
64 Planar Circuit Complexity and the Performance of VLSI Algortthm.

Yao [21] has a framework in which to consider the computation of functions in


which information must be distributed between two processors. He has used
this to obtain lower bounds on the performance of VLSI algorithms. His
results can be obtained by showing that a function meets a condition very
similar to that given above. Lipton and Sedgewick [11] use a condition similar
to this also where the sets A and B have equal size. The following theorem is
similar in spirit to results obtained by these authors. Its derivation uses the
Planar Separator Theorem [20] and the pigeon-hole principle.
Theorem 5: If qO,ljn .... !O,llm is w-separated, then
C;(f) ~ w 2/ 32
The next definition states a condition similar to those used impliCitly by
Thompson and Brent-Kung for multiple-output functions [5].
Definition 4: Consider a function qO,ljn .... !O.llm and let X and Y be subsets of
the sets of input and output variables of f. respectively. It is said to have a w-
flow if for all partitions of XuY into two sets A, B with
IAI. IBI ,s; 2( IXI + IYI)/3. there are sets SeX and S c Y. where S is a subset
of A and S a subset of B or vice versa. such that for some assignment to vari-
ables not in S. the resulting subfunction h of f from S to S has at least 2"
points in the image of its domain. A function f also has a w-ftow if it contains a
subfunction with this property.
The following illustrates the importance of this definition. It also uses the
Planar Separator Theorem and the pigeon-hole principle.
Theorem 6: If qO.l!n .... jO.ll m has a w-fiow. then
C;(f) ~ w2/B
With Definitions 3 and 4 and these two theorems. bounds on the performance
of VLSI algorithms can be obtained when predicates are shown to be w-
separated or when functions are shown to have a w-fiow.
Lower bounds on the planar circuit complexity of certain predicates can
be obtained directly if they are known to be related to functions which have a
large w-fiow. as indicated in the following definition.
Definition 5: Given a function qO.l!n .... !O.llm, f(XI' .... xn) = (YI' .... ym)' its

!
associated predicate F:!0.1In+m .... !0.11 is defined by

1 f(XI... xn) = (YI ..... Ym)


F(XI .... ,Xn.ZI ..... zm) =
o f(XI ..... xn) ,;. (YI .... ,Ym)

where Xl> .... xn and YI ..... Yin are the variables of F.


Theorem 7: if f:!0.1I n.... lO.ll m has a w-fiow. then its associated predicate
F: !O.1 jn+m .... !O.11 is w-separated.
As a consequence of this Theorem. and subsequent results it follows that the
predicates that recognize whether or not one integer is the square root of the
other. whether or not one matrix is the inverse of another. whether or not one
integer is the product of two others (also announced by Lipton and
Sedgewick [11]). for example. all require that AT2 and A2T must be quadratic in
the lengths of the inputs for the particular condItions under which these ine-
qualiUes are stated.
John E. Savage 65

4. BOUNDS ON PLANAR CIRCUIT COMPLEXITY


We now state lower bounds on the size of the smallest planar circuits for a
number of important multiple-output functions. The first of these is a restate-
ment of a result of Lipton and Tarjan [20].
Theorem 8: The shifting function f~n) : 10,ljn+k ... !O, lj2n-l, which shifts a binary
n-tuple by a to n-l places, has a (n/18)-flow.
Vuillemin has defined the class of transitive functions [9], which is the subject
of the following theorem.
Theorem 9: Let f&n) : 10,ljn ... 10,1!n be a transitive function of order n. Then it
has a (2n/ 9)-flow.
The proof of this result has important differences with that of [9] for the
Brent-Kung model. In particular, from Definition 4 we cannot guarantee that a
partition of the inputs and outputs will insure a near equal division of either
the inputs or outputs into sets of about equal size, as is assumed in [9].
The following result on matrix multiplication borrows some ideas from [5]
and improves by about six orders of magnitude on the bound to planar circuit
size given by Lipton and Tarjan [20].
Theorem 10: The matrix multiplication function ft&. defined by the product
C = DxE of two pxp Boolean matrices has a (p 2/12)-flow.
Theorem 9 can be applied to a number of problems. It follows immedi-
ately that the n-bit binary integer multiplication function f~) : 10,1~2n ... 10,lj2n
contains the shifting function fIn) : 10,ljn+k ... 10, lj2n-l as a subfunction, as does
the Boolean convolution function f~fJ(Xl"'" Xn'Yl' ... ,Yn) = (zi!, ... ,Z2n)
where zk = 2:: XiYj' It is also straightforward to reduce binary multiplication
i+j=k
to binary squaring. Thus, in each of these cases, the lower hound to planar
circuit size for shifting yields a lower bound for the particular problem. The
next result extends the class of such problems in interesting directions.
Theorem 11: The functions f~n.e) and f~n.e) defined by
f!f.e) = [(2n/y)ej, f~n.e) = [22nxej
in which 1 ~ x,Y ~ 2n-1, e = q/2k > 0, k and q integers independent of n, and
e> 1 for f~n.e), contain the shifting function fJm) , m = 8(n), as a subfunction.
For 0 < e < 1, f~n.e) contains the 2's complement of the shifting function in 8(n)
variables as a subfunction.
A wide range of problems satisfy the conditions of this theorem. They include
the reciprocal and square root functions and many powers and reciprocals of
integers. It is also clear that since the reciprocal of a binary integer reduces
to binary integer division, that a quadratic lower bound applies to the planar
circuit complexity of this function also.

5. DISCUSSION
Much of the recent literature on the performance of VLSI algorithms con-
cerns quadratic lower bounds to AT2. These results reinforce the notion that
this measure is basic and should be used to evaluate the performance of VLSI
algorithms for all problems. In this section we demonstrate that there are
problems for which the measure A2T is a better measure in that algorithms
can be found for which it is much closer to the lower bounds that can be
derived with it than to lower bounds derived with AT2.
The inequality involving AT2 is stronger than that involving A2T when
AI }..2 ~ T. If a VLS] algorithm is semelective several authors have shown that
88 Planar Circuit Complexity and the Performance of VLSI Algorithms

the area required for various problems must be at least linear in the length of
the input. This is true for the shifting function [22], for binary integer multi-
plication [3], transitive functions [9], and for matrix multiplication [23]. Thus,
in this case, the ATz bound will be the stronger for large problem sizes.
A problem for which the measure AZT is superior to ATz is binary sorting.
The standard sorting problem for words consist~ o~ strings over fO, l!k is
modeled by a function f~ftf):!O,l!nk .... !O,llnk. If k ~ ,logznl + 1, it is easy to show
that the function is transitive of order n. Combining this with the observation
above, the better measure for this function is AT2. However, for binary sort-
=
ing, namely, when k 1, the other measure is superior since one can con-
struct a planar circuit from the schema of Muller and Preparata [24] which
uses area as small as 0(log2n) and for which AT O(n). =
The lower bounds for the chip area required by semelective algorithms
which are stated above hold for multiple output functions. We exhibit a predi-
cate for which a similar result holds.
Definition 6: p~~m) is the set of functions nO, ll n.... ! 0, 11 m for which there exist
at least g distinct subfunctions of f by some assignment to variables in the set
=
J for all J such that IJ I p.
The following is a simple extension of a result of Yao [10] which was originally
stated for the case IJ I = n/2.
Theorem 12: Let qO,ll n""!O,ll m , a member of p~~m), be computed by a
semelective VLSI algorithm. Then, the chip area must satisfy
AI >.,2 ~ [IOgzg k
2

even if the reading and writing by the chip is not done in a data-independent
manner (assumption A3).
Meyer and Paterson have defined a Boolean function f~'fHo,1In""!0.11 that is
= =
contained in P~~gl) for g 2P for p Q(n/ log n) and that has a linear standard
circuit size [13,p.43]. Their circuit can be reworked to produce a linear sized
planar circuit for this function. Thus, the A2T measure is the better, at least
for large inputs.
As a last observation, we show that if some natural constraints, such con-
tiguity of input and output variables on the periphery of a chip, are placed on
VLSI algorithms, then the area required by them can be excessively large.
This is demonstrated with the binary addition function when the Boolean vari-
ables representing each of the three integers, the two inputs and the result,
are required to be contiguous on the boundary of a chip. In this case it is easy
to show that many node-disjoint paths must exist between inputs and outputs
and these must cross at many points. This leads to a lower bound on the chip
area which is quadratic in the length of the inputs. This quadratic area can be
avoided by the standard expedient of overlapping the registers holding the two
binary numbers and loading them in two successive cycles for subsequent
addition with a standard full-adder array. This technique reduces the area
used to linear in the length of the input, thus showing a very large jump in the
area required when the contiguity constraint is dropped.

6. CONCLUSIONS
We have outlined an approach to the study of the performance of VLSI
algorithms in which performance is related to the planar circuit complexity of
the functions considered. This complexity measure provides a lower bound to
performance as recorded in the quantities AT2 and A2r. We have stated
John E.Savage 87

properties of the planar circuit complexity measure and have stated bound on
it for a wide variety of functions and predicates. We have also considered
cases for which each of the two measures is the superior, thus illustrating that
the better measure to use is problem dependent.
Before closing, we note that the inequalities stated above can reflect cer-
tain special conditions that are placed on VLSI algorithms. For example, if I/O
is to be done on the boundary of a chip, then the planar circuits will exhibit
this property also and this fact could be used to improve the lower bounds
obtained to planar circuit complexity.

7. REFERENCES
1. C. D. Thompson, "Area-Time Complexity for VLSI," Frocs. 11th ACM Ann.
Symp. Th. Camp., pp. B1-BB (April 1979).
2. C. D. Thompson. "A Complexity Theory for VLSI," Report No. CMU-CS- BO-
140, Dept. of Compo Sci., Carnegie-MeHon U., Pittsburgh, Penn. (August,
19BO).
3. R. P. Brent and H. T. Kung. "The Area-Time Complexity of Binary Multipli-
cation," Report No. CMU-CS-79-136, Dept. of Compo Sci., Carnegie-Mellon
U., Pittsburgh, Penn. (July, 1979). revised report also available
4. H. Abelson and P. Andreae. "Information Transfer and Area-Time
Tradeoffs for YLSI Multiplication," CACM 23 pp. 20-23 (January 19BO).
5. J. E. Savage, "Area-Time Tradeoffs for Matrix Multiplication and Related
Problems in LSI Models." Jnt. of Comput. and Sys. Sci., (April, 1981).
6. H. T. Kung and C. E. Leiserson, "Algorithms for LSI Processor Arrays,"
pp. 271-292 in Introduction to VLSI Systems, ed. C. Mead and L. Conway,
Addison-Wesley, Reading. MA (1980).
7. L.J. Guibas, H.T. Kung. and C.D. Thompson. "Direct LSI Implementation
of Combinatorial Algorithms," Froc. ConJ. Very Large Scale Integration:
Architecture, Design. Fabricati~n, California Institute of Technology,
(January 1979).
8. F. Preparata and J. Yuillemin, "Area-Time Optimal VLSI Networks for Mul-
tiplying Matrices Info. Froc. Letters
11 ( 2 ) pp. 77-80 (Oct. 20, 1980).
9. J. Vuillemin, "A Combinatorial Limit to the Computing Power of VLSI Cir-
cuits," Frocs. 21st Ann. Symp. Th. Camp., pp. 294-300 (Oct. 12-14, 1980).
10. A. Yao, "The Entropic Limitations on LSI Computations," Froc. 13th Ann.
ACM Symp. on Theory of Computing, pp. 308-311 (May 11-13,1981).
11. R. J. Lipton and R. Sedgewick. "Lower Bounds for VLSI," Froc. 13th Ann.
ACM Symp. on Theory of Computing, pp. 300-307 (May 11-13, 1981).
12. C. D. Thompson, The VLSI Complexity of Sorting, Division of Computer
Science, D. C. Berkeley (1981).
13. J. E. Savage, The Complexity of Computing, John Wiley and Sons (1976).
14. J. E. Savage, "Computational Work and Time on Finite Machines," JACM
19(4) pp. 660-674 (1972).
15. L. G. Valiant, Personal Communication
16. D. Yu. Grigoryev. "An Application of Separability and Independence No-
tions tor Proving Lower Bounds of Circuit Complexity." Notes of Scientific
Seminars, Steklov Math. Inst. 60 pp. 35-48 (1976).
68 Planar Circuit Complexity and the Performance of VLSI Algorithms

17. M. Tompa, "Time-Space Tradeoffs for Computing Functions, Using Connec-


tivity Properties of Their Circuits," Proc. 10th Ann. ACM Symp. Th.
Comp., pp. 196-204 (May 197B).
lB. J. E. Savage and S. Swamy, "Space-Time Tradeoffs for Oblivious Integer
Multiplication," pp. 49B-504 in Lecture Notes in Computer Science, ed. H.
A. Maurer, Springer-Verlag, Berlin, Heidelberg, New York (July, 1979).
19. Joseph Ja'Ja', Personal Communication
20. R. J. Lipton and R. E. Tarjan, "Applications of a Planar Separator
Theorem," SIAM J. Comput. 9(3) pp. 615-627 (Aug. 1980).
21. A. C. Yao, "Some Complexity Questions Related to Distributive Comput-
ing," Procs. 11th ACM Ann. Symp. Th. Comp., pp. 209-213 (April 1979).
22. G. M. Baudet, Personal Communication
23. C. A. Heintz, Personal Communication
24. D. E. Muller and F. P. Preparata, "Bounds to Complexities of Networks for
Sorting and Switching," JACM 22(2) pp. 195-201 (1975).
ThreeDimensional Integrated Circuitry

Arnold L. Rosenberg
Duke University
Department of Computer Science
Durham, North Carolina 27706

ABS~RACT. This paper is devoted to the que~tion~ ~at


benefits would accrue if we could implement mlCrOCIrCults
in a three-dimensional medium? The patient reader can
ViAW the reported results as demonstrating the
efficiencies to be gaine~ once the current technological.
obstacles to three-dimensional VLSI chips are overcome.
~he less patient reader can view the results as
indicating the currently attainable advantages of three-
dimensional printed circuit boards.
~he results obtained indicate (roughly) that bounds
on are~ (both upper and lower) in the neighborhood of
order n in the tw~-dimensional case translate to bounds
on volume in the neighborhood of order n 3 / 2 in the
three-dimensional case; moreover, most of the upper
bounds are attainable via (idealized) realizations that
have active devices on only one level and that use the
third dimension only for wire-routing; such realizations
place somewhat fewer demands on the fabrication
technology. However, it is also shown that unrestricted
use of the third dimension can yield realizations lj~at
are more conservative of volume (by the factor log n)
than any "one-active-level" realization can be. Finally,
it is shown that three-dimensional circuit realizations
can afeord one significant savings in device-to-device
wire length: examples are presented wherein two-
dimensional realizations require runs of wire as long as
n/log n, while three-dimensional realizat~'2ns can alwavs
get hy with wire lengths not exceeding n Thus, at
least in the worst case, there are substantive benefits
to he gained from three-dimensional circuit realization,
in terms of savings of both material (area vs. volume)
and time (wire length).

69
70 Three-Dimensional Integrated Circuitry

1. INTRODUCTION
One would anticipate three benefits if we could
realize microelectronic circuits in a three-dimensional
medium. First, wire-routing should become easier and
more systematic. Next, since one could avoid obstacles
by using the third dimension, runs of wire should be
shorter, at least in the worst case (This has been noted
in the "modestlv" three-dimensional thermal conduction
module develop~d by IB~ [91.) Finally, since avoidin~
obstacles in a two-dimensional environment can require
area-consuming circuitous routing of wires, one would
expect savings in material: the volume of a three-
dimensional realization of a citcuit should be less than
the area of any two-dimensional realization. In this
paper we demonstrate that at least the last two
expectations are well founded: unbounded savings in both
wire length and material are achievable via three-
dimensional circuit real.ization, at least in the worst
case: we cannot comment about the first expected benefit.
~ow, we have intentionally been vague about the
"level" of circuit implementation at which we are
proposing the use of a three-dimensional medium: our
abstract setting does not distinguish between the problem
of laying out gates and their interconnections on (or in)
a chip and the problem of laying out chips and their
interconnections on (or in) a printed circuit hoard.
Thus the patient reader can view-our study as predicting
the kind of gains in efficiency that will be achievable
when the current technological obstacles to the
fabrication of truly three-dimensional chips are
overcome: and the less patient reader can view our study
as an indication of the currently achievable benefits of
three-dimensional circuit "boards."
Although we obviously want to stress the pbtential
benefits of three-dimensiona1 circuit rp.a 1 izations, we
should also mention some of the problems inherent in the
task of implementing such realizations. Most of these
problems accompany the proposal to make three-dimensional
chips, the problems with circuit boards being minor in
comparison. One problem that exists (\lso with circuit
boards (though far less than with chips) is the problem
of registration. In the miniature world being discussed,
it is difficult to assure that corresDonding positions on
successive layers of either a chin or a stack of circuit
boards line up. In fact, with the current technology for
fabricating chips (as described, say, in [5]), even the
goal of two or three more metal layers for wire routing
poses nontrivial problems. A related second issue that
plagues the chip fabrication nrocess is the difficulty of
creating true cylindrical holes: current processes are
plagued by holes that have accentuated tops and/or
Amold L. Rosenberg 71

bottoms (see, e.g., Figure 4.26 of [5, Bect. 4.71)


because of effects like diffraction, scattering, and
nonuniform exposure to solvents; and these effects
increase with the depth of the hole. However, recent
work on x-ray beam 'ann refined optical lithography
suggests that both of the preceding problems wil.l he less
critical in the future than they are now. A third major
issue in the fabrication of three-dimensional chips is
the problem of cooling. Even with current three-layer
chips, one must pay careful attention to power
consumption and concomitant heat generation. A solid
three-rlimensional chip would present formidable problems
in this regard. This problem may disappear if a
radically different technology becomes dominant and would
diminish in severity if three-dimensional chips proved to
require much smaller linear dimensions than their two-
dimensional counterparts. However, even with current
technology, one avenue for possibly ameliorating this
problem would be to design three-dimensional chips that
have active devices (transistors, gates, etc.) on only
one level and that use the third dimension only for wire
routing. Our formal framework can model this
restriction, and we show that the dramatic savings in
both material and wire length are often attainable even
in its presence. (Warning: there are examples \l1herein
these dramatic savings are not attainable with "one-
active-level" realizations, and other examples where
dramatic savings are attainable, hut not to the extent
achievable with unrestricted three-dimensional
realizations.) The fourth problem we mention is one that
would he encountered if one tried to build (unrestricted)
three-dimensional chips using 1\1 OS technology. 1\ctive
devices in this technology are built upon a substr.ate of
monocrystalline silicon. Placing active devices deep
inside a three-dimensional chip would require one to
deposit multiple layers of this form of silicon; and
subsequent processing of the chip could easily cause its
delicate crystal structure to break up into the
electrically dissimilar polycrystalline form of sil,icon.
But, recent work on devices (see, e.g., [3]) suggests
that full layers of monocrystalline silicon are not
needed: acceptable transistors can be fabricated on
islands of monocrystalline silicon that sit on a sea of
oxide. Moreover, this problem, as well as the earlier
one, would be ameliorated by restricting attention to
one-active-level three-dimensional realizations. In what
follows, we consider, both unrestricted and one-active-
level three-dimensional circuit realizations.
Our notion of the realization of a circuit follows
the approach of [4,7,10,11,12,14,161: we view circuits
as graphs whose vertices correspond to the active devices
(transistors, gates, etc. in the chip model, and chips in
the circuit board model) of the circuit, ann whose edges
72 three-Dimensional Integrated Circuitry

correspond to the wires connecting these devices. We


view the me~ium in which the circuits are to be realized
as (two-dimensional or three-dimensional) grids. A
circuit realization is an edge-disjoint embedding of the
circuit-graph in the grid. The material used in an
embedding is just the area (in the two-di.mensional case)
or the volume (in the three-dimensional case) of the grid
holding the circuit-graph. We assume throughout that a
unit length of wire has unit area in the two-dimensional
case and all unit-area cross sections in the three-
dimensional case: the unit-width assumotion follows the
reasoning in the cited sources~ the rationale behind the
unit-height assumption in three dimensions is that this
space will diminish the coupling effects that could
plague closely overlapping runs of wire.
The results reported here summarize the main results
of [10,11], to which the reader is referred for proofs.
2. THE FORMAL FRAMETi\TORK

A. Graphs and Grids.


A ~ G consists of a set V of vertices and a set
E of doubleton sets of elements of V called edges. A
path in the graph r, = <V,E> is a finite seguence of
vertices the form v l ,v 2 ' . ,v where each set lv, ,v'+l'
is an edge of G. We shall terM a set of paths inla gfaph
indepenoent if no two paths share an edge.
t"l x L planar gr id is the graph whose vertex-set is
the set of pairs [Wlx[Ll and whose edges connect vertices
<a,b> and <c,d> just when la-cl + !b-dl = 1. (Here and
throughout, [n] denotes the set [n] = fl,2, ,n}.)
The HxWxL solid grid is the graph whose vertex-set
is the set of triples [H]x[W]x[L] and whose edges connect
vertices <a,b,c> and <d,e,f> just when la-dl + Ib-
e I + I c-f I = 1. We shall view soUd gr ids as being
composed of strata having dimensions length x width, each
stratum comprising two levels. The two levels of a
stratum allow us to lay wires out on a stratum with
crossovers: the crossovers are eliminated by cuts
between the two levels. (We shall never allow more than
two wires to cross at a point.) ~~en we are discussing
one-active-level realizations, the active stratum plays a
Amold L. Rosenberg 73

materially different role than other strata, since it is


where the active devices reside. The purpose of the b10
levels of the bottom stratum, in this case, is to permit
the construction of these devices (cf. [5, Cap. 1]).
With this understanding about the use of levels, we shall
never have to refer to them explicitly again; therefore,
we shall henceforth refer to strata as atomic objects.
Note. All of the (proved and cited) results here remain
true if we weaken the assumption of rectangularity of
grids to convexity.
B. Embedding Graphs in Grids.
An embedding of the graph G in the grid r
(solid or
planar) is a one-to-one association of the vertices of G
with vertices of r,
together with a one-to-one if all
vertices of G are associated with vertices of r
of the
form <io,j,k> for some fixed level iO e (il].
C. The Costs of an l':mbedding.
We consider two costs of an embedding of a graph in
either a solid or a planar grid: the amount of material
consumed by the embedding (measured by area in the two-
dimensional and volume in the three-dimensional case) ;
and the maximum length of any run of wire that does not
encounter a device; this length is impor.tant because of
its effect on the speed of a circuit: depending on the
technology in question, this effect is either linear in
the length of a run (the transmission Jine model) or
almost quadratic in the length of the run (the diffusion
model). The reader is referred to [7] for a discussion
of the problems caused by long runs of wires in circuit
realizations.
The volume (resp.,~) of ~ embedding of the
~ Q in ~ solid ("esp ., planar) gr id I' is the product
of the dimensions of The volume (reso., area) of the
S!!Eh G, VOL (G) (resp., ARBA(G, is-the mInTmum-Volume
(resp., area) of any embedding of G in a solid (resp. ,
planar) grid. The one-active-level volume of the ~
74 Thre.Dlmenslonallntegrated Circuitry

(I;S) (I ;4) 0;2) (1;2) (1;4) O;S)

\~-------
- _------~I
'-wi"

F(S)

\~--------------,,--------------~I
R(S)

Figure 1. The 8-rearrangeable network R(8) and the 8-point FFT graph F(8).
Arnold L. Rosenberg 75

G, VOLl_AL(G), is the minimum volume of any one-active-


level emtiedding of G in a solid grid.
The wire-length of ~ embedding g of the ~ Q in
~ grid is the maxi.mum length of any path ere) over aD
edges e of G. The solid (~., planar) wire-length of
the ~ G, NIRE-T.}~NGTH3 (r;) (resp., T~TIRE-LBNr;'l'H2 (e;) 1, is
the minimum wire-length of any embedding of G in a sol.id
(resp., planar) grid. The one-active-Ievel wire-length
of the ~ Q, WIRE-LENGTHl_AL(G), is the minimum wire-
length of any one-active-Ievel embedding of G in a solid
gr id.

3. TWO SPECIAL CIRCUITS


In the full paper [10], we present here efficient
three-dimensional embeddings of two special graphs and
contrast them with their two-dimensional counterparts.
The results are summarized in the follmdng two theore'lls.
REARRANGEABLE PERMUTATION NETWORKS: Let R(n) be the n-
rearrangeable network of Benes [21.
(General Three-Dimensional Embeddings)
(a) VOL(R(n) = 0(n 3 / 2 );
and
WIRE-LENGTH 3 (R(n = 0(n 1 / 2 ).

(b) For any n-permuter P,


VOL (P) = 0.(n 3 .12 );
and t~ere exist n-permuters P -- R(n) beinq an
example -- with
WIRE-LENGTH 3 (P) = o.(n 1/2/10g n) .
(One-Active-Level Embeddings)
_ 3/2
(e) VOLI_AL(R(n - O(n log n); and
WIRE-LENGTH 1 _ AL (R (n = 0 (rt 1 / 2 )

(d) For any n-permuter P having maximum fanout ~,

3/2 1/2
VOLI_AL (P) = o.(n (logan) )
76 ThreeDlmenslonal Integrated Circuitry

(Two-Dimensional Embeddings)
(e) For any n-permuter P,
AREA(P) = .o.(n 2 );
and there exist n-permuters P -- R(n) being an
example -- with
WIRE-LENGTH 2 (P) = .o.(n/10g n).

Remarks. (a) One easily verifies that the stated


results are valid also for the n-point FFT graoh, which
is a subgraph of R(n).
(b) F. Preparata fS] has recently exhibited a family
{p(n)}37~ n- pei~~ters that have VOL]_~L(P(n))
= 0 (n (log n) ); thus the lower Bound of Part (Cl) is
tight, and the upper bound of Part (c) is not.
T,is theorem indicates the dramatic savings in both
material and wire-length attainahle by the exploitation
of three-dimensional routing. liTe no',/ indicate that, even
when a graph is efficient1y embe~dable in a twn-
dimensional gr id (in terms of AREA), one may be able to
realize substantial savings in wire-Jength via three-
dimensional circuitry. (This example indicates also that
the a~vantages of three-~imensional circuit realization
may not be attainahle if one insists on one-active-level
embeorling s. )

COMPLETE BINARY TREES. For any n-nooe complete binary


tree T,
~HRE-LENGTH3(T) = 0(n 1 / 3 ),
but both
WIRE-LENGTHI_AL(T) = .o.(n 1 / 2 I log n)

1i7IRE-LENGTH 2 (T) = .o.(n l / 2/10g n).


Of COllrse, VOL(T) = AREA(T) = O(n).
4. LAYOUT WITHOUT ROU'fING: AN APPLICATION nF PERMUTERS
A placement of the graph G is a one-to-one mapping
of the vertices of G into the set NxN of pairs of
positive integers. The dilation of a placement is the
maximum (Manhattan) distance in the plane between images
Amold L. Rosenberg 77

of adjacent G-vertices. Thus dilation is the two-


dimensional analog of bandwidth. Aleliunas and
Rosenberg [1] prove that, at the cost of increasing
dilation by a small factor (say 3), one can restrict
atte~7~on to placements of G in square regions of side
21GI : we shall so restrict our attention.
An edge-coloring of the graph G is a labelling of
the edges of G with "colors" in such a way that edges
incident on the same vertex get distinct labels. A
well-known theorem of Vizing [5] asserts that
one can always edge-color a graph G with either d or
d+l colors, where d ~ maxdegree(G).
Edge-colorings are important to us here since, as one
easily verifies,
for any color R, the transformation of Vertices(G)
defined by:
vertex u gets mapped to vertex v just when u and v
are connected by an R-colored edge
is a permutation of Vertices(G).
We can combine the material in the preceding
paragraphs with our efficient three-dimensional
realization of n-rearrangeable permutation networks to
prove the following.

LAYOUT WITHOUT ROUTING. If the n-vertex graph G admits a


dilation-~(n) placement, then there is an embedding of G
in an 0(nl/2)xO(nl/2)xo(~(n three-dimensional grid, so
that
VOL (G) = O(n~(n:

WIRE-LENGTH 3 (G) ~ O(~(n.

In particular, since ~(n) = 0(n l / 2 ), every small-degree


graph G can be realized with

VOL (G) = 0(n 3 / 2 )


and
WlRE-LENGTH 3 (G) = 0(n l / 2 ).
78 Three-Dimensional Integrated Circuitry

Moreover, there are n-vertex graphs G with


VOL (G) = Q(n 3/2 ,
and
WIRE-LENGTH 3 (G) = C1.(n 1/2 /10g n).

The use of permuters as "conduits" for the edges of


a graph has i~s origins in the work of Offman [61 and
Valiant [13]. The major difference between our use and
theirs is that we advocate pruninq the nermuter in order
to obtain a specific edge association, whereas they, in
their quests for universality, retain the settability of
the permuter switches.
5. CONTROLI.ING ECCENTRICITY AND LINEAR Dn/tENSIONS
Two benefits accrue when a three-dimensional chip is
almost a cube (I.e., the ratios H/W, W/L, L/H are all
close to unity). First, such cubehood minimizes linear
dimensions (for fixed VOLUME) and, therefore, ameliorates
the problem of heat dissipation. Second, just as
excessive deviation from squarehood in the two-
dimensional case can lead to "packaging" difficulties,
excessive deviation from cubehood can have a similar
effect in the three-dimensional case. Therefore, we have
sought a three-dimensional analog of the results of [11,
wherein rectangular grids were deformed into square grids
of almost the same AREA (the techniques expander'! AREA bv
a factor of hetween 1.2 and 1.8), without stretchina
edges excessivelv (edges were stretched by a factor
between 3 and 15).

CUBE-ING UP THREE-DIMENSIONAL GRInS. Say that the graph


G is embedded in the HxTtJxL three-d imensional qr ia, \'lith
h1 IRE-LENGTH w. There is an iterative procedure that,
after k itetations, will embed G in the three-dimensional
grid Of dimensi02sk (HWL)a(k) x (Hil7L,a(k) x (HWL)a(k)
where 13 - a(k) 1 ~ ~;

VOLUME < ~ kHTtJI. for some ~ e [1. 2,1. 8] ;


WIRE-LENGTH 3 ~ yr 2k/3 l w for some y e [3,15].
Since one can view a two-dimensional grid as a
three-dimensionaJ grid with H 1, one can use the
process in the previous result to obtain an almost cubic
three-dimensional embedding from any two-dimensional
embedding.
Amold L. Rosenberg 79

6. CONCLUDING REMARKS
We have demonstrated dramatic efficiencies in
three-dimensional circuit realizations that are not
attainable in two dimensions. Moreover, the benefits of
these efficiencies in orders of growth are not delayed by
huge constants hidden in the "big-Oh"'s~ the constants in
our constructions are small. These efficiencies are of
sufficient magnitude that further research is warranted
on two questions. First, how wides~read are the
advantages of three dimensions over two that we have
found here? qow, for example, would a three-dimensional
realization of a random small-degree graph compare with a
two-dimensional real.ization of the graph? Second, are
the technological impediments to three-dimensional
circuitry surmountable in the foreseeabJ.e future, or are
they so tied to the current state of the technology that
only a revolutionary advance will overcome them? We plan
further study of these and related questions in the near
future.
REFERENCES
1. R. Aleliunas and A.L. Rosenberg: On embedding
rectangular grids in square grids. IEEE Trans. Comp., to
appear.
2. V.E. Benes: Optimal rearrangeab1e multistage
connecting networks. 8ell Syst. Tech. J. 43 (19M)
1641-1656.
3. H.W. Lam, A.F. Tasch, Jr., T.e. Holloway:
Characteristics of MOSFETs fabricated in laser-
recrystallized polysilicon islands with a retaining wal.l
structure on an insulating substrate. IEEE Electron Dev.
Lett. EDL-l (1980) 206-208.
4. C.L. Leiserson: Area-efficient graph layouts (for
VLSI) . Proc. 21st IEEE Symp. on Foumlations of Computer
Science, 1980, 270-281.
5. C. Mead and L. Conway: Introduction to VLSI Systems,
Reading, MA, 1980.
Ad<'lison-~'I7es1ey,

6. Yu. P. Offman: A universal automaton. Trans. Moscow


Math. Soc. 14 (1955) 200-215.
80 ThreeDlmensional Integrated Circuitry

7. M.S. Paterson, W.L. Ruzzo, L. Snyder: Bounds on


minimax edge length for complete binary trees. Proc.
13th ACM ~. on Theory of Computil!9., 1981, 293-299-.-
8. F. Preparata: Optimal three-rlimensional VLSI
layouts. Typescript, June, 1981.
9. D. Rosen: ~CM - it's a new word for density in logic
packaging. TqI~K, Jan./Feb., 1981, p. 23.
10. A.t. Rosenherg: Three-dimensional VtSr, I: a case
sturly. IBM Report RC-874 5, 1981; submit.ted for
publication.
11. A. L. Rosenberg: Three-1imensional. VTJSI, tI: The
general layout problem. In preparation, 1981.
12. C.D. Thompson: Area-time complexity for VLSI. Proc.
11th ACM Symp. on T~eory of ~omputing, 1979, 81-88.
13. L.G. valiant: Universal circuits. Proc. 8th ACM
Symp. on Theorv of Computing, 1976, E. 196-203.
14. L.G. Valiant: Universality considerations in VLSI
circuits. IEEE Trans. Elec. Comp., ~-l.Q. (1.981) 135-140.
15. V.G. Vizing: On an estimate of the chromatic class
of a p-graph (Russian). Oiskret. Analiz. 1 (1964) 25-30.
16. J. Vui11emin: A combinatorial limit to the computing
power of V.L.S.I. circuits. Proc. 21st IEEE Symp. of
Found~tion~ of Computer Science, 1980, 294-300.
A Critique and an Appraisal of VLSI
Models of Computation
G. Bllardl, M. Pracchl, and F.P. Preparata
University of illinois at UrbanaChampalgn
Coordinated Science Laboratory

1. Introduction
Considerable attention has been paid to the definition of a suit-
able model of computation for Very-Large-Sca1e Integrated (VLSI) cir-
cuits [1], [2], [3], [4]. The basic parameters Of any VLSI computation
model are chip area A and computation time T. VLSI systems display a
trade-off between these two parameters, each of which represents a we1l-
defined cost aspect: chip area is a measure of fabrication cost and
computation time is a measure of operating cost.
A general feature of all proposed - and presumably of all future -
VLSI models of computation is that a chip is viewed as a computation
graph, whose vertices are called nodes and whose arcs are called wires.
Nodes are, by and large, devices and are responsible for information
processing (computations of boolean functions); wires are just electri-
cal connections, and are responsible for both transfer of information
and distribution of power supply and timing waveforms.
A giver computation graph is to be laid-out in conformity with the
rules dicta ed by technology. These rules are taken into account in the
model of c', Iputation by the following assumptions.
Area Assumptions
All ,", .. res have minimum width A > 0, and at most v <:: 2 wires can 2
overlap at any point. Transistors and I/O ports have minimum area A
Time Assumptions
T1. (Propagation time). To propagate along a wire of length a bit
requires:
T1.1 A constant time, irrespectively of [Brent-Kung [2]]
(synchronous model).
T1.2 A time 0(10&) [Mead-Conway [1]; Thompson [3]] (capacitive
model). 2
T1.3 A time 0( ) [Seitz [5]; Chaze11e-Monier [4]] (diffusion
model)
T2. (Algorithm time). The computation time of an algorithm is the time
of the longest sequence of wire propagation times between beginning and
completion of the computation. [All models.]
While the area assumptions are uncontroversia1 there is little
consensus among researchers about the computation time T, as is

This work was supported in part by the National Science Foundation


Grants MCS-78-13642 and MCS-81-05552, and by the Joint Services
Electronics Program Contract N00014-79-C-0424.

81
82 A Critique and an Appraisal 01 VLSI Models 01 Computation

reflected in the different choices for TI. Asymptotically rl.3 is


valid because a wire is characterized by a resistance and a capacitance
which (in a given technology) both grow linearly with the length t;
therefore, the time constant of the transistor load grows proportion-
ally to t 2 As noted by Chazelle-Monier in [4] the consequences of
rl.3 are drastic. Indeed, chip wires of substantially different lengths
are ruled out and connections must exist only between devices in very
close proximity. As a consequence, the only permissible computation
graphs are of the mesh type (or closely related), which rules out very
fast parallel computation, such as performed by computing structures of
the type of the shuffle-exchange [6], the cube-connected-cycles [7], or
the tree-connected machine [8].
However, the asymptotics of VLSI have a much closer horizon than
those of the Turing machine. This horizon, in fact, is set by
realistic bounds on the expectations - in the current technology - of
minimum feature size and maximum chip size. Within this horizon, the
line parameters must be weighed against the nonnegligible output im-
pedance of the driving transistor and the input impedance of the driven
transistor. To appraise this interaction, it is therefore appropriate
to take a critical look at the actual physical phenomena occurring
during the transmission of an output switching from a driving
transistor to a driven transistor.

2. A Mathematical Model of Wire Switching


Perhaps the most characteristic feature of present-day VLSI tech-
nology is the fact that, irrespective of the choice of the devices
(MOS-FET versus bipolar, for example) wires are realized as dispersive
lines, and with reference to dispersive line VLSI technology, any
reasonable device selection is representative of the general problem.
In particular, we shall carry out our analysis with reference to
the CMOS technology [6]. In figure lea) we have illustrated the
circuit being consi~red. TI is an n-channel MOS transistor, initially

(b)

Ca)
l,oS
Figure 1. The CMOS configuration and the transistor characteristics.

cut-off, with its drain load initially charged to voltage VO. So,
with reference to (IDS'VDS ) characteristic curves of figure l(b), PI is
the initial operating point of TI By applying at t = 0 a step voltage
Vg =VO at the gate of TI the operating point becomes P2 and, from then
on, it moves on the vg = V0 curve toward the origin and the transis tor
G. Bllardl, et al 83

discharges through the channel.


The circuit is modeled as in figure 2(a), where Co is the gate
capacitance of T2 and the line, of length t, has resistance rand capac-
itance c per unit of length. The behavior Tl on the
--+ i(x,t)

a
(a) (b) v?o Va
Figure 2. The model (a) and the idealized characteristics.

Vg ==V o curve is approximated as in figure 2(b) with two straight line


segments (saturated regime and ohmic regime), meeting at the pinch-off
voltage VPO.
2.1 General solution
Let v(x,t) and i(x,t) denote the values of the line voltage and
current at abscissa x and time t, respectively. From Ohm's law and the
definition of capacitance we obtain
bv - -ri
-bx bv
- -c bt '
whence
'rlv bv bi
(1)
bx2 = rc bt ' rc bt

These are instances of the classical diffusion equation (or heat equa-
tion), but our boundary conditions deserve special attention. We use a
general approach to solve partial differential equations with homoge-
neous boundary condi ti ons , called me thod of separation of variab 1es, which
consists of the following steps: 1) find the solutions to eq. (1) satis-
fying the boundary conditions and expressed as a product of a time func-
tion and of a space function; the space functions are called the
eigenfunctions of the boundary problem; 2) express the initial condi-
tions as a linear combination of the eigenfunctions; 3) get the final
solution by using the superposition principle. For a large class of
problems (known as Sturm-Liouville problems [7]),step 2) is simplified
by the orthogonality of the eigenfunctions. In our case while the
saturated regime is of the Sturm-liouville type, the ohmic regime is
not. In the latter case, the difficulty has been circumvented by intro-
ducing an unconventional inner product with respect to which the
eigenfunctions become orthogonal.
We now solve eq. (1) for both the saturated and ohmic regime. T~e
results are expressed in the normalized variables ~ ==x/t and T == t/rcL .
In both cases the eigenfunctions are sinusoidal of angular frequency ~,
where ~ must satisfy the characteristic equation determined by the
boundary conditions.
84 A Critique and an Appraisal of VLSI Models of Computation

2.2 Saturated Regime


In the saturated regime the initial condi tion is i (x, 0) = 0 and the
boundary conditions are i(O,t)=-I O and Co ~~ (J.,t)=i(.e,t). We de-
compose the current in its stationary and transient components; the
first is easily determined from the boundary conditions and the second
satisfies homogeneous boundary conditions so that the general method can
be used. We get

Here y = d/Co' iJ. i (i = 1,2, .. ) is the i-th positive solution of the


characteristic equation iJ. = -y tan iJ., and g. (~) is the corresponding
.
e~gen unct~on.
f . ~

The expression of V (l;r), the voltage at the capacitor at the end of


the line, is obtained from I(l,T) and the capacitor equation, as

1,2
T+~
00 g. (1) (
L: 1. -~-- l-e
-iJ.~T
~
)
(2)
Co i=l ~ iJ.~
~

From this, by integrating I(~,T) along the line, we obtain


1
V(O,T)=V(l,T)+d S I (i1,T)di1.
o
From this expression for V(O,T) we can determine the time TpO at which
V(O,T) VpO ' i.e. the time at which the regime changes. Assuming that
VpO/VO 0.8, by numerical evaluation we have ascertained that for

y ~ 10 3 at T = T PO the transient term II (~,T po) is all but negligible


10-8'I(~'TPO' Therefore in this range of y, we may safely assume
10 (0 as tbe initial condition for t he current in ohmic regime.

2.3 Ohmic Regime


In this case the initial condition is i(J.~,O)=IO(~) and the
boundary conditions are v(O,t) = -ROi(O,t) and Co ~~ (1"t) =i(.e,t),
where RO = VpO/IO' Solving for the current we get
2
co y -iJ. i T
I(~,T) = 10 i~O (y+ 1 Hi -Ki)gi (~)e

where, with p ~ r1,IRO,iJ. i (i = 0,1, . ) is the i-th positive solution


of the characteristic equation
G. Bllardl, at al 85

tan ~ =....lL- 1 --....l!.... (3)


y+ p ~ y +P

and g.(~) is the corresponding eigenfunction. As before we can get the


1
solution for the voltage which turns out to be

(4 )

3. Discussion and Conclusions


Expressions (2) and (4), which respectively give the voltage
V(~,T) in the saturated and ohmic regimes, are the objective of our
analysis. In any given technology the ratio yip = cRO/rcO is a con-
stant; therefore only one parameter describes the behavior. Several
discharge curves have been plotted in figure 3, for the values of
-2 -1 2 3
Y = 10 ,10 ,1, 10, 10 , 10 .

l~~~~~----==~-------r~---------'r-----------'

-2 -1 3
Figure 3. Discharge curves for y = 10 ,10 , ... ,10 The broken
lines describe the discharge at constant current.

In general the discharging phenomenon is dominated by the ohmic


regime during which the voltage is given by the approximate - but
basically valid - expression

(5 )

where ~O is the smallest eigenvalue. Therefore the time constant of


2 2
the discharge, in unnormalized time, is tD = rct /~O I f in (3) we
expand tg~O in Taylor series we obtain
86 A Critique and an Appraisal of VLSI Models of Computation

yp 1 ~o
=---
Y tp ~o y+p
whence CD
y+p
S, = ROCO(ltytp)(l+lty+p i~l (6 )

tJ. '" 2i
Letting (y,p) = (y+p) r t.~O I(l+y+p), gives the relative devia-
i=l ~
tion of tn from ROCO(ltytp), which is linear in p and y and gives the
time constant in an idealized capacitive model. In this model the dis-
persive line is replaced by a single equivalent capacitance of value
cL/(l+p/y), where ply is a constant in any given fabrication tech-
nology; indeed CO(l+y+p)=C O+cL(l+p/y). A set of contour lines of
as a function of p and y is plotted in a logarithmic diagram in
figure 4.

Figure 4. Contour lines of (p,y).

In this diagram, it seems reasonable to (somewhat arbitrarily)


define three regions: (1) the region of the synchronous model - shown
in heavy shading - as the one where (l+y+p)(l+)::::; 2 (see eq. (6)
and the definition of ), i.e., where tn ::::; 2ROCO; (2) the region of the
diffusion model - shown in light shading -, as the one where ~ 1
(a deviation which corresponds to at least doubling the propagation
delay of the idealized capacitive model); (3) the region of the capaci-
tive model, in In the same diagram each technology is
represented by a straight line of slope +1, since - as we noted - in
any given technology p = Kly (K l , a constant).
Below, parameter values are reported both for current MOS tech-
nology and for a foreseeable scaled-down technology, following the
G. Bllardl, at al 87

established conventions [11 [81 on the relation between feature width,


A, and device sizes.
Current technology Future technology
A 2.5 iJ.m 0.5 iJ.m
Field oxide thickness 1 iJ.m 1 iJ.m
Gate oxide thickness 600 A 150 A
Aluminum thickness 1 iJ.m 1 iJ.m
Power supply voltage 5 V 3 V
10 0.98 rnA 1.4 rnA
RO 4.05 X 10 3 a 1.69 X 103 a
-2 10- 3 pF
Co 4.12 X 10 pF 6.5 X

r 3
3.78 x 10 aim 1.89 104 aim
x
c 3.46 X 10- 10 Flm 6.93 X 10- 11 Flm
ply 1.10 x 10- 4 0.992 x 10- 3

The corresponding curves are plotted in figure 4. The points repre-


senting the maximum chip width Lmax assumed are also reported and the
corresponding values are

Lmax 10 mm 50 mm
2
Ymax 84 5.65 X 10

The conclusion we extract from the preceding analysis is that


current MOS-FET VLSI technology falls in the domain of either the
synchronous or the capacitive model; in the latter, propagation delay
is proportional to the length of the wires. Note, however, that this
propagation delay is computed in the hypothesis that both the driving
and the driven transistors be standard (i.e., of minimum size). How-
ever, by raising the channel width of the driving transistor, the
current 10 increases and tpR decreases. Indeed - as suggested by
Carver-Mead [1] and Thompson [3] - if the channel width is propor-
tional to the capacitive load for all transistors, one approaches
constant propagation time, and presumably, current density becomes
the limiting factor.
On the other hand, projected future technology may reach the
(conventional) boundary of the capacitive model region. Beyond this
boundary a possible design philosophy - as suggested by Chazelle-
Monier [41 - is to introduce repeaters on long wires in order to
achieve delay proportional to the wire length. Note, however, that in
this case we can no longer avail ourselves of channel width con~rol.
Another alternative, entirely in the realm of speculation, is the
development of integrated nondispersive transmission lines, where
speed of light considerations are the controlling phenomena.
88 A Critique and an Appraisal of VLSI Models of Computation

REFERENCES

1. C. A. Mead and L. Conway, Introduction to VtsI Systems, Addison-


Wesley, Reading, MA 1979.

2. R. P. Brent and H. T. Kung, "The chip complexity of binary


arithmetic," Proceedings of the 12th Symposium on Theory of
Computing, Los Angeles, pp. 190-200; April 1980.

3. C. D. Thompson, "A complexity theory for VLSI," Ph.D. Thesis,


Department of Computer Science, Carnegie-Mellon University,
Pittsburgh, PA, August 1980.

4. B. Chazelle and 1. Monier, "A model of computat:ion ,for VLSI with


related complexity results," Tech. Rep., Department of Computer
Science, Carnegie-Mellon University, February 1981.

5. C. L. Seitz, "System li.ming," Chap. 7, in C. Mead and L. Conway,


Introduction to VLSI Systems, Addison-Wesley, Reading, MA 1979.

6. Engineering Staff of AMI, MOS Integrated Circuits, Van Nostrand,


New York 1972.

7. R. L. Street, The Analysis and Solution of Partial Differential


Equations, Brooks/Cole Publishing Company, Monterey, CA 1973

8. B. G. Streetman, Solid State Electronic Devices, Second Edition,


Prentice-Hall, Inc., Englewood Cliffs, NJ 1980.
On the Complexity of VLSI Computations
Thomas Lengauer and Kurt Mehlhorn
Unlversltat des Saarlandes
Fachberelch 10
06600 Saarbriicken, West Germany

Abstract: We present four results on the complexity of VLSI computations:


a) We further justify the Boolean circuit model [Vu, Sa, LS] by showing
that it is able to model multi-directional VLSI devices [e.g. pass
transistors, pre-charged bus drivers).
b) We prove a general cutting theorem for compact regions in Rd [d~2)
that allows us to drop the convexity assumption in lower bound
proofs based on the crossing sequence argument.
c) We exhibit an ~(n1/3) asymptotically tight lower bound on the area of
strongly where-oblivious chips for transitive functions.
d) We prove a lower bound on the switching energy needed for computing
transitive functions.

Keywords: Complexity Theory, Lower Bounds, VLSI Models, Switching Energy.

1. THE BOOLEAN CIRCUIT MODEL AND MULTI-DIRECTIONALITY OF VLSI CIRCUITS

In a number of recent papers on the complexity of VLSI computations


[Vu, Sa, LS] VLSI chips are modelled as synchronous Boolean circuits
that are embedded into the plane. In Boolean circuits wires have a spe-
cific direction (from the output of one gate to inputs of one or more
other gates). Moreover, this fact is explicitly used in numerous lower
bound arguments [see [LS]). However, current MOS technology supplies us
with a number of truly multi-directional devices, such as pass transistors
and pre-charged bus drivers, and these devices are extensively used in
circuit design [cf. [MC]). [Note that Thompson [Th] relaxes the mono-di-
rectionality of wires somewhat. But he is not able either to model e.g.
a bus driver with many ports and constant delay.)
In this section we introduce a class of models for VLSI circuits
that are also able to model multi-directional co~ponents. We show that
the circuits in these models can be simulated by circuits in the LS-
model 1 (and vice versa) with area and time increased only by a constant
factor. Thus we give evidence for the fact that the mono-directionality
of Boolean circuits is indeed no serious restriction if we are only con-
cerned with asymptotic analysis.

89
90 On the Complexity of VLSI Computltlonl

These results contrast experiences made in the area of switch-level


simulators for MOS circuits, where due to the greater level of detail in
which the circuits have to be modeled -especially because one has to take
special care in modelling faulty circuits- multi-directional circuit
models have great advantages over the Boolean circuit model Ccf. [Br]).
The multi-directional circuit models CMD-models) are closely related
to models used in switch level simulators for MOS circuits. Their main
characteristic is a different concept of a wire. A wire in an MD-model
is not a passive but an active device.It determines its value as a sym-
metric function of the values on its terminals. The ingredients of an
MD-model are an MD-circuit and a layout fulfilling the following require-
ments:
1. The MD-circuit consists of components which are either wires or gates.
It operates on Boolean values CO and 1) at any of a finite number of
strengths, increasing from 1 to a constant r. Thus a value is a pair
v=(vb,vs)E B:={0,1}x{1, . ,r}.
2. Each gate and wire of the circuit have a finite number of terminals.
Gates may have at most k terminals, where k is some specified con-
stant. CWe do not distinguish between input and output terminals.)
Each terminal of a wire Crespo gate) is either an input or an output
to the MD-circuit,or is coincident with exactly one terminal of a
gate Crespo wire).
3. Each gate g has an associated transition function 0g that computes

the values on the terminals of the gate. Here 0 :Bm+a m where mSk is
the number of terminals of the gate, maps value on the terminals
before switching to values on the terminals after switching of the
gate.
4. The circuit operates fully synchronously. Each step consists of four
phases:
a) Each gate reads the values on its terminals.
b) Each gate uses its transition function to determine the new values
on each of its terminals.
c) Each wire reads the values on all of its terminals.
d) Each wire determines a value wEB to be put on all of its terminals.
This value is the strongest Boolean value put on any of its ter-
minals. In a case of a tie between 0 and 1 the value w is unde-
fined. (Essentially a wire is a special sort of gate that can have
arbitrarily many inputs and that computes some kind of threshold
function. Gates and wires switch alternately.)
5. The layout is defined analogously to [LS]. Each component corresponds
to a connected compact region in R2. Components that have common ter-
minals in the circuit must touch in the layout. At most V components
intersect each AXA square of the layout, where A is defined as in the
LS-model.

1 Throughout this paper we use a slight modification of the model presen-


ted in [LsI. We assume that any region in R2 representing a gate or a
wire is connected and compact, and that each point in the region lies in-
side a square of size A,2 that is completely contained in the region.
We call this modified version the LS-model.
Thomas Lengauer and Kurt Mehlhom 91

Different MD-models can vary in the number of strength levels and the
kinds of gates they allow. As an example we give the following MD-model:
There are three levels of strength: 1 (isolated charge), 2 (connection
to GND resp. VOD through a pull-up resistor), and 3 (direct connection to
GND resp. VDD). There is only one kind of gate, namely the MoS transistor
T. T has three terminals s (source), d (drain), and g (gate). The trans-
ition function 0T is defined as follows: 0T(s,d,g)=(s',d',g'), where
g'=(gb,1), and s'=(sb,1), d'=(d b ,1) if gb=o, otherwise s'=d' is the
strongest of the values sand d. In case of a tie resolve arbitrarily, or
set the value to be undefined. Thus a transistor with a 1 on the gate be-
haves exactly like a wire with two terminals. It is straightforward to
show the following lemma.
Lemma 1: Each circuit in the LS-model can be simulated by a circuit in the
above MD-model with area, time, and period only increased by a constant
factor.
Sketch of Proof: We can simulate inverters as well as AND- and OR-gates
with arbitrary fan-in in the MD-model by using the NOR-gate implementa-
tions common in NMOS. These implementations translate easily into the MD-
model. Since wires in the MD-model can have arbitrary shape it is possible
to simulate each gate in the LS-model "in place". Any uridefined values oc-
curring in the MD-circuit have to have been introducRd through incorrect
operation of the simulated LS-circuit.c
The following theorem shows that the reverse of Lemma 1 holds for
all MO-models. This means that circuits in the LS-model are powerful
enough to model also multi-directional VLSI circuits. In the proof of the
theorem the properties of the gates in the MD-model chosen do not play a
significant role. They only influence the constant factor.
Theorem 2: Choose any MD-model. Any circuit in the model can be simulated
by a circuit in the LS-model with area, time, and period only increased by
a constant factor. The factor depends on the MD-model chosen.
Sketch of Proof: Let C be the circuit in the MD-model that is to be sim-
ulated. Let C run in area A, time T, and period P. We will simulate C
"in place" by a Boolean circuit C'. It ~ill turn out th~t we can fit a
layout of C' inside a blowup by a constant factor of the layout of C.
This yields the bound on the area. The bounds on time and period follow
directly from the definition of C'.
For the purpose of the simulation, values vEB will be encoded in a
unary fashion, i.e., by unit Boolean vectors of length 2r. The vector
(v 1 , ,v2r ) with the unique 1 being element vi encodes the value
v=( (i+1) mod 2, (i+1) div 2).
Since each gate has at most k=o(1l terminals its function can be
simulated by a Boolean circuit with A,T,P=o(1) that fits into a blowup
by a constant factor of the layout of the gate in C. (Note that we can,
in addition, assure the proper location of the terminals in the circuit
C'. However, this may require different layouts of the circuit simulating
gate g, for different copies of the same gate g in C.l
It remains to show how to simulate the wires. Let (v 1 ' . ,v ) with
v.=(v.
1 1,
1'.'.'v.1, 2r ) be the unit vectors encoding the values on th~ termi-
92 On the Complexity of VLSI Computations

nals of the wire before switching. The "output" value w of the wire is
encoded by a vector (w i , ,w2r ), where for i~i~2r

2r n n
A A vJ',k) A J'ivJ',i'
k=i+i j=1
(An undefined value is here encoded by a vector which is not a unit vec-
tor.) If we allow AND- and OR-gates with arbitrary fan-in this formula
has depth 2. For the simulation of one step on the wire 2r such functions
have to be computed in parallel. The layout of the circuitry necessary
for this can fit inside a blowup by a constant factor (depending on r)
of the layout of the wire in C.o
Note that in the proof of Theorem 2 we make extensive use of the
fact that in the LS-model AND- and OR-gates with arbitrary fan-in are
allowed, as well as of the liberty we can take with respect to the shape
of the gates.

2. THE CONVEXITY ASSUMPTION

Most papers on the complexity of VLSI circuits assume the convexity


of their layout. (The notable exception is Thompson's original paper [Th].
However, he assumes that VLSI circuits are embedded into the planar grid.)
We believe that this assumption is unsatisfying, because:
a) Convexity is not an inherent property of VLSI circuits.
b) Dropping the convexity assumption and modelling a chip as a compact
region of the plane (holes allowed!) will considerably strengthen
our lower bounds. It will allow us to measure the area occupied by
only those circuit components that actually process information,
and affect the yield during chip production.
c) Taking power and ground into account non-convexity may be called for
(cf. the discussion in [CM] on how convexity plus power consumption
imply large area).
In the following we will prove a cut theorem for d-dimensional sets
(d~2) that will allow us to drop the convexity assumption. Note that the
theorem holds for all dimensions d~2, such that the result is applicable
also to three-dimensional models of VLSI that have been mentioned in sev-
eral places in the 11 terature (cf. [Ro]).
Tneopem 3: Let M be a compact region in Rd , such that oM is the disjoint
union of finitely mag y injective continuous functions from the d-dimen-
sional sphere into R Let ~ be a measure on M such that ~(M)~i. Then M
can be cut by a set C of cuts along at most 2d-1 hyperplanes into two
compact sets M~ and Mr such that
1 M~ U Mr = M MR, n Mr = C
2. vol d _i (C) ~ (2d - 3!2)'2 i/d ,vOl d (M)(d-i)/d

3. ~(M~), ~(Mr) ~ 2/3.


(Here vold(M) denotes the d-dimensional volume of the set Me Rd.)
Thomas Lengauer and Kurt Mehlhorn 93

FPoof: Let M be a d-dimensional set satisfying the premise of the theo-


rem. The condition about the smoothness of the borders of M is designed
to ensure that all integrals that occur in this proof are defined.
We will cut M in a sequence of at most d steps. Specifically from
step i on we cut a set M. to be defined inductively and satisfying
ll(M.1 2/3 into two parts 1,r such that ].l(M.1,kn), ll(M.1,r )~2/3.
l M. 0 and M.
. 1,k
At the end we sum up the (d-i)-dimensional volumes of all cut surfaces.
We start out with M1=M. (Note that if 1l(M)~2/3 we can define MR-=M,
M ~, C~.) In general, for 1~i~-1, we cut M. as follows: Since the
i~tegral of any integrable function is continDous, by the Bolzano-Weier-
strass Theorem we can find a hyperplane H. through d-dimensional space
with the normal vector pointing into the r-th dimension, such that H.
divides M. into an upper half M. and a lower half Mi ,and such tAat
ll(M. )=172. Let V.=vold(M.), aR~Pv.=VOld(M. ), then ,qv.-V.=VOld(M. ).
1,p 1 1 1 1,p 1 1 1,q
We now find two hyperplanes Hi,R-t(i) and Hi,R-b(i) parallel to Hi' one
cutting M.1,p and one cutting M.1,q Here R-t(i) resp. R-b(i) denote the
signed distances of Hi,R-t(i) resp. Hi,R-b(i) to Hi (R-t(iO, R-b[i)<O).
Thus Hi,R-t[i) and Hi,R-b[i) cut Mi into the sets Mi,t' Mi,m' and Mi,b

I
as shown in Figure 1.

M.1,p M"t

dist. Hi,R,t(i)\

r
M.1,m

M.1,q .......:~----+---- Hi, R-b [i)J


M.1, b

We can choose R-t[i) and R-b[i) such that vol d _1 [H. R-t[.)nM. )~2vi[d-1)/d
.. . (d-1)/d 1, 1 ~,p i/d
and voId 1(H. nb(.)nM. )~(V.-v.) as well as R-t(1)~(i/2)v.
- 1,k 1 1,q 1/d 1 1 1
and R-b(i) ~ (-1/2)(V.-v.) For assume that such an R-t(i) cannot be
1 1
found. Then
1/d
Vi
v.1 ~ f0 /2 voId - 1(H.1,konM.l,p ) dR- > -21 v.1i/d "2"v.1 (d-1)/d =v.1
which is a contradiction. An analogous argument holds for R-b(i). We in-
clude H.1,kot[.)nM.
1 1,p and H.1,NOb[.)nM.
1 l,q in the cut C. Now, i f ll(M.l,m )~2/3
we are done. Since ].l(M. t),].l(M. b' )~2/3 we only have to choose M. 0 to be
1... 1 JI l.jl x,
the most expensive of the three sets, and M. correspondingly to be the
union of the other two sets, and we cut M. l,r as desired.
1
94 On the Complexity of VLSI Computations

If on the other hand 2/3 we choose M. i=M.


~(M. and continue
1,m 1+ 1,m
cutting. After succeeding in cutting Mi +i into Mi +i ,)1, and Mi +i ,r we let
M. 0 be the most expensive of these two sets, and M. be the rest of M..
1,'<' 1,r 1

If after the (d-i)-st step we end up with a set Md such that


~CMd2/3, we can cut Md exactly in half by a hyperplane with the normal
vector pointing into direction of dimension d. The cut surface can this
time be limited by all the 2Cd-il cuts done before and will have a Cd-i)-
dimensional volume of size at most

Thus the total volume of C is at most


d-1 Cd-i)/d + 2CV._v.lCd-il/dl + d-i l(
IT 2 v.i/d+ CV . _v. )i/d) .
vol d _i CCl ~ L C2v i
i=1 1 1 1 i=1 1 1
We know that for a1l 1~i~d V.~V: =vold{M). Also, the function x:+Ca-x):
gets for all :>0 maximized at 1 the point x=a/2. Thus
vol d _i CCl ~ Cd-il.4CV/2lCd-iJ/d+CV/2lCd-1)/d = C2d_~l.21/d.vCd-1l/d.c
If we are only concerned with two-dimensional layouts we can improve
upon the constant factor somewhat.
Theorem 4: If d=2 then the constant factor can be improved from
C2d-3/2l2 1/d = 5/V2 to 2.
Proof: Not included.c
3. I/O CONVENTIONS AND BOUNDS ON AREA

Two I/O conventions prevail in the literature:


aJ Times and locations at which the input and output bits are available
at 1/0 ports are independent of the input [BK, Vu]. Such chips are
called when- and where-oblivious in [LS]. Several linear lower bounds
on the area for when- and where-oblivious chips are known ([BK] for
integer multiplication, [Vu] for transitive functionsl. .
b) Only the locations are fixed. In this case the chip actively requests
the inputs at its input ports. We distinguish two cases here: Either
the chip requests inputs by name, i.e., it may request X17, and hence
the chip environment has to adapt to input dependent requests, e.g.,
by means of a random access memory for each port. Output bits are
also produced at input-dependent times, and the chip identifies each
output value when it produces it.[LS] call such chips where-oblivious.
We add the following other possibility:
c) The chip may only request the next input bit at each port, i.e., a
queue is associated with each port and the ordering in these queues
is independent of the input. Similarly the order in which output bits
are produced at each output port is fixed. We call such chips strong-
lY1lJhere-ob li vious
Thoma. Lengauer and Kurt Mehlhorn 95

Theorem 5: Let f be a transitive function of degree n. Consider any


strongly where-oblivious chip computing f with storage area As.input area
AI.and output area AO' Then AIAO(AS+A2IV) ~ (A 6 Iv 3 ) on. In particular
A ~ (A 2 Iv)o(n 1/3 -1).
Proof: Let f(x1 xn.s1 . sp)=(y1 yn) be a transitive function of
degree n (cf. [Vu]) with p control inputs. Assume that the chip under
consider~tion has kr input.and kO output ports. Assume furthe~ tha~
through lnput port 1 the bl ts x.ln (.1. 1) ..... x.ln (.l.p. ) are read ln thlS or-
der. Here P1+"'+ Pk =n. We do not consider the tims at which the si are
read in. Analogous19 define the outputs y t(' 1)..y t(. ). Again
ou J. ou J.q.
q1+ ... +qko=n. Let Out 1={Youtfj.1) 1 1~j~ko} be the set of output bits that
have to be output first.
Let G be the group computed by f. Consider any fixed yEOUt1' Then
{g-1(y)lgEG} is a multiset with exactly IGI elements; moreover. each x.
1~i~n appears exactly IGl/n times in that multiset. Hence each x. appe~rs
exactly kolGl/n times in the multiset G-1(out 1 )={x j lg(x j lEOut 1 1 gEG}.
Now let for bEN be Inb={Xin.(i,Q,)11~i~kI'1~Q,~in(b.p.)}.Le In b is
the set of all x. that are in the first b positions of the queues asso-
ciated with all input ports. Certainly IInbl~kIb.
We define witnessb(g) for all gEG to be an arbitrary element of
In bng- 1 (Out 1 )' i f one exists. Let b 11 =min {b IVgEG witness b (g) exists}.
We prove the following claim. a
Claim: At least ball -1 bits have to be stored by the chip.
Proof of Claim: Assume the chip is only able to store b<b 11-1 bits. Then
there is some gEG such that witness b+1 (g) does not exist. a set s1 s
such that the chip computes g. Since witness b+1 (gl does not exist. we p
have Inb+1ng-1(Out1)~.i.e at least b+2 inputs have to be read before
the first output is produced. Thus the chip must be able to store at
least b+1 bits. This is a contradiction.c
Now by our counting argument above. we have for any x with 1~i~n:
I {glx i Eg-1(Out 1 l} l=kOIGl/n. Thus I {glwitnessb(g) exists} lSi kIbokolGl/n.
by our upper bound on the size of In b For b=b all the left side of this
inequality becomes IGI and we get
IGI ~ kIballokolGl/n.
Since AI~A2kIIV. AO~A2kOIV and AS~A2(ball-1)/V the theorem follows.
The lower bound on A is a consequence of the formula A ~ max(AI.Ao.As)'C

For certain transitive functions the lower bound given in Theorem 5


oan ba matohad up to a oon5tant faotor with an upper bound. We consider
here the function fcS computing cyclic shifts, Le. the function
fCs(XOxn_1k) = (yOyn-1) where O~k~n-1 and Yi=x(i-k) mod n'
Theorem 6: There is a strongly where-oblivious chip computing fCS with
area O(n 1/3).
96 On the Complexity of VLSI Computations

Sketch of Proof: Since we are only concerned with an upper bound on the
area we will not take special care to make the chip fast.
We give the chip n1 / 3 input ports. The i-th input port receives the
inputs x.lon 2/3""'X Cl'+ 11 on 2/3 - 1 in this order (O~i~n1/3_11. These are
n 2/ 3 inputs per port. We will give the chip slightly fewer output ports,
2/31/3 2/31/3
namely n/Cn +n 1 output ports. Each output port produces n +n
output bits. Output port j produces the bits y. ( 2/3+ 1/3 1, ...
Jon n
y(j+11o(n2/3+n1/31_1 for O~j~n/(n2/3+n1/31 -1.
The idea of this arrangement is,informally, that for each value of k
there will be one output port j and one input port i ,such that the first
input bit to be read by input port i has an index that is at most n1/ 3
smaller than the first output bit to be produced by output port j. Thus,
if we start reading inputs from input port i we have to store at most
n1 / 3 input bits before we can directly output the input bits read star-
ting at output port j. We continue reading inputs in a clockwise fashion
and directly produce the corresponding output. At the end we produce the
input bits read and stored at the beginning.
Formally we define j=j(k1 such that
(j(kl-11 on1 / 3 < k mod n2 / 3 ~ jCk1 on1 / 3
and let i(k1 = j(k1 - (k div n 2 / 3 1. We start reading at input port iCk1
and producing output at output port j(k1. Then the first bit to be output

is Yj(klo(n2/3+n1/31 = X[j(k)o(n 2 / 3+n 1/ 3 )_k] mod n = X[i(k1 on2 / 3+m] mod n


where m = jCk10(n2/3+n1/31-k-i(kln2/3 = j(k1n 1/3 - k mod n2 / 3E[O,n 1/ 3 _1].
Thus we have shown that for the above algorithm AI,Ao,As=OCn 1 / 3 1.
We still have to argue that the computation necessary for selecting the
correct routing in dependence of k can be done in small area. The details
of this will not be given here. We end up with a chip using area A=O(n 1 / 3 1
and time T=O(noPOLYLOGLOG(n11.D
Theorem 6 shows that there are strongly where-oblivious chips for
fCS that are strictly smaller than any where- and when-oblivious chip for
f cs ' If we consider all where-oblivious chips, we can achieve a further
reauction in ~rea.
Lemma 7: There is a where-oblivious chip computing fCS with A=OClog n1.
Froof: The chip has one input port and one output port. Outputs are al-
ways produced in the order Yo""'Yn-1' Inputs are requested in the right
order to be produced directly at the output port. The necessary computa-
tion for requesting the input bits can be done in area A=OClog n).D

4. BOUNDS ON ENERGY CONSUMPTION

Thompson [Th] derives a lower bound on energy consumption based on the


assumption that one unit of energy is consumed by one unit of chip area
every time that it is involved in the transmission of a signal. However,
there is a definite difference between transmitting a 0 followed by a 0,
i.e., maintaining a state, and transmitting a 0 followed by a 1, i.e.,
Thomas Lengauer and Kurt Mehlhorn 97

switching a state. In the first case only "static" energy is expended for
maintaining the value. In the second case "switching" energy is expended
in addition, for changing the value. Whereas static energy consumption is
dominant in the NMOS process, switching energy consumption is dominant in
processes like CMOS. M~reover, switching energy is the energy concept that
is more closely related to computational complexity, and it is the central
energy concept introduced in [MC]. Thompson bounds static energy from be-
low.
We derive lower bounds on switching energy consumption based on the
following assumption:
Every unit of chip area on every layer of the chip consumes one unit
of energy every time it changes its state (from 0 to 1 or vice versa).
Theorem 8: Let f be a transitive function of degree n. 1 Consider any
where-oblivious chip computing f with area A, period P, and switching
energy E per solved problem. Then
n2
P log(c2Ap2/n2)
where 01 and C2 are con5tants involving the technological parameters
A and V of the LS-model.
Sketch of Proof: We assume the layout of the chip to be overlaid with a
grid of mesh-siZe A. We consider only cuts along grid lines. As in [Th]

,hewe ie Fir'" 2. c:t C


i
'l}
it follows that a series of cuts C , ... ,C., ... can be found such that C.
he> the ,hep' l

~ } vertical
t ~
",-,I J
sections of
o or A cut Ci

middle section of cut C.


l

Figure 2
Each C. cuts the chip in half w.r.t. the I/o-ports. We associate a cros-
sing s~quence with each cut as in [LS]. We disregard all values contri-
buted to the crossing sequence by the middle section of Ci . Those are at
most (4i+2)v bits per crossing value.
As shown in [LS] at least n(n) bits have to cross the cut C. for
each solved problem. Thus, if the chip is running at full rate,nJ(nT/P)
bits have to cross C. during any time interval of length T, i.e., over
the vertical section at least nCnT/P) - o(iT) bits have to cross in this
interval. Let L. be the number of bits contributed to each crossing value
by the verticallsections of cut C.. Let W. = w. II ... II w. be the con-
l l l,1 l,L.
catenation of the crossing sequences associated with each bit contributed
to the crossing value by the vertical sections of cut C..
1 The theorem can be proved in the same way for any 8'-o-o-;j,t-e-a-n--::f-u-n-c"-t";"i-on-f-=--
such that at least 2n crossing sequences are necessary across any cut
that divides a chip computing f. In this class fall all functions for
which AT2=n(n 2 ) can be shown with the crossing sequence argument.
98 On the Complexity of VLSI Computations

The w.. are bit strings of length exactly T. We will encode these strings
in a lJ special fashion. First we introduce some notation.
Definition: Let w be an arbitrary bit string, say w=0~~1~20~3 ~t
a) sew) is the number of state changes in w. i.e . s(w)=t.
b) bin(w) is the bit string obtained from w after substituting each 0
with a 00 and each 1 with a 11.
c) compress(w) = 0 bin(~1) 01 bin(~2) 01 bin(~3) 01 . 01 bin(~t)'
If the first bit of w is a 1 so is the first bit of compress(w).
ExampZe: compress(00011) 0 11 11 01 11 00.
Iwl
Lemma 9: Icompress(w)I S 4 sew) + 2 sew) logs(w)'
Proof: Not included.C
Let us now only consider cuts Co ... Co for some 0 to be chosen la-
ter. Let W = Wo! I ... ! IwO' Since ~(nT/P)-O(iT) bits have to cross cut Ci
it follows by summation that Iwl S ~(nTO/P)-0(02T). Furthermore. since
the w.. have the same length T the mapping
lJ 0 Li
w'" II II compress (w lJ.. )
i=O j=1
o Li
is injective. Thus. if we define S = r r s(w ij ) and L
i=O j=1
and apply Lemma 9. we get the following.
Lemma 10: n(nTO/P)-0(02T) S 4S + 2S 10g(TL/S).
Proof: Not included.c
Choosing 0 = 8(n/P) appropriately yields
~(n2T/p2) S 4S + 2S 10g(TL/S)
which implies that
S ~
C1P2 log(c 2 Lp2/n2)
It remains to relate S to the switching energy expended per solved
problem. and to bound L from above. Both can be done with the following
argument.
Let c be a component crossing the vertical section of cut C. It can
be shown that we can charge to c an area of size n(1) belonging lto the
layout of component c. such that the areas charged to components crossing
different cuts do not overlap. and such that the areas charged to two
different components overlap at most v-fold. i.e on V different layers.
(The details of this process will not be given here.) From this follows
that both L=O(A) and Etot=~(S), where Etot is the total switching energy
expended in the time interval of length T. Since in this time interval
O(T/P) problems are solved the theorem follows.C
For a chip working at the limits dictated by information transfer.
i. e.. Ap2 =0 (n 2 ). the bound given in Theorem B is tight up to a constant
factor: E=~(n2/P)~n[AP) follows from Theorem B. and E=O[AP) is obvious.
Thomas Lengauer and Kurt Mehlhorn 99

5. ACKNOWLEDGEMENTS

Bob Tarjan pointed out to us how to generalize the cutting theorem


(Theorem 3) to dimensions d>2.

6. REFERENCES

[BK] R.P. Brent, H.T. Kung, "The Area-Time Complexity of Binary Multi-
plication," JACM 28,3 (July 1981), 521-534.
[Br] R.E. Bryant, "A Switch-Level Model of MDS Logic Circuits," in
VLSI 81 (ed. John Gray), Academic Press (August 1981), 329-340.
[CM] B. Chazelle, L. Monier, "A Model of Computation for VLSI with
Related Complexity Results," 13th Ann. Symp. on Theory of Com-
puting,ACM (May 1981), 318-325.
[LS] R.J. Lipton, R.S. Sedgewick, "Lower Bounds for VLSI," 13th Ann.
Symp. on Theory of Computing, ACM (May 1981), 300-307.
[MC] C. Mead, L. Conway, Introduction to VLSI Systems, Addison Wesley
(1980).
[Ro] A. L. Rosenberg, "Three-Dimensional VLSI, I: A Case Study," Res.
Rep. 8745, IBM (March 1981).
[Sa] J. E. Savage, "Planar Circuit Complexity and the Performance of
VLSI Algorithms," Dep. of Compo Sci. Tech. Rep., Brown University,
Providence, RI (Jan. 1981).
[Th] C.D. Thompson, "A Complexity Theory for VLSr." Ph.D. Thesis,
Carnegie-Mellon University, Pittsburgh, PA (1980).
[Vu] J. Vuillemin, "A Combinatorial Limit to the Computing Power of
VLSI Circuits," 21st Symp. on the Foundations of Computer Science,
IEEE (1980).
On the Area Required by VLSI Circuits
Gerard M. 8audet
INRIA Rocquencourt
B.P.105
78150 Le Chesnay, France

ABSTRACT

A technique is developed and used to derive lower bounds on the


area required by a VLSI circuit by taking into account the amount of
information that has to be memorized in the course of the computation.
Simple arguments show, in particular, that any circuit performing
operations such as cyclic shift and binary multiplication requires an
area at least proportional to its output size. By extending the techni-
que, it is also possible to obtain general tradeoffs between the area,
the time, and the period (a measure of the pipeline rate) of a circuit
performing operations like binary addition. The existence of VLSI
designs for these operations shows that all the lower bounds are opti-
mal up to some constant factor.

~. - INTRODUCTION

Host lower bounds on the performance of VLSI circuits have essen-


tially been derived through a technique initially developed by Thompson
[9J. It consists in looking at the necessary flow of information between
the two sides of an arbitrary partition of a circuit. As a conse~uence,
the results derived through this method account exclusively for the
space taken up by the wires required to transmit information [1,2,8,9,
10 J.

Using a similar approach, Yao [11J presents various results based


on the observation of the flow of information required by a VLSI circuit.
On the other hand, by looking at the computation graph of specific ope-
rations, Hong and Kung [4J obtain lower bound results corresponding to
the limitations due to the I/O requirements of VLSI circuits.
In contrast with these results, we develop a new technique that
allows us to account for the memory required by a circuit. General lower
bound can be derived by looking at the amount of information that has
to be retained by a circuit in order to memorize or encode the input
that has already been read when the output has not yet been released
completely.

100
Gerard M. Baudet 101

The results of this paper are based on the VLSI model of computa-
tion already discussed in [2,7,9,IOJ. The model is briefly reviewed, and
the notations and assumptions are introduced in Section 2, along with
the measures of performance used to evaluate integrated circuits. In
particular, in addition to the usual parameters of a circuit, the area
and the time required to perform an operation, we discuss the notion of
the period of a circuit, initially introduced by Vuillemin [IOJ to
capture the potential pipeline of a circuit.
As a first illustration of the method developed in this paper, a
very simple proof is given to show that any circuit performing binary
mUltiplication requires an area proportional to the total number of bits
input (and output) by the circuit. This result is presented in Section 3
and reestablishes a result of Brent and Kung [2J in a much more general
setting. The result is valid for a very general class of functions and
applies, in particular, to important problems such as cyclic shift,
convolution, etc.
While the result of Section 3 provides us with lower bounds rela-
ting the area required by a circuit directly to the function computed
by the circuit, the result of Section 4 establishes a general tradeoff
between the area and the period of a pipelined circuit. As an applica-
tion of this result, we derive a new lower bound on the complexity of
any circuit performing operations like prefix computation or binary
addition. The existence of VLSI designs for these operations shows that
the result is optimal up to some constant factor.
In ~3:!, Chazelle and Monier present a different computational model
for VLSI. In the last section of this paper, we discuss our results and
present lower bound in relation to their model.

2 - A VLSI MODEL OF COMPUTATION

Successive versions of a computational model for VLSI circuits have


been developed and refined in recent texts and papers [2,7,9,IOJ. Most
of our notations follow Brent and Kung [2J, and the notion of the period
of a circuit was originally introduced by Vuillemin [IOJ. The model is
briefly discussed below.
A VLSI circuit can be viewed as a graph where nodes correspond to
I/O ports and logic gates while edges correspond to wires. I/O ports and
logic gates have minimal areas p and S, respectively; and a signal
requires a minimal delay T to propagate through either type of node.
Since we are only dealing with lower bounds on the performance of cir-
cuits, we can neglect the time for a signal to propagate along a wire
(part of this propagation time could also be included in the delay T).
This delay will be used throughout as the time unit. A logic node, with
area S, has a maximum fan-in b, and it is able to retain one bit of
information at a time.
Owing to design principles, we restrict ourselves to considering
VLSI circuits satisfying the following assumptions.
(AI) Both input and output are data independent,
(A2) Any input variable is read only once.
Assumption Al states that input and output must be read and delive-
red according to a pre-determined sequence of time instants and on
102 On the Area Required by VLSI Circuits

pre-specified locations which depend entirely on the circuit design and


not on the data. Assumption A2 states that an input bit cannot be recy-
cled but must be "encoded within the circuit if it has to be reused in a
later stage of the computation (this requires an area at least 13).
The performance of a VLSI circuit is evaluated with respect to se-
veral parameters of the circuit. The area A and the time T required by
a circuit to solve a problem constitute important parameters (and have
been considered so far the usual measures) of the circuit. As will be
illustrated in the case of binary addition, however, these two measures
cannot, by themselves, usually capture the full complexity of a circuit.
Vuillemin [10] introduced a useful complement to these measures with the
notion of the period of a circuit, P, with corresponds to the minimal
time interval between the input (or the output) of two consecutive ins-
tances of a problem solved by the circuit (used in a pipe lined fashion).
We feel that the period is a very important parameter of an integrated
circuit since it characterizes its maximum troughput, and, therefore,
in contrast with the time, it is able to take more completely into ac-
count the computing power of the circuit. This is so because it measures
not the time to solve just one problem but the time elapsed between the
solutions of two consecutive problems input to the circuit (their execu-
tions taking place simultaneously or in pipeline). In addition, since
any circuit satisfies T ?P, lower bounds involving the period usually
correspond to stronger results than similar lower bounds involving the
time.

3 - A LOWER BOUND ON THE AREA OF VLSI CIRCUITS

Let f be a function computed by some VLSI circuit C. An important


(and obvious) remark is that, if circuit C is able to compute function
f, then it is certainly capable of computing any restriction fo of f.
The restriction fo can consist of the computation of any subset of the
output of f from any subset of the input of f (the remaining input
variables are set to some fixed values and are considered a parameter of
f o ). As an example, let us consider binary addition: f(x,y)=z=x+y,
wfiere x=(xO' .. 'x -I) and likewise for y and z. By setting y to some
fixed value c, anpossible restriction which we will consider is given
by fO(x)=z=f(x,c)=x+c. In particular, notice that, when c=O, the res-
triction fO comes to the identity function. Similarly, in the case of
binary multiplication: f(x,h)=z=x.y, we will consider the restriction
hO(xO, ... ,xn /2_1)=(Zn/2, ... ,Zn-I), which, again, come& to the identity
function (over n/2 bits) when y is given the value 2n/2.
The results of the paper are based on the observation of the pro-
gress made by circuit C towards the evaluation of a restriction
fO(xI, .. ,xN)=(zl, ... ,zN) of f. At the time of observation, we want
to relate the quantity of information still to be produced by the cir-
cuit (loosely speaking, the number of bits not yet delivered) to the
quantity of information still available from the input (i.e., in
addition to the number of bits still to be read, the information alrea-
dy read and encoded within the circuit). Before developing these results,
let us introduce some notations.
Let ti=iT, i=O,i, ... , be the sequence of observation times. At
time ti, let ni be the number of relevant bits input by circuit C
Gerard M. Baud.. 103

(i.e., among the N input bits taken into account by the restriction f O)'
We define NO=O and Ni=Ni_l+ni-l, i=I,2, ... ; this quantity represents
the number of bits input to C prior to time ti' Also, let si be the
number of bits encoded within the circuit. Finally, let Si be the set
generated by the output that has not been delivered by time ti ; the
size of this set, denoted by ISil, corresponds to the number of distinct
values that can still be taken by the output after time ti'
If, at time ti, the evaluation of fO is not completed, any output
bit that has yet to be delivered must be evaluated from the N-Ni+1
input bits that have not been read by the circuit at time ti, from the
ni input bits that have just been read at time ti' and from the infor-
mation memorized within the si cells of the circuit (which encodes the
input that has been read prior to time ti)' This leads us to the follo-
wing lemma.

LEMMA 1: The area A of a circuit C computing function f must satisfy:


A ~ pn i + Ss. , ( 1)
1
with :
Si ~ 10g 2 Is 1.1 - (N-N.).
1
(2)

PROOF: Equation (I) is a direct consequence of our notations. For


equation (2), we observe that, at time ti' circuit C can produce at
N-N'+ I n. s.
most 2 1 x2 1x2 1 distinct values, while the evaluation of fO still
requires the production of ISil possible states. 0
Lemma 1 provides us directly with a lower bound on the area requi-
red by a circuit to compute some function f. The lower bound which we
can expect to derive from this result is, however, limited by the size
of the output of function f, and, in particular, it cannot be more than
linear in the number of output variables produced by f. Nevertheless,
this linear lower bound appears to be tight in a number of cases, in
particular for operations like cyclic shift, binary multiplication, etc.
These operations share the property that any output bit of fO depends l
(with respect to f) on all N input bits of fO and that fO is surjective
(~n other w~r~s, IS OI=2 N). This general result is stated in the fo110-
w1ng propos1t10n.

PROPOSITION 1: The area of any circuit which computes a function f with


the above property must satisfy
A ~ min(pN,SN). (3)

PROOF: Let ti be the time when the last input bit is read by the circuit.
We have Ni+ni=Ni+I=N. Since any output bit depends on all input bits, no
outputNhas been produced by the circuit at time ti, thus Si=SO and
ISil=2 Equation (3) then follows directly from Lemma I. 0

lForma11y, an output bit z. depends on an input bit Xi if there exist


two distinct values of theJinput which differ only by bit x. such that
bit z. takes on two different values. 1
J
104 On the Area Required by VLSI Circuits

4 - A GENERAL TPJillEOFF FOR VLSI CIRCUITS

As we have already mentioned, a linear lower bound on the area of


a circuit is the best we can expect as a direct consequence of the
result of the previous section. One reason for this is that the inequa-
lity is based on an instantaneous observation of the circuit rather than
on a continuous observation over the entire execution. In this section
we show how we can strengthen the result of Lemma I, and, as an appli-
cation, we are able to derive a new lower bound on the product A.P for
any circuit performing binary addition. Using the fact that T~P, we
improve on Johnson's lower bound on AT2 5 .
Define k by T=kT. Note that we can easily obtain a lower bound on
the product A.T (rather than the product A.P) through a summation of
inequality (I) for i=O,I, .. ,k-l, namely, A.T~PTN+STLsi' where s. satis-
fies inequality (2). Again, since T~P, this is a weaker form of the
result stated in the following lemma.

LEMMA 2: The apea A and the pePiod P of a cipcuit computing some func-
tion f must satisfy : ~
A.P ~ pTN + ST L..J [10g2IS.I- (N-N.)J. (4)
O:O;i<k 1 1

PROOF: Define p=P/T. If some problem Pt starts at time to, consider the
sequence of previous problems, Pt-I, ... ,Pt - , still in progress at time
t., for O:O;i<p (recall that the clrcuit may We pipelined). From the
d~finition of the period and by assumption AI, we deduce that, at time
ti, problem Pt - j is in the same stage as problem Pt would be at time
ti+jP=ti+jp (i.e., the same variables have been read and the same
variables have been produced). Therefore, circuit C has to be able to
generate the set Si for problem Pt, the set Si+? for problem Pt_l, etc.
Since all the problem instances input to Care lndependent, the circuit
must potentially produce Ri=ISil .ISi+pl .. ISi+mpl distinct values. Then
the proof of the lemma is similar to the proof of Lemma I with a summa-
tion for i=O,I, ... ,p-I. Note that, at time ti, the circuit has yet to
read E(N-N.
. 1+ I +JP
. ) input bits; it has just read En. . inout
. l+JP . bits; and
.lt contalns
J. 11 h . f' J . l'
ri memory ce s, were ri satls les an lnequa lty Slml. . 1ar to
inequality (2). n
Again, let us consider a restriction fO(xl, .. ,sN)=(ZI"",zN) of
function f which is surjective, that is, such that ISol=2N. For i~O,
let M, be the number of bits that have been produced by the circuit up
1 N-M.
to (and including) time ti' Then we have IS.I= 2 1, for i=O,I, .. , and

L:
inequality (4) simplifies as : 1
A.P. ~ pTN + ST (NCMi)' (5)
o:o;i<k

We are now able to state a lower bound for binary addition and,
more generally, for the computation of any function f such that any
output bit Zj of fO depends on j input bits, for j=I, .. ,N. In the
case of binary addition, bit Zj depends on bits xl, ... ,x j . Other
Gerard M. Baudat 105

applications include the computation of a chain of prefixe [6J.

PROPOSITION 2: The area A~ the period P~ and the time T of any circuit
performing the binary addition of two N bit numbers must
satisfy :
A.P :> pTN + 2"1 BT N.logb (TNiT) . (6)

PROOF: Let us consider inequality (5). Rather than finding directly an


upper bound on the quanti ties ~~i given the sequence Ni , we wi 11 look
at the dual problem and, given NO,N 1 , ... , we want to find a lower bound
on the time t~ at which output b~t Zj can be output. Consider an output
bit Zj such tHat Ni<j~Ni+I' Since bit Zj depends on j input bits, we
deduce that ,
tj ~ ti+1 + Tlogb(j-N i )
(recall that b is the maximum fan-in of a logic node). Hence, the global
contribution for all bits Zj with Ni<j<Ni+1 satisfies

l:t ~ - J
t. 1 :> T
~+
L 10gb (j-N.)
~
:> -21 Tn ~.. log,
b
(n.),
~

the sum being extended in the range N.<j~Ni+1 (recall that ni=Ni+I-N i ).
By summing this last inequality for i~O,I, ... ,k, the result tollows
from the convexity of the function x+x.logb(x). n
As an immediate consequence of Proposition 2, we deduce (by looking
separately at the case T<log2N and T:>log2N) that
2
A.P.T :> O(N.log N).
Since T:>P, we also have
A.T 2 :> O(N.log 2N),
which reestablishes a result of Johnson's [5J.
These last two bounds on APT and AT2 are weaker forms of the result
of Proposition 2. For example, the classical full adder, with A=O(I).
and P=T=O(N), is optimal (up to some constant factor) with respect to
inequality (6), while it is not optimal with respect to the APT and AT2
measures. This is the same for the systolic adder built out of N linear-
ly connected full adder cells, with A=O(N), P=O(I), and T=O(N). The
fast adder described by Brent and Kung [2J is of interest since it
shows that the APT measure is indeed a tighter bound than the AT2
measure: with A=O(N log N), P=O(I), and T=O(log N), we have
APT=O(N 10g2N) while AT2=O(N 10g3N). A pipelined version of this fast
adder, with A=O(N) and P=T=O(log N) is optimal with respect to both
the APT and the AT2 complexity measures.

5 - CONCLUDING REMARKS

We feel that the technique developed in this paper to study the


complexity of VLSI circuits offers a useful complement to techniques
developed earlier (e.g., [4,9,IIJ). For example, while results derived
by Thompson [9J exclusively account for the space taken up by the wires
(i.e., transmission of information), the results we derive exclusively
106 On the Area Required by VLSI Circuits

account for the logic required by a circuit (i.e., memorization of


information). These two techniques seem to lead to optimal lower bounds
for two distinct classes of problems: the class of quadratic problems,
such as integer, polynomial, or matrix multiplication, cyclic shift,
etc., for which AP2 and AT2 grow quadratically with the size Qf the
output; and the class of sub-quadratic problems, such as binary
addition, prefix computation, etc. Notice that the systolic adder, with
A=O(N) and P=O(I) satisfies Ap 2=0(N).
Another complementary aspect of the two techniques is that, while
AP2 (or AT2) corresponds to a natural complexity measure with Thompson's
technique, the results derived from Lemma 2 lead to lower bounds on the
AP complexity measure. This measure is an estimate of the energy dissi-
pated by the circuit in solving one instance of the problem.
In [3J, Chazelle and Monier discuss a new computational model for
VLSI circuits in which the time required to drive a signal through a
wire is no longer a constant but is proportional to the length of the
wire. A purely geometrical consequence of their model is that the time
required to compute one output variable which depends on n input
variables grows at least as Tin for some time unit T. This is to be
contrasted with Tlogbn in the model we have considered. With very little
changes, the results of this paper can be applied to this new model as
well. For example, we can deduce easily that any circuit performing
the binary addition of two N bit numbers must satisfy

A.P. 2: PTN + ~ BT N.ITNFr.


From this inequality we derive that, for 0~a~1/2, we have
a I+a
A.P.T 2:0(N ).

REFERENCES

[IJ Abelson,H. and P. Andreae. Information transfer and area-time


for VLSI multiplication. CACM, Vol.23, No.1, 1980, pp. 20-23.
t2J Brent,R.P. and 11.T. Kung. The chip complexity of binary arithmetic.
12th Annual AC~1 Symposium on Theory of Computing, 1980,pp.190-200.
[3J Chazelle,B. and L. Monier. A model of computation for VLSI with
related complexity results. 13th Annual ACM Symposium on Theory
of Computing, 1981, pp. 318-325.
[4J Hong,J.-W. and H.T. Kung. 1/0 complexity: The red-blue pebble
game. 13th Annual ACM Symposium on Theory of Computing, 1981,
pp. 326-333.
[5J Johnson,R.B.,Jr. The complexity of a VLSI adder, IPL, Vol. II,
No.2, 1980, pp. 92-93.
[6J Ladner,R.E. and M.J. Fisher. Parallel prefix computation. Interna-
tional Conference on Parallel Processing, 1977, pp. 218-223.
[7J Mead,C.A. and L.A. Conway. Introduction to VLSI systems. Addison-
Wesley, 1980.
Gerard M. Bauda' 107

[8J Savage,J.E. Area-time tradeoffs for matrix multiplication and


related problems in VLSI models. Brown University, CS-50, 1979.
[9J Thompson,C.D. Area-time complexity for VLSI. II th Annual ACH
Symposium on Theory of Computing, 1979, pp. 81-88.
[IOJ Vuillemin,J.E. A combinatorial limit to the computing power of
VLSI circuits. 21st Annual Symposium on Foundations of Computer
Science, 1980, pp. 294-300.
[IIJ Yao,A.C. The entropic limitations on VLSI computations. 13th
Annual ACM Symposium on Theory of Computing, 1981, pp. 308-311.
The VLSI Complexity of Sorting

C.D. Thompson
University of Callfomla at Berkeley
Division of Computer Science
Berkeley, Callfomla 94720

O. Abstract
The area-time complexity of sorting is analyzed under an updated model of
VLSI computation. The new model has fewer restrictions on chip I/O than previous
models. Also, the definitions of area and time performance have been adjusted to
permit fair comparisons between pipelined and non-pipelined designs.
Using the new model, this paper briefly describes eleven different designs for
VLSI sorters. These circuits demonstrate the existence of an area*time 2 tradeoff
for the sorting problem. The smallest circuit is only large enough to store a few
elements at a time; it is, of course, rather slow at sorting N elements. The largest
design solves a sorting problem in only O(lg N) clock cycles. The area*time2
performance figure for all but three of the designs is close to the limiting value,
O(N2).

1. Introduction
Sorting has attracted a great deal of attention over the past few decades of
computer science research. It is easy to see why: sorting is a theoretically
interesting problem with a great deal of practical significance. As many as a
quarter of the world's computing cycles were once devoted to sorting [Knu 73, p.3].
This is probably no longer the case, given the large number of microprocessors
running dedicated control tasks. Nonetheless, sorting and other information-
shuffling techniques are of great importance in the rapidly growing database
industry.
The sorting problem can be defined as the rearrangement of N input values so
that they are in ascending order. This paper examines the complexity of the
sorting problem, assuming it is to be solved on a VLSI chip. Much is already known
about sorting on other types of computational structures [Knu 73, pp. 1-3BB], and
much of this knowledge is applicable to VLSI sorting. However, VLSI is a novel
computing medium in at least one respect: the size of a circuit is determined as
much by its inter-gate wiring as by its gates themselves. This technological novelty
makes it necessary to re-evaluate sorting circuits and algorithms in the context of
a "VLSI model of computation."
Using a VLSI model, it is possible to demonstrate the existence of an
area*time 2 tradeoff for sorting circuits. A preliminary study of this tradeoff is
contained in the author's Ph.D. dissertation [Tho Baa], in which two sorting circuits

This work W68 supported in part by the National Science Foundation under Grant ECS-Sl10684 and by
the U.S. Army Research Office under Grant OAAG29-7B-G-0167.

108
C.D. Thompson 109

were analyzed. This paper analyzes nine additional designs under an updated
model of VLSI computation. The updated model has the advantage of allowing fair
comparisons between pipelined and non-pipe lined designs.
None of the sorting circuits in this paper is new, since all are based on
commonly-known serial algorithms. Still, this paper is the first to layout these
circuits for VLSI and to analyze their area*time2 performances. Eight of the
sorters will be seen to have an Af2 performance in the range 0(N2lg 2 N) to
0{~lg5 N). Since it is impossible for any design to have an Af2 product of less than
O{~) [Vui 80], these designs are area- and time- optimal to within logarithmic
factors.
A number of different models for VLSI have been proposed in the past few
years [B&K 81, C&M 81. K&Z 81, Tho 80a, Tho 80b, Vui 80]. They differ chiefly in
their treatment of chip I/O, placing various restrictions on the way in which a chip
accesses its input. Typically, each input value must enter the chip at only one
place [Tho 80a] or at only one time and place [B&K 81]. Savage [Sav 81] has
characterized these as the "semelocal" and "semelective" assumptions,
respectively.
The model of this paper builds on its predecessors, removing as many
restrictions on chip I/O as possible. Following Kedem and Zorat, the semelocal
assumption is relaxed by allowing a chip to access each input value from several
different I/O memories. The intent is to allow redundant input codes: if each input
bit appears in k places, a chip's area*time 2 performance may be enhanced by a
factor of k 2 [K&Z 81]'
Additionally, the new model is not semelective, for it allows multiple accesses
to problem inputs, outputs, and intermediate results. Here, the intent is to model
the use of off-chip RAM storage; the area of the RAM is not included in the total
area of the sorting circuit. This omission clarifies the area*time 2 tradeoff for
sorting circuits, since RAM area is involved in an entirely different form of tradeoff.
{The recent work of Hong and Kung [H&K 81] indicates that a (time * 19 space)
tradeoff may describe how local memory affects the speed of a sorting circuit with
fixed I/O bandwidth.) Leaving RAM area out of the new model permits the analysis
of sublinear size circuits. It also makes the model's area measure more sensitive
to the power consumption of a circuit, since memory cells have a low duty cycle
and generally consume much less power per unit area than a "processing" circuit.
Other authors have used non-semelective models, although none has
elaborated quite so much on the idea. Lipton and Sedgewick [L&S 81] point out
that the "standard" AT2 lower bound proofs do not depend on semelective
assumptions. Hong [Hon 81] defines a non-semelective model of VLSI with a space-
time behavior which is polynomiaUy equivalent to that of eleven other models of
computation. His equivalence proofs depend upon the fact that VLSI wiring rules
can cause at most a quadratic increase in the size of a zero-width-wire circuit.
Unfortunately, Hong's transformation does not necessarily generate optimal VLSI
circuits from optimal zero-width-wire circuits, since a quadratic factor cannot be
ignored when "easy" functions like sorting are being studied.
Lipton and Sedgewick [1.&S 81] point out another form of input restriction, one
that is not removed in this paper's model. Inputs and outputs must be assigned to
fixed I/O ports at the time the circuit is designed, and these assignments must not
depend upon the problem input values. Thus this paper's model is "where-
oblivious." It is an interesting theoretical question to ask what functions become
easier to compute when the "where-oblivious" restriction is removed. Certainly,
shifting is such a function; sorting may be another. although this seems unlikely.
In any event, there are practical reasons for requiring VLSI circuits to be where-
oblivious. Otherwise, permutation networks would have to be used to connect one
VLSI circuit to another!
110 The VLSI Complexity of Sorting

The catalog of input restrictions is not yet complete. ]n both Vuillemin's [Vui
BO] and Thompson's [Tho BOb] models of pipelined VLSI computation, analogous
inputs and outputs for different problems must be accessed through identical 1/0
ports. For example. input 1/1 of problem 1/2 must enter the chip at the same place
as input #1 of problem #1. While this seems to be a natural assumption for a
pipelined chip, it leads to a number of misleading conclusions about the optimality
of highly-concurrent designs. For instance, the highly parallelized bubble sort
design of Section 3.10 is nearly area*time2 optimal under the old models, but it is
significantly suboptimal under the model of this paper.
When the restriction on pipelined chip inputs is removed, it becomes
impossible to prove an O(N2} lower bound on Ar2 performance until the definitions
of area and time are adjusted. However, no change needs to be made in the
definitions of area and time performance for non-concurrent designs.
]n the new model, the area performance of a design is its "area per problem,"
equal to its total area divided by its degree of concurrency. Thus it does not
matter how many copies of a chip are being considered as a single design: doubling
the number of chips doubles both its concurrency and its total area, leaving its
area performance invariant. The old definition of area performance was the total
area of a design, with no correction factor for its concurrency.
The time performance of a design is newly defined as the delay between the
presentation of one set of problem inputs and the production of the outputs for
that problem. The old definition of time performance was the rate at which a
design accepted input bits. It is easy to see that duplicating a design doubled its
time performance under the old definition, but leaves its time performance
invariant under the new definition.
The old and new definitions of area and time can be contrasted by analyzing
the combined sorting performance of N independent serial processors. As will be
shown in Section 3.1, each one of these processors has an area of O(Lg N) and each
can solve one sorting problem every O(N tg 2 N) time units. A collection of N
processors would thus consume input data at the rate of one bit every O(lg N} time
units. Their total area is O(N tg N), yielding an "impossibly good" area*time2
performance of O(N Lg 3N} under the old definitions of area and time. Under the
new definitions, the total area per problem is just O(Lg N} and the solution delay is
O(N 192N}, so the AT2 performance is 0{N 2tg 5N}.
This paper is organized in the following fashion: Section 2 describes the new
VLSI model of computation in full detail; Section 3 sketches eleven different
designs for VLSI sorters, analyzing the area-time performance of each; Section 4,
compares the performances of each of the designs, with some discussion of the
"constant factors" ignored by the asymptotic model; and Section 5 concludes the
paper with a discussion of of some of the currently open issues in VLS] complexity
theory.

2. Model of ~ Computation
]n all theoretical models of VLSI, circuits are made of two basic components:
wires and gates. A gate is a localized set of transistors, or other switching
elements, which perform a simple logical function. For example, a gate may be a
"j-k flip-fiop" or a "three input nand." Wires serve to carry signals from the output
of one gate to the input of another.
Two parameters of a VLSI circuit are of vital importance, its size and its speed.
Since VLSI is essentially two-dimensional, the size of a circuit is best expressed in
terms of its area. Sufficient area must be provided in a circuit layout for each gate
and each wire. Gates are not allowed to overlap each other at all, and only two {or
perhaps three} wires can pass over the same point.
C.D. Thomp8on 111

A convenient unit of area is the square of the minimum separation between


parallel wires. In the terminology of [M&C 80], this paper's unit of area is equal to
(4A)2, where A is a constant determined by the processing technology. Each unit of
area thus contains one, two, or three overlapping wires; or else it contains a
fraction of a gate. The actual size of this area unit becomes smaller as technology
=
improves. In 1978, it was typically 150 J-Lm 2 1.5* 10-6 em 2 ; eventually, it may be
as small as .4 J-Lm 2 [M&C 80, p. 35].
The speed of a synchronous VLSI circuit can be measured by the number of
clock pulses it takes to complete its computation. Once again, the actual size of
this time unit is a technological variable. In 1978, a typical MOS clock period was
30 to 50 ns; and this may decrease to as little as 2 to 4 ns [M&C 80]. For the
superconducting technology of Josephson junctions, a clock period of 1 to 3 ns is
achievable today, using a process for which the area unit is 25 J-Lm 2 [Ket 80].
The speed of a VLSI circuit may be adversely affected by the presence of very
long wires, unless special measures are taken. In many MOS processes, a
minimum-sized transistor cannot send a signal from one end of the chip to the
other in one clock period. To accomplish such cross-chip communication, special
"driver" circuits are employed. These drivers amplify the current of the signal;
O(Lg k) stages of amplification are required to drive a length-k wire [M&C 80, p.
14]. The use of these driver circuits is reflected in the VLSI model's "logarithmic
delay" assumption, that a length-k wire has O(lg k) delay. Each stage of a driver's
amplifier chain is individually clocked, so that the driver behaves like an O(lg k)-
bit shift register. Note that this design for long-wire drivers achieves unit
bandwidth. Every wire, even the longest one, has a throughput of one bit per time
unit.
The logarithmic delay assumption is used here because it leads to realistic
circuit designs and time bounds. As it turns out, the time bounds obtained for VLSI
sorting under this assumption are no different from the ones that would be
obtained under a "unit-delay" assumption (in which each gate is able to transmit
its output all the way across the circuit, in one clock period). In the circuits of
Section 3, the delay of the drivers is overlapped with the delays of comparison
operations. The sole effect of the logarithmic delay assumption is thus to ensure
that the VLSI designer strives for such an overlap.
It may be argued that the logarithmic delay assumption is too severe or too
lenient, depending on the technology. The former is currently the case in the I2L
[Evans 79] and Josephson junction processes. As of now, both are really unit-delay
technologies; no drivers are needed for cross-chip communication. However, the
results of this paper still apply if the drivers are omitted from the circuit
constructions of Section 3.
It seems unlikely that synchronous MOS circuits will ever violate the
logarithmic delay assumption. Seitz [Sei 79] projects a signal transmission
velocity of (1 em)/(3 ns) in a fully-developed MOS technology. This means that a
cross-chip communication will only take a few clock periods, even if the "chip" is as
large as a present-day "wafer." In other words, the time performance of the fully-
developed MOS technology is only slightly overestimated by the logarithmic delay
assumption -- the true delay would best be modeled as logarithmic plus a small
constant. Modelling delay as a linear function of distance, as suggested by Chazelle
and Monier [C&M 80], would greatly exaggerate the importance of delay in the
determination of the speed of such circuits.
If circuits ever become much faster or much larger than envisioned today, the
logarithmic delay assumption may become invalid. As a case in point, consider the
Josephson junction circuit assemblies currently built by IBM. They are 10 em on a
side, and they run on a 1 to 3 ns clock [Ket 80]. The wires in these circuits are
superconductors, but of course they cannot transmit information at a velocity
112 The VLSI Complexity of Sorting

greater than (a fraction of) the speed of light. Right now, the clock frequency and
circuit dimensions are just small enough to allow a signal to propagate from one
side of the circuit to the other in one clock period. Any increase in either speed or
size would make this impossible. The computational limitations of such enhanced
(and hypothetical) technologies could be analyzed under Chazelle and Monier's
linear delay assumption.
Before leaving the subject of wire delay, it should be noted that the model of
this paper makes provision for the "self-timed" regime predicted by [Sei 79]. It
may eventually become very difficult to guarantee that all portions of a VLSI
circuit get a clock signal with the correct frequency and/or phase. Fortunately, it
is feasible to have the long-wire drivers include timing information with the data
being transmitted, so that special "receiver" circuits can resynchronize the data
with respect to the local version of the clock. Also, single-stage, unit-delay
"repeater" pircuits can be used to avoid a driver delay at each vertex in the mesh-
type connection pattern of Section 3.7.
Thus far in the discussion, only "standard features" have been introduced to
the VLSI model. The interested reader is referred to [Tho BOa] for more details on
the practical significance of the model, and to [Sav 79] for an excellent
introduction to the theoretical aspects of VLSI modelling.
A major distinction between the model of this paper and most previous VLSI
models is the way in which it treats "Va memory." Here, only a small area charge
is made for the memory used to store problem inputs and outputs, even if this
memory is also used for the storage of intermediate results.
In the new model, each input and output bit is assigned a place in a k -bit "Va
memory" attached to one or more "Va ports." Two types of access to the I/o
memory are distinguished. If the bits are accessed in a fixed order, the I/O
memory is organized as a shift register and accessed in 0(1) time. If the access
pattern is more complex, a random access memory (RAM) is used. Such a memory
has an access time of O(lg k) [M&C BO, p. 321]. The random access time covers
both the internal delays of the memory circuit as well as the time it takes the I/O
port to transmit (serially) flg kl address bits to the RAM. Of course, many other
organizations could have been assumed for the Va ports. This paper's bit-serial
interface seems to be the simplest one that allows optimal area-time results.
Allowing more than one I/O port to connect to a single I/O memory makes it
easy to model the use of multiport memory chips. However, a few restrictions
must be placed on their usage, to remove the (theoretical) temptation to reduce
on-chip wiring at the expense of increasing printed-circuit board wiring. All Va
ports connecting to a single memory must be physically adjacent to each other in
the chip layout, to avoid any possibility of "rats-nest" wiring to the memory chips.
This assumption allows the area*time 2 lower bound proofs to proceed without
difficulty, since all cross-chip communication must use on-chip wires. (Note that a
two-port memory provides a communication channel between its two I/O ports.)
The model makes as few assumptions as possible about the actual location of
the I/O memory circuitry, even though this can have a large effect on system
timing. If the memory is placed on a different chip from the processing circuitry,
its access time is considerably increased. Fortunately, this will not always
invalidate the model's timing assumptions. The O(lg k) delay already assumed for
a k-bit RAM will dominate the delay of an off-chip driver, if k is large enough.
Alternatively, if k is small, it should be relatively easy to locate the RAM on the
processor chip. As for off-chip "shift register" I/O memories, there should be no
particular difficulty in implementing these in such a way that one input or output
event can happen every 0(1) time units.
As indicated above, time charges for off-chip I/O are problematical and may
be underestimated in the current model. Area charges for I/O are also
C.D. Thompson 113

troublesome. Here, I/O ports are assumed to have 0(1) area even though they are
obviously much larger than a unit-area wire crossing or an 0(1) area gate. It is also
assumed that a design can have an unlimited number of I/O ports. In reality, chips
are limited to one or two hundred pins, and each pin should be considered a major
expense (in terms of manufacturing, reliability, and circuit board wiring costs). An
attempt is made in Section 4 to use more realistic estimates of I/O costs when
evaluating Section 3's constructions.
The complete model of VLSI computation is summarized in the following list of
assumptions.
~S'Umption 1: F1mbedding.

a. Wires are one unit wide.


b. Two wires may cross over each other at right angles {in one unit square}.
c. A logic node occupies 0(1) area. It has 0{1} input wires and 0{1} output
wires, none of which are more than 0(1) units long.
d. Each logic node belongs to a self-timed region. All wires connecting to a
logic node lie entirely within its self-timed region.
e. A self-timed region may be as much as O{lg N) units wide or long.
f. A driver node of O(k) area has an output wire that is k units long. Its input
wire is 0(1) units long. The output wire may pass through any number of self-
timed regions before it connects to the input of a repeater or receiver node.
g. A receiver node occupies 0(1) area. Its output wire is 0(1) units long. Its
input wire may be of any length.
h. A repeater node of O(k) area has one output wire that is k units long. Its
other output wire is only 0(1) units long and must be connected to a receiver
node. Its input wire is at least n(k) units long.
i. An I/O memory and its associated I/O ports occupy O( 1) area. Each I/O
port has one input wire and one output wire, each of 0(1) length.
~S'Umption 8: Problem definition.

a. A chip has degree of concurrency p if it solves p problem instances


simultaneously.
b. Each of the N input variables in a problem instance takes on one of M
different values with equal likelihood.
c. M = Nit-E, for some fixed e > o. Furthermore, a nearly non-redundant code
must be used to represent input and output values as O(lg N) bit words. (This
assumption makes it possible to express area and time bounds in terms of N
alone.)
d. The output values of a problem instance are a permutation of its input
values into increasing order.
~S'Umption 3: Ti:mi:ng.
a. Wires have unit bandwidth. They carry at most one bit of information in a
unit of time.
b. Logic nodes, repeater nodes, and receiver nodes have 0(1) delay.
c. The driver node for a wire of length k has O(lg k) delay.
~S'Umption 4: Transmissionfunctions.
a. A deterministic finite-state automaton (FSA) is associated with each node.
The "state" of a node is a bit vector encoding the current state of its FSA.
There is a fixed mapping between the (single-bit) signals appearing on the
input and output wires of a node, and the inputs and outputs of its FSA.
114 The VLSI ComplexHy of Sorting

b. The state of a node is changed every time unit, i.e. its FSA undergoes one
state transition per time unit.
c. Logic nodes, repeater nodes, and receiver nodes are limited to O{l) bits of
state.
d. Driver nodes have O{lg k) bits of state, one bit for each stage in th~Jir
amplification chain.
e. The state vector of a "k-bit" I/O memory contains one bit for each of its
assigned problem input and output bits. The assignment of problem bits to
memories is one-to-one and is not data-dependent.
f. There are O{lg k) bits in the state vector of each I/O port attached to a k-
bit memory. These state vectors are used to address specific memory bits, as
explained in Assumptions 4g and 4h. Two different ports may not access the
same bit simultaneously.
g. "RAM-type" k -bit I/O ports run a memory cycle every O(lg k) time units.
During the first 19 k time units of a cycle, the port receives a bit-serial
address on its input wire. The next input signal is interpreted as a read/write
indicator. If a write cycle is indicated, the following input signal is written into
the addressed bit. During the last time unit of a memory cycle, the value of
the addressed bit is available on the I/O port's output wire.
h. "Shift-register-type" I/O ports run a memory cycle every 0(1) time units.
During the first time unit of a cycle, the value of the currently-addressed data
bit is available on the port's output wire. In the last time unit of a cycle, the
signal appearing on the port's input wire is written into this data bit, then the
port's address register is incremented (mOd k).
Assumption 5: Area, time performance.
a. The total area of a chip is the number of unit squares in the smallest
enclosing rectangle.
b. The area performance A of a chip is its total area divided by its degree of
concurrency p. See Assumption 2a.
c. The time performance T of a chip is the maximum number of time units it
takes to solve anyone of its p problem instances.

3. Circuit Constructions
(This section has been omitted, due to space considerations. A complete
version of this paper is available from the author.)

4. Comparison of the Designs


The area and time performance of the eleven sorting circuits is summarized in
Table 1, below. It is easy to see that most of the designs are within a factor of
O(Lgk N) of being optimal in an area*time2 sense. This indicates that the lower
bound of O(~) is nearly tight over a wide range of circuit speeds and sizes.
Of course, a sorting circuit should not be selected just because it is
asymptotically optimal. A circuit designer is interested only in actual speeds and
sizes. Unfortunately, the model of computation of this paper is not exact enough
to permit such comparisons. It is possible, however, to make some statements
about the relative sizes and speeds of the designs.
The smallest design is clearly the O(lg N) area uniprocessor. Somewhat
surprisingly, this design is nearly area*time 2 optimal if it is programmed to use
any of the O(N 19 N}-step serial algorithms.
If more sorting speed is desired, the {lg N}-processor heapsort design
becomes attractive. It requires almost exactly 19 N times as much area as the
C.D. Thompson 115

Design Area Time Area "'Time2


(Lower bound) - - O(N2)
Uniprocessor 19 N N 192N N 2 lg5N
19 N - PE heapsort 192N N 19 N N2lg4N
19. N - PE bitonic 192N N g N N 2 lg4N
192N - PE bitonic 193N N N 2 lg3N
"'N 19 N - PE bitonic "j N 193N N N 3 lg 3/2N
N/2 - PE bubble N 19 N N N 3 lg N
N - PE bitonic, mesh N 192N VN N 2 lg2N
N - PE bitonic, S-E N2/lg2N 193N N 2 lg4N
N - PE bitonic, eee N2/lg2N 193N N 2 lg4N
N 192N - PE bitonic N2/lg2N 193N N2lg4N
N 2 - PE bubble N 19 N N N 3 lg N

Table 1: Area-time bounds for the sorting problem.

Design Total Area I/O Bandwidth


Uniprocessor 19 N 1
19 N - PE heapsort 192N 19 N
g N - PE bitonic 192N 192N
192N - PE bitonic 193N 193N
"'N 19 N - PE bitonic Nlg N 19 N
N /2 - PE bubble N 19 N 19 N
N - PE bitonic, mesh N 192N VN 19N
N - PE bitonic, S-E N 2/lg2N N/lg2N
N - PE bitonic, eee N2/lg2N N/lg2N
N 192N - PE bitonic N2 N
N2 - PE bubble N2 N

Table 2: Other performance measures.

uniprocessor design, since the two designs make very similar demands on their
processors. A drawback of the multiprocessor design is that it requires 19 N
independently addressable memories, one for each processor. The total memory-
processor bandwidth increases proportionately (see Table 2) to 19 N bits per time
unit.
The (lg N)-processor bitonic design has about the same area and time
performance as the {lg N)-processor heap sort design. The former has the
advantage of a slightly simpler control algorithm, and uses the simpler shift-
register type of I/O memory; the laller uses a more efficient sorting algorithm and
hence less memory bandwidth.
The {lg2N)-processor bitonic sorter is smaller than either of the (lg N)
processor designs, for moderately sized N. Its control algorithm is extremely
simple, so that a "processor" is not much more than a comparison-exchange
module. Its major drawback is that it makes continuous use of
(11 2)(lg N)(lg N - 1) word-parallel shift-register memories, of various sizes.
The ("'N 19 N )-processor bitonic sorter has been entered in Table 2 with a
total area of O(N 19 N), so that there is room on the chip for all of its temporary
storage registers. Otherwise, it would require '"N 19 N separate I/O memories. It
has the same speed and a somewhat beller I/O bandwidth than the (lg2N)-
processor bitonic sorter just discussed. However, the laller's shift registers could
also be placed on the same chip as its processing circuitry, removing its I/O
116 The VLSI Complexity of Sorting

bandwidth disadvantage. When "constant factors" are taken into consideration, the
{v N 19 N )-processor design is clearly much larger than the {lg2 N)-processor
design, because it has more processors and a much more complicated control
algorithm.
The {N / 2)-processor bubble sorter has a couple of significant advantages that
are not revealed in either Table 1 or Table 2. Its comparators need very little in
the way of control hardware, so that at least for small N, it occupies less area than
any of the preceding designs. Also, it can be used as a "self-sorting memory,"
performing insertions and deletions on-line. {The uniprocessor and the (lg N)-
processor heapsorter can also be used in this fashion.) However, for even
moderately-sized N, the bubble sorter's horrible area*time 2 performance becomes
noticable. For example, when N = 256, the (lg2N)-processor's 36 comparators and
491 words of storage probably occupy less room than the 128 comparators in a
bubble sorter. Still, the bubble sorter maintains about a 2:1 delay advantage over
the (lg2N)-processor bitonic sorter, when similar comparators are used.
The N-processor mesh-type bitonic sorter is the first design to solve a sorting
problem in sublinear time. Unfortunately, it occupies a lot of area. Each of its
processors must run a complicated sorting algorithm, reshuffling the data among
themselves after every comparison-exchange operation. Its I/O bandwidth must
also be large, since it solves sorting problems so rapidly. However, constant factor
improvements may be made to its area and bandwidth figures, by reprogramming
the processors so that each handles 0(1) data elements at a time. Also, large area
and bandwidth are not always significant problems: in an existing mesh-connected
multiprocessor, the N processors are already in place and the I/o data may be
produced and consumed by local application routines.
The next three designs in Tables 1 and 2 are variants on a fully-parallelized
bitonic sort. The shuffle-exchange processor has a slight area advantage over the
CCC processor. because of its simpler control algorithm. However, the eee is a
somewhat more regular interconnection pattern, so that it may be easier to wire
up in practice. Both designs are smaller in total asymptotic area than the
{N 192N)-processor bitonic sorter, which solves 192N problems at a time.
Nonetheless, the control structure of this last design is so simple that it probably
takes less area than the others for all N < 220. Of course, if a shuffle-exchange or a
CCC processor has already been built, the additional area cost for programming
the sorting algorithm is very small.
There seems to be little to recommend the final design, the N2-processor
bubble sorter. It has the same I/O bandwidth, a bit more total area, and a much
worse time performance than the {N 192N)-processor bitonic sorter.

5. Closing Remarks
At the time of this writing, there are a number of important open questions in
VLSI complexity theory. A simply stated, but seemingly perplexing problem, is to
find out how much area can be saved when additional "layers" of wiring are made
available by technological advances. It is known that a k-level embedding can be
no smaller than 1/ k 2 of the area of a two-level embedding [Tho 80a, pp. 36-38], but
it is not known whether this bound is achievable. (When k grows as the square root
of the two-level embedding's area, it seems that the 1/ k 2 bound is tight.)
A second problem is to derive a better lower bound for the area*time2
complexity of the sorting problem. The original proof that AT2 = O{N 2lg 2N) [Tho
80a] applies only to sorting circuits that read entire words of input through their
I/O ports. When input words can be broken down into bits, the largest lower bound
known is O{N2) [Vui 80]. I can see how to prove an O{N2l~ N) bound for the case of
unrestricted inputs, but I know of no proof of O{N 2lg 2 N). Of course, it is possible
that it is the upper bound that is too weak.
C.D. Thompson 117

Another set of problems is opened up by the fact that the area*time 2 bounds
are affected greatly by nondeterministic, stochastic, or probabilistic assumptions
in the model. For example, equality testing is very easy if one only requires that
the answer be "probably" correct [Yao 79, L&S B1].
A final and very important problem in VLSI theory is the development of a
stable model. Currently there are almost as many models as papers. If this trend
continues, results in the area will become difficult to report and describe.
However, it is far from settled whether wire delays should be treated as being
linear or logarithmic in wire length, and the costs of off-chip communication
remain unknown.
Despite the open problems noted above, VLSI theory has been fairly successful
in obtaining matching upper and lower bounds for the computation of such
functions as Fourier transformation, matrix multiplication, integer multiplication,
integer addition, and sorting [A&A BO, B&K 79, B&K Bl, Joh BO, P&V 79, P&V BO, Sav
80, Tho BOa, Tho BOb, Vui 80]. It has led to increased understanding and new
models for the embedding of graphs in the plane [Lei BO, KLLM B1]. The area*time2
performance metric has been shown to be applicable over a wide range of circuit
sizes and speeds, indicating that it is a fundamental type of space-time tradeoff.

References
[A&A 80] H. Abelson and P. Andreae, "Information Transfer and Area-Time
Tradeoffs for VLSI Multiplication," CACM Vol. 23, No.1, pp. 20-23,
January 1980.
[Arm 7B] Philip K. Armstrong, U.S. Patent 4131947, issued December 26, 1978.
[B&K 79] R P. Brent and H. T. Kung, "A Regular Layout for Parallel Adders,"
CMU-CS-79-131, Carnegie-Mellon Computer Science Dept., June 1979.
To appear in IEEE-TC.
[B&K B1] R P. Brent and H. T. Kung, "The Area-Time Complexity of Binary
Multiplication," JACM Vol. 28, No.3, pp. 521-534, July 1981.
[C&M B1] B. Chazelle and L. Monier, "Towards More Realistic Models of
Computation for VLSI," Proc. 11th Annual ACM Symp. an Theory of
Com:pu1ing, pp. 209-213, April 1979.
[CLW BO] Kin-Man Chung, Fabrizio Luccio, and C. K. Wong, "On the Complexity of
Sorting in Magnetic Bubble Memory Systems," IEEE-TCVol. C-29, No.
7, pp. 553-562, July 1980.
[Des 80] A. Despain, "Very Fast Fourier Transform Algorithms for Hardware
Implementation," IEEE-TCVol. C-28, No.5, pp. 333-341, May 1979.
[Eva 79] S. A. Evans, "Scaling 12L for VLSI," IEEE Journal of Solid-State
Oircuits, Vol. SC-14, No.2, pp. 318-326, April 1979.
[Hon 81] J-W Hong, "On Similarity and Duality of Computation," unpublished
manuscript, Peking Municipal Computing Center, China.
[H&K 81] J-W Hong and H. T. Kung, "I/O Complexity: The Red-Blue Pebble
Game," Proc. 13th Annual ACM Syrnp. on Theory of Compuf:i:ng, pp.
326-333, May 1981.
[Joh 80] R. B. Johnson, "The Complexity of a VLSI Adder," Info. Proc. Letters,
Vol. 11, No.2, pp. 92-93, October 1980.
[Ket 80] M. B. Ketchen, "AC Powered Josephson Miniature System." 1980 Int'l
Conf. on Oircuits and Computers, IEEE Computer Society, pp. 874-
877, October 1980.
118 The VLSI Complexity of Sorting

[KLLM 81] D. Kleitman, F. T. Leighton, M. Lepley, and G. L. Miller, "New Layouts


for the Shuffle-Exchange Graph," Extended Abstract, MIT Applied
Mathematics Dept., 1981.
[Knu 73] D. E. Knuth, The Art of Computer Programming, Vol. 3: Sorting and
Searching, Addison-Wesley, 1973.
[K&Z 81] Zvi M. Kedem and Alessandro Zorat, "Replication of Inputs May Save
Computational Resources in VLSI," Proc. 28nd Symp. on the
Foundations of Computer Science, IEEE Computer Society, October
1981.
[Lei 80] C. E. Leiserson, "Area-Efficient Graph Layouts {for VLSI)," Proc. 21st
Symp. on the Foundations of Computer Science, IEEE Computer
Society, October 1980.
[1&S 81] Richard J. Lipton and Robert Sedgewick, "Lower Bounds for VLSI,"
Proc. 13th Annual ACM Symp. on Theory of Computing, pp. 300-307,
May 1981.
[M&C 80] C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley,
1980.
[Mor 79] Hans P. Moravec, "Fully Interconnecting Multiple Computers with
Pipelined Sorting Nets," IEEE-TC Vol. C-28, No. 10, pp. 795-798,
October 1979.
[P&V 79] F. Preparata and J. Vuillemin, "The Cube-Connected Cycles: A
Versatile Network for Parallel Computation," 20th Annual Symp. on
Foundations of Computer Science, IEEE Computer Society, pp. 140-
147, October 1979.
[P&V 80] F. Preparata and J. Vuillemin, "Area-Time Optimal VLSI Networks for
Multiplying Matrices," Info. Proc. Letters, Vol. 11, No.2, pp. 77-BO,
October 1980.
[Sav 79] J. Savage, "Area-Time Tradeofl's for Matrix Multiplication and Related
Problems in VLSI Models," TR-CS-50, Brown University Dept. of
Computer Science, August 1979.
[Sav B1] J. Savage, "Planar Circuit Complexity and the Performance of VLSI
Algorithms," TR-CS-6B, Brown University Dept. of Computer Science,
July 19B1.
[Sei 79] C. 1. Seitz, "Self-timed VLSI Systems," Froc. Caltech Conf. on VLSI,
Caltech Computer Science Dept., pp. 345-356, January 1979.
[Sto 71] H. Stone, "Parallel Processing with the Perfect Shuffle," IEEE-TCVol.
C-20, No.2, pp. 153-161, February 1971.
[Tho 80a] C. D. Thompson, A Complexity Theory for VLSI, Ph.D. Thesis,
Carnegie-Mellon Computer Science Dept., August 1980.
[Tho 80b] C. D. Thompson, "Fourier Transforms in VLSI," UCB/ERL MBO/51 ,
October 1980.
[Vui 80] J. Vuillemin, "A Combinatorial Limit to the Computing Power of VLSI
Circuits," Proc. 21st Symp. on the Foundations of Computer Science,
IEEE Computer Society, pp. 294-300, October 19BO.
[Yao 79] A. C. Yao, "Some Complexity Questions Related to Distributive
Comptuting," Proc. 11th Annual ACM Symp. on Theory of Computing,
pp. 209-213, May 1979.
[Yao 81.] A. C. Yao, "The Entropic Limits of VLSI Computations," Froc. 13th
Annual ACM Symp. on Theory of Computing, pp.308-311, May 19B1.
Minimum Edge Length Planar Embeddlngs
of Trees

Walter L. Ruzzo* Lawrence Snyder* *


University of Washington Purdue University
Seattle, Washington 98195 West Lafayette, Indiana 47907

INTRODUCTION

Valiant [1] showed how to embed any binary tree into the plane in
linear area without crossovers. The edges in this embedding have a
maximum length of 0 Un ) . With Paterson, we [2] showed that a complete
binary tree can be embedded in the plane with maximum edge length of
OCIn/log n) and we argued the importance of short edge length for VLSI
design and layout. Here we show that every binary tree can be embedded
in the plane with all three properties: linear area, no crossovers, and
OC~/log n) maximum edge length. This improves a result of Bhatt and
Leiserson [3] -- a graph with an n~-E separator theorem can be embedded
Cperha~s with crossovers) in linear area and with a maximum edge length
of OC/n/log n) -- for the case of binary trees. In the paper we also
observe that Valiant's result can be extended to the case of oriented
trees [7].
These bounds on edge length are the best possible in the ~ense that
there are graphs in the families requiring edges of length ~CIn/1og n)
in any planar embedding [2]. Edge length is an important quantity
because it corresponds to wire length in VLSI circuits. Since the
delay in charging or discharging a wire is related to its length, long
wires can significantly influence performance [8]. For example,
families of pipe1ined circuits in which the maximum length wire is
longer for circuits solving larger problem~ will have the propagation
delay of the longest wire entering as a multiplicative factor in the
timing complexity of the circuit family. Thus, it is crucial to know
how long wires will be in a layout. Our results provide this information
for trees which must be embedded without Crossovers.
The added condition of crossover-freedom is of theoretical interest
since it can be had without incurring an area or maximum edge length
penalty. This fact should be compared with Valiant's results on general
planar graphs [1] where embeddings without CrOSsovers require more area
C8(n2)) than those with crossovers (O(n(log n)2)) and consequently
greater maximum edge length.

*Supported in part by National Science Foundation grant ECS80-07428.


**Supported in part by Office of Naval Research Contracts
N0014-80-K-OSI6 and N00014-Sl-K-0360; the latter is Task SRO-100.
119
120 Minimum Edge Length Planar Embeddlngs of Trees

There is also some practical interest in achieving shorter edge


length embeddings even though VLSI technologies (e.g., nMOS) provide
modest crossover capability [8]. In particular the functional part of
the circuit abstracted by the tree is not the only object in the layout.
There are power and ground wires and there are often timing signal wires
which tend to be dedicated to certain layers (e.g., metal and poly).
Also, the tree may be only one of several subcircuits occupying the
same layout. Crossover-free embeddings can be a helpful simplification
in the difficult task of VLSI design.

RESULTS

The model is the same one used in earlier works [1-3]. The set, T ,
is the set of all trees on n vertices with vertex degree at most four. n
The tree is a guest graph that is to be embedded in a host graph which
is a grid; that is, the vertex set is the set of lattice points in the
first quadrant and the edge set is composed of those edges connecting
the unit distance neighbors of the vertices. An embedding is a 1-1
mapping of the vertices of a tree tETn to grid vertices and the tree
edges to vertex disjoint (i.e., without crossovers) paths in the grid.
The area of tETn is the area of the minimum square bounding the image of
an embedding of t. The maximum edge length in an embedding of t is the
greatest number of edges in an image path. The edge length of t is the
least maximum edge length over embeddings of t.
First we identify a useful lemma (of some interest in its own right)
that a modified version of Valiant's crossover-free tree embedding can
embed oriented trees [7] in linear area.
Lemma 1. Every tETn with a specified orientation can be embedded
in the plane with O(n) area and without crossovers so that the
orientation is preserved.
The proof is direct by making obvious modifications to the
"reconnection" phase of the Valian~ algori thm. Notice that the
resulting edge length is still O(/n ).
Theorem 2: [Main result] Every tETn can be embedded in t~e plane
with area O(n), without crossovers, and edge length of O(/n/log n).
Proof sketch: The proof takes the form of an algorithm to achieve
the required embedding.
We begin by "balancing" the tree, that is by embedding it in a tree
of height O(log n). As a result, certain guest tree edges will be
mapped to host tree paths. We refer to edges in these host tree paths
as "double" edges because they host segments of two guest tree edges.
If the final embedding is to be planar, we must keep track of the
orientation of these double edges.
Although an indirect technique is known [4] for the balancing
operation, we prefer a more direct technique [5] that has better
dilation, i.e., less edge stretching. The details of how the-direct
method performs the balancing are unimportant here. What does matter is
that the two cases of that construction can be laid out in a planar
fashion. Figure I illustrates the two relevant balancing transformations.
Next we use a refined version of the Bhat t- Leiserson [3] "hyper H"
embedding on the balanced tree. Their method, motivated by the Mead-
Conway, "hyper-H" planar embedding of complete binary trees [8], shortens
the edges near the root of the tree by "pulling" vertices up narrow
Walter L. Ruzzo and Lawrence Snyder 121

'.

~
fi
Figure 1. Transformations used in the balancing operation. Dotted
edges are host tree edges.

H-shaped channels to be closer to their parents. (See Figure 2.) We


must be cognizant of the fact that the balanced tree actually contains
double edges whose orientation must be respected. The inductive
hypothesis is that an n node oriented forest (initially just the
balanced tree) has a bisector, a path from the root to a leaf of one of
the trees, that separates the forest into left and right pieces of
essentially equal size that does not change the orientation. The
bisector has length at most O(log n).

----: I
1--~
r: ---
I

---J tt---
I
I

Figure 2. Channel layout schema for Bhatt-Leiserson construction.


122 Minimum Edga Langth Planar Embedding. of Tnta.

The "hyper H" embedding is used to embed those vertices nearest the
root and is terminated with unembedded subtrees of size O(n/(log n)2).
By the lemma these trees can be embedded in O(n/(log n)2) area without
crossovers and with edge length O(iJl/log n).
As a consequence of the above techniques, we have
Theorem 3. Every tETn with a specified orientation and height at
most O(n~-E), can be embedded with O(n) area, without crossovers,
with edge length O(Al/log n) so that the orientation is preserved.
We do not know whether arbitrary oriented trees can be embedded
with O(n) area, without crossovers and with edge length O(Al/log n). If
so, then the property of orientation would not penalize an embedding.
For comparison purposes let us relax our area measurement to be the
area of the smallest bounding convex set; this is legitimate since we
are now interested in lower bounds.
Theorem 4. Every tETn with a specified orientation can be embedded
with its leaves on the perimeter of a convex set, without
crossovers in area 9(n 2) so that the orientation is preserved.
The upper bound is obvious. A tree that forces the lower bound is
shown in Figure 3. The best lower bound for the unoriented case is
Brent and Kung's Q(n log n) area lower bound for complete binary trees
with leaves on the perimeter of a convex set.

,.- ."
Figure 3. A tree forcing the lower bouno of Theorem 3.

REFERENCES

[1] L. G. Valiant
Universality Consideration for VLSI Circuits
IEEE Transactions on Computers, 1981
[2] M. S. Paterson, W. L. Ruzzo, and L. Snyder
Bounds on Minimax Edge Length for Complete Binary Trees
Proceedings of the Thirteenth Annual Symposium on the Theory of
Computing, 1981

[3] S. N. Bhatt, and C. E. Leiserson


Minimizing the Longest Edge in a VLSI Layout
Manuscript, MIT, 1981
WI Iter L. Ru:rzo Ind Llwrence Snyder 123

[4] Hong Jia-Wei, Kurt Mehlhorn, and A. L. Rosenberg


Cost Tradeoffs in Graph Embeddings
ICALP 81, 1981

[5 ] W. L. Ruz zo
Embedding Trees in Balanced Trees
Manuscript, University of Washington, 1981

[6] R. P. Brent, and H. T. Kung


the Area of Binary Tree Layouts
()'J.
CMU Technical Report, 1979

[7] D. E. Knuth
Art of Computer Programming
Volume I, Addison Wesley, 1968

[8] Carver Mead, and Lynn Conway


Introduction to VLSI Systems
Addison Wesley, 1980
The VLSI Approach to
Computational Complexity
Professor J. Finnegan
University of Ocelnvlew, Kln.l.
(Formerly with the DP deplrtment of the
First Nltlonll Blnk of Ocelnvlew)

The rapid advance of VLSI and the trend toward the decrease of the geometrical feature size,
through the submicron and the subnano to the subpico, and beyond, have dramatically reduced the cost
of VLSI circuitry. As a result, many traditionally unsolvable problems can now (or will in the near
future) be easily implemented using VLSI technology.

For example, consider the traveling salesman problem, where the optimal sequence of N nodes
("cities") has to be found. Instead of applying sophisticated mathematical tools that require investment
in human thinking, which because of the rising cost of labor is economically unattractive, VLSI
technology can be applied to construct a simple machine that will solve the problem!

The traveling salesman problem is considered difficult because of the requirement of finding the best
route out of N! possible ones. A conventional single processor would require O(N!) time, but with
clever use of VLSI technology this problem can easily be solved in polynomial time!!

The solution is obtained with a simple VLSI array having only N! processors. Each processor is
dedicated to a single possible route that corresponds to a certain permutation of the set [1,2,3 ... NJ.
The time to load the distance matrix and to select the shortest route(s) is only polynomial in N. Since
the evaluation of each route is linear in N, the entire system solves the problem in just polynomial
time! Q.E.D.

Readers familiar only with conventional computer architecture may wrongly suspect that the
communication between all of these processors is too expensive (in area). However, with the use of
wireless communication this problem is easily solved without the traditional, conventional area penalty.
If the system fails to obtain from the FCC the required permit to operate in a reasonable domain of the
frequency spectrum, it is always possible to use microlasers and picolasers for communicating either
through a light-conducting substrate (e.g., sapphire) or through a convex light-reflecting surface
mounted parallel to the device. The CSMAlCD (Carrier Sense Multiple Access, with Collision
Detection) communication technology, developed in the early seventies, may be found to be most
helpful for these applications.

If it is necessary to solve a problem with a larger N than the one for which the system was initially
designed, one can simply design another system for that particular value of N, or even a larger one, in
anticipation of future requirements. The advancement of VLSI technology makes this iterative process
feasible and attractive.

124
Professor J. Finnegan 125

This approach is not new. In the early eighties many researchers discovered the possibility of
accelerating the solution of many NP-complete problems by a simple application of systems with an
exponential number of processors.

Even earlier, in the late seventies many scientists discovered that problems with polynomial
complexity could also be solved in lower time (than the complexity) by using a number of processors
which is also a polynomial function of the problem size, typically of a lower degree. NxN matrix
multiplication by systems with N2 processors used to be a very popular topic for conversations and
conference papers, even though less popular among system builders. The requirement of dealing with
variable N was (we believe) handled by the simple P/O technique, namely, buying a new system for
any other value of N, whenever needed.

According to the most popular model of those days, the cost of VLSI processors decreases
exponentially. Hence the application of an exponential number of processors does not cause any cost
increase, and the application of only a polynomial number of processors results in a substantial cost
saving!! The fact that the former exponential decrease refers to calendar time and the latter to problem
size proba\:!ly has no bearing on this discussion and should be ignored.

The famous Moore model of exponential cost decrease was based on plotting the time trend (as has
been observed in the past) on semilogarithmic scale. For that reason this model failed to predict the
present as seen today. Had the same observations been plotted on a simple linear scale, it would be
obvious that the cost of VLSI processors is already (or about to be) negative. This must be the case, or
else there is no way to explain why so many researchers design systems with an exponential number of
processors and compete for solving the same problem with more processors.

Conclusions
With the rapid advances of VLSI technology anything is possible .
The more VLSI processors in a system, the better the paper.
Optimal Placement for River Routing
Charles E. Lelserson and Ron Y. Pinter
Massachusetts Institute of Technology
Laboratory for Computer Science
Cambridge, Massachusetts 021391986

Abstract- River ?"Outing is the problem of connecting a sP.t of' terminals at, ... ,an on
a line to another set b1 , , bn in order acr08~ a rectane;ular channel. When the terminals
are located on modules, the mod ules must be placed relative to one another before routing.
This placement prcJblem arises frequently in design systems like bristle-blocks where stretch
lines through a module can eifedively break it into severa! chunks, each of which must be
placed separately. This paper gi~'es concisf;' necessary and suflicient conditions for wirability
which are applied to reduce the optimal placement problem to the graph-thtoretic single-
source-longest-paths problem. By exploiting the special structure of graphs that arise from
the placement problem for rectilinear wiring, an optimal solution may be determined in
linear time.

1. Introduction
River rou.tinq j~ 11 special rout.ing probIern which arises often in the design of integrated
cirC1.iit.s, ?,nil it bas be,.'n shown to be ofll.in!a!!y solvahl" in p0lynomial-time for many wiring
models (see in p:ut.icular [Tompa 1 and [Dolev et al.D. In this paper we demonstrate that
the placcmem prohlem for rivf)f routing is also polynomial-time solvable.
The gClieral character of the placement problrm for river routing is illustrated in Figure
1. Two sds of terminals at, ... ) an and b1 , ... , On are to be connected by wires across a
rectangular channel so that wire i is routed from ai to bi The terminals on ('ach side of
thl) channel are grouped into chunks which must be placed as a unit. The quality of a legal
JlI~ccmcnt---onc for which the channel Cl),ll be routed---can be measured in tejms of the
dimensions of tll" channel. The separation is the vertical distance between the two lines
of terminals, aPG the 8pread lS the horizontal dimension of the charnel.
The wiriH!" model give8 tIl(' eonstraints that the routing mllst Eatisfy. Although
our rc&ults can bt generalized to include a variety of wiring lTlodp.ls (see Section 5), we
concentrate on thc (one-layer) squrlregrid model. Crossovers ar~ disallowed in the square-
grid model, and nll wires must take disjoint paths through the grid.
The placement problem for river ronting ari,1'2<l often during ordinary integrated circuit
debign. A common insta.nce is when the tcr:ninals of one or more modules are to be
connected to drivrfs. Tl;e vr.riolls independent "chunk~" are the TYlouulcf:, which lie on one
side of the chaIin ..], and the dri', ters, ",hie]) lie on the other.
Tbis r",earch was suppurted in part by the De,'"nse Advanced Rpsearcb Projects .Ag~ncy unolH Contract
No. ~JOO()J4-80- C-0622.

126
Charle. E. Lel.eraon and Ron Y. Pinter 127

al a2 a3 a4j as a6 a7 a& a9 alO


I I I I
T
~ spread
~ separation

.
A Ftl Its
==i
b7
~

bs
II
I
b9 610 I
Figure 1: Two sets of chunks on either side of a rectangular channel.
Terminal ai must be connected to bi for i =
1, ... ,10.

A more interesting manifcs~ation of the placement problem occurs in the context


of sllch design systems as bristlc-blo.'ks [Jtlhan:lsrn] and DPL/Daedalus iBatali et ai.].
These systems encDur"ge a desigller to lJUild pillg-together modules so th,.! the dilficulties
associated with gene~al routinf; ean hp avoid(~d. A desi!;IlCr may sp(eify strdch lines which
run through a module and allow the module to be expanded perpendl~ular to the stretch
line, as demonstrated in Figure 2. When two independently designed modules are plugged
together, stretch lines permit the terminals to be pitch aligned, that is, the distances
between pairs of adjacent terminals are made to match the distances between their mates,
and routing is avoided because the separation of the channel is zero. Unfortunately, this
approach may not succeed unless stretch lines are put between every pair of adjacent
terminals. The stretch lines may not only disrupt the internal strucLure of the modules,
but the consequence may be an inordinate amount of stretching that leaves the channel
with a large spread.
The other extreme is to forego stretching altogether and river route between the
terminals. But the cost may still be large if a large separation is required in order to
achieve a routing. A reasonable compromise is to place stretch lines where it is convenient,
and then do a little stretching and a little routing. Determining how much of each to do
is exactly the optimal placement problem for river routing.
The remainder of this paper demonstrates that optimal solutions to the placement
problem can be achieved efficiently. Section 2 gives a concise necessary and sufficient
condition for a channel to be routable in the square-grid model. Section 3 shows that the
form of this condition is allows the placement problem to be reduced to the graph-theoretic
problem of finding the longest paths from a source vertex to all other vertices in a graph.
A linear-time algorithm for placement is given in Section 4. Section 5 extends the results
of the paper to other wiring models and discusses further placement problems.

2. Necessary and Sufficient Conditions for Wirability'


To demonstrate the results of this paper, wc adopt an extremely simple wiring model:
the (one-layer) sq7},are-grid model. All wires are constrained to run on an underlying grid
of intcger lattice points. The terminals al,.'" an and b1, ... , On oc~upy grid points on
opposite sides of the channel. No two wires may occupy a single grid point which enforces
unit separation of wires. Figure 3 shows a solution to the problem of Figure 1 using this
model.
128 Optimal Placement for River Routing

Figure 2: A module before and after stretching (courtesy of John Batali).

Figure 3: A possible solution to the problem in Figure 1 for which the


separation is 5 and the spread is 27.
Charles E. Leleeraon and Ron Y. Pinter 129

In order to establish constraints on wirability In this model, consider a straight line


segment drawn from (XI, YI) to, but not including, (X2' Y2). We ask the question, "How
many wires can cross this line?" A simple analysis shows the answer is max(lx2 - xII, IY2-
YII). Without loss of generality, assume the situation is as in Figure 4, and look at the grid
points immediately below the line, that is, {(x, lYI +
~~=:: (x - xdJ) I x = XI,"" X2 -
1 }. Any wire crossing the line must perforce occupy one of these grid pointG, and therefore
the number of sllch wires is bounded by the cardinality of this set.

Figure 4: The number of wires crossing the half-open line segment is at


most the number of grid points immediately below the line.

Let us now turn to the river routing problem and examine how this constraint can
be brought to bear. Let a" . .. ,an denote both the names of the terminals at the top of
the channel and their x-coordinates, and let the same convention hold fo; the terminals
bl,"" bn at the bottom of the channel. Consider a half-open line segment drawn from
terminal a,to terminal b, as shown in Figure 5. The j .- i wires emanating from
a" ... , aJ-I must all cross this line, and similarly for a line drawn from bJ to a,. In order
for a channel with separation t to be rouiable, therefore, it must be the case that

max(aj - bi , t) ~ j -- i and max(bJ - ai, t) ~ j - i (1)


for 1 ::; i ::; j ::; n.

(j - i) wires
r....- - - " " -
..._ - _ ,
ai ai+1 ... aj __ 1 a,.
, ,
, ,,
,
,,
"
(
""
bi b'+ 1 bj - 1 b,.

Figure 5: The U - i) wires from a" ... ,ai-I must cross the dashed line
between b. and ai'
130 Optimal Placement for River Routing

Although Condition (1) is a new condition for wirability, the analysis that leads to it
is essentially the same as that in [Dolev et al.l and represents previous work in the field.
A more compact condition exists, however, which is equivalent:

(2)

for 1 ~ i ~ n - t. The channel is always routable if t ~ n.


Condition (1) implies Condition (2) because Condition (2) is a refinement of Condition
(1). For the opposite direction, suppose first that j - i < t; then max(a, - bi, t) ~ t >
i-i. If j - i ~ t, on the other hand, then
a, - b. = a'+t+(i-i-t) - b.
~ a.+t - b. + (j - i - t)
~ t + (i - i - t)
=j-i

since ak+l ~ ak + 1 for all 1 ~ k < n. Thus the two conditions are indeed equivalent.
Figure 6 shows a simple geometric interpretation of Condition (2). The condition
ai+t - b. ;;::: t means that a line with unit slope going up and to the right from b, must
intersect the top of the channel at or to the left of terminal ai+t. And if the condition
fails, terminal b,. must be t.o the right of a,. for i ~ j < i +
t - 1, that is, each wire
from a a, goes down and to the right, which can be shown to follow from the fact that
+
a,+l ~ u,. 1. (For bi +t - a. ~ t the line with slope -1 going down and to the right
from ai must intersect the bottom of the channel at or to the left of terminal bi+t.)

Permissible range for ai+t


,,
1
'filM'Hl/lllw/JtJ/1II!
"
")- " "
" "
"" ",
" ",
/ " " ",
" 'I \II \11111\1,
Permissible range for bi +t

Figure 6: Geometric interpretation of Ui+t ~ b. + t and b'+t ~ a. + t.


This geometric interpretation can be used to show that Condition (2) is not only a
necessary condition for routability of the channel, but a sufficient condition a.s well. In
fact, a simple greedy algorithm will succes~fully route a routable channel. Processing
terminals left to right, the greedy algorithm routes each wire across the channel until it
hits a previously routed wire; then it foUows the contour of the opposite side until it reaches
its destination.
Charles E. Lelul'lon and Ron Y. Pinter 131

To see that ihis algorithm works given Condition (2), we must be more precise about
what paths are taken by the ,,:ires. Consider without loss of generality a block c,f consecu-
tive wires i,hat go down and to the right, that is, a, S bi for all wires in the block. For
any horizontal position x such that a. - t < x S b.,
define

77i(X) = max(ai -' x, max r).


bi_r~:Z;

The path of wire i is then described by the locus of points (x + 1)i(X), '1t(x)) for at - t <
x S bi .
A geometric interpretation of this formulation uses the same intuition as was given in
Figure 6. The line with unit siope drawn from (x, 0) "INhere x is in the range ai - t < x S b.
must cross wire i. Th' valu~ '7i(X) ~iveR the y-cvordinatc of wire i wher~ it r.rOSf.'~8 this line
of unit slope. The two-part maximum in the definition of 1'}i(X) corresponds to whether
the wire is being routed straight across the chaanel or whether it is following the contour
of the bottom. The value of 1'}i(X) for the lat.ter situation is the number of wires to the left
of wire i which must cross the line of unit slope.
We must now show that the locus of points for a wire is a path, that the paths are
disjoint, and that they never leave the channel. That the locus of points is indeed a path
can be seen by observing that as x ranges from ai - t to bi , the initial point is (ai, t -I),
the final point is (b" 0), and with a change of OIle in x the coordinates of the path change
by a single grid unit in exactly one of' the two dimensions. To show that the paths are
disjoint, consider two adjacent wires i and i + 1, and observe for a'-I-l - t < x S b. that
ai - x < ai+l- x and maxbi_r~rr < maxbH.I_r~xr, and therefore '1.(x) < '1i+l(X).
To show a path of a wire never leaves the channel, we demonstrate that '1i(X) < t for all
i and x in the associated range. It is fcr this part of the proof that we need the assumption
that Condition (2) holds. If for a wire i, t.he two-part maximum in the definition of '1i(X)
is achieved by 0 - x, t.hen '1i(X) musL be less than t because x > ai - t. Suppose then,
that the two-part maximum is achieved by the maximal r such that bi-r ~ x. To show
that r < t, we assume t.he contrary and obtain a contradiction. But since bi - t ~ b.- r ~
x > ai - t, the contradiction is immediate because ai - b.- t ~ t from Condition (2).

3. The Structure of the Placement Problem


The objective of a placement algorithm is to set up a routing problem that is solvable
and minimhe~ SOfIlC cost. functioIl. Many criteria can be adopted to measure the cost of
a placement for river routing, whether in terms of area (total or channel) or some other
function of spread and separation. A plot of minimal spread versus given separation reveals
that the region of feasible placements may not be convex although the curve is guaranteed
to be monotonically decreasing. (Figure 7 shows the plot for the problem of Figure 1.)
Any measure of placement cost that is a function of spread and separation and which is
monotonically increasing in each of spread and separation will therefore find a minimum
on this curve.
Thus we content ourselves with producing points on this curve, that is, determining
a placement which 'lchieves the minimum spread for a. given separation t, if indeed the
132 Optimal Placement for River Routing

,spread

2
28 Feasible region
27
26
25
24
23
22

I I I I
2 J 4 5 6 7 8 9 10
I I I I > separation

Figure 7: The curve of minimum spread versus separation for the example
of Figure 1.

channel is routable in t tracks. If minimum separation is the goal, for example, binary
search can determine the optimum t in O(lg t) steps. Since the algorithm presented in
the next section deiermines a placement for fixed t in O(n) time where n is the number
of terminals, and since the separation need never be more than n, a minimum-separation
placement can be achieved in O(nlgn) time. For more general objective functions sllch as
area, the oplimum value can be d.:termined in O(n 2 ) time.
We now examine the character of the placement problem for river routing when the
~eparation t is given. The n terminals are located on m chuuh which arc partitioned into
two sets that form the top and bottom of the channel. For co!1venience, we shall number
+
the chunks frorn one to k on the top. and k 1 10 m on the hotLom. The order of chunks
on each side of channel is fixed, but they may be moved sideways so iong as they do not
overlap. For each ehunk i, a variable Vi repreEents the horizontal pasitilln uf it.s left edr~e.
Any placement can therefore be specified by an assignment of values to these variables.
We also add two variables Vo and V m +l to the sei of variables, which represent the left
and right boundaries of the channel. The spread is thus Vm +l - Vo. Figure 8(a) shows
the eight variables for the example from Figure 1.
Since the relative positions of terminals within a chunk is fixed, the wirability con-
straints of Condition (2) can be reexpressed in terms of the chunks themselves to give
placement constraints that any assignment of values to the Vi must satisfy. If terminal
ai+t lies on chunk h, and terminal bi lies on chunk j, the constraint ai+t - b; ~ t can
be rewritten as Vii - Vi ~ rhJ' where rhi reflects t and the offsets of the terminals from
the left edge of their respective chunks. The constraint between two chunks determined in
this way will be the maximal constraint induced by pairs of terminals.
Cherles E. Lelseraon end Ron Y. Pinter 133

al a2 a3 a41 as as a7 as ag alO
VI ~

Va V4 Vs ~

Va M I b b,l
3 bs ] I bs b1 bs
] I bg blO ] Vi

(a) Assignment of variables to chunks and channel boundaries.

(b) The placement graph for separation 3.

Figure 8: Representing the placement constraints as a graph for the ex-


ample of Figure 1.

Additional constraints arise from the relative positions of chunks on either side of the
+
channel. For each pair of adjacent chunks i and i 1, the constraint Vi+l - V, ~ Wi must
be added to the set of placement constraints, where Wi is the width of chunk i. Four more
constraints are needed which involve the boundary variables Vo and V m +l' For chunks 1
and k + 1 which are leftmost on the top and bottom, the constraints Vl - Vo ~ 0 and
Vk+l - Vo ::::: 0 enforce that these chunks lie to the right of the left boundary of the
channel. For chunks k and m which are rightmost on the top and bottom, the relations
V m +l - Vk ::::: Wk and Vm+l - Vm ::::: Wm constrain them to lie to the left of the right
boundary, where Wk and Wm are the widths of the chunks.
Figure 8(b) shows a placement graph which represents the constraints between chunks
for the placement problem of Figure 1 where the separation is 3 tracks. A directed edge
with weight Okl goes from Vk to Vt if there is a. constraint of the form Vt - Vk ~ Okt. For
example, the weight of 1 on the cross .~dgc going from Vs to V2 is the maximal constraint
of ag - bs ~ 3 and alO - b7 2 3 which yield V2 - Vs ~ -1 and V~ - Vs ~ 1 since
ag = V2 +5, alO = V2 +
6, b6 = Vs +
1, and b7 = 'Us +
4. The side edge from V4 to Vs
arises from the const.raint that chunk 4, which is 5 units long, must not overlap chunk 5.
134 Optimal Placement for River Routing

The goal of the placement problem is to find an assignment of values to the v. which
minimizes the spread v m -j-1 - Va subject to the set of constraints. This formulation is
an instance of linear programming where both the constraint.s and the objective function
involve only differences of variables. Not surprisingly, this problem can be solved more
efficiently thai! by using general linear programming techniques. In fact, it reduces to a
single-sourCf-longest-paths problem in the placement graph. The length of a longest path
from Va to V,n+I corresponds to the smallest spread of the channel that complies with all
the constraints. The placement of each chunk i relative to the left end of the channel is
the longest path from Va to Vi. If the placement graph has a cycle of positive weight, then
no placement is possible for the given separation.
For the placement problem of Figure 1 with a three-track separation, the longest path
from Va to V2 in the placement graph (Figure 8) is Va - VI - V4 - V5 - V2 with weight
13 which corresponds to the positioning of chunk 2 in the optimal placement shown in
Figure 9(a). Figures O(b) through \lid) show optimal solutions to the placement problem
of Figure 1 for separations t = 4 through t = 6. The constraints for t = 2 yield a cycle of
po~,itiv'; weigLt in the placement graph, and thus no placement is possible which achieves
a orparation of <lnly two tracks.

4. A Linear-Time Algorithm for the Placement Problem


The analysis of Section 3 showed that thE' optimal placement problem for fixed-
separation river routing was reducible to the single-source-longest-paths problem on a
placement graph. For a general graph G = (V, E) this problem can be solver: in time
O(/V I lEI) by a Bellman-Ford algorithm [Lawler]. Better performance is possible, however,
due to the special structure of placement graphs. This section reviews the Bellman-Ford
algorithm, and shows how it can be adapted to give an O(m)-time algorithm for the
longest-paths problem on a placement graph, where m is the number of chunks. Since the
placement constraints can be generated in O(n) time, where n is the number of terminal
pairs, this algorithm leads to an optimal linear-time algorithm for the fixed-separation
placement problem. The discovery of a linear-time algorithm represents joint research
with James B. Saxe of Carnegie-Mellon University.
The linear-time algorithm is a refinement of the standard Bellman-Ford algorithm
which for each vertex v, where i = 1, ... ,m +1, iterat.ively updates the length il( v,) of a
tentative longest. path from Va to V,. The algorithm initializes A( va) to zero, and all other
A( Vi) to -00; then it sequences through a list C of edges, and for each edge (Vi, V]) with
weight 6,] updates A(V]) by

The list C of edges is the key to the correctness of the algorithm. The length of a
longest po.th from the source 'Va to a vertex Vi converges to the correct value if the edges of
the path form a subsequence of the list c. (This can be proved by adapting the analysis
of [Yen].) In the normal algorithm for a general graph G = (V, E), the list. C is IVI - 1
Charle. E. Lelserson and Ron Y. Pinter 135

(a) Separation 3, spread 27.

(b) Separation 4, spread 26.

(e) Separation 5, spread 26.

(d) Separation 6, spread 23.

Figure 9: Optimal pla(~ements and routings for the problem of Figure 1


with separations ranging from l = 3 to t = 6.
138 Optimal Placement for River Routing

repetitions of an arbitrary ordering of the edges in E, which ensures that every vertex-
disjoint path in G heginning with Vo is a subsequence of c. If there are no cycles of positive
weight in the graph G, then from Vo' to each other vertex in G, there is a long;est path
that is vertex-disjoint; hence the algorithm is guaranteed to succeed. The condition of
positive-weight cycles can be tested at the end of the algorithm either by checking whether
all constraints are satisfied or by simply running the algorithm through the edges in E one
additional time and testing whether the values of any ).,( Vi) change.
The list c is also the key to the performance of a Bellman-Ford algorithm. For the
general algorithm on an arbitrary graph G = (V, El, the length of the list is (IVI--1) ~EI,
and thus the algorithm runs in O(IVI'IEI) time. For a placement graph it is not, difficult
to show that both IVI and lEI are O(rn), and thus the longest-paths problem can be
solved in O(m 2 ) time by the general algorithm. But a linear-time algorithm can be found
by exploiting the special structure of a placement graph to construct a list c of length
O(rn) that guarantees the correctness of the Bellman-Ford algorithm. We now look at the
structure of placement graphs more closely.
The vertices of a placement graph G = (V,E) corresponding to the chunks on the
top of the channel have a natural lincar order imposed by the left-to-right order of the
chunks. We define the partial order -< as the union of this linear order with the similar
linear order of bottom vertices. Thus u -< V for vertices 11 and v if t.heir chunks lie on
the same side of' the channel anJ the chunk t.hat eorrpsponds lOlL lies to the I( ft of the
one which corresponds to v. The left-boundary vertex Vo precedes all other vertices, and
all vertices precede the right-boundary vertex V m +l' The partial order :S is the natura!
extension to -< that includes equality.
The next lemma describes some of the structural properties of placement graphs.
Figure 10 illustrates the impossible situations described in Properties (i) and (ii) and shows
the only kind of simple cycle that can occur in a placement graph together with the two
consecutive cross edges that satisfy Property (iii).

Lemma 1. Any placement graph G = (V, E) has the following properties:


(i) There do not exist cross edges (u, v) and (x, y) such that u-< x and y -< v.
(ii) There do not exist cross edges (u, v) and (x, y) such that v -< x and y -< u.
(iii) All cycles have two consecutive cross edges (u, v) and (v, w) such that w :S u.
Proof. Properties (i) and (ii) can be proved by considering which of the terminal
constraints from Condition (2) induce the edges in the placement graph. For each of these
cases, suppose the edge (u, v) was caused by the terminals i in u and i + t in v, and the
+
edge (x, y) came from the terminals j in x and J' tin y. For Property (i) we have u -< x
and y -< v, and thus i < j and j + +
t < i t. Canceling t from this latter inequality
obtains the contradiction. The assumption to be proved impossible in (ii) is that v -< x
and y -< u, which implies i + +
t < j and J' t < i. Since t is nonnegative, we gain a
contradiction.
To prove Property (iii), we need only consider simple (vertex-disjoint) cycles. Since
no cycle can consist solely of side edges, every simple cycle must have a cross edge (u,v)
going from bottom to top. In order to complete the cycle, there must be a top-to-bottom
edge (w, x) such that v ~ wand x ~ u. If v = w or x = u, then the pair of edges satisfies
Property (iii). But if v i- wand x -=j:. u, then the pair of edges violates Property (ii). I
Charies E. Lelsereon and Ron Y. Pinter 137

tal The situation forhidJen by Property (i).

(0) The situation forbidden by Property (ii).

(c) Every simple cycle contains at most one vertex from the top or
at most one vertex from the bottom. The edges incident on the
vertex are a consequence of Property (iii).

Figure 10: The properties of the placement graph enumerated in Lemma 1.

Each edge in the placement graph is either a top edge, a top-bottom edge, a bottom-
top edge, or a bottom edge. For each of these four sets of edges, there is a natural linear
order of edges based on ::5, where (u, v) precedes (x, y) for two edges in the same set if
u ::S x and v ::S y. Property (ii) guarantees that the linear order holds for two cross edges
in the same set. Let TT, TB, BT, and BB be the four lists of edges according to the
natural linear order, and include the two edges out of Vo and the two edges into vm-H in
either TT or BB as appropriate.
e
The list used by the Bellman-Ford algorithm is constructed by a merge of the four
lists which w!' eall MERGE. At each step of MERGE, a tournament is played among the
first elements of each list. If (u, v) and (v, w) are the first elements of two lists, then (u, v)
138 Optimal Placement for River Routing

Figure 11: A possible ordering of edges in , for the plr-cement graph in


Figure 8.

beats (v, w) if w :i u. Since there may be more than one edge beaten by none of the other
three, ties are broken arbitrarily. The winner is appended to , and removed from the
head of its list. The tournament is then repeated until no edges remain in any of the four
lists. The performance of the tournament can be improved by recognizing thaL only six of
the twelve possible comparisons of edges need be tried, and that w :i u is guaranteed for
all but two. Figure 11 shows a possibb ordering of edges in , for the placement graph in
Figure 8.
In order for MERGE to be well-defined, the tournament must always produce a winner,
which is a consequence of the next lemma.

Lemma 2. The list t produced by tvlERGE is a topological sort of the edges of E


according to the relation R where (u, v )R( v, w) if w ~ u.

Proof. First, we show that the relation R is acyclic so that the edges can indeed be
topologically sorted. By definition of R, a cycle in R induces a cycle in the placement
graph. According to Property (iii), the cycle must have two consecutive cross edges (u, v)
and (v, w) such that w :S u. But since (u, v)R( v, w), we also have that w ~ u, which is a
contradiction.
The proof that MERGE topologically sorts the edges of E according to R makes use
of the fact that if a vertex v is the tail of an arbitrary edge in anyone of the four lists
TT, TB, BT or BB, then for every u :S v there is an edge in the same list emanating
from u. Suppose that MERGE does not topologically sort the edges of E according to R.
Then there is a first edge (u, v) in , such that there exists an edge (v, w) earlier in , and
(tt, v)R(v, w). Consider the edge (x, y) in the same list as (u, v) that competed with (v, w)
when (v, w) was the winner of the tournament. For each of the possible combinations of
lists for (u, v) and (v, w), it can always be deduced that there is an edge emanating from
y such that which makes (x, y) an earlier violator of the topological sort than (u, v). I
Since each edge of E is included exactly once in the list e created by MERGE, the
Bellman-Ford algorithm applied to , has a running time linear in the number of chunks.
The correct values for longest paths are produced by the algorithm if for every vertex v,
there is a subsequence of [ that realizes a longest path from Vo to v, under the assumption
Charles E. LalselSon and Ron Y. Pintar 138

that there are no positive-weight cycles in the placement graph. Since for every longest
path, there is a vertex-disjoint longest path, the following theorem proves the correctness
of this linear-time Bellman-Ford algorithm.

Theorem 3. Let G be a placement graph with left-boundary vertex Vo. Then every
vertex-disjoint path beginning with Vo is a subsequence of the list, created by the procedure
MERGE.
Proof. We need only show that every pair of consecutive edges in a vertex-disjoint
path from Vo satisfies R because then Lemma 2 guarantees that the path is a subsequence
of ,. Suppose ('tI, v) and (v, w) are two consecutive edges on a vertex-disjoint path from Vo
which violate R, that is, w ::; u. If either ('tI, v) or (v, w) is a side edge, the pair must satisfy
R, and thus both must be cross edges with the vertices u and w on the same side. Since
if w = u, the path is not vertex-disjoint, we need only show that w -< 'tI is impossible.
Assume, therefore, that w -<: u, and consider the initial portion of the pat.h from Vo
to u. Since Vo -< v and 'Vo -< w, there must be an edge (x, y) on the path which goes from
the set of vertices to the left of (v, tV) to the right of (v, w) in order to get to u. But then
either Property (i) or Property (ii) is violated depending on whether x -< v and w -< y,
or y -< v and x -< w. I

5. Concluding Remarl<s
The reduction from the fixed-separation placement problem in the sqllare-!~rid model
t.o t.he single-source-long;pst-pat hs problem is possible because the wirability constraints can
all be written in the form v, -Vj 2 Oi,. Thus for any wiring model v.here wiring c:onstraints
can be written in this form, the reduction wlll succeed. Also, it should be observed that
in general, the performance of the single-source-longest-path algorithm will not be linear,
but will be a function of the number of constraints times the number of variables. This
section reviews other models and gives the necessary and sufficient wirability (:onstraints
for each.
1. One-layer, gridlesB rectilinear ([Dolev et al.]). Wires in this model must run
horizontally or vertically, and alihough they need not run on grid points, no two wires
can come within one unit of each other. The wirability constraints for this model are the
same as for the square grid model:

a,+t - b. 2 t al,ld b'+ t - a. 2 t


for 1 ~ i ~ n-t. As with the square-grid model, the fixed-separation placement algorithm
for this model can be made to run in linear time.
2. One-layer, gridlesB, rectilinear and forty-five degree ([Dolev and Siegel]). This model
is the same as the gridless rectilinear, but in addition wires can run which have slope 1.
The constraints in this case are
ai+r - bi 2 nfi - t and b'+r - a. 2 nfi - t ,
for t/V'i ~ r ~ t and 1 ~ i ~ n - r. The placement algorithm for this model runs in
+ + +
O(min(tm2 tn,m 3 tn,m 3 n 2 )) time.
140 Optlmll Pllcement for River Routing

3. Onelayer, grid less ([Tompa]). Wires can travel any direction. The const.raints are
a,+r - b, 2: Jr' tJ and bi +r - ai 2: Jr2 t2
for t ::; r ::; nand 1 ::; i ::; n - r. The placement algorithm runs in O( m 3 + n 2) time.
4. Multilayer models. All the models presented until now have been one-layer models.
It is natural to generalize to I-layer models in which wires may travel on different layers.
Remarkably, optimal routability can always be achieved with no contact cuts ([Baratz]),
that is, a wire need never switch layers. The necessary and sufficient conditions for these
multilayer models are a natural extension of the onelayer conditions. For example, in the
one-layer, gridless, rectilinear model the conditions are modified for I layers to be

a,+lt - bi 2: t and b,+lt - ai 2: t


for 1 ::; i ::; n - It.
5. Nonriver routing. The placement algorithm gives optimdl placement, for river
routing, but there are other routing problems for which it works optimally as well. One
example is the two-layer, any-to-any i'outing problem where two sets of terminals must be
connected across a channel, but they may be connected in any order.
There are some wiring models, however, where upper and lower bounds for wirability
do not meet. For these models a constrainL graph which repr~sents upper boune s will give
the best possible plac21Ilcnt for those bounds. A graph representing low!)r bouncs will give
lower bounds Oil the llC'st possible placement. Together, bounds can be established for some
of tbese models, and hellristic algorithms inyoked to attempt rcutil1g within t:1e feasible
range of optimality.

Figure 12: A two-dimensional extension to the river-routing problem. A


solid line between two modules indicates routing occurs between
them.
Chi ... E. Lel..rson and Ron Y. Pinter 141

Extensions can be made to the placement algorithm as well as to the wiring model.
For instance, multiple (parallel) horizontal channels are easily handled within the same
graph-theoretic framework. More interesting is the two-dimensional problem illustrated in
Figure 12. Here, a line between two chunks indicates that wires must be routed between
them. Unfortunately, in order to optimally solve this general problem, it appears that
the constraints indicated by the lines must be convex in both dimensions, not just one
as is the case for the models heretofore considered. When the constraints are convex,
however, convex programming can be used to optimize a cost function such as the area
of the bounding box of the layout. One model which gives convex constraints for the
general two-dimensiona.l problem is the one in which all wires must be ro~ed as straight
line segments between terminals such that no minimum spacing rules are violated. This
model is not particularly interesting from a practical standpoint, however. Heuristics for
solving the related two-dimensional compaction problem by repeatedly compacting in one
dimension and then the other can be found in [Haueh].

Acknowledgments. We would like to thank Howie Shrobe of the MIT Artificial Intel-
ligence Laboratory for posting the plots of the data paths from the Scheme81 chip which
inspired our interest in this placement problem and for his valuable comments on the
practicality of our work. We would also like to thank Alan Barah and Ron Rivest from
the MIT Laboratory for Computer Science for numerous helpful discussions, and Shlomit
Pinter also from the Laboratory for Computer Science for influencing the direction of our
proof of Theorem 3. Finally, special thanks to Jim Saxe of Carnegie-Mellon University for
his key contributions to the linear-time algorithm for longest-paths.

References
[Barah] Baratz, A. E., priv8:te communication, June 1981.
[Batali et al.] Batali, J., N. Mayle, H. Shrobe, G. Sussman and D. Weise, "The DPL/
Daedalus design environment," Proceedings of the Internati01lal Confer-
ence on VLSI, Univ. of Edinburgh, August 1981, pp. 183-192.
[Dolev et al.] Dolev, D., K. Karplus, A. Siegel, A. Strong and J. D. Ullman, "Optimal
wiring between rectangles," Proceedings of the Thirteenth Annual ACM
Symposium on Theory of Computing, May 1981, pp. 312-317.
[Dolev and Siegel] Dolev, D. and A. Siegel, "The separation required for arbitrary wiring
barriers," unpublished manuscript, April 1981.
[Hsueh] Hsueh, M.-Y., Symbolic Layout and Compaction of Integrated Circuits,
MemoranduIIl No. UCB/ERL-M79/80 (Ph.D. dissertation), Electronics
Research Laboratory, Univ. of California, Berkeley, December 1079.
[Johannsen] Johannsen, D., "Bristle blocks: a silicon compiler," Proceedings of the
Caltech Conference on VLSI, January 1970, pp. 303-310. Also appears
in the Proceedings of the Sixteenth Design Automation Conference., June
1979, pp. 310-313.
142 Optimal Placement for River Routing

[Lawler] Lawler, E. L., Combinatcral Optimizaton: Networks and Matroids,


Holt, Hinehart and Winston, New York, 1976.
[Tompa] Tompa, M., "An optimal solution to a wire-routing problem," Proceed-
ings of the Twelfth Annua.l Symposium on Theory of Computing, April-
May 1980, pp. 161-176.
[Yen] Yen, J. Y., "An algorithm for finding shortest routes from all source
nodes to a given destination in general networks," Quarteriy of Applied
Mathematics, Vol. 2'7, No.4, July 1970, pp. 526-530.
The Separation for General SingleLayer
Wiring Barriers

Alan Siegel Danny Dolev


Stanford University IBM Research
Stanford, California San Jose, California

l INTIWDUCTION
The problems of placement and routing in integrated circuit design have been gaining increasing
attention as fabrication techllology advances. Although a variety of these the problems have been
proven to be ).I P-hard, progress is being made on restricted versions. TOlllpa, for example, gives
a qnadratic solution to a particular single layer routing problem ([T]). This paper gives efftcient
algorithms for IIllding the separation and the olTset in contexts which include his lIIodel.
The objective is to compute the space required for wiring without taking the extra time neces-
sary to deterllline the coordinates for the wires. This information would be useful for placement
and compaction.

1.1 The Wiring Problems


We consider the problem of connecting n corresponding points {Pd with {Qi}. Each collection
is collinear. The rows of points arc parallcl, and both sets arc ordered. The connecting wires will
be subjed to various constraints (design rules) described later. We may assume that the two rows
of point.s arc horizontal, with {Q} above {I'}. We define the horizontal displacement of the set {P}
to be the offset, and call the vertical distance between the two rows the separation.
We cxarnine:
1. The Separation Problem:
Given an olTset and a wirillg rule, find the minimum separation permitting a legal wiring.
2. The Optimal Olfset Problem:
Given a wiring rule, lind an olTset minimizing the separation.
The cOlldu~iollS are:
1. Under'l large variety of wiring rules, the separation can be found in O(n log n) time. Moreover,
in virtually all practical cases, the time is O(n).
2. In all rt'll.'\ollable models, the optimal oITset problem can be solved in time 0 (>II (n) log n) where
"'(TL) is the time to IInJ the separation.
1.2 The Models
III VLSI design, wires on a given layer have specific restrictions. For mathematical convenience,
we shall take the minimum distance between wires to be 1, and set the width of wires to be
O. Another wiring restriction is that of shape. Wirc5 arc frequently required to be rectilinear.
Sometimes 15-degree lines arc permitted as well. Other processes indude a variety of slopes. As
a matter of theoretical interest, we also examine models permitting arbitrary curves. Implicit in
this Jiscllssion of shape is a grid restriction. A circular figure, for ex alii pie, can be approximated
arbitrarily wdl by rectilinear segments. This, of course is lIsele~s ill terms of a fabrication process
since any such process must have a limit on its resolution. Our model re/letts this limit by requiring
all line segments to end on grid points. We examine integer, fractional and continuous grids.
143
144 The Separation for General SlngleLayer Wiring Barriers

Figure 1.1:
Wires restricted by rectangular barriers around PI

Since wires must be separated by minimum distances, every routing problem has forbidden
regions restricting the wiring flow. Specifically, for every vertex Pj and index 8, there is an 8 th
region (specified later) around Pj which cannot be entered by a wire connecting Pj +! with Qj+t
for 1t 12: 8. See Figure 1.1.
With a scheme permitting arbitrarily shapf'd wires, these regions will be concentric disks
centered at connection point vertices ([TD. It is easy to see that the boundaries conform to the
shapes permitted by the wiring scheme and grid restrictions. In the rectilinear case with an integer
grid they arc concentric rectangles. On a quarter-integer grid the separation barriers are no longer
rectangular. See Figure 1.2.

(a) Arhitrary (circular) (b) Rectilinear integer grid (c) Rectilinear quarter-integer grid
Figure 1.2: Separation barriers
In Section 2 we initially consider families of forbidden regions whic.h are convex and geometri-
cally similar. Later the requirements of convexity and similarity will be somewhat relaxed to include
virtually all known wiring schemes.
Given n pairs of vertices, (PI, Q d, ... , (Pn, Qn), we take Pi to be both the name of a point on
the bottom row and the horizontal position of that point, and we take Qj to be a similar point on the
upper row. A left block is defined to be a maximal sequence of pairs of points (Pi, Qi), ... , (Pj, Qj)
such that Qk ::; Pk , for i ::; k ::; j. This condition says that all the connections in the block have
a position in the upper row to the left of the corresponding position on the lower row.
We may define a right block in the obvious, symmetric way. We call a left or right block a
block. In th" rest of the paper we refer to left blocks.
If a block can be legally wired, then there is a wiring in which all wires move monotonically; a
wire need never reverse diredion. Consequently a wireable block can be wired within its rectangular
boundary. This implies that for a fixed ofTset the separation is determined by the worst block.
Since the wiring can be accomplished with monotone wires, we may alter the separation barri-
ers to include this observation. Imagine a separation harrier centered at the origin, and extend the
barrier in the second quadrant as a constant. The constant represents a possible crossing number
([DKSSU]). Figure 1.3 shows the modified separation barriers in a left block. These regions are
still gcomcLrically convex and similar. We now assume that all barriers arc so modified.
It is not hard to see that the separation canbc dcLerrnined from the barriers which emanate
from one specific side, say the families centered at the points of P. Wireahility in this instance
is equivalent to all points Qj lying outside the relavent separation regions. The reason is that
modified harriers represent compact wiring. It follows that the separation is maXi,j(W( i, j)) where
Alan Slagel and Danny Oolev 145

W(j,j + 3)

Figure 1.3: A separation caused by circular barriers

Figure 1.4: A physical separation - rectilinear wiring

the separation function W(i,J') is ddilled (for lCt blocks) as the height at Qj of the (j - i)th
barrier emanating from I';. Notice that ill Figure 1.3 the separation will be determined by the
at P;'s third barrier.
If, for example, we require that wiring in the integer rectilinear case leave rows P and Q
vertically for at least one unit in length, then the physical separation is

{ I, if 1'; = Qi for all i


2 + max(W(i,j)), otherwise.

Figure 1.4 ilhlHtrates rectilinear wiring with thi~ restriction. The purpose is to avoid unknown
wires inside modules P and Q (the black boxes denote possible internal wires). It turns out that
all rea.~onable initialization schemes can be accomplished with simple changes in the definition of
W. Wires with a fIXed positive thickness can he accommodated as well.

2 TIlE SEPARATION PROBLEM


Finding the minimal separation required between two parallel blocks is an O(n 2 ) process if we
check all W(i,j). Figure 2.1 illustrates the problems in determining the separation: one i!xample
shows IIlany wiring barriers interacting with one wire (Pa to (Ja ill this case); the other shows the
nonmoIlotonic behavior of Lhc separation function.

2.1 The Partition Property


The partitioning property is an inequality that allows the use of a divide-and-conquer approach
to redl\(;e the complexity of finding the separation. We first require that the separation curves be
concm'p and ~irnilar as in r,'igure 2.2. In Section 2.6 Lhe rC'luirmcnt is relaxed.
This family of barrier curves will be represented as IIp(x), where p is the index. Our family
II p( x) satisfies,
0< IIp(x) $ p for x < p, (2-1)

and Hp(x) =0 for x;::: p. (2-2)


146 The Separation for General SlngleLayer Wiring Barriers

PJ

(a) Interacting barriers (b) Nonmonotonic separation


Figure 2.1: The complications of wiring

h(x)

o 1

Figure 2.2: Similar concave barriers

On the region x < p the function IIp(x) is concave. The basic inequalities arc: the concavity,

l1~(X) ::; 0 for x < p, (2-3)

and the monotonicity,

lT~(x) ~ 0 for x < p. (2-4)


Any family of similar curves which is concentric with respect to the origin can be represented
as
x
IIp(x) = ph(-). (2 5)
p
In our case p is appropriately a positive integer. For left blocks, the separation function is:

if j >i (26)
if j ~ i.
This is congistent with wires from P to Q starting in a possibly horizontal direction.
The idea behind the partition property is essentially that rooted line segments connecting Qi,
and Qj, with Pi" and Pi, have the following behavior: if i l and i2 respectively maximize W(.,jr)
and W(., j2), then the segments cannot ~ross. Figure 2.3 illustrates this result in the more general
context, of Theorem 1.
Theorem 1. ( The 1'artitioning Theorem)
Let h(x) be a suitable concave barrier function and W(i,j) the corresponding separation func-
tion. Suppose Pi, 1'i+" Qj, and, Qj+8 are in a left block, where r 2: 0, B 2: 0, and j 2: i + r.

If W(i,j) ~ W(i + r,j), then W(i,j + B) ~ W(i + r,j + B). (2-7)


Alan Siegel and Danny Dolev 147

Proof: (sketch) When all four terms in (2-7) are positive, some computation shows that

W(i + r,J + s) - W(i,J + s) = {W(i + r,J) - W(i,J)} +


r
10
{hl(Z + s + u) _ h'(x + T + Z + +
k+s T + k + s
8 U)} du +
r {hl(z + T + v) _ h,(8 + Z + T + v)} dv +
k+T S+ k + T

1'18{
10

_(k-z)2 /I z+u+v}
o 0
(k +u+v ph (k +u+v ) dudv.

Under the conditions of the theorem, each integrand is non-negative. See Figure 2.3.

Note k = jjr k
k~
// /
/

/
kH//
;:::::. --- --- k +s+r
W(*,*)

Figure 2.3: A left block separation


Variables in the proof are as shown. Variables bclween the heavy lines are harrier numbers (p)
while x, y, z, r, and .9 outside the lines arc measures of distancc. Note that only Z can he negative.

2.2 Gencral Wiring


In this section we shall use the partitioning property to produce an O(n log n) algorithm for
deterllJining the separation. The following lemma is a consequence of the partitioning theorem.

Lemma 2. Let W(ijo,Jo) = maxi(W(i,Jo)). Thw the separation function attains its maximum
at (imaXl Jmax) where either i max ~ i j and Jmax ~ Jo, or imaz ~ ij and Jmaz > Jo.

Theorem 3. Suppose W(i, J) is a separation function which satisfies the partition property. Then
the separation can be found in O(n log n) time.

Proof: Without loss of generality, we assume that P and Q constitute one left block.
We first fllld a Pi. maximizing W(i, ~), the separation induced by the pairs Pi and Q'f. Lemma
2 ensures that the maximuHl separation is among separations restricted to th(> intervals [PI, I'i.l and
[(h, Q '}]. or [Pi., Pnl and [Q'i ,(JnJ. R"peating this divide-and-conquer step on the Q coordinates
requires a total of O(n log nl comparisons.

Corollary 4. The separation for a wiring scheme with circular barriers can be found in O(nlogn)
time.
148 The Separation for General SlngleLayer Wiring Barriers

2.3 The Rectilinear Integer Grid Case


The separation problem in this case can be solved in linear time with a linear scan over Pi
and Qi (see [DKSSUJ). The fundamental property which allows this method to work is local
monotonicity: (for a left block) for k < j,

W(i,j) >0 implies W(i,j) > W(i,k).

2.4 The 45-Degree Case


Another wiring model uses rectilinear and 45 wiring on a half-integer grid. For this model the
left separation barriers are as in Figure 2.l(b).
More precisely,

I, if x < .5
h(x) = { 1.5 - X, if .5:::; x <1
0, if 1 :::; x.
As before, we define the separation function:

W(i,j)= {(
.
d.-I.) h( Qj-p;)
~, if j > i
if j :::; i.
The partition property holds for these wiring barriers, but for fIXed i, W( i, j) is not even
unimodal when restricted to the part of a block where W(i,j) > O. Consequently the algorithm
for the rectilinear separation problem is inappropriate for this model. Nevertheless there is a
linear-time algorithm for this separation problem.
Theorem 5. The separation for rectilinear plus 45-degree wiring on a half-integer grid can be
determined in linear time.
Proof: (sketch) The separation function can be decomposed into two parts, one resulting from the
rectilinear (Aat) portions of the separation barriers and one from a restriction to the 45 pieces. The
maximum contribution from the Aat portions can be found from a linear scan as in the rectilinear
case. The other contribution can also be evaluated by a linear scan. In this instance a priority queue
must be maintained to indicate which Pi connection point gives a maximum (restricted) separation
for a current point Qj. When j is incremented a new restricted W(i,j) value is computed, and
the data structure is updated. As i is incremented, Pi is inserted in the data structure. The
linearity results from the fact that during the insertion, old data contributing a separation less
than that from a new Pi entry can be deleted. In addition, Ph entries giving former (but not
current) maximum separation 'values can be deleted during the updating. I

2.5 Similar polygonal barriers


The algorithm sketched above can be adapted to line segments of arbitrary slope. lIenee
Corollary 6. The separation problem for similar polygonal barriers can be solved in time O(kn)
where k is the number of sides of the polygon and n is the number of points to be wired. I

2.6 A weakening of similarity and curvature restrictions


We note that any algorithm which finds the separation for a particular wiring scheme will, with
one proviso, find Lhe separation for the same wiring scheme where the definition of separation dis-
tance is redefined. The proviso is that the new measure of separation distance <I> be a non decreasing
function of the original distance measure. Then

max(<I>(W)) = <I>(max(W)).
Rectilinear wiring on a quarter-integer grid is equivalent to a comparable rectilinear and 45 wiring
scheme under the map <1>( x) = H4x1- This proves
Alan Siegel and Danny Ooley 149

Corollary 7. The separation problem for rectilinear wiring on a quarter integer grid can be soltled
in linear time. I
For completeness we observe that one-third and onl~half integer grid rectilinear wiring schemes are
identical to the integer grid problemsj the separation barriers are the same.

3 THE OPTIMAL OFFSET PROBLEM


The.separation problem 'allows one degree of freedom, the distance between the two modules
being wired. The optimal offset problem introd uces a second degree of freedom, lateral motion.
We now seek an offset which has a minimal solution to the separation problem for an integer grid.

Figure 3.1: A tight offset


Without loss of generality we set QI and PI to O. Define the offset of P by a distance d to be

P[d] = P+d

Lemma 8. The separation is a unimodal function of offset.

The separation of a left block of P[d] and Q can't decrease as P slides to the right (d increases).
The left separation can increase, and the left block size can increase as neighboring right blocks
are cOll"umed by Idt blocks. Consequently, for a fixed offset do, we lIlay find the separation of
the left and right blocks independently. If, say, the left separation exceeds the right, it follows
that a minimal separation occurs for some value d ~ do. If the two separation distances are equal,
then do is a solution to the offset minimizing problem. (We point out that the last condition is not
neces~aryj sec Figure :1.1.)
Therefore a solution to an integer grid optimal olTsct problem can he fO'lIld by using an
appropriate separation algorithm on P[d] and Q where the values d arc obtained from a binary
search on the interval [-Pn,Qn]' This gives an operation count of O(ljI(n)log(Pn + Qt.)) for the
offset algorithm where ljI(n} is the complexity of the separation problem. While this is sufficient
in most instances, it does suffer from the fact that log(Pn + Qn) is noi bounded by any function
of n. We now describe an O(ljI(n)log(n)) complexity integer grid optimal offset algorithm. The
idea behind it is to limit the number of choices used in the offset region [-Pn, Qn]. This task is
simplified by defining the interdigitation number of an olTset:

Idn(d) = Cardinality{(i,j)IP[dJ; ~ Qi}

Thus ldn(-Pn - 1) = 0 and Idn(Qn) = n 2 The obvious choice is to try a binary search on
the interdigitation number.
Th' problems with this approach are:
1. The interdigitation number is insufficient to determine the offset and
2. It is necessary to find an offscl which corresponds to a given bisecting interdigitation number.
We show the first difIiculty is not serious, and the second can be overcome.
1. Once we have the interdigitation where a minimal separation occurs, the possible offsets will
be limited to an interval [a, b]. If b- a ~ 2n, then a binary search on [a, b] will find an optimal
olTsd the separation for an additional cost of ljI(n) log n. If b - a > 2n, then d = a + n gives
150 The Separation for General SlngleLayer Wiring Barrlel'S

an offset where the vertices do not interact with each other. Consequently the separation is
just based on the largest number of wires passing a connection point. It is possible that the
offset d = a will allow the connection points to align perfectly, but a test for this is easy to
include.
2. The remaining problem can be stated as follows: given interdigitation numbers DI and D2 with
corresponding displacements d l and d 2 , find a displacement d 3 which gives an intermediate
interdigitation between DI and D 2 We relax the condition on the binary search in the sense
that our new value need not cut the interval [DI' D2J in half; it is su{ficient to require that the
length of the new subinterval be a fraction of the previous interval's length where that fraction
is bounded by a constant less than one. The following algorithm finds the intermediate value
d 3 We note that the actual interdigitation numbers D are not used; they were introduced as
motivation for the algorithm.
Algorithm Intermediate( d l , d 2 , P, Q):
ARRAY P I : n , QI:", C I :", fl:,,;
k +- 1;
j <- 1;
FOR i +- 1 TO n DO BEGIN
WlIILE j ::; n and Pi + d l ~ Qj DO
j<-j+lj
Qj is the first point to the right of Pi + d 1
WHILE k ::; n and Pi + d2 ~ Qk DO
k+-k+l;
Q k is the first point to the right of Pi + d 2
C i +- k - j;
C i is the number of points Qj crossed by the jth point of P, P[dji, as d increases
from d l to d 2 Formally, C i = Gardinality{j I P[dtJi < Qj ::; P[d2 Ji}'

ei +- Ql u,-'
Note that z:; Gi = D2 - D 1
J - Pi;
fi is the olTset such that P[dJi crosses G;j2 points of Q as d increases from d 1 to
fi: fi = min{f I Gardinality{j I P[dtJi < Qj ::; P[JJ;} ~ rG
i /21}.
END FOR;
RETURN WeightcdMedian(e, G).
EN D Intermediate.

3.1 Analysis of Intermediate


If we weight the ei with weights Ci, then the (fast) weighted median f of the fi is essentially
d a , the bi&ccting offset ue3ired. The offsets ei ::; e account for at least half of the total weight
D2 - VI' By definition those ofTsets ensure interdigitation increases of at least half of their weights
r
for a total of at least D2~D'l Similarly any offset d < e will increase the interdigitation by at
most r )l It follows that e or the next smaller offset is a bisection point which reduces
3(D',;-D.

interdigitation range by at least 25%.


It remains to accounteor the complexity of the scheme. The set of interdigitation increments
{Gi } is found in O(n) time. This is done above with a linear scan of P and Q. The fast median is
detailed in [AHUJ. A fast weighted median profits from the same technique of divide-anu-conquer
as does the fast median. The conclusion is that these medians are linear in time. We therefore
have
Theorem 9. The optimal off.,et problem Jor an integer grid can be solved in time O('II(n) log n)
where 'II(n) is the time to find the separation. I

We note the following


Corollary 10. The optimal offset problem for a fractional grid with a density of k points per
unit can be solved in time O('II(n)logkn) .
Alln Slag_lind Dlnny Dolev 151

In addition, arguments similar to those above ~how that the continuous optimal offset problem
can be solved in time O(n 2 10gn).

4 THE OFFSET RANGE PROBLEM


The OlTset Range problem is due to Leiscrson and Pinter [LI'J: given a wifing rule, and a
separation, find all olTsets permitting a legal wiring for that separation. The principal result in
their paper is that a particular multiple module placcmcnt problcm for river routing can be solved
in the same time complexity as the offsct range problem. They also show that the olTset range
problem for rectilinear wiring on an integer grid can be solved in linear time.
Given a fixed separation wand a fIXed offset (of, say, 0), we may define the right oll'set
function R( i, j) as the horizontal directed distance from Qj to the (j - i)th left barrier emanating
from Pi. See figure 4.1. Thus

if j - i > w
if i - i ::; w.

w
_2
_1

PI

Figure 4.1: The right olTset function


Sim ilarly the left offset function may be defined as:

L(i,j) = {Qj - Pi + (i - i)h-IC~j)' if i - i >w


-00, if i - i ::; w.
Thcn the relative ofT.~et interval for P is [max L( i, i), min R( i, j)l.
The same techniques used in section 2 for the separation problems apply to the computation
of the olfset range. Consequently we have
Theorem 11. Let h(x) be a suitable concave barrier function and R( i, j) the corresponding right
offHt function. Suppose r 2: 0, s 2: 0, and j 2: i + r.

If R(i,j) 2: R(i + r,j), then R(i,j + s) 2: R(i + r,j + s).

Corollary 12. The offset range problem for wlrmg defined by geometrically similar convex
separation barriers can be solved in O(nlogn) time.

Theorem 13. The offut range problem faT wiring defined by similar polygonal separation barriers
can be ~olved in O(n) time.

Theorem 14. The offset range problem for rectilinear plus 45-degree wiring on a half-integer
grid, can be solved in O(n) time.
152 The Separation lor General Single-Layer Wiring Barriers

5 REFERENCES

[AHU] Aha, A. V., J. E. Hopcroft, and J. D. Ullman, The Vcsign and Anaiysis of Computer
Algorithms, Addison Wesley, Reading, Mass., 1974.
[DKSSU]Dolev, D., K. Karplus, A. Siegel, A. Strong, and J. D. Ullman, "Optimal wiring between
rectangles," TlJirteen Annual ACM Symposium on the Theory of Computing, pp. 312-317,
1981.
[FP] Fischer, M. J. and M. S. Paterson, "Optimal tree layout," Proc. Twelfth Annual ACM
Symposium on the Theory of Computing, pp. 177-189, 1980.
[GJ] Garey, M. and D. Johnson, Computers and Intractability: A Guide to J/ P-completeness,
Freeman, San Francisco 1978.
[J] Johannsen, D., "Bristle blocks: a silicon compiler,' Caltech Conf. on VLSI, pp. 303-310,
Jan., 1979. See also Sixteenth Design Automation Proceedings, pp. 310-313, June, 1979.
[L] LaPaugh, A. S, "A Polynomial Time Algorithm for Optimal Routing Around a Il.ectangle,"
Proc. Twenty-Brst Annual IEEE Symposium on Foundations of Computer ScieIlce, pp.
282-293, 1980.
[LP] Leiserson, C. and R. Pinter, "Optimal Placement for River Routing," Carngie-Mellon
Conference on VLSI Systems and Computatiolls, Oct., 1980. .
[MC] Mead, C. and L. Conway, Introduction to VLSI Systems, Addison Wesley, Reading, Mass.,
1980.
[SLj Storer, J. A., "The node cost measure for embedding graphs on the planar grid," Proc.
Twelfth Annual ACM Symposium on the TllCory of Computing, PI'. 201-210, 1980.
[T] Tompa, M., "An optimal solutioll to a wire-routing problem," Proc. Twelfth AIInual ACM
Symposium on the Theory of Computing, PI'. 161-176, 1980.
[V] Valiant, L., "Universality Considerations in VLSI Circuits," IEEE Transactions OIl Com-
puters, pp. 135-140, February 1981.
Provably Good Channel Routing Algorithms
Ronald L. Rivest, Alan E. Baratz, and Gary Miller
Massachusetts Institute of Technology
Laboratory for Computer Science
Cambridge, Mallachuaatts 021391986

I. Introduction
In this paper we present three new two-layer channel routing algorithms that are provably
good in that they never require more than 2d-l horizontal tracks where d is the channel density,
when each net connects jLlst two terminals. To achieve this result we use a slightly relaxed (but
still realistic) wiring model in which wirc~ may run on top of each other for short distances as
long as they arc on different layers. Two of our algorithms will never use such a "parallel run"
of length greater than 2dl and our third algorithm will require overlap only at jog points or cross
points. Since in this wiring model at least dl2 hori7ontal tracks are required, these algorithms
produce a routing requiring no more than fOLlr times the best possible number of horizontal
tracks. The second algorithm also hm; the property that it uses uses at most 4n contacts, where n
is the number of nets being connected.
II. The i\Iodel
The (infinite) channel of width I consists of (1) the set V of grid poillls (x,y) such that the
integers x and y satisfy the conditions 0:::;;)':::;;,+ 1 and -OO(x<co, (2) the set P of poly segments
consisting of all unit length line scgments connecting pairs of adjacent grid points which do not
both have y=O or y=t+l, (3) the set M of metal segments which is isomorphic to but disjoint
from P. The channel (V,P,AI) thus forms a multigraph with verh~){-set V and edge-set PUM. If
two vertices are adjacent in this graph they are connected by precisely two edges one of type
poly and one of type metal. We define track i of the channel (V,P,Af) to be the subgraph
composed of all grid points in V with y-coordlllate equal to i, and allscgmenis of PUM which
conncct pairs of these grid points.
A wire IV consists of a sequence of distinct grid points separated by segments which connect
them:
w == (p{),s/,p1,s]> ... ,sk,Pk)'
Here PO,,,,,Pk arc the grid point, and si connccts Pi-j to Pi' Each si may be of either type, poly or
mCkll, and we define the sets of poly segmcnt~ and metal segments of wire IV as follows:
P(H') == {si I sjEP},
M( H) = {si I sr~f}
Similarly, the contact points C( If) is defined to be tlie ~ct of grid P~)illt5 whrrc W starts. ellds or
challges layers:
nil):::: {p(/'Pkl U {p,IO<i<kandtypc(l)* typd'i+,l}
153
154 Provlbly Good Chlnnel Routing Algorithms
(typc(s) = poly if siEP( It) and Lype(s) = metal if sF- AI( If)

We say that two wires I-V] and W2 are compatible if there does not exist a pair of segments
siE W] and sl
W2 such that si and Sj are incident on a common grid point and type(s)= type(s).
Notice that two compatible wires may "overlap" by connecting to common grid points with
segments of different type, as illustrated in Figure 1.

M(1'~1
d t
t. .. ,,~t
-------. P"\y
W. -= fCl,s" bJ $1. ,.1, s;'1J e~ % ,'3, 58 ~
<, '5", kJ S!J) j 1
"" 1 =~c:~ S)/.\' ~S";+~ S?I '-', Sql <, S,o)
1<) S'l / 11
Figure 1.
Many previous channel routing algorithms employ a more restricted wiring model in which
no such "overlap" is permitted. We do not know how to prove our current results without
making use of a modest amount of overlap. The current model is certainly a realistic two-layer
model, although it docs permit wirings which arc susceptible to "cross-talk" via the capacitive
coupling of long overlapping wires. Our wirings will not have any long sections of overlapping
wires - the longest such section will have length at most the width of the channel.
A l1el Ni = (p,q) is an ordered pair of integers sp,~cifying an enlry (x-)coordinale Pi and an
exit coordinale qr A net is said to be rising if qi(Pi, fallil;g if p/qr and trivial if Pj= qr A
channel rOUling problem is simply a set of n nets, for some integer n, such that no two net~ have a
common entry coordinate or a common exit coordinate. A solufioll to a channel routing problem
consists of an integer t and a set of Il compatible wires W], ... ,Wn in the channel of width t, such
that Wi begins at grid point (Pl,t+ 1) and ends at grid point (qtO). The oplilllal widlh for a
channel routing problem is defined to be the least integer I such that the problem has a solution
in a channel of width f.
For any real number x, we say that a net Ni = (Pi,q) "crosses x" if either Pi5:,x(qi or qi::,x(Pt
The channel density of a channel routing problem is defined to be the maximum over all xER of
the number of nets crossing x. It is simple to show tJJat a problem has optimal width at least dl2
if it has density d.
III. A Prol'ahly Good Channel H.Q!!.!l!!R. Algorithm
Let CRP= {N1, ... ,Nn} denote any channel routing problem. We assume without loss of
generality that 15:, Pl,qi5:,111 for all 1::;i5:,11 and some integer m. Thus the nets NiECRP specify
end-points which lie witJ1ln some 111 "columns" of the channel. We will now describe a
polynomial time algorithm which is guar"ntccd to compute a solution to CRP having channel
width exactly 1= 2d-1, where d is the channel density of CR 1'. Since dl2 is it lower bound on the
optimal channel width for CR P, this algorithm will never generate a solution with channel width
marc than four times oplimal.
Algorithm l.
This alglHllhm PWCl'l'tI, c. lillinn bv C(llllllill !'(lUling ,111 nels II hid1 uo:,s j in :'!(P J. The
Ronald L. Rivest, 8t al 155

solution generated wiII have the properties that t= 2d-1, there wi\l be at most d wires passing from
column j to column j+ 1 for any j, and for some j there will be at least d such wires. Further,
wires will pass from a column j to column j+ 1 only on the oddnumbered tracks; there will be
no horizontal segments on the even-numbered tracks. In addition, if there are k nets which cross
j then there will be exactly k horizontal segments connecting columns j and j+ 1. These segments
will all lie on distinct odd-numbered tracks and they may be of either type, poly or metal,
independently. Finally, if exactly r of the k nets which cross j are rising and fare falling (so that
r+ f= k), then between columns j and j+ 1:
(1) The top-most r odd tracks will be devoted to wire segments for the r rising nets,
(2) The "middle" d-r-f odd tracks will be empty, and
(3) The bottom-most f odd tracks will be devoted to wire segments for the f falling nets.
It now remains to demonstrate that this set of invariant properties can be maintained as the
algorithm proceeds from column to column. If a column contains a trivial net, the net is wired
straight across the column using the even numbered tracks to change layers as necessary. No
other wiring is needed in such a column.
If a falling net Ni =(p;J) enters column j from column j-1 on track tz' the algorithm drops a
vertical connection from grid point (j,lz) down to grid point (j,O). The algorithm then "closes up
ranks" in column j so that all the empty odd tracks are in the middle of the channel. Figure 2
illustrates how such a wiring can be generated. Rising nets with entry coordinate j are handled
similarly.
Finally, any rising net N i = (Pj,) is routed in column j with a vertical connection from grid
point (j,O) up to grid point (j,tJ, where tw is the top-most odd track which would be empty (Le.
contain no horizontal segment between grid points (j,t)) and (j+ 1,tJ) if net N; were not present.
Similarly any falling net is routed down to the lowest odd track that would otllerwise be empty.
If both of these situations occur in the same column, a modest amount of "overlap" is required as
indicated in Figure 3. However, the situation of Figure 3 is the only place where overlap is
needed.
col --_I'>
'" j .

Figure 2. Figure 3.
Theorem 1:
-----
/\lgorithm I is g.lIar,llIt':ed to llWlPlilc a solution to CRI' 11<1\ ing cilannel widlh no more than
156 Provably Good Channel Routing Algorithms

four limes optimal.


Proof:
The proof of Theorem 1 follows directly from the above discussion.
At this point it should also be clear that the running time of Algorithm 1 is bOlll1ded by a
polynomial in m, the number of columns which must be processed. In practice, however, we only
need process columns having index equal to some net entry or exit coordinate. Thus with the
appropriate output representation, Algorithm 1 is O(dn) for a channd routing problem containing
n nets and having density d.
Although Algorithm 1 never generates a solution with channel width more than four times
optimal, it docs generate solutions containing as many as d'll contact points. further, it generates
solutions containing overlapping parallel runs as long as length 2d-l. In the remainder of this
paper we present algorithms which cope with these two problems independently.
I\'. Bo~J.I.!..<!i.!!g the Numher of Contacts

In this section we will describe a polynomial time algorithm which, like Algorithm 1, is
guaranteed to compute a solution to CRP having channel width no more than four times optimal,
but unlike Algorithm 1 requires no more than 4n total contact points. This new algorithm
employs the same basic approach as Algorithm 1 and thus its description will be facilitated by
simply noting the differences between the two algorithms.
Algorithm 2.
Similar to Algorithm 1, this algorithm proceeds column by column routing all nets which cross
j in step j. Further, a solution generated by Algorithm 2 will have esscntic\lIy the same properties
as a solution generated by Algorithm 1 with only two significant exceptions. The first of these
exceptions is that all horizontal segments belonging to wires of fa1\ing nets (with the possible
exception of the top-most such segment in each column) will be of type metal. A similar
property will hold for rising nets and poly horizontal segments. The second significant exccption
is that for each column j there may be at most one distinct horizolltal segment which is associated
with a falling net and connects columns j and jt 1 wllile lying on an even-numbered track.
Further, the net of such a segment will not have exit coordinate equal to j+ 1 and the odd-
numbered track immediately below the segment will be empty betwecn columns j and j+ 1. A
similar property will also hold relative to rising nets.
The maintenance of this new set of invariant properties requires a somewhat different set of
wiring rules from thosc employed by Algorithm 1. Consider the case where a faIling nct
Ni = (pjJ) enters column .i from column j-1 on track fz. As in tllC previous algorithm, a vertical
connection is dropped from grid point U,l z} down to grid point (i,O). Notice, however, that at
most one contact point will be required along this connection since all segments whkh must be
crossed will have tile same type. The algorithm must now "close up ranks" so that all blank
columns remain in the middle of the channel. It should be clear that the tcchnique employed by
Algorithm I in solving this problem can be of no lise here. I !owever, tbe problem can be easily
slllvcd by dropping a vertical connection from the lop-must track containing a t'allinf, nct which
crus~('s j down [() grid [Jilint (jJ z + J), as shilwn in Figurc eLt, The only I'lilblcm that occurs is
Ronald L. Rivest, et al 157

when the net to be dropped has exit coordinate equal to )+1/ In this case, however, the
algorithm simply drops the next lower net (if any) as shown in Figur~ 4b. Rising nets with entry
coordinate ) are handled similarly and all other cases are handled as in Algorithm 1.

----1---
- -- -
-----
- -.-
---.
---1-----
--------
--- ----
'. '---1
~v

----QI~
i

Figure 4.
Theorcm 2:
Algorithm 2 is guaranteed to compute a solution to CR,P with channel width no more than
four times optimal and with no more than 4n total contact points.
Proof:
The proof of Theorem 2 follows from the above discussion and a more detailed case analysis
of the wiring rules applied within each column.
Finally, we note that Algorithm 2 has time complexity O(d'n) for a channel routing problem
containing II nets and having density d.
V. Reducing Ovcrlap
Let IlS now assume that we wish to compute a solution to CRP which has minimal channel
width and no segment overlap. In this section we will describe a polynomial time algorithm
which is guaranteed to compute a solution to CRP having channel width no more than four times
optimal and requiring only "corner overlap". However, the number of contact points required by
this algorithm will be O(d'lI) rather than O(n).
Algorithm 3.
This algorithm proceeds track by track rather than column by column. The processing at each
step involves a pair of adjacent tracks, i and i+ I, such that i is odd. Furthermore, the algorithm
proceeds bottom-up beginning with tracks 1 and 2. At each step the algorithm extends all
existing wires across both track i and track i+ 1, in such a way that the density of the subproblem
between track i+ 1 and the top of the channel decreases. This reduction in density will result
from horozontal wire extension along the odd-numbered track. Once again the final solution will
have the properties that {= 2d-l and there will be horizontal wire segments lying 2.!!!Y on odd-
numbered tracks; the even-numbered tracks will be used solely for layer changes along vertically
running wires.
When the algorithm begins processing a pair of tracks i and i+ 1, there will exist exactly II
distinct vertical segments connecting a grid point in track i-I to a grid point in track i. Further,
each of these scgmclIls will belong to a dbtinct wirc. Since track i-I is (,Icn-numbered and thlls
used solely fur layer changes, we i1utc that the type, poly or mct:ll, of each of these scgmcnts can
always he a',signed ,lS a l'uIlLli('n ur li1e huriwllial nHlling in tr,Kk i, Wl' \'Iill IHlII <lc',(' 1ilJi.' tile
158 Provably Good Channel Routing Algorithms

procedure for routing nds across track i.


The processing of track i is performed in either a left-to-right or a right-to-left fashion
depending on how track i- 2 was processed_ The processing direction for track i is initially set to
be the opposite of that for track i- 2.
Let us assume that track i is to be processed in a Ieft-to-right fashion: an analogous procedure is
employed for the right-to-left case. Further, assume that column Jj is the left-most column
containing a vertical segment connecting grid point Uj,i-I) to grid point Uri) and bdonging to a
rising net Nk=(Pk,qk) for which Pk>h. Thus net Nk requires extension to the right. If no such
column exists then track i is processed in a right-to-Ieft fashion.
Now let Wk denote the wire associated with net Nk . Note that Wk ends at grid point Uj,l). The
algorithm then simply extends wire Wk horizontally to the right from grid point Uri) until it
rcaches either column Pk (the entry coordinate of net Nk ) or a column J2 containing the terminus
of a wire l11r for a net Nr=(pl"qr) with P;>Pk' (i.e. Wr is a wire which must be ext'~nded farther
right than Wk')
[n the latter case wire Wk ends at column h and wire Wr is extended to the right in a manner
similar to the extension of 11Ik. In the fonner case wire Wk ends at column Pk and the algorithm
searches to the right for the first wire requiring some extension.
Let column J3 denote the left-most column (if any) such that J3'2Pk and the point U3,i) is the
terminus of a wire Ws for a net Ns= (ps,q) with Ps* J3. Thus wire Ws requires some horizontal
extension: either to the right or to the left. Further, if J3>Pk then Ns must additionally be a rising
net so that W~ requires extension to the right. The wire Ws is then the next wire to be extended.
l11e only difference in the m~nner of extension occurs when Ns happens to be a falling net. In
this case IVs is extended to the right only until it reaches a column .i4 such that the point U4,i) is
not the terminus of a wire for a net with entry coordinate equal to J4. This will allow Ws to
extend to the left, without generating segment overlap, when track i+ 2 is processed.
Once the processing of track i has been completed, all wires are extended vertically across
track i+ 1 and the horizontal processing of track i+ 2 begins. The entire procedure for tracks i
and i+I is illustrated in Figure 5. Notice that a wire W for a net N=(p,q) is never extended
horizontally once its tenninlts lies' in column p. Therefore, Algorithm 3 terminates when no
further horizontal extension is necessary.
f1 '2 "3 Go C; Lf '7 .. 10
I ,

_--!'----'._. --- -____ ....--+--;,..-----1 +-- +I",,\( l


~t..... J . i-l

1 10

Figure 5.
Ronald L. Rivest, .t al 158

A!gOlithm 3 is guarallttcd to compute a solution to CRP with chJllllel width no more than
four times optimal and requiring only "corner overlap".
Proof:
It follows directly from the above discussion that Algorithm 3 will always generate a solution
in which the only type of overlap is corner overlap. The upper bound on channel width then
follows from the ob,crvation that the density between track i and the top of the channel is strictly
decreasing as the algorithm proceeds and i increases.
We now point out that Algorithm 3, like the previous two algorithms, has time complexity
O(dn). Unlike the pr(;vious two algorithms, however, this algorithm may generate wires which
arc non-monotonic (Le. weave back and forth across the channel), thus resulting in increased total
wire length.
VI. Conclusions
We have presented three channel routing algorithms which are guaranteed to compute a wiring
requiring no more than four times the optimal channel width. Furthermore, one of these
algorithms requires only a small number of contact cuts and another requires only a minimal
"mount of overlap. However, many open questions still remain:
(1) Can the upper bound be improved (e.g. to 3dl2)?
(2) Can this bound be proved ill more restricted wiring models (e.g. the model of
[076])?
(3) Can this bound be proved for multi-tenninal net..,?
(4) Can both the number of contact cuts and the amount of o\c::lap be simultaneously
minimized?
vn. Acknowledgements
We would like to thank Charles Leiserson, Ron Pinter and Brenda Baker for helpful discussions.
~II. References
[076] Deutsch, D.N., "A Dogleg .Channel Router," Proceedings of the 13th IEEE Design
Automation Conference (1976), 425-433.
[DKSSU81j Dolev, D.; Karplus, K.: Siegel. A.; Strong, A. and Ullman, lD., "Optimal Wiring
Be[ween Rectangles," 13th Annual ACM Stoc Proceedings (Milwaukee, 1981), 312-317.
IH,)7l] Hashimoto, A. and Stevens, l, "Wire Routing By Ol~timjzing Channel Assignment
\Vitllin Large Apertures," Pro~eedings of the 8th [EEE Design Automation Workshop (1971),
155-169.
(['80] Tompa. M., "An Optimal Solution to a Wire-routing Problem," 12th Annual ACM
Stoe Proceedings (Los Angeles, 1980). 161-176.

This research was supported by NSF grant MCS78-05849 and by DARPA grant NOO014-80-C-0622.
Optimal Routing in Rectilinear Channels

Ron Y. Pinter
Massachusetts Institute of Technology
Laboratory for Computer Science
Cambridge, Massachusetts 02139-1986

Abstract: Programs for integrated circuit layout typically have two phases: placement and
routing. The router should produce as ~fficicnt a layout as possible, but of course the quality of
the routing depends heavily on the quality of the placement. On the other hand, the placement
procedure would like to know how' good a routing it can expect without actually routing the wires.
This paper pre,cnts fast algorithms for optimal routing and for accurately estimating the area cost of
such routings II' ithout actually laying them out.
The most cummon types ofjunctiollS occuning in layouts are T-shaped or X-shaped; this paper
presents efficient c1leorithms to measure and produce the optimal renilim:ar, tw,r-Iayer routing in
channels formed around such junctions. The ability to do this is based on the new notion of painvise
ordering which is used to propagate routing constraints from olle pan of a channel to the rest, and
alleV1Jtcs a fundam(;nt31 problem plaguing tradition:l1 channel r0l11ers. In :Hldition we present a
greedy aigorithm for 0ptimal fOUling in rectangles with a new type' nftenllillai ord(:j~ing which comes
up frequently in practice but has not been studied before.

1. I nt roduction
The most common methodology for sclving the htyout problem for integrated circuits is to
dec()mpo~e it into two subproblems: placcl11C1l! and TOuling ([Ilr a definition of the layout problem
sec, for example', [l.aP80]). For complicated VLsI circuits, we may need to fom1 a hierarchy of such
problems (JS ill [Pr79]). At each level we arc gilen pieces of the circuil. called modules, which have
bC'en ldid out at the preceding lc\cl. and hJIC' to lay tih?!l1 out to form a modnk for tile next level up.
Each module has termina/s locall'd ::t!lIng irs boundary. and cach tcrmillJ! is as:;ociat,~d with a signal
l1el. In this metilOciology, we first plilee the mod ales, I.C. decide their gcomctrii: positions on the chip,

and then route paUlS to intC'rcl'I111C'ct tennin:;ls of common ,ignal nets. The objectil e is to minimize
the total Jrea required to rcan/e rhe circuit subj<:ct to vcllilllS design rules. An eXdmplc of a typical
placcnh'Ilt is given in Figure 1.
Ndlurally, the placcrncnt and routing problems :IlC ,rroni'll' rcJ~tcd: the way in which n~()dl1les
are placed relative \0 each otiler may cfTe:ct dnmatic;illy the qU:lliry and difl!culty (If the J'Outing

Thi~ research was supported in part by the Defense Advanced Research


Projects Agency under Contract N. N00014-80-C-0622.
160
Ron Y. Pinter 181

phase. Ideally, we would solve the routing problem for each proposed placement (while looking for
an optimal one), but this is obviously intractable. One reason why routing is a hard problem is its
global nature: the way one signal net is routed effects the potential solutions for other nets. Thus we
need good estimates for the area which is going to be needed for routing relative to a given placement
without spending the time nceded to actually solve the routing problem.
Traditionally ([HaSt71],[Hi74],[Ri81]), a rectangular pamdigm is used in VLSI design. Modules
have rectangular shapes, and the routing among them is achieved by partitioning the given space
into rectangular channels. Each such channel is assigned signals with terminals along its boundary
(some are common with modules, some with other channels), and the signals are routed inside. While
rectangular modules seem to be generally acceptable, rectangular channels pos.:: difficulties for two
reasons: both the evaluation of the placement based on possible routings and the globality of the
routing problcm are fragmented by the channel structure in a way that makes their solution even
harder to achieve.
Therefore, I propose that we shall start looking at routing problems in non-rectangular channels
(but still maintaining rectilinear sides). As long as modules are rectangular, such channels take one
of three general shapes: T. X or L (as indicated in Figure 1). While the latter is relatively easy to
handle, the other two are more complicated. Some instances ofT's and X's yield to some interesting
theoretical analyses which are presented in this paper. In general, non-rectangular channels are
treated by partitioning them internally along edges and dealing with each section separately; the
edges are used to maintain constraints in such a way that overall optimality is achieved. We develop
a powerful algebraic abstraction for constraint propagation, called painvise ordering, which is well
sl:itcd to the problem. and study it carefully.

Figure 1: T and X junctions appear frequently in a typical layout.


182 Optimal Routing In Rectilinear Channel.

The theoretical research thitl has been done so far on routing in rectangles ([LaP80],[DKSSU81])
has paid little attention LO (Onfie:,Lltarions which are common in practice, and even less to the problem
of propagating routinb cor.srraints through the chan;le!. In lDKSSU81] we find a polynomial-time
algorithm [() solve a sImple siw:l!.ioll in which the ordering of the terminals with respect to their
signal-nNs is the same on two p3r.,]icl edges, On the other hand, [Lar80] tells us that the problem
in general (and e\cn if limited c()n~ictcrably) is NP-complete. In order to fin the gap, we should
examine routing of uscfi.1i patrerns both in the newly proposed channels and in rectangular ones.
After describing tlVO relevJnt wiring model:; in Section 2. the three subsequent sections describe
polynomial-time algonlhms for al.t<1ining optimal routings for certain lonligm3tions in the following
three CJses: a T-shapcd channel (Section 3). an X-shaped channcl (Section 4). and allowing arbitrary
ordering among the signals along a channel's side (Section 5), The notion of pairwise ordering is
defined and discussed in the beginning of Section 4. We conclude with a discussion of implications
on il melhodology for gencralized channel routing.

2. The Wiring Models


Throughout. we assume an underlying rectangillar grid. i\. poim in the plane is a grid-point if
both its coordinatcs arl~ imcgcrs. ;\ line ill the pi:JIlC ,)f the form x = m Of y = m. where In is an
intcger. is a Icrtical cr horizontal. n.'spccti,cly, grid-lill'.
A module is a p{ll),!:()n ali (If II hose sides arc cith~r p,lral1cl or perpendicular to each other:
furthcrJllorc. the diJllCIl~illns nf all ~jdc, <11(' illrc.~r;d. Also, ,I ~ct III' di,t;nCl p(Jilm. cdkd It/lliillt/f,I, is
.'u;{)cid[(\1 " illl ';,il II tI1l1dllk ~lIcil 11",t ,',tell LTilllll:t! iiI'" Oil Oil'; III' [Ill' Ilililliilc'\ sidl" at .Ill i,;tC'Fral
Ji';t.IliCl' flUllI (either 01') it' cmi(:.).i he' [crmiliJh ,ii',' lhlial1:, bbclieu by IliIlllc', lIf )1,:;IiJi!W/). iIws.
if we pick an arbitrary corner of the module* and align the two sides mCi?ting at it with a hOIizontal
grid-line and a vertical grid-line, ali temlinals and other corners will fall at grid points: furthennore,
all sides will coincide with segments of either vertical or horizontal grid-lines.
A path between two terminals, P and Q, is a sequence of grid-points R I , R 2, , ., Rk such that
P = R I , Q = Rk and cach two successive points, R i , R i .rI for i = 1, .. "k - 1, lie on the same
grid-line. These line segments (including their end points) are called the path's segments and are
denoted by [R j , Ri+rl for i = 1, ... , k - 1. Furthermore, the points Ri (for i = 1, ... , k) are
called turning poillts, and we insist that at each point a vertical segment meets a horizontal one.
A placement of a set of modules is an assignment of each module to a portion of the grid such
that its curners coincide with grid points and its sides - with grid lines, and no two (areas covered
by the) modules overlap. Furthermore, no two sides of modules or their corners may coincide. Thus
the distance between any two points within or on the boundary of two modules is at least 1. The
orientations of the modules may be constrained or not, but this is of no interest to us in this paper.
Once a placement on the grid is given, we can subdivide the remaining area into channels.
Channels are polygonal, non-overlapping areas with corners and sides c0nfonning with the grid
(as modules do). Chailllels may share sides (and corners) with each other and with modules. A
terminal belonging to a module whose side is shared with a channel's side, is now associated with
that channel's side. Obviously, the channels and modules do not cover the whole (infinite) grid; the
objective of the layout problem is usually to minimize the area of the smallest rectangle enclosing the
ensemble.
Ron Y. Pinter 163

Now let us define the wiring ntles: For any two paths (al! four terminals involved arc distinct)
no turning point of one path may Ii,: an a scgm(!nt of the other; thus, the paths may cross each
other, but cannot share segments, go thrOllgh each others terminals or turn at the same grid-point
(see Figure 2(a)). This is somewhat different from the model in [['h80] where turning points may
bc shared (see Figurc 2(b)), but our model conforms with the traditional Manhattan wiring model
([Hi74]) in which two layers are being used - one for each direction (preassigned). In current
technology (e.g. nMOS. see for example [Me80]) connections between the layers arc facilitated by
vias (or contacts). If we adhere to the cOllvention of one layer per direclioll, we had better refrain
from making two turns at the same point (causing two vias to overlap). Some of our results are
affected only mildly by this divergence from Th,)mpsol1's model. but some arc quite sensitive to it.

xl ...

loX
iy I . v,

I
I
l. 'lJ

"Y
Y

(a) legal in both models (b) legal only in Thompson's model


Figure 2: 'The two wiring models.
Finally, a Pi1th nllming iII a channel cannot share any of its segments with any of the sides of
the channel; thus the first segment of a path starring at a tcrmin~l is always perpendicular to the
terminal's side.
All in all, we have a rectilinear. two-layer model for routing among the modules, which con-
funns with it unit separation design rule throughout the layout

3. Routing in T-shaped Channels


A T'shaped channel is showll in Figure 3. Its sides are named lop, jlallk.~. legs and ellds for
ohvious reasons: these names arc qualified by the appropriate directions when needed. In general,
\crmlHals can lie OT, any side and routing the cliannel m-:ans to connect tenninals with the same labels
to each other using p,nhs lying wili1in the channel. obeying the design rules.
I'he dcci,iun problem <Issoc,ialcd with routing a chan!ld is: Call lIe I'-'Ul'~ it in its given di!11cn-
sinns? :\ pr,;,cdul".: fur soh in;: thi:; C:111 he rck'"mt ill 1:" piilccrncnt pl1.:,(' (111<1.\'011[. Til" rcbled
minimization problem is somewhat more interesting: How can we millimize the area required in
order tD route the channel? This gives rise t.o the following ambiguity: Even if we assume that the
sides of the channel (except for the ends) belong to three modules in the natural way (as Lnplicd
by Figure 3), what movements of the modules are allowed? When do~s the channel cease to have
164 Optimal Routing In Rectilinear Channels

top
left right
end end
left flank right flank

eft ri-
eg ght
leg

/L-._ _ _ _.....
bottom end
~ /

Figure 3: A T -shaped channel and its boundary.

aT-shape? First, we decide that the flanks of the channel will remain aligned (i.e. share the same
grid-line). Second, it seems unnatural to use the absolute area as the optimality criterion for various
reasons (sec IPr79]).Thus it is natural to consider the distance betYieen the legs (denoted Wa, for
bottom width) and the distance between the top and the flanks (denoted WI" for top width) as our
criteria in a way to be described in the following paragraph.
Moreover, it is obvious that changing the distance between the lower modules (i.e. changing
the leg to leg distance) may effect the routing of signals going to the top and the flanks. Thus our
strategy will be to minimize Wn first, and then minimize UJT with respect to it. This approach is most
practical in most design situations and is also likely to approximately minimize other interesting cost
functions, such as area. Notice that minimizing WT first (by setting it to 0) will flatten out the T, i.e.
make it into a rectangular channel by pushing the lower modules outward to the ends of the upper
one. Also, once Wa is known, we can fix the horizontal location of the lower modules with respect to
the top one, forming a solid T. Finally, we shall see that Wr tends to be much bmaller than Wn: thus
minimizing WB first in an unconstrained manner is preferable from the placement procedure's point
of view (since it is better in preserving the T-shape).
Now we restrict ourselves to twa-point nels, ie. to instances oflhe problem where each signal-
net name can appear as the label of exactly two terminals. Also. for sake of simplicity, we exclude
the ends as possible sides for terminals to lie on (they l:an be added at a later stilgC). Assuming no
net connects two terminals lying on the same side or on two adjacent sides of the channel (this is
reasonable if these are the sides of single modules). the nets can be divided into five cases according
to which kinds of sides they connect:
(i) top to flanks
(ii) top to legs
(iii) flanks to legs (left to right and right to left only)
(iv) flank to flank
tv) leg to leg
Ron Y. Pinter 165

The most interesting case to consider is (ii); ca~es (i), (iv) & (v) arc embedded in standard
rectangular routing, whereas (iii) is essentially a restriction of (Ii). We shall solve case (ii) in the rest of
this section.

3.1. Terminology and Notation


This subsection summarizes the terminology required fur describing. and discussing the routing
results presented in the ncxt subsection. We assumc here that tilc fhnks have hecn positioned relative
to the lOp; this makes sense Si!lCC thc kg-to-kg distance will he computed ir!lmediatcly from thc
input without involving ao) Ilf the notations to t()llow. Most of the terminology is summarilcd In
Figure 4.

T-terms
I ,----------------~~~------------------
A
B ..
L-!;..,erms
C
, M-terms, r R-te~ms
D E

~

.. """
left central right
portion portion portion

----
crossing edge
C
B- H B-
B
terms

F
E

.-.-. --
Figure 4: More tcnninology for T-shaped channels.
Here (B.G) is an aligned pair inducing a conflict; (A,F) is an aligned
pair not inducing a conflict.
n = 8, n, = nr = 3, nm = 2.

Ikfitlirion I. All tennin;:!:; on the I~gs (c)'c1uJing lhe corners) are called H-tcrms(for bflttom-
lenninal:;), ,mel the lennin:tis 1111 the lOp - F-/cni1\ (for top-[crminals). We dellotc by n the number
of :;i~II;l1 p;,jrs: thlls there arc 'It I--tl'lIll'- ;iild II, lhl'rrJb. l-kl IIh \\h()~(' l"uHII\lin;,!c i~ wiri.in till:
r;!Ilgc uf til~ left (right) Rank (illcluJillg end pdlnl;;) ~rc c,tlkJ r hl.'l:.\ (J: lenn\): the rest "ftile '1'-
telms are called M-tenns (for middle-terminals). We denote the number of L-terms, R-terms and
M-terms by nl, nr and n m. respecti\cly (thus n = nl nr nm). + +
Definition 2. Two J3-terms with the same y-cl'ordinate are called an aligned pair. Alignments
of I3-terms induce pairing between t,1-Je corresponding T-terms (i.e. the T-tenns bearing the same
labels). If the x-coordinate of the top terminal corresponding to the bottom terminal lying on the
166 Optimal Routing In Rectilinear Channels

right leg is smaller than that (i.e. is to the left) of the top tenninal corresponding to the bottom
terminal lying on the left leg. the pair is considered to be a conflicting pair. In symbols: let S10 S2
be two signal nets. the top tenninals of which are sf, Sr. respectively. and the bottom terminals are
sf, sq. respectively. If sf lies on the left leg, S~I on the right leg and they are (y-)aligned. then
Sf and Sf constitute a conflicting pair if Sf lies to the left of Sf. We classif} such conflicting pairs
according to the subclasses their T-tenns fall intc: fOf ~1l po~sible combinabons of X, Y E {L.R,M}
an XY-pair is a conflicting pair in which 011e T-term is an X-tenn and the other a Y-tenn*; e.g. the
pair (B,G) in Figure 4 is an LR-pair. We order the pairs according to the positions of their T-terms,
ie. we write (SI, S2) if the x-coordinate of Sf is smaller than that of S1. The number ofXY-pairs is
denoted by nxy; thus there are niT LR-pairs.
Definition 3. The grid-line segment going from the left end-pomt of the right flank to the
right end-point of the left flank (dashed in Figure 4) is called the crossillg edge. The part of the
channel above it is called the top part. the one below it - the bollom part. The extension of the right
(left) leg upwards, until it hits the top (dotted in Figure 4). is called thc right (left) edge; the portion
of the top part to its right (left) is called the right (left) portion. The portion between the right and the
left edges is called the cenlra! portion. A grid point residing on an edge and coinciding with a routing
path is called a crossing point.
Definition 4. A grid-line segment enclosed by either the top part or the bottom part of the
channel is called a track; the tracks' orientations (i.e. horizontal or venical) arc relative to the T-
shape, not to its parts.

3.2. Routing results


As noted above. the bottom part of the channel is routed first. By assigning one \'ertical track for
each signal, we can easily route the [Henns to the crossing edge attaining WB = n +1 by sharing
horizontal tracks betwecn aligned pairs regardless of thc positions of the other B-teDns (see Figure
5); the only constraint thi~ imposes 011 the croSSing edge is that aligned pairs will appear on it in the
order corresponding to their legs. We shall see ho\,.. to handle this, from the top part's point of view,
in what fijllows - whether the pair is conflicting or not. The ordering among the olher signals may

(A) (F)
r<cp,. ..l UJW-
I ,: I I ,
I-I
I
D
I I I I
I I
I
I
I legend:
G I
layer 1
I
I
layer 2
A
contact (via)
terminal

. _-_._--.-.-
Figure 5: Routing the bottom part and propagating the resulting constraints to
the crossing edge.
Ron Y. Pinter 167

signals having to cross


be arbitrnry. and this will be wken ad\<llltagc of heavily. Since the Illimber of
the comlTlon edge is n, and they have to abide the design rules, we have
Lemma 1. The bottom part of a T-sh<lped channel can be routed optimally (wu
= n + 1).
we notice that signals
This i, of llO conseqllence unless \\c c:m roilte the top part dJiciently. First
l tracks in the left and
co,responding (0 L-tcnns and R-tentls can be a~~igncd to arbitrary horizonta
rif~ht portions (respectively) withollt :l!l)' loss of optimality
(uf wr) there: e:lch slich (non-M-)terrn
l track: since there are
has to jog in ord~r to-get to the coIT(',p()l~ding edge, so it nc:cd~ a horizonta
complete in the way we assign these
llO conflicts with rC:,pect to vertical tr:lCk\ we hil\\'
frc('(luill

in ordu to ;I\()id Ch30S in the ccntLlI


hOrIlIll1::i1 tracks, Of c()ur~;c, ,omC' logic has lCl be 3flplicd
\\hich W!lleS
portion il '\c' \\:11ll (Inturally) tLl II',' tlw >IIJle 11,l/'i/llllt;i1trx'k\, r-;'l\' :11,: Ci"l pl'ohlril1
at first seems that
it
to mind is how to accommodate conflicting pairs, Let us ~onsider LR-,pairs first:
be implied by Figure
routing thcm will require as many as 2nrl tracks in the central portion as might
6(a), but we have:
Lemma 2. The LR-pairs can be routed from the left and right edges using
nlr+ 1 horizontal
point and another - to the right of
tracks, one of which is unused to the left of the leftmost crossing
the rightmost crossing point. This is optimal.
nothing can
Proof outline: The lemma is proven by the construction of Figure 6(b); notice that
tion can be described as
be gained by interleaving the crossing points of different pairs. Tbe construc
the way to the rightmost
follows: The signal corresponding to the L-tcrm of the flrst pair is routed all
to the R-tenn of the pair is assigned to the
verticJI track, and jogs down; the signal corresponding
signal, thus forming the desired
next highcr track and jogs down immediately to the left of the former
fer the former R-term,
crossover. The second pair uses for its L-tcrm the liitmf track that was used
etc. Notice that (a)
tims sharing a track, and the crossover is famled to the lert of the former one,
and (b) we could have
there is no significance whatsoever to which order the pairs are picked in,
R-terms below the ones
likewise gone from left to right in the crossover ordering, putting signals for
for L-terms, The optimality is proven by induction, I

, ,
-----11, , -r- --- -
.,-------- Yl

r: :~ ----r-:. :. -:-I....Tr
I
---' t-I.. 112

l-.l::Lj llill" r
I Xl

1/.J

1'1 :':
I I I I : I
I I I I I

(a) a single pair (b) interleaving many pairs. Xi is in conflict


with Yi (for i = 1,2,3),
Figure 6: Routing conflicting pairs in a T-shaped channel.

we put those signals


The next stage is the ordering of signals along the crossing edge: On the left
rms; likewise on the right. The
associated with L-tenns and not involved in conflicts with non-L-te
pairs are routed straight
LR-pairs Jrc being put in the middle. M-tenns not involved in conflicting
168 Optimal Routing In Rectilinear Channels

down through the central portion. As for ordering on the left and right edges: Signals corresponding
to LR-pairs arc put as low as possible. This leaves one free track either on the right or the left
(depending on the direction used in rile wnstruction of the crossovers. as in Lemma 2) which is used
by one (any) signal of the corresponding sidc; all other ~ignals corresponding to L- and R-terms
are put above. This strategy yiclds the situation in Figure 7(a), which is abstracted schematically in
Figure 7(b).
, ,
T
I

confl i cti ng ( ___.l..-:"-;'~-+--+-~.


r-+----r-: +:-'-+-_ _ } confl i ct i ng
I : R-terms
L-terms \.. _ _--'-.....1-+--+--+-+-_......, I I
t I
I I
I I

(a) detailed layout

(b) schematic layout


Figure 7: Routing in absence ofconilicting M-tenns.

Thus, if we define

I, ifnlr :f 0 and nl = nr = n'r


Xr= {
0, otherwise

we have
Theorem 1. If n",/ = n,nm =
nmr = 0, then* the width of the top part of aT-shaped
channel is WT = max(nTJ Til) XT 1. + +
Proof 011 tline: Conflicting Ll-p3irs alld 1m-pairs are ordered properly all the crossing edge.
Tracks for signals of L-tenns and R-terms are then assigned arbitraril~ so as to fonn Ule situation
in Figure 7(h). If Xr = 0 then either there arc no LR-pairs to worry about, or the extra horizontal
track needed to accommodate the LR-pairs in the centwl portion is being lIsed by a signal either on
the right or on the left. Oniy if XT = 1, we arc forced to usc an extra horizontal track. I
Changing the wiring rules to Thompson's (in which two sign:<1s may share a common turning
point) docs not effect this result.
Notice that we hale opted to resolve wnniets only in the top pilrt of the channel. Some such
connicls, however, could hale been rc,ol\cQ in rhe bl!l101l1 pal! without loss of optil1litlity there.
This is not dime since exphll'ing thi'; ph,iiJllit,. ,,\iIl Implicate null,'r:; c()!l~idc'r;i11Iy. '1 he additional
we a(ld 1 at the end b0(al!:,t Wi' Illl':I:,Ur,: . width. ~ IHell j, i IIIU(\' th:l'l the [lumb"r of tr.lcks
Ron Y. Pinter 169

A
, , B C
T
D
f
,
E
I
FG
TT
H
,
T
f r 1 t I r
I r I I
1 1" r J
I

L....<--'----r""1 A
E
G -----_B
(a) All conflicting pairs involving M-terms are simple.

, ,,
ABC DE
T '
,
F G
T
H
T
,I J
T
K
T
L
T
MN
TT
I I I
I

~ t r 1 I
I I I i r
I 1
,I
I I I 1I I
I I I
I I
I I I
I
I I
I

4-
I
1 I i I I I t I r
I I 1 I I I r ir N
D
~
1 I .1
r
1 : f ,
I

I I f
I
C

K _1 . -' I F
1 J ..l!
L
G
I
I
J . ;j I I . ,
I A

, B
.J.

;i I

1
I
1 E
J
I I I
I 1 I 1 H
M I

(b) A more complicated case, in which M-terms are involved in conflicts with L-
or R- terms of the far portions.
Here n = 14 with nl = 5, n,. = 5, and nm = 4. Using the definitions given
in the text, we obtain Xr = 0, <Pr = 1, and J.l.r = 0 (since nl = n r ). By
Theorem 2, Wr = 7.

Figure 8: Complete routing in a T-shaped channel.


170 Optimal Routing In Rectilinear Channels

complexity does not justify the effort since the best it can help is by reducing Wr by 1 for The'Jrem 1
(only if n[ = nr and then by resolving all conflicts) or 2 for Theorem 2 (by resolving most conflicts).
Thus, the notion of"optimal" in this section (and in the subsequent one) should be viewed relative to
this simplifying decision. but it is not far from being truly so.
The case in which M-terms are involved in conflicts is dealt with by extending the paradigm of
Figure 7. First ML- and MR-pairs whose M-temls are above the crossing points allocated for the
corresponding L-term or R-term, respectively, can be accommodated in the appropriate ranges by
simply forcing assignments of crossing points to the corresponding L- or R-term (e.g. D and F in
Figure 8(a. The number of ML-(MR-) pairs not handled in such a way is derJoted by n~j (n;"'T)'
and consequently we define n;n = nmm n;"'/ n:nr' + +
Other conflicts of the above kind and MM-pairs are handled one at a time in the remaining
tracks, making full use of track segments left free in tl1e upper-middle part of the central portion.
The block of LR-pairs is pushed all the way towards the more congested edge. A greedy approach
in assigning tracks at this stage (on a pair-by-pair basis, putting the two crossing points as close to
each other as possible) is good enough to attain a minimal WT; again, pairs share tracks as LR-pairs
did. TI1e only trouble is with M-terms which are too close to the right and left edges and are involved
in conflicts with L-terms and R-tenns, respectively, or appear in MM-pairs. Surprisingly, this might
cost us at most 1 extra horizontal track:
We say that an M-tenn is m-adjacent to the right (left) edge if all (possibly 0) grid-points be-
tween it and the edge (along the top side) have M-terms located at them. Now we define Ji- = 1 if
an M--term invohed in a conflict with an L- or M-terrn is m-adjacent to the right edge alld n T ~ nj;
o otherwise. Ji-~ is defined likewise by reversing the roles of left and right in tht definition, and
Ji-T = max(Ji-, Ji-). Also, rPT = 1 iff both left edge and right edge have m-adjacent M-terms
involved in such conflicts alldnT = nj.
Now we are ready to state
Theorem 2. The width of the top part of a T-shaped channel is given by

wr = max(nr , nt) + max(XT, r) + M + max(O, n:" - (niT -1 - min(nTJ nl))) + 1.


The addend before last* is due to the fact that there might be too many MX-pairs (for
X =M.L.R) to fit into previously alloclltcd tracks. so the excess has to be allocated new tracks. Figure
8(b) shows an elaborate (but not exhaustive) case of routing in a T-shaped channel. The proof of
Theorem 2 is too complicated to be outlined here.
Ifwe use Thompson-s wiring model, liT and <PT disappear from the sum; but since their product
is always 0, their sum is at most 1, and this ctfect is quite insignificant.
The dominant operation involved in the algorithms lIsed to attain this optimum is sorting of the
tClminals, thus the complexity of both fOutillg and finding the optimum is O(n log n), but the actual
routing takes many morc operations to complete. In case the tcnninals are presorted (which is not
uncommon), the complexity is O(n).

4. Pairwise Ordering and Routing X-shaped Ctwnncls


The major notion emerging from the previous section is t1wt of sharing the conWaints of two
almost independent rectangular channel ]"(lUling problems hy 311 ordering ()rpoil1l~ Ull t1lcir ccrnmon
edge, This ordering i., \ cry spal"~,('. bllt CX1Il'l1lCly pOll erful.
, the I :,[ the end appear, fOI the :aIl1C le;hOlI as ;n Ibcorcm 1
Ron Y. Pinter 171

~--~
-----
a2 .... -->e a 4

as'" - -;..aa
a7 .... --;..aJ

as

Figure 9: The graphic representation of pairwise orderings aml their union.


A = {al, ... ,as}
'WI = {(a2' a4), (a5, a6), (U7, a3)}
'lY2 = {(flI2' a;), (aJ, a4), (t16, as)}

arm

I
------ - 'I- - -
I I
la
arm I central
,- - - - - mr
I portion

arm

Figure 10: An X-shaped channel and its parts.

.... ... .. .. . . ..
5432 111096787213141
J.'igure It: Topologically sorting a cycle that docs not create a conflict.
172 Optimal Routing In Rectilinear Channel,

Definition 5. Given a set A, a pairwise ordering 'W of A's elements is a binary, antisymmetric
relation over A such that if (ai, aj) E 'W then ai i: aj and neither ai nor aj appear in any other
member of'W.
The interpretation of (A, 'W) as a directed graph induces an undirected graph with bounded de-
gree 1 (see Figure 9(a. The reason this algebraic structure represents channel routing constraints is
that exactly two signals are involved in each conflict, no signal is involved in more than one conflict,
and the conflicts are directional in nature.
An X-shaped channel (Figure 10) is most naturally partitioned lilto 5 portions: 4 arms and 1
central ponion. If we ignore, for the time being, terminals on the channel's ends, the constraints
propagating from the arms inwards to the edges separating the arms from the central portions are
simply pairwise orderings. Again, if we restrict ourselves to two-point nets and ignore signal nets
connecting terminals in the same aIm, the central portion is a rectangular channel with pairwise
orderings on its 4 sides. For sake of simplicity, we deal here only with nets having points on opposite
edges (but not adjacent ones).
Each edge of the central portion has a pairwise ordering associated with it. In routing this
portion, we have to satisfy these constraints. 'This can be studied by looking at the structure obtained
by taking the union of two pairwise orderings, 1~ 'W2, defined on the same set of elements. This
is exempli!led in Figure 9(b) and (c). The graph interpretation of the resulting structure induces an
undirected graph with bounded degree 2, thus it consists of isolated vertices, open paths and (even
length) cycles. Open paths and cycles that arc not directed cycles (i.e. there arc at least two arcs going
in opposite directions) can be arranged on a line such that all arrows go in one direction by topologi-
cally sorting the nodes (see Figure l.l). Thus the union of two pairwise orderings corresponding to
opposite edges of the central portion of an X-shaped channel can be arranged in such a way that
signals in the central ponion can go straight across unless there is a directed cycle.
Directed cycles are the only interesting case to look at. Obviously, a cycle involving k nets can be
+
routed rectilinearly using k 1 tracks in one direction and 1 in the other (Figure 12(a. This turns
out to be optimal for one cycle, whether we are using our wiring model (of Section 2) or Thompson's.
However, sharing tracks between cycles going in perpendicular directions turns out to be beneficial.
Before we proceed. let us introduce some notation: A cycle whose points are on the horizontal
(vertical) edges is called a vertical (horizontal) cycle. because of the directions of the signals. The
number of vertical cycles is denoted by v, and of horizontal ones- by h; al~o, c = h+v. Obviously,
we need at least h horizontal tracks and v vertical tracks to roUle across the central portion of an X.
Let Av and Ah denote the number of exIra vertical and horizontll, respectively, tracks needed to do
the routing.
Theorem 3. For our wiring model, A1} + +
Ah = c 1. whereas in Thompson's model
Av = Ah = 1. Moreover, in our model, any pair of values for Av and Ah satisfying Av tl.h = +
+
c 1, Av, Ah > 1 can be attained.
Proof outlil/c: The construction follows the paradigm of Figure 12(b) and (c). Both cases attain
the claimed bound, iind can be fi)lded around b) horizontally and (e) vertically <IS indicated by the
arrows) to attain all interim valucs. The result for Thompson's model is achieved by merging corners.
The (1ptimaiity is proven by induction on c. I
Using this n::sult, dilfercnt lJptimality criteria can h~ cmpll'ycd to achieve dc:.irablc layuuts.
Ron Y. Pinter 173

f<. tracks

extra
... track
(a) One cycle needs an extra track
in each dimension.
I I
---, '-h
I I I I I I
I I
I I I I
I I I I
I I I I
I I I
I I
I J I I

(b) Ah = h + 1, Av = v. (c) Ah = h, Av = v + 1.
Figure 12: Laying out vertical and horizontal cycles in an X -shaped channel.
Here h = 2, v = 3.

5. Routing Across a Rectangle in Arbitrary Order


Consider the case in which a side of a rectangular channel belongs to a module in which the
order of the signals can be changed arbitrarily by easy changes to the module (e.g. 110 to a PLA, I/O
pads, data registers). Then a set of telminals is associated with a set of signals (of equal cardinality),
and we create the individual association between signals and terminals as part of the routing proce-
dure. In this case we might take advantage of the freedom we arc given and reduce the width of the
channel that would occur if we were using an algorithm for fixed ordering (like in [DKSSUSl]). This
is exemplified by Figure l3(a).
The key observation is that signals which are aligned across tbe channel have to be connected
straight across; the rest are processed in order (frOlY; nne end to the other) and each is assigned the
highest available horizontal track as its (only) jog track. A more complicated example is shown in
Figure 13(b).

, r ., T~ r
,t r, If.""-"'11---IIi"""'8.,

.
~ .. I .,

,
....---t-I...
,
~ ~ l I
~.
(a) A simple case using only one (b) A more complicated case with
horizontal track. ec = 3 (due to x).
Figure 13: Routing across a rectangle with arbitrary terminal ordering (notice the
terminals have no labels).
174 Optimal Routing In Rectilinear Channels

We restrict ourselves to a channel, C, having tenninals on two opposite sides only (like in
[DKSSU81]). For sake of presentation, let us assume these are the horizontal sides (as in Figure 13).
Let IT{Xo) (similarly IB{Xo)) be the number of terminals on the top (bottom) side of tIle channel lying
to the left of Xo (i.e. whose z-coordinate is smaller than Xo). Then we define the excess /lumber at a
point z = Xo to be*

(we could have similarly done this with temlinals lying to the right of Xo).
The excess number of the channel. fC. is defined as

where z/- and ZR are the z-coordinates of the left and right ends. respectively. of the channel. Then
we have
Theorem 4. The number of horizontal tracks needed to route C is exactly Ec. and this is
optimal.
Proof outline: This number of tracks can be attained by first routing aligned terminals straight
across and then assign horizontal jog tracks using the grcedy algorithm mentioned above. Note that
this algoritllrn does not cause two ,erticai tracks to ovcilap. as opposed to the case in [DKSSU81] (see
Figure l3(b). Also, no wire passes through more than two contacts. The lower bound is proven by
drawing vcrticalline segments tluough the channtl and showing that at least at some point as many
as fC signals have to be routed from its left side to its riglU side, L1US forcing us to use as many as ec
tracks. I
The calculation of the excess number is linear once the tenninals arc sorted. The assignment
of tracks, however. takes time O(n log n) (where n is the number of terminals) to allow for the main-
tenance of the priority queue holding the free tracks. So all in all we have an O{ n log n) algorithm.
but in case the terminals are presorted, evaluating the channel's width (WitllOut routing it) takes
linear time.
This result can be generalized to dealing with disjoint sets of signals, the order of tenninals
within each is arbitrary, but we arc not JlImved to mix terminals from diOcrent sets. This is achieved
by modifying the definition of tlle excess number to accommodate tllis constraint. FurthemlOre, the
excess number has tlle same cOl1vexivity property as me conflict number discussed in [DKSSU81].
thus it can be used to solve the offset problem in the same fashion as described there.

6. Conclusions
We have shown mat optimal routing for some configurations in rectilinear-polygonal channels
can be obtained efficicmly. Technically some surprisingly compact, yet simple, routing patterns were
discovered. AltllOUgh most seem to be ad hoc. they share a common flavor which is induced by
the pairwise ordering introduced to represent constraint propagation. The results obtained are truly
optimal. i.c. not just in order of complexity (as in [LeiSO]).
Surprisingly, this can be achieved by decomposing me polygons in a natural way, and solving
the parts almost independently while maintaining the constraints that must be shared in a simple
1
1:0 n~cd nGt b(: integral; in filet. ViC haw to look at point; of the form 1+ where I is an i!ltcgcr. MorcO\cr,
it i, :,upcrnUOli' to Inl,k at inkgral point,: it i, "'-~'.I{h tn look at POiUiS right hefore (-"-11 and right after (+t)
tcrmina\<. for the apl'ilc.ltion to foll(lw.
Ron Y. Pinter 175

fixed

?l
;.... 'i
cj U'
M 1-"

\ I
+' <.""t-
'ri 'i
,.Q ?l
M 'i
c;l '<

p.w.o.
(a) Center portion of aT-
fixed shaped channel as dis- p.w.o.
cussed in Section 3.

.!,.~ .::;:
'0
0
- ,

'\,vi
~

'\(
~ 0
p.

p.w.o. p.w.o.
(b) Center portion of a Cc) Center portion of a general
T-shaped channel with X-shaped channel.
terminals on flanks.
Figure 14: Possible routing configurations for rectangular parts of rectilinear chan-
nels. p. w.o. stands for pairwise ordering; a two-headed arrow indicates
that routing occurs between the sides pointed at

(a) aT (b) an X

Figure 15: Skewed rectilinear channels.


176 Optimal Routing In Rectilinear Channels

manner. This giy('s rise to a general methodology according to which the routing area of a chip
will be divided into polygonal channels, which in turn will be subdivided into rectangular parts.
The original channels will be used to fOlm routing constraints in tenns of orderings on the sides of
these rectangles. The types of orderings on the sides of a rectangle and their interaction (in tenns
of common signal nets) induce a typing of rectangles. For example, thc center portion of the T-
shiJped channel in Section 3 may be described as arbilrary-jixed-arbitrary-pairwise (going clockwise
from thc left edge) where nets are split betwecn the first three sides and the last onc (Figurt: 14(a.
Allowing tcnninals to reside also on the flanks yields a pairwise-/ixed-pairwise-painvise (with similar
net splitting) description for the same portion (Figure 14(b. An X-shaped c\l,innel in which two-
point nets can be split in any way between two different sides yields a paimise-pairwise-pairwise-
pairwise description with the aforementioned net interaction (Figure 14(c. Such types can be
characterized in tcnns of the complexity of their optimal routing probkm. Some, as we have seen,
can be routed optimally efficiently, but other configurations may be intractable (e.g. instances of NP-
complete proble:ns): still, guod heuristic solutions will be hclpful.
For this method to be effectivc we may need to allow channels to overlap (relaxing the definition
given in Section 2). The common areas will reflect constraints arising from more than one polygonal
channel which have to be solved simultaneously; trying m filld independent solutions and piecing
them togcther is clearly a bad idea: Although this complicates matters slightly, the types of rectangles
are essentially [he ~,:tme iHld the general methodology applies.
A further direction i~ to consider paral1lelerized modules ([Go081]) which can be integrated into
the constraint pI\)pagation methodology 10 enhance the interrelat.ion between placement and routing
even further. Other interesting cases are skewed T's and X's (Figure 15(a) and (b), respectively) in
which a side of an internal rectangle might be further subdivided. Solving the offset problem (as
defined by [DKSSU811 in a limited context) ior such channels is another extension.

Acknowledgll1ellts. I am grateful to Charles Lciserson for suggesting some problems that initiated
this research, and would like \(I thallk I1ml ilnd Ron Hivcst for many helpful discussions.

References

[DKSSU81] Dolev,D., Karplus,K., Siegel,A., Strong,A. & UlIman.J.D.: Optimal Wiring Between
Rectangles; Proceedings of the Tbirteenth Annual ACM Symposium on Theory of
Computing, May 1981, pp. 312-317.
[Goo80] Goodhue.E.: private communication (February 1981).
[HaSt71] lJashimoto,A. & StevensJ.: Wiring Routing by Optimizing Channel Assignment within
Large Apertures; Proceedings of the Eighth Design Automation Workshop, IEEE, 1971,
pp.155-169.
[Hi74] Hightower,D.: The Intcmi/lneciioll Problem: A Tutorial; Computer, Vol 7, No.4 (April
1974), pp. 18-32.
[LaPSO] LaPaugh,A.S.: A Polynomial-time Algorithm for Optimal Routing Around a Rectangle;
Proceedings of the Twenty-first Annual IEEE Symposium on Foundations of Computer
Science. October 1980, pp. 282-293; also in /llgorithmsfor Integrated Circuit Layout:
An Analytic Approach, MlT/LCS/TR-248 WIlD. dis5ertation), November 1980.
Ron Y. Pinter 177

[LeiSO] Leiserson, C.E.: Area-Efficient Graph Layouts (for VLSI); Proceedings of the Twenty-
first Annual IEEE Symposium on Foundations of Computer Science, October 1980, pp.
270-281.
[MCSO] Mead,C. & Conway,L.: Introduction to VLSI ~stems; Addison Wesley, Reading.
Mass. 1980.
[Pr79] Preas,B.T.: Placement and Routing Algorithmsfor Hierarchical Integrated Circuit Lay-
out; Computer Systr:ms Laboratory Technical Report No. lSO/SEL-79-032 (Ph.D.
dissertation), Stanford University, August 1979.
[RiSI] Rivest,R.l..: The "PI" (Placement and Interconnect) System - Progress Report. un-
pubiished mannscript, M.l.T, May 1981.
[ThSa} Thompson,C.D.: A Complexity Them), for VLSI; Technical Report CMU-CS-SQ-140
(Ph.D. dis~crtation), Carnegie-i\1cUon University, August 1980.
New Lower Bounds for Channel Width
Donna J. Brown Ronald L. Rivest
University of Illinois Massachusetts Institute of Technology
Coordinated Science Laboratory Laboratory for Computer Science
Urbana, Illinois 61801 Cambridge, Massachusetts 021391986

ABSTRACT
We present here a simple yet effective technique for calculating
a lower bound on the number of tracks required to solve a given channel-
routing problem. The bound applies to the wiring model where horizontal
wires run on one layer and vertical wires run on another layer. One of
the major results is that at least ~ tracks are necessary for any
dense channel routing problem with n two-terminal nets that begin and
end in different columns. For example, if each net i begins in column
i and ends in column i+l, at least ~ tracks are required, even though
the channel "density" is only 2. This is the first technique which can
give results which are significantly better than the naive channel den-
sity arguments. A modification results in the calculation of an im-
proved bound, which we conjecture to be optimal to within a constant
fac tor.
I. INTRODUCTION
The "channel-routing" problem has recently attracted a great
amount of interest and is becoming increasingly important with the
advent of VLSI. The results of this paper are of both practical and
theoretical interest. On the practical side, the techniques allow a
channel-routing algorithm to estimate more accurately a bound on the
number of tracks required to solve a given problem, and thus to know
when to stop looking for an impossibly good solution. From a theoreti-
cal point of view, this paper makes two points. The first is that
channel "density" is not the only factor determining the limits of
channel-routing performance in this wiring model; we must also consider
how many nets must "switch columns" in order to be routed. The second
point is closely related: the "traditional" wiring model - which we
study here - seems to be in some significant sense provably worse than
related wiring models where nets can overlap slightly (say at corners).
In these models twice channel density is provably an upper bound on the
number of tracks required [RBM81].
Related work has been done by, among others [HS71], [D76], [T80],
and [DKSSU81].
II. DEFINITIONS AND THE WIRING MODEL
The (infinite) channel of width t consists of (1) the set V of
grid points (x,y) such that x and yare integers and 0$ y$ t+l,
-= < x < =, and (2) the set E of edges connecting points (x,y) and (x',
y') whenever these points are at distance 1 from each other and yand y'
are not both equal to 0 or t+l. Figure 1 shows a channel of width 4.
If the width of the channel is t, we say that the channel has t tracks;
track i (for 1 $ i $ t) consists of all grid points with y=i and the

178
Donna J. Brown and Ronald L. Rivest 179

(horizontal) edges connecting these points.


A (two-terminal) net Ni consists of a pair of integers (Pi,qi)'
The intent is that a net specifies that a connection must be made be-
tween the point (q.,t+l) and the point (p.,O); these points are the
terminals of the n~t. 1

A connection is made by a wire; a wire is defined to be a simple


path (vO'JQ,vl,~, ,vk_l'~_l,vk) connecting Vo to v k (Here ~i-l & E
is the edge connecting grid points v. 1 and v . ) A channel-routing
1- 1
problem is defined to be a set of nets (with p. ~ p. and q. ~ qj for
1 J 1
i ~ j). A solution to a channel-routing problem is an integer t and a
set of wires in the channel of width t, such that one wire connects the
terminals of each net, and satisfying the restriction that two distinct
wires can meet at a grid point only if one wire has only vertical edges
touching that grid point and the other wire has only horizontal edges
touching that grid point. (This grid point is then a crossover point.)
This corresponds to the traditional model using one layer for
horizontal wires and another for vertical wires.
Given a channel-routing problem, it is desired to find the least
t permitting a solution. Szymanski [581] has proved that this minimi-
zation problem is NP-hard if each net may require connection of an
arbitrary number of terminals; it is natural to conjecture (but as yet
unproven) that it is also NP-hard if each net connects only two
terminals (as in our case).
An obvious lower bound on the minimum achievable channel width is
the channel density: this is the maximum (over x) of the number of nets
N.1 = (p.,q.)
11
for which p.1<- x < q.1 or qi ~ x < Pl'; i.e., the maximum
number of nets whose wires must cross or touch some vertical line x in
order to make the necessary connection. Previous to this paper, no
better lower bound has been published.

III. A SIMPLE LOWER BOUND


We begin with an investigation of the simple "shift-right-one"
channel-routing problem with n nets. Here the top terminal of net i
is in column q.=i and the bottom terminal of net i is in column
p. = i+l for i~l, ,n. We can show that at least ~ tracks are re-
1
quired for this problem, even though the channel density is only 2. In
fact, our simple argument does not depend upon the structure of the
problem except that it contains n closely-packed nets with different
starting and ending columns. Thus, an essentially same argument can be
applied to any channel-routing problem with two-terminal nets.
Suppose that the leftmost terminal of any net occurs in column 1
(i.e. with p.~ or q.1 equal to 1), and that the rightmost terminal of any
net occurs in column w. We consi-der the "window" of columns 1 to w,
where obviously w~ n. While wires do not need to lie entirely within
the window, every wire must both start and end within the window. For
our shift-right-one problem we have a window of size w = n+l.
180 New Lower Bounds for Channel Width

Let m denote the number of nets which IllUst be "moved" (i.e. which
IllUst switch columns because Pi~qi)' The structure of our argument is
a track-by-track analysis of how many wires can be moved into their
final columns on each track. Consider the first track (i.e. y=l). If
below track 1 (i.e. connecting track 0 to track 1) we have mO '" m nets
which IllUst be moved, after track 1 (i.e. between tracks 1 and 2) we
will have a number ml of nets to be moved, where ml ~ mO' We continue
in this manner for each track; when mi = 0 we are done (with t = i).
How many nets can be moved into their target columns in One track?
The fundamental but simple observation is that if net i moves from its
current column to its target column q. on the track, then column q.
1. 1.
IllUst have been empty, (i.e. there were no wires in column qi between
this track and the previous one). Let e.1. denote the number of empty
columns between tracks i and i+l in our window. Then clearly
t-l
or m ~ !! e i .
i=O
The only way to change e i from one track to the next is to route
wires from a column inside the window to a column outside the window
(which increasese i by one) or vice versa (which decreases e i by one).
We also observe that e. - 2 ~ e. 1 ~ e. + 2, since at most two wires can
1. 1.+ 1.
cross the window boundary on any track.
Our initial conditions are eO = e t = w - n (w is the width of the
window, n the number of nets), and we have the inequality

e i ~ minCe O + 2i, e t + 2(t-i)}


This implies that, for t ~ 3:
Lt/2J t-l
m< !! (e O+2i) + E (e + 2(t-i
i=O i=Lt/2J+l t
Lt/2J L(t-l)/2J
and so m~ t(w-n) + 2 !!i + 2 !!i
i=O i=l
t 2
m~ t(w-n) +'2

t ~ -~w-nl + r~w-nl2 + 2~1 (*)


Thus, in our shift-right-one example we have w-n'" 1 and m=n,
yielding:
t ~ -1 + r.}2n+l l
Figure 2 illustrates a routing for this problem with n = 13; the lower
bound of t ~ -1 + r$7l = 5 is achieved. It is not true however, that
the shift-right-one example can always achieve -1 + rJ2n+i l tracks.
For instance, for n = 12, shift-right-one cannot be implemented using
Donna J. Brown and Ronald L. Rivest 181

fewer than five tracks.


Notice that the argument outlined above does not apply solely to
the shift-right-one example,and the bound (*) applies to all two-ter-
minal channel-routing problems. It is in fact possible to show that
the bound (*) is tight in the sense that, for any values of w-n and m,
there is some channel-routing problem which achieves the minimum number
of tracks given by (*). Figure 3 illustrates a particular routing
prob lem wi th m= 12, w -n = 1 for which the lower bound of t = 4 can
actually be achieved (even though it cannot be for the shift-right-one).

IV. AN IMPROVED LOWER BOUND


Let us examine the channel-routing problem specified by Figure 4a.
The density is three, and since w-n=4 and m= 16, (*) tells us that at
least three tracks are required for a routing. But consider just the
left side of this problem, using a window of size eight: w-n=O, m=8.
The bound (*) guarantees that at least four tracks are required for
this subproblem, and so it is certainly not possible to achieve a
routing with fewer than four tracks for the entire problem. Therefore
the routing illustrated in Figure 4a is indeed optimal. What is the
problem with our (*) bound? The difficulty is that the bound nowhere
takes into account the details of the particular routing or the loca-
tions of the initially empty columns. As shown by Figure 4b, there is
some channel-routing problem wi th w-n = 4 and m= 16 which can be routed
using only three tracks. But the empty columns are more spread out, so
no subproblem is as "dense" as in Figure 4a.
We modify our Section III argument to consider not only the largest
window but also all o(n2) "subwindows" it contains. A bound like (*) is
computed for each subwindow, and the overall lower bound is then the
maximum of the individual lower bounds.
The bound on a subwindow involves computing a solution to a quad-
ratic formula as we did above. The formula is, however, more compli-
cated because some nets may have only one terminal in the subwindow and
some nets may have both terminals on opposite sides of (and outside of)
the subwindow.
Consider a (sub)window of width w. Let D denote the number of nets
whose top terminal is within the window and whose bottom terminal is
outside of the window. Such a "departing" net may have its bottom
terminal either to the "left" or to the "right" of the window; there
are DL and DR of these, respectively (D = DL +D R). Similarly, let A
denote the number of "arriving" nets, those with top terminal outside
and bottom terminal inside the window; there are ~ (~) of these with
top terminal to the left (right) of the window. There are T nets which
have their terminals on opposite sides of the window and so must pass
all the way "through". Finally, there are I nets which have both their
terminals "inside" the window. (Note that w = D + I +e t =A +I + eO')
In this extended abstract, we omit the (somewhat complicated)
derivation of the modified bound for an arbitrary subwindow and consider
only subwindows for which DL = DR and ~ = Aa'
182 New Lower Bounds for Channel Width

D
Clearly '2 tracks are required for the D "departing" nets. But

within these tracks, as many as

~ - 1
2 E (e +2i)
i=O t
"inside" nets might also be routed. Similarly, the A "arriving" nets
A
require at least '2 tracks, which could also be used to route

~ - I
2 E (e O+2i)
i=O
"inside" nets. This leaves max[O,I'l, where
Q_ l ~-l
I'=I_ 2 E (e t +2i)_2 E (e O+2i),
i=O i=O
more "inside" nets to be routed. Bound (*), previously established,
gives a minimum number of additional tracks required to route these.
Recalling that T nets pass completely through the window, we obtain
D A
t~T+'2+'2+max{O,-(et+D)+ J(et+D) 2 +2I'}.
This formula is illustrated by the example in Figure 5 (where only
the left half has been drawn; the right half is the mirror image of the
left). For this problem, T = 0, D = 2, A = 6, I = 42, eO = 0, e = 4,
and the above formula gives t

t;;O: 1+3+max[O,-6+J36+64 1= 8.
This minimal number of tracks is in fact achieved by the routing shown.
The above formula can, of course, be extended to subwindows where
AL # ~ and DL + DR. In addition, small improvements can easily be
made by considering relative positions of, say, the D nets and the e t
columns.
Finally, it should be noted that channel density is, in fact, a
subcase of what we are here computing. If the (maximum) density d is
in column i, then the subwindow of size one which includes i will
require at least d tracks. So the maximum over all subwindows can be
no less than d.

v. CONCLUSIONS
We have presented a new simple but powerful technique for deriving
a lower bound on the number of tracks required to solve a traditional
channel-routing problem for two-terminal nets. We have as of yet found
no example for which this bound is more than a constant factor from
optimal.
Donna J. Brown and Ronald L. Rlvelt 183

ACKNOWLEDGEMENTS
This research was supported by NSF grants MCS80-08854,
IST80-12240, MCS78-05849, and by DARPA grant N00014-80-C-0622.

REFERENCES

[D76] Deu tsch, D. "A Dogleg Channe 1 Rou ter," Proceedings of the
13th Design Automation Conference (IEEE 1976), 425-433.

[DKSSU811 Do1ev, D., K. Karp1us, A. Siegel, A. Strong, and J. D.


Ullman, "Optimal Wiring between Rectangles," Proceedings
of the 13th Annual ACM Symposium on Theory of Computing
(1981), 312-317.

[HS 71 1 Hashimoto, A. and J. Stevens, ''Wire Routing by Optimizing


Channel Assignment," Proceedings of the 8th Design Auto-
mation Conference (IEEE 1971), 214-224.

[RBM81] Ri ves t, R., A. Bara tz, and G. Mi ller, "Provab 1y Good


Channel Routing Algorithms," to appear.

[S811 Szymanski, T. Personal communication.

[T801 Tompa, M. "An Optimal Solution to a Wire-Routing Problem,"


Proceedings of the 12th Annual ACM Symposium on Theory of
Computing (1980), 161-176.

track

4
3
2
1

Figure 1. (Infinite) channel of width 4.


184 New Lower Bounds for Channel Width

Figure 2. Shift-right-one example for n = 13.

Figure 3. Example achieving bound (*) for w-n = 1, m = 12.


Donna J. Brown and Ronald L. Rlv..t 185

(a) Optimal but does not achieve bound (*).

(b) Does achieve bound (*).

Figure 4. Examples for w - n=4, m= 16.

I I
.

4= - --
,-- r--
- r-- .--
r-- r-- r-
r-- r--

r
."

I I I
Figure 5. Illustration for improved lower bound.
Compact Layouts of Banyan/FFT Networks
David s. Wise
Indiana University
Computer Science Department
Bloomington, Indiana 47405

A~~ljA~l A two-layer pattern is presented for the


crossover pattern that appears as the FFT signal flow graph
and in many switching networks like the banyan, delta, or
omega nets. It is characterized by constant path length
and a regular pattern, providing uniform propagation delay
and capacitance, and ameliorating design problems for VLSI
implementation. These are important issues since path
length grows linearly with the number of inputs to such
networks, even though switching delay seems to grow only
logarithmically.
Independent of that pattern, an arrangement for
stacking planes of such planar crossover patterns in three
dimensions is described. Whereas a planar crossover
pattern of Q(m) inputs and outputs has at best Oem) path
length, the stacked pattern allows O(vrn) path length. The
scheme provides for stacking 2k planar nets (perhaps VLS~
chips), each with k inputs/outputs into a network of k
inputs/outputs. Using this pattern, all such paths would
have length (propagation delays) = O(k).

Key Words and Phrases: VLSI, switching networks, delta


networks, omega networks, multiprocessing.
CR Categories: 6.1,4.32,3.81,3.83.

I. INTRODUCTION
This paper offers two results that can both be des-
cribed as pictures. They are Figures 1 and 3. The percep-
tive reader may stop here, since the remainder of this
paper only describes them.

*This research was supported in part by the National


Science Foundation under grant number MCS 77-22325.

188
David S. Wise 187

II. NOTATION
The 10garithm-base-2 of n is written 1& x. For real
functions f and g, we write f1xl = O(g(x)) if there is a
constant k and some value Xo such that f(x) ~ k'g(x) for
all x > xo' This notation expresses a proportional,
asymptotic upper bound of g for f. If f(x) = O(g(x and
g(x) = O(f(x then we write f(x) = Q(g(x. I t is only
necessary to express g as a single term (e.g. 19 x, 2x,
x2 ) in such contexts.
The abbreviatios FFT and VLSI refer to the Fast
Fourier Transform and Very Large Scale Integration circuit
technology, respectively.
III. AN APPLICATION
This work is immediately motivated by the need for a
switching network between processors and memory in a multi-
processor system of 100 or 1000 processors [4]. In order
to increase bandwidth to memory, reducing contention among
the processors, a banked memory is envisioned. Its access
is through a fast, parallel switching network.
A sui table model for such a network is a banyan
network [7] whose elementary functional unit is a 2x2
crossbar switch. It may be perceived as a QY1~L [2], a
store-and-forward unit with two input lines and two output
lines. Figure 1 might be interpreted as suoh a network
from 2n = 16 processor to 2n memories.
Memory fetch and store instructions are transmitted as
packets through the network; duplicate networks pass
information in the reverse direction. (Say, honoring a
fetch instruction or allocating free nodes from a heap
[5].) Each packet initially contains a binary
(destination) memory-address followed by a message. Upon
arrival at each router, its high-order address list
determines along which path it is to be forwarded. The
entire message is shifted left one bit, displacing that
address bit for transmission to the next stage. In Figure
1, a zero bit would send the modified packet from a router
on a leftward (northwest) line; a one bit would send the
remainder of the packet to the right (northeast). It is
possible to insert into the vacated low-order bit a value
identifying the input line by which a packet entered each
router. After a packet has passed through the network to
its destination, its destination address will have been
shifted out. In its stead (at the end of the packet) would
be the address of its source processor when the vacant bi ts
are so used.
This describes a network which uses crossover pattern
to allow as many as n messages to arrive at some of 2n
different destinations simultaneously, each over a path of
at most Ig n routers. As many as n'lg n messages might be
already in the switch, pipelined behind the arriving wave.
188 Compact Layouts of Banyan/FFT Networks

On the other hand, contention is still possible (for


instance, all messages going to one destination) since each
router can only handle one message at a time.
IV. PLANAR LAYOUT WITH CONSTANT PATH-LENGTH.
Figure 1 presents an alternative to the wiring diagram
which is the essential component of many switching
networks, such as the banyan network, the delta network,
the omega network, and the Benes network (as either a
circuit switch or a packet switch) [6,9]. It also appears
as the broadcast pattern for intermediate results in the
FFT [1].
The functional boxes, pictured as shaded boxes in the
figure, are of Ii ttle importance here. Their defini tions
vary wi th the various applications of this crOSsover
network. The only assumption we make is that they are all
of uniform design, perhaps mirror images of one another.
Their size affects the aggregate size of the network.
Most analyses of such networks emphasize these
functional boxes as the critical cost of such networks.
That is, any path from the n input boxes (at the bottom)
through the n output boxes (at the top) of such a graph
passes through exactly Ig n such boxes. If such a network
is implemented in VLSI technology, however, n can become
locally large and the apparently low O(log n) delay may be
unattainable when timing considerations due to line-delay
are included [3]. While Figure 1 has the property that all
lines at any stage are of equal length, it still exhibits
this line delay-problem. Drawn with uniform inter-wire
spacing, it shows that the spacing between the stages,
between functional boxes along any path, grows
exponentially (top-to-bottom). Thus, although there are
only Ig n stages, every path length is B(n).
For any planar graph wi th n inputs/outputs, the
longest path length is kn, for some constant k. This is
easily proved, since the upper-left output must have a path
from the lower-right input, and the network has width n.
Figure 1 exhibits the desirable property that ~ paths are
of length kn, so that special considerations due to line
delay, attenuation, and capacitance are lessened.
Functional boxes may be defined more uniformly, a
desideratum in VLSI design.
Also important for planar fabrication technology is
the fact that the crossover network is exactly two layers
thick. While it exhibits more wire crossings than the
common planar patterns (e.g. the cover of [1 ]), the others
require three or more layers. Here one layer is composed
of all diagonal wires running "northeast" (lower-left to
upper-right) and the other layer is composed of "northwest"
wires (lower-right to upper-left). The northeast wires
might be in the metal layer of a VLSI chip and the
northwest wires in the diffusion or (better) a second metal
David S. Wise 189

Figure 1. Crossover Net for n=8


190 Compact Layouts of BanyaniFFT Networks

layer. Al ternatively the northeast wires might be on one


side of a printed circuit board and the northwest wires on
the other. The D-shaped blobs in the figure indicate
contact cuts through the insulator connecting the layers.
The size of the contact-cut is one constraint on the
height of the network. According to Mead and Conway, [7J
their diameter, including insulating spacing, is 7~ in
VLSI, larger than any insulated wire. Let that wire
diameter be w, and let us measure the width of one
functional box, including insulation as a mul tiple of w =
qw.* As Figure 2 illustrates, the angle of either diagonal
is then arcsin l/q. This is another constraint on the
height of the network.
In fact, the height of the n input/output crossover
net of Figure 1 can then be calculated accurately as
~ 2i qw tanQ
Oii<lg n
= w(n-l)q/Jq2:1
To this must be added the height of (lg n + 1) functional
boxes.
The path length along any path is the height of the
network times q = cosecQ plus the path wi thin (lg n + 1)
functional boxes. This is not quite accurate, since
additional width of the crossover net is necessary to
prevent contact cuts from overlapping. Over all, the width
must be increased by wn/4 to allow for the n/4 pairs of
cuts across the second level of Figure 1; the top level can
be wired without contact cuts.
It is then possible to derive the exact area of the
crossover network. The width is nw(q+O.25) and the area of
the wiring is w2 q(q+0.25)(q2_ 1 )-0.5(n 2_n) = Q(n 2 ). To this
must be added w(q+O.25)n(lg n+1) times the height of one
functional box. This can be compared with the asympto~ic
area of a shuffle-exchange network [6] of Qn/log n) );
that network performs unpipelined FFT or packet switching
(not circuit switching) with n functional boxes in (lg n+1)
iterations, accounting for one factor of log n in its
compression of area.
V. RACKING NETWORKS IN CUBIC SPACE.
Regardless of the technology or design of a planar
crossover network of4 the SO[t presented above, extending it
to large n -- say 10 or 10 -- requires a large plane. As
we have seen, and as Franklin [3] pOints out, both width

*This does not allow horizontal spacing for


immediately adjacent contact cuts -- the facing D's in
Figure 1 -- but we are measuring vertical distance.
David S. WI.. 191

Figure 2. e = Arcsin 1/q.


192 Complct Llyouts of BlnYlniFFT Networks

and height grows linearly with n. Obviously some dicing of


larger nets will be required as n increases.
Figure 3 illustrates an elegant decomposition from the
plane into cubic space. It is elegant because of its
decomposition into planes, and because of a net .r:.e..!1Y...c.Jt.i.2ll
in the height of the network without increasing its volume.
Buil t of 4n planar nets, each n funct~onal boxes wide, it
is the equivalent of a network with 2n functional boxes on
either edge. The planar nets are arranged into two racks
of 2n planes each; the racks are stacked one atop the other
but with orthogonal racking arrangements.
The interconnection between the racks is very simple:
there are no wire crossings there; all wires run nearly
vertically. Therefore, there is no required distance be-
tween the racks, except a constant spacing that depends on
the technology used to rack the planes. Thus, the height
of such a network is only about twice the height of one
planar component, linear in ~ as discussed above. The
height of a network that is 2n functional boxes "wide" is
only O(n) tall. More import~ntlY the path length is also
only O(n) instead of the O(n ) of equivalent planar net-
works.
The two outputs from one of the final functional boxes
on a lower plane are routed to corresponding initial boxes
on each of two adjacent upper planes; the dual perspective
is that inputs to one functional box on an upper plane come
from corresponding output boxes on each of two adjacent
lower planes.
That such a patt~rn is the equivalent to extensions of
Figure 1 (that have 22k+1 boxes across the bottom and top)
can be seen from the following argument. Split such a
network by tempor~rpy removing the middle crossov~r web.
The top half is 2 + disjoint planar networks of 2 boxes
in width. Although the wires cross, the bottom half may
be untangled into exactly the same pattern. (In Figure 1,
for instance, the first and last boxes in the bottom row
are connected to the middle two boxes in the next row
above, just as the left two boxes in the top row are
connected wit~ those immediately under them.) If we
visualize the 2 + networks in the top half as stacked but
with every other IICTt~" in the stack reversed in mirror
image, and if the 2 + networks in the bottom half are
untangled and stacked, then the missing middle web is the
analog of the vertical connections described above.
Using the swi tching example of Section III, this
racking can be perceived a different way. The purpose of
the entire network is to switch a packet through to a
destination which is at an (x,y) co-ordinate in the top
plane. Its address is a bi t concatenation of x and y. The
lower rack uses the high order (x) bits of the address to
route the packet to a correct position on a planar circuit
oriented along the x-axis. Then it passes vertically onto
an identical board oriented along the y-axis, which uses
David S. Wise 193

Figure 3. Racking of Planar n Networks in Cubic Space.


194 Compact Layouts of BanyaniFFT Networks

the low order bits of the address to forward it to its


correct destination.
This example points out an application of the racking
in which all 2n planar circui ts, including functional
boxes, are identical, making it quite attractive for pla-
nar fabrication techniques, such as those used in current
VLSI or printed-circuit-board technology. Alternatively it
shows how existing planar circuits for FFT or switching can
be used in racks to square their effectiveness.
VI. CONCLUSION.
This paper shows how to layout a crossover network
either on a plane in just two layers and with uniform wire
length or in cubic space in order to minimize path length.
Since the latter pattern (Figure 3) is fabricated from
planar nets, the former results apply for it as well.
Although the two solutions have a common motivation,
they may be used independently of one another, either in a
case where a planar layout is required (Figure 1) or where
a planar net is readily available but an extension is
needed.
A feature of both solutions is that no connections to
function boxes are made horizontally. In Figure 1, they
are accessible from the sides for busses across each stage.
Common signals -- like power busses -- can be passed across
the boxes without interfering with the crossover net under
consideration. In Figure 3 the racked planes run
completely across the circuit, leaving the edges available
for maintenance lines, and gaps between the planes (in the
case of printed circuit technology) for cooling.
It is possible to build up an arrangement equivalent
to Figure 3 from smaller planar nets (say identical to
Figure 1) using the recurrence of Figure 3 repeatedly. The
resulting arrangement would not have the features of simple
stacking just mentioned, but it might be necessary i f
fabrication techniques could not provide sufficiently large
planar circuits, and might be useful when technologies must
be mixed as suggested below.
Fabrication technology is a constraint to importance
of the asymptotic analysis of these circuits. In many
applications, like Section III above, each line in the
figures represents a parallel communication bus of from 64
to 256 wires. Large planar circui ts are not now available
from VLSI techniques when the number of such input/output
lines is high. The inter-rack connection of Figure 3,
though simple, is not now possible in the same scale as the
VLSI planar network. The "short" vertical wires between
the racks would be very, very long today. Thus the
importance of Figure 3 for planar circui ts on a chip is now
conceptual rather than practical; path length is probably
.D..Q.t reduced under current fabrication techniques, but the
design is still a useful modulization of Figure 1. Thus,
David S. WI.e 195

we might well use Figure 1 to layout a small planar VLSI


chip (say n=8), then use Figure 3 to build a printed
circuit board with 32 such chips (n=128), and finally rack
such boards using Figure 3 to build a very large network
(n=32768). The path lengths would not be significantly
reduced by using these designs on one board, but the
regular pattern on a chip, and within the rack of boards
(instead of a backplane) would make the aggregate speed of
the circuit faster and easier to clock than conventional
planar layouts of that size.
ACKNOWLEDGEMENT: I thank John O'Donnell for introducing me
to the depth of VLSI technology, George Epstein for
suggesting [1], C. E. Leiserson for pointing out [6], and
Nancy Garrett, able type st.
REFERENCES
1. N. Ahmed and K.R. Rao. Orthogonal Transforms for
Digital Signal Processing. Berlin, Springer (1975).
2. J.B. Dennis, G.A. Boughton, and C.K.C. Leung. Building
blocks for data flow prototypes. ~ 11h ~
Computer Architecture (IEEE order no. 80CH 1494-4C),
ACM SIGARCH Newsletter a, 3 (May, 1980), 1-8.
3. M.A. Franklin. VLSI performance comparison of banyan
and crossbar communication networks. ~ Trans.
Computers C-30, 4 (April,1981), 283-291.
4. D.P. Friedman and D.S. Wise. Aspects of applicative
programming for parallel processing. ~ Trans.
Comput. C-27, 4 (April, 1978), 289-296.
5. S.D. Johnson. Connection networks for output-driven
list multi-processing. Technical Report 114, Computer
Science Department, Indiana University (October, 1981).
6. D. Kleitman, F.T. Leighton, M. Lepley, and G.L. Miller.
New layouts for the shuffle-exchange graph. ~ ll1b
A&M ~ Qll Theory Qf Computing, [ACM Order No.
508810] (1981), 278-292.
G.M. Masson, G.C. Gingher, and S. Nakamura. A sampler
of circuit switching networks. Computer 12, 6 (June,
1979),32-48.
8. C. Mead and L. Conway. Introduction 12 ~ Systems,
Reading, M~Addison-Wesley (1980).
9. H. Siegel. Interconnection networks for SIMP machines.
Computer 12, 6 (June, 1979), 57-65.
Syntax Directed Verification of
Circuit Function

Michael J. Foster
CarnegleMelion University
Computer Science Department
Pittsburgh, Pennsylvania 15213

Abstract
This paper introduces a new technique, called syntax-directed verification, for proving properties
of circuits composed of standard cells. The lengths of proofs using this technique are
independent of the size of the circuits, but depend only on the number of standard cell types
and the complexity of the rules for interconnecting them. Syntax-directed verification is thus
well-suited to VLSI, in which large circuits are built using relatively few types of cells. The
paper describes the syntax-directcd verification method, and presents an example of its use.

Introduction
Many current VLSI designs are composed of standard cells, which themselves perform simple
functions but are wired together to perform more complex functions. Often it is not obvious
that the function performed by the cell combination is the one specified, even if we assume that
the cells themselves are correct. Examples of complex circuits formed from simple cells are
Leiserson's systolic priority queue [Leis79], the pattern matcher of Foster and Kung
[FostKung80], and the programmable recognizer array (PRA) [FostKung81]. In all of these
examples, correctness of the circuits has been demonstrated by tracing the action of the circuit
rather than by formal proof.

This paper suggests a syntax-directed technique for verifying circuits composed from standard
cells. This technique allows proofs of correctness to be developed in a mechanical way. It
relies on the usc of a context-free grammar to specify both the function and structure of the
legal combinations of cells. The terminal characters in the grammar correspond to the primitive
cells, and non-terminals correspond to combinations of cells. The start symbol corresponds to
the class of circuits whose correctness is to be verified. By proving a single theorem for each
production in the grammar, the correctness of any circuit constructed according to the grammar
may be verified.

This research was supported in part by the Office of Naval Research under Contract NOO0l4-80-C-0236, NR 048-659,
by the Defense Advanced Research Projects Agency under Contract F33615-78-C-1551, by the National Science
Foundation, and by the Fannie and John Hertz Foundation.

196
Michael J. Foster 197

Syntax-directed verification can be used to prove properties of circuits constructed using


standard cells. Together with other techniques, such as simulation, these proofs promote
confidence that large circuits will work as expected. As VLS[ systems become larger and more
complex. validation methods of this type will be essential. The success of syntax-directed
verification provides hope that other formal techniques can be discovered to replace the ad hoc
methods in current use.

The Verification Method


To prove the correctness of circuits composed of primitive cells, we must prove a theorem of
this form:
If all primitive cells are correct. then any legal combination of cells will be correct.

To. construct this kind of theorem, we must give the specifications of the cells, tell what
combinations of cells are legal, and give a rule for determining the specifications of any legal
cell combination from its form. In this paper, we assume that the specifications of the cells are
primitive, along with the cell designs. The legal combinations of cells will be precisely those
circuits that. are generated by the context-free grammar, and the specification of a circuit will
depend upon its derivation in the grammar.

As with program correctness, proof of circuit correctness proceeds in two steps: development of
verification conditions, followed by their proof. To develop the verification conditions we make
use of syntactic assertions on the values and timings of signals at the ports of each primitive cell
and compound circuit. These assertions correspond to the inductive assertions [Floyd67J of
program verification. The verification conditions are theorems relating the syntactic assertions.

One syntactic assertion is required for each symbol of the grammar. Terminal symbols of the
grammar correspond to primitive cells, and the assertions for these symbols are simply the
primitive cell specifications. Assertions for the non-terminals are specifications of the various
compositions of primitive cells. The assertion for the start symbol is thus the specification for a
complete circuit constructed using the grammar.

Once we have the syntactic assertions we can develop the verification conditions. Each
production of the grammar corresponds to one verification condition, stating that the syntactic
assertions of the symbols on the right side of the production imply the assertion on the left.
Proof of these theorems completes the verification of the circuit family.

An Example
As an example of this technique let us verify that the recognizcrs described in [FostKung81J
actually recognize the regular languages they are supposed to. Three kinds of primitive cells are
used to build these recognizers, corresponding to types of characters in the regular expression;
one cell type is a comparator for single characters, while the other two types correspond to the
union (+) and Kleene star (*) operators. The three cell types, together with symbols used in
drawing large recognizers, are shown in Figures 1, 2, and 3. Note that each cell type has left
and right ports, which are used for cells concatenated to its right and left. In addition, the +
and * cells have upper and lower ports for connecting circuits for their operands.
198 SyntaxDlrected Verification of Circuit Function

RE'S

CHR

ENB

Figure 1: Comparator Cell

RES
CHR
ENB

~
.... ....
RES RES
CHR < CHR
ENB < ENB

RES
CHR
ENB
Figure 2: OR-Node Cell

RES
CHR
ENB

RES ~ RES
CHR < CHR
ENB ENB

Figure 3: Kleene * Cell

Each of the primitive cells has one or more data paths passing through it, with three data
streams on each data path. The CHR and ENB streams, which flow from right to left, carry the
text characters and enable bits. The RES stream, which flows from left to right. carries the result
bit. We hook these cells together to form recognizers by connecting the data path at the right
side of one cell to a data path at the left side of another cell.
Mlch1J. Foster 199

Circuits formed from these cells will take the fonn of ternary trees. All communication with a
recognizer tree takes place at the root, through the right port of the rightmost cell. Any cell
with nothing concatenated to its left must have a terminatillg loop connected to its left port.
This is simply a wire running from the ENB output to the RES input, and ensures that RES is
always equal to ENB.

We construct a circuit for a regular expression using the grammar:


R ::= P I RP
P :: = <letter> I (R)* I (R + R)
The non-terminal R is the start symbol; circuits corresponding to R recognize regular
expressions. The non-terminal l' corresponds to primitive regular expressions-those whose top'
level operator is not concatenation. To build the circuit, we parse the regular expression using
this grammar. Each of the productions in the derivation of the expression corresponds to a
connection of two or more circuits for subexpressions, or to the introduction of a single cell.
The five productions, with their associated circuit constructions are:
R -+ P
Put a terminating loop on the left port of P.
R -+ RP
Connect the left port of P to the right port of R.
P -+ <letter>
Use a new comparator for P.
P -+ (R)*
Connect the right port of R to the top port of a new * node.
P -+ (R+R)
Connect the right ports of the two R's to the top and bottom ports of a new + node.
A circuit for a complex regular expression constructed using these rules is shown in Figure 4.

G)-c

Figure 4: Circuit for (A)*B)* + (DC)*XE)*F


A recognizer constructed from these cells is a clocked circuit with a two-phase clock. Three
data streams are available at the right port of a recognizer: the text stream (CHR), the enable bit
200 SyntaxDlrected Verification of Circuit Function

(ENB), and the result bit (RES). On even clock phases the recognizer inputs CIIR, while on odd
phases it inputs ENB and outputs RES. If a string in the language generated by the regular
expression is input, preceded by a 1 on ENE, the recognizer outputs a 1 on RI;S immediately
after the last character of the string; otherwise it outputs a 0 on RES. Note that a single
character in the input stream may be a member of several recognized strings.

We have now described the legal combinations of cells, and have claimed that a legal
combination of cells should recognize its corresponding regular expression (by setting RES to 1
at the right times). Verification of the circuit function will consist of a proof of this claim.
Before proceeding with the proof we must make this circuit specitication more precise, as well
as supply specifications for the individual cells. These specifications of circuit function will be
the syntactic assertions used in the proof of correctness. They will consist of predicates on the
sequence of values at each port of a single cell (comparator, + node, or * node) or larger
module (P or R).

The above description of the operation of a recogrlizer circuit is an informal statement of the
assertion for R. By introducing suitable notation for the signals on the ports of a recognizer, we
can translate this informal description into a concise predicate. Let Et, C t , and Rt stand for the
ENE, CIIR, and RES signals at beat t on the right port of a recognizer, and let E't, C'l' and R't
stand for the signals on the left port of a primitive recognizer. The assertion for a recognizer
for regular expression X is then the predicate P R(X):

(Vt) [(3n) Et- 2n + 1 A C t- 2(n-1) .. , Ct- 2 Ct> in X)] ++ Rt+l'

where the expression "(C t- 2(n-l) ... Ct- 2 Ct>in X" means that the string Ct- 2(n-1) ... Ct- 2 Ct
is in the language generated by X.

A primitive expression recognizer (P) has an associated pipe length 6. It sends CHR and ENB
from right to left with delay 6, so that data entering the right port at time [ leaves the left port
at time t + 6. Furthermore, if RES is input from the left 6-1 beats before the start of a string in
the language generated by the primitive expression, then 1 will be output on RES from the right
immediately after the last character of the string. Otherwise, a 0 is output on RES on every beat.

The assertion Pp(X) for a primitive recognizer of pipe length 6 for expression X is thus the
conjunction of:

(Vt) E'r+6 ~ Et
(Vt) C't+6 ++ C t
(Vt) [(3n) R'H6-2n+1 A C t- 2(n-l) ... Ct- 2 Ct> in X)] ++ R t +1'

Our primitive specifications of the cells assert that they function according to the circuit
diagrams. Thus a character recognizer for the character "x" obeys the predicate P x:

(Vt) E't+1 ~ Et
(Vt) C't+1 ++ C t
(Vt) [R't A (C t = x)] ++ Rt+ l'

as can be seen from Figure 1.

For the OR node we denote the upper and lower ports by the superscripts I and II, so that Cl is
the character output of the upper port. The predicate P OR is then the conjunction of:
Michael J. Foster 201

(\It) Rt -

-- -Rl V RU
t
Clt CU
t

-
(\It) C't t - Ct
(\It) Elt E\ - R't
(\It) E't E-t

Using the same convention for the upper port, the predicate P* for the Kleene * is:

(\It) R t - E\ - R't V RUt


(\It) C't- CUt - Ct
(\It) E't - Et

Having stated the assertions that apply to each symbol of the grammar, we are ready to state
and prove the verification conditions. The verification condition for each production is a
theorem stating that the assertions for the symbols on the right side imply the assertion for the
left symbol. Each of these theorems assumes of course that the semantic rule associated with
the production is applied. cl11e theorems show, taken together, that a circuit constructed using
the semantic rules above will recognize the regular expression that drove the productions. There
are five verification conditions, one for each production in the grammar; we shall state and
prove only one of them here.

To prove the veritication condition corresponding to the production R -> P, we must show that
if a termination loop is added to a circuit satisfying Pp(X) then the resulting circuit satisfies
PR(X). A circuit satisfying Pp(X), with an added termination loop, satisfies the conjunction of:

(\It) E't+l1 ++ Er.


(V't) C't+8 - Ct
(\It) [(3n) R't+11-2n+l /\ ((C t- 2(n-l) ... C t- 2 C t) in X)] - Rt + 1
(V't) R't - E't

The last equivalence here comes from the termination loop. By substitution in the third
conjunct, first E't+11-2n+l for R't+11-2n+l' then Et- 2n + 1 for E't+11-2n+1' we obtain the
predicate:

(V't) [(3n) Et- 2n + 1 /\ C t-2(n-l) .. , C t- 2 C t) in X)] ++ Rt+ l'

This is precisely PR(X), so we have proven the theorem corresponding to the production R ->
P. The verification conditions for the other four productions are similar in statement and proof.

Conclusions
This paper has introduced a method for verifying properties of circuits composed from standard
cells using a context-free grammar. Proof of a small number of theorems using this method can
verify the correctness of allY circuit built from the standard cell system. It should be mentioned
that this technique is not a panacea, since we may wish to combine circuits in ways that cannot
be described by context-free grammars. For example, no context-free grammar describes the set
of rectangular arrays, since the language {aCba'b ... ba c: c an integer} is not context-free.

A wide class of circuits can be composed from standard cells using context-free grammars,
however. For this class of circuits, correctness proofs can be developed at the same time as the
202 SyntaxDlrected VerificatIon of CIrcuIt FunctIon

design of the cells and grammar, to provide assurance that large circuits will meet their
specifications. Properties of circuits other than correctness of function, such as timing
properties, are also verifiable by this method. Ikcause of the usefulness of this technique in
verifying large circuits, it is worthwhile to try to specify interconnections of standard cells using
a context-free grammar. Designers of standard cell systems should apply syntax-directed
verification techniques, to help ensure that circuits built with their systems wiII work as
expected.

References
[Floyd67] R. W. Floyd, "Assigning Meanings to Programs", Proc. Amer. Math Soc. Symp. in
Applied Mathematics 19 (1967), pp. 19-31.

[FostKung80] M. 1. Foster and H. T. Kung, "The Design of Special-Purpose VLSI Chips,"


Computer, Vol. 13, No. I, Jan. 1980, pp. 26-40.

[FostKung81] M. 1. Foster and H. T. Kung, "Recognize Regular Languages With


Programmable Building-Blocks," in: VLSI-8I, J. P. Gray editor, Academic Press, 1981, pp. 75-
84.

[Leis79] C. E. Leiserson, "Systolic Priority Queues," Proc. Caltech Con[ Very Large Scale
Integration, California Institute of Technology, Pasadena, Calif., Jan. 1979, pp. 199-214.
Temporal Specifications of SelfTimed
Systems

Yonatan Malachi and Susan S. Owicki


Stanford University

Abstract
Self-timed logic provides a method for managing the complexity of asynchro-
nous module connections; the correctness of a properly constructed self-timed
system is independent of the speed of its components. In this paper we present a
means of formally specifying self-timed systems and modules using temporal logic,
an extension of ordinary logic to include an abstract notion of time. We show by
example that temporal logic can describe Seitz's self-timed modules, giving detailed
specifications for combinatory logic, and sketching the treatment. of wires, align
elements, feedback registers, pipelines and finite state machines. Temporal logic
has an expressive power that makes it well suited to this task; it also provides a
framework for proofs of the properties of self-timed systems.

Introduction
Self-timed logic is a method for managing the complexity of asynchronous
connection~ between system components. Its basis is a signal-acknowledgment
protocol that guarantees that a module remains inactive until its input is available,
and that the input then remains available as long as it is needed. The cycle of
signals and acknowledgments plays a role much like that of a two-phase clock.
However, self-timed logic has the advantage that the correct behavior of a system
does not depend on the speeds of its components or the length of communication
delays.
Seitz [S1, S3j has developed a number of conventions for defining and compos-
ing self-timed modules. One of his basic building blocks is the Combinatorial Logic
(CL) module ([S1]). This class of modules is specified by a set of constraints, called
the weak conditions, that describe permissible timing orders for changes of values
on input and output lines. The simplest CL modules combine Boolean logic with
appropriate timing signals. CL modules can be composed, and the result will itself
be a CL module, as long as a simple set of combining rules are obeyed.

Work supported in part by the Defense Advanced Research Project Agency under contract No. MDAfJOIJ-
79-C-0680
203
204 Tempoi'll Speclflcatlona of SelfTlmed Syatema

Seitz defined the weak conditions using timing inequalities, where p ~ q means
that q occurs after p or simultanously with p. This notation is adequate for
expressing timing pre-requisites, but it cannot be used to express requirements
that certain events must eventually occur. For example, one usually expects that
a module whose input is available will eventually produce output. We will call
the first kind of properties safety properties, and the second liveness properties.
Both safety and liveness properties, as well as more complex timing requirements,
are naturally expressible in Temporal Logic, an extension of ordinary logic that
includes an abstract notion of time. In addition to its expressive power, temporal
logic provides a rigorous basis for proving the soundness of composition rules and
for verifying properties of compound modules; one shows that the specifications of
the component imply the specifications of the compound module, using the axioms
of temporal logic.
In this paper we first review the weak conditions and temporal logic, and
then show how temporal logic can be used to specify formally the timing behavior
of Seitz's classes of self-timed modules. The specifications for CL modules are
presented in some detail. Other classes of modules - the align element, feedback
register, pipeline, and finite-state machine - are discussed more briefly. We con-
clude with an assessment of the use of temporal logic in specifying self-timed logic.

1 Combinatorial logic and the weak conditions


Seitz's modules use a signal-acknowledge protocol in which data lines take on
values from the set {O,l, .1}, where .1 (bottom) denotes the undefined value. The
status of the signal should be well defined not only when the value is 0 or 1 but
also when it is .1, and every transition from 0 to 1 or vice versa should be through
.1. These requirements can be satisfied for example, if we adopt the double rail
code for signals proposed in [811.
Before we present the original specifications, we introduce some notation. For
a signal line x we say "x ~'s definerf' and denote it by d(x) if the value of x is one
of the standard binary values. For a set of signal lines X, we will use

d(X) =def (3x E X).d(x) and D(X) =def (\Ix E X).d(x).

If J denotes a set of input lines and 0 a set of output lines, we can then express
facts like "all inputs are undefined" as -,d(J), "all outputs are defined" as D( 0), or
"not all outputs are defined" (equivalently, "some output is undefined") as -'D(O).
We have no reason to deal with empty sets of signals, and therefore in the formal
treatment of the problem we assume that D(X) :J d(X).
We now present the inequalities for the weak conditions using this notation.
Those inequalities specify the relation between I, the set of input signals of a
module, and 0, the set of output signals of the same module. Later, when we want
to talk about several modules simultaneously, we will introduce subscripts on the
signal set names.

WC1 d(l) ~ d(O)


WC2 D(J) ~ D(O)
Yonltln Milichl Ind SUlin S. Owlckl 20S

WC3 D(O)::; -,D(I)


WC4 -,D(I)::; -'D(O)
WC5 -,d(I)::; -,d( 0)
WC6 -,d( 0) ::; d(I)
Informally, the weak conditions state that some input to the module must become
defined before any output becomes defined (WC1), all inputs must be defined
before all outputs become defined (WC2), all outputs must be defined before any
input becomes undefined (WG3), and similarly through a sequence where all values
become undefined (WG4 - WGB).
Note that while WG1, WG2, WG4, and WG5 are functional constraints on the
GL, WG3 and WG6 are restrictions on the enviroment (domain) that supplies the
inputs to the CL. The interpretation is that if the enviroment specifications WG3
and WGB are satisfied the CL module will obey the rest of the conditions.
In the next section we introduce temporal logic, which allows us to state the
weak conditions formally. While the weak conditions specify only that some event
must precede another event, temporal logic enables us to extend the restrictions
on the behavior of GL's by specifying cases in which a certain event must occur.

2 Temporal logic
Temporal logic is a formal system that provides operators for reasoning about
transitions in time. Variants of this logic are used for verifying programs and
processes, and have been found especially useful for parallel programs ([MPI, [01],
[MaJ, [HaJ, [HO]). Our version of temporal logic is based on an axiomatization
similar to the one in [Mal; its framework and semantics are fully described in [MPj.
The basic modal operators are <) (read diamond), 0 (read box), 0 (read circle),
and U where <) p, 0 p, 0 p, and p U q denote "eventually p", "henceforth p", "next
p" and lip until q" respectively. We assume that time is linear, i.e., each time
instant has exactly one successor, and reflexive, i.e., the present is part of the
future. All the operators state facts about the future, and their informal semantics
are

<) pp is true in some future instant (or in the present),


-
op p is true in all future instants (including the present),
-
o p - p is true in the next instant,
pUq - p will be true in all time instants preceding the (first) instant
in which q is true.

The until operator in our variant of the system differs from the one described in
[MPj, and it can be called weak until in the sense that it does not imply the eventual
occurrence of its second argument. We use this until because we have found it more
suitable for expressing the properties of self-timed systems.
As an example of temporal formula, one of the axioms of our system is

oP :J P 1\ 0 p 1\ 0 0 p,
stating that henceforth p implies the truth of p at the present, at the next moment,
and in all the future of the next moment.
208 Temporal Speclflcltlons of SelfTlmed Systems

The axioms that characterize our until operator and make our system different
from the one in [MaJ are,

pU q = q V (p t\ O(p U q)), and


op :J P U q.

The first states that p U q is true iff q is true now or p is true now and p U q is true
in the next state. The second states that if p is always true then so is p U q. The
first of the two axioms is valid for both the weak and strong until while the latter
characterizes the weak until.

3 Derived operators
The specification of the behavior of self-timed systems can be more concise if
we use derived operators whose intuitive meaning is close to the conditions that
we want to specify. The while operator, p W q states that q is true as long as p is
true and can be defined formally as,

Another useful operator is the precedes operator, p P q ([MP]) stating that p must
precede q if q is going to ever happen. The definition is

pP q =def (-,q) Up,

making P a sort of dual of W.


Frequently we want to express the requirement that once some assertion be-
comes true, it remains true as long as some other assertion is true. This can be
expressed with the derived operator latched while:

pLWq =dcf p:JpWq.

Another useful abbreviation is the entails operator ([OL]) p........---+ q. It is defined by

p........---+ q =def p::J <) q,

stating that the first operand implies t.he eventual occurrence of the second. Note
that since we use reflexive time, p :J q implies p........---+ q.

4 Formal specification of CL modules


Now, having introduced both the formal system and the weak conditions,
we specify the self-timed systems formally. We use LW to specify stability
properties that correspond to the safety properties from the weak conditions and
........---+ to specify liveness or eventuality properties that are missing from the weak
conditions.
We introduce one further piece of notation. Our temporal logic specifications
typically take the form, "if the module starts in a legitimate initial state, then some
Yonltln Millchlind SUlln S. Owlckl 207

assertion w is always true during its operation." If <p is an assertion describing the
legitimate initial state, this is expressed by

<p ::l Ow.

We abbreviate such a specification as

1= w,

borrowing notation (used somewhat differently) from Manna and Pnueli [MP] to
avoid repeatedly stating the initial conditions. In a G L, <p - ,d(I) 1\ ,d( 0).

81. 1= ,d(O) LW ,d(I)


82. 1= -,D(O) LW ,D(I)
L2. 1= D(I) r-..r-+ D( 0)
E3. 1= D(I) LW ,D(O)
84. 1= D(O) LW D(I)
85. 1= d(O) LW d(I)
L5. 1= -,d(I) r-..r-+ ,d( 0)
E6. 1= ,d(I) LW d(O).
E7. 1= [D(I) P ,d(I)] ::l (Vi E I).[d(i) LW ,D(I)]
88. 1= [D(O) P ,d(O)] ::l (Vo E O).[d(o) LW ,D(O)]
E9. 1= [-,d(I) P D(I)] ::l (Vi E I). [-,d(i) LW d(I)]
810. 1= [,d(O) P D(O)] ::l (Vo E O).[,d(o) LW d(O)]

81,82,84, and 85 are stability properties corresponding to WG1, WG2, WG4,


and WG5 respectively, while L2, and L5 are liveness properties. E3, and E6 are
restrictions on the behavior of the enviroment. We associate no liveness statement
with WG1 because we allow the system to stay in the state where both outputs
and inputs are undefined if no new input is supplied. Each individual input or
output line is expected to behave consistently; we specify this by axioms E7 - 810.
(This requirement is an important one; it was mentioned informally in the original
exposition of GL but is not included in the weak conditions.)

4.1 Interconnection rules


The notion of CL specified above provides a valuable tool for building systems.
Compositions of CL modules are themselves CL's, as long as the following simple
rules are obeyed:
1. No dangling inputs, i.e., all inputs to the composite module are inputs to some
component,
2. No dangling outputs, i.e., every output of a component is either an input to
another component or an output of the composite system,
3. No closed data paths (loops).
208 Temporal Specifications of SelfTlmed Systems

4
1
OL
h 3 0
ML
I
IR OR
MR 5
2
M

Figure 1. Interconnection scheme

We now show how the interconnection rules can be specified in our notation. It
suffice~ to specify the interconnection rules for two modules; all other configurations
can be built incrementally by adding one module at a time. The rules for two
modules are captured by the configuration in figure 1, where I and 0 are the input
and output of the composite system M, while h, OL,!R, OR are those of ML (L
for left) and MR (R for right), respectively. Every line in the diagram represents
a set of signal lines. Lines 1 and 5 must both represent non-empty sets; lines 2,
3, and 4 may be empty, but none of I,h,IR,O,OL,OR may be empty. Because
our assumption is that a line in the diagram represents one point in the circuit
(see also section 5.1) we do not need and do not allow lines connecting the input
directly to the output. All this can be formalized in the following axioms, labeled
according to the lines in the diagram.
CID. f- D(I) :J D(h)
Cld. f- d(h) :J d(T)
CI2D. f- D(h) 1\ D(IR) :J D(I)
C12d. f- d(I) :J d(h) V d(IR)
C23D. f- D(I) 1\ D(OL) :J D(IR)
C23d. f- d(IR) :J d(I) V d( 0 L)
C34D. f- D(O)I\D(IR) :J D(OL)
C34d. I- d(OL) :J d(IR) V d(O)
C45D. I- D(OL) 1\ D(OR) :J D(O)
C45d. f- d(O) :J d(OL) V d(OR)
C5D. f- D(O) :J D(OR)
C5d. f- d(OR) :J d(O)
If the interconnection rules are obeyed and ML and MR are both CL's then
the composite system .M will also be a CL. (Note that this fact must be proved
using the axioms of temporal logic. We do not supply such a proof in this paper.)

5 Additional self-timed modules


In this section we mention other categories of modules ([S3]), comment on their
temporal logic specifications, and explain the construction of some of the compound
modules.
Yonatan Malachi and Suaan S. Owlckl 209

5.1 Wires
In self-timed systems the delay caused by "wires" can be modelled by treating
wires as CL modules. It is easy to prove that a delay line (when we look at it as a
one directional information pipe) satisfies the CL conditions. The CL composition
rule allows us to combine a line and a CL into a single CL. Thus it is always
possible to incorporate the delay between two modules into one of them (or divide
it arbitrarily betweeen them) without changing the logic of the system.

5.2 The Align module


The Align element is used for synchronization in building more complex mod-
ules. It is a special kind of CL, with the property that all of its input lines must
be (un-)defined before any of its output becomes (un- )defined. Its specification is
only a slight modification of the CL specification, namely 51 and 52 are replaced
by SAl while 54 and 55 are replaced by SA2,

SAl. --,d( 0) LW --,D(I)


SA2. D(O) LW d(I).

It is clear that SAl and SA2 imply the axioms that they replace, and therefore
any Align is also a CL. Any combination of Align and CL elements that obeys the
interconnection ruls above is also a CL. Some interconnections of CL and Align
modules are Align modules as well.

5.3 Feedback register FB


F B is a buffering module that makes its output defined when its input becomes
undefined, and vice versa. Typically the ouput is just a copy of the previously
defined input. The specification of the F B is like that of the C L, except that
the role of "defined" and "undefined" are reversed in the output part of the
specification and the value in the future is specified in terms of the value at present.

5.4 Pipeline modules PL


A CL must be placed in an environment that behaves properly, i.e., changes
the CL input only when that is permissible. This means that there must be some
feedback from the output of the CL to the source of its input. Pipeline modules
provide for this feedback with explicit acknowledgment lines. Pipeline elements
can be connected in sequence in such a way that they operate in parallel while
transmitting data down the pipe. A pipeline element can be built from CL's and
Aligns. Its specifications are somewhat more complex than those of the CL, since
the special role of the acknowledgment lines must be made explicit. However, they
are based on exactly the same sort of two-phase signalling conventions that we have
used already. A sequence of pipelines is itself a pipeline. In addition, a pipeline is
a OL when we exclude the acknowledgment lines from the input set I and output
set O. A pipeline can be combined with CL's according to the usual composition
rules.
210 Temporal Specifications of SelfTlmed Systems

AI Ao
C
A ~

l
I ~
0
~
CL , 9 .
n

Figure 2. A possible implementation of P L

The temporal specfications of the P L are,

PLl. ~ D(I) LW -,d(AI )


PL2. ~ -,d(AI) LW [-,d(Ao) V -,D(I)]
PL3. ~ -,d(I) LW d(AI)
PL4. ~ d(A I ) LW [d(Ao) V d(I)]
PL5. ~ D(O) LW -,d(Ao)
PL6. ~ -,d(Ao) LW -,D(O)
PL7. ~ -,d( 0) LW [d(Ao) V -,D(I)]
PL8. ~ d(Ao) LW D(O)
PL9. ~ D(I) r..r-+ d(AI)
PLIO. ~ D(I) r..r-+ D"( 0)
PLll. ~ D(O) r..r-+ d(Ao}
PLI2. ~ -,d(I) r..r-+ -,d(AI)
PLI3. ~ -,d(I) r..r-+ -,d( 0)
PLI4. ~ -,d( 0) ,....~ -,d(Ao),

where here 1= abbreviates the initial conditions:

-,d(I) /\ -,d( 0) /\ -,d(AI ) /\ --,d(Ao).

Essentially these are two 4-phase Muller cycles ([S2]) one in the handshaking
of I and AI and the other between 0 and Ao. Each of these two cycles is similar
to the "requesters and granter" example described in [MPj.
The C element in the implementation proposed in figure 2 is a conversion
element whose output is defined when the input is undefined and vice versa.
The price that we pay for having an asynchronous pipeline is that each module
has to wait for the acknowledgment from the next one before it can issues its own
acknowledgment. This price is not too heavy in two important cases:
a. When the computation time within each module is much larger than the
communication time, and all the modules have similar computation time. For
example, this is the case when each module performs many iteration of some
simple computation before passing the result and getting the next chunk of
Yonatan Malachi and Susan S. Owlckl 211

data. In general we can move toward this behavior by increasing the amount
of buffering within.-each module.
b. When the variability of computation time is big and depends on the actual
data. In such a case a synchronous system will have to operate with a clock
rate corresponding to the worst case; the asynchronous computation will be
much faster on the average.
Note that we could have given weaker restrictions that allow a P L to ac-
knowledge its input before its output is acknowledged. This would allow greater
flexibility; however such a P L would not in general be a C L.

5.5 Finite state machine


A finite state machine can be built from C 1's, Aligns, and F B's. Its timing
behavior, and thus its temporal logic specification, is that of a OL or P L, depending
on the presence or absence of the acknowledgment lines. In figure 3 we suggest an
implenlentation of an FSM that behaves like a OL, assuming the initial state to
be -,d(IFB) 1\ D(OFB) 1\ -,d(I) 1\ -,d(O). If we add the acknowledgment lines (or
replace the Aiion anJ OL combination by a PL) we get an FSMthat behaves like
a PL.

I A o "
"-
OL ,
~ I
'-9 i

1
g
n

...::=
O/<'B
FB j
.
[FB

'---.

Figure 3. A CL implementation of FSM

6 Conclusions
We have found that it is possible to specify the timing behavior of a variety
of self- timed modules. In fact, since these modules all depend on the same sort of
two-phase signalling conventions, it is not hard to deduce any of the specifications
from those of the OL. Note that in the process of formalizing the specifications of
OL we added restrictions that were not part of the original weak conditions such
as that individual lines are wcll behaved, The need for this restrictions becomes
evident when trying to build sound formal specifications,
The advantages of a formal notation like temporal logic for expressing these
timing specifications are twofold.
1. Temporal logic's expressive power allows us to state various kinds of spec-
ifications explicitly and unambiguously. It is more precise than informal
specification methods, and seems to be able to exprcss a wider range of
important properties than other formal methods,
212 Temporal Specifications of SelfTlmed Systems

2. Temporal logic provides a reasoning system that can be used to formally


validate composition rules, as well as to verify properties of particular com-
pound modules, such as an implementation of a finite state machine. In ad-
dition, temporal logic can be used to express data-value specifications ([OLD,
as well as the timing specifications we have been discussing. This gives us a
uniform framework for combined reasoning about both kinds of properties.
We next plan to validate the composition rules proposed by Seitz, using the
axioms and inference rules of temporal logic. (The proofs provided in [SID extend
only to OL composition, and are quite informal.) We will also consider proofs of
the correctness of particular constructions, such as those for pipelines and finite
state machines.

Acknowledgments
This paper builds on the work of Chuck Seitz on self-timed systems and Zohar
Manna on temporal logic. We are indebted to both of them for valuable discussions
that provided the context for our approach. Greogor v. Bochmann ([BoD developed
an earlier method of using temporal logic to describe selt-timed systems, and
conversations with him contributed to our understanding of the problem. Pierre
Wolper and Amy Lansky provided helpful comments on various drafts of this paper.

References
[Bo] G.v. Bochmann, "Hardware specification with temporal logic: An example,"
Technical Note, Computer Systems Laboratory, Stanford University, June
1980.
[GPSS] D. Gabbay, A. Pnueli, S. Shelah, and J. Stavi, "The temporal analysis
of fairness," Proc. of the 7th Symposium on Principles Of Programming
Languages) Las Vegas, NY (January 1980), pp. 163-173.
[Ha] B.T. Hailpern, "Verification of concurrent processes using temporal logic,"
technical report No. 195, Computer System Laboratory, Stanford Univer-
sity, August 1980.
[HO] B.T. Hailpern and S. Owicki, "Modular Verification of Computer Commu-
nication Protocols," submitted to IEEE Transactions on Communications.
[Ma] Z. Manna, "Verification of deterministic, sequential programs: Temporal
axiomatization," Computer Science Dept., Stanford University, 1981.
[MP] Z. Manna, and A. Pnueli, "Verification of concurrent programs: The tem-
poral framework," Computer Science Department, Stanford University,
June 1981.
[MC] C.A. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley,
1980.
[0L] S. Owicki and L. Lamport, "Proving liveness properties of concurrent
programs," submitted to ACM Transactions On Programming Languages
And Systems.
[SI] C.L. Seitz, "System Timing", Chapter 7 of [MC], pp. 218-262.
[S2] C.L. Seitz, "Ideas about arbiters", LAMBDA First Quarter 1980.
[S3] C.L. Seitz, personal communication.
A Mathematical Approach to Modelling the Flow
of Data and Control in Computational Networks

Lennart Johnsson 1 Danny Cohen 1


CaltechllSI ISl/Caltech

ABSTRACT
~his paper proposes a mathematical formalism for the
synthesis a.no aUnli tative analysis of computational neblOrks
that treats data and control in the same manner.
Expressions in this not~tion are qiven a direct
interpretation in the iMpJementation domain. "'opology,
broadcasting, pipelining, and similar properties of
implementations can be determinec'l oirectlv from the
expressions.
This treatment of computatinnal networks em~hasizes the
space/time tradeoff of implementations. A full
instanti.ation in space of most computational problems is
unrealistic, even in VLSI (Finnegan [4]). ~herefore,
computations also have to be at least partially instantiated
in the time domain, requirinq the use of explicit control
mechanisms, which typically cause the data flow to be
nonstationary and sometimes turbulent.

INTRODUCTION
The evaluation of mathematical expressions in general
requires arithmetic operations as well as communication of
data and control signals. The computations mav be a.rranged
in several ways, spanning the spectrum from fully parallel
to fully sequential. "'he former approach spreads the
computations in space "Thereas the latter spread them in
time.

1
Mailing ao.dresses: Computer 8cience, MS-280, Caltech,
Pasadena, California 91125 and ISI, 4676 Admiralty Way,
Marina del Rey, California 90291.

213
214 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks

High computation throuqhput is tvpicall.v achieved by


spreading the computations entirelv in space. The oa ta now
is stationary and often laminar. Auch a flow is easy to
understand and to ~escribe graphicallv.
Unfortunatelv, the cost of a full instantiation in space is
often prohibitive for large problems. Therefore, the
alternative instantiation in time has to be considered.
The spread in time requires a more complicated control
mechanism that is needed for the fully parallel
implementations. When computations are spread fully jn
space, the control of the operations may also be in the
space domain (e.g., the si~e of an array may determine its
computational response). On the other hand, with spread in
time the control mav often be in the time domain too. In
addition, the data flow is typically not laminar, and
turbulence (in the form of loops) occurs.
Typically, control signalling varies between the parallel
and the sequential regimes, just as with operations on data.
~herefore, it is possible to use the same mathematical tools
for treating the control and data flow through networks.
This work treats time/space tradeoffs mathematically. ~he
required computations are spread jn several ways in order to
optimize the computational throughput of the implementations
at a reasonable cost.
For many problems, arithmetic operations have to be
performed simultaneouslv, such as those expressed hy the
mathematical svmbols "sigma" for additi.on ann "pi" for
multiplication. The direct circuit implementation o
simultaneous operations has lower throughput than sequential
additions. ~herefore, pipelined structures are generally
sought.
The transformation of certain networks into pipelined
structures affects the na.ture of the propagation of both
data and control. These effects are studied and treated
formally.
The organization of VLSI arrays is an active area of
research. For example, systolic arrays for a few tvpical
problems in numerical linear algebra and for the Discrete
and Fast Fourier Transforms have been proposed anrl rlescrihed
by H.T. Kung [8J, as has a two-dimensional array for solvinq
a linear system of equations by S.Y. Kunq [91. The key
issues in VLSI design are discussed in Mead and Conway flO].
Lennart Johnaaon and Danny Cohen 215

To the best of our knowledge no proof of the correctness of


these systems is available.
Two-dimensi.onal arrays for the multi.plication of band
matrices have been describe~ formally bv Weiser and Davis
[13]

All of the above systems use implicit control.


The computations of a Finite Impu~se Response (FIR) filter
and of the Discrete Fourier Transform (OFT) are used to
illustrate the concept of mathematical modelling of hoth
data flow and explicit control.

FINITE IMPULSE RESPONSE FILTERS


The delay operator, Z, defined by Zx(n) = x(n-l), is used
for modelling storage (delay) el~ments. The properties of
this operator and examples of using it for the synthesis and
analysis of computational networks may be found in Cohen [1]
and in Johnsson et al. [7].
There exists a straightforward correspondence between
mathematical expressions and networks that contain only
combinational logic and delay (memory) units.
With svmbolic manipulati.on and operator calculus it is
possible to generate for a given expression (and the
corresponding network) severa} alternatives that compute the
same results while having <"ifferent properties like
throughput and delav.
The transformations between computationally equivalent
networks and their evaluation with respect to a given set of
criteria may be carried out formally, in a way which may
eventually be automated.
For example, a general FIR filter (Oppenheim an.d Schafer
(11], Rabiner and Gold [12], and Cohen [1]) is defined by:

1:
11-1
yen) = a(i)x(n-i.) (1)
1-0
218 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks

which may be expressed as


'M-l i
Y (n) = ~ ali) Z x (n) (2)
1-0
or as
N-1
i
yen) = ~
1=0
Z ( a(i)x(n) (3)

Eventhough (2) and (3) are comoutationallv equivalent, they


have different characteristics. For example, it is clear
that (2) requires a simultaneous addition of N elements,
whereas (3) uses pipelined anditions which may be performed
within a shorter cyc]e than nee0er for the former, hence
Yielding a higher throughput.
It is also clear that (3) requires the broadcasting of the
input signal X=Jx(n)} to -all the computation elements,
unlike (2), which does not require it.
Figures I and 2 show the networks corresponding to these
expressions.

o y
Figure 1: The implementation of (2)

o y

Figure 2: The implementation of (3)

Both of these networks spread the computations entirely in


space, thereby achieving fully parallel implementations. In
Lennart Johnsson and Danny Cohen 217

the network of Figure I all the computations for each yen)


are simultaneous whereas in that of Figure 2 the
computations are skewed in the time/space domain, which is
typical of pipelines.
Both of these implementations have implicit control embedded
in space, bv virtue of the array size, N.
By explicitly aclnressinq the control issue (~ohen and Tyree
[2]) these networks can be transformed i.nto sequenHal
implementations with significantly ~ifferent
characteristics, not onlv of performance but also reqardinq
issues of error containment and hardware reduction.
As a result, the mathematical requirement to operate on all
the input elements, fx(n)l, with aJ.I the constant
coefficients, fa(i)}, (implemented in the parallel network
by moving the data along the stationarv coefficients), is
implemented in the sequential network by the dual anproach
of (relatively) stationary input elements and movtnq
coefficients.
An explicit control slgnal, U, is useo to implement the
accumulation of a variable number, N, of elements, where N
is determined only at run time. The expression U~A+UB, which
is equal to A when U=O and to B when U=l, is used to model a
multlplexer 2 for the variable length accumulation, as shown
in Figure 3. Note that since the value 0 is connected to
the I-input of the multiplexer the effect of U=l is to clear
(reset) the accumulator (the Z-element).
The function performed by the network in Figure 3 is
described by the following expression:
Y = UT where T = X+ ~(U~T + UO) = X + zn~~ (4)

The UT means "enabling" Y to have the value 'T' \-Then U=l and
not having any value when U=O (e.q., E in Fiqure 3 ~s a
tristate driver). Hence, the explicit control that resets
accumulators (the Z-units) to 0 also enables the E-units to
drive the output bus. Note that the existence of the output
of E is contr011en by U, not its value, as in the case of
multiplexers.

2
A is called the O-input and R is the I-input.
218 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks

x--t------- x

u- u

y --..-.------ y

Figure 3: A finite accumulation network

The output of this network, Y, is always the accumulative


sum of all the input val.ues, X, since the last time U=l.
The right-hand side of equation (4) can be expanded a few
times to make the nature of the network more explicit.
T =X + ZU"T = (5)
X + ZU"(X + ZU'T) =
= X + ~U" X + ZU"ZU'(X + ZU"TI =
= X + ZU' X + ZU"ZU' X + ZU'ZU"ZU'(X + ZU'T) = .
The expansion can he carried on indefinitely. However, if
U=l every N cycles, then at most N-l consecutive values of
U" mav be equal to 1. When U=l the output Y is equal to the
sum of the last N input values.
To simplify the expressions for the case with cyclic
control, the followinq notation is i.ntroduced. Let rkl be
the remainder of k when divide~ bv N, an~ let <k>=k-[kl,
which is the largest multiple of N that does not exceed k.
Rence, k ~ <k> + rk] for all values of k.
A signal, U, havina the value 1 every N cycles and the value
o in between is expressed by nIt) = d([t-j]), where dIO)=l
and d(x~O)=O. This i is the "phase" of the control. signal.
Lennart Johnsson and Danny Cohen 219

If the phase of U is j, then the phase of ZU is 1+1, because


ZU is delayed one cycle behind the U siqna~
The expression above, (5), for the time k, is rewritten as
m
Y(t=k) '1 ) X (6)
m=O
Only when [k-11 = N-l which is when [k] = [1-11 the output
is the sum of the last N input values.
~hie network shown in Figure 3 produces such an N-term sum
onlv once in every N cycles. If such a sum is needed for
every cycle, then N units are required. These units shoul~
be controlled by U~s of different phases.
In equation (3) the fa(i)} were assumed to be time
independent, therefore not affected by the Z-operator.
However, one may arrange a network where the {ali) f travel
in a circular shift register of length N, such that the ali)
coefficient follows, rather than precedes, the a(i+l)
coefficient. In this case Za(i) = a(i+1) and ali) = Zi a(O)
This arrangement of coefficients is called "dynamic
coefficients."
By introducing the notions of controlled selection
(multiplexing), finite accumulation, and dynamic
coefficients, it is possible to obtain implementations
consisting of an array of sequential elements, as shown in
Figure 4. Such arrays effectively perform a cyclic
convolution.
The net'tlork in Figure 4 computes

1: z\
. B.
m
[(Z l U) ( Z X) where R = !k-i-l] (7)
m=O

3
This is proven bv ZU(t)=d([t-l-j])=d([t-(j+l)]).
220 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks

x X
A A

I
I
&.[!].- u

y y

Figure 4: Two sequential elements for an FIR network

Note that the outer "sigma" and the Z U represent the


selection of that output of the ith unit exactly when the
output of that unit is the sum of the last N products of x's
and-a's. Hence it expresses the same functio~ as equations
(1), (2), and (3), and the network in Figure 4 is
functionally equivalent to the networks in Figures 1 and 2.
~he implementation represented in Figure 4 requires extra
circuitry associated with the explicit control. The
previous implementations do not need the extra circuitry
because the value of N is built implicitly into their
configurations. Another nice aspect of those networks is
that the data flow inside them is always laminar and easy to
track, unlike the turbulent flow in the ~etwork of Figure 4.

4
Proof: Unit #i has the phase of i and is selected
(enabled) when [k]=i, at that time B=[k-i-ll=[i-i-ll=N-l.
Hence, Y is a sum of the last N terms.
Lennart Johnsson and Danny Cohen 221

HmoJever, the sequential implementation has some important


advantages deservinq attention.
In networks that are fully instantiated in space, every
module (multiplier, adder, and delay) and every connection
are directly involved 5n the computation of every y(k).
Hence, any component failure in the entire network causes
every output to be erroneous. On the other hand, in the
network of Figure 4, failure of a sinale adder or a sinqle
accumulator (a Z-element) affects only liN of the output
set.
If the number of the required {y{k)l is smaller than the
number of {x(k)l, as for down-sampling by a factor of K, the
hardware c~ be reduced by using onlv N/K units, instead o~
N as needed for the networks of Figures 2 and 3. Oown-
sampling hy those networks can be ~one only by discardina
K-l out of every K output values, not by reducing the
computation, as in the network of Fiqure 4.
From signal processing point of view this computation scheme
is hetter than first down-samplinq and then filtering.
There is no way to use any of the former networks for
lengths greater than the original N. ~he latter network,
with its explicit control, mav easily he modified to handle
shorter lengths than the original N, which is the number of
units in the configuration. If a longer length, M, has to
be handled, the latter network can still be used. It will
produce only some (N/M) of the fy{k)}. '1'his is better than
the former networks, which cannot produce any of the {y{kll.
Even with the extra hardware, the latter network has several
advantages over the former ones.

THE DISCRETE FOURIER TRANSFORM


A more interesting case is the implementation of the
continuous OFT computation (Rabiner and Gold [12), Johnsson
and Cohen [6]). A continuous DFT is a OFT that is updated
every input cvcle. If computed on windows of length N, then
the straightforward implementation (as a matrix
multiplication) requires NM complex mutiolications, where M
is the number of the required components of the OFT. The use
of the FFT (Cooley and Tukey [31) requires only NlogN/2
steps, independent of the value of M.
222 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks

Fullv parallel implementations of any interesting DFT


computation require more components than it is practical to
consider.
A formal treatment of FFT arrays has been presented in
Johnsson and Cohen [5]. Several networks for the DFT can be
derived formally from its definition. One sample
implementation is derived and shown here.
The implementation resulting from our formal treatment of
the DFT requires only 2M (or possibly M) multipliers.
(Compared this with NM multipliers for a full instantiation
in space and NlogN/2 multipliers for the FFT.)
The mth DF~ component at time t=n is defined by
If-I .
y(m,n) = L. .,lm x(n-i) (8)
jao
which is rewritten as
If-I If-I
~ 1m j ~ m i
y(m,n) =~ w Z x(n) ~(w Z) x(n) (9)
jao jao
Since this is a qeometr~.c series its sum is
If-l
L
m :i mN N -1 (10)
y(m,n) (w Z) x (n) (1 - w Z ) (1 - wZ) x (n)
jao
These two operators commute and may be combined
("multiplied") in either order. Hm"lever, for response
reasons the order used in (10) is preferred.
Eventhough theoretically wmN= 1, it may be beneficial not
to eliminate the multiplication bv 1, in order to treat the
effects of roundoff errors in limited precision
implementations.
Lennart Johnsson and Danny Cohen 223

m -1
The implementation of (1 - w Z) mav not look intuitive
to the untrained reader. However, note that
m -1
U = (1 - w Z) V which yields:
( 11)
III m
(1 - w Z) U = V or U = w 7,tJ + V

This equation is implemented by the network in Figure 5.

u )040- V

m
W
m -l
Figure 5: The implementation of U = (l-w Z) V
m -1
With the network shown in Figure 5 for (l-w Z) the
implementation of equation (10) is:

~~x
Y"
z x x Z

mN m
-W W

Figure 6: The continuous implementation of y(m,n)

CONCLUSIONS
A rigorous mathematical formal approach may be applied to
the synthesis and the analysis of both the aata paths and
the control signalling of computational networks.
This mathematical tool is instrumental for the investigation
of time/space tradeoffs, and supports the transformations of
networks between implementations ranging from fully parallel
to sequential or both, such as in pipelines.
224 A Math. Approach to Modelling the Flow of Data and Control In Compu. Networks

The same formalism used for the manipulation of the data


flow is also used for control signalling.

In addition to its value for the synthesis of networks, this


formalism is also very useful for verification of
correctness and quali.tatbTe analysis. Network properties
such as pipelininq and broadcastinq can be determine~
directly from expressions i.n the notation.

ACKNOWLEDGMENTS
The authors gratefullv acknowledge the support for this
research provided generously by the Defense Advanced
Research Projects Agency unner contract MDA-80-C-0523 wj. th
the USC/Information Sciences Institute and contract N00014-
79-C-0597 with the California Institute of Technology.
Views and conclusions contained in this paper are the
authors~ and should not be interpreted as representing the
official opinion or policy of DARPA, the u.S. Government,
nor any person or agency connected with them.
Lennart Johnsson and Danny Cohen 225

REFERENCES
0) Cohen, D., "Mathematical approach ot iterative
computational networks," Proceedings of the Fourth
Symposium on Computer Arithmetic, pp. 226-238, October
1978, also published as USC/Information Sciences
Institute RR-78-73, November 1978.
(2) Cohen, D., and V.C. Tyree, "VLSI system for Synthetic
Aperture Rader (SAR) processing," Proceedings of the
Societ of Photo-O tical Instrumentation En ineers
SPIE , vol 186, pp. 166-177, 1979.
(3) Cooley, l.W., and l.W. Tukey, "An algorithm for
machine calculation of complex fourier series,"
Mathematics of Computation, vol 19, pp. 297-301, 1965.
(4) Finnegan, J. "The VLSI approach to computation
complexity," in these proceedings.
(5) lohnsson, S.1., and D. Cohen, "Computational arrays
for the Discrete Fourier Transform," COMPCON 81,
February 1981.
(6) Johnsson, S.L., and D. Cohen, "A VLSI approach to
real-time computational problems," Proceedings of the
Societ of Photo-O tical Instrumentation En ineers
SPIE , vol 298, September 1981.
(7) Johnsson, S.L., U. Weiser, D. Cohen and A. Davis,
"Towards a formal treatment of VLSI arrays,"
Proceedings of the Second Caltec Conference on VLSI,
January 1981.
(8) Kung, H.T., and C. E. Leiserson, "Algorithms for VLSI
processor arrays," in (10).
(9) Kung, S.Y., "VLSI matrix computation array processor,"
The MIT Conference on Advanced Research in Integrated
Circuits, February 1980.
(10) Mead, C.A., and L.A. Conway, Introduction to VLSI
Systems, Addison-Wesley, 1980.
(11) Oppenheim, A.V., and R.W. Schafer, Digital Signal
Processing, Prentice-Hall, 1976.
(12) Rabiner, L.R., and B. Gold, Theory and Application of
Digital Signal Processing, Prentice-Hall, 1975.
(3) Weiser, U., and A. Davis, "Mathematical representation
for VLSI arrays," University of Utah, Computer
Science Department, Report UUCS-80-lll, September
1980.
A Wavefront Notation Tool for VLSI
Array Design

Uri Weiser and AI Davis


University of Utah
Computer Science Department
Salt Lake City, Utah 84112

1. INTRODUCTION

This paper presents an overview of an extension to a


mathematically based methodology for mapping an algorithmic description
into a concurrent implementation on silicon. The result of this
methodology yields a systolic array [4]. The basic mathematical method
was initially described by Cohen [1]. Extensions were made by Weiser
and Davis [5]; Johnsson, Weiser, Cohen, and Davis [2]; Cohen and
Johnsson [3]; and Weiser [6]. This approach focuses on the
correspondence between equations defining a certain computation and
networks which perform the computation. As the complexity of problems
increases, a hierarchical approach can reduce the complexity by hiding
detail and thus reduce the design complexity at each level. The
purpose of this paper is to introduce a method for treating sets of
data as wavefront entities in the equations of the mathematical
methodology and in the graphical representation.
In order to represent streams of recurring data elements
(sequences), a special delay operator Z is defined [1]. When x is a
data element at a particular point in a computational network, Z[x] is
defined as the data element that was at the same point on the previous
time step. The notion of time step is used to simplify the discussion,
but does not necessarily correspond to a strictly synchronous view.
A sequence x(1),x(2),x(3), ,x(i-l),x(i), x(i+1) is defined,
where x(i) precedes- the-arrival of x(i+1) -by onetime step. An
underlined subscript represents a positional index in the sequence.
The order of the sequence corresponds to the time at which the data
elements arrive at a certain point in the computational network. The
Delay operator will be seen to modify only time subscripts. The Delay
operator is defined such that:

(1)

The Z operator may be implemented by a simple register in a


clocked system. Equation 2 is merely a recursive definition of
Equation 1.

226
Uri W....r and AI Davis 227

The graphical duality of Equation 2 is shown in Figure 1-1.

lG(1)
x(I-4)

Figure 1-1: A chain of delay elements [Equation 2].

This paper deals wi th functions KM [1, 5] which obey the


transformation expressed in Equation 3:

Z[KM(x,y,z)]=KM(Z[x],Z[y],Z[z]) (3)

This equation represents the commutativity of delay and function


operation.
One particular property of an "attractive" equation is that
recurring data sequences have been transformed in such a way that they
can be implemented as a pipelined data stream. A data stream is
defined as a set of data elements along a directed path through nodes
of a computational network.

2. MANIPULATION OF WAVEFRONTS
The progression of sets of data elements and computations through
computational networks can be modeled by the concept of a wavefront.
Wave fronts can be defined either graphically in terms of the networks,
or mathematically in terms of equations. A wavefront is an ordered
set, such that no two elements belong to the same data stream, and all
the elements of the set move uniformly in time or in space. A
wavefront, denoted as 1, represents an ordered set of data elements:
{a(1,m),a(2,m), ,a(N-l,m),a(N,m)} , where m is the "time" subscript.
The eI.ements- a(I,m) for -all m belong to the I-th data stream. For
simplicity, the "time" subscript in elements of a wavefront is omitted
and a(i,!) will simply be represented as a(i).
This paper introduces definitions of some of the more important
transformations on wavefronts, and presents examples of the technique.
A more complete set of transformations can be found in [6].
2.1 DELAYED WAVEFRONTS

The application of the delay operator to a wavefront A creates a


new set of data elements, each of which is a delayed version of the
corresponding data element in the set i. This is shown in Equation 4.

Al =Z[A] (4)
where:
A={a(1),a(2), a(N-l),a(N)} and
A1={Z[a(1)],Z[a(2)], Z[a(N-l)],Z[a(N)]}
228 A Wavefront Notation Tool for VLSI Array Design

,.
Figure 2-1: Wavefront A and Z[A].
Figure 2-1 shows a wavefront A and its delayed version Z[A].

2.2 ROTATION OF WAVEFRONTS

A positive rotation R+[A] function can create a new wavefront


where the i-th element is delayed (1-1) times. This corresponds to a
rotation of a wavefront in the graphical domain as shown in Figure 2-2,
where the pivot point is the first element of the set.

Similarly the negative rotation R_ is defined as a rotation using the


last element of the set as the pivot element:

Composition of the rotation functions R+ and R_ results in a


delayed wavefront.

N-1

Figure 2-2: Rotation of a wavefront.


Uri Welser and AI Davia 229

2.3 SHIFTED WAVEFRONTS

A shift function S on a wavefront removes the first element and


adds a new last element to the original wavefront. For example
S[A]={a(2),a(3), a(N),a(N+1)} where A={a(1),a(2),a(3), a(N)}. The
inverse of the shift function will rf.move the last element and add a
new first element, i.e. S- [A]={a(0),a(1), a(N-1)} when:
A={a(1),a(2), a(N-l),a(N)}. The value of this new element is implied
by the application. This function is useful in two dimensional arrays
and its application is demonstrated in the matrix multiplication
example.

2.4 IMPLEMENTATION OF THE FUNCTION KM


KM is a function applied to two wavefronts, each consisting of N
data streams, and produces a new data stream. The function element KM,
in this case, has three inputs: an element of the set A, an element of
the set X, and a partial result which is the output carry of the
neighboring KM element. The output of KM is a new partial result or
output carry. The mathematical representation of this implementation
is given in Equation 8:

y=KM{A,X}= (8)
KM{a( 1) ,xC 1),
KM{a(2),x(2),
KM{a(N-l),x(N-l),
KM{a(N),x(N),
ID}}}}
Where ID is the two sided identity element for the function KM.

All elements appear at "time" m, y=y(m), a(i)=a(i,m), for all i.


y:KM{R+[E),R+[F]}, can be implemented in two different ways as
expressed in Equations 9 and 10:
~ ~

y=KM{R+[E),R+[F)}= (9)

KM{e( 1) ,f( 1),

KM{Z[e(2)],Z[f(2)
KM{zN-2[e(N_l),ZN-2[f(N_l)],

KM{ZN-1[e(N),zN-1[f(N)],

IO}}}}}

Using the distribution of delay and function rule presented in the


previous section this equation can be written in the form:
230 A Wavefront Notation Tool for VLSI Array Design

y=KM{R+[E],R+[F]}= (10)

KM{e(1),f(1),
Z[KM{e(2),f(2),
Z[KM{e(N-1),f(N-1),
Z[KM{e(N),f(N),
ID}]}]}]} ]}

The graphical representation of Equation 10 and 9 is given in


Figure 2-3.

(a) (b)

Figure 2-3: Graphical representation of Equation 10 and 9

Both networks presented in Figure 2-3 produce the same output from the
same set of inputs. Network 2-3b exhibits high throughput since the
time step duration is the time required for the function KM to produce
an output from the given inputs.

3. EXAMPLES

Two examples using wavefront transformations are presented in this


section. The first is a string matching network, and the second is
band matrix multiplication.

3.1 EXAMPLE 1: STRING MATCHING


Two strings of length Nare compared element by element. New
strings are applied to the network at every time step. Each string is
applied to the network through N terminals. The input strings are
defined as wavefronts Aand X.
Uri Weller and AI Davll 231

The funotion involved in the matching prooedure is KM where:

T if a=x and R2=T ( 11)


KM(a,x,R2):{
F elsewhere

The output from the network for the m-th set is equal to:

y(m) =KM{A,X}} (12 )

,
The graphioal representation of Equation 12 is given in Figure 12.

'I

Figure 3-1: y=[KM(A,X).

The network in Figure 3-1 is slow sinoe the time step duration
will be the time required to compute the function KM function H times.
Using equivalence between the networks in Figure 2-3 it is possible to
embed delay elements between the KM elements and thus decrease the time
step by H, which is equivalent to increasing the throughput by
N. Application of a positive rotation on each of the wavefronts will
result in:

(13 )

Where:

A1=Z-(H-1)[A)= (14)
{a(1,m+N-1),a(2,~), a(N-1,m+N-1),a(N,m+N-1)}

X1=Z-(N-1) [8]=
{b(1,m+N-1),b(2,!!!:1), b(N-1,m+N-1),b(H,m+N-1)}

The result network is shown in Figure 3-2.


232 A Wavefront Notation Tool for VLSI Array Design

.... -.. --- --- -_x.... -_ .......... ..


-----r- ----,
I I
I I
I I
I I
I I I I
I
I- -- _0- __ ' ______ I
I
I. ____ JI
"",
AI ,, I I

I I
I
I
I I
I I
: I I _ _ oJI
-_ .. ---'"T----- -
I I I
I
I I
I I
IL ____
y(~,J)
_
I

,. ,.
Figure 3-2: y=KM[A"X,].

3.2 EXAMPLE 2: BAND MATRIX MULTIPLICATION

Let matrix Y represent the product of band matrices A and X with


band widths r,+s,+' and r 2+s 2+' respectively (see Figure 3-3). Input
and output data elements will flow through data streams, each related
to the diagonals of the matrices. One strategy is to try to produce
one column of the matrix Y as an output of the network every time step.
y(J-n,~) when n varies from -r=-(r,+r 2 ) to s=s,+s2' represents column J
of matrix
!t!.. s,.s,a2
~
1(I,I)I(I,z)'\. oo(l,l)oo(l,z)'\. y(l,l)y(l,2)y(l,3)
0 ~&o 0
31 ~
- ~{..-
00(2,1) 00(2,2)00(2,3)
:'); y(2, I) '112,2) y(2,3) y(1,4)

'13,2)00(3'3)'i~~~ ,
VI3,I) VI3.2) VI3,3) '113", Vl3,i)

y(4,2lvl4.3) vi.") vl4,I) y(....

...
_14,3) 't'4) 00(4,5)

...
()(
00(5,4)1(5,1)1(5,6) y(l,rl-
~~00(5,l1
vl5,3) VI"') y(5,I) y(I,II

0 0 0

Figure 3-3: Band matrix multiplication.

Wavefront notation is used here in order to simplify the equations and


to enhance the perception of" the interaction between data elements
inside the network. Wavefront A is defined as a row in matrix A:
,.
A={a(~,J-r,),a(~,J-r,+') a(~,J+s,=l),a(~,J+s,)} ('5 )

Wavefront X is defined as a column in matrix X:


,.
X={x(J-s2,~),x(J-s2+,,~),x(J+r2=l,~),x(J+r2'~)} ('6 )
Uri Welser and AI Davis 233

The element y(J,J) is calculated by the inner product of wavefront


A(a row in matrix A)-and X(a column in matrix X).

(17 )

The function KM is an inner product function. The network producing


y(~,~) is identical to the one presented in Example , shown in Figure
3-2 where:

A,:Z-(s,+r 1+1)[A) ( '8)


x1:z-(s,+r,+1)[X)
The element y(J-l,J) is the inner product of the J-l row of matrix
A (Z[A) and the J-th column of matrix X, but the last element of the
wavefront ~ does not participate in the product (see the dotted lines
in Figure 3-3). This me~ns that tr,e element Y(J-l,~) is calculated by A

the inner product of Z[A) and S- [X]. For these new wavefronts the
network that computes y(J-l,J) is identical to the one presented in
Figure 3-2. It is possible -to use the same structure, delaying the
wavefront A, and shifting the wavefront Xl' This process can be
repeated for all n and the new structure will contain at all pOints of
the grid the same elements as in the diagonal in Figure 3-2. Delay
elements reside in the vertical and horizontal directions.

Y(J-n,~):KM{Zn[A],S-n[X]}= (19 )

z-(s,+r'+')[KM{Zn[Al),S-n[X,)})

The resulting network is given in Figure 3-4.

....... --_ ...... ......i-......................... .


').1,.""
"t-l.;...
-r~~~~~~~~ .... z--.

Figure 3-4: Band matrix multiplication,


output Y in columns [eq. 19].
234 A Wavefront Notation Tool for VLSI Array Design

The shaded elements, shown in Figure 3-4, function only as delay


elements and can be eliminated. Note that only the unshaded elements
are needed, therefore matrix multiplication can be implemented using
(r1+s1+1 )(r2+S2+1) modular P elements. The result in this case is a
cross section of all data streams.
In [5] the same problem was presented without using the wavefront
notation, and is clearly more complex in the representation. The
throughput of this computational network is three times higher than the
throughput of a similar network presented by H.T. Kung [4], which uses
the same number and type of modular elements.

4. CONCLUSIONS
The wavefront notation presented in this paper drastically reduces
the complexity of mathematical representation of the computational
array problems This encapsulation can also be used as a perceptual
tool to represent movement of sets of data elements through the
network, and consequently may enhance the appreciation of the
interactions between data elements in the array.

REFERENCES

1. Cohen, D. Mathematical approach to iterative computation networks.


Proceedings of the Fourth Symposium on Computer Arithmetic,
October, 1978, pp. 226-238. Also ISI/RR-78-73, USC/Information
Sciences Institute [4676 Admiralty Way, Marina del Rey, CA 90291],
November 1978.
2. Johnsson, L. and Weiser, U. and Cohen, D. and Davis A. L. Towards
a formal treatment of VLSI arrays. Procedings of the Caltech
Conference on Very Large Scale Integration, January, 1981.
3. Johnsson L. and Cohen D. Comutational Array for the Discrete
Fourier Transform. Compcon81, February, 1981, pp. 236-244.
4. Kung, H. T. and Leiserson, C. S. Systolic arrays (For VLSI).
Tech. Rept. CMU-C8-79-103, Carnegie-Mellon University, 1978.
5. Weiser, U. and Davis, A. L. Mathematical representations for VLSI
processor arrays. Tech. Rept. UUCS-80-111, University of Utah,
September, 1980.
6. Weiser, U. Mathematical and Graphical ~ for the Creation 2!.
Computational Arrays. Ph.D. Th., University of Utah, July 1981. Dept.
of Comput. Sci.
A Matrix Data Flow Language/Architecture for
Parallel Matrix Operations Based on
Computational Wavefront Concept*
S.Y. Kung, K.S. Arun, D.V. Bhaskar Rao, and Y.H. Hu
University of Southern California
Department of Electrical Engineering . Systems
Los Angeles, California 90007

ABSTRACT

This paper focuses on revolutionary parallel architecture and


language concepts for a class of matrix operations which are fundamental
for signal processing computations. Based on the notion of computational
wavefront, a data flow language for matrix array processors is developed
and a new processor architecture tailored to this language is proposed.
Simulations were done in global and local levels and both of them report
encouraging success.

I. INTRODUCTION

A ulajer impact of the revolutionary VLSI device technology will be


the massively available parallel processing capability since "millions
of logic gates" will become virtually free in VLSI hardware. However,the
availability of the almost unlimited hardware has not solved all the
problems in VLSI system design. In fact, it has induced several untradi-
tional and rather critical issues. First, communications between logic
gates in VLSI circuits will consume a large portion of time, area and
energy [1]. Moreover, the software and design cost in large scale inte-
gration will become unusually high. Consequently, one has to resort to
some revolutionary architectural and language concept to accomplish an
effective VLSI system design.
Traditionally, programming language for multiprocessors focus exclu-
sively on the description of parallel data executions with little men-
tion.. of the data movements. Due to the communication problem in VLSI
systems, the issue of data availability and management becomes critical
and it has become very desirable to have a new language capable of ex-
pressing parallel data movements in a computing network. While the data
flow language [2] offers a good means of tracing data movements, to act-
ually build a dataflow machine is yet very involved. To simplify the ar-
chitectural design and language description it is very sensible to res-
trict the application to a special class of compatible operations, so as
to take advantage of the special structure shared by them.
It has been recently indicated [3J,[4J, that a major portion of the
* Research supported in part by Office of Naval Research under contract
N00014-81-K-0191, N00014-80-C-04S7; and by National Science Fundation
under Grant ECS-80-16S81.
235
238 A Matr. Data Flow LangJArchlt. for Paral. Matr. Oper. Based on Compu. Wavefront Con.

computational needs for signal processing applications can in fact be


reduced to a basic set of matrix operations such as matrix multi-
cation, inversion, LU decomposition, convolution, Toeplitz solution, etC.
Therefore, special purpose parallel processors and a parallel programming
language for the matrix operation set appearsto be a promising solution
in order to effectively utilize the massive VLSI hardware for signal
processing applications.
For matrix operations, there exists between the topological pro-
perties of the computational algorithm and the computing structure a
simple and natural mapping. This mapping in turn allows a large amount
of data flow in a regular and simple pattern. Based on this principle,
H.T. Kung [1J proposed the concept of "systolic array," in which oper-
ations such as matrix multiplication, LU decomposition can be parallelly
processed in a highly synchronized and regular manner.
Later, S.Y. Kung [3J proposed a variant of the systolic array in
which a notion of computational wavefront is introduced. The notion fa-
cilitates tracing the parallellism in a synchronous or asynchrounous
computing network and therefore is applicable to a large class of signal
processing oriented matrix operations [3J - [7J. By offering an elegant
description of the data movements for matrix algorithms, the wavefront
notion has conceptually shortened the gap between the algorithm analysis
and the design of dataflow computing structure as will be demonstrated
later.
COMPUTATIONAL WAVEFRONT
The notion of computational wavefront can be best explained by a
simple example. Here,we shall consider only the matrix multiplication
case. Let A = [a .. J, B = [b .. J and C = AxB =[c .. J be all NxN matrices
Decompose the maEt1x A into 1 t!olumns [A.] and B trl.torows [B.J ,
and therefore, 1 J
(1)
The matrix multication can be carried out in N recursions, executing
C(k) = C(k-l) + A B
k k
recursively for k = 1,2, ... ,N.
The exploitation of parallelism is now becoming obvious with the
availability of N x N processors. A parallel algorithm for matrix mult-
iplication is almost trivial. However, most existing parallel programs
would need global interconnections in the computing network, while loca-
lized interconnections and data flow are much more desirable in VLSI
design.
The topology of the matrix multiplication algorithm can be naturally
mapped to a square, orthogonal N x N matrix array. ( cf. Fig. 1) To
manage a smooth data movement on a localized communication network, we
make use of the notion of computational wavefront. In our interpretation
a wavefront on a computing network corresponds to a mathematical
recursion in an algorithm. Successive pipelining of the wavefronts
will accomplish the computation of all recursions in the algorithm.
As an example, the computation wavefront for the first recursion in
matrix multiplication is now examined. Suppose that the registers of
all cells are initially set to zero; the entries of A stored in the
s.y. Kung, at 81 237

memory modules to the left ( in columns) and that of B in the memory


modul es on the top ( ; n rows). The proces s starts with ce 11 (1,1)
(1) _ (0) _
C (1,1) - C (1,1) + a 11 b 11 - a 11 b 11 (2)
is computed. It th~~)activatt~)its sucessor neighbors (1,2) and (2,1)
cells, executing C (1,2),C (2,1) computations. Then they further
activate their successor neighbors cells (3,1),(2,2) and (1,3), creating
a computation wavefront traveling down the processor array. Once the
wavefront sweeps through all the cells the first recursion is over. As
the first wavefront propagates, we can parallelly execute the second
recursion by pipelining a second wavefront immediately after the first
one. The pipelining is feasible because the wavefronts of the two re-
cursions will never intersect ( Huyghen's wavefront principle), assu-
ring that they will be using different processors and avoiding any
contention problems.
Without going into further details, we note here that the same
wavefront concept applies as well to the other classes of matrix oper-
ations, such as LU decomposition, convolution etc. [3]-[7].
MATRIX DATA FLOW LANGUAGE (MDFL)
Since most general purpose languages cannot simultaneously describe
parallel processing, data flow as well as computational wavefronts
within a multi-processor array; therefore, a special purpose language
for matrix operations is needed.
The language described in this paper is based on the progress of
computational wavefronts and the corresponding data flow in matrix
algorithms. Hence, it is named "MATRIX DATA FLOW LANGUAGE" (MDFL). The
key features in the development of tIDFL are : (i) Utilization of the
topology of processor array, (ii) Localized interconnection/communica-
tion, (iii) Facilitation of the description of computational wavefronts,
(iv) Minimal data dependence and (v) Self-timing.
Two versions of MDFL are presented in this paper : Global MDFL and
local MDFL. Global MDFL describes matrix algorithms from the perspec-
tive of the entire matrix array of processors while local MDFL describes
the operation on each individual processor. More precisely, the pers-
pective of global MDFL is of one wavefront passing across all the pro-
cessors while the perspective of local MDFL is of one processor encoun-
tering a series of wavefronts. The conversion from a global MDFL
program to the corresponding local MDFL programs is simple and straight
forward.

II GLOBAL MATRIX DATA FLOW LANGUAGE

The global version of MDFL relies on the key concept of wavefront.


More specifically, this language is the tool for the description of the
activities of a wavefront. Hence, a global MDFL program should describe
completely the "life-cycle" of each wavefront from the instant it is
initiated through its progress across the entire array of processors.
This language uses "POINTERS" pointing to currently "active" processors
in order to trace wavefronts. Since all wavefronts are identical in
most applications, hence, a glocal }IDFL program need describe only one
238 A Matr. Data Flow LangJArchlt.for Para I. Matr. Oper. Based on Compu. Wavefront Con.

wavefront. Further more, each instruction in a global MDFL program re-


fers to all processors within that wavefront.
In most matrix algorithms, a wavefront performs the same set of
tasks at each new processor that it encounters. Global MDFL describes
this repetitive loop by the construct: "WHILE PTR @ (*,*) DO BEGIN
.. END; ". The statements within the WHILE loop must indicate what a
wavefront does at one set of processors. Therefore, in order to con-
struct a global MDFL program, one needs only to specify what the wave-
front must do at a particular set of precessors, and then enclose it
within a WHILE loop.
It is common in matrix algorithms that slightly different tasks be
performed concurrently by different processors in the same wavefront. To
help describe this , a construct
CASE KIND =
(LOCATION) : BEGIN .. END;
ENDCASE;
is provided. Then, instructions within this "CASE KIND" construct
should be executed by the processors at the indicated locations only.
Other instructions provided by the language include BRANCH, MOVE
and ACTIVATE*for wavefront movement; FLOW and FETCH for data movement;
and ADD, SUB, MULT, DIV and SQRT for arithmetic processing of data. For
logical processing of data CMP and TST are provided whose results can
be used to change the sequencing of an MDFL program with IF** constructs.
No conditional branching or GOTO statements are allowed in order to
retain block structure in }IDFL.
In data flow languages,arrival of data initiates processing.Simi-
larly, in global MDFL, an instruction WAIT FOR (DATA) is used to ensure
that data has arrived before the actual processing in an active processor.
Finally, to fully exploit the inherent parallelism in matrix al-
gorithms, global ~IDFL allows instructions to be executed in parallel
within each processor, so as to estimate the best speed of the parallel
algorithm.
We have found that the above instructions constitute a complete
set for the description of most matrix algorithms. For illustration,a
global MDFL program for matrix multiplication is provided below. ( For
more programming examples, the reader is referred to a later full paper
or [7J)
Example. 1 Global MDFL program for Matrix Multiplication

BEGIN
WHILE PTR @ (*,*) DO
BEGIN
CASE KIND =
(1,*) FETCH B,Up;
(*,1) : FETCH A,Left;

* The BRANCH instruction here is used to create new wavefronts. MOVE


refers to a shift mevement of the entire wavefront and has nothing
to do with the data movement. Active processors can ACTIVATE adjacent
ones. Note that in the MOVE instruction, the current processors will
be deactivated.
**IF ( ) BEGIN .. END; is the construct provided.
S.Y. Kung, at 81 239

(2,2) BRANCH (1,1);


ENDCASE:
WAIT FOR A,B,C) READY;
IN PARALLEL DO
BEGIN
FLOW A,Right;
FLOW B,Down;
HULT A,B,B;
ADD B,C,C;
END;
MOVE Right ,Down;
END;
ENDPROGRAM.
INTERPRETER
To accomplish the first phase of simulation work ( the validation
of parallel algorithms ), an INTERPRETER is developed. This package
can interpret a global MDFL source program so that it can be executed
sequentially. This establishes a test-bed for verifying the correctness
of parallel algorithms.
Just like global ~1DFL, the INTERPRETER is wavefront cn;iented. A
numberof wave fronts may be simultaneously progressing across the array
with each of them working on the same global MDFL program. In addition
the instruction execution time for each ALU operation is also parame-
terized so that the package can cope with various kinds of arithmetic
units. Parallel operations withinaprocessor are permitted in order to
estimate the maximumspeed achievable. In order to interpret a language
such as MDFL to a uniprocessor machine, the INTERPRETER has to "freeze"
the time-count to perform sequentially all the supposedly parallel
operations. Snapshots of the array at various "freezed" times are printed.
The INTERPRETER package in PASCAL is available at USC upon request.

III LOCAL MDFL AND PROCESSOR ARCHITECTURE

So far, we have seen that global HDFL serves to describe wavefront


computation. However, another major objective of the MDFL design is to
provide a tool for designing processor architecture. For this pur-
pose, a local MDFL version is introduced. While the perspective in
global MDFL is of one wavefront passing through all the processors, the
perspective in local MDFL is of one processor encountering a series of
wavefronts. Inspite of the seemingly high level constructs such as
IF BEGIN.. END, MDFL is treated as a low level language as these are
directly executed. With MDFL viewed as a low level language, the new
perspective leads to a version of ~IDFL which focusses on the action
within each processor, thereby dictating the processor hardware.
As all wavefronts are identical (c.f. section II), each processor
repeats the same sequence of tasks with each new wavefront. Therefore,
we need a repetitive loop in local MDFL. In fact, the statements
within the repetitive loop should be the set of statements within the
WHILE loop of the corresponding global MDFL program*.
* As was seen earlier, the statements within a WHILE loop of a global
MDFL program indicate what a wavefront does at one set of processors.
It is obvious that they also indicate what a processor should do when
a wavefront passes through it.
240 A Matr. Data Flow LangJArchlt. for Paral. Matr. Oper. Based on Compu. Wavefront Con.

Data input/output can also be done in a pipeline of wavefronts.


This establishes the need for having more than one repetitive loop in
the program. To allow for two or more loops, local MDFL provides
"REPEAT ... UNTIL TERMINATED" construct. Thus, the first step in con-
version of a global MDFL program to local MDFL is the replacement of
program construct
WHILE PTR @ (*,*) DO BEGIN .. END;
by the repetitive loop
REPEATED .... UNTIL TERMINATED;*
For conversion consideration, all other global MDFL instructions
(except CASE construct) can also be treated as local MDFL instructions:
The only difference being that they now refer to the current processor
alone, instead of all the processors in the wavefront.
"CASE KIND" is the only instruction left which is purely global
and needs to be converted. This conversion can be achieved by selecting
only the relevant portion of the CASE statement. In this way, a global
~IDFL program is split into smaller programs with each of these programs
describing the action of a particular processor. However, as most of
these processors will perform the same operations, a major portion of
this program is common to all the processors.
In summary, a global MDFL program is converted into its corres-
ponding local MDFL program by replacing (i) WHILE constructs by REPEAT
.. UNTIL and (ii) Selecting relevant portion of the CASE statement.
To illustrate this conversion process the local MDFL version of the
matrix multiplication algorithm is listed below :
Example 2 Local MDFL program for Matrix multiplication in
(1,0) processors (2,2) processor
BEGIN BEGIN
REPEAT REPEAT
FETCH B,Up; BRANCH (1,1);
FETCH A,Left; WAIT FOR (A,B,C) READY;
WAIT FOR (A,B,C) READY; FLOW A,Right;
FLOW A,Right; FLOW B,DoWil;
FLOW B,DoWil; MULT A,B,B;
MULT A,B,B; ADD B,C,C;
ADD B,C,C; MOVE Right,DoWil;
MOVE Right, DOWil; UNTIL TERMINATED
UNTIL TERMINATED ENDPROGRAM.
ENDPROGRAM.

PROCESSOR ARCHITECTURE
For the purpose of designing prossor architecture, local MDFL is
treated like a low level language which has one to one correspondance
with the machine language of each processor directly dictating the hard-
ware. In this sense, each processor is, in effect, a hardware interpreter
of local MDFL. However, MDFL is not low level in the conventional sense
because the inclusion of some high level constructs like REPEAT .. UNTIL,
IF BEGIN .. END. Yet these instructions are treated here as low level
as they are directly executed by the processor.
* TERMINATED is a boolean flag providing the mechanism to exit a repe-
titive block and enter another.
s.y. Kung, et 81 241

The hardware of each processing unit has to be able to execute


MDFL instructions. This determines the registers, status flags and
interconnections that must exist. For instance, pointers are imple-
mented by active flags at each processor. In addition, each processor
has to be capable of activating orthogonally adjacent processors and
of accessing data in them. This is the only fonm of communication
necessary.
A programmable array processor requires the loading of local MDFL
prog~ams. This has an effect on the hardware of the individual pro-
cessor: It must be capable of supporting the loading process. In
order that I/O can be done in a pipelined fashion, the boundary pro-
cessors are required to have some I/O capability. Lastly,parallelism
within a processor increases its hardware complexity. So a tradeoff
between complexity and parallelism has to be made. Considering all the
above factors, a typical architecture of a processor is proposed in
figure 2.
LOADING AND INPUT-OUTPUT
For a programmaliearray, the loading is rather complicated, because
different programs have to be stored in different processors. This
requires some preprocessing on the global NDFL program. Statements
within a CASE construct meant for specific processors are to be labeled
with a destination address (linking). The WHILE construct is to be
replaced by the REPEAT UNTIL construct. The preprocessor also ass-
embles all instructions so that they are ready for execution (code
generation ).
During the loading process, address matching is done before each
processor accepts and loads each statement of the program into program
memory. The loading process is carried out by executing a loader pro-
gram written in MDFL which is stored permanently in the processor. It
is important to note that the loading can be done in a pipelined fashion
retaining the wavefront nature. A global MDFL program for the loader
is shown below :
LOADER:
BEGIN
WHILE PTR @ (*,*) DO
BEGIN
CASE KIND =
(1 ,1) : BEGIN
DEC_COUNT; (* track number of times repeated *)
READ; (* input one line of assembled code *)
ACTIVATE Right;
END;
FIRSTROW BEGIN
FETCH P,Left,P;(* fetch program from left *)
ACTIVATE Right;
END;
(2,2) BEGIN
BRANCH (1,1);
FETCH P,Up,P;
END;
REST FETCH P,Up,P;
242 A Matr. Date Flow LangJArchH. for Parel. Matr. Oper. Basad on Compu. Wavefront Con.

ENDCASE;
MOVE Down;
END;
ENDPROGRAM.

Input/output is anooher aspect of the array processor that deserves


special attention. Data I/O can be done in a pipelined manner and so
again the wavefront nature is inherent. It is left to the user to
write a program i~MDFL for I/O purpose. Below a sample global MDFL
loop is shown for pipelined data input from memory stacks on top of the
processor array.
INPUT LOOP :
WHILE PTR @ (*,*) DO
BEGIN
CASE KIND =
(1, *) : ACTIVATE Right;
ENDCASE;
WAIT FOR (A) USED; (* wait for data in A register to be used,
i.e. accessed by processor below *)
FETCH A,Up,A;
MOVE Down;
END;

SIMULATION
Simulation of a programmable array processor at the register-level
(within each processor) has been undertaken. The entire simulation can
be summarized as follows: The absolute loader, a local MDFL program
permanently stored in all processors is executed first. It loads the
user's program. After loading, the user's program is executed. The
user's program consists of three major parts : input, program body
(algorithm) and output. The input section of the user is to ensure
that the data is distributed as desired by the algorithm and the output
is to collect the results, the program body executes the algorithm.
The entire user's program is written in MDFL and the sections can be
intermixed depending on the situation.
The simulation indicates that the prototype architecture (c.f.
figure 2) proposed performs as desired. The entire simulator written
in SIMULA can be made available on request.
Finally, in summary, we wo~ld like to draw the reader's attention
to the following remarks :
(1) We had seen that the processor architecture of a programmable
array processor is dictated by the language as the processor is a hard-
ware-interpreter for local MDFL. On the other hand, if a dedicated
non-programmable array processor is desired, the local MDFL program for
that dedicated algorithm should be hardwired in the processor.
(2) Local MDFL thus, leads to processor hardware. It is also
capable of describing the algorithm from the processor's perspective.
Yet global MDFL is indispensible because it is close to matrix algorithm
and it is easier to program the array in global MDFL rather than write
several local IIDFL programs.
S.Y. Kung, at 81 243

The preprocessing and selective loading together perform the con-


version of the user's global MDFL program into its local version.
(3) Global MDFL packages for common matrix algorithms can be made
available upon request.

IV CONCLUSION

Facing the next generation of parallel computers, conventional


languages such as FORTRAN that are based on a global state model of
computer operation, will soon become unsuitable and will eventually
be abandoned for large scale scientific computation [8J. This paper
presents a promising alternative in the fovrn of a special purpose lan-
guage for a special purpose parallel machine. It offers a solution to
the problem of efficiently exploiting concurrency in large scale
computations.
The development of a matrix-oriented special purpose language also
led to the design of a programmable matrix array processor, which avoids
the typical problems of global communication, shared bus as well as
shared memory. The prototype architecture for the processors of an
MDFL programmable array processor,as proposed in section III, is simple
enough to be implemented on a single chip even with today's commercially
available device technology. Therefore, the notion of an MDFL progra-
mmable, microprocessor-based array processor is not only theoretically
appealing, but also commercially feasible.

ACKNOWLEDGEMENTS : The authors wish to thank Dr. David Lin (formerly at


USC) and Ron Gal-Ezer ar USC for the very valuable discussions and their
helpful participation in the project.

REFERENCES

[lJ Mead, Carver and L. Conway, "Introduction to VLSI Systems,"Addison


Wesley, 1980, Chapter 8 (by H.T. Kung).
[2 J Ackerman, W. B. ,"Data Flow Languages," AFIPS NCC Proc., Vol. 48,
June 1979, pp. 1087-1095.
[3] Kung, S. Y. ,"VLSI Array Processor for Signal Processing, "MIT Conf.
on Adva. Research on I.C., Cambridge, MA, 1980.
[4J Speiser,J.M. and H.J. Whitehouse, "Architectures for Real-time
Matrix Operations," Proc., GOMAC, Nov. 1980.
[5J Kung, S.Y. and D.V.Bhaskar RaID, "Highly Parallel Architectures for
Solving Linear Equations," Proc. ICASSP '81, Atlanta,pp39-42.
[6J Kung,S.Y. and Y.H. Hu,"Fast and Parallel Algorithms for Solving
Toeplitz Systems," Inter.Symp. Mini & Microcomputer in Control and
Measurement, San Francisco, Hay 20-22, 1981.
[7] Kung, S. Y. ,"Matrix Data Flow Language for Hatrix Operation Dedicated
Array Processors," Proe. ECCTD '81, The Hague, The Netherlands,
Aug. 25-28, 1981.
[8J Dennis, J.B.,"Dataflow Supercomputers," IEEE Computer, Nov. 1980,
pp. 48-56.
..,....
( NlSOlUTE )
....
\ LOADER

PmiPtr. :=
~
rEmf :"'
C
~
#'" '\,; ..
H USER ) "TI
.' #' :! (
~,;;,,<,; .' '", PROGRM f
,,,# r-
DID III
COHTROL ~
IC
UHIT ;.
' ~~ ' ~' .' " a:::T
r-- " " , " ..-,!
' I' .' " .' ,. ">'1 .' !o 1'"
.. ' " .' " .' " .' .,c,
, ' , ~
l. . . ""'\. . '' '1.
;' , .....-...,' '---.... .,..--.... "'U
III
OJ
'"
DJJ]--[J ' [3' []' ill ..' -
:; " ,";" t,1 ~
PHI - Pro9r ... aelllOry
"'0 .' ,,' ;:1"
',1'
"'~';
,I'
. ' I" ' c,!" :=
~ Buff er for loadioq
I ~
we ... Word Count for :"'
t' .' ~ 0
..'".',,,' . ".'" " loadlnq
1..',.'" " ..' .'~
g
OJ R"T-A,VOR - Peturn
i:"'
:c A:Mre tor III
OIlH:J---:[Ir ,O'~ ,/' " flep"ilt-Untll loops
III
..
- lloinl controlled by ALU
'I ' " .' I" .' " ~ local proce or &
.' " .' , 0
,"", 1 ",,' "1 .' ",,' "1 .' '"
.' :" .' ',1" . ' ,I' ' #"\ BII51~--- ~
o ... point controlled. by
local proce or and 0
0
" ~ ',~
. ,'"
[[I]}--[J "
.
0'" , J
..
0' ", 0'
"
'....-. ...'. ,,,~
.. "
the adjacent 3
proce orl "CI
, , !=
First wavefront. ~
Second wavefront .-...... ~ . ..
!
a
3-
CD~NT - f,)unter useD to track and exit Repeat-Until loops 0
0
?
Fig, 1 - CQOfl9urat Ion for NxN Squore Matrix Ar.alY FIG. 2- MOtITECl\fE ~ A NHlCUIlAAY ~OC:ESSCJl.
Digital Signal Processing Applications
of Systolic Algorithms

Peter R. Cappello and Kenneth Steiglitzt


Princeton University
Department of Electrical Engineering and Computer Science
Princeton, New Jersey 08544

1. INTRODUCTION

VLSI structures and algorithms are given for bit-serial FIR filtering, IIR filter-
ing, and convolution. We also present a bit-parallel FIR filter design. The struc-
tures are highly regular, programmable, and area-efficient. In fact, we will show
that most are within log factors of asymptotic optimality. These structures are
completely pipelined; that is, the throughput rate (bits/second) is independent of
both word size and filter length. This is to be contrasted with algorithms designed
and implemented in terms of, say, multipliers and adders whose throughput rates
may depend on word length.

2. A TWO'S COMPLEMENT BIT-SERIAL FIR FILTER

In this section we present a cellular FIR filter structure. We may express the
K-l
nth output sample (a B-bit number) of a K-tap FIR filter as Y n - 1: akxn-k' It is
k-O
easy to see from [3,4,51 that Yn can be computed using the cellular structure
shown in Figure 2.A. In this architecture, the input signal values are piped to the
right while the output signal values are formed as they move left. Each process
step is of the form yn(k) - yn(k-I) + aK-l-kXn-K+l+k'
We now consider some detail of the structure of the inner product step cell.
Using two's complement arithmetic we may rewrite Yn as follows:
K-l K-l B-1
Yn- 1: akXn-k - 1: ( 1: (ak 2b ) Xn-k.b) The recurrence for Yn becomes
k-O k-O b-O
B-1
Yn(O) - 1: (uK_!"2 b ) X n -K+l.b
b-O
B-1
yn(k) - yn(k-I) + 1: (aK_I_k,2 b ) Xn-K+I+k.b
b-O

Y n _ Y n(K-I)

Substituting a bit-level recurrence for multiplication by a constant into the


recurrence for Yn , we obtain,
B-1
y}O) _ si~lll - 1: 2b Si~1.~
b-O

t This work was supported in part by the National Science Foundation under Grant ECS-7916292. and in
part by the U.S. Army Research Office. Durham. NC. under Grant DAAG29-79-C-0024.
245
246 Digital Signal Processing Applications of Systolic Algorithms

Y n(k) _ Y n(k-]) + SK-I-k


(B-]) _ Y (k-])
n
+ B-1
~ 2b S (B-])
"- K-I-k,b
b-O
Yn _ Y n(K-I) '

where the general recurrence for sum bits, s (and carry bit, c) is

e Ci~I-]) e b~i<B
!
S/b-]) (ai-b'xb ) for O<b<B,
S(b) _
I s (b-]) else
I

1 if S/b-]) + Ci~I-]) + (ai-b'xb) > 1 for O<b<B, b~i<B


c/ b ) - \ 0 else for O<b<B, b~i<B
o for O<b<B, O~i<b

Figure 2.G illustrates acellular, bit-serial inner product step structure for K - 4
and B - 4. Figure 2.H gives the bit-serial inner product step algorithm. I/O and
computation are overlapped. Finally, the overall FIR filter computation is given in
Figure 2.1. Note that input signal values arrive and output signal values depart on
every other major cycle.
The time complexity (i.e., the number of clock pulses) using this structure
and algorithm is 0 (nB+KB) for a signal of length n through a K-tap filter, using
B-bit input and output signal values (approximately 2 clock pulses per output bit).
The FIR filter's area complexity is 0 (KB). Note that the structure computes the
whole output value. Thus if the filter design requires Bo-bit coefficients and Bo-bit
input values, then the B referred to in this structure is B - 2Bo + 10gK (bits).
Also, if there are fan-out restrictions then the inner product step cell grows as
o (B 10gB) rather than 0 (B). Thus the FIR filter structure has area complexity
o (KBolog(Bo+logK) + KlogKlog(Bo+logK.
It is straightforward to modify the bit-serial FIR filter structure to accommo-
date coefficient loading, making the FIR filter structure programmable, without
increasing its area complexity. Viewed as a finite state machine, such a bit-serial
chip must have at least 2KBo- 1 states. (Consider the case \vhen all the filter
coefficients are 1.) Thus KBo-l bits of information are needed to represent the
state of the chip which implies area growth n (KB o). Thus the FIR filter structure
is within no more than log factors of asymptotic optimality.

3. A TWO'S COMPLEMENT BIT -SERIAL IIR FILTER

In this section we present a cellular IIR filter structure. It is a simple adapta-


tion of the FIR filter structure of the previous section. Let the IIR filter output,
K-I K-I
Yn be defined as follows: Yn - I akXn-k + I bkYn-k' Each summation in this
k-O k-I
K-I
equation has the form of an FIR filter. Clearly I akXn-k can be computed using
k-O
K-I
the FIR structure. The summation 1: bkYn-k can be computed by the FIR struc-
k-I
Peter R. Cappello and Kenneth Stelglltz 247

ture provided that it can get its input on schedule. This turns out to be the case.
Figure 3.A depicts the IIR filter structure. Figure 3.B presents the algorithm for the
structure.
As with the FIR structure, the IIR structure is amenable to structural
enhancements supporting programmability and fast initialization.
The asymptotic time and area complexities of the bit-serial IIR algorithm and
structure are identical to their FIR counterparts (this is true only in the bit-serial
case). The I1R filter structure is likewise within no more than log factors of
asymptotic optimality.

4. A TWO'S COMPLEMENT BIT -SERIAL CONVOLVER

In this section we present a cellular structure for computing a convolution.


We may express the nth output of the convolution of two sequences of length K
as,
n
zn - I Xk'Yn-k'
k-O
To compute the Zj sequentially for O~ i < 2K, there is an obvious data movement
scheme for the input sequences: Starting with the x signal on the left and the Y
signal on the right, the signals are shifted past one another. This is illustrated for K
- 4 in Figure 4.A. At each point in the cycle we need to multiply the x and Y
values that correspond positionally, then sum these products to get z. Done bit-
serially, we can use a variant of the multiply by a constant cell to effect the bit-serial
multiplication, summing the (not more than K) products with a bit-serial K-ary
adder to get z/. Figure 4.B illustrates the high level structure of this bit-serial con-
volver. At this level, the data flow in the pipeline algorithm is as follows: y values
flow to the left while x values flow to the right and product bits flow up into the
K-ary adder.
The multiplication by a constant cell is the one described in section 2 with two
modifications:
It is positioned "vertically" to reduce the width-length product of the convolu-
tion structure.
It is "programmable" - the "constant" is shifted into place prior to the bit-serial
multiplication (see Figure 4.C for the structure and Figure 4.D for the algo-
rithm).
The K -ary bit-serial adder is a generalization of a bit-serial binary adder. The
structural consequence of this generalization is a complete binary tree with K
leaves t , where each cell is a I-bit full adder. The area complexity of this structure
is 0 (K 10gK); each enhanced* I-bit full adder occupies constant area.
The algorithm for the convolution structure as a whole is simply a matter of
passing the x and y input streams through the convolution structure. As with the
other structures, to accommodate complete pipelining, B is assumed to be large
t We assume for simplicity that K is a power of 2.
I (enhanced. that is. to permit pipelining)
248 DIgIIIII SIgnal ProcessIng Applications of Systolic Algorithms

enouah to represent the result with no loss of information (i.e., B - 2Bo + 10gK
where Bo is the number of bits needed to represent one x (y) signal value). Con-
sequently, and assuming fan-out restrictions, the convolution structure has area
complexity 0 (KBolog(B o + 10gK) + KlogKlog(B o + 10gK. A lower bound on
the area of a bit-serial convolver is 0 (KBo) (see [6]). Thus the structure given is
within no more than log factors of asymptotic optimality.

5. A TWO'S COMPLEMENT BIT-PARALLEL FIR FILTER

In this section we present a cellular FIR filter structure which operates in bit-
parallel fashion; that is, an entire output word is produced every 2 clock cycles.
Using two's complement arithmetic, we may express the nth output of a K-tap
filter as follows:
K-I K-I 8-1 8-1 K-I
Yn - I 0kXn-k - IOk I 2 b x n _k,b- I 2b(I 0k'Xn-U) (S.A)
k-O k-O b-O b-O k-O

Equation S.A expresses algebraically what we may call a shift-add implementation


of an FIR filter. The high level structure for the FIR filter, illustrated in Figure
S.A, reflects this organization. As can be seen from that figure, the shifts are fixed
and thus can be hardwired. Brent and Kung describe in [II a cellular, pipelinable,
bit-parallel adder that is suitable for use in this add tree.
We now turn our attention to that part of the computation which is
represented algebraically by the quantity in parentheses in Eq. S.A. Consider a K-
tap filter whose input signal is simply a sequence of bits. That is, output Y n is
defined as follows:
K-I
Yn - I 0kbn-k, where bn_kE{O,I}.
k-O

We will call this computation a bit-FIR filter, and a structure for computing it
a bFIR (structure). Figure S.B illustrates a cellular bFIR structure for K - 4 and B
- 4. Bit-level recurrences follow.
(0)
Sb - 0K-I,b'xn,b for Ot;;. b< B
S,,(k) - Sb(k-ll E9 cb~lll E9 (OK-I-k.bXn-K+l+k.b) for Ot;;.b<B, O<k<K
Cb(O) - 0 for Ot;;. b< B
c~1) - 0 for Ot;;.k< K
+ cb~lll + > 1
Cb(k) - {~ if
else
Sb(k-ll (OK-I-k.bXn-K+lH.b
for Ot;;. b < B, 0< k < K
Now,
8-1 8-2
bFIR - I 2b Sb(K-ll + I2b+l cb(K-O.
b-O b-O

Note that since the bFIR output is the sum of a "sum bit number" and a carry bit
number," we need the adder. Data flow in the pipeline algorithm is as follows: The
input bits, bn- k , move to the east while the sum bits move west, and carries move
Peter R. Cappello and Kenneth Stelglltz 249

southwest. The algorithm for a general cellb(k) is given in Figure S.C. Note that
input bits arrive and output words depart on every other cycle.
The area of the FIR filter, (taken as its width-length product) is
o (B2K + B3/ogB). Its data rate is 0 (B) bits/cycle (1 output word every 2 clock
pulses).

6. CONCLUSIONS

We have presented some VLSI structures for digital signal processing that are
completely pipelinable, programmable, and area-efficient. More detail is given in
the full length version of this paped7J. The development process highlighted the
fact that the tasks of designing VLSI "hardware" and VLSI "software" are insepar-
able. One compelling reason for this is the desire to have data in the right place at
the right time. Systolic algorithms on topologically (quasi-) planar cellular struc-
tures are, thus, well suited for the computations considered. Cellular structures
also lend themselves to hierarchical description - surely a useful method for coping
with descriptive complexity.
On the other hand, to achieve a completely pipelined result, one must verti-
cally integrate the functions involved. For example, we took
1. a systolic algorithm/cellular structure [3,4,5] for computing an FIR filter, and
2. a systolic algorithm/cellular structure for computing an inner product step,
and integrated them to form a single systolic algorithm/cellular structure that har-
moniously computes both functional "levels." References [2,3] also discuss the idea
of bringing the design of systolic algorithms down to the bit level.
The approach of this paper can clearly be applied to other classes of digital sig-
nal processing algorithms, and to other arithmetic systems (such as residue arith-
metic). These applications are currently under investigation.

REFERENCES

[1] Brent, R. P. and H. T. Kung, "The Chip Complexity of Binary Arithmetic,"


Proc. 12th Annual ACM Symposium on the Theory of Computing, 1980.
[2] Foster, M. J. and H. T. Kung, "The Design of Special-Purpose VLSI Chips,"
Computer Malazine, Vol. 13, pp. 26-40, 1980
[3] Kung, H. T., "Let's Design Algorithms for VLSI Systems," Proc. Conference
on Very Large Scale Intelration: Architecture, Design, Fabrication, Califor-
nia Institute of Technology, pp. 65-90, January, 1979.
(4] Kung, H. T., "Special-purpose devices for signal and image processing: an
opportunity in very large scale integration (VLSI)," Proc. of Society of
Photo-optical Instrumentation Engineers, Volume 241 Real-Time Signal Pro-
cessing III, July, 1980.
[5] Kung, H. T. and Charles E. Leiserson, "Algorithms for VLSI Processor
Arrays," Section 8.3 of Introduction to VLSI Systems, C. Mead and L. Con-
way, Addison-Wesley, 1980.
250 Digital Signal Processing Applications of Systolic Algorithms

[6) Vuillemin, Jean, "A Combinatorial Limit to the Computing Power of VLSI
Circuits," Proc.IEEE 21st Annual Symposium on Foundations of Computer
Science, 1980.
[71 Cappello, P. R. and K. Steiglitz, "VLSI Structures for Completely Pipelined
Processing of Digital Signals: Tech. Rpt. No. 288, EECS Dept., Princeton,
N.J. Aug. 1981. Submitted for publication.

Filure 2.A

! r Inner product algorithm "


Shift register X; 1* input signal "
& sum-carry cells)
(
Read x,l;
if I .. 1 then C - 0;
s' - sE!1cE!1(ax);
c' - if s+c+(ax) > 1 then I else 0;

Shift register s; '" output signal "'


Shift circular register t; I' timing chain "
I
Figure 2.R Inner product algorithm

Fisure 2.G Inner product structure

" initialization t 0,
Reprat 2KB times:
Clock the FIR cell with input bit 0;
Repeat 2nB + KB times:
Clock the FIR cell with the 2nB + KB bit sequence shown in Filure 2.K;

Figure 2.J FIR filter algorithm


Peter R. Ceppello and Kenneth Stelglltz 251

fO IIR algorithm of
(\1 inner product cells)
{
Shift register x; fO input bit of denominator FIR structure is sSUM of
(\1 sum-carry cells)
(
Read X,I; r
SSUM=X of denominator FIR cello) of
if I - 1 then C - 0;
s' - s~c~(Q'x);
C' - if d+c+(Q'x) > 1 then 1 else 0;
}
Shift register s; I" output signal value of
Shift circular register I; fO timing chain of
I
(SUM cell)
{
Read s~'UM,sDENOM,cSUM;
SOUT - SNUM~SDENOM~CSUM;
COUT - if SNUM+SDENOM+CSUM > 1 then 1 else 0;
SSUM - SOUT'IHEAD;
CSUM - COUT'IHEAD;
Shift circular register I; I" timing chain of
}

Figure 3,B IIR algorithm

J1 t:.:[] r ~HIIM
~
q'QJ 1 t1 t:rJ r-. SIlE_
:.~

<:'OUT r ... r
s, ....

~ ~} ...
;t ;t -) .. -0 2 I
\. ,

Figure 3,A IIR filter structure


252 Digital Signal Processing Applications of SystOliC Algorithms

X3 )(2 XI Xo
1~1-rIAL
Yo Y, Y2 Ya
S~,I<TX

X3 Xl XI Xo
0
Yo YI ~ >;
514,.1" 'I
><3 Xl '><'1 Xo
1 Y,
Yo >"2. Y3
$ ~,F' X
X3 )(2. XI )(0
2 Y,
Yo ~ ~
5 ",,7 Y
X3 Xl XI Xo
3
Yo Y, >'2- Yo
SHIFT)(

X3 Xz X, Xo
4 y.
Yo >'2- >3
sH,Fry
)(3 Xl ><", '10
5" Y,
Yo ~ ~
SMWr)(
)(3 X2 X, 'fo
6 Yo YJ ~ X
sll,",1
'13 Xz X, Xo
Yo 1, ~
~ Yo
Figure 4.A Data movement scheme for convolution

'iMr~.,

Figure 4.C Modified multiplication by a conslanl SlfUClUre for B = 4


Peter R. Cappello and Kenneth Stelglltz 253

SIT- SERIAL K-ARY ADbE..R.


T
0 (Ltl" /<.)

1
X
---J
BIT- SERIAL
~
-
...
... ~
~
y
Eo- T
MULTIPLY (j(B)
A

1
~y

CONSTANT

O(K)----1
Figure 4.8 High level convolver structure

: /* Modified mulliply by a constant algorithm OJ


(mode cell)
I
Read t;
if t = 1 the n m - ..... 111 ;
I
CiI sum-carry cells)
I
Read t;
if t = 0 then I p - 0; c - 0; I;
Read 111;
if 111 = 0 then
I
Shift register x;
Read xlIAD y:
1
else

Shift register y:
Read YIIW x:
l:
p -pfficffi(x'Y);
c - if p+c+(x'Y) > 1 then I else 0;
Shift register p; j' product 'j
I
Shift circular register t; jO liming chain OJ
I
Figure 4.0 Modified multiply by a constant algorithm
254 Digital Signal Processing Applications of Systolic Algorithms

Y~ St+IFT
LEFT
2.M~a)-1

)(1-2.-- - - - - - - - - - - - - -- - -
'-::::-=-:-{

Filure S.A Hi.h level FIR Slrutlure

Fi.ure 5.8 A cellular bFJR for II: - 4 and 8 - 4

/. bit FIR algorithm for a general cell ./


Shift register b,s ,c; ,. input, sum bit, and carry bit, respectively 0'
fV sum-carry cells)
I
s - seeElHb'Q);
e' - if s+e+(b'Q) > 1 then 1 else 0;

Figure S.C Bit FIR algorithm for a general eel/If


A Two-Level Pipelined SystOlic Array for
Convolutions

H.T. Kungl, Lawrence M. Ruane, and David W.L. Yen


ESL Inc.2
Advanced Processor Technology Group
San Jose, California 95131

Abstract - Pipelining comp~tations over a large array of cells


has been an important fea t~re of sye tolic arrays. '1'0 achieve
even higher degrees of concurrency, it is desirable to have
cells of a systolic array themselves be pipe lined as well. Tue
resulting two-level pipelined systolic array would enjoy in
principle a K-fold increase in its throughput, where k is the
ratio of the time to perform tlw entire cell comp~tation over
that to perform just one of its pipeline stages. 'fllis paper
describes such a two-level pipelined systolic array that is
capable of performing convolutions of any dimension. Tne
designs take f~ll advantages of the pipelining . asswned to be
available at eacn cell.
jjul ti-stage pipe lined Clri thmetic uni ts b~il t from discrete
components have been used in most of high-performance
computers. With the advent of VirSl, these pipelined units will
surely be implemented in one or few cllips. 'rhis paper shows
for the first time how a large number of these pipelined chips
can be efficiently combined to form a systolic array.
1. IN'rltODUCTii.Jii

M~lti-dimensional convolutions constitute some of the most


compu te- in tensi ve tasks in signal or image processing. ~'or example,
the 2-D convolution using a general 4 x 4 kernel would require 16
mul tiplica tions and 15 additions to be performed for generating each
output. If the dimensionality is nignel' or the kernel is larger, even
more ari thmetic operations liould be required. Though compu ta tiOIllilly
demanding, convolution is nevertneless a highly regular computation.
Therefore, by exploiting the regularity inherent to the computation,
cost-effective high-throughput pipelined systems can be built to
perform multi-dimensional convolutions, as argued in some recent
articles [1-7J'
Past approaches that lie are awal'e of, however, suffer from the
drawback that they do not take advantages of the possibility that
arithmetic units could themselves be pipelined. Note that in VLSl
implementations Lligh-speed ari thmetic chips are likely to be pipelined
1 On leave from Department of Computer bcience, Carnegie-
2 -1'iellon University d.uring January and August, 19!:l1.
801 is a subllidiary of irRw Inc.
255
258 A TwoLevel Plpellned Systolic Array for Convolutions

in order to make efficient use of silicon. Consider, for example, the


implementation of a bit-parallel integer multiplication chip. If a
single-stage combinational circuit is used, then at any given moment
only the small section of silicon where the computation is rippling
through is actively used. To circumvent this inefficiency, it is
natural to latch the circuit into multiple pipeline stages so that
several portions of the chip can work simultaneously. Introducing
latches to a circuit requires additional registers ~nd control
circuits, but this overhead will usually be over-compensated by the
increase of the throughput. In fact, existing 1~1 multipliers such as
the 'l'R\i 1SI multipliers L8 j already support two-stage pipelining. This
paper demonstrates how a large number of these pipe lined chips can be
efficiently combined. to form a systolic convolution array. The
systolic array is capable of executing convolutions of any
dimensionality, and is extendable to an arbitrary number of processing
elements of trie same type.

II. SYSTOLIC ARRAY ORGA~IZATION

Trie systolic array consists of C identical linearly-connected


processing elements, or cells, as depicted in Figure 1. The cell
structure is ShO~1 in Figure 2. ~ach cell contains a multiplier, a
latch, an adder, and a shift register. Tne three inputs to the array
are the three inputs of cell 1; the only output of the array is the
adder output of cell C. The other outputs of cell C are left
unconnec ted.

Figure 1. A Systolic Array Model


'rhe cells of the systolic array may be considered as the
individual segments of a pipeline. In addition, as depicted in Figure
2, the adder and the multiplier of each cell are themselves pipelined
to any degree.
Tne design of such a two-level pipelined systolic array is in
general more complex than the corresponding one-level pipelined
systolic array. It is convenient to define the following parameters:

M is the number of pipeline stages of the multiplier unit (M = 0


corresponds to a combinational multiplier, 11 '" 1 corresponds to a
combinational multiplier with input registers, M = 2 corresponds to a
multiplier with 2 stages of pipeline registers, etc.).

A is the number of pipeline stages of the adder unit.

R is the number of stages of the shift register.

in a design, H and A may be considered as given and an important task


is to figure out the value of n for the systolic cell.
Note that there is no broadcast or fan-in path in the system,
either for input or output. Each cell communicates only with its
H.T. Kung,.t.1 257

J LATCH I
I I

- 8, 82 8A

ADDER

-
- 8, 82
~ ~

MULTIPLIER

8, 82 .. . 8R

SHIFT REGISTER

Figure 2. A Systolic Cell Model


immediate neighbors. This allows indefinite expansion of the array
which. for electrical and timing reasons. is not possible if global
communication paths exist.

iH. 1X~CU'rluN uF l-DIlljEN3ivNAL COi~VOLUTIJN

Consider a vector (or signal) X = Ix.l. i,. 1.2 n and a


vector (or window) of weighting coefficient~ w= twl. j = 1.2 k.
In general. n K. The convolution of X by W. giving Y is defined as:

k-l
ys = I wi +1 * Xi+s s = 1.2 n-k+l.
i=O
Conceptually. if the vector indices increase from left to right.
the first result. Yl' is obtained by aligning the leftmost element of
the W vector -.,i th the leftmost element of the X vector. then computing
the inner product of the W vector and the section of the X vector it
ov~rlaps. The window then slides one position rightward and the inner
product calculation is again performed on the overlap to produce the
second result. '.i'he last result is obtained when rightmost element of
the Ii vector is aligned with the rightmost element of the X vector.
In order to execute l-D convolution on the systolic array
described in the preceding section. the number of stages of the shift
register must be one greater than the number of stages of the adder.
that is. R = A + 1. There is no constraint on the number of multiplier
stages, however. For the time being. assume that there are at least as
many cells in the array as elements of the window. i.e. C > k.
First. a total of C K zero elements are entered into the
pipe lined IV path. followed by the window elements in the order ill'
w2 through Ii.. ('fhis causes wk to be placed in the multiplier
latch of th~ leftmost cell. 'l'ne rightmost (; K cells will be
258 A TwoLevel Plpellned Systolic Array for Convolutions

essl:lutiully unused for this computation.) 'l'hl:lse valutls remain in place


for the Juration of the computation. 'l'hen, execution begins as X
~lement~ stream into the Xl input. A constant zero strl:lam is fed
~uto tne Y1 input. Results stream from the YC+ l output. Figures
3a through 3d show snapshots of an execution of a l-D convolution on a
particular configuration of the systolic array (namely, M=4, A=2, R=3,
C=4). In the figure, Xl enters the array at time 0, and p.
designates the partial result of Yi. ~

a __ TIME_a

3b-TIME-'

3.-TIME -.

3d-TIME- 15

Figure 3. Snapshots of an Execution of a 1-0 Convolution


with M=4, A=2, R=3, and C=4
H.T. Kung, .t al 259

IV. l!;FF IClb:N'1' liAlmLIi~G OF ZEROriINDOIV ELElilENTS

From the preceding discussion, it would appear that a zero window


element would cause the corresponding cell to be "wasted", that is, it
would perform no useful work. The present section shows that this
situation can be avoided if the number of stages (R) of the individual
cell shift registers can be adjusted at set up time. Since any n-D
convolution problem can be converted to a l-D convolution with a window
consisting of possibly a large number of zeros, as demonstrated later
in section V, the result of this section has a far-reaching effect than
just saving a few cells.
Suppose that cell i contains a zero window element. Then the only
effect of that cell in the execution of the 1-D convolution is to delay
the Y streaJIby A cycles and the X stream by A+1 cycles. It should be
apparent then that if this cell were replaced by a zero cycle delay for
the Y stream (a direct path) and a single cycle delay for tne X stream,
the correct result stream would be generated, although each result
would appear A cycles earlier than before (the latency would therefore
be reduced). liow, this degenerate cell i may be absorbed into cell i-I
by increasing the number of shift register stages of cell i-1 by one,
to A+2.
'1'he procedure, then, is to load the non-zero window elements into
consecutive cells. Let q be the number of such window elements. Let
2[ij be the number of zero elements in the window ~etween the window
elements stored in cell i and cell i+1, and let Rli] be the number of
stages of the shift register of cell i, i=1,2, ,q-l. The shift
registers are then configured as follows: R[ij"A+1+2[i]. Execution
may then begin.
As an example, if A"2 and i f the window it is [3,0,0,5,9,0,6J, only
4 cells are needed, with 6 stored in cell 1, 9 in cell 2,5 in cell"
and 3 in ge11 4. The vector, LR(1],H[2],H[,JJ, is set to [4",5j. A
series of snapshots of this example are shown in Figures 4a through 4d.
If A+1+ZLiJ exceeds the physical capacity of the shift register of
cell i, the problem can still be solved by using a dummy cell, cell i',
having a zero as its window element, betweeucell i and cell i+1; and
setting RLiJ and RLi'] such that RLi] + RLi' J .. 2A+1+ZLiJ.

V. EXECU'l'ION O!' N-.LHI4~1~SIONAL CONVOLU'l'lON

Applying the technique of the preceding section, multi-dimensional


convolution becomes a straightforward generalization of 1-D
convolution. To make the discussion concrete, 2-D convolution will be
examined in detail.
Consider an m by n matrix (or image) X .. Ix.I, i .. 1,2, ,m;
j 1,2, ,n and a k by P window with 1~eighting coefficients
it .. Iwijl, i .. 1,2, ,k; j .. 1,2, ,p.

The convolution of X by II, giving Y is defined a:s;

k-l p-l
Yrs I
i=O j=O
I w'+ 1 '+1
1. ,J
* xi+r,j+s

for r = 1,2, ,m-k+1; s" 1 ,2, ,n-p+1


280 A TwoLevel Pipelined Systolic Array for Convolutions

41-TIME-a

0 ... -----1

.e-TIME -12

0 ... -----1

4d-TIME-"

Figure 4. Snapshots of the Convolution Exectuion with


Efficient Handling of Zero Window Elements

'.i'his computation can be viewed as a 1-lJ convolution problem, where


the signal is composed as follows:

where xi*=~i1 ,x i2 , x in ; that is, the 1-D signal is the


concatenatlon of the rows of the 2-D signal (or image). The length of
the signal is mn. The window is composed as follows (u!v stands for
the value v repeated u times):
H.T. Kung,.t 81 281

where w.*=w. 1 , ,w.; that is, the l-D ~indow is the


.concaten~tiofi of thE rows of the 2-D window, with a vector of (n-p)
zero elements inserted between each consecutive pair of rows. '.l'he
leugtn of the l-D window is therefore n(k-l)+p.
For illustration, consider a scaled-down example, where the image
is :) rows by 4 columns (111 = 5, n = 4), anci the window is 3 rows by 2
columns (k = j, P = 2). 'l'he l-D signal is formed from the 2-D image as
follows (a 2-D element is specified by its 2-D row-column index):

x= Lll, 12, 15, 14,21,22,23,24,51,32,53,34,41,42,45,44,51 ,52,55,54J

and the 1 -D I#indow is formeL! from the 2-D window as follows:

W= L1 1 , 12, (), 0,21 ,22, (), 0,.5 1 ,52 J.


Figure 5 shows snapshots of several winL!ow positions, each
corresponding to a result, from both the l-D and 2-D vieOipoints. In
the figure, window indices are shown over signal indices. Generation
of the first n~sult, Yl1' is shown in Figure 5a. 'rne windoOi moves
one posi tiOll to the right to produce the second resul t, Y12' as shown
in Figure 5 b. Jlfter the las t result of the first row, y 1 ~, is
produced, the window slides one position to the right. At this) point
an invalid resul t is generated emd is to be ignored. l'his situation is
shown in Figure 5c. iiext, the first element of the second row is
produced, as shown in Figure 5d. 'l'he window continues to slide to the

x (11,12, ll,14, 21,22,23,24, ll, 32,33,34,41,42,43,44,51,52,53,541


II [11,12, 0, 0,21,22, 0, 0,31,32J

11 II
11 II 13 14
21 22
21 22 23 24

31 32
20: 31 32 33 34 5. - first result

41 42 41 44

51 52 53 54

11 12 0 a 21 22 a 0 31 32
10: 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 51 52 53 54

11 12
11 12 13 14

21 22
21 22 23 24

31 32
20: 31 32 33 34 5b - aecond re8ul t

41 42 43 H

51 52 53 54

11 12 0 0 21 22 0 0 31 32
10: 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 51 52 53 54

Figure 5. Snapshots of a 2-D Convolution Operation,


From both the l-D and 2-D Viewpoints
282 A TwoLevel Pipe lined Systolic Array for Convolutions

11
11 12 13 14

12 21
21 22 23 24

22 31
2D: 31 32 33 34 5c - inval id result
(ignored)
32
41 42 43 44

51 52 53 54

11 12 0 o 21 22 0 0 31 32
10: 11 12 13 14 21 22 23 24 31 J2 33 34 U 42 43 U 51 52 53 54

11 12 13 14

11 12
21 22 23 24

21 22
20: 31 J2 33 34 5d - first result of the
second row
31 32
41 42 43 44

51 52 53 54

11 12 0 0 21 22 0 o 31 32
10: 11 12 13 14 21 22 23 24 31 32 33 34 41 42 43 44 51 52 53 54

11 12 13 14

21 22 23 24

11 12
20: 31 32 33 34 5e - final result

21 22
U 42 (J 44

31 32
51 52 53 54

11 12 0 0 21 22 0 0 31 32
10: 11 12 13 14 21 22 23 24 31 J2 33 34 41 42 43 44 51 52 53 54

Figure 5. (Conld)

right until the final result, y.,., i<l generated (see f'igure 5e). A
6-cell array showing the cortfigurCition of the cell shift registers
needed to perform the above 2-D convolution is shown in Figure 6.
Although in the gen.:ral case p-1 invalid results are generated for
each row of" output (except tne last), the fraction of total results
\thich are invalid approacnes zero since in general n p.

Figure 6. A Linear Systolic 2-D Convolution Array with


A=2 and ~1=4
H.T. Kung, .t II 263

This method of translation of a 2-D problem to a 1-D problem can


be very easily generalized to the translation of an n-D problem to a
1-D problem. ~ee L9J for details.

VI. CONCLUDiNG Rl!:MARKS

'rne paper has demonstrated a two-level pipelined systolic array


for performing convolutions of any dimensionality. With the two-level
pipelining the throughput of the array is bounded by the stage time of
the pipelines inside each cell rather than the total cell cycle time.
This suggests that research should be carried out to devise multi-level
pipelined systolic arrays for other computations. Research along this
line is particularly worthwhile in view of the fact that VLSI will make
the implementation of multiple-stage pipelined arithmetic chips
feasible.

References

1. H.l'. Kung and C.l!:. Lt:liserson, "~ystolic Arrays (for VWI)," in


I.S. Duff and ~.W. Stewart (editors), Sparse v~trix Proceedings
1978, Society for Industrial and Applied Mathematics, pp.256-282,
1979. A sligtly different version appears in Introduction to VLSI
Systems by C.A. Head and L.A. Conway, Addison-Wesley, 1980,!3ection
8.3.
2. H.rr. Kung, "\ihy Systolic Architecture?" To appear in Computer,
1981.
3. ii.T. Kung, "Special-Purpose Devices for Signal and Image
Processing: An Opportunity in VLSI," in Proceedings of the SPIE,
Vol.241, Real-'l'ime Signal Processing III, The Society of
Photo-Optical Instrlllllentation ~ngineers, pp.76-d4, July,1980.

4. H.rr. Kung and :3.Yi. ;;iong, "A Systolic 2-D Convolution Chip,"
Technical Report Cj'IU-CS-dl-110, Carnegie-l>lellon University,
Computer Science Department, March, 19d1. Also to appear in
lIon-Conventional Computers and Image Processing: filgori thins and
Programs, Leonard Uhr (editor;, Academic Press, 1981.

5. D.W.L. Yen and A.V. Kulkarni, "rrne l!:SL Systolic Processor for
Signal and Image Processing," to appear in Proceedings of the 1981
I~~~ Computer Society workshop ~ Computer Architecture for Pa~n
Analysis and Image Database /ilanagement, Hot Springs, Virginia,
November 11-13, 1981.

6. L. Schirm IV, "Hul tiplier-Accumulator Application Notes," r.rRW LSI


Products, ~l Segundo, California, January, 19~u.

7. d.fr. Kung and R.L. Picard, "Hardware Pipelines for


Multi-dimensional Convolution and Res1impling," to appear in
Proceedings of the 1981 l~EE Computer Society Workshop ~ Computer
Architecture for Pattern Analysis and Image Database Management,
Hot Springs, Virginia, November 11-13, 1981.
284 A Two-Level Pipe lined Systolic Array for Convolutions

s. 'rRW WI Products, If'raw LSI i~ultipliers - HJ series, If TRw Inc.,


Redondo Beach, California, 197B.
':,l. 1.iVl. Ruane, D.W.L. Yen, and H.T. Kung,If A 'l'wo-Level Pipelined
Systolic Array for /Ii-Dimensional Convolutions," submitted to IEEl!:
Transactions ~ Computers for publication.
Systolic Algorithms for Running Order
Statistics In Signal and Image Processing

Allan L. Fisher
CamegleMelion University
Department of Computer Science
Pittsburgh, Pennsylvania 15213

Abstract

Median smoothing, a filtering technique with wide application in digital signal and image
processing, involves replacing each sample in a grid with the median of the samples within some
local neighborhood. As implemented on conventional computers, this operation is extremely
expensive in both computation and communication resources. This paper defines the running
order statistics (ROS) problem, a generalization of median smoothing. It then summarizes some of
the issues involved in the design of special purpose devices implemented with very large scale
integration (VLSI) technology. Finally, it presents algorithms designed for VLSI implementation
which solve the ROS problem and are efficient with respect to hardware resources, computation
time, and communication bandwidth.

1. Introduction

Median smoothing [16] is a filtering operation, widely used in digital signal and image
processing, which involves replacing each sample value with the median of the values found within
some neighborhood of itself. In the two dimensional image processing case, this typically means
taking the median of 25 to 100 numbers for each of 105 to 106 pixels. As a result, the computation
and memory communication resources required to implement this operation on a conventional
computer are very large.

The development of very large scale integration (VLSI) technology has made feasible the
production of relatively inexpensive, highly parallel special-purpose computing engines [4, 10] for
the implementation of compuk,tionally demanding operations. VLSI algorithms have been

This research was supported in part by the Office of Naval Research. under ContracL~ NOOO14-76-C-0370 and NOOO14-80-
C-0236, in part by the National Science Foundation under Grant MCS 78-236-76 and a Graduate Fellowship, and in part by
the Defense Advanced Research Projects Agency under Contract F33615-78-C-] 551 (monitored by the Air Force Office of
Scientific Research).

285
286 Systolic Algorithms for Running Order Statistics In Signal and Image Processing

designed and some prototypes implemented for such applications as pattern matching [3),
convolution in image processing [81, and relational daf.1base operations [61. This paper presents
efficient VLSI algorithms which solve the running order statistics (lWS) problem, a generalization
of median smoothing.

Section 1.1 defines the running order statistics problem, and mentions some of its applications.
Section 1.2 explains the principles underlying the algorithm designs presented, and Section 1.3
describes the approach used in analyzing the complexity of such algorithms. Section 2 describes a
VLSI algorithm for the one dimensional signal processing case. Section 3 gives an algorithm for the
two dimensional image processing case, and describes the extension of tIJat algorithm to problems
of arbitrary dimension. Finally, Section 4 reviews some of the important features of the algorithms
described.

1.1. The Running Order Statistics Problem


The nmning order statistics problem is a generalization of the median smoothing problem.
Median smoothing has been widely used in speech and image processing [1, 13], especially in the
elimination of outliers, spurious values caused by noise or other technical error. Median filtering
has several properties which are particularly useful in image processing. One is that tile operation
does not introduce intel111ediate pixel values not found in the original image, as convolution
methods may, and hence preserves sharp region boundaries. Also, since the median of a group of
numbers is generally insensitive to tlJe presence of a small number of outliers, median smootIling is
not subject to tlJe problem of artifact ringing, the propagation of an erroneous value tIlrough a
region of an image. For tlJis reason, median smoothing is sometimes used as a preprocessing step
before an image is filtered by other means.

The median smooiliing problem can be generalized by considering order statistics otIler than the
median; while the median of k numbers is ilie one having rank (k + 1)/2 (for odd k), we can
consider asking for the clement having some arbitrary rank r. We will say iliat an instance of the
running order statistics problem has dimension n if the array of numbers to be tiltered has n
dimensions. For the sake of simplicity, we will require tlJat the neighborhood around each clement
over which statistics are taken be in tile fonn of an II-dimensional hypercube wiili odd edge length
centered on that clement, and we will say that an instance is of order k if tile hypercube has edge
length k.

Section 2 describes a VLSI algorithm for tlJe one dimensional ROS problem. The structure
described is based on the same idea as Leiserson's systolic priority qllelle [9]. and is presented here
Allan L. Fisher 267

mainly in order to lay U1e groundwork for U1e discussion of the second algorithm. The second
algoriilim, described in Section 3, solves the two dimensional ROS probiem.

1.2. Systolic Algorithms


The chief performance advantage offered by VLSI technology is the availability of massive
parallelism, achieved by ilie harnessing together of many processing units. The exploitation of this
potential requires more than the creation of the raw processing power to solve a problem. It is also
necessary to provide for data transfer between individual processing units and between the
processing units and mass storage. The need for mass storage persist~ even in the face of advances
in miniaturization; technological forecasts (11, l?] make it clear that it will not be possible to have a
number of processors comparable to the number of data items used by an application at any time in
the foreseeable future. Thus, the communication architecture of a system is a dominating factor in
its performance.

The most common computer system stmcture is the "von Neumann" architecture, in which a
processing unit receives data and instructions from a memory unit and returns the results of its
computations to the memory. This architecture has the disadvantage that, for most computations,
the operation rate achievable is limited not only by the speed of the processor but also by the
bandwidth of the processor-memory communications link. This Iimik'ltion is commonly referred to
as the "von Neumann bottleneck".

One solution to this problem is the concept of systolic arrays [5. 7]. A systolic array is a collection
of relatively simple processing units, either all of the same type or a mixture of a few different types,
which are connected by a simple communications network and operate in parallel. The
performance advantage of a systolic array architecture is iliat it uses each datum retrieved from
memory many times witl10ut having to store and retrieve intermediate results, thus potentially
allowing speedups relative to memory bandwidth which are proportional to the number of
processors used.

In order for fabrication of a systolic system to be reasonable in practice, the communication


structure which connects the processors mllst be simple and regular. In particular, a linear pipeline
structure has several important features. First. it requires memory bandwidth which is independent
of the size of the array, as contrasted With a two dimensional structure. Second, a large pipeline can

TIle examples in this pnper will be discllssed in terms of fully synchronous operations for the sake of simplicity, but they
could be broken up into self-limed segments [14] communicating by some protocol while maintaining the same asymptotic
performance achieved synchronollsly.
268 Systolic Algorithms for Running Order Statistics in Signal and Image Processing

be constructed simply by concatenation of smaller pipelines. Finally, since the interface of a linear
pipeline to the outside world is of bounded size, increases in integrated circuit density can be
exploited while retaining constant chip pinout by laying out pipeline segments on chips in zig7.ag
fashion.

1.3. Complexity Measures for VlSi


Tn order to obtain size and performance measurements for the machines proposed here, we
require a model of VLSI technology which is amenable to asymptotic complexity analysis.
Thompson [15], among others, proposes such a model. The analyses carried out here rest on a
simplified version of Thompson's model. The pertinent features of the model are as follows:

Logic gates of constant fan-in and fan-out require constant area and switch in constant
time .

A constant-width wire of any length can be driven in constant time by drivers which can
be implemented in area proportional to the wire's length. In particular, this means that
wires occupy area proportional to their length.

Given these elements, we can obtain asymptotic measures of the resources required to apply an
algorithm to a problem of some specified size. We will be concerned with the area required for the
implementation of an algorithm, which translates roughly to its hardware cost, and with the time
required to perform the algorid1m.

2. Algorithm for the One Dimensional ROS Problem

This section describes a VLSI algorid1m which, for the order k one dimensional running order
statistics problem, yields any particular running order statistic of a vector of length m in time O(m),
while occupying area O(k). The algorithm makes a left-to-right sweep over the input sequence,
computing the required order statistic of each contiguous subsequence of length k. The hardware
structure of the algorithm is a pipeline consisting of k cells, which hold the k values under
consideration at each step. The idea is to keep the k elements in order, so that elements having
particular ranks may be extracted from corresponding positions in the pipeline, and to update the
contents of the pipeline at step n by deleting an_ k and inserting an'

The updating is effected by passing messages from cell to cell down the pipeline; at each step, the
left end of d1e pipeline receives a series of messages, and passes them to its right. First, a message is
sent down the line which seeks out an_ k' the element to be deleted from the array, and causes it to
be deleted. This is followed by a message containing an' d1e new value to be inserted, which passes
Allan L. Fisher 289

down the line until it reaches its appropriate position in order, at which point the value is inserted.
High throughput is achieved by pipelining the processing of the messages, so that many messages
are "cti ve in the pipeline at one time, each in a separate cell.

A check of the correctness of the algorithm can be carried out by checking two assertions. The
first is that the abstract sequence of operations specified by the messages injected yields the desired
result; that is, that the processing of a particular message, in the absence of other messages, leaves
the pipeline in the intended state. The second condition that the pipelining of the operations is
carried out correctly, in that each message causes each cell to perform the same computation in both
the pipelined and non-pipe lined cases.

Complexity analysis of a VLSI algoritllm is concerned with two measures: the area required to
implement the algorithm, and the time that it takes to perfonn a computation. In this case,
assuming that the precision of the numbers to be processed is fixed, so that each comprises a
constant number of bits, the area required by each cell is a constant, independent of k. Thus the
area required for the entire algorithm is proportional to k, since it consists of k constant-size cells.
The time required to process a sequence of numbers is equal to the product of the number of cycles
required to pass the sequence through the machine and the time required to perfonn a machine
cycle. A. sequence of length m requires 3X(m+ k) machine cycles, and cycle time is constant,
regardless of the value of k. Thus, assuming that mk. a sequence of m numbers can be processed
in time O(m).

3. Algorithm for the Two Dimensional ROS Problem

This section presents an algorithm for the two dimensional running order statistic problem which
may be extended to handle ROS problems of arbitrary dimension. In the two dimensional case, the
algorithm yields a set of s order statistiCs of order k for a matrix with m elements in time
O(mslog log k), while occupying area O(l?log k). Like the algorithm presented for the one
dimensional problem, this algorithm is based on a linear array of cells, down which messages are
passed to maintain an ordered sequence of values.

As in the algorithm of Section 2, many messages are processed simultaneously, in separate cells.
The algorithm for the two dimensional problem has an additional level of parallelism, however, in
that it operates on data belonging to k squares of size kxk simultaneously. Essentially, the
algorithm sweeps a rectangular window of 2k-l rows and k columns across the array, and at each
step produces order statistics for the k overlapping kx k squares contained in such a rectangle.
270 Systolic Algorithms for Running Order Statistics in Signal and Image Processing

Since the algorithm works on more than one square at a time, each array value is tagged with a
row number, ranging from 1 to 2k-l, in ordN to make it possible to calculate to which squares it
belongs. Also, since the mixing of valucs from different squares makes it impossible to compute
results just by reading the contents of particular cells at particular times, order statistics arc gathered
by special messages which cOlInt the number of clements in a given square up to a specified rank,
then pass the value having that rank to the end of the pipeline as the result.

A check of the correctness of the algorithm can be performed in the style of Section 2. Again,
consideration of a uniprocessor simulation makes it apparent that the sequence of operations
applied compute the correct results, and the demonstration of the correctness of the algorithm's
pipelining is identical.

The complexity of the algorithm, though, is more complicated than that of the algorithm for the
one dimensional case, because each cell must handle numbers ranging up to D(kl). In particular,
each cell must be capable of performing comparison and subtraction operations on these numbers.
By encoding the numbers in 2's-complement binary notation, both of these operations can be
expressed in terms of addition and testing for zero value of O(log k)-bit numbers. Brent and
Kung [2] describe a general adder design which yields b-bit adders requiring area O(b) and time
O(1og b). Substituting log k for b, the necessary additions can be performed in area O(log k) and
time O(log log k). Testing for zero value can be performed with a binary tree of OR gates in area
O(log k) and time O(log log k). Thus, since each of 2kl - k cells requires area proportional to log k,
the entire linear array requires area O( kllog k). The time to process m numbers, computing s order
statistics for each square, can be calculated as Oems) machine cycles multiplied by a cycle time
proportional to log log k, yielding the result 0(1I1slog log k).

The algorithm may be extended to handle problems of any dimension II by using a (2k-l)n- 1k
cell pipeline, and sweeping the array with a hyperwindow which has width k in the direction of the
sweep and width 2k-l in every other direction. At each step in the sweep, the algorithm would
read (2k-l)n-l values and produce kn- 1 sets of order statistics. Each value in the pipeline would
be accompanied by n-l numbers indicating to which hypercubes, among the k n- 1 contained in
the window, the value belongs.
Allan L. Fisher 271

4. Conclusion

In addition to providing cost-effective solutions to a computationally difficult family of


problems. the algorithms described in this paper illustrate some important issues in the design of
special-purpose parallel computing devices. First, because of their linear structure, they conserve
memory bandwidth. 'Ibe number of bits transferred in each cycle by the algorithm for the two
dimensional problem grows a~ log k, rather than at least k as in the case of a two dimensional array
of processors. Thus a large device can run with essentially the same memory bandwidth as a small
device without wasting any of its processing power. Also, each value is retrieved from memory a
small constant number of times (no more than 2 if a 2k2 - k cell shift register is used to keep track
of values to be "deleted), so the total external communication required by the algorithm is small.
This feature is the result of an economy of memory reference which is central to systolic algorithms:
when a value is read from memory, at least half of the computations which depend on that valu.e are
performed. This is accompanied by a dual ecollomy of computation: no two input values are ever
compared more than a small constant number oftimes.

Acknowledgments

Thanks are due to M. 1. Foster, H. T. Kung, P. L. Lehman, and S. W. Song for helpful criticism,
and to H. T. Kung for suggesting the problem.

References

[1) Andrews, H.C. Monochrome digital image enhancement. Applied Optics ]5(2):495-503,
February, 1976.

[2) Brent, R. P. and H. T. Kung. A regular layoutfor parallel adders. Technical Report CMU-
CS-79-131, Carnegie-Mellon University, Computer Science Department, June, 1979.

[3) Foster, M. 1. and H. T, Kung. The design of special-purpose VLSI chips. Computer
Magazine 13(1):26-40, January, 1980.

[4) Kung, H. T. Let's design algorithmsfor VLSJ systems. Technical Report CMU-CS-79-151,
Carnegie-Mellon University, Computer Science Department, JanuarY,1980.

[5) Kung, H. T. Notes on VLSI Computation. To be published by Cambridge University Press.

[6) Kung, H. T. and P. L. Lehman. Systolic (VLSI) arrays for relational database operations. In
Proceedings of ACM SIGMOD 1980 International Conference on Management of Dala, pages 105-
116. Association for Computing Machinery, May, 1980.
272 Systolic Algorithms for Running Order Statistics In Signal and Image Processing

[7] Kung, H. T. ;lOd C. E. Lei~erson. Systolic arrays (for VLSI). In Duff, I. S. and Stewart, O. W.
(editors), Sparse Matrix Proceedings 1978, pages 256-282. Society for Industrial and Applied
l'vlathematics, 1979. A slightly different version appears as Section 8.3 of Mead and Conway [10].

[8] Kung, H. T. and S. W. Song. A systolic 2-D convolution chip. Technical Report CMU-CS-81-
110, Carnegie-Mellon University, Computer Science Department, March, 1981. To appear in Non-
COf/ven/iollal Computers and Image Processing: Algorithms and Programs, Leonard Uhr (editor),
Academic Press, 1981.

[9] Leiserson, Charles E. Systolic priority queues. Technical Report CM U-CS-79-115, Carnegie-
Mellon University, Computer Science Department, April, 1979.

[10] Mead, C. A. and L. A. Conway. Introduction to VLSI System~: Addison-Wesley, Reading,


Mass., 1980.

(11] Moore, G. L. Are we really ready for VLSI? In C. L. Seitz (editor), Proceedings of
Conference all Very Large Scale Integration; Architecture, Desig,~ Fabrication, pages 3-14.
California Institute of Technology, 1979.

(12] Noyce, R. N. Hardware prospects and limitations. In M. 1,. Dertouzos and 1. Moses
(editors), The Computer Age; A Twenty- Year View, pages 321-327. Institute of Eicctrical and
Electronics Engineers, 1979.

[13J Rabiner, L. R., M. R. Sambur, and C. E. Schmidt. Applications of a nonlinear smoothing


algorithm to speech processing. IEEE Transactions on Acoustics, Speech, alld Signa/Processing
23(6):552-557, December, 1975.

[14] Seitz, C.l.. System Timing. Chapter 7 of Mead and Conway [10].

[15] Thompson, C. D. A Complexity Theory for VLSI. PhD thesis, Carnegie-Mellon University,
Computer Science Department, August, 1980.

[16] Tukey,1. W. Exploratory Data Analysis. Addison-Wesley, Reading, Mass., 1977.


Systolic Array Processor Developments

Keith Bromley, J.J. Symanski, J.M. Speiser, and


H.J. Whitehouse
Naval Ocean Systems Center
San Diego, California 92152

ABSTRACT
The combination of systolic array processing techniques and VLSI
fabrication promises to provide modularity in the implementation of
matrix operations for signal-processing with throughput increasing
linearly with the number of cells utlized. In order to achieve this,
however, many design tradeoffs must be made.
Several fundamental questions need to be addressed: What level of
complexity (control) should the processor incorporate in order to
perform complicated algorithms? Should the control for the processing
element be combinatorial logic or a microprocessor? The broad
application of a systolic processing element will require flexibility in
its architecture if it is to be produced in large enough quantities to
lower the unit cost so that large arrays can be constructed.
In order to have a timely marriage of algorithms and hardware we
must develop both concurrently so that each will affect the other. A
brief description of the hardware for a programmable, reconfigurable
systolic array test-bed, implemented with presently available integrated
circuits and capable of 32 bit floating point arithmetic will be given.
While this hardware requires a small printed circuit board for each
processor, in a few years, one or two custom VLSI chips could be used
instead, yielding a smaller, faster systolic array. The test-bed is
flexible enough to allow experimentation with architecture and
algorithms so that knowledgeable decisions can be made when it comes
time to specify the architecture of a VLSI circuit for a particular set
of applications.
The systolic array testbed system is composed of a minicomputer
system interfaced to the array of systolic processor elements (SPEs).
The minicomputer system is an HP-IOOO, with the usual complement of
printer, disk storage, keyboard-CRT, etc. The systolic array is housed
in a cabinet approximately 28"x19"x21". The interface circuitry uses a
single 16-bit data path from the host HP-IOOO to communicate data and
commands to the array.
Commands and data are generated in the HP-IOOO by the operator using
interface programs written in FORTRAN. Algorithms can be conceived, put
into a series of commands for the systolic array processor, and tested
for validity. Data computed in the array can be read by the host
HP-IOOO and displayed for the operator.
The use of a general purpose minicomputer as the driver for the
systolic array gives unlimited flexibility in developing algorithms.
Through the use of interface routines, algorithms can be tried,
273
274 Systolic Array Processor Developments

evaluated, changed and tried again in a few minutes. Also, in cases


where the output must be manipul ated and fed back into the array, the
manipulation of the data can be done either in the host using the high
order language capability (for optimum flexibility), or in a dedicated
microprocessor interfacing the systolic array to the host (for optimum
speed) .
INTRODUCTION
Signal processing theory currently suggests many improved process-
ing methods which cannot be implemented in real-time because of the
computational burden [1]. A partial solution is provided by faster
device technology. However, parallelism will always be required when
the data acquisition rate is comparable to the arithmetic cycle time and
multiple operations are performed per data point. Currently trends in
VLSI/VHSIC technology support the development of highly parallel
computational structures to an extent which has never previously been
practical. While some previously developed parallel algorithms and
architectures deserve consideration because of changed economic and
technical constraints, most of the algorithm and architecture
development remains to be done. It is desirable to realize economies of
scale for the chip set, processors, and system software. This makes it
desirable to select a set of primitive operations at each level which is
broad enough for wide applicability, but sufficiently structured \0
permit high efficiency and regularity of design with a small set of
primitives. Fortunately, the computations which provide the bulk of the
computational load for signal processing are highly regular in their
arithmetic operations and data flow.
It has previously been shown that the major computational require-
ments for many important real-time signal processing tasks can be
reduced to a common set of basic matrix operations [lA, 1B]. These
include matrix-vector multiplication, matrix-matrix multiplication and
addition, matrix inversion, solution of linear systems, least-squares
approximate solution of linear systems, eigensystem solution,
generalized eigensystem solution, and singular value decomposition. For
implementation on a single processor, the results of extensive research
in numerical linear algebra is a set of numerically stable, well
documented routines for linear systems and least squares problems
LINPACK [2] and one for eigensystem problems-EISPACK [3]. Parallel
processor designs can utilize the available studies of the numerical
stability of algorithms, but much less is known about effective
parallelization of algorithms.
The array processor has been a powerful and popular augmentation of
the general purpose computer, permitting the rapid implementation of
operations with circulant matrices, since the eigenvectors of a
circulant are the basis vectors of a discrete Fourier transform.
Similarly, the eigenvectors of a Toeplitz matrix are asymptotically
approximated by sinusoids, and shift-invariant linear systems have
Toeplitx kernels and stationary random processes have Toeplitz
covariance matrices.
Our objective is to provide a matrix processor to augment the
general purpose computer and array processor, providing modular
Keith Bromley, et II 275

parallelism in the hardware implementation of the equivalent of the


LINPACK/EISPACK functions. This would lift the requirement of special
structure in the matrices for real-time computation. On the other hand,
irregular computations or data reorganization may be performed at a
lower rate by the general purpose host computer, permitting a simplifi-
cation of the operation set of the matrix processor.
Parallel processing architectures for the matrix operations have
been surveyed [lA,lB] and it was concluded that the systolic architec-
tures [4,5] provide the most promising combination of characteristics
for utilizing VLSI/VHSIC technology for real-time signal processing:
modular parallelism with throughput directly proportional to the number
of cells, simple control, synchronous data flow, local interconnects,
and sufficient versatility for implementing the matrix operations needed
for signal processing. Previously reported systolic architectures
include linear [4], hexagonal [4] and rectangular/hexagonal [5,6]
arrays. The 1i near confi gurat ions perform matri x-vector mu 1tip 1i cat i on
and solution of triangular linear systems. The hexagonal configurations
perform matrix multiplication/accumulation and L-lJ decomposition. The
engagement processor rectangular/hexagonal array provides matrix
multiplication/accumulation with improved efficiency for dense matrices,
and may also perform as a hexagonal array for L-U decomposition [6].
Although the regularity of the arithmetic operations and data flow is
high for non-partitioned operations, partitioned operations and mixed
types of computations require careful consideration of "edge-effects"
and incorporation of corresponding features in the hardware. An
implementation using commercially available microprocessor components
provides the opportunity to test a variety of algorithms on a variety of
systolic array configurations, and especially to understand the
architectural features needed by an algorithm's data movement and
decision-making requirements. Such hardware requirements are difficult
to learn by simulation alone, and expensive to learn by trial and error
design of dedicated chips at the VLSI/VHSIC level of complexity.
SYSTOLIC ARRAY PROCESSOR (SAP)
The systolic array concept involves the inherent high throughput and
simplicity offered by a lattice of identical processing elements, all
operating in parallel on data synchronously flowing through the
structure. The prospects for fabricating an array element (or several
elements) on a single chip appear very good. However, many details of
the algorithms, data flow, control, input/output, numerical accuracy,
speed, etc. have to be determined before a particular chip design can be
undertaken.
The goal of this work is to build a systolic array testbed which is
flexible enough to allow experimentation with algorithms and configura-
tions so that intelligent decisions can be made when it comes time to
specify the chip architecture for a particular set of applications. The
hardware implementation is shown conceptually in Figure 1. The array
consists of a cabinet containing a 8-by-8 array of systolic processor
circuit boards, a mother board, and a rack for the host-interface
electronics.
276 Systolic Array Processor Developments

8 x 8 ARRAY OF PROCESSORS

CABLE TO
HOST

I
HOST INTERFACE

Fig. 1. Systolic array configuration.


DESIGN HISTORY AND RATIONALE
In the initial design of this testbed, many questions regarding
tradeoffs and decisions had to be answered. For instance, should we use
bit-serial or bit-parallel computation? what dynamic range is necessary
for a useful processor? what speed of computation is reasonable? how
complex or smart should each processor be? how many elements will be
reasonable to connect? and what will be the packaging approach? Only
the final design will be presented here, without describing the many
possible alternatives.
Since the overriding concern in this testbed is for flexibility to
allow experimentation, it was decided to use a microprocessor with EPROM
and RAM, to allow the maximum programmability in the systolic processing
element.
As for the dynamic range, consideration of signal processing uses
led us to settle on a 32-bit floating-point capability. Here a tradeoff
between software and hardware implementation led to the use of an
Arithmetic Processing Unit (APU), e.g., Intel-8231 or AMD-9511.
Bit-serial computation was considered as a possibility because of
its expandability and low pin count. However, the need for wide dynamic
range and the long design time a bit-serial design would require, made
the bit-parallel approach of the APU more attractive.
The type of communication path to the SPE (serial or parallel)
effects the speed of operation and the hardware required for an array.
Serial communication was found to be more advantageous for several
reasons. Since each SPE has six I/O ports, an eight-bit parallel path
would require 48 pins and 48 driver/receiver buffers in each SPE. In
addition, as in VLSI design, interconnection could become a major
KeHh Bromley, et al 277

problem. Furthermore, to obtain flexibility in the intercommunication


of the array, multiplexers have been placed in some of the data paths.
The use of parallel communication would have significantly increased the
amount of hardware required.
After the basic processor complexity was determined, the system was
partitioned into a large (approximately 23-by-16 inch) mother board with
each systolic processor on a separate 2.5-by-8 inch printed circuit
board, mated to the mother board by us i ng an edge card connector on a
short side as shown in Figure 1.
SYSTOLIC ARRAY TESTBED ARCHITECTURE
The systolic array testbed system is composed of a minicomputer
system interfaced to the array of systolic processor elements. The host
is a minicomputer with the usual complement of printer, disk storage,
keyboard-CRT, etc. The systolic array is housed in a cabinet approxi-
mately 28 by 19 by 21 inches. The interface circuitry uses a single
16-bit data path from the host minicomputer to communicate data and
commands to the array.
Commands and data are generated in the host by the operator, using
interface programs written in FORTRAN. Algorithms can be conceived, put
into a series of commands for the systolic array processor, and tested
for validity. Data computed in the array can be read by the host mini-
computer and displayed for the operator.
Many other papers have discussed the theoretical aspects of the
systolic array's communication of data and other properties. We have
implemented the original H. T. Kung hexagonal interconnect architecture,
as shown in reference [4]. By substitution of squares for hexagons,
appropriate rotation of the communication paths, and realignment of the
processors on a square grid, we have the square array. Now the A data
paths are horizontal, the B paths are vertical, and the C paths are
along a diagonal.
In this implementation, there are also virtual rows and columns
along the edges of the array, as shown in Figure 2. These virtual rows
and columns perform the interfacing between the parallel data path from
the host and the serial communications of the systolic array processing
elements. These virtual rows and columns can be thought of as existing
on either side of the array, since the data paths from the SPEs on the
"far side" of the array wrap around and are also connected to the left
of the virtual A, B, or C rows and columns.
To achieve architectural flexibility, the serial data paths from
the virtual rows and columns pass through multiplexers so that several
options for data flow are possible. The data path can be selected in
real time by the host processor.
It is important to note, and the essence of the great flexibility
of this testbed, that thru the use of the microprocessor in the SPE we
can interchange the roles of A, Band C at will.
278 Systolic Array Processor Developments

Host
To Interface
Host,--_...J

Fig. 2. The square systolic array


with virtual rows and columns.
ARRAY CONFIGURATIONS
There are many array configurations available, five of which will
be described here. The first is the basic engagement configuration
shown in Figure 3. Here we have the data flowing in the usual A, Band
C directions thru the square array of processors. This configuration
will be used for (a) matrix multiplication/accumulation of full matrices
[6J with no special processors and (b) L-U decomposition, which requires
special boundary processors and data interchange. For instance, the top
left processor performs a division while the left column outputs the C
data back into the array along the A data path and the top row outputs
the C data back into the array along the B data path.
B

A
B

A~
Fig. 3. Rectangular/hexagonal
configuration.
Keith Bromley, et 81 279

The second configuration converts the square array to a linear


array by routing the right output of a given row to the left input of
the next row as shown in Figure 4. The routing of data is determined
with multiplexers placed in the data path between the virtual rows and
columns and the processors on the left and top edges. Note that in the
linear array the x and y vectors move along the A data path, i.e., the
rows. In performing the matrix-vector multiplication y=Ax where
y=(Yi), A=(aij), and x=(x'), the matrix A is fed vertically down
in an approprlate manner to form products with the elements of x which
move from left to right on a given row. The product vector y moves from
right to left along the row data paths which are bidirectional and time
multiplexed. This configuration will be used for (a) matrix-vector
multiplication with no special processors and (b) triangular linear
equation solution, which requires a subtraction/division at one end of
the linear array.

X/V

Fig. 4. Linear array configuration.


The third configuration shown in Figure 5 essentially broadcasts
the same data to the top of all columns of the array. This
configuration can be used to multiply a Hankel matrix by an arbitrary
matrix or to form a skewed outer product [6]. A similar configuration
can be used to multiply a Toeplitz matrix by an arbitrary matrix [6].
The fourth configuration, shown in Figure 6, uses a second array
coupled to the first along the rows or A data path. This configuration
can be used to double system throughput in the computation with complex
matrices. The connection between arrays along rows enables the swapping
of data from one array to the other for the computation of the AB + CD
product necessary for complex matrix operations as shown in Figure 7.
This configuration can be used to perform a large DFT via modular
decomposition [6].
210 Systolic Array Processor Developments

Fig. 5. Broadcast configuration.

Fig. 6. Dual array configuration.


Kalth Bromlay, at .1 281

TO HOST TO HOST

TO HOST

Fig. 7. Complex matrix multiplication using two real systolic arrays.

The fifth configuration ;s implemented ;n a single array and


enables the transposition of a resident matrix in a straightforward
manner as shown in Figure 8. Here we are utilizing the capability to
interchange the roles of the A, Band C data as well as external data
path multiplexers.

Fig. 8. Matrix transposition


configuration.
282 Systolic Array Processor Developments

THE SYSTOLIC PROCESSOR ELEMENT (SPE)


The block diagram for the systolic array processing element is
shown in Figure 9. The microcomputer used is the Intel 8031. This
device was chosen for its speed, internal RAM, many I/O pins, and ease
of bit and byte manipulation. The data moves about in the SPE over a
multiplexed 8-bit data/address bus.

8031
MICROPROCESSOR

A
B
C
S

Fig. 9. Systolic processor element


block diagram.

Computation is accomplished in'the Arithmetic Processing Unit


(APU). This unit is capable of several formats (16-bit fixed point,
32-bit fixed point, and 32-bit floating point). Many operations are
available such as add, subtract, multiply, divide, square root, several
trigonometric functions, etc.
The EPROM and RAM are standard devices. The EPROM is 4k X 8 bits.
The RAM is lk X 8 bits. The EPROM is used to store routines which
perform data manipulation. This gives the system a hierarchial approach
so that a single byte transmitted to the processor initiates a sequence
of operations. The RAM can be used for data storage. This results in a
"3rd dimension" of matrix storage which is important for partioned
matrix operations [6]. The RAM can also be used for storage of programs
during algorithm development. Once the algorithms have been perfected,
they will be put into EPROM.
The I/O from the processor is bit serial. The main reason for
serial I/O is to minimize pinouts and driver/receiver requirements.
Four universal 8-bit parallel/serial shift registers constitute the I/O
ports. Three registers are used for the A, Band C data. The fourth
register, the S register, receives the instruction, broadcast along
Keith Bromley, et 81 283

rows, which tells the SPE which routine to perform. The I/O registers
are loaded and read by the microprocessor under program control.
VLSI/VHSIC IMPLEMENTATION
It has been predicted that one or more systolic processing elements
could be put on a single VLSI chip, While the currently implemented
printed circuit board with 18 ICs is undoubtedly not the optimum design
for the future, it is an interesting exercise to calculate the gate
count for a 32-bit version of this SPE. Table 1 shows the estimated
gate count for the present implementation and that which would be used
in a VLSI chip if we eliminate some of the flexibility (programmability)
used in this testbed. These numbers are very rough estimates, perhaps
+30%.
Table 1 - SPE Gate Count (32-bit, Floating Point)
Testbed SPE VLSI
Control Processor 20K 10K
APU 20K 20K
ROM 64K 16K
RAM 16K 16K
I/O 1K 2K
121K 64K

CONCLUSION
In the last two decades numerical analysis has developed many
numerically stable matrix algorithms [2,3] for use with single
arithmetic unit digital computers. However, parallel numerical
algorithms have not been correspondingly developed and only a limited
number of parallel processors or computers have been built.
It is the belief of the authors that concurrent processing with
generalized systolic architectures will provide the capability of
implementing in hardware the matrix processing which currently is
represented by the EISPACK [2] and LINPACK [3] software libraries. The
availability of affordable VLSI matrix processing peripherals for
minicomputers would significantly advance signal processing research.
Similarly, VHSIC [7] implimentations of these matrix processing
peripherals would make advanced signal processing available for real-
time tactical signa] processing applications.
Dr. Mermoz recently addressed the question of spatial signal
processing beyond adaptive beamforming [8]. The conclusions of his
paper were that with adequate computational resources much new
information about the medium through which the signal propagates may be
incorporated into the signal processing with corresponding improvements
in system performance. In particular, he says, "Despite the advances in
computer technology, fast as it may have been in the past decades, such
a program is liable to absorb all the present capacity and probably the
predictable capacity during the next twenty years. Meanwhile, there
284 Systolic Array Processor Developments

will be some trade-offs between complexity and preclslon. But the trend
to introducing the most flexible model, compatible with the array and
the number of sources, is likely to be the right approach toward further
improvements when we deal with complex and unpredictable mediums ... Such
an approach is rather the ~pposite of what has been done so far, except
in advanced research. Of course, anybody would be horrified by the
amount of computing power required. On the other hand, most scientists
have been horrified several times in their career by what turned out to
be current practice within a few years."
Systolic architectures implemented with VLSI should make this type
of signal processing possible for sonar bandwidths within this decade
while VHSIC implementations should provide similar capability at
communication and radar bandwidths.
ACKNOWLEDGEMENTS
The authors wish to acknowledge the support of the Naval Electron-
ics Systems Command (Elex 612, 6l4) and the Naval Ocean Systems Center
Independent Research/Independent Exploratory Development (IR/IED)
program.
REFERENCES
[1AJ Speiser, J.M. and H.J. Whitehouse, "Architectures for Real-
Time Matrix Operations," Proceedings of the 1980 Government Micro-
circuits Applications Conference held at Houston, Texas, 19-2r-NOV.1980.
[IBJ Speiser, J.M., H.J. Whitehouse, and K. Bromley, "Signal
Processing Applications for Systolic Arrays," Record of 14th Asilomar
Conference on Circuits, s~stems and Computers held at Pacific Grove,
California, 17-19 Nov. 19 0, IEEE Catalog No. 80CH1625-3, pp 100-104.
[2] Dongarra, J.J., et al, LINPACK Users'Guide, Society for
Industrial and Applied Mathematics, Philadelphia, Pennsylvania, 1979.
[3] Garbow, B.S., et al, Matrix Eigensystem Routines-EISPACK Guide
Extensions, Springer-Verlag, 1977.
[4] Kung, H.T., "Systolic Arrays for VLSI," in Duff, 1.S. and G.W.
Stewart, Sparse Matrix Proceedings, 1978, Society for Industrial and
Applied Mathematics, Philadelphia, Pennsylvania, 1979 (Reprinted in
Mead, C. and L. Conway, Introduction to VLSI, Addison-Wesley, 1980).
[5] Kung, S.-Y., "VLSI Array Processor for Si gnal Processing,"
presented at Conference on Advanced Research in Integrated Circuits held
at MIT, Cambridge, Massachusetts, Jan. 1980.
[6] Speiser, J.M. and H.J. Whitehouse, "Parallel processing
algorithms and architectures for real-time signal processing," Real-Time
Signal Processing IV, a publication of the SPIE International Technical
Symposium held in San Diego, 25-28 Aug. 1981, Vol. 298.
[7] L.W. Sumney and E.D. Maynard, Jr., "The United States
Department of Defense Program on Very High Speed Integrated Circuits
(VHSIC}," Proc. 1979 Int. Symp. on Circuits and Systems, July 1979,
pp. 559-563.
[8J H.F. Mermoz, "Spatial Processing Beyond Adaptive Beamforming,"
J. Acoust. Soc. Am. 70(1), July 1981, pp. 74-79.
A Systolic (VLSI) Array for Processing
Simple Relational Queries
Philip L. Lehman
CarnegleMelion University
computer Science Department
Pittsburgh, Pennsylvania 15213

1. INTRODUCTION
This paper discusses the use of systolic arrays (a conceptual and design tool for VLSI
systems [11]) to produce VLSI capable of processing simple relational database queries, which are
by far the most frequently executed queries in practical large database systems. We will be
concerned with the exploitation of VLSI technology to process "simplc" relational qucries very
rapidly; the dcsign of an array for this task is described below. The systolic properties of the array
design are considered, and are shown to have analogs in the domain of databases by using the
systolic properties to prove certain consistency and scheduling complexity properties of all
transactions executed by dle array (hereinafter called the simple query array, or SQA). 111e SQA is
intended for use as an integral part of a systolic database machine [13], which would handle very
large databases and is expected to have a high performance gain Clver conventional database
systems. The machinc should also compare quite favorably with other database machine designs
(for example [I, 4,16,17,19]), especially when used for databases with frequent simple queries, i.e.
those databases used by most commercial applications!
2. SIMPLE QUERIES
It has been observed [7, 14] that the following rule applies to the largcst database systems:
Simplicity CharacteristiC: lilmost all of the transactions arc very simple.
For example, in a large banking system most (> 80%) of the transactions run by fue bank on its
Customer_Account Database will be "Dcbi(Account" or "Credi(Account" transactions, as in
Figure 1 [6]. In contrast, the monthly printing of customer statements is more complex, but is
performcd (relatively) very rarely (perhaps 10-7 as often).
This paper, therefore, assumes a model of database system usage in which almost all of the
transactions are drawn from a set of simple transactions that are performed very frequently. 'This
model seems to be satisfied by a wide range of practical applications. including banking, airlines
reservation systems, telephone directory assistance, inventory systems, employee record systems,
etc. The systolic arrays proposed in this paper emphasize high throughput of these transactions
without sacrificing adcquate response. All databases in this paper are assumed to be relational
databases. (sec [2, 3]) In [9] systolic arrays were used to perform "hard" relational database
operations (intersection, difference, union, remove-duplicates, projection, join, division): the
regular structure of systolic arrays was shown to lend itself naturally to the processing of relations,
which are also very regular. (For background on systolic arrays see [8, 11].)
3. CONCEPTUAL OPERATION OF THE ARRAY
As a simple illustration of the use of the SQA, it is appropriate to reconsider a (simplified)
piece of the DESrCCREDIT transaction: the section shown in Figure 2 is of interest, since the
othcr part~ of the transaction are very similar in form to this section. 'Dlis section of the transaction

This research was supported in part by the Office of Naval Research under Contracts NOOO14-76-C-0370 and
NOOO14-80-C-0236. in part by the National Science Foundation under Grant MCS 78-236-76, in part by the Defense
Advanced Research Projects Agency under Contract F3361S-78-ClS51 (monitored by the Air Force Offiee of
Scientific Research). and in part by an IBM Predoctoral Fellowship.
285
286 A Systolic (VLSI) Array for Processing Simple Relational Queries

DEB IT_CREDIT:
!3cgillj'rallsaclioll ;
Get Massage;
<xtrat:t Acct._Num, Delta, Teller, Branch from Message;
,ind !\CCOUNT(Acct_~Jum) in Database;
U' flotJound I Acct_Bal+Delta<O IhenPut "Negative Response";
else do;
Acct_Ba1~Acct_Bai+Delta;
Po~ HISTORY record 011 ACCOUNT (Delta);
CASH_ORi.UER( Te 11 B r) "CASH_DRAlJER (Ta 11 er) +De 1 ta;
BRANCH_BAL(Branch)aBRANCH_BAL(Branch)+Oolta;
Put Massaga(ttNelll Balance =tt Acct_Bal);
end;
Commit;
Figure 1: A simple debit/credit transaction(tj'om [6]).
(In transactions in this paper, field names arc abbreviated for convenience.)
examines the ACCOUNT rclation of a banking database, looking for a single Account_Number
(assuming that there is at most one instance of that AccountNumber in the relation). Jrit finds the
AccountNumbcr, it changes the associated AccountBalance (by Delta). An example ACCOUNT
relation is sho\l;n in Figure 3. As written, the DEBlT_C[U:DIT transaction was intended to be IUn
on a sequential processor. The same transaction-in fact, a balch of such transactions-can instead
be mn on the systolic army shown in Figure 4. In this paper, the term balch refers to a group of
transactions that are processed together and that access the same columns of the same relations in
the same order; an example might be a large batch of bank check cashings.
]~nd ACCOUNT(Acct_Num) in Database;
if Not_Found I Acct_Ba 1+De lta<O then Put "Negat 1ve Response"
else do; Acct_Bal=Acct_Bal+Dalta;
Figure 2: Part of the DEBIT_CImDIT transaction in Figure l.
ACCOUNT Account Number Balance

1215 $ 1234.56
1492 $ 29.95
1701 $ 100000.00
1776 $98765432.10
1812 $ 432.79
1980 $ 1284.73

Figure 3: The bank ACCOUNT relation.


]n Figure 4, each box represents a systolic processor. Each row of the systolic array contains
two processors which together handle a single query. (For complex queries, more than one row
can be used.) The SQA is pre/oaded with the queries, one to a row. In this case the lefthand
processor in a row is to contain a predicate on the ACCollntNumbcr, and the righthand processor is
to contain an update on the AecountBalfincc. The general strategy is to pass tuples of the relation
in question into the top of the SQA. One column of the relation is processed by one column of the
array. Each tuple passes through each row of processors and is acted upon (in parallel) by each
query. This is the usual technique for using systolic arrays to manipulate relational databases [9).
Philip L. Lehman 287

R2
R1

100000.00
1701 29.95
1492 1234.56
1215

Figure 4: A 3x2 systolic array for simple queries. PI = "= 1701 ?"; Ul = "+ 50"
In the present case, columns Rl (Accoun(Number) and R2 (Account.Bal:mce) are passed into the
two columns of the array, as shown.
For example, suppose the first query is "Add $60 to the AccountBalance of
Account.Number 1701." Then the first lefthand processor is prcloaded with predicate PI: ".
1701 1" The first righthal1d processor is preloaded with update VI: "+ 60." As each element
from the Account.Number column passes predicate PI, the predicate is executed, and the result of
the predicate is passed to the processor containing VI. If the result of the predicate is TR UE, the
update is executed (in this case, on the entry representing the Account_Balance of Account_Number
1701: $100000.00 in the figure).
Noticc that the SQA is synchronous-the processors operate in lock-step. The data is entered
so that the Account.Bal:mce corresponding to a particular Accoun(Number enters the array one
time-step after the Account_Number. This insures that, for example, the result of predicate PIon a
particular Account_Number will reach update VI at the same time as the corresponding
Account.Balance.
Additional queries are placed in succeeding rows of the systolic array. The important
characteristic of this array is that it can contain many queries (hundreds, not just the three shown in
the figure). Hence, many queries can be executed for a single pass of the relation through the
array.
The general philosophy of this array is that its structure is such that many similar queries can
be loaded into a single array simultaneously. Then the relevant relation is passed through the
array, and all of the queries are executed. Even ifit takes SubSl'1ntial time to run:l relation through
an array, the large number of queries processed yields a high throughput (in queries per unit time)
for the combined operations.
288 A Systolic (VLSI) Array for Processing Simple Relational Queries

4. COMBINATION BY PIPELINING
4.1. Updates on Multiple Relations
The DEBlT.CUEDiT transaction updates several relations in the database, including
ACCOUNT, CASH.DRAWER, and BRANCRBALANCE. Each of these updates is conditional
on the appropriate Account.Numbcr being found in the database, but of course this test must only
be done once. As written, the transaction indicates that the following procedure is to be executed.
Once the Account.Numbcr is found (if it is present), the Accouut.Balance is updated. Then, the
CASI(DRAWI~R relation is searched for the entry for the appropriate teller, and that teller's
drawcr balance is updated. Similarly, the BRANCH.BALANCE for the correct branch is updated.
The operations on the ACCOUNT, CASH.DRA WER, and B1~ANCH.BALANCE relations are
alinost identical, and can each be executed by the array shown in Figure 4.
The method described in section 3 will update each of the relations in the transaction, if it i.$
applied separately to each relation. It remains to be shown how to combine multiple instances of
the method to implement a single transaction operating on several relations. The straightforward
method for applying batches to several relations would be to process the batches sequentially,
relation by relation. This method can be improved by exploiting the parallelism that is made
possible by using several instances of the SQA. Specifically, this can be done with a pipeline
scheme, like that shown in Figure S.
Direction of Pipeline
)
Relations Relations Relations Relations

t...-
ARRAY
Queries (STAGE)
r- ' - - ARRAY

Queries (STAGE)
I--
--- ARRAY f--
Queries (STAGE) Queries
1 2 3

Figure 5: The pipeline of systolic stages.


4.2. The Pipeline
Each systolic stage of the pipeline can be thought of as an instance of the SQA. At any given
time, each stage of the pipeline contains a different batch of queries. When a batch enters the
system, it enters stage 1. The first relevant relation is then passed through the stage. Updates on
that relation are performed by the array. Then, the batch is moved to the systolic array in the
second stage, and the second relevant relation is passed through the batch. At the same time, a new
batch enters stage 1 and is processed by that stage. Next, the first batch moves to stage 3, the
second batch to stage 2, and a new batch enters Ule system at stage 1. Eventually, each batch-
having "met" all of the relevant relations in some stage-exits the system (at the right side of the
figure). The batches in the system move synchronously from stage to stage-no batch is allowed to
"leap over" another in the pipeline.
The pipeline enhances the concurrency of the overall database system. If several stages are
used. the throughput of the system depends only on the time required to nm a relation through a
single array, not on the number ofrclatiolls involved in a transaction.
5. HOW THE ARRA Y WORKS-A FEW DETAILS
5.1. General Plan
This section contains (a sketch ot) one possible high-level design for the implementation of
an array to perform the simple relational transactions. 1be primitive operations that can be
performed by the array components and the method of connection arc specified in some detail.
The systolic array discussed in thIS section (the SQA) differs from those proposed to do
"hard" relational operations [9] in several ways. Primarily, the SQA is programmable: the specific
Philip L. Lehman 289

operations it is to perfonn arc specified just before they are executed, rather than when the array is
built (as in the case of the "hard" operation arrays). 'DIC users are allowed to specify a wide range
of possible queries to be loaded into the array, in contrast to the operation of the intersection array.
for example, which is completely specified when it is built. Since this programmability makes the
structure of the SQA fairly complex, only relatively simple operations arc allowed as the atomic
programmable operations of the array. This makes the design of the physical array somewhat
simpler.
5.2. Specifications
The functionality of the array derives from a combination of: (a) the op.erations perfonned
by each individual systolic component at each time step, and (b) the method of interconnection
between the components. One of the systolic processors used in the SQA (with the connections
leading to other processors) is shown in Figure 6.
5.2.1. Connections Between Processors
As the figure shows, each processor has six connections to other processors, labelled as
described in Table l. In the present design, data only flows "down" through the systolic array.
Logical results only flow "down" and to the "right." The general execution pattern of the systolic
processors (SPs) is as follows. For each time-step: (a) the inputs are read from the input lines (D t
bt , bl ); (b) they are manipulated according to the operalions programmed for the systolic processor
(described below); (c) and the outputs are put on the output lines (Db' bb; b,). The connections
Dr and Db are used to pass data (elements of the relation) between processors. The bits bl and br
and the bits br and bb are lIsed to pass logical results from one processor to another from left to
right, and top to bottom, respectively. These logical results allow the construction of complex
queries from single systolic processor building blocks, and are used to convey conditional results
between processors.
The connections described are each enhanced with a single bit of memory (see Figure 6).
These memory bits have three modes: TRUE, FALSE, and BYPASS. In BYPASS mode, the
memory bit is transparent, and data is simply passed through the bit as ifit weren't there. If the bit
is not set to BYPASS, then it can be set to TRUE or FALSRby its input line. Then its output line
continues to pass that value until it is changed or the bit is again set to BYPASS. This bit is useful
in handling problems with time-delay, and is intended primarily for use between stages of the
pipeline. This arrangement allows the division of a chip structure into stages to change

One
Dt Input Data entering from the top.
Db Output Data leaving at the bottom.
SQA bt Input Bit entering from the top.
bl~ b
r Output Bit leaving at the bottom.
Processor bb
bl Input Bit entering from the left.
b, Output Bit leaving at the right.

o b
t b
}<'igurc 6: A single systolic processor: Table 1: Data entering and leaving each
Dr and Db are data connections, processor. The directions top, bottom.
bt , bl , b" and bb are logical one-bit connections, left, and right refer to Figure 6.
M's are single-bit memories on the connections.
290 A Systolic (VLSI) Array for Processing Simple Relational Queries

dynamically. The divisions occur where BYPASS is not set. At these points, results are held
between passes, and arc sent on to the next stage during the next pass for that stage.
5.2.2. Operations
The complete operation that can be executed by a systolic processor is specified- by the syntax
"<Input bit exp>=<Primitive op>=<Output bits>." <Primitive op> is either a <Predicate
primitive) (which is a comparison with a constant) or an <Update primitive) (an arithmetic
operation). (Table 2 lisl~ all of me primitive operations.) The predicate operations are used
primarily to implement the conditional parts of queries on the database, like those describcd for
the bank database abovc. The update operations are used to implement the changes to the
database. Only one operation (either predicate or update) may bc loaded into a single systolic
element.
The <Input bit exp) is a logical expression using thc signals from the processors to the top
and left, and is allowcd to bc any of the sixteen possible Boolean combinations of me two input bits
bl and h,: if unspccified, it defaults to TI? UE. <Output bits) is a list of (up to two) bits to be output
to the processors to the right and bottom, along with me BY I' ASS setting for the relevant memory
bits. Both elemcnts in the list have one of the four following forms: "b", "not-bOO, "set(b)",
"clear(b)", where thc b is rimcr b, or bb' The "bOO and "not-b" forms pass TRUE or FALSE on that
output line, respectively, and sct the mcmory bit on mat line to BYPASS mode. The "set(b)" and
"clear(b)" fonns set the associated mcmory bit to TRUE or FALSE, respectively (and turn off
BYPASS).
5.2.3. The Internal Systolic Algorithm
We use the expression internal systolic algorithm to refer to the algorithm executed by each
individual systolic proccssor at each time-stcp. For the SQA this "hardware algorithm" is shown in
Figure 7. Bricfly, it inputs thc data, executes either a predicate or update, and puts the output onto
the output lines.
Pn'rlicate.j).ngatiolls
.symbol Operation (k is a constant) /* Get data and input bits: */
= k? Docs the data cqual k? Read D, and b, and b, from thc input lines;
< k? Is the data less than k? tval+- FALSE; /* Tcmporary variable */
) k? Is the data greatcr than k? D~ +- D ,: /* Temporary variable */
:5 k? Is tllC data at most k? if<Input bit expr) (D(fault: TR UE) then
;:: k? Is thc data at least k? if<Operator) (Default: null) is predicate then
k? Is the data differcnt from k? /* Evaluate the predicate: */
T Constant TRUE. tval+- execute <Operator) on Data
F Constant FALSE. else
Update Operations D~'" execute update <Opcrator) on Dr;
Symbol Operation (k is a constant) tval+- TRUE
+ k Add k to thc data. <Output 'bits) (Default: none) ... IvaI;
k Subtract k from the data, D +- Dl
b b
* k Multiply thc data by k.
+ k Divide the data by k. Figure 7: The internal systolic algorithm
Table 2: Primitive operations. for the SQA proccssors.
Find ACCOUNT(Acct_Num) in Database;
if r~ot_Found I Acct_Ba HOe lta<O then Put "Negat i va Response";
else do;
Acct Bnl=Acct Bal+Dolta;
CASH=DRA\~ER( Te 11ar) =CASH_DR,/\WER( Tall ar) +De 1ta;
BRAtJCH_BAL( Branch) =BRA~JCH_BAL (Branch )+09 1ta
end
Figure 8: A subset of thc DEnrCCREDlT transaction in Figure 1.
Philip L. Lehman 291

5.2.4. An Example
We illustrate the function of the SQA with an example: the implementation of a subset of
the DEBI'CCREDIT transaction shown in Figurc 8. This is implemcnted on the two rows of the
six column SQA shown in Figure 9. The "program" for the array is givcn in Figure 10. The
picture is brokcJl lip (by dashed lines) into the thrce stages us.:d ji)r the three relations. The stages
execute sequel1ti,tlly and arc p:lll of the pipeline. The first stage handles the ACCOUNT relation,
the second CASI I)}RA WER, and the Ulird BlUNCHJIALANCE. In the first stage, processor (I)
checks the Account)'bmhcr. If it matches, processor (2) checks to sec if the A~coulltJ3a\:mce is
large enough. Ifso. processor (4) modifies the AccountBa\:uice. Processor (3) docs nothing. '1l1C
other stages work similarly. Notice that two rows of processors are lIsed for this transaction, and
that they work together by means of the communication wires (shown more clearly in Figure 6;
these wires communicate the logical resulLS from the tests in processors (1) and (2. Processor (4)
performs U1e update as (he relation streams by.

STAGE 1 STAGE 2 STAGE 3


Figure 9: A piece of the SQA to implement lie transaction subset in Figure 8.
The stars (*) indicate memory biLS that arc not set to BYPASS.

6. CONSISTENCY
Proofs of database consistency that have appeared ill the literature have instead proven the
somewhat stronger condition of serializability [5,12,18] (with a few exceptions, for example [10]).
(A concurrent schedule for a set of transactions is considered to be serializable if its results are
equivalent to a schedule in which the transactions arc mn sequentially.) If only syntactic
informatioll is known about a set of transactions then a schedule from a serializing scheduler is the
best possible. (This is one example of the tradeoff between perfOImance and information in
consistent schedulers [12].) Most of liese proofs have concerned conventional database systems, in
which . locks have been used to guarantee serializability by isolating sections of Ule database for
exclusive access by a single transaction.
In the system considered in this paper, no locks are used. Instead we rely on thc great
regularity of systolic arrays to guarantee the serializability of the transactions executed by those
arrays. Theseserializability results arc summarized below, and concern the SQ,\ and pipeline
discussed above.
Definition 1: A transaction is a mapping F: (DB,I) --t (DB',O), whcre DB and DB' are
dmabase slates: sets of relations, together with the "shapes" of those relations and the values of
their clements; J is the input to F and a is Ule output from F.
292 A Systolic (VLSI) Array for Processing Simple Relational Queries

( 1) /lull = AcctNum? =b,


= ~ - Deft.. ?
=-C?

(2) bl = bb sct(b,)
(3) /lull
(4) bl = + Delta = /lull
(5) bl == Teller? =b,
(6)
(7)
bl = + Delta
null
= sc/(b r)

(8) null
(9) bl = = Branch? =b
(10 ) bl = + Delta = n~l/
( 11) /lull
(12) null
Figure 10: A program for the array in Figure 9 to implement the transaction subset
in Figure 8. The notation "/Iliff' indicates that a field or statement is empty.
(Restricting F produces interesting classes of transactions, including, for example, add-
relation transactions (more relations in DB' than in DB), update-tuples-oilly (only element values
have been changed), etc. [13])
Analogously to the usual definition (but from a different point ofvicw) we define
Definition 2: A transaction F is serially decomposable into transactions FI and F2 if it equals
their composition: F =
F2 F1' The decomposition is said to be syntactic if it can be done
regardless of the semantics of (operations performed by) F, F}, and F2.
This model produces the following results concerning tl1e systolic designs presented above.
Theorem 3: The transaction FSQA executed by the SQA is syntactically serially decomposable
=
into individual queries corresponding to the individual rows of the SQA: FSQA 1'n 0 fn-I 0 0
1; 0J;. Some of these can be re-combined so that more than one row can be used for a single
(complicated) query.
Proof: The proof is by recursion on the rows of the SQA. An SQA with one row is trivially
decomposable into J;. Consider an SQ;\ with k rows (hI). By the inductive assumption, the first
k-I rows (considered as an SQA) are decomposable into J'~_l = fk-l 0f k-2 0 _.. 01; 0ii. 'D1e data
leaving the first k-I rows flows into the kth row and is processed there after the first k-l rows have
completed their work on it. The kth row performs the function f k . Hence the SQA performs FSQA
= ric = fk 0 F",_1'
Theorem 4: The transaction FpIPE executed by several stages (each an SQA) on a single batch
is (syntactically) equivalent to the serial execution of the transactions obtained by taking single
rows acruss the width of the entire pipeline, provided each stage operates on a different relation
(conforming to the usual structure of simple queries). This is the "natural" desired decomposition
into individual transactions, each operating on the same set of several relations.
Proof: The transaction FpIPE = 1'~1 0 0 FI , where each J'~ is a stage (SQA). Hence, by
Theorem 3, FPlPE = fm,n 0 fm,n-l 0 fm,l 0 fm-I,n 0 f m-I ,n-l 0 0 f m-1,l 0 0 ii,n 0 ii,n-l 0
... o/'ll'
, Then by Lemma 5 (below) we can rcorder these: FpIPf:' = f m,n 0 f m-I ,n 0 . . . o/'1 .n 0
Ii!n.n -1 0 f m_]. ,n-1
0 . . . 0 /'1 -1 0
,11
. . . 0 f l o f _} 1 0 . . . 0 /'11' Then since ea~h row of the pipe
m, m, .
(across all stages) Gi = fm,i 0 fm-1,i 0 0 j~.1' we have f;'IPE = Gn 0 0 Gl .
The f's can be switched above because any f's to be switched op(!rate on distinct relations;
this satisfies the condition of lemma S.
Lemma 5: For two transactions Fl and F2: F2 0 F} = r~ 0 "2 provided that F~ and F2
operate 011 independent portions both of the database state and of the I/O infomlation (the
information that accompanies the state as either input or output-in a composition, the output of
one function becomes the input of the next).
Philip L. Lehman 2t3

Proof: By assumption, the state and I/O information arc independent for the two
transactions in question. Hence, we simply "pretend" that each }i locks the appropriate section of
the database and I/O infOlmation. This guarantees that the transactions can operate in either order
(or even concurrently) and still achieve the same results. The lemma easily generalizes (by
associativity of"o") to more than two transactions.
Theorem 6: The transaction schedule resulting from passing more than one batch through the
SQA pipeline is syntactically serially decomposable (into simple queries) provided that no relation
is used more than once in a batch and that the batches arc pipelined so that the batches themselves
arc syntactically serializable (i.e. that the directed graph of the "must-precede-in-serialization"
rel.ation on transactions-induced by the order in which they access relations-is acyclic).
Proof: This is an easy consequence of Theorem 4.
We assume that in running the pipeline, batches are started as soon as they arrive (or in the
next available time slot), provided that this will maintain serializability. The task of maintaining
syntactic serializability (on-the-fly) is delegated to a scheduler. This scheduler is efficient (runs in
polynomial time) for our SQA and pipeline, for two different senses ofserializability, as shown by
the theorems below.
(Papadimitriou [IS] showed that the problem of testing inembership in SR; the set of
serializable transaction histories, is NP-complete. However, there are efficiently recognizable
subsets of SR. The set of transactions we consider ("SQATRANS") is more general than those in
SR, in that the transactions are multistep: in general, they consist of more than one read!then-write
step. SQATRANS has strong analogies to Papadimitriou's DSR set, which has specific rules for
serialization, making it "casy" to recognize. Hence it is not surprising that the polynomial
recognition (or scheduling) algorithms exist for SQATRANS.)
Theorem 7: An efficient (polynomial time) syntactic serializability scheduler exists for the
SQA-pipe.
Proof: The sketch of a simple polynomial-time scheduling algorithm is given in algorithm
Schedule 1 (Figure 11). The algorithm takes time O(q('; + ;;'p2 = O(qw\ where w is the
number of batches scheduled, p (~w) is the number of stages in the pipeline, and q is the length of
the queue of batches waiting to be run (which depends on the arrival rate). The algorithm assumes
that we maintain a wxw matrix, }If, of precedence relations: batch i must precede batch j in a
serialization of the schedule if and only if m ....0. The algorithm is simplified, and its running time
could be improved fairly easily; for exam~le our approximation assumes that computing the
transitive closure of a matrix of size n takes time O(n4).
Definition 8: We say that a schedule is serializable ill the semi-strict sense (following the spirit
of [IS]) if the order in which the transactions begin execution is the same as that in the equivalent
serial schedule.
Theorem 9: An efficient (polynomial time) syntactic semi-strict serializability scheduler exists
for the SQ~-pipe.
Prosf: The scheduling algorithm (algorithm Schedule 2; FIgure 12) is very simple and runs in
time O(qwp3). Notice that this algorithm could also have been used for the proof o/Theorem 7. Its
running time is actually somewhat smaller, but algorithm Schedule 1 may produce better schedules
(in terms of utilization of the pipeline) if the condition of semi-strictness is not required.

7. CONCLUSIONS
In this paper, we have examined a high-level design for a systolic processor array to handle
large batches of simple relational queries with high throughput. This array has the desirable
property of syntactic serializability, and proofs of this are directly derivable from the systolic
structure of its design. Furthermore, serialization can be accomplished with an efficient scheduler,
making the array feasible for use in database machine systems. Less restrictive serializability
constraints are possible, but determining these must involve examining the semantics of the queries
294 A Systolic (VLSI) Array for Processing Simple Relational Queries

to be executed [12]. Such examinations are currently in progress, and should also produce a better
characterization of the power of the SQA, by establishing a precise' mapping between the function
of the systolic array and the class of transactions that it may execute on a database. This mapping
was begun in the characterization of serializability herein reported and serves to show that systolic
structure for VLSI maps readily into other interesting problem domains. Hence, we believe that
the design presented above is strong evidence for the practicality of using systolic designs in high
performance database machines, as well as in other applications.
for each scheduling cycle (wofthem):
for each batch B in the queue (q of them):
1* add B to inatrix M to form test matrix M */
for each liatch Brin M(wofthem):
/* find precedence of Band B */
for each stage in B (at n~ost p): /oreach stage in B (at most p): compare the stages
/* see if there is a cyclk in Af */ '
cOIllPute trans. closure of i\f (time O( w4
schedule some batch; save its M as new M

Fiaure 11: Algorithm Schedule 1


for each scheduling cycle (w of them):
for each batch B in the queue (q of them):
/* try schedu1ing B */
for each running batch B,(p of them):
1* find precedence of Bq and B, */
for each stage in B (at most p):
for each stage i6 Br (at most p):
compare the stages
if Bq can precede B, then reject Bq
schedule some batch Bq that was not rejected

Figure 12: Algorithm Schedule 2

References
[I] Banerjee, 1., Hsiao, O.K., and Kannan, K.
DBC-ADatabase Computer for Very Large Databases.
IEEE Transactions 011 Computers C- 28( 6):414-429, June, 1979.
[2] Codd, E.F.
A Relational Model of Data for Large Shared Data Banks.
Comlllunicaliollsoflhe ;lei\! 13(6):377-387, June, 1970.
[3] Date, CJ.
An Introduction to Database Systems.
Addison-Wesley, Reading, Mass., 1977.
[4J DeWitt, DJ.
DIRECT-A Multiprocessor Organization for Supporting Relational Database
Management Systems.
IEEE Transactions on ComputersC-28(6):395-406, June, 1979.
[5J Eswaran, K.P., Gray, J.N., Lorie, R.A. and Traiger, I.L.
The Notions of Consistency and Predicate Locks in a Database System.
Communications of the ACl'v119(1l):624-633, November, 1976.
Philip L. Lehman 295

[6J Gray, 1.
Notes on Data Base Operating Systems.
In Bayer, R., Graham, R.M. and SeegmuIler, G. (editors), Lecture Notes ill Computer
Sciellce 60: Operating Systems, pages 393-481. Springer- Ver~ag, Berlin, Gennany,
February. 1978.
[7] Gray, 1.
Private Communication, 1980.
[8] Kung,H.T.
Let's Design Algorithms for VLSI Systems.
In Proc. Conference on Very Large Scale Integration: Architecture, Design, Fabrication,
pages 65-90. California Institute ofTechnology, January,1979.
[9] Kung, H.T. and Lehman. P.L.
Systolic (VLSl) Array~ for Relational Database Operations.
In Proc. ACM-SlGMOD 1980 International Conference on Afanagement ofData, pages 105-
116. ACM, May. 1980.
[10] Kung, H.T. and Lehman, P.L.
Concurrent Manipulation of Binary Search Trees.
ACM Transactions on Database Systems 5(3):354-382, September. 1980.
[11] Kung, I-1.T. and Leiserson, C.E.
Systolic Arrays (for VLSI).
In Duff, I. S. and Stewart. G. W. (editor), Sparse Matrix Proceedings 1978. pages 256-282.
Society for Industrial and Applied Mathematics, 1979.
A slightly different version appears in Introduction to VLSI Systems by C. A. Mead and
L. A. Conway, Addison-Wesley, 1980, Section 8.3.
[12] Kung, H.T. and Papadimitriou, C.H.
An Optimality Theory of Concurrency Control for Databases.
In Proc. ACM SIGMOD 1979 Internatioai Conference on Management of Data, pages 116-
126. ACM, May, 1979.
[13] Lehman, P.L.
The Theory and Design of Systolic Database Machines.
In preparation.
[14] I _orie, R.
Private CommunicatioJ1, 1980.
[15] Papadimitriou, c.B.
The Serializability ofCOJ1current Updates.
Journal of the ACM 26(4):631-653, October, 1979.
[16] Schuster, S.A., Nguyen, H.B., Ozk:irahan, EA, and Smith, K.C.
RAP.2-An Associative Processor for Databases and Its Applications.
IEEE Transactions all COl11puters C-28(6):446-458, June, 1979.
[17) Su, S.Y.W., Nguyen, L.H., Emam. A., and Lipovski, GJ.
The Architectural Features and Implementation Techniques of the Multicell CASSM.
IDlE Transactions all Computers C-28(6):430-445, 1979.
[18] Yannakakis, M., Papadimitri()U, C.H. and Kung, H.T.
Locking Policies: Safety and Freedom from Deadlock.
In Proceedings o/Twentieth Annual Symposium on Foundations of Computer Science, pages
286-297. IEEE. 1979.
[19] Yao, S.B. and Wah, RW.
DIALOG-A Distributed Processor Organization for Database Machine.
Proc. 1980 National Computer Can/. 49:243-253, May, 1980.
A Systolic Data Structure Chip for
Connectivity Problems

Carla Savage
North Carolina State University
Computer Science Department
Raleigh, North Carolina 27650

1. INTRODUCTION - SYSTOLIC DATA STRUCTURE CHIPS

In this paper we present an example of a design for a "data struc-


ture chip" and suggest how it can be used for problem solving in a di-
gital system. In particular, we describe a systolic structure which
can be used, for a graph, to find the connected components, a spanning
tree, or, when used in conjunction with a systolic priority queue, a
minimum spanning tree.
A data structure can be viewed as a structured collection of ob-
jects, together \l7ith operations to be performed on the collection. A
STACK, for example, is a data structure whose structured collection of
objects is an ordered sequence of elements, one end of the sequence
designated as top, with the operations of PUSH (add an element to the
top), POP (delete the top element), TOP (read the top element) and
EHPTY (test for empty). The operations can be classified as "updates",
\vhich alter the structured collection of objects (PUSH, POP) and "re-
trieves", which only return information about the collection (TOP,
EHPTY). By a "data structure chip" we mean a device capable of storing
the structured collection of objects of a data structure and performing
the specified set of operations. A performance goal for such a chip
would be to have constant time response for each retrieve operation and
the ability to request either a retrieve or update within constant time
after any update operation. A design for a systolic STACK chip which
satisfies these goals has been proposed [3].
hie subscribe to the philosophy of Foster and Kung [1] and Kung and
Leiserson [5] that a "good" chip implementation of a data structure
should be built up from a collection of simple cells with limited inter-
cell communication, regular data flow, and extensive pipelining and
multiprocessing. An implementation with these properties has been
termed "systolic". Systolic implementations for many algorithms have
been studied [1, 2,5,6]. The systolic priority queue proposed by
Leiserson [8] is an example of a systolic data structure chip.
In the same way that data structures are used in algorithm design,
we feel that data structure chips can be useful in system design. In
algorithm design, the relevant data structures are identified and im-
plemented and the role of the algorithm is to describe.when and which
data structure operations should be performed. In the design of special
purpose systems, image processing, for example, the structure of the

296
Carla Savage 297

data to be stored and the operations to'be performed on that data would
dictate what sort of data structure chips would be appropriate hardware
to incorporate into the system, under the control of a host machine.
In [7] Kung and Lehman discuss how special purpose chips for performing
data base operations could be incorporated into a data base system.

2. A CHIP DESIGN FOR CONNECTIVITY PROBLEMS

One approach to finding connected components of an undirected


graph G is to maintain a collection of disjoint sets of vertices of G,
as the edges of G are examined sequentially. Initially, each vertex is
in a set by itself. \\Then an edge e is scanned which joins vertices in
distinct sets, those sets are combined. After all edges are scanned,
each set is the vertex set of a connected component of G. In this con-
text, the operation of combining two sets is called "union" and deter-
mining to which set an element belongs is called a "find". We describe
a chip which maintains the collection of disjoint sets and performs the
operation UF(e): given an edge e = (u, v) determine whether u and v
are in the same set and if not, combine the two sets. Although a sin-
gle UF operation will require time O(n) where n is the number of ver-
tices, a sequence of m UF operations can be pipelined to require total
time O(m+n).
To implement this data structure, we propose an array of n+l cells
where cells 1, .. , n correspond to vertices 1, "', n. To perform an
operation UF(e) , edge e enters the left end of the array and, with each
clock pulse, travels right one cell until it reaches the right end and
begins to travel back left. Each cell contains a program and the odd
and even numbered cells execute on alternate clock pulses. A new UF
operation may begin after a delay of two pulses. Each cell i contains
a value COMP(i) indicating the number of the component containing ver-
tex i. Initially COMP(i)=i. As edge e=(u,v) travels right, it must
determine the component numbers of u and v, which values it carries
along as cu and cv. Initially, cu and cv are 0, but when e enters cell
u, cu becomes COMP(u) and similarly for v and cv. The values cu and cv
may get updated (by a leftmoving edge) as e moves right, so that by the
time e reaches the right end cell, cu and cv are guaranteed to be the
components which vertices u and v will be in after all UF operations
preceeding UF(e) have completed. At this time, the values cmin and
cmax are determined as the min and max, respectively of cu and cv.
Edge e then proceeds left to combine components cu and cv into a single
component, labeled cmin. Any cell i with COMP(i) = cmax encountered by
e, has COHP(i) changed to cmin. As e moves left, he may encounter a
rightmoving edge e' = (u', v') with cut equal to cmax. That is, e'
"thinks" u' is in component cut = cmax, but after UF(e) has completed,
u' will be in component cmin. In this case, the value of cut is changed
to cmin. The case is analogous for cv'.
Each cell i, 1 < i < n, needs only enough registers to store i,
CO~W(i), the values (u, v), cu, cv for a rightmoving edge and the values
(u, v), cmin, cmax for a leftmoving edge. It needs only the logic
necessary to execute the following program:
298 A Systolic Data Structure Chip for Connectivity Problems

cell i
i,COMP(i)

<l~-- e J. : (u J. , v J.) ,cmin.J ,cmax.J <-":1-1-----


~

BEGIN
IF COlIP (i) = cmax.
J
THEN COHP(i)+cmin.;
J
IF i = uk
THEN cuk+COHP(i)
ELSE IF i = vk
THEN cvk+COHP(i);
cmax.
J
THEN cuk+cmin j ;
IF cVk = cmax j
THEN cVk+cminj
END

Cell n+l needs only to hold values (u,v), cu, cv, cmin, and cmax
for an edge and the logic to execute:

y
cell n + I
,,(u,v) ,,",cv---l> 1
<l--e: (u,v) ,cmln,cmax<J---!

BEGIN
IF cu < cv
THEN BEGIN
cmin+cu;
cmax+cv
END
ELSE BEGIN
cmin+cv;
cmax+cu
END
END
Carta Savage 299

Initially, we assume 'all registers (u,v) , cu, cv, cmin, and cmax
hold the value D. Note the array works correctly. If the operation
UF(e') is inserted before UF(e) and i f UF(e') needs to update COHP(u)
for an endpoint u of e, then when e enters cell u for the first time
to read COMP(u), either e' has already passed through cell u on its
trip left and changed COMP(u), or e and e' will meet in some cell as e
moves right and at that point e can update its cu value to reflect the
change to COMP(u) indicated bye' .
A spanning tree of a connected graph, G, is a connected, acyclic
sub graph of G with the same vertex set as G. The connected component
procedure can be modified to find the edge set of a spanning tree, T,
in the following way. As the edges of G are examined sequentially,
when an edge e is scanned which joins vertices in distinct sets, com-
bine those sets and add e to T. Note that since G is connected, ex-
actly n-l edges will be added to T. Since no edge which joins ver-
tices in distinct components is ever added to T, T contains no cycles.
The UF chip we have described can be used to find a spanning tree
as follows. \lhen an edge e ~ (u,v) enters the right end cell, i f
cu # cv, then u and v are in different components (after processing
all UF instructions preceeding UF(e)), so e is chosen to be an edge of
T. In this case, cmin # cmax and since these values never change as
e moves left, spanning tree edges can be recognized as they leave the
left end cell as those edges with cmin # cmax. Note that to retrieve
this information about an edge takes time D(n). However, a sequence
of m UF operations, including retrievals, can be processed in time
D(m+n). Thus if m>n, we average constant time per operation.
A minimum spanning tree of a vleighted graph, G, is a spanning
tree T such that the sum of the weights of the edges in T is minimum
among all spanning trees of G. It can be shown that if the edges of
G are examined in nondecreasing order of weight, the spanning tree
procedure described above will produce a minimum spanning tree. The
UF chip could be used to find a minimum spanning tree if there were
a way to supply the edges in nondecreasing order of weight: This can
be done by preprocessing the edges using the systolic priority queue
described in [8]. The systolic priority queue is a data structure
chip which maintains a collection of weighted items and is capable of
performing the operations of insertion (constant time) and retrieving
the minimum weight item (constant time). Thus, to find a minimum
spanning tree, the m edges are inserted into the priority queue and
then repeatedly the minimum weight edge is removed and inserted into
the UF chip.
In an area such as image processing where massive amounts of data
are involved, special purpose hardware can be cost effective. The UF
chip described in this paper could be used for storing image data as
well as for solving connectivity problems which arise in region detec-
tion and minimum spanning tree problems which arise in clustering.

3. RELATED HORK

The problem of designing a chip to solve connectivity problems


has been considered in [2] and [4]. In [2], Guibas et. al. describe
a systolic array for computing transitive closure of an nxn matrix in
time D(n) using an nxn array of cells with 2n input and 2n output
300 A Systolic Data Structure Chip for Connectivity Problems

ports. lIambrusch shows in [4] how to find connected components in


time 0(n3/2) on ;; x ;; array of cells using;; input ports. The UF
chip described in this paper has area * time O(nm) , which compares
favorably with the other two models for sparse graphs and has the ad-
vantage of very limited (constant) I/O requirements.
A conventional sequential machine can be programmed to solve the
connected component problem in O(n+m) operations. Thus even with
pipelining, we have not asymptotically improved the number of oper-
ations. However, we expect that the time to compute connected com-
ponents in hardware should be substantially less than in software,
especially for problems in which a large amount of data is involved,
when these chips can be used for storage as well as for processing.

ACKNOWLEDGEMENT

I would like to thank Leo Guibas for helpful discussions on this


material.

REFERENCES

1. H. J. Foster and H. T. Kung, The design of special-purpose VLSI


chips, Computer, January, 1980.

2. L. J. Guibas, H. T. Kung, and C. D. Thompson, Direct VLSI imple-


mentation of combinatorial algorithms, Proc. Caltech Conf. on
VLSI, January, 1979.

3. Leo J. Guibas and Frank M. Liang, Systolic Stacks, Queues, and


Counters, manuscript.

4. Susanne E. Hambrusch, VLSI algorithms for the connected component


problem, Technical Report CS-8l-9, the Pennsylvania State Univer-
sity, Computer Science Department, Harch 1981.

5. H. T. Kung and C. E. Leiserson, Systolic arrays (for VLSI) , in


Sparse Matrix Proceedings 1978, SIAM, Philadelphia 1979, I. S.
Duff and G. VI. Stewart, eds.

6. H. T. Kung, Let's design algorithms for VLSI systems, Proceedings


Cal tech Conference on VLSI, January 1979.

7. H. T. Kung and Philip L. Lehman, Systolic (VLSI) arrays for re-


lational database operations, Technical Report CHU-CS-80-114,
Carnegie-Helon University, Computer Science Department, October
1979.

8. Charles E. Leiserson, Systolic priority queues, Proceedings Cal-


tech Conference on VLSI, January, 1979.
FixedPoint HighSpeed Parallel Multipliers
in VLSI

Peter Reusens, Walter H. Ku*, and Yuhai Mao


Cornell University
School of Electrical Engineering
Ithaca, New York 14853

ABSTRACT
The paper presents techniques to increase the speed of fixed-point
parallel multipliers and reduce the multiplier chip size for VLSI real-
izations. It is shown that a higher order (octal) version of the
Booth's Algorithm will lead to significant improvements in speed,
coupled with a decrease of chip area and power consumption, as compared
to the modified (quaternary) version of the Booth's Algorithm presently
used in or proposed for monolithic multipliers. In addition, further
speed improvements can be obtained by using Wallace trees or optimal
Dadda types of realizations.
The optimal Dadda realizations with minimal number of adders can be
layed out in a regular, rectangular array interleaved with partial pro-
duct generation. The resulting regular structure is suitable for VLSI
implementations. The more complex interconnection wiring which is
needed is shown to be feasable in technologies with at least 3 layers
of interconnections. Layout, interconnection and speed considerations
for the proposed high-speed VLSI parallel multiplier configurations
have been studied.
KEYl>JORDS
Parallel Multiplier, High Speed Multiplier, Wallace Tree Multiplication
Scheme, Booth's Algorithm, Modified Booth's Algorithm.
1. REVIEW OF MULTIPLICATION ALGORITHMS
1.1 REVIEl~ OF SEQUEN'rIAL "ADD AND SHIFT" ~1ULTIPLICATION SCHEMES
Two unsigned numbers can be multiplied by generating partial prod-
ucts one at a time and by adding each partial product to the shifted
sum of all previous partial results. The signed case can be done simi-
larly, except for some correction terms. However a very elegant scheme
was invented by BOOTH [1,2J. Designed for signed and unsigned multipli-
cations, this algorithm generates partial products that are +1,0 or -1
times the multiplicand. Note that the subtraction doesn't pose any par-
ticular difficulty. A complete discussion can be found in, e.g., [3J.

*National Research and Resource Facility for Submicron Structures.


301
302 FlxedPolnt High-Speed Parallel Multlpllera In VLSI

The only speed-up of these sequential schemes can be the generation of


fewer partial products. Insteap of generating the partial products by
scanning through the multiplier one bit at a time, one could take two
bits at a time. However this requires the generation of partial prod-
ucts that are the multiplicand times 0, 1, 2 and 3, which is considered
impractical because of the times 3 multiplication. If however one uses
this quaternary scheme to generate partial products in Booth's algorithm
then the necessary partial products are 0, l, 2 times the multipli-
cand, which can be realized easily and which makes the Modified Booth's
Algorithm a very attractive scheme. It was used in the IBI4 369-91 [4]
and it is the basis for more recently designed parallel multipliers [5].
1.2 REVIEW OF PARALLEL MULTIPLICATION SCHEMES IN LSI
The fastest way to generate the product of two numbers involves
the generation of all partial pr~ducts at once, followed by the summing
of these "summands" in some optimal way. Parallel Nultip1iers do not
have an optimal throughput per chip (in LSI realizations) or per area
of silicon (for monolithic designs), if compared with bit-serial and/or
pipelined multipliers [15]. However they offer the shortest possible
latency time and therefore they were investigated thoroughly, with main
efforts concentrated on : 1. Optimal ways to combine the summands to
form a result, 2. Methods to reduce the number of summands.
Thi sled to the proposa 1 of the use of Para 11 e1 Counters to combi ne
more bits of the partial products and to form the result more quickly
[6,13,14]. Other investigations have proposed Optimal Schemes to com-
bine the partial results with some type of counter in the fastest pos-
sible way, e.g. Wallace [7], Dadda [8] and others [9J. The use of
ROM's combined with a logarithmic conversion was proposed by several
researchers, e.g. [lO,llJ.
1.3 THE USE OF PARALLEL COUNTERS IN VLSI PARALLEL MULTIPLICATION
The use of very complex Parallel Counters is not the optimal way
to realize parallel multipliers in VLSI. The smallest parallel counter
is the Full Adder. This adder is a (3,2)-counter as it reduces three
input bits of the same order to two output bits. One can make any
counter that takes n inputs to form rlog(n+l)l outputs. Even counters
that take severa 1 input bi ts of different order were proposed [13J.
Any higher type of parallel counter can be realized as a combination of
smaller counters, ultimately with only Full Adders. (As example we show
in figure 1 the realization of a (5,5,4)-counter with 6 Full Adders.)
In VLSI realizations the use of higher order counters is counter-
productive. First of all it is meaningless to build bigger counters as
a hierachical interconnection of smaller ones, as the optimal reali-
zation of the multiplier with composite counters can in no way be bet-
ter (i.e. smaller and/or faster) than the optimal use of the constitu-
ent counters. On the other hand the design complexity of bigger coun-
ters that are not composite grows exponentially, as shown in the folow-
ing paragraph.
EXAMPLE: To realize a (3,2)-counter (=Full Adder) with a PLA
design, 3 inputs, 2 outputs and a total of 7 product-terms are needed.
Peter Reusens, et 81 303

A (7,3)-counter can be replaced by four (3,2)-counters. With PLA tech-


niques this (7,3)-counter needs between 64 and 128 product-terms and 10
input-output lines. As the size of the PLA grows proportionally to the
number of inputs and outputs and to the number of product-terms, the
non-composite (7,3)-counter needs 4 to 8 times more space than the com-
posite one.
This larger size will cause the non-composite (7,3)-counter to be
slower than the (3,2)-counter or the full adder. But even if the (7,3)
counter were as fast as a full adder, the theoretical attainable speed-
up would be small. The optimal combination of (7,3)-counters would be
less than twice as fast [=10g(7/3)/10g(3/2)J as a scheme with full ad-
ders combined in a Wallace Tree.
Added to this there are the difficulties associated with the inter-
connection of the larger counters on silicon in a monolithic reali-
zation.
CONCLUSION: When comparing different approaches for VLSI reali-
zations of parallel multipliers, logic with (3,2)-counters or Full Ad-
ders is optimal from an area viewpoint, and it is also "asymptotically
optimal" from a timing viewpoint. We mean that with any type of paral-
lel counter the optimal adding time is indeed O(logN).
2. PARALLEL MULTIPLIERS WITH FULL ADDERS
The ease to realize parallel multipliers with full adders made this
the only method presently used to realize Monolithic Parallel Multi-
pliers.
2.1 EXAMPLE OF THE UNSIGNED NxN MULTIPLIER
The earliest parallel scheme to realize parallel multipliers was
the straightforward adding scheme showed in Figure 2. (See also [3J).
For an NxN multiplication N partial products are generated with a
total of N2 bits, summed by an optimal number of N(N-2) Full Adders and
N Half Adders. The timing is less optimal. If every Full Adder gener-
ates a sum bit after a delay of Os and a carry bit after Dc and assum-
ing Ds>Dc, then the maximal delay is :
Delay = (N-l)D s + (N-l)D c
The layout is shown in Figure 2 for N=5. Horizontal and Vertical
wires carry the multiplier and the multiplicand. Where wires cross, an
AND-gate generates a bit of the partial products. The Full Adders are
located in the mazes of this grid. Carry bits propagate vertically,
Sum bits diagonally. The diagonal Sum bit propagation can however be
replaced by a horizontal and vertical one, making this layout suitable
for technologies with only 2 layers of interconnections (e.g. NMOS with
2 + 1/2 layers).
A parallel version of Booth's Algorithm can be implemented with a
similar layout scheme.
2.2 THE PARALLEL VERSION OF MODIFIED BOOTH'S ALGORITHtl
Booth's Modified Algorithm is commonly proposed to improve the
304 FlxedPolnt High-Speed Parallel Multipliers In VLSI

straightforward multiplier of the previous paragraph. An interconnection


scheme is shown in Figure 3a for an 8x8 parallel multiplier. Every
square stands for a full adder, every character represents an output
of a multiplexer generating the partial products. This multiplier can
execute a signed, unsigned or mixed multiplication. Only 27 adders
plus a 16 stage carry-propagate adder are needed, versus 54 adders plus
a 9 stage carry-propagate adder with the original Booth's algorithm.
The drawback is that a more complex multiplier is necessary to generate
the partial summands. But as the complexity of the multiplexer is much
smaller than that of the Full Adde~considerable savings are made. The
case of a signed only multiplication with modified Booth's Algorithm
requires even fewer adders and is shown in Figure 3b for N=8.
2.3. A HIGHER ORDER (OCTAL) VERSION OF BOOTH'S ALGORITHM
In the straightforward parallel multiplier the number of adders is
N2 + OeN). In the Modified Booth's Algorithm the total is N2/2 + OeN).
If we realize an Octal version of Booth's Algorithm we need only
N2/3 + O(N) full adders. This means less area for full adders and fewer
propagation stages for the results.
An octal version of Booth's A1 gori thm requires parti a1 products that are
O,1,2,3,4 times the multiplicand. The multiplication times 3 poses
problems. For a sequential multiplication the extra hardware and time
needed to precalculate this multiplication times three makes this
approach not useful. But for a straightforward parallel multiplier
this approach is valid. The computation time will be OeN) also, and
the area savings are significant. To show this we did a complete
design of the necessary basic cells in NMOS for both the quaternary
and octal version of Booth's Algorithm. We based our layouts and
transistor schemes on a paper by Masumoto, in Lambda [5J, describing
a 16xl6 multiplier in NMOS.
3. COMPARISON BETWEEN QUATERNARY AND OCTAL VERSION OF BOOTH'S
ALGORITHM IN NMOS
3. 1 INTERCONNECTION SCHEME
The schematic interconnection scheme of the basic cell for the quater-
nary version is given in Figure 4a. We have to realize multiplexers
generating the partial products interspaced with layers of full adders.
Power and multiplicand are fed in from above on metal lines. Horizon-
tal poly-lines are used as select lines in the multiplexers. Where the
poly-lines cross vertical diffusion lines, the pass transistors of the
multiplexer are formed. The distribution of C, C, Sand 5 signals
(for double rail signal distribution) is done by metal lines ~ well.
The C signal travels diagonally in between adders. the Sand S signal
has to be.distributed one row down and two columns to the right. So
C and S slgnals will run on parallel metal lines crossing the multi-
plexer diagonally. This brings the Carry to its destination. To route
the Sum signal correctly, it has to travel horizontally first over one
Peter Reusens, et 81 305

grid length to the right on poly of diffusion, before crossing the


multiplexer diagonally.
The fact that power supply and multiplicand travel on the same metal
level doesn't allow them to be routed vertically alone. As a result
the power lines will go vertically in between the adders but diagonally
across the multiplexer. The multiplicand signal has to travel vertically.
This forces us to feed this signal in a zigzag way: vertically in be-
tween the adders, horizontally to the left above the multiplexer, diag-
onally across the multiplexer. (See again Figure 4a). The increased
propagation time doesn't pose any problems. Furthermore, the zigzag
pattern brings the multiplicand at the input of two adjacent multi-
plexing units. But every multiplexing unit needs the multiplicand and
its shifted version, precisely what the zigzag accomplishes.
What now in the case of the Octal version of Booth's Algorithm?
The advantage is that fewer Full Adders are necessary. This saves area
and shortens the propagation chain of the signals. The following
changes are required: 1. We have to precalculate the multiplicand
times three. 2. We have to distribute this precalculated result.
3. The carry and save signals propagate somewhat differently. 4. The
multiplicand is needed at the inputs of three adjacent multiplexing
units. S. The multiplexers get more complex: more select lines are
necessary. On the other hand the precalcu1ation of the Mu1tie1icand
can be done very fast with the use of Manchester Adders (see LS., 12]).
A schematic layout for the octal case is given in Figure 4b. (To-
gether with the explanation of Figure 4a, this is self explanitory).
The height of the octal multiplexer is precisely twice that of the
quaternary one in our NMOS design.
3.2 COMPARISON OF AREA, SPEED AND POWER
We realized complete NMOS layouts for the full adder, and for the quat
and octal versions of the multiplexer. We used the Mead and Conway
design rules [12]. In our design the height Hm of the quaternary
multiplexer is 40lambda. The height Ha of the adder is BOlambda.
The octal multiplexer was twice as high as the quaternary one.
For the case of a l6x16 multiplier (signed, unsigned and mixed) we got
a total height for all adders and multiplexers of:
B.SHm + 7'Ha = gOO' lambda
in the quaternary case. For the octal case we found a height:
S.S2Hm + 4'Ha = 760'lambda
(Note: In the quaternary case g multiplexers are needed of which one
is only half as complex as the others. This explains the 8.SHm
total height. A similar reasoning explains the S.S(2Hm) height in
the octal case).
The area for the adders and the multiplexers in the quaternary case is
lB% larger than in the octal case. For the complete chip with path
drivers, line drivers (taking SO% more space in the octal case), the
extra addertS and lines for the calculation and distribution of the times
three partial product, etc ... , an area saving of 15% is obtained.
306 FlxedPolnt High-Speed Parallel Multipliers In VLSI

There are time savings also. For a 16x16 case there are 7 levels of
Full Adders versus only 4 in the octal case. If the precalculation
of the times three partial product is done with fast carry propagate
adders, if this precalculation is done during the necessary precharge
time, and if the propagation time for the carry in the Manchester
Adder is 1/4 of the delay D in the Full Adder, then the total delay is
(7 + 4)D in the quat case and (4 + 4lD in the octal case. The speed
improvement is about 25%.
We also expect a decrease of the power consumption, as there are fewer
full adders with restoring logic in the octal case. (They are re-
placed by the pass transistor logic of the multiplexers).
The advantages of the octal case can be extended to other technologies.
The Full Adders are indeed a time critical line in the summing process
of the partial products, while the generation of the partial products
in the multiplexers is less troublesome. As a result the reduction by
50% of the number of adders at the cost of an increased complexity of
the multiplexer will certainly offer similar significant advantages.
4. WALLACE TREE AND DADDA SCHEMES
The multipliers discussed in Section 3 can be realized as a very regular
interconnection of basic cells. The number of adders is minimal, the
computation time however is O(N) and not optimal.
It can be shown that theoretically an O(logN) delay is attainable with
the Wallace Trees [7J or Dadda Schemes [8J. However these tree schemes
were never used in monolithic multipliers, because the speedup they
could give for small N wasn't considered to be significant enough to
justify the more difficult interconnections and the irregularity of
the structure. However for sizes of the order of 16 or higher the use
of the Dadda scheme is attractive, especially in combination with Booth's
algorithm. We worked out an example of a 24x24 multiplier to show that
the routing difficulties are feasible.
For a 24x24 product 13 partial products are generated if the Modified
Booth's Algorithm is used. With a Wallace Tree type of interconnection
scheme these 13 partial products can be summed to form a result in
redundant form (i.e. two bits per column) after 5 Full Adder Delays.
To get the final result the summing has to be completed in some Fast-
Carry-Propagate or Carry-Look ahead Adder. A regular interconnection
scheme for this 24x24 multiplier is shown in Figure 5.
In Figure 5, the bits of the partial products are generated by multi-
plexers. These multiplexers are shaded, and every square represents
a generated bit. For the 24x24 case the partial products are just an
extention of the 8x8 case shown in Figure 3a. The adders are marked
with the level they have in the Wallace tree, the adders of the final
carry chain are marked 7 to 47. The actual interconnections are left
out of Figure 5 for clarity, but we show the expected complexity in
Figure 6. There we give the complete interconnections for the area
marked with dotted lines in Figure 5, where the wiring density is maxi-
mal. (In the final layout of the chip one would of course reorganize
Peter Reusens. et 81 307

Figure 5 from a parallogram shape to a more appropriate rectangular


shape, without basic changes in the interconnections).
Let us now consider the difficulty of the interconnections in Figure 5
and 6. There is a vertical stream of carries, sum bits and partial
product bits, crossing the horizontal select lines in the multiplexers.
The power supply could run parallel with either direction. The dis-
tribution of the multiplicand is not horizontal but diagonal. If we
compare the wiring density in Figure 6 ~/ith the complexity of the
straightforward scheme of Figure 3 we find that in Figure 6 the rows
of adders are crossed by groups of up to 4 wires per column. In
Figure 3 there are only wires interconnecting adjacent layers of adders.
The fact that we need the same total number of adders for the Dadda
scheme as for the ripple-through scheme of Section 3, would allow us
to use a similar chip size. This would however only be so, if the
extra wires for the Wallace trees (as shown in Figure 5) could be
routed above the adders, on an independent extra level. If only 2
levels of interconnections are available this is impossible, but in
a technology with for instance 2 layers of metalization the extra
wires can be layed out on the extra metal level. The density of
these extra wires is not prohibitive as shown in Figure 6.
This allows us to conclude that the speed improvement by using the
Wallace tree is certainly within the limits of existing VLSI tech-
nologies.
Note: If the routing of the long wires, necessary for the Wallace
Tree, is done without an extra level of interconnection then the
multiplier chip will need a substantially larger area. Indeed, in
that case the extra wiring would be layed out in between the C-S
adders and this would certainly constitute a very poor usage of the
chip area.
5. SUMMARY
Fast multipliers are essential building blocks for high-speed signal
processing applications. Techniques to increase the speed of fixed-
point parallel multipliers are presented. New parallel multiplier
configurations are derived using an octal version of the Booth's Al-
gorithm with additional speed improvement by using the Wallace tree
or optimal Dadda types of realizations. A regular array structure for
Dadda schemes is obtained which is suitable for VLSI implementation.
The more complex interconnection wiring which is needed is shown to
be feasible ir. technologies with at least 3 layers of interconnections.
Layout, interconnection and speed considerations for the proposed high-
speed VLSl. parallel multiplier configurations have been studied. In
general, significant improvements in speed and chip area have been
estimated.
ACKNOWLEDGEMENT
The authors wish to acknowledge the partial support of the DOD VHSIC
Program under the direction of Mr. Larry Sumney of OUSDRE.
308 FlxedPolnt High-Speed Parallel Multlpllera In VLSI

REFERENCES
1. "l\ signed binary multiplication technique", A.D. Booth, Quarterly
Journal of tlechanics and Applied rlathematics, Vol.4 pot.2., 1951.
2. "A proof of the modified Booth's algorithm for multiplication",
L.P. ~ubenfield, IEEE Trans. Computers, Vol. C-24, Oct. 1975,
pp. 1014-1015.
3. Theory and Appl ication of Digital Signal Processing, L. Rabiner
and B. Gold, Prentice Hall, 1975, Chapter 8, Paragraph 5.
4. "The IBt1 System 360/Model 91 : Floating-point execution unit",
Anderson S.F. et al., IBM Journal, Jan 1967, pp.35-53.
5. "The design of a 16 x 16 multiplier", R.T. Masumoto, LAr-1BDA, First
Quarter 1980, pp.15-21.
6. "On Parallel Digital Multipliers", L. Dadda. Alta Freguenza, Vol.
45, N.10, Oct. 1976, pp. 574-580.
7. "A suggestion for a fast multiplier", C.S. ~Iallace, IEEE Trans. on
Electronic Computers, Feb. 1964, pp. 14-17.
8. "Some schemes for parallel multipliers", L. Dadda, Alta Freguenza,
Vol. 34, pp.349-356,1965.
9. "Merged Arithmetic for Signal Processing", E.E. Swartzlander, Jr.,
Fourth IEEE Symposium on Computer Arithmetic, Oct 25-27, 1978, pp.
239-244.
10. "~'ultipl ication using logarithms implemented with read only memor-
ies", T.A. Brubaker and J.C. Becker, IEEE Trans. Computers, Vol.
C-24, Aug. 1975,pp. 761-765.
11. "Generation of products and quotients using approximate binary
logarithms for digital filtering applications", E. Hall, et al.,
IEEE Trans on Computers, Vol C-19, N.2, Feb. 1970.
12. Introduction to VLSI Systems, C.r1ead and L. Conway, Addison-~Iesley
Publishing Company, 1980.
13. "A compact high-speed multiplication scheme", 14.J. Stenzel et al.,
IEEE Trans. on Computers, Vol. C-26, No. 10, Oct. 1977, pp. 948-57.
14 "Two's complement parallel implementation of large multipliers",
H. Kobayashi, T. Yamada, H.Ohara, Proc of the first IEEE Internat-
ional Conference on Circuits and Computers, pp. 1085-1088.
15. "The area-time complexity of binary multiplication", R.P. Brent
and H.T. Kung, JACM, Vol. 28, No.3, July 1981, pp. 521-534.
Xo
RO
Figyre 1 :
The Realization of a R1
(5,5,4)counter with 6
Full Adders. R2

Figure 2 : R3
A straightforward reali-
zation of an unsigned 5x5 R4
parallel multiplier
RR R R
R9 RS R7 R6 RS

Fig.1 Fig.2
Figure
,...
,...
3.a
.... :
AO AO

Figure 3.b :
8x8 parallel multi-
plier with Booth's
modified algorithm. ++ ft
Signed operands on- 515 5 14 5 13 5 12 5 11 510 59 5a 57 56 55 54 53 52 5, 50 5'4 5 '3 512 511 510 59 5a 57 56 55 54 53 52 5, 50
ly (2's complement). Fig.3.a Fig.3.b
Ha 1f adder : (2J EI
Ca rry prop. adder:

Figure 4 :
Schematic layout of the
multiplexer and the inter-
connections of the full
adders for the quaternary
"tI
case of the modified ..
Booth's algorithm in 4.a,
1.ttii~~
.
-'---'
----_.. .- '"
_.. - !.
!l
:D
" .
and for the octal case in _._._._.. . .
_._._._
_._._._._~~.
~

4.b.
~~~
--_._._._..-. ..
. .
.
.
----metal !
_._._.- poly ~
!.
------- diff. x1+ 0 n !!.
NMOS Fig.4.a Fig.4.b Co>
Ii
~
o

."

2 i
3 ~
a
::I:
c:::J"
iJJ

~Ii!~ Ion
I
~ ."
iil
i:
s::
, i 1
:.
~ i .a:
~
~ i I 1'4 13' i'
iii
~I ; i 1'8 '71 '6 15 :;
I 120 I~ <
~

~i
~7 .. 6 .... ~U423122211 Fig .5 Fig.6
I-_J

Figure 5 :
Possible layout of the layers of adders and multiplexers for a Wallace Tree
interconnection scheme of a parallel 24x24 multiplier.
Figure 6 :
The complete interconnection scheme of multiplexers and adders for the area
indicated with the dashed line in Fig. 5.
RESULT
A MeshConnected AreaTime Optimal VLSI
Integer Multiplier

Franco P. Preparata
University of illinois at UrbanaChampalgn
Coordinated Science Laboratory

VLSI circuits for integer multiplication based on the Fourier


Transform have been discussed in the literature [1-3]. The circuits
presented in [2] meet the area X (time)2 lower bound for multiplication
established by various workers [1,4,5] in the well-known VLSI synchro-
nous model of computation [6]. This bound states that, if the numbers
to be multiplied are represented with N bits and A and T are respec-
tively chip area and computation time, then any VLSI multiplication
network must satisfy the condition AT2 = O(N 2 ).
The above circuits [2] meet this bound for all values of T in the
range 0 (lolN) ~ T ~ O(/N). They are structurally closely related to
the cube-connected-cycles architecture [7], and therefore contain wires
of considerably different lengths. It is the purpose of this note to
show that an AT 2 -optimal integer multiplication network can also be
built with the square-mesh structure, which uses wires of uniform (and
small) length. Such design, however, seems to exist only for T=O(,JN).
The method proposed in this note is based - as those reported in
[1] and [2] - on transforming a convolution product (or polynomial prod-
uct) into an ordinary product, by "releasing the carries". In turn
convolution is obtained via the Discrete Fourier Transform (DFT), which
is the crucial component of the technique. Such an apparently compli-
cated method may seem mainly of theoretical interest. However, when
considering the regularity of the circuit structure and of its timing,
the possibility of implementations, especially for very large operand
sizes, should not be hastily ruled out.
As is well-known, the calculation of the DFT of an n-components
vector over some field F could be viewed as a matrix-vector multiplica-
tion (see, e.g. [1]). Our method is an interesting combination of the
following two ideas (we assume that n = m2 ):
1. The Discrete Fourier Transform (AO,AI, .. ,A
n-l
> of a vector
(aO,al, . ,an _ l > can be obtained as a two-dimensional DFT, by
arranging the vector in row-major order as an m Xm matrix
A = a, I LJ,II,
where a, , = a '+' (j < m). (Note that indexing starts
LJ mL J
from 0 rather than from 1.) We then have
A = '" (mi+j) (mr+s) _ m;l( m)jr( sj m-l m is
'-' a .. w - '-' w w E a , (w) )
rs ij LJ j=O i=O iJ

This work was supported in part by National Science Foundation


Grants MCS-78-l3642 and MCS-8l-05552, and by the Joint Services
Electronics Program Contract N00014-79-C-0424.
311
312 A Mesh-Connected AreaTlme Optimal VLSllnteger Multiplier

Therefore A can be computed by the following program:


rs m-l.
(i) for 0 ~ s,j <m do A'. <- L: a .. (wm/s(/column-DFT/)
- -- sJ i=O ~J
(ii) for O~s j<m do A". <- WSjA'.
, -- sJ SJ
m .
(iii) for O~r,s<m do A = A"' +- L: A".(wm)JrUrow-DFT/)
-- rs sr . 0 sJ
J=
2. As shown in [2] and [7], a unidimensional m-module array (where
m= 2 r for convenience) can be used to compute the DFT of an m-
vector. Specifically, the unidimensional array emulates a
cube-connected-array in computing the Fast-Fourier-Transform of
the input vector.
It is convenient to review the general method described in [7] in
the specific context of the FFT executed according to the well-known
decimation-in-frequency scheme (see [11], p. 240-250). We recall the
relations which are the basis of this method:

j 0,1, ... ,n/2-l

Thus the network appears as in Figure la, in the standard signal flow-
graph notation. (1) If we rearrange the input vector according to the

t
~ r
1,:

~\ ,~
3 0 ';J
"0

(
a, ."
"1 DFT '~2 ~

on 30 2

I
A.
~ Cams
A
a.
a
.~
,.
~
n-2
~- L a
a
- Jl
..1.1
3. 1 A~

i- Jl
L a_
)
A_
)

a DFT ..1.3
f-L on 30 3 ';3
'f-L rn car:ns
.JJ "" I 30 7 ':"7
a ::1 A
.. : n-~

(a) ,b)

Figure 1. Decimation-in-frequency signal-flow graphs.

(l)The convention is that each node of the signal-flow-graph outputs the


linear combination of the outputs of its predecessors, where the
multipliers are the arc labels (the multiplier 1 is normally
omi tted)
Franco P. Preparata 313

bit reversal permutation (BRP) , then we obtain a signal-flow-graph


which for n = 8 is shown in Figure l(b). Two facts are worth noting: (i)
in this flow-graph both input and output are indexed according to the
BRP; (ii) with identical conventions about the arrangements of input
and output, the signal-flow-graphs of the decimation-in-time and of the
decimation-in-frequency schemes are topologically similar. (In the
decimation-in-time scheme the root power multipliers are different and
the shuffles precede the multiplications by the root powers.) Each
module of the array corresponds to a horizontal line in the signal-
flow-graph; if we start indexing the modules from the top starting from
0, each odd-numbered module successively uses a nontrivial sequence of
root power multipliers. The reason why decimation-in-frequency is pre-
ferred to decimation-in-time is that for any given module the root
. 2' 4' 8'
powers form a sequence of the type wJ,w J,w J,w J, ... , that is, each
power is the square of the preceding one; by contrast, in the decima-
tion-in-time scheme, the calculation of successive powers involves
square-rooting, and squaring is far simpler than computing the square
root. In addition, the shuffle operations required in the flow-graph
can be implemented by appropriate sequences of parallel interchanges
between adjacent modules; indeed the flow-graph of Figure l(b) can be
expressed as in Figure 2, where each operation is either a "butterfly"
or an exchange. All operations on the same vertical line are performed
simultaneously. It is easy to realize that, if the array has 2r mod-
a Ao
a4 A4
a2 A2
a6 A6
al Al

as AS
a3 A3

a
7 -1 -1 A7
Figure 2. The flow-graph of Figure l(b) adapted to the unidimensional
4 0 6 2
array operation. (Note that w = -wand w = -w .)

ules, the total number of exchange steps is (2 r - r - 1).


Each module must have the following capabilities:
(i) Multiply two operands;
(ii) Add two operands;
(iii) Exchange one operand with a neighboring module.
We assume that all operands be viewed as integers in a prime field F ,
P
where p is a suitably chosen prime (see below). Therefore all operands
will be represented with s ~ rlog2~ bits, and all arithmetic will be
314 A Meah.connected AmTlme Optimal VLSllnteger Multiplier

carried out modulo p. The module therefore contains a pipeline multi-


plier [8], consisting of 2s "cells", which obtains the product in 2s-l
shift-and-add steps by convolving the two factors. The modular product
is obtained with an insignificant increase in time. The same unit can
be easily adapted to perform addition; further, it contains, by its own
construction, three data registers (the two operands and the result).
The multiplier-adder has itself the linear (unidimensional) cyclic
array structure; it is indeed an array of cells, each of which contains
a "full adder" and a few bits of storage capability. Therefore, each
cell can be laid-out in constant area. Moreover, it is proposed that
the 2s cells of the multiplier be embedded in a square "ffi l X "ffil r r
mesh of cells. Various simple choices are available: for example,
in Figure 3 two possible arrangements are shown for 2s = 22. Wi th this
arrangement an operand exchange between adjacent cells can be effected
in time O(./S) (by shifting either along rows or along columns), while

L.. l i
.....
1- L -I
1--'
1 r
I t

Figure 3. Two possible layouts of an ll-bit pipeline multiplier.

a multiplication or addition is completed in time O(s). Note that all


wire lengths are of the order of the distance between neighboring cells.
To obtain the DFT of n = m2 numbers in F , we construct an mX m
p
mesh of modules M.. (0:$ i, j < m) of the type described above. The
~J
area of the resulting circuits is O(sn). The prime number p is chosen
of the form p=nq+l ( this is always possible [1],[9], with logq=
O(logn, so that F contains a primitive root of unity, w, of order n.
p
Therefore s = flog 2Pl is at least 0 (logm).
Each column (or row) of the mesh is viewed as a unidimensional
array capable of performing the DFT of the data it contains. Working
out the details of the algorithm, it is easy to realize that each
module must store at most three constants (powers of the primitive root
w) to be used, respectively, in steps (i), (ii), (iii) of the DFT, as
described above. This does not significantly affect the chip area.
Recalling that m= 2r , the algorithm uses (4r + 1) multiplication steps,
2r addition/subtraction steps, and 2(2 r -r-1) exchange steps. On the
basis of our preceding estimates of the running times for each of these
operations, we have for the total computation time
T = 0 (rs +{ Js ).
Franco P. Preparata 315

If we add the condition that JS < K 2 r /r (for some constant K) then


T=0(2 r JS) = 0(';;;). We have already observed that A=O(ns).
The above scheme can now be readily adapted to perform integer
multiplication. Let N ~ ns' be the numbers of bits used by the arith-
metic. The N-bit sequence of each operand is partitioned into n blocks
of length s', and each such block is considered as an integer in F
The prime p is chosen of the form p = n2 2s ' 'q + 1, which simultaneou~lY
guarantees the existence in F of a primitive root of order n and it
p
exceeds the size of the largest convolution term. Although s' can be
chosen in the range [0 (logn), 0(,jN/ 10gN)], it is convenient to let
s' = O(logn).
Letting a = (aO,al,.,an- 1) and -b = (bO,bl,,b n- 1)' a.,b.
, - ~ J E
s
[0,2 -l),the integer multiplication is carried out by first computing
the DFTs A and B of a and b, respectively, multiplying A and B com-
ponentwis;, and-obtaining the inverse DFT of the latter~ The-result is
the convolution product, which can be transformed into the ordinary
product by releasing the carries. The network which can be used for
the purpose is a rather straightforward modification of the previously
described DFT network; the area and time both increase by small
multiplicative factors. In conclusion we have:
A O(ns) = O(ns') = O(N)

T O(';;;)=O(R)=O(JN)
i.e. we have realized a mesh multiplier which achieves the lower bound
O(N 2 ) to the AT2 measure of complexity. This bound holds both in the
synchronous model of computation as in the one recently proposed by
Chazelle and Monier [10]. Note that in the described network all wires
have minimal length.

References

[1] R. P. Brent and H. T. Kung, "The chip complexity of binary


arithmetic," Proc. of 12th ACM Symposium on Theory of Computing,
(Los Angeles), pp. 190-200, May, 1980.

[2] F. P. Preparata and J. Vuillemin, "Area-time optimal VLSI networks


for computing integer multiplication and Discrete Fourier Trans-
form," Proceedings of I.C.A.L.P. Symposium, Haifa, Israel,
July 1981.

[3] F. P. Preparata and J. Vuillemin, "Area-time optimal VLSI networks


based on the cube-connec ted-cycles," Tech. Rep. ACT-2l, Coord. Sci.
Lab., Univ. of Ill., February 1980.

[4] H. Abelson and P. Andreae, "Information transfer and area-time


trade-offs for VLSI multiplication," Communications of the ACM,
vol. 23, no. 1, pp. 20-22, January 1980.
316 A MhoConnected AreaTlme Optimal VLSI Integer Multiplier

[5] J. E. Vuillemin, "A combinatorial limit to the computing power of


VISI circuits," Proc. 21st Symposium on Foundation of Computer
Science (Syracuse), pp. 294-300, October 1980.

[6] c. D. Thompson, "A complexity theory for VLSI," Ph.D. Thesis,


Department of Computer Science" Carnegie-Mellon University,
Pittsburgh, PA, August 1980.

[7] F. P. Preparata and J. Vuillemin, "The cube-connected-cyc1es: a


versatile network for parallel computation," COIllIIlUnications of the
ACM, vol. 24, no. 5, pp. 300-309, May 1981.

[8] H. T. Kung and C. E. Leiserson, "Algorithms for VLSI processor


arrays," Symposium on Sparse Matrix Computations, Knoxville, TN,
November 1978.

[9] S. S. Wagstaff, Jr., "Greatest and least primes in arithmetic


progessions having a given modulus," Math. Comp., 33:1073-1083,
1979.

[10] B. Chazelle and L. Monier, "A model of computation for VLSI with
related complexity results," Tech. Rep. Dept. of Compo Sci.,
Carnegie-Mellon University, February 1981.

[11] L. R. Rabiner and C. M. Rader, Eds., Digital Signal Processing,


IEEE Press, New York, 1972.
A Regular Layout for Parallel Multiplier
of O(Log2N) Time

W.K. Luk*
IMAG Laboratory
Computer Architecture Group
B.P.53X
38041 Grenoble Cedex, France

ABSTRACT
2
A O(log n) time n-bit binary multiplier basing on the Mead and
Conway VLSI design rule is presented. The layout has regular, recursive
structure and is directly suitable for practical VLSI implementation.
This multiplier is much faster than the traditional "serial pipeline
multiplier" having O(n) time and the "Brent-Kung multiplier" having
0(n 1/ 210g n) time. Its layout is of more practical interest than the
multiplier proposed by Preparata and Vui11emin basing on the CCC network
though it is optimal and has time 0(10g 2n). The AT2 measure of this mul-
tiplier layout is nearly optimal, being 0(n 210g 4n) so it answers the
question posed by Brent-Kung that the existence of a practical multi-
plier having AT2 = 0(n 3 ) measure.
The detailed VLSI layout, the theoretical and actual exact comple-
xity of the time and area measure will be presented. The actual imple-
mentation of a 16-bit example through the MPC is also discussed.

1. INTRODUCTION

Nowadays computation relies very much on parallel computation using


a large number of processing elements since the development of VLSI tech-
nologies. The design rules for VLSI [MeaCon 80J and the VLSI space-time
computation model [Tho 80J permit us to design and also evaluate VLSI
parallel computation systems. In this paper, we present a layout of a
n-bit binary multiplier of 0(log2n) time using VLSI technology and eva-
luate its space-time merits. It has simple recursive structure and is
particularly suitable for practical VLSI implementation.
Traditional sequential computing of two n-bit binary numbers requi-
res o (nlognloglogn) time [SchStr 71J and is considered too slow within
present day technology in which a large number of processing elements
are relatively cheapl~ available to do the thing in parallel. Recently,
we can achieve a O(log n) time [PreVui 80J for integer multiplication
basing on the CCC network [PreVui 79J, but it is too complicated for
simple realization. The theoretical lower bound for space-time trade-off
for integer multiplication given by several authors [AbeAnd 80J, [BreKun
80J, [Tho 79J, [Vui 80J is AT2a=n(n1+a). Such proof is strongly based on
the computation graph used in the VLSI model [Tho 80J and the reader is
requested to refer to it if interested.
* The author is now with the Laboratoire de Recherche en Informatique,
Uni versi til de Paris-sud. 317
318 A Regular Llyout for Plrallel Multiplier of O(LogIN) Time

Simple realization of a n-bit multiplier which is optimal within a


constant factor using the AT 2nmeasure is- not yet known. An example typi-
cally used in digital signal processing is the "serial pipeline multi-
plier" which requires area A=O(n) and time T=O (n) [Lyo 76J, [Jac 68J.
Brent & Kung [BreKun 79J have proposed an implementation based on the
convolution theorem to compute the product of two integers. It has area
A=O(n log n} and T=0(n 1/ 2log n}, but it is not of practical interest.
Preparata & Vuillemin have proposed an optimal multiplier using the CCC
network [PreVui 79, 80J, but also without practical interest.
In section 2 we present the design and layout of a n-bit binary
multiplier having simple recursive structure, based on the VLSI design
rules of Mead and Conway. It can be easily extended to any number of
bits n, n being a power of 2. We shall calculate the area-time complexi-
ty in section 3. This multiplier has area A=0(n 2log 2n) and time T=O(log2n}.
Though it is only nearly optimal in the usual area-time sence, its fast
running time 0(log2n} and simplicity are particularly useful. We also
show how, if one wants to slightly modify the regular structure, one can
obtain better results of A=o(n 2 } and T=0(log2n}. In'section 4, the ac-
tual VLSI layout of a 16-bit multiplier through the MPC is considered
in details, including the fast Brent-Kung adder and some exact evalua-
tion of the chip.,

2. LAYOUT OF A N-BIT BINARY MULTIPLIER

It is well known that in sequential binary multiplication of two


n-bit numbers, one can do it by basically four n/2-bit multiplications
and two or three additions. Our layout of the multiplier is derived from
this recursive scheme. Let x, y, be the two n-bit binary numbers (without
further specification, we assume that n is a power of 2), a, c be the
most significant n/2-bit of x and y respectively, b, d be the least si-
gnificant n/2-bit of x and y respectively, i.e. x=2 n / 2a+b, and y=2 n /2c+d.
Then, we can express the overall product as follows :
xy=(2 n / 2a+b} (2 n /2c+d ) = 2n ac+2 n / 2 (ad+bc)+bd (1)
We see that ac, ad, bc and bd are the four n/2-bit multiplications and
additional additions and shiftings are necessary to complete the calcu-
lation as follows. The result xy which is the sum of ac2n, (ad+bc}2 n / 2
and bd is of 2n-bit. Summing can be performed by two or three additions.
First, we sum up ad and bc giving a n+l-bit number; then sum up with the
n/2 most significant bits of bd with suitable shifting of the bits of
(ad+bc);and finally we sum up ac (using only the most significant (n/2+2)
bits of the above) with suitable shifting of the bits of ac,totally requi
res two n-bit additions and one (n+l}-bit addition. We can also obtain
the product of xy by a single n-bit addition and a single 2n-bit addi-
tion as follows.Firstly,sum ad and bc giving a (n+l)-bit number; since
2nac and bd can be treated as a single 2n-bit number, hence an additio-
nal 2n-bit addition suffices.
In other words, a n-bit multiplication can be recursively performed
by four n/2-bit multiplications, two n-bit additions and a (n+l}-bit
addition, or by four n/2-bit multiplications, a single n-bit addition
and a single 2n-bit addition. The choice of either form will not affect
the overall area-time complexity and we shall adopt the latter one in
the future discussion.
It is important to point out' that the above recursive scheme is not
the only way to perform the multiplication. Different schemes may lead
W.K. Luk 319

to other area-time results and greatly affect the layout. For example,
as traditionally discussed [Aho & al 74J, only three multiplications of
n/2-bit, some additions,and bit testings are used. Since only three mul-
tiplications are needed in the recurrence, it leads indeed to smaller
area as seen later; the time complexity remains the same. Also is this
three-multiplication recurrence that improves the sequential multiplica-
tion from 0(n 2 ) to 0(n lo g 3 ) running time. Had it been four, no improve-
ment had been obtained. But, since the layout becomes more irregular
than the one proposed using four multiplications, we keep the former one
for the time being.
We now present the layout of a n-bit multiplier using the recursive
structure. It is directly suitable for VLSI implementation. The layout
is shown in figure 1. A, G, are the input and output cross-bar network
and they are most "area consuming", each requires 0(n 2 ) area. C is a
perfect-shuffle network for distributing the results ad and bc to the
adder and it also has 0(n 2 ) area. E is also an area of wire consuming
0(n 2 ) area. B is the area for the four n/2-bit multipliers. The bit size
of the multiplier in each level is half the parent until arriving at the
"deepest" level where the multiplier is a 1-bit trivial case. One might
choose a 2-bit or 4-bit multiplier to be the terminal level since they
can easily be realized by using ROM or PLA. Area D places the n-bit adder
whereas area F finds the 2n-bit adder.
In the recursive layout, each input data line drives two next lower
levels and so on recursively. An input driver is indispensable (unless
fan-in in each level is one) at each level to maintain the signal level
for practical reasons. A more important reason is that we have assumed
that all other "secondary" devices, e.g. inputs, output buffers, clock
drivers, wire drivers, wire propagation time, etc .. are constant (inde-
pendent of n). The input drivers are so used to maintain the same recur-
sive structure at all levels. Otherwise, a large overall input driver is
needed for all the 210gn fan-ins, and it is also size-dependent. These
input drivers give rise to a overall time delay of O(log n), but it forms
a pipeline structure which might be useful in some applications. It also
pOints out a O(log n) time lower bound for this type or recursive struc-
ture. Also is an interesting aspect of area-time trade-off at I/O ports:
elimination of the O(log n) delay by using a single input driver having
sufficiently large area. For the present multiplier layout, neither would
affect the overall area-time complexity since they are of lower order of
magnitude than the rest (adding time, wire area), see equations (2), (3),
they only affect up to a constant difference.

3. AREA-TIME COMPLEXITY

The cost for wiving is quite high in this layout as in the case of
many other VLSI layouts in which the cost for communication (wire) is of
a higher order of magnitude than the cost for computation (transistor).
This is an important pOint in the present day design of the silicon chip.
The multiplier time is dominated by the performance of the adders. The
choice of the type of adder is critical.
The discussed recursive structure allows us to calculate the area,
time complexity directly. The area of A, C, E, G are 0(n 2 ) as discussed.
Assuming that the terminal level is n=1, the area for level n is given
by the recursive equation :
320 A Regular Layout for Parallel Multiplier of O(Log'N) Time

A(1) = constant 2 (2)


A(n) = 4A(n/2) + O(n ) + B(n)
where B(n) is the total area for the n-bit and 2n-bit adders. Similarly,
we can calculate the running time. We assume that the time for transmit-
ting a signal through the wire is independent of the length of the wire,
[MeaCon 80J, so the time is only ruled by the time to perform the multi-
plications and additions. We have the following recurrence equation :
T (1) = constant
(3)
T(n) = T(n/2) + C(n)
where C(n) is the total time required to compute the n-bit and 2n-bit
additions successively.
The choice of the adder is critical to the area-time complexity.
A simple type of adder, called the carry propagate adder, has area O(n)
(width = 0(1) and height = O(n, and running time O(n) for performing
a n-bit addition, since n basic adding units are needed and it requires
n clock pulses to propagate the carries through out. By substituting
B1 (n)=O(n) and T1(n)=O(n) into equations (2) and (3),we obtain:
A(n) = 4A(n/2) + 0(n 2 ) + O(n)
(4)
and T(n) = T(n/2)+ O(n) 2
Solving (4) we have A(n) = O(n log n) and T(n) = O(n), which is of no
theoretical and practical interest. One can easily find a simpler layout
to achieve this.
We turn to the discussion of using the Brent-Kung adder [BreKun 79BJ
(Details on this adder are given in section 4). We simply apply here the
result of the adder in order to study our multiplier. A n-bit Brent-Kung
adder is a parallel "carry look-ahead" adder having faster time of O(logn)
and area of O(nlog n) (width = O(log n), height = O(n. Put B2 (n) =
O(nlog n) and T2 (n) = O(log n) into equations (2) and (3), we obtain:
A(n) = 4A(n/2) + 0(n 2 ) + O(nlog n) (5)
and T(n) = T(n/2) + O(log n) 2
from which the area complexity is the same being A(n) = O(n log n), but
the time complexity is of great interest being T(n) = 0(10g2n ). Its fas-
ter adding time improves the time of our multiplier, but its larger area
does not worsen the area complexity. As pointed out earlier, the overall
chip area is dominated by the wire which has at least 0(n 2 ) area. Hence
any additional device of area o(n2) would not affect the area complexity.
Althouth the AT2 measure is beyond the theoretical lower bound, it
is of particular practical interest owing to its regular layout as dis-
cussed next. It also has a much lower running time than the mUltiplier
used in traditional signal processing having O(n) time and the multi-
1
plier proposed by [BreKun 79 having 0(n 1/ 2 log n) time. It also presents
a practical design having AT O=0(n 1+2U ),answers the question posed in
[BreKun 79J.
It is important to point out that by using only three multiplica-
tions rather than four with some additional additions and bit testings
[Aho & al 74J, the recurrence equation (2) becomes :
A(n) = 3A(n/2) + 0(n 2 ) + B(n) (6)
A(1) = constant.
We have assumed that in the additional additions, bit testing is of area
o(n 2 ), i.e. lower order of ma~nitude than the wiring 0(n 2 ). Hence the
area is reduced to A(n) = O(n ). The time complexity is the same since
the additional work requires at most time as C(n) in equation (3).
Using this 3-multiplication scheme, the AT2 measure is reduced to
AT2 = O(n 21og 4n). Indeed, it can be also shown that the actual layout
W.K. Luk 321

takes the same o(n 2 ) area with w(n) = h(n) = O(n), it is optimal for
this recursive layout. Here w, h stand for width and height respective-
ly. The results obtained for various multipliers are summarized in table
1.
In practice, it is not interesting to have a chip layouted in thQ
form of a "long bar", in other words the width is of higher order of ma-
gnitude than the height or vice versa; i.e. w(n) = o{(h(n)} or h(n) ~
o{w(n)}, where wand h are the width and height of the layout respecti-
vely. We are going to show that in our recursive layout, the width and
the height have the same order of magnitude for all n.
At level n, the height and width are given by (refer to fig. 1) :
h(n) = max{4w(n/2), A1n}
(7)
w(n) = h(n/2) + C1n + C2Wa(n)
A1 is the height constant of the 2n-bit adder, C1 is the width constant
of all the input cross-bar wire, the perfect-shuffle network and the
wire at E and G, wa(n) is the total width of the two n-bit and 2n-bit
adders. wa(n) = n for the carry-propagate adder and wa(n) = log n for
the Brent-Kung adder. For asymptotic study, wa(n) is immaterial since it
is of a lower order of magnitude than the other terms. For small n,
A1 n >4w(n/2), the solution of equation (7) is h(n) = w(n) = O(n). For
sufficiently large n, 4w(n/2) >Aln, then equation (7) becomes:
h(n) = 4w(n/2)
w(n) = h(n/2) + O(n) 2 ~8)
Solving (8) , we have h(n) = w(n) = O(nlog n) and so takes O(n log n)
area. It is of a factor of O(log n) higher than the expected O(n 2 log n)
for area already proved for this recursive layout. There remains some im-
provements for the running wire (how the different blocks communicate
with the others in such a way that the area can be decreased by a fac-
tor of log n).

A T AT2

theoretical lower bound Mn 2 )

Brent-Kung scheme IilreKun79J nlog n n1 / 2log n n 2log 3n

serial pipeline multiplier n n n3


[Lyo76 Jac68 J
this paper n 2loin log 2n n 2log6n

this paper with 3 2


n log2n n 21og4n
multiplications
Preparata,Vuillemin log 2n 2
(n/lo~2nJ2 n
[ PreVui80]
Table 1 - Area-time complexity for various multipliers

4. VLSI IMPLEMENTATION OF A 16-BIT MULTIPLIER

A 16-bit multiplier implementation is based on the layout described


in the previous sections. Some details are discussed in this section.
The multiplier is implemented through the MPC of the Research Team of
computer Architecture of the lMAG Laboratory. The design rule chosen is
the one commonly adopted as presented in the book by Mead and Conway.
Section 2 gave the skeleton of the layout. The remaining important
322 A Regular Layout for Parellel Multiplier of OCLogIN) Time

parts are the design of the adder, a fast Brent-Kung adder or a simple
slower carry propagate adder, the basic multiplier and the register to
register data flow timed by a two phases non-overlapping clock ~1 and ~2.
Some practical evaluation will also be discussed.
We fix the terminal level of the multiplier to n=2. A 2-bit multi-
plier can easily be implemented using a PLA or simple logic. The compu-
ting time and area necessary to do a 2-bit multiplication is then a cons-
tant depending on the technology and circuit optimization. Let this time
be 'multi and assume such computation can be completed between two suc-
cessive h and ~2 of the two phases clock, i.e. 'multi < T/2 where T is
the period of either ~1 or ~2.
We first give the design of the carry propagate adder since it is
simpler. A n-bit carry propagate adder consists of n full adders each of
which can be implemented by simple logic. At each level, say n, the n-bit
and 2n-bit adders are so formed by cascading the full adders. The time
and area required by a full adder are also constant depending on the
technology. Let this time be 'add and assume such time is less than the
time between two successive ~1 and ~2' i.e. 'add < T/2. So a n-bit addi-
tion requires n consecutive ~1 and ~2 clocks and takes area O(n). Let
T(n) be the time in terms of the total number of clock pulses of ~1 and
~2 necessary to perform a multiplication at level n, including the first
clock pulse ~1 for input. Hence we have :
T (2) 1
T(n) = 1 + T(n/2) + n + 2n = T(n/2) + 3n + 1 (9)
or T(n) = 6n + log2n - 12
T(2) = 1 means that a 2-bit multiplication requires only a single clock
pulse as indicated in the previous paragraph. T(n/2) is the time for in-
puting data and computation at level (n/2), n+2n is the time for the two
n-bit and 2n-bit additions and the 1 is for inputing data at level n. In
the expression T(n) the term 6n accounts for doing computations, whereas
the term log2n accounts for the input pipeline. This exact evaluation of
time will later be used for comparing with the results obtained using the
Brent-Kung adder, hereafter presented.
The Brent-Kung adder[BreKun 79 B] modifies the slow carry propagation
in ordinary adders, and it allows to compute all the carries parallelly
in O(log n) time. Let ai' b i , Ci' i=l, . ,n, be the two n-inputs and the
n carries respectively; and Si' i=l, ,n+l be the n+1 outputs. An ordi-
nary adder is so defined by :
Co 0
Si = ai$ bi $ Ci -1 i=l, ... ,n (10)
Ci = (aiAbi) v ((ai $ b i ) ACi-1)
Sn+1 = C
where $, A , v nrepresent modulo-2 addition, logic AND, and logic OR res-
pectively. In the Brent-Kung scheme, further define gi' Pi' i=l, ,n as
gi = a i A bi
i=l, ... ,n (11)
Pi = ai $ bi
Then a binary adder can be written as equation (12) and is shown in fi-
gure 2 :
Co 0
Si Pi $ Ci-1 i=l, ... ,n (12)
Ci gi v (Pi ACi -1)
All the gi' Pi'S can be found in constant time and it leaves the problem
of constructing all the Ci's from the gi' Pi'S efficiently.
W.K. Luk 323

The Brent-Kung scheme definesAan ope~ator 0 by :


(g,p) 0 (~A~) = (gV(pAg), pAp) (13)
where g, p, ~, p are any boolean variables. It can be shown that opera-
tor 0 is associative and so the Ci , i=1, ,n, can be computed by
Ci = Gi i=1, .. ,n
where (G.,P.)={ (g1,P1) for i=l
1 1 (gi,Pi) 0 (Gi-1,Pi-1) (14)
=(gi,Pi) 0 (gi-1,Pi-1) a ... 0 (g1,P1) for i=2, . ,n
The associative property of 0 allows us to embed the processing elements
far 0 in a binary tree structure of depth O(log n). For example:
(G16,P16) = (g16,P16)0 . o(gg,pg)o(gs'PS)o . 0(g1,Pl)
can be splitted into a left part (g16,P16)0 . 0(gg,pg) and a right part
(gS,PS)0 0(g1,P1)' and so on. An example for n=32 to compute (G32,P32)
(G16,P16), (GS,PS)' (G4,P4)' (G2,P2) and (G1,P1) is shown in the lower
part of figure 3 (level 1-5). Each 0 is a processing element performing
o and each ~ is simply a gate for transmitting data and driving next
stages. The general problem for computing all the (Gi,P i ) for i=1, .. ,n
is similar to such tree structure, except that one more tree structure
in the reverse order is superimposed. An example of n=32 is shown in
figure 3 to implement our 16-bit multiplier. At the final outputs, we
only keep the Gi's which are the carries Ci'
We see that each processing element 0 and each gate ~ drive at most
two inputs ; they can be treated as a basic computing unit having cons-
tant time depending only on the technology. Also are the output modulo-2
adders, the input modulo-2 adders and the input AND gates. We assume all
these time constants are less than the time T/2 as defined previously.
It is obvious that the adder requires time O(log n) and area
O(nlog n) since it has width O(n) and depth O(log n). The exact total
number of level of the carry-computing tree is (210g2n-1). Hence it takes
a total number of (210g2n-1)+2 consecutive 1 and 2 clock pulses to per-
form a n-bit addition. Similar to the eXact time evaluation for the
previous adder, equation (g) becomes:
T(2) 1
T(n) 1+T(n/2)+(210g2n+1)+(210g2(2n)+1) (15)
T(n/2)+410g 2n+5
or T(n) 210g~n+7log2n-S
In terms of the number of system clocks for the two types of adders used,
the multiplication time T(n) is shown in table 2. We see that for n ~ S,
the multiplier using the Brent-Kung adder has faster running time as pre-
dicted.
The timing diagram of a 16-bit multiplier using the Brent-Kung adder
is shown in figure 4. One should note the regular 1, 2 timing in the
different levels and is straight forward for implementation.

n 2 4 8 16 32 64
carry propagate 1 14 39 88 185 382
adder

Brent-Kung adder 1 14 31 52 77 106

Table 2 - Multiplication time T(n)


324 A Regular Layout for Parallel Multiplier of O(Log'N) Time

CONCLUSION

We have presented a recursive,regular VLSI layout for a n-bit mul-


tiplier for any n being a power of 2. This method can be applied to
other computations which have recursive structure or can be solved by
divide-and-conquer traditionally. This includes, for example, Batcher's
bitonic sort, matrix multiplication, modular arithmetic.
This layout has fast ~unning time of 0(log2n ) and simplicity for
direct VLSI implementations. It is of greater interest over others which
are slower or too complicated to realize. To further improve the AT2
measure based on this layout, it needs to find whether we can embed it in
area of o(n2) and having adders of O(log n) time, or not.

ACKNOWLEDGEMENT

The author would like to thank Prof. F. ANCEAU and the members of
the Research Team of Computer Architecture of the IMAG Laboratory in
Grenoble, for providing facilities for this research, and Mme Chaland for
her help in typing this manuscript.

REFERENCES

[AbeAnd 80J H. ABELSON & P. ANDREA, Information transfer and area-time


trade-offs for VLSI multiplication, CACM, vol 23, January 1980,
pp. 20-23.
[Aho & al 74J A. AHO, J. HOPCROFT & J. ULLMAN, The design and analysis of
computer algorithms, Addison-Wesley, 1974.
[BreKun 79J R.P. BRENT & H.T. KUNG, The area-time complexity of binary
multiplication, Technical Report, Dept of Computer Science, Carnegie-
Mellon University, July 1979.
[BreKun 79BJ R.P. BRENT & H.T. KUNG, A regular layout for parallel adders,
Technical Report, Dept. of Computer Science, Carnegie-Mellon Univer-
sity, CMU.CS.79.131, June 1979.
[BreKun 80J R.P. BRENT & H.T. KUNG, The chip complexity of binary arithme-
tic, Proc. 12th Annual ACM Symposium on Theory of Computing, ACM, May
1980, pp. 190-200.
[Jac 68J L.B. JACKSON, S.F. KAISER & H.S. McDONALD, An approach to the
implementation of digital filters, IEEE Trans. Audio and Electro-
acoutic, AU-16, September 1968, pp. 413-421.
[Lyo 76J R.F. LYON, Two's complement pipeline multipliers, IEEE Trans.
Comm., COM-24, 4, April 1976, pp. 418-425.
[MeaCon 80J C.A. MEAD & L.A. CONWAY, Introduction to VLSI systems,
Addison-Wesley, Reading, Mass. 1980.
[PreVui 79J F.P. PREPARATA & J. VUILLEMIN, The cube-connected cycles
a versatile network for parallel computation, Proc. 20th Annual
IEEE Symposium on Foundations of Computer Science, Puerto Rico,
October 1979.
[PreVui 80J F.P. PREPARATA & J. VUILLEMIN, Area-time optimal VLSI network
based on the cube-connected cycles, INRIA Report nO 13, Rocquencourt,
1980.
[SchStr 71J A. SCHONHAGE & V. STRASSEN, Schnelle Multiplikation grosser
Zahlen, Computing 7, 1971, pp. 281-292.
W.K. Luk 325

[Tho 79J C.D. THOMPSON, Area-time complexity for VLSI, Proc. of the 11th
Annual ACM Symposium on the Theory of Computing (SIGACT), May 1979,
pp. 81-88.
[Tho 80J C.D. THOMPSON, A complexity theory for VLSI, PhD Thesis, Dept.
of Computer Science, Carnegie-Mellon University, Pittsburgh, 1980.
[Vui 80J J. VUILLEMIN, A combinatorial limit to the computing power of
VLSI circuits, Proc. of the 21st Symposium on Foundations of Compu-
ter Science, Syracuse, NY, October 1980, pp. 294-300.

a b I,; d
~

---,
~
---""MsB

~ MSB

L-
a
AC -a -
-
'---
-
- ~J ' - - -
MS.

I-- I---
l--
'---
AD I--- I---
'-----
~
0
~
-
~
w I-- L--
~
~
z ~
w ., -
0
0
~
~
~
~
.,g ~

0 ~ r-- ~

" f-
~
~

-
~
u
w
~
----.
- -
~

Be
- I---
-== =
== - r-
-
-
-
- '-- -
110= -
-
BD ll~ -
,'----

::=
-==
a ~

I--- ~
'----
0
'"---- L-
, I
, , ,
C ' D' 'F 'G OUTPUT xy ,
I I I I I I
FIGURE 1 RECURSIVE LAYOUT OF AN o-BIT MULTIPLIER (0'8, NOT TO SCALE)
328 A Regular Layout for Parallel Multiplier of O(Log"N) Time

4h''t+t+t-+-I.....

3~ ~
~ ~ ~
lrl~ rl~ ~
FIGURE 2 A BRENT-KUNG ADDER

FIGURE 3 COMPUTATION OF ALL THE CARRIES Ci FOR ,"32

I .'i'4-bit I a-bit I
1'::1.;::. I
1.~ltladd. I add. I
I I I I I
dd" .. !d II
na ' 3+S
,";';,
It.,.
'. 16

,, ,
III 4-bit I a-bit I 16-bit I

,,
,:, ... ,:, ... ,:
I.~, multiplication I add. I addltion I

n-8 -It!
a
" '.
1 1511 IS 'It 31.
,\~
'
, ,
8-bitllLlltiplication
,
116-hit
,
I 32-bit
,~

,,
,.~
, ,
I addition I addition
,
old I I I I III III I 1 II ~i ....

FIGURE 4 TIMING DIAGRAM FOR '"16 (USING BRENT-KUNG ADDER)


VLSI Implementations of a Reduced
Instruction Set Computer

Daniel T. Fitzpatrick, Manolis G.H. Katevenis, David A.


Patterson, Zvi Peshkess, Robert W. Sherburne, John K.
Foderaro, Howard A. Landman, James B. Peek, Carlo
H. Sequin, and Korbin S. Van Dyke
University of Callfomia at Berkeley
Computer Science Dlvlsion/EECS Department
Berkeley, Callfomla 94720

1. INTRODUCTION
A leneral trend in computers today is to increase the complexity of archi-
t.cturu commensurate with the increasing potential of implementation techno-
logies. Consequences of this complexity are increased design time. more design
errors, inconsistent implementations. and the delay of single chip implementa-
tion[7]. The Reduced Instruction Set Computer (RISC) Project investigates a
VLSI alternative to this trend. Our initial design is called RISC I.
A judicious choice of a small set of the most often used instructions com-
bined with an architecture tailored to efi'icient execution of this set can yield a
machine of lIurprisingly high throughput. In addition. a single-chip implementa-
tion of a simpler machine makes more efi'ective use of limited resources such as
the number of transistors. area. and power consumption of present-day VLSI
chips[6]. Simplicity of the instruction set leads to a IImall control section. a
comparatively short machine cycle. and a reduced design cycle time.
Students taking part in a multi-term course sequence designed two
difi'erent NMOS versions of RISC 1. The "Gold" ,roup (Fitzpatrick. Foderaro.
Peek. Peshkess, and Van Dyke) designed a complete 32-bit microprocessor.
currently being fabricated. The "Blue" group (Katevenis and Sherburne)
started from the same basic organization but introduced a more sophisticated
timing scheme in order to shorten the machine cycle and also reduce chip area.
At present, only the data path of this more ambitious design has been com-
pleted. The chips were designed using only horizontal and vertical lines
("Manhattan" design). with the simple and scalable Mead-Conway design rules
(fabrication: >.. = 2 microns. no buried contacts).
At the onset of the design of RISC I we defined the following goals and con-
straints: (a) find a reasonable compromise between high performance for high-
level language programs and a simple, single chip implementation; (b) make the
size of all instructions equal to one word and execute all instructions in one
machine cycle; (c) emphasize register oriented instructions and restrict
memory access to the LOAD and STORE instructions. The resulting architecture
has 31 instructions in two formats. uses 32-bit addresses. and supports 6-.16-.
and 32-bit data.
The chip area saved by simplicity of the control circuitry was devoted to a
very large set of 32-bit registers. This permits the processor to allocate a new

327
328 VLSI Implementations of a Reduced Instruction Set Computer

..t of re,isters for each procedure call and thus avoids the overhead of saving
Nli.ten in memGrY. By owrlappilll the Ie windolnl of resisten, parameters
may be passed to a procedure by simply changing a pointer.
Ibis 110 called overlapped register window sCheme[l] is largely responsible
for the surprisingly lood performance obtained in simulation of hilh-Ievel
language programs. Simulations of benchmark programs written in 'C' indicate
that RISC I can run faster than many commercial minicomputers. Table 1 shows
the size and execution time of six C programs on RISe I assuming a machine
cycle of 400 nanoseconds, and 250 nsec instruction-fetches. Also in the table is
the VAX 11/780. a 32-bit SchoUky-ITL minicomputer with a 200 ns microcycle
time; and the ze002. a l6-bit NllOS microprocessor with a microcycle time of
250 ns. Even though the ZB002 is using only l6-bit addresses and data while
RISC is using 32-bit addresses and data, RISC programs are typically lOr. larger
while running about four times faster. The byte-variable length of VAX instruc-
tions reduces program size by about a third; on the other hand. every C pro-
srarn that we have run on the RISC simulator has been faster than the on VAX.
In addition to good performance. in this paper we show that the design of
RISC I also was several times faster and required only one-fifth the manpower of
comparable machines. The most visible impact of the reduced instruction set is
that the area dedicated to control logic has dro})ped from 50 r. in typical com
mercial microprocessors to only 6 r. in RISC 1.

Table 1.
RIse Program Size and Execution T1rnes
Rfllative to II VAX 11 /?8() c.wa.d a Z8000

Program SV. (bytu) Ea:8cution nm. (_c,,)


VAX Z8000 VAX Z8000
Name RISC VAX I1T!'i(' Z8000 RT!'i(' RISC VAX _IDse Z8000 'illiiC
Acker 208 120 0.58 238 1.15 3.2 5.1 1.8 8.8 2.8
qIIort 144 436 0.88 848 1.01 08 1.8 2.3 4.7 5.9
puzzle(lUb) 2488 11168 0.88 1812 0.7 4.7 9.5 2.0 19.2 4.2
puzzle(ptr) 1480 1700 0.88 1856 0.7 3.2 4.0 1.3 7.5 1.3
led 17S88 14936 0.83 17500 1.01 5.1 5.7 1.1 22.2 4.4
tower. 132 100 0.76 242 1.82 8.8 12.2 1.8 28.7 4.2

Average 3883 3060 0.7 :1:.1 3849 1.1 :1:.4 4.0 6.4 1.7 :1:.4 15.2 4.0 :1:1.2

2. MICRO-ARCHITECTURE
The simplicity and regularity of RISC permits most instruction executions
to follow the same basic pattern: (1) read two registers. (2) perform an opera-
tion on them. and (3) store the result back into a register. Jump. call. and
return instructions add II. register (possibly PC') and an otrset and .tore the
relult into the appropriate PC latch. The load and store inltructions violate the
oricinal conltraint: in order to allow enough time for acceas of the main
memory. they add the index re.ister and immediate otrlet during t.he nrst cycle.
and perform the memory access dUrLllI an additional cycle. In all cases. while
the processor is executing the nrst cycle of an instruction. the next instruction
iI fetched from memory.
Daniel T. FHzpatrlck, .t 81 329

The micro-architectures of the two implementations are tailored according


to the above characteristics. The CPU can be subdivided naturally into the fol-
lowing functional blocks: the register-ftle, the ALU, the shifter, a set of Projram
Counter (PC) registers, the Data I/O latches, the Program Status Word {PSW)
register, and control (which contains the instruction register, instruction
decoder, and internal clock circuits). Since two operands are required simul-
taneously, the register tlle needs at least two independent busses and a two-port
cell design. For speed, the registers are read with dynamically precharged bit
lines. This requires the following basic timing sequence: (1) register read, (2)
arithmetic/logic/shift operations, (3) register write, and (4) bus precharge for
next read. The cycle time is determined by this sequence of operations. For the
price of a third bus. phase (4) can be eliminated: as the result is written back
into the register ftle by this enra bus, the two read busses are pre charged for
the following read phase. This simple but robust 3-phase scheme has been
adopted in the Gold processor. The basic organization with busses A,B (read
only) and C (write only) is shown in Figure 1. It permits simple instruction fetch
and execution.
During progress of the Gold RISC J design, it became apparent that a three-
bus register cell incurred a significant area penalty. Since a large fraction of
the chip area is devoted to the register file, more attention was focused on the
design of a compact bit cell. The classic six-transistor static RAM cell was
chosen for its compactness in the second, more ambitious Blue chip. Reading is
accomplished by .electively discharging the precharged bit line busses. Con-
trary to commercially available static RAMs, no sense amplifiers are used, yield-
ing a speed penalty. However, a two-port reading capability is obtained. Writing
is accomplished by putting both the data and its complement onto the two
busses, as in a typical static RAM. Before reading, the busses must be
precharged for proper operation.
The Blue RISe I design (Figure 2) was based on this two-bus, two-port regis-
ter cell. The reduced cell size allowed significant performance improvement due
to the shorter RC delay in the polysilicon control lines running across the data
path. The overall chip size was also considerably reduced.
Further improvements were made by overlapping the register-write with
the execution of the following instruction. The result of an operation is kept in a

TO/FROt1 . MEM'~
!DATA 110J 11Mt1.1
rJ -
'i
A
r-
A r-- PC,
"- ,NC.,
~ .b

,. ....
8 8
~ R ....
"- ek.
-t'DD2.
Ll
~
RECiISTEIt FILE '-\' III.
oJ
..... 0
:%: f--- ALLl
"- I/)~ ~.b TO
C C Met1.
~IoIIFTEIt ~ .... ~

/ipre 1: Ths Gold Data.-Pa.th.


330 VLSI Implementations of Reduced Instruction Set Computer

PC,
INC..,
ek ADDfl-.
..4
DATA
TO
'------I MEM.

ftvure 2: 7he Blue lJrIla.-Pa.th

t.emporary lat.ch (DST). and is only written int.o t.he regist.er-me during the
operat.ion-phale of the next cycle. Special "int.ernal forwarding" circuit.ry t.akes
care of the instructions that use t.he result of their immediately previous one.
In effect.. each instruction now requires three cycles: (1) Inst.ruction fetch and
decode; (2) Register read. operate. and temporary latching of result; (3) Write
result into register tile. However. in the Blue design. t.hese three operations are
pipelined so that a new instruction begins each cycle (except load/stores).
Both designs multiplexed the address and data pins. as we could not afford
to use 64 separate lines with current packagill technology. Power consumption
for the Gold chip is estimated to be between 1.2 and 1.9 Watts.

3. CHIP ANALYSIS
We have analyzed the completed Gold chip and the data path of the Blue
chip. Table 2 Ihows how the chip resources are allocated for the various func-
tional blocks in the Blue and Gold designs (figures for Blue control and I/O are
estimates). Thele relources are mealured in million Iquare lambda. thousands
of transistors. and thousands of rectangles. We can make several observations
based on this table:
(1) "Control" is only 6 to 8 % 0/ the tota.l ch'ifJ. Even if there were no registers.
it would still be only 12 %. Section 5 compares this important result to
commercial microprocessors. and discusses its signitlcance.
(2) The BLue d.a.ta. pa.th is considerably more compact than the Gold data path.
The register file pitch (42 A in Blue. 77 A in Gold) determined the height of
the data path in both versions; the rest of the modules were designed
accordingly. The compactness of the cross coupled static RAM cell allowed
the 138 32-bit registers of the Blue design to occupy about half the area of
the 78 32-bit registers of the Gold design.
(3) SCAN IN/SCAN OUT (SISO) is Less tha.n fj % 01 the chip. SISO is a technique
which improves chip testability by allowing access to each state bit in a
module. The tlip-tlops are connected together as a long shift register. allow-
ing serial reading and writing. The Gold chip has complete SISO on the
shifter. ALU. control, and some of the PC registers. As we had spare pins we
used 11 pins for SISO (4 % of chip area); we could have used fewer.
Table 2.
Am. tftIIIIMtOl'll, all NctUIPt. per Blue ..Ill Colt lImcUeul lIIoolr.

AREA (II AI) FUNC"I'[ON (Ie trllJlllletorw) COIIPLEXn'Y (IC reot8.lll')


fUnoUcmeI
blook CoN. BIu. Cold au. Cold IIhIe
Rqteterw '.?8 SI" t.11l 30" SO.Da 28.110 S?8.1 III " 300.0
reatem' deoocler 1.1111 II" 0.117 S.S8 ,,,
"" S.IO I"
..,,," SO.II
." to.O
."I.
lIIIm.er 1.111 10 " 0.111 1l.'S
." I.IS SI.S ,,, SI.t
."
ALV I.tl
." 0."
."."
S. S.D'
,,, 1.1' 112.S ." IS.S II.
PC-. 0. t" 0.38 S" 1.111 1.14 t. 10.' t" I'.' t"
Deta 110 LqIc 0.23 I. 0.10 I. 0.11'
."
I. O.S? 1" '.11 1" S.II 1.
ScIll1In SoIll1 Out O.lt 1" 0.01 0" O.SII 1" O.lt 0" 1.11 0" 1.1 0"
total DATA PATH IS.14 III" II.M 110 " -12.01 III " S,." It" 11011.11 IS" t2t.0 IIX
PIA'. 0.2S IX O.lIt 2X 11.11 IX
Latch 0.-12 2X O.BS 2X B.5 2X
Rout.ln& O.t' 2" - - B 1"
Scan In Soan Out 0.04 0" 0.10 OX 0.11 OX
total CONTROL 1.111 II" -1.111 -B" 4" -1.66 -4 " 20.6 t" -20.0 -4" i:::I
2.011
I." - - - 3.6 -5.0 -IX !:
Rout.fn& 10 " -2.'11 -20 " - 1"
Pad. LIS 6X -1.10 -B" 0.64 -D.M 9.9 -10.0 ;-I
2" -2 " 2" -2 "
Scan In Scan Out O.'S 4" -0.110 -II " 0.06 0" -D.OII -0 " 1.2 0" -1.0 -0 "
total 110 S.91 20 " -4.110 -S3X 0.92 2" -D.90 -2 " 14.9 3" -16.0 -t"
i
Unueedarea 1.09 6" -1.20 -9 " - - - - - - - - i
F
TOTAL CPU 20.00 100 " -13.70 100 " ".42 100" -to.OO 100 " 113B.3 100" -480.0 100 " !.
!!.

w
~
332 VLSI Implementations of a Reduced Instruction Set Computer

(4) Th8 num.ber 0/ trtJfl.Sistors rlLrlt'we to th4t o/rlcta:n.gles, in. la.c:hfunctionaJ.


block, is strongly COfTelat8d.; there exist about ten rectangles for each
transistor.
(5) Area utitillation varin strongly. The highly regular data path, consisting
mainly of carefully optimized cells, contains more than 90 " of the transis-
tors and rectangles but only half to two-thirds of the area. On the other
hand, the I/O area, which is dominated by random interconnections, con-
tains 20 to 30 " of the area but less than 4 " of the transistors or rectan-
gles.

4. TOOLS
There is no doubt in our minds that the key to successful VLSI design lies in
appropriate software. While we have plans for a sophisticated design environ-
ment[5]. for the moment we have to work with a rather small subset. We used
more than a dozen programs to design our chips. Among those. we felt that the
following six tools were invaluable - we cannot imagine how we would have been
able to get this far without them:
CAESAR color graphics editor John Ousterhout U.C. Berkeley
CIFPLOT plot of mask layers Dan Fitzpatrick U.C. Berkeley
MEXTRA Manhattan circuit enractor Dan Fitzpatrick U.C. Berkeley
SLANG multi-level simulator John Foderaro U.C. Berkeley
MOSSIM switch level simulator Bryant/Terman M.I.T.
DRC layout rules checker Clark Baker M.I.T.
MKPLA PLA j[enerator Howard Landman U.C. Berkeley
Other tools used to some degree and which were particularly helpful for our pro-
ject include:
PRESTO PLA minimizer Fang/Newton U.C. Berkeley
EQNTOTT PLA equation translator Robert Cmelik U.C. Berkeley
SPICE circuit simulator Pederson/Newton U.C. Berkeley
STAT electrical rules checker Forest Baskett Stanford/Xerox PARC
POWEST power estimator Robert Cmelik U.C. Berkeley
A key factor was the glue that held all these various tools together and pro-
duced a cohesive design environment; this function was provided by the UNIX
operating system (4th Berkeley Software Distribution) [2] running on a DEC VAX
11/780.
We started the designs with an ISPS descriptiont of RISC I and f1 block
diagram similar to Figure 1. The logic, circuits, and initial layouts were
designed on paper. and then entered into Caesar. As the designers became
more comfortable with Caesar. they used it to do all layout. Caesar converts the
graphical description into CIF (the format needed for fabrication submission, as
well as for some of our tools, like CIFPLOT and DRC). The information necessary
to run STAT, MOSSIM, and POWEST was extracted from this same CIF by MEXTRA.
After the bottom level modules were designed, we used SLANG to completely
describe the chip at a mixture of functional and logical levels. We then ran RISC
diagnostic programs on this description to uncover errors. The SLANG descrip-
tion was also used to specify many of the remaining connections in the chip and
to drive the PLA tools to automatically produce the PLA's for RISC. Howard
Landman acted as a "roving critic", scanning for errors that were overlooked by
the design tools.
t The [sPS description was not very useful, because the [sPS simulator does not run on the VAX
(the machine we use to do our design); also, while useful for architecture descriptions, it is awkward
for describing implementations.
Dlnlel T. FHzpltrlck. at II 333
The tinal step was to use SLANG and llOSSIM to compare the original
description with the tinal masks. SLANG was used to interactively compare,
every half clock phase, the values of hundreds of nodes in both the SLANG
description simulation and the WOSSIM simulation. We ran about a dozen diag-
nostic programs and uncovered several errors.
The submitted Gold design successfully passed the checks provided by all of
the tools.

5. CONTROL AREA, DESIGN TIME, AND REGULARITY


Table 3 compares several design metrics for RISC to those of some commer-
cial microprocessors. The information for the ZBOOO and WC6BOOO comes
from[l] : the information on the 432 comes from [3,4], and [10]. The data pro-
vides quantitative support to some of the points made earlier.
The Gold chip is larger than most University (Mead-Conway) projects, but
still comparable to the latest microprocessors. It is about 6 ,.; larger than the
43203; on the other hand. the Blue chip may be 25 % smaller than the 43203.
The most significant ditl'erence is that the simple instruction set of RISC I leads
to a much smaller control section than that of the comparison processors; con-
trol occupies less than 10 % as compared to a more typical value of around 50 ,;:
for the other chips. When comparing absolute areas one finds that the smallest
control section is still a factor of 5 larger than RISC' s.
The smaller control section leaves correspondingly more area for the struc-
tured data path and. in particular. for the very regular and compact register
file. Besides making etTective use of the silicon area, this also increases the
overall regularity of the chip. Lattin[4] defined a "regularity factor" as the tota.l
number of devices on the chip (excluding ROM's). divided by the number of
drawn transistors. The table entries for the number of devices in the ZBOOO and
MC6BOOO are estimates, but the entries for the 432 and RISC are actual measure-
ments. By this measure, RISC is 2 to 5 times more "regular".
The increased chip regularity was a key factor leading to a short design
cycle. The elapsed time was considerably shorter for RISC; also. the effort spent
{man months} was about a factor of five less. Control was by far the most time
consuming part of the design and layout. Since the control section was reduced
from half the chip to less than 10%. a significant reduction in design etTort
resulted.
Another factor for the reduced design cycle is our integrated design
environment. All of the various tools. which performed graphic editing. check-
plotting. design rule checking, layout rule checking. architectural simulation.
and switch-level simulation, were at the fingertips of the designers and ran on
the same machine. This reduced the overhead associated with the usage of such
tools. Special credit is deserved by Caesar. the Manhattan-based graphic layout
editor. Caesar was developed by Ousterhout[6] in close interaction with the chip
designers. Bugs were corrected and designer wishes implemented. sometimes
overnight. This responsiveness has produced a superb tool that is well-liked by
the users since it dramatically improves their layout productivity.
Other factors ailecting layout time were the use of the simplified Mead-
Conway design rules, the restriction to Manhattan features. and (for the Gold
design) the concentration on correct function rather than highest performance
(SPICE simulations were used only in some of the most critical paths; however.
extensive switch-level simulation was performed). Certainly. another portion of
the reduced design time is due to the the fact that. as academics. we only had to
deal with a set of rather loose, self-imposed constraints. For example. our mas-
ter clock is generated oil-chip. and the Gold chip has a rather poor external bus
interface. The selection of op-codes was kept fiexible to the very end; they were
334 VLSI Implementations of a Reduced Instruction Set Computer

Table 3.
Demen metric. for zeooo. Ilce8000. ~41111. and RI!C I.

ZllOI Ilotorol. IntalIAPX-432 RISC I


zeooo 1Il000 4S201 ta202 tUOS GoJcl Blue

Total DIYic l'1.lIk Uk 1I0k dk 10k 4ft 1'1k+?


TotaI- ROil 1Uk S7k 44k 41k folk 4ft ?
Drawn Devic 1.lIk S.Ok 1I.8k '.l5k IUk I.Ok. 1.8k' .+?

".
Re&Wal'I.at1on factor 11.0 11.1 ?II U ?? 11.'1.2'1.11' ?

...
(111--- III .a.)
(Ana- ......)
-.wpiWh
(1IIfcrou)
SIae of Control b
(Area In ... mill)
1IbI1l1
Ilk
11

S'1k
-
...... 1

II

tIIk
118dllS
MIt
II

17k
11. .....
11

4Gk
U.
II1II81

11

47k
tIIIIlIOII
114k
U!

7k
....o<Irlll5
-eeIr:
12

-?k
Percent Control 118 X II! X III X "X tOX IX -II X

Elapeed TIme to
lint. 811icon SO SO SS" SS" 21" 1'1d. 1'1+,01
(montlul)
DIlI&n TIme 80 100 1'10" 1'10 UIO 10/2' (IIO+?) /2'
(man month.)
Layout TIme '10 '10 .0 100 50 12 II+?
(man month.)

There are two ways to count drawn transistors; the pessimistic approach counts every transis-
tor in a cell even if it derived from simple modifications to a basic cell. The optimistic approach only
counts the transistors that were changed. For the Gold chip the difference is the transistor count for
the register decoders. The optimistic count saves 433 drawn transistors thereby increasing the regu-
larity factor.
b We estimated these sizes trom the photomicrographs of the commercial chips.
C Data provided by Rattner of Intel [10]

01 We counted elapsed time trom the beginning of the first class to the end of the last elas, plus
three months which should be the time for fabrication (we hope).
, Since the designers also did layout this is a somewhat fuzzy distinction. All work before
1/1/81 is considered design and we have included circuit design as part of layout.
Daniel T. Fitzpatrick, et al 335

modified the day before the design was submitted for mask generation!. An
industrial product can rarely atrord such luxury.

. flNAL CmOlENTS
The RISC Project has had a synergistic etTect on research at Berkeley in
architecture, VLSI, and CAD. Often, useful tools were created in response to
specific needs. For example, a special extractor for Manhattan geometry was
formulated since the older extractor, able to handle general geometry, would
have taken too long to tlnd all 44000 transistors of the Gold chip. AB a result of
this synergism our design environment has experienced a dramatic improve-
ment within the last six months.
The gains in performance of the design tools by their rntriction to Wanhat-
tan designs were well worth their inconveniences in layout; we can tlnd only
small areas in each chip where non-Manhattan geometry could save space.
While we realize that more work is necessary to turn RISC into a fuU-fiedged
microcomputer, we also believe that the most difficult and time consuming por-
tion of that task has been completed. These results can be duplicated by indus-
try; reduction in elapsed design time, reduction in manpower, and high perfor-
mance are available for those who are willing to take calculated risks.

7. ACKNOWLEDGEMENTS
The RISC project was aided by several people at Berkeley and other places.
We would like to thank them all, but give special thanks to a few:
John Oullterbout created Caesar, the main interface of the designers. The relia-
bility and quality of this graphics editor and his responsiveness to our needs is a
major reason for our reduced design time. Richard Newton allowed us to use his
graduate class to resolve issues related to RISC. We also want to thank others in
the Berkeley community for the use of their tools: Bob Cmelik, Sheng Fang,
Richard Newton, and Donald Pederson. We would especially like to thank the
people in the ARPA-VLSI community that shared their tools: Randy Bryant and
Chris Terman for the switch-level simulator /lOSSIM and Clark Baker for the
layout rule checker DRC. In addition to these tools from MIT, from Stanford we
received STAT, a static electrical rules checker created by Forest BaBkett. We
Iratefully acknowledge helpful discussions with Osamu Tami:.... in the areas of
MOS circuit design and processing. lim Beck, Bob Cmelik, and Robert Byerle
provided valuable information and ideas on testing the chip. We would also like
to thank the visitors from industry who gave us valuable suggestions on our
designs: Le. Credele from Motorola, Dick Lyon from Xerox PARC, and Peter Stoll
from Intel. Thanks go to Bob Fabry, Richard Fateman, Bill loy, and Bob Kridle
for providing the computing environment necessary to complete our designs.
We would like to thank Danny Cohen and Lee Richardson of the MOSIS group (U.
of Southern California - Information Sciences Institute), and Alan Bell and Al
Pateth of Xerox Palo Alto Research Center, for fabricating our chips. Finally, we
would like to thank Duane Adams and DARPA for providing the resources that
allow universities to attempt high-risk projects.
This research was sponsored by the Defense Advance Research Projects
Agency (DoD), ARPA Order No. 3803, and monit.ored by Naval Elect.ronic Syst.em
Command under Contract No. N00039-7B-G-0013-0004.

8. REFERENCES
1. Frank, E.H. and Sproull, R.F., "An Approach to Debugging Custom
Integrated Circuits," Carnegie-MeLLon Computer Science Research Review
197~80, pp. 21-36 (July, 1981).
338 VLSI Implementations of a Reduced Instruction Set Computer

2. Joy. W.N . Fabry. &'S . and Sklower. K. Seventh Edition. Mrtual VAX-ll v.r-
non. Computer Science Division. Dept of EECS. U. C. Berkeley June. 19B1.
3. t..ttin. W.W. Bayliss. J.A.. Budde. D.L.. Colley. S.R. Cox. G.W. Goodman. A.L.
Rattner. I.R. Richardson. W.S . and Swanson. R.C . "A 32b VLSI Micromain-
frame Computer System." I+oc. IEEE lntem.lltionral Solid-Stllta Clrcuits
Con/eranca. pp. 110-111 (February, 19B1).
4. t..ttin. W.W. Bayliss. I.A.. Budde. D.L. Rattner. J.R. and Richardson. W.S . "A
Methodology for VLSI Chip Design," La.mbda., pp. 34-44 (Second Quarter.
19B1).
5. Newton. A.R. Pederson. D.O. Sangiovanni-Vincentelli. A.L. and S6quin. C.H ..
"Design Aids for VLSI: The Berkeley Perspective." IEEE 7rl.lflSl.ICt-ions on OW-
cvi.ts lind St/st.m.s. (luly 1981).
6. Ousterhout. J . "Caesar: An Interactive Editor for VLSI Circuits." LSI
Dasip. D(4)(Fourth Quarter. 19B1). to appear
7. Patterson. D.A. and Ditzel. D.R.. "The Case for the Reduced Instruction Set
Computer." Gbmput.r Arch.it.ctur. N.'WIl 8(6} pp. 25-33 (15 October 19BO).
8. Pattenon. D.A. and S6quin. C.H.. "Design Considerations for Single-Chip
Computers of the Future," IEEE 7rl.lflSl.lctioflS on Comput.rs C-28(2) pp.
108-116 (February 19BO). Joint Special Issue on Microprocessors and Micro-
computers.
9. Patterson, D.A. and S6quin. C.H . "RISC I: A Reduced Instruction Set VLSI
Computer." I+oc. Eighth Intem.lltionral Symposium on Computer Architec-
ture, pp. 443-457. (May 19B1).
10. Rattner. J.R.. Pri:vllt. Communiclltion. August 19B1
MIPS: A VLSI Processor Architecture

John Hennessy, Norman Jouppi, Forest Baskett, and


John Gill
Stanford University
Departments of Electrical Engineering and Computer Science

1 Introduction

MIPS (Microprocessor without Interlocked Pipe Stages) is a general purpose processor


architecture designed to be implemented on a single VLSI chip. The main goal of the design is high
performance in the execution of compiled code. The architecture is experimental since it is a radical
break with the trend of modern computer architectures. The basic philosophy of MIPS is to present
an instruction set that is a compiler-driven encoding of the microengine. Thus, little or no decoding is
needed and the instructions correspond closely to microcode instructions. The processor is pipe lined
but provides no pipeline interlock hardware; this function must be provided by software.
The MIPS architecture presents the user with a fast machine with a simple instruction set. This
approach is currently in use within the RISe project at Berkeley [4]; it is directly opposed to the
approach taken by architectures such as the VAX. However, there are significant differences between
the RISe approz,ch and the approach used in MIPS:

1. The IUSe architecture is simple both in the instruction set and the hardware needed to
implement that instruction set. Although the MIPS instfl.lction set has a simple hardware
implementation (Le. it requires a minimal amount of hardware control), the user level
instruction set is not as straightforward, and the simplicity of the user level instruction set is
secondary.
2. The thrust of the lUSe design is towards efficient implementation of a straightforward
instruction set. In the MIPS design, high performance from the hardware engine is a primary
goal. and the microengine is presented to the end user with a minimal amount of interpretation.
This makes most of the micro'engine's parallelism ,:vailable at the instruction set level.
3. The RISe project relies on a straightforward instruction set and straightforward compiler
technology. MIPS will require more sophisticated compiler technology and will gain significant
performance benefits from that technology.

MIPS is designed for high performance. To allow the user to get maximum performance, the
complexity of individual instructions is minimized. This allows the execution of these instmctions at
significantly higher speeds. To take advantage of simpler hardware and an instmction set that easily
maps to the microinstmction set. additional compiler-type translation is needed. This compiler
technology makes a compact and time-efficient mapping between higher level constructs and the
simplified instruction set. The shifting of the complexity from the hardware to the software has
several major advantages:

337
338 MIPS: A VLSI Processor Architecture

The complexity is paid for only once during compilation. When a user runs his program on a
complex architecture, he pays the cost of the architectural overhead each time he runs his
program .
It allows the concentration of energies on the software, rather than constructing a complex
hardware engine, which is hard to design, debug, and efficiently utilize. Software is not
necessarily easier to construct, but the VLSI environment makes hardware simplicity important.

The design of a high performance VLSI processor is dramatically affected by the technology.
Among the most important design considerations are: the effect of pin limitations, available silicon
area, and size/speed tradeoffs. Pin limitations force the careful design of a scheme for multiplexing
the available pins, especially when data and instruction fetches are overlapped. Area limitations and
the speed of off-chip intercommunication require choices between on- and off-chip functions as well
as limiting the complete on-chip design. With current state-of-the-art technology either some vital
component of the processor (such as memory management) must be off-chip, or the size of the chip
will make both its perfOImance and yields unacceptably low. Choosing what functions are migrated
off-chip must be done carefully so that the performance effects of the partitioning are minimized. In
some cases, through careful design, the effects may be eliminated at some extra cost for high speed
off-chip functions.
Speed/complexity/area tradeoffs are perhaps the most important and difficult phenomena to
deal with. Additional on-chip functionality requires more area, which also slows down the
performance of every other function. This occurs for two equally important reasons: additional
control and decoding logic increases the length of the critical path (by increasing the number of active
clements in the path) and each additional function increases the length of internal wire delays. In the
processor's data path these wire delays can be substantial, since thy accumulate both from bus delays,
which occur when the data path is lengthed, and control delays, which occur when the decoding and
control is expanded or when the data path is widened. In the MTPS architecture we have attempted to
control these delays; however, they remain a dominant factor in determining the speed of the
processor.

2 The microarchitecture

2.1 Design philosophy

The fastest execution of a task on a microengine would be one in which all resources of the
microengine were used at a 100% duty cycle performing a nonredundant and algorithmically efficient
encoding of the task. The MIPS microengine attempts to achieve this goal. The user instruction set is
an encoding of the microengine that makes a maximum amount of the microengine available. This
goal motivated many of the design decisions found in the architecture.
MIPS is a load/store architecture, i.e. data may be operated on only when it is in a register and
only load/store instructions access memory. If data operands arc used repeatedly in a basic block of
code, having them in registers will prevent redundant load/stores and redundant addressing
calculations; this allows higher throughput since more operations directly related to the computation
can be performed. The only addressing modes supported are immediate, based with offset, indexed,
or base shifted. These addressing modes may require: fields from the instruction itself, general
registers, and one ALU or shifter operation. Another ALU operation available in the last stage of
every instruction can be used for a (possibly unrelated) computation. Another major benefit derived
from the load/store architecture is simplicity of the pipeline structure. The simplified stmcture has a
fixed number of pipestages, each of UIC same length. Because, the stages can be used in varying (but
related) ways, the rcsult is that pipline utilization improves. Also, Ule absence of synchronization
John Hennes.y, et II 339

between stages of the pipe, increases the perfonnance of the pipeline and simplifies the hardware.
The simplified pipeline eases the handling of both interrupts and page faults (see Section 4.2).
Although MIPS is a pipelined processor it does not have hardware pipeline interlocks. The six
stage pipeline contains three active instructions at any time; either the odd or even pipestages are
active. The major pipestages and their tasks are shown in Table 1.

Table 1: Major pipestages and their functions

Stage Mnemonic Task


Instruction Fetch IF Send out the PC, increment it

Instruction Decode ID Decode instruction

Operand Decode OD Compute effective address and send to


memory ifload or store, use ALU

Operand Store OS Send out operand if store

Operand Fetch OF Receive operand ifload

Execution EX Execution cycle, use ALU

Interlocks that are required because of dependencies brought out by pipelining are not
provided by the hardware. Instead, these interlocks must be statically provided where they are
needed by a pipeline reorganizer. This has two benefits:

1. A more regular and faster hardware implementation is possible since it does not have the usual
complexity associated with a pipelined machine. Hardware interlocks cause small delays for all
instructions, regardless of their relationship on other instructions. Also, interlock hardware
tends to be very complex and nonregular [3,5]. The lack of such hardware is especially
important for VLSI implementations, where regularity and simplicity is important.
2. Rearranging operations at compile time is better than delaying them at run time. With a good
pipeline reorganizer, most cases where interlocks are avoidable should be found and taken
advantage of. This results in performance better than a comparable machine with hardware
interlocks, since usage of resources will not be delayed. In cases where this is not detected or is
not possible, no-ops must be inserted into the code. This does riot slow down execution
compared to a similar machine with hardware interlocks, but does increase code size. The
shifting of work to a reorganizer would be a disadvantage if it took excessive amounts of
computation. It appears this is not a problem for our first reorganizer.

In the MIPS pipeline resource usage is pennanently allocated to various pipe stages. Rather
than having pipeline stages compete for the use of resources through queues or priority schemes, the
machine's resources are dedicated to specific stages so that they are 100% utilized (see Figurel). To
achieve 100% utilization primitive operations in the microengine (e.g., load/store, ALU operations)
must be completely packed into macroinstructions. 'This is not possible for three reasons:

l. Dependencies can prevent full usage of the microengine, for example when a sequence of
register loads must be done before an ALU operation or when no-opsmust be inserted.
2. An encoding that preserved a\l the parallelism (i.e., the microcontrol word itself) would be too
large. This is not serious problem since many of the possible microinstructions are not useful.
340 MIPS: A VLSI Proceaaor Architecture

Figure 1: Resource Allocation by Pipestage

Time --+
1 2 3 4 5 6 7 8 9 10
IF ID 00 OS OF EX

IF 10 00 as OF EX

IF 10 00 as OF EX

os
10

lostruction Data
ALU Memory Memory

EX OF

00 ~ as
IF 10

enotes ALU reserved for use by 00 and EX

3. The encoding of the microengine presented in the instruction set sacrifices some functional
specification for immediate data. In the worst case, space in the instruction word used for
loading large immediate values takes up the space normally used for a base register,
displacement. and ALU operation specification. In this case the memory interface and ALU
can not be used during the pipe stage for which they are dedicated.

Nevertheless, first results on microengine utilization are encouraging. Many instructions fully utilize
the major resources of the machine. Other instructions, such as load immediate which use few of the
resources of the machine, would mandate greatly increased control complexity if overlap with
surrounding instructions was attempted in an irregular fashion.
MIPS has one instruction size, and all instructions execute in the same amount of time (one
data memory cycle). This choice simplifics the construction of code generators for the architecture
(by eliminating many non obvious code sequences for different functions) and makes the construction
of a synchronous regular pipeline much easier. Additionally. the fact that each macroinstruction is a
single microinstruction of fixed length and execution time means that a minimum amount of internal
state is needed in the processor. The absence of this internal state leads to a faster processor and
minimizes the difficulty of supporting interrupts and page faults.

2.2 Resources of the microengine

The major functional components ofllie microengine include:

ALU resources: A high speed, 32-bit carry lookahcad ALU with hardware support for multiply
John Henn...y, et .1 341

and divide; and a barrel shifter with byte insert and extract capabilities. Only one of the ALU
resources is usable at a time. Thus within the class of ALU resources, functional units can not
be fully used even when the class itself is used 100%.
Internal bus resources: Two 32-bit bidirectional busses, each connecting almost all functional
components.
On chip storage: Sixteen 32-bit general purpose registers.
Memory resources: Two memory interfaces, one for instructions and one for data. Each of the
parts of the memory resource can be 100% utilized (subject to packing and instruction space
usage) because either one store or load form data memory and one instruction fetch can occur
simultaneously.
A multistage PC unit: An incrementable current PC with storage of up to two branch targets as
well as six previous PC values. These are required by the pipelining of instructions and
interupt and exception handling.

3 The instruction set

All MIPS instructions are 32-bits. The user instruction set is a compiler-based encoding (Le.
code generation efficiency is used to choose alternative instructions) of the micromachine. Multiple
simple (and possibly unrelated) instruction pieces are packed together into an instruction word. The
basic instruction pieces are:

1. ALU pieces - these instructions are all register/register (2 and 3 operand formats). They all use
less than 112 of an instruction word. Included in this category are byte insert/extract, two bit
Booths multiply step, and one bit nonrestoring divide step.
2. Load/store pieces - these instructions load and store memory operands. They use between 16
and 32 bits of an instruction word. When a load instruction is less than 32 bits, it may be
packaged with an ALU instruction, which is executed during the Execution stage of the
pipeline.
3. Control flow pieces - these include straight jumps and compare instructions with relative jumps.
MIPS does not have condition codes, but includes a rich collection of set conditionally and
compare and jump instructions. The set conditional instructions provide a powerful
implementation for conditional expressions. They set a register to aliI's or O's based on one of
16 possible comparisons done during the operand decode stage. During the Execution stage an
ALU operation is available for logical operations with other boolcans. The compare and jump
instructions are direct cneodings of the micromachine: the effective operand decode stage
computes the address of the branch target and the Execution cycle does the comparison. All
branch instructions have a delay in their effect of two instructions; i.e., the next two sequential
instructions are executed.
4. Other instructions - include procedure and interrupt linkage. The procedure linkage
instructions also fit easily into the micromachine format of effective address calculation and
register-register computation instructions.

MIPS is a word-addressed machine. This provides several major performance advantages over a
byte addressed architecture. First, the use of word addressing simplifies the memory interface since
extraction and insertion hardware is not needed. This is particularly important, since instruction and
data fetch/store are in a critical path. Second, when byte data (characters) can be handled in word
blocks, the computation is much more efficient. Last, the effectiveness of short offsets from base
register is multiplied by a factor of four.
342 MIPS: A VLSI Processor Architecture

MIPS does not directly support floating point arithmetic. For applications where such
computations are infrequent, floating point operations implemented with integer operations and field
insertion/extraction sequences should be sufficient. For more intensive applications a numeric co-
processor similar to the Intel 8087 would be appropriate.

4 Systems issues

The key systems issues are the memory system, and internal traps and external interrupt
support.

4.1 The memory system

The use of memory mapping hardware (off chip in the current design) is needed to support
virtual memory. Modern microprocessors (Motorola 68000) arc already faced with the problem that
the sum of the memory access time and the memory mapping time is too long to allow the processor
to run at full speed. This problem is compounded in MIPS; the effect of pipelining is that a single
instruction/data memory must provide access at approximately twice the normal rate (for 64k
RAMS).
The solution we have chosen to this problem is to separate the data and instnlction memory
systems. Separation of program and data is a regular practice on many machines; in the MIPS system
it allows us to significantly increase performance. Another benefit of the separation is that it allows
the use of a cache only for instructions. Because the instruction memory can be treated as read-only
memory (except when a program is being loaded), the cache control is simple. The use of an
instruction cache allows increased perfonnance by providing more time during the critical instruction
decode pipe stage.

4.2 Faults and interrupts

The MIPS architecture will support page faults, externally generated interrupts, and internally
generated traps (arithmetic overflow). The necessary hardware to handle such things in a pipelined
architecture usually large and complex [3, 5]. Furthermore, this is an area where the lack of sufficient
hardware support makes the construction of systems software impossible. However, because the
MIPS instruction set is not interpreted by a microengine (with its own state), hardware support for
page faults and interrupts is significantly simplified.
To handle interrupts and page faults correctly, two important properties arc required. First, the
architecture must ensure correct shutdown of the pipe, without executing any faulted instructions
(such as the instruction which page faulted). Most present microprocessors can not perform this
function correctly (e.g. Motorola 68000, Zilog Z8000, and the Intcl8060). Second, the processor must
be able to correctly restore the pipe and continue execution as if the interrupt or fault had not
occurred.
These problems arc significantly eased in MIPS because of the location of writes within the
pipe stages. In MIPS all instructions which can page fault do not write to any storage, either registers
or memory, before the fault is detected. The occurrence of a page fault need only turn off writes
generated by this and any instructions following it which arc already in the pipe. These following
instructions also have not written to any storage before the fault occurs. The instruction preceding
the faulting instruction is guaranteed to be executable or to fault in a restartab1c manner even after
the instruction following it faults. The pipeline is drained and control is transferred to a general
purpose exception handler. To correctly restart execution three instructions need to be reexecuted.
A multistage PC tracks these instructions and aids in correctly executing them.
John Hennessy, et al 343

5 Software issues

The two major components of the MIPS software system are compilers and pipeline
reorganizers. The input to a pipeline reorganizer is a sequence of simple MIPS instructions or
instruction pieces generated without taking the pipeline interlocks and instruction packing features
into account. This relieves the compiler from the task of dealing with the restrictions that are imposed
by the pipeline constraints on legal code sequences. The reorganizer reorders the instructions to make
maximum use of the pipeline while enforcing the pipeline interlocks in the code. It also packs the
instruction pieces to maximize use of each instruction word. Lastly, the pipeline reorganizer handles
the effect of branch delays.
Since all instructions execute in the same time, and most instructions generated by a code
generator will not be full MIPS instruction set, the instruction packing can be very effective in
reducing execution time. In fully packed instructions, e.g. a load combined with an ALU instruction,
all tlle major processor resources (both memory interfaces, the alu, busses and control logic) are used
100% of the time.
The example in Figure 2 illustrates the techniques: where possible, short instructions are
moved together into one word. As this is a very short segment, not too many compactions are
possible. Once a basic block has been treated for compaction, the effects of the delayed branch are
processed. In this case it is possible to remove ilie no-ops, required because of pipeline dependencies
and branch delays, completely.

Figure 2: Source code, original machine code, and reorganized machine code
Source code Correct Code Reorganized AlU use Data Me-
with No-Ops Code 00 EX mory use
(* A,B,C: global ld HA, r3
N,R,S: local *) ld HB, r4
ld HC, r5
For i;= a To N Do mv HO, rl ld N(sp),r2;mv #O,rl x x x
ld N(sp),r2 ld #C, r5 x
bgt r1, r2, l30 bgt r1, r2, l30 x x
no-op ld #B, r4 x
no-op ld #A, r3 x
Begin
A[i]:=B[i]+C[i]; l20: 1d ( r4, rl) , r6 llOO:ld (r4,r1),r6 x x
ld (r5,r1), r7 ld (r5,r1),r7;add r6, rB x x x
no-op
add r7, r6, r9 add r7,r6,r9;add r7,r10 x x
st r9, ( r3, r1) st r9,(r3,r1);add #l,r1 x x x
R:= R + B[i]; add r6, rB
S:= S + C[i]; add r7,. rIO
add #1, rl
ble rl. r2, l20 b1e rl, r2, llOO x x
no-op st rB, !!(sp) x )(

no-op 5t r10, ~(5p) )( )(

End st rB, !!(5p)


st rIO, ~(sp)
l30: .... l30: ....

Size 21 Words 12 Words


Time 120 Un its 75 Units

Note that the code with no-ops was also of reasonable quality: ilie loading of the array base
addresses is hoisted up, and the store ofS is moved out of the loop. (Initialization ofS is done outside
the segment considered.) The no-op following "ld (r5,r1), r7" is necessary to take care of the missing
pipeline interlock.
344 MIPS: A VLSI Processor Architecture

The optimal packing of instructions is obviously a hard problem (at least Np-complete);
however, we are investigating heuristics that we believe will have acceptable running times, yet will
produce nearly optimal code in most cases.

6 Present status and conclusions

The present status of the MIPS project is:

Data path components: completely designed at the transistor level; approximately 50% laid out
The ALU has been fabricated and performs as simulated, with less than lOOns required for
addition.
Control: a SLIM [2] program for designing the control PLA's has been written and the PLAs
have oeen generated. .
Software: code generators have been written for both C and Pascal. These code generators
produce simple instructions, relying on a pipeline reorganizer. A first version of the pipeline
reorganizer is nmning and an instruction level simuiator are also in use.

Figure 3 shows the floorplan of the chip. The dimensions of the chip are approximately 6.9 by
7.2 mm with a minimum feature size of 4 I-' (i.e. A = 2 1-'). The chip area is heavily dedicated to the
data path as opposed to control structure, but not as radically as in RISC implementation. Early
estimates of performance seem to indicate that we should achieve approximately 2 MIPS (using the
Puzzle program [1] as a benchmark) compared to other architectures executing compiler generated
code. We expect to have more accurate and complete benchmarks available in the near future.
The following chart compares the MIPS processor to the Motorola 68000 running the Puzzle
benchmark written in Pascal. The same code generator (with different target machine schema)
generated code for the program. The MIPS numbers are approximate.

Motorola 68000 MIPS


Transistor Count 65,000 25,000
Clock speed 8 MHz 8MHzl
Data path width 16 bits 32 bits 2
Puzzle Instruction Count 650 477
Instruction Bytes 2630 1508
Execution Time (sec) 21.3 3.6 3

Ine 68000 ICtechnology is much better, and the 68000 performs across a wide range of environmental situations. We do
not expect to achieve this clock speed across the same range of environmental factors.

;bis advantage is not used in the benchmark, i.e. the 68000 version deals with 16 bit objects while MIPS uses 32 bit objects

3A highly optimized (by hand) C version of puzzle runs on the VAX 111780 in 3.5 sec.
John Henney, et .1 345

Figure 3: MIPS Floorplan

Data Pads & MIU functions


I Address Pads & MIU functions

EDT IBusB EAT /BusA


0 I I 31
EDT isolators
InstrClass
PLA PC1 through PC6
,.....
Address Reqister and EAT isolators
C 0 PC
PC+1 PC+2
0 e
Loadl s + 2 through 5
n C
Store/ Disolacement Generator
t 0
Mise
I r d
PLA 0
R L 0 e r
I r Register File
i
Muxes. S S V
shifter. B
ALU3 etc. e
S u U
PLA r
s n s Small Constant Inout 4 bits
0 d
W e
i r
t Barrel Shifter & Control
n
ALU2 h e
PLA
a
t Pitcllmatch

h Multiply I Divide Special Registers


Booth Coo
and dition
con logic
Master
dilion
Pipeline
control
PLA ALU
PLA

(wire channel for pads below)

External Bus Control, Timing, Status, Etc.


348 MIPS: A VLSI Processor Architecture

Acknowledgements

Many people have contributed to the MIPS project. Among the most important contributors
are: Thomas Gross, pipeline reorganizer; Alex Strong, 32-bit carry look ahead ALU; Jim Clark, VLSI
circuit ideas; Cary Kornfeld, Pascal code generators; Chris Rowen, Spice simulations and multistage
PC layout; Glenn Trewitt, resource usage simulator and unified approach for exception handling,
and Wayne Wolf, redesign of the barrell shifter.
The MIPS project has been supported by the Defense Advanced Research Projects Agency
under contract # MDA903-79-C-0680.

References

1. Baskett, F. Puzzle: an informal compute bound benchmark. widely circulated and run

2. Hennessy, J.L. SLIM: A Language for Simulation and PLA Generation in VLSI. Tech. Rept
195, Computer Systems Laboratory, ~tanford University, 1980.

3. I.ampson, B.W., McDaniel, G.A. and S.M. Ornstein. An Instruction Fetch Unit for a High
Performance Personal Computer. Tech. Rept. CSL-81-1, Xerox PARC, Jan, 1981.

4. Patterson, D.A. and Sequin C.H. RISC-I: A Reduced Instruction Set VLSI Computer. Proc. of
the Eighth Annual Symposium on Computer Architecture, Minneapolis, Minn., May, 1981.

5. Widdoes, L.C. The S-l Project: Developing high performance digital computers. Proc. Compcon,
IEEE, San Francisco, Feb, 1980.
Comparative Survey of Different Design
Methodologies for Control Part of
Microprocessors
Monlka Obrebska
IMAG Laboratory
Computer Architecture Group
B.P. S3X
38041 Grenoble Cedex, France
ABSTRACT
We present several methodologies used in the design of control parts of
microprocessors and we discuss their classification with respect to the
qualities of the design. All these different methodologies were brought
out by decoding existing integrated circuits. Afterwards each one of
these methodologies was used to redesign a new control part of the MC
6800 microprocessor, its operation part remaining unchanged. By so
doing, we obtained a set of normalized solutions so that the real effi-
ciency of each method could be estimated in terms of the cost of hard-
ware and design time. The performance expressed by the cycle time of
each control part was also calculated leading to the complete, valid
classification of different design styles. At last the evolution of the
design efficiency versus the circuit complexity was stUdied.

KEY WORDS
Interpretation algorithm - Levels of interpretation - Control part
design - Microprogramming - Programmable logic arrays - Efficiency
evaluation - Regularity factor - Algorithm complexity.

1 - INTRODUCTI ON

The technological improvements induce the increase of the total circuit


area and, for the devices like microprocessors, the increase of the cir-
cuit (algorithm) complexity. This evaluation leads to new design pro-
blems [2J because of the great amount of information to be handled by
the designer. Great progress has been achieved in the operation part
design where the use of predefined, ready to assemble cells seems to be
the solution [16J. On the other hand, in the control part design domain,
there has not been till now any comparative evaluation of the characte-
ristics of different design methodologies.

2 - MICROPROCESSOR SPECIFICATION
Microprocessors, like all sequential machines, are defined by their lan-
guages i.e. by the sets of executed instructions. So the behaviour of
a microprocessor may be described with an algorithm, called the interpre-
tation algorithm, which explains the semantics of the instruction set.

347
348 Comparative Survey 01 Dillerent Design Meth. lor Control Part 01 Microprocessors

The microprocessor circuit is in fact one of the possible hardware rea-


lizations of the interpretation algorithm.
The interpretation algorithm is written in the definition language
which does not necessarily correspond to the description of the physical
machine behaviour. The translation of this algorithm to the physical
machine may be done through one or several steps considered as levels of
interpretation. At each level of interpretation an appropriate language
is used to describe the instructions execution of the preceding level.
The lowest level description should be directly translatable to hard-
ware. It serves to define the hardware operation part which performs
the data memorizations and manipulations. All functional blocks (regis-
ters, UALs) of this operation part are defined from the variables used
in the algorithm description. The communication paths between the
blocks are defined from the transfers needed by the algorithm. Notice
that the operation part is defined first and should be completely speci-
fied before the control part design.
The control part animates the operation part through a set of con-
trol lines in order to perform the given sequences of operations. Its
structure is determined from the algorithm flow chart and must take into
account the operation part specification, external control lines and
instructions coding. The hardware implementation of the control part
depends on the employed design methodology.

3 - PRINCIPLES OF INVESTIGATION
As a matter of fact, different control part design styles were
brought out by the microphotographic analysis of the internal architec-
ture of existing circuits [1], [3J, [4J, [6J, [8J, [9J, [10J, [14J, [15J
but they could not really be compared. Each one them was applied to
build a different machine, executing a different algorithm and was often
implemented with a different technology, so it was no possible to say
which one was better than the others.
The idea then was to apply each of these design methodologies to the
same example in order to obtain the corresponding hardware realization
of the same algorithm. By so doing, we found a set of normalized solu-
tions reflecting the efficiency of each method. The Me 6800 micropro-
cessor was selected to serve as a benchmark in this study. The main
reason is that we knew its internal architecture as well as the inter-
pretation algorithm of its instruction set [11J.
The following rules were respected in the redesign of each new
version:
- the operation part remained the same as in the original 6800,
- the interpretation algorithm was exactly the same as this of the
original 6800,
- the evaluation was made following the same design rules for the lay-
out which were in our case the rules of GSN3 technology.
The efficiency of each design methodology was estimated according
to three essential characteristicsl the cost of hardware, the speed of
the device and the design time.
- the cost of hardware was valued as a function of the total area of the
redesigned circuit, obtained after the layout proposition.
- the design time and ease were estimated according to the percentage of
structures which can be generated automatically like ROMs or PLAs in
each solution. This percentage was called also regularity factor.
Monlka Ob..baka 349

- the speed of the control part was established by examining the delays
of its different functional blocks. The delays for ROMs and PLAs were
calculated by a special program [5J and then the global timing was ana-
lized in order to obtain the total compatibility with the original.

4 - COMPARATIVE SURVEY OF DESIGN METHODOLOGIES


In this section we present the basic principles of each one of the
different methodologies which we studied and the characteristics of the
corresponding Me 6800 redesign.

4.1. SINGLE LEVEL INTERPRETATION CONTROL PARTS


Single level interpretation control parts are obtained when the
lowest level description of the interpretation algorithm is used for the
design. The operation code. the external control lines and the opera-
tion part status lines are considered to generate directly all the
commands at each clock cycle.

4.1.1. RANDOM LOGIC CO~TROL PARTS


In order to design these control parts the flow chart of the inter-
pretation algorithm~ of the instruction set is directly translated into
a set of gates and dynamic shift registers [Me 6800. Me 6809). This
technique is usually considered as the most economical one from the si-
licon area consumption point of view because it allows a high density
layout. However. this methodology is difficult to design and debug and
may only be used for small or medium complexity instruction set. It may
also be used to implement small automata.
The original MC 6800 microprocessor control part is implemented
with this technique and so the only regular block it contains is the
instruction decoder which occupied 14.5% of the control area. The ran-
dom logic sequencer occupied 39% of the control part while the remains
are devoted to the auxiliary control circuits. control plots and inter-
connections [figure 1). Notice that the Me 6800 recognizes 197 opera-
tion codes which correspond to 72 different instructions. The instruc-
tion execution can take from 2 to 12 clock cycles and the interpretation
algorithm has 167 different states. The control part generates 63 com-
mands for the operation part and 17 commands for the condition register.
The Me 6800 B is working with a two-phase non overlapping clock [~1.~2).
Its minimum cycle time [0.5~s) is given by the delay in the critical
gate chain.

4.1.2. SINGLE PLA CONTROL PARTS


Such a control part is built as a single automaton implemented with
a single PLA. The "and" matrix decodes the executed instruction. the
status of the operation part and the step register. The "or" matrix ge-
nerates the next content of the step register and the commands for the
operation part [e.g. NS SC/MP).
We found that such a single PLA realizing the control function for
the MC 6800 microprocessor should have 28 input lines and 80 output lines.
If it is organized as a classical PLA. i.e. the inputs are parallel to
the outputs. then its total area is very large. Furthermore a vertical
layout is needed which causes a great waste of no used area.A horizontal
layout is possible with an optimized [12J PLA giving a very good
350 Comparative Survey of Different Design Meth. for Control Part of Microprocessors
regularity factor (52%) but the total area of the microprocessor is still
increased by about 40%. The cycle time, fixed by the delay through the
PLA, is about 1.2vs, so the performance of this solution is not good.
This technique gives a control part easy to design and to implement,but
should only be used for small complexity instruction sets. Indeed even
for the optimized PLA the area grew too large when the complexity of the
interpretation algorithm increases.
4.1.3. MICROPROGRAMMED CONTROL PARTS
This well known technique consists of storing in a ROM a micropro-
gram which describes the sequence of control actions to be performed to
interprete the instruction set (e.g. 52752 FAST, MC 68000J. The next
address and test indications are recorded in each microinstruction. The
decoding is performed by an extra PLA which recognizes the operation code
and the branching address is selected by a simple logic circuit.
HORIZONTAL MICROPROGRAM~lING
In horizontal microprogramming one bit of the microinstruction
corresponds to one command. This induces a rather big waste of space but
is useful from the topological point of view and easy to implement. The
6800 algorithm needs 169 x 70 bits microinstructions, where 15 bits are
used for sequencing and 55 for commanding the operation part. In the
layout proposed, a double-word organized ROM is used. The total area of
the microprocessor is increased by 26%, and the regularity factor is very
good, 54%. The horizontal microprogramming is worthy when the amount of
the ROM needed is not very large and the coding and decoding of the com-
mands are too costly.
VERTICAL MICROPROGRAMMING
In vertical microprogramming, the coding of the commands decreases
the ROM area but the cost of decoders and interconnections must be added.
The coding of the 6800 commands saves 23 bits of the microinstruction
while 6 decoders are needed. In this redesign the total area of the mi-
croprocessor is increased by 15% only. The regularity factor of 48% is
a little worse than in the previous version because of the interconnec-
tion area. This technique allows the building of optimized control parts
easy to design and to implement for medium complexity instruction set.
In both solutions, the microaddress is calculated during ~1, while
the microinstruction is read during ~2. The cycle time, fixed by the ROM
read delay, is about 0.89 VS in horizontal microprogramming. In vertical
microprogramming, the commands decoding delay must be added but the ROM
delay is smaller so the cycle time is about 0.86 vs.
4.1.4. MULTIPLE PLAs CONTROL PARTS
We have shown that the single PLA version is not acceptable because
of the tremendous increase of the circuit area. In the microprogramming
version there is still a lot of space lost because of the need of having
a full address field in each microinstruction. A multiple PLAs approach
seems to bring some kind of solution to these problems. The basic idea
here is first to separate sequencing and commands generation and second
to extract (and useJ characteristic properties which do not vary during
all the instruction interpretation (e.g. HP MC2, GICP 1600J.
The sequencing is achieved by a PLA which recognizes the instruction
code, the status of the microprocessor, the current step of the algorithm
and generates the address of the next step. The command generation may
be implemented with two different philosophies leading to the quite
different solutions.
Monlka Obrebska 351

. . , ""
--. -...
,. " "'" '"

" '. ' .


I~
I

~
11GT1UC1'ICliCl:COC!U 1151.11('.

t]
I
!OIl

WCJCIilOGIC tCllflQ, ",.1 -_.....


IU

1000 ... "OIl

Fig.1 - MC 6800 la yo ut.

.. ., '00 'I
~--------
..." I5'T TS(
--
t, Of(

,"" ."
Fig.2 - Redesign of MC 6800 using parameter PLAs.
352 Comparative Survey of Different Design Meth. for Control Part of Microprocessors
PROPERTIES EXTRACTION
In this approach, the characteristic, global properties are brought
out directly from the instruction code by a separate PLA. This allows to
decrease the number of states of the algorithm because some of them be-
come independent of the commanded actions. There is no longer 8 but 7
address lines. The property lines are directly used in the command gene-
ration PLA as entry line. As all the PLAs are optimized, this version of
the MC 6800 needs only 11% more space than the original one, and its re-
gularity is 55%. It is also more performant because the cycle time given
by the sequencing PLA is about 0.58 ~s. Such a technique allows the
building of well optimized control parts for rather complex instruction
sets.
PARAMETERS EXTRACTION
The parametrization is a direct generation of static commands i.e.
the characteristic, local properties are directly translated by the para-
meter PLA into the set of commands. The other commands which depend on
sequencing are computed in a command generation PLA. The switching bet-
ween the static and the computed commands is performed by the selection
lines coming from a validation PLA. The idea of parametrization can be
easily applied for these functional units which are executing one parti-
cular function during one instruction interpretation (example ALU,RCC).
It is possible however to make an extension and to describe in a parame-
ter PLA two or three particular functions chosen by the selection lines.
For this reason several subparts were distinguished in the MC 6800 ope-
ration part and the associated parameter PLAs were defined (figure Z).
The total area of the microprocessor is not increased and the regularity
factor is equal to 43%. This result proves that it is possible to opti-
mize the design time and to decrease the difficulty of implementation
without increasing its cost. The speed characteristics however are not
so good. The cycle time fixed by the delay of sequencing and validation
PLAs is about 0.77 ~s. The use of parameter PLAs gives the possibility
of designing flexible, easy to test and debug control parts for complex
instruction sets.
4.1.5. CONTROL PART USING TIMING GENERATOR
This approach is based on the fact that each command is characteri-
zed by the moments of its activation during the instruction interpreta-
tion. One instruction may be described then as a set of commands to
activate and a set of moments of their activities.
The commands to activate and other property lines are generated by
the instruction decoder. The moments are described by the timing lines
coming from a timing generator. The timing generator is a small automa-
ton which represents the "skeleton" of the instruction execution. It is
controlled by the property lines of the instruction decoder. Command
lines are finally generated through the validation of lines coming from
the decoder by timing lines. This validation may be performed by and-or
gates net (Z 80, Z 8000, I 8080) or by a PLA (I 8085). The redesign of
the MC 6800 using this technique has a timing automaton of 27 states
which generates 10 timing signals. The area of the microprocessor does
not need any increase. It seems even possible to decrease it of about
Z%. The regularity factor of 33% is worse because the timing generator
was evaluated as random logic. The cycle time fixed by the instruction
decoding PLA is about 0.4 ~s. Although this design methodology has a
very good technical characteristics, none of the existing microprocessors
use it at one level of interpretation. The reason for this is explained
Monlka Obrebska 353

in the next section.


4.2. CONTROL PARTS USING SEVERAL LEVELS OF INTERPRETATION
Control parts using more than one level of interpretation are built
when the upper level description of the interpretation algorithm is chosen
to supervise the instruction execution. The operation code. the external
control lines and the operation part status lines are first considered
at the highest level to generate the command lines which are used as ins-
truction inputs for the just lower one. In this way. several single le-
vel control parts may be stacked. each level interpreting the instruc-
tions of the just upper one. Any kind of single level control part may
be used but among the existing circuits. we find only the stacking of
control parts using timing generators [I 8080. I 8085. Z 80. Z 8000).
The reasons for this are that :
- this kind of control part uses a validation network whose inputs[coming
from an instruction decoder) are of the same kind as its outputs J
- this technique reduces the total number of timing lines by the facto-
rization of the timing instants (e.g. machine cycle x timing states).
The redesign of the MC 6800 using two levels of interpretation (fi-
gure 3l. controlled by the machine cycle and timing states generators.
seems to allow the area decreasing of about 8%. A single PLA is used for
validation of all commands. Note that the timing generatnrsoccupy less
of the control area and the regularity factor is better (34%1 than in
the preceding solution. The clock cycle time estimated for this control
part is equal to 0.34 ~s.
4.3. RECAPITULATION
The results of our investigation are shown in Table 1 which recalls
the circuit area estimation. the regularity factor and the cycle time for
each solution. The percentage of the irregular elements. unchanged Dr
added during this study. and the area available for the interconnections,
are also given. The most interesting results concern those three metho-
dologies where the circuit area does not increase while the regularity
factor is much better than the original. In case of the methodologies
using timing generators, it seems even possible to decrease the original
circuit area. Indeed such an expectation may be a little too optimistic
if we keep in mind the error of the estimation method. However, this
result allows us to think that the random logic control part technique
is not necessarily the less silicon consuming one. We have seen that
other economical techniques, much easier to implement and debug, rnay
be used with success.
5 - GENERALIZATION
The redesigning of the MC 6800 microprocessor gave us an idea of the
trade off between the total area and the regularity for different control
part design methodologies used in the specific case of the 6800 algorithm
implementation. We were interested then in the evolution of design effi-
ciency for other, more or less complex algorithms typical for the general
purpose microprocessors.
The complexity was defined as the number of states of the equivalent
Moore automaton executing the interpretation algorithm. We made the
assumption that all the characteristics of the algorithm. such as number
of operation codes, of tests, of branchings, etc ... are linear functions
of the complexity. Then the relations between the characteristics of the
algorithm and the parameters of each methodology were established, allo-
wing to evaluate all the elements of each solution for different values
354 Comparative Survey of Different Design Meth. for Control Part of Microprocessors

.
IT

;!'
.'

. '"

Fig.3 - Redesign of Me 6800 with two levels of


interpretation using timing generators.

1--- - - - --
l AREA
INCREASE
z
I REGUl,AR lTV IRREGULAR ! IHTERCON-

I
FACTOR
~
ELEMENTS NECTIONS
l
I
' ~
CYCLE
TlHE
vS

~-~!~--L_-11-.5--t1-0-.5-
I
I--MC
_6800
_ _ __

SINGLE PLA I 40 , 52 24 : 24 1.2

i 0.89
r-~--------_+I------T'-------+---~,------+----~
~ HORIZONTAl I 26 54 31 I 15
- - ----1-[----- -
i: ii
o '"
~ Cl
VERTICAL I 15 48 I 36 16 I 0.86
1 - - - -- - - ---
~ PROPERTIES : 11 i 55 I 34 I 11 ! 0.58 I

~,_~_~ P_~ E_~_+_1--0---I---4-3--~!--4-0---+l--1-7----!-0_._77 ~1


__ ___ __

...'"o 1 LEVEL i -2 33 ' 52 I 15 . 0.40 I

- -----,- -- _,_~ _~
...:a:
__
,I _ _ ' -:- ___ _ _
~
... Z LEVELS , - 8 34 I 51 15 : 0.41 I
Cl I

Tab.1 - Design methodologies classification.


Monlka Obrebska 355

of complexity. In this way, the curves showing the evolution of the


total control part area versus complexity could be obtained (figure 4).
We can notice that the gap between the different design methodologies
emphazises with complexity. We can also see that three of them,multiple
_parameter PLAs methodology and timing generators methodologies, are
getting competitive with respect to the random logic technique,especially
concerning the complexity where not the total area but the design time
becomes the decisive criterium.
To validate these results, we have examined the Me 6809 microproces-
sor which is an extension of the Me 6800. It is realized with the HMOS1
technology and its control part is designed with random logic. We have
calculated that ifit were realized with GSN3 technology, the control part

~RfA I"

2()
SINGIl PLA
- - - - H?RI10NIAL ,PPOG.
.
....
~

...
-'

- - - . VERTICAl. ,PROG.
- - -Il!LT I PLi.. - P"OP(RIiES I ...
~
- - RANOO!I LOGIC
I~ _a_ a_HULlI PLA,PAP.AMl:HRS
- - - II ,1m; Gf 'ERATOR IL
- ' - '-11"1 G GEII(RATOP 2L.

10

100 161 zoo JOO 400 466 500

Fig . 4 - Control part area versus complexity .

would occupy an area of about 19.4mm2. We have also found that the equi-
valent Moore automaton has about 466 states. This point plotted in
figure 4 confirms our expectations and underlines the necessity of more
flexible and economical design methodologies.
We must stress that the aim of this generalization was to find the
characteristic trends in the evolution of the efficiency for different
design techniques. The eight basic design methodologies which were stu-
died here may be considered as "pure" styles, and the curves show the
influence of their basic concepts in design qualities. In reality, the
control parts using several different ideas may be built as for example
in the Me 68000 [8J microprocessor which has a microprogram stored in a
ROM but the microcommands are validating the parameter PLAs. The curves
that we established should help the designer in his judgement of the
impact of different concepts on final control part characteristics.
6 - CONCLUSIONS
This comparative study allows us to find the valid classification of
eight basic control part design methodologies applied to the case of Me
6800 algorithm and then extended to other algorithms of the same kind.
The classification was made in respect to the total area, the regularity
358 Comparative Survey 01 Different Design Meth. lor Control Part 01 Microprocessors

and the speed of each designed structure. We can see that in the case of
small complexity circuit, and especially when the development of a family
of circuits is not considered, the total area and regularity are less
significant and other factors must be taken into account at the time of
design style choice. In particular, the cost of the algorithm analysis
must be considered. The main conclusion is that nowadays when the cir-
cuits are becoming more and more complex, it is worth-while to analyze
the problem in such a way that the application of more elaborated metho-
dology would be possible. This should lead to minimum area, maximum
regularity solution giving optimum extension and test possibilities.

7 - BIBLIOGRAPHY
[iJ J.ABDO, F.ROORIGUEZ "Analysis of MC 6B09 microprocessor", master's
report, june 19B1.
[2J F.ANCEAU "Architecture and design of Von Neumann microprocessors"
NATO Advanced summer Institute, july 19BO.
[3J Ch.BERNARO, B.LAPLACE, Y.ALEXANDRE "Analysis of CP 1600 micropro-
cessor", master's report, june 1979 .
[4J Ch.BERNARO "Analysis of MC2 HP microprocessor",IMAG report, 19BO.
[5J M.BONNET, J.F.TANT "Static NMOS PLAsH, master's report, june 19B1.
[6J A.BoSSEBoEUF "Internal analysis of MC 6BooO",IMAG report,june 19Bo.
[7J J.P.BRAUN "Design and implementation of a VLSI circuit with CMOS/SOS
technology", master's report, june 197B.
[BJ M.GUITTET "Microprogramming of the 6BOOo microprocessor", master's
report, june 19B1 .
[9J A.GUYOT "Comparison of ZBO and INTEL BOBS microprocessors", IMAG
report, september 1979 .
[10J V.S.R. MALLAOI "Analysis of internal architecture of INTEL micro-
processors", IMAG report, june 1979
[11J M.NEMMOUR "Analysis of instructions execution in 6BoO microproces-
sors", Final report EOF/ENSIMAG, nO 511 7B 10, may 1979 .
[12J T.PEREZ SEGOVIA "Optimization of PLA's area" Research IMAG report
n0216, october 19BO .
[13J E.PRESSON "Analysis of sequencing in MC 6BOO microprocessor"
NPL IMAG report, 197B
[14J R.REIS "Analysis of Z BODO microprocessor", IMAG report, september
19BO .
[15J R.SARWAT "Analysis of COP 1B02 microprocessor" IMAG report, june 79
[16J A.A.SUZIM "Operation parts using modular elements" IMAG report,
september 1979.
C.fast: A Fault Tolerant and Self Testing
Microprocessor

Michael M. Tsao, Andrew W. Wilson, Ralph C.


McGarity, ChiaJeng Tseng, and Daniel P. Siewiorek
CarnegleMelion University
Departments of Electrical Engineering and Computer Science
Pittsburgh, Pennsylvania 15213

During the spring of 1981, the authors were involved in a project to design a single
chip fault tolerant microprocessor. The microprocessor chip is now being fabricated by
the Multi Project Chip (MPC) facilities. This report presents a brief overview of the chip,
examples of the reliability testability techniques implemented, and some of the trade off
issues resolved during the design process: partitioning of control code into several PLA's,
and increase of PLA size and the overall chip size due to testability reliability constraints.

INTRODUCTION

The C.Fast 1 project attempted to accomplish four goals. The first goal was to
provide the authors with experience in designing digital integrated circuits, especially
microprocessors. We hoped that this experience would give us a better basis from which
to build a Design Automation (DA) system using a hierarchical structured design
metrodology. A second goal of the project was to explore ways to connect control signals
to the data path part in a simple structured way with little random routing. A third goal was
to tryout some low cost reliability techniques at the IC design level. Two new ideas
implemented were parity checking on the the control PLA's and the concept of using the
data path to act as a visibility bus for testing purposes. Other reliability techniques were
also implemented for the appropriate sections of the chip. A final goal was to produce, as
a by product of the design effort, a set of register transfer (RT) level building blocks to be
used by our DA programs.
The Fairchild F8 microprocessor [FAIR77] was chosen as the target machine for the
following resons. 1) It represents a "typical" illicroprocessor architecture of the mid70's
era. 2) The original F8 is an nMOS chip, same as the MPC process. The minimum feature
size used is similar to the current MPC process, where the minimum transistor gate area is
5 microns by 5 microns. 3) The complexity of the F8 is not very great, which made
.reimplementing the Instruction Set Processor (ISP) easier. 4) The original F8 is partitioned

lC.fast. in the PMS notation [SIEW8l). stands for Computer: FAulttolerant and Self Testing

357
358 C.fast: A Fault Tolerant and Self Testing Mlcroprocee80r

in such a way that we could implement the basic CPU chip in less than 24 pins, thus
leaving some pins for our testability reliability portion of the design. 5) As part of earlier
research work, we have explored the question of implementing low cost fault tolerant
features on an F8 system at the IC level.

CHIP OVERVIEW

The chip can be regarded as consisting of three interrelated sections: the control
part, the data part and the reliability part (see Figure 1.) These sections were each under
the control of a different designer, though naturally there was considerable consultation
between them.
The control is partitioned into two groups of PLA's. Three large PLA's control the
instruction execution, provide correct sequencing for the external data bus, and attempt
recovery after transient errors. These PLA's broadcast encoded commands on a control
bus which traverses the chip in parallel with the data bus. Small decoder PLA's (called
NanoPLA's because of their resemblance to techniques used in nanocoding) produce the
actual control signals for the data path elements using the broadcast microinstructions as
inputs. This partitioning has produced a smaller and faster control section than would
have been produced by the more conventional design methodology.
Information about the state of the data part is fed back to the control part through a
status bus which is available to all PLA's. The extensive use of buses is intended to reduce
random routing and is partly motivated by our Design Automation research. The use of a
command bus allows easy testing since it can be made readable and writable through the
visibility bus and the I/O bus. It also provides a convenient way for the Retry PLA to take
over the data path control when it attempts instruction retry.
The data part is similar to the Cal Tech OM2 data path chip [MEAD80]. Its
similarities include a two phase clocking scheme, precharging of buses, use of a
precharged Manchester carry chain in the ALU, and interleaved data buses. It is different
from the OM2 in that all of the storage elements are static, although the latches driving the
ALU are dynamic. Also several of the elements have been reworked so that there is a
uniform control, power and ground scheme. One of the buses (called the 8 bus) is divided
into several sections to increase parallelism. Other changes include passive pullups in
some places, a different spatial ordering of ALU and Shifter, use of a more specialized
shifter, and wider Vdd and ground wires. Embedded within the data path part are fault
tolerant devices including a parity checker, parity generator, and shadow registers. Also
added are a zero detector and a status register.
The chip is also serving as a test bed for several reliability and testability techniques.
Testability is enhanced by allowing access to the internal control bus through the I/O bus
via the visibility register. Fault tolerances against transient errors are derived from the
pervasive parity checking and built in retry algorithms. It is also intended that two chips will
be used in a duplicate and compare system with one in the standby slave mode, ready to
take over if an error occurs in the master.

AREA TRADE-OFFS FOR THE CONTROL PLA'S

The control scheme makes exclusive use of PLA's. Here, the PLA's can be thought
of as associative read only memories where, unlike ROMs, only those product terms (p.
terms) actually needed are used. Thus a PLA based microprogram can continuously
examine the present state of the machine, rather than have it retained impliCitly in the
Mlcha.1 M. T.ao, .t al 359

microprogram location counter. The PLA's responsible for generating the Microcode
(TIMING and MAIN) have the instruction register and part of the processor state fed
directly in, requiring only five bits of recycled state. There is no need for dispatch ROMs or
condition code multiplexing as is often used in ROM based designs. This simplifies
automatic implementations while still resulting in small size.
Our design utilizes a two level hierarchy of PLA's: a group of decoder PLA's, refered
to as nanoPLA's, and microcode generation PLA's. No pipelining has been attempted in
this design. The current state of the machine determines the operations performed during
the next microcycle. This was done to preserve the sanity of the Microprogrammer
(though he went insane anyway). The microprogram output is broadcast to all the nano
PLA's during q, 2 and their outputs are guaranteed stable by the following q, 1"
The microcode generation is broken up into two PLA's, TIMING and MAIN. The
TIMING PLA keeps track of the individual instruction sequencing and controls timing of the
external processor bus. .It also generates the F8 ROMC codes wtiich direct the other
elements in an F8 system. The generation of ROMC codes and next state was combined
into one PLA since they shared many of the same Pterms. The specific sequence of
states produced by the TIMING PLA is determined by the instruction being processed. The
MAIN PLA combines the present state with the contents of the instruction register to
determine the next micro instruction fo be executed. Few of these Pterms overlap with the
others, so they were placed in their own PLA. This arrangement required less PLA area
than other possibilities, as will be shown later.
When an error is detected during instruction execution, the TIMING and MAIN PLA's
freeze their state and the Retry PLA takes over. It issues its own instructions onto the micro
instruction bus and attempts to return the system to a known state. The instruction is then
retried from the point of failure.
In Figure 2, the area requirements of several different arrangements of PLA's are
presented. These estimates do not include the effects of adding fault tolerance. As can be
seen from the table, the present arrangement requires 920,000 A2, of which 565,000 A2 are
in the MAIN and TIMING PLA's. If the MAIN and TIMING PLA's were combined into a
single PLA, ten Pterms would be saved, but the total area would increase to 630,000 A2.
Also the overall system would be slower. Another possible arrangement would combine all
PLA's into one giant PLA. This yields an even larger and slower PLA of 1,010,000 A2.
Thus the particular partitioning chosen appears to be a good one.

DESIGN FOR FAULT TOLERANCE

In order to achieve a more reliable microprocessor system, we decided to use fault


tolerance techniques, rather than fault avoidance techniques [ELKI81]. Since we had no
control over the processing and manufacturing aspect of the project, we could not do
anything about the fault avoidance area of reliability improvement. Furthermore, we
wanted to explore the design space for low cost fault tolerance alternatives. For this chip,
we employed a fourtier approach in improving the fault tolerant properties of this
microprocessor. The goals were to improve the error detection and error recovery
capability of the system.

1. Conservative circuit level designs


A static latch design was used for all the flip flop implementations, including
the scratch pad memory (SPM) and all the temporary buffers. In this manner,
one could hold off anyone phase of the two phase clock, thus trading off
360 C.fast: A Fault Tolerant and Self Testing Microprocessor

system speed for testing observability. Since there were some questions
concerning the performance of the onchip clock input 10 pad, supplied by the
local MPC symbol library, two 10 pins were used for the two phases of the
system.

2. Register Transfer (RT) level functional blocks

a. Data path part


Parity was used on all registers and the SPM's for error detection.
Shadow copy registers were inserted for the accumulator (ACC) and two
other key registers, allowing for instruction retry. The issue of detecting
ALU errors was handled at the system level.
b. PLA implementation
Several alternative schemes for implementing a self checking PLA were
investigated. The simplest one to implement involved restricting the
product terms (Pterms) such that, for any input pattern, one and only
one Pterm is activated. The "even" and "odd" Pterms were alternately
placed next to each other, such that a shorted wire will activate terms
from both classes. Additionally, two extra OR array output lines were
inserted, one for the "even" Pterms, one for the "odd" Pterms. In this
scenario, one checks for the unique activation of these two "parity"
lines. There are several more complicated schemes that could have
been employed, but all required deSigning a great deal of additional
circuitry.

3. The content of the micro code, (functionality of the control PLA)


On detection of external bus transaction error (e.g. memory read), the entire
system does a bus retry. (This may not always be possible for all
microprocessor systems. It is dependent upon the types of support chips used
in a micro computer system.) On detection of internal errors, such as those
reported by the ALU input bus parity checker, the current target machine
instruction is retried. This can be done only if there is sufficient redundant
information. The goal was to make the fault tolerance aspect of the chip
transparent to F8 software. Under this type of fault tolerant design, one can
run an existing F8 program unmodified. Figure 3 lists the additional PLA area
increases due to this aspect of the fault tolerant design.

4. At the system level


Several possible scenarios were explored for this project. We implemented a
masterslave, duplicate and compare system. In this scenario, the slave
microprocessor watches the output of the master in a lock step fashion. At the
same time, the on board error detection mechanism is also in operation. For
additional error detection capabilities, the slave microprocessor compares the
output from the master with its own copy of the "proposed" output pattern.
On disagreement, and also on detection of internal error, a system "retry"
signal is activated. Both microprocessors perform one retry of the current
target machine instruction. At the end of the retry cycle, the system can be in
anyone of the states shown in Figure 4. With this scheme, total error
detection and recovery is not guaranteed, However, the system can recover
from certain type of error conditions.
Michael M. Taao, et al 381

DESIGN FOR TESTABILITY

A traditional microprocessor design can usually be grouped into the data path part
and the control part. The data part is more observable from the off chip I/O pins. The
control part is somewhat harder to control and even harder to ob~erve. In most
microprocessor designs, the only way to determine the proper operation of the control part
is by observing the output of the data part. The goal for our testable microprocessor was
to design a place where we can put a controllable and observable path (the visibility bus)
between the control part and the data part, and route it to the off chip I/O pins.
In the C.fast microprocessor deSign, the main control PLA generates control
information for the "control bus", not for the data part directly. We convert this bus into
the "visibility" bus during the test mode. Extra circuitry is used for observing values of the
control part, as well as jaming new values onto this visibility ("control") bus. Thus, we
increased the observability of the control part and the controllability of the data part.
The controllability of the control part Finite State Machine (FSM) is increased by
making it easier to write information into the FSM flip flops (FF's). During test mode, these
FF's can be loaded from the off chip data bus I/O pins via the visibility register and the
visibility bus. The operation is very similar to all the scan-set ideas, such as the Level
Sensitive Scan Design (LSSD) technique used on some IBM machines [EICH77]. On
C.fast, the FF values can be loaded during the test mode write cycles. The microprocessor
runs one or more executing cycles. The chip is set back to test mode, and the values
stored by the visibility register are read off. One important difference is that the data
reading and loading uses the a-bit parallel 10 data bus pins. This design does not use
shiftin and shift-out pins as in most scan set-like designs. Furthermore, the pins for
controlling the test mode functions could be shared with the pins used for the fault
tolerance operations. The extra pins do not visibly impact the total pin count. 2
Additionally, portions of the FSM used for fault tolerance operations are highly
observable during "normal" operations. The microprocessor can easily be fooled into an
error recovery mode where the proper operation of the recovery FSM can be observed.
The built in error detectors also increase the testability of the chip.

CADTools

A dozen or so CAD programs were used in designing the chip. Manual layouts were
done using the Xerox ICARUS interactive IC layout program, running on the Xerox ALTO's.
The CIF files generated from the ICARUS layout files were sent over the Ethernet to the
CMU VLSI-VAX. The most interesting path was in generating the CIF files for all the PLA's.
The sequence is illustrated in Figure 5. Many programs were invoked in the prescribed
sequence. However, the tasks were somewhat simplified by using system comand files.
The actual bit patterns for the PLA generator were used to drive the ISPS simulation of the
micro machine, providing an independent check on the correctness of the microcode. A
UNIX shell program was used to merge all the leaf node CIF files, which included 6 ICARUS
files, 10 PLA files, and 'one small hand typed CIF file, into one unified C.fast CIF file.
Because of the size of the entire CIF file, design rule checking of the chip was done in
separate chunks, merging only a few leaf node files at a time.

2We ran out of chip real estate in which to place the Retry PlA. However, all the associated control signals
were placed. Using the visibility bus, the error recovery procedures can still be tested.
382 C.tlst: A Flult Tolerlnt Ind SeN Testing MlcroproC8180r

SUMMARY

Work on the C.fast chip was initiated in midJanuary, 1981, and completed in mid
June, 1981. Four graduate students were actively involved in the design. Figure 6 shows a
checkplot of the C.fast microprocessor, and Figure 7 identifies the major functional areas
on the chip. The completed design, excluding the RETRY PLA, contains approximately
13000 transistors, of which the TIMING PLA and the MAIN PLA uses 4300 transistors. The
a.
data path part, with 16 byte SPM section, contains 5600 transistors. The "nanoPLA's"
accounts for about 3000 transistors. The chip is approximately 6100 microns by 5800
microns, somewhat big for a simple microprocessor. However, we feel that we have
satisfactorily completed the four goals stated at the beginning of this project.

ACKNOWLEDGMENT3

The authors would like to thank Bill Birmingham on the layout of the shifter section,
and Dr. Dennis Lunder, of Fairchild Microprocessor Product division, who donated a
F387X PEP single board computer system, providing a test vehicle for our finished
product. We would also like to thanks our colleagues'at CMU's VLSI community. Without
their many wonderful CAD programs, we could not have completed this design.

REFERENCES

[EICH77] E.B. Eichelberger, and T.W. Williams. "A Logic Design Structure for
LSI Testing". Proc. 14th Design Automation Cont." June 77. pp 462
468.

[ELKI81] S.A. Elkind. "Reliability and Availability Techniques". in "The Theory


and Practice ot Reliable System Design". by D.P. Siewiorek, and
R. Swarz. Digital Press. 1981.

[FAIR77] "F8 User's Guide." Fairchild. 1977.

[MEAD80] C. Mead, and L. Conway. "Introduction to VLSI Systems". Addison .


Wesley. 1980.

[SIEW81] D.P. Siewiorek, C.G. Bell, and A. Newell. "Computer Structures:


Principles and Examples. " McGraw Hill. 1981.

3The paper presented here is an excerpt of a project report "The MPC C.fast micro computer". available from
the authors.

While working on this project. the authors were supported by the Defense Advanced Research Agency (DOD).
ARPA Order No. 3579. monitored by the Air Force Avionics Laboratory under contract F33615-78C-1551. and by
National Science Foundation grant ENG-78-25755. and by the Carnegie -Mellon University Department of
Electrical Engineering.

The views and conlusions contained in this document are those of the authors and should not be interpreted
as representing the official policies. either expressed or implied. of the Defense Advanced Research Agency or
the US Government. or other funding agencies.
P
C,H MAIN MAIN RETRY TIMING MAIN
KI(2)

E V P P 5
R R CG H
R HE A
E D
R KN 0 5
G 5
E
G
l I Pbit
W P
I M
P 51 1
T R
A
T
U
5 3:
A-bus n
=r
parity on A-bus III
!!.
IStore & Compare on I/O pads - 1 ~
....
Figu re 1 a: RT level block diagram
J!.
Figu re 1 b: Block diagram of the testability
of the C.fast microprocessor !!.
-reliability portion of the C.fast

i!
384 C.fast: A Fault Tolerant and Self Testing Microprocessor

PLA Areas & Timing Delays

Configu ration Inputs Outputs P-Terms Area (KX2) Delay (7)

TIMING PLA 14 11 80 251 700


MAIN PLA 14 8 110 314 970
Total PLA 14 19 180 633 2070
SPM PLA 12 33 33 163 140
Hypothetical all-in-one PLA
15 48 181 1010 2150

Comparison of Three Possible Partitions

Present Scheme 920 1010


Combined TIMING & MAIN 990 2210
AII-in-one PLA 1200 2280

Figure 2: Table of PLA area trade-ofts

PLA's Inputs Outputs P-Terms Area Increases (%)


over original

TIMING 16 16 116 35 extra states


MAIN 14 11 114 9 extra nano codes
RETRY 12 22 61 FB instruction retry
SPM 12 34 35 5
ACC 8 17 23 91 shadow registers
CON 12 11 18 9
ALU 9 13 12 0
SSI 11 14 20 41 shadow status bits
VRE 4 7 8 visibility registers

Figure 3: Table of PLA area increases


due to testability -reliability requirements
o .- devic e down
1 devic e O'K

Maste r Slave
Both OK Both OK

Internal
error

IStop Slave Stop Master


Internal

CMU- 10A
- --
Stop Master Stop Slave
Stop Master 31:

~~
~
.....
Figure 4: Markov diagram showing the states
..'"
$)
of a duplica te-matc h C.fast system !.
Figure 5: Diagram tracing the design process !!.
through various CAD program s

~
366 C.tast: A FauH Tolerant and Self T.stlng Microprocessor

Figure 6: Video monitor display of the CIF


check plot for the C.fast chip

L I/OPADS

I TIMINGPLA II MAINPLA
I
I CONTROL BUS
I
~aIACC IlcoN I~ ~
v S S
R SPM ACC ALU H T I
E I A R
G ~ T
l 1/0 PADS

Figure 7: Major functional areas of the C.fast chip


VLSI Processor Arrays for Matrix
Manipulation *
J.G. Nash, S. Hansen, and G.R. Nudd
Hughes Research Laboratories
Malibu, California

A. INTRODUCTION

It is generally recognized that computing system throughput can be


significantly increased by concurrent processing techniques. The compo-
nent densities available from VLSI have made this potential economically
feasible, and at the same time have considerably enhanced circuit per-
formance. Implementation of concurrent architectures can be achieved by
mapping desired algorithms into data flow forms suitable for pipelining.
However, the utility of this approach suffers because different hard-
ware organization is required for every algorithm. A preferable
approach is based on the observation that most of the signal processing
algorithms being investigated today can be formatted in terms of matrix
manipulations. l A matrix operation is also well-suited to concurrent
implementations in which a number of small, identical processors oper-
ate simultaneously on the matrix elements. Thus, a set of "matrix
operator" chips made using VLSI would, when coupled to a general purpose
host computer, provide both a high computational throughput and the
flexibility to perform a wide range of signal processing ,algorithms.
In Section B of this paper we describe two important matrix manipu-
lation algorithms and functional implementations of these. The first is
a Toeplitz system solver 2 that can be used for adaptive filtering,
spectrum analysis, and beam forming. The second algorithm 3 is suited
to general purpose, non-Toeplitz matrix operations. The matrix opera-
tions performed are functions of the input data. This built-in flexi-
bility allows the solution of general linear systems and performance of
matrix multiplication, addition and inversion, all using the same
set of processing elements.
Section C describes some statistical simulations we have performed
in order to verify correctness of data flow within the matrix processors,
and in order to predict effects of finite register lengths, roundoff
techniques, and pivoting schemes. In Section D we discuss various
speed, power, and area tradeoffs associated with actual hardware imple-
mentations of the Toeplitz linear system solver.

*This work was supported in part by National Science Foundation Grant


No. ECS 8016581, and a contract from Office of Naval Research, No.
NOOOI4-81-K-0191

367
368 VLSI Processor Arrays for Matrix Manipulation

B. ALGORITHMS

1. The Toeplitz System Solver

The Toeplitz system of linear equations,

[R] [X] = [C] (1)

where [R] is a Toeplitz matrix, [X] is an unknown vector,and [C] is a


known vector, can be solved by a lattice processor array.2 It uses the
Weiner-Levinson (WL) algorithm, which takes advantage of the symmetry of
[R] to produce [X] in D(n) time steps, using D(n) processors for a total
of D(n 2 ) total operations. A conventional approach would require Den 3)
operations. To solve (1) we note that [R] can be written

where [V] is provided bi LV decomposition of [R], and [D] is a diagonal


matrix with elements u1l ' u2~' .. , u~* Therefore,

[X] = [V]-l [D]-l [V+]-l [C]

A system solver can then be built which solves for [X] in four steps:

Generate [V]+ (Lattice array implementation of WL algorithm)


Calculate [] [U+]-l[C] (Back substitution)
Calculate [1)i] [D]-l[] (Multiplication)
Calculate [X] [U]-l[1)i] (Back substitution).

A functional diagram of the system solver for a symmetric Toeplitz matrix


is shown in Figure 1. It consists of two basic sections: a lattice
array of processors which produces the [u+] matrix using the WL algo-
rithm, and a back substitution section which solves for the intermediate
result [] and the final result [X]. Between the two basic sections is
a LIFO (last-in-first-out) stack which stores the successive elements of
[u+] for later output as [V].
To generate elements of [v]+, the WL lattice array operates recur-
sively on rows of IR] to produce a vector, which we denote as [A], that
is used in conjunction with a second "auxiliary row" [B]. At each
recursion a "reflection coefficient", K, is generated which successively
multiplies the two rows, so that when they are subtracted from each
other, the new result [A] is the next row of [u+]. A new auxiliary vec-
tor, [B], is also produced. The reflection coefficient is obtained as
the ratio of the first elements of the two rows.
The system solver is initialized by loading both the A and B regis-
ters with the elements of the top row of [R]. The elements of [C] are
loaded in register C. Also, switch 81 is closed and switches S2 and S3
are open. When operation begins, A and B supply inputs to the WL array
and new results are returned to the same registers. The new values of
A correspond to successive rows of [U+]. These results are simultane-
ously stored in the LIFO stack for later use and supplied to the back
substitution section. In each time cycle new values of Ai, Bi, i, and
Ci are calculated and the intermediate results, 1)ii' are stored in a shift
J.G. Nllh, at II 369

register. When all of the elements of [U+J have been calculated and
stored on the LIFO stacks, the shift register containing the intermedi-
ate results, ~i' will also be full. At this point, switch 51 is opened
and 52 is closed; switch 53 is closed long enough to load the C register
with the results, ~i' The previous operations are repeated with the WL
array remaining idle. Each time period a new result, Xi. is calculated
and stored in the shift register.
We have further reduced the functional configuration shown in
Figure 1 to a form that is pipelined and suitable for reduction to hard-
ware. The necessity of pipelining resulted from the constraint that
processors Pi and Qi in Figure 1 only be allowed to communicate with
nearest neighbors. This constraint was imposed to minimize the speed
and power consuming requirements of a global bus.

2. The Faddeev Algorithms

The Faddeev algorithm 3 is useful for solving more general non-


Toeplitz linear systems problems and for matrix manipulations. To illus-
trate the algorithm, consider the case of finding

where cl' c2""'c n , and d are given numbers, and xl,x2"'Xn is the
solution to the system

all xl + a 12x 2 +

a 2l x l + a 22 x 2 +

+ a nnxn b
n
whose determinant is non-zero. The problem can be codified by writing
it as

a b
nn n

-c d
n
or, in abbreviated form,

~
7. (2)
370 VLSI Processor Arrays for Matrix Manipulation

where [B] is a column vector andre] is a row vector. It can be shown


that if suitable linear combinations of the rows of [A] and the elements
of [B] are found and added to the row beneath the double line so that
[e] = 0, the desired result, [e][xj + d, will appear in the lower
right corner. The simplicity of the algorithm is due to the absence of
a necessity to actually identify multipliers of the rows of [A] and
elements of [B]; it is only necessary to "annul the last row" (the
elements in the lower left corner of (2 by addition of a suitable
linear combination of the first n rows. This can be done by ordinary
Gaussian elimination. Note that this modification of the Gaussian
elimination algorithm avoids the usual backsubstitution in the triangu-
lar system and obtains the values of the unknowns' directly at the end
of the forward course of the computation, resulting in a considerable
savings in added processing and storage.
This result can be generalized to the case where we deal with two
dimensional matrices [e], [B], and [D], or

~
:ctn (3)

After Gaussian elimination, in the lower right hand corner the result
[e] [A-I][B] + [D] will appear, where we have used [X] = [A]-IIB].
Other results can be obtained by appropriate choices of the entries in
(3). For example, a common problem is the solution to the linear sys-
tem [A] [X] = [B], where [X] and [B.l are column vectors. The solution
[X] to this equation can be obtained with the entries

~ (4)
:iFo
Here, [I] is the identity matrix. Examination of the Gaussian elimina-
tion procedure shows that only the top line of the identity matrix is
actually being utilized at each step, so the array can be reduced to
(n + 1) x (n + 1), as shown in Figure 2. After each pass of the
Gaussian process, the top line and left column will have been annihi-
lated. If the remaining numbers are then shifted upward and to the
left, the -100 . 0 line of the identity matrix can be restored at
each step. The result is that at each pass the matrix shrinks by one
column, but retains the original number of rows. After n passes, it is
found that the remaining column contains the solution vector.
The Faddeev algorithm can be used to solve linear systems rapidly
by using an (n + 1) x (n + 1) array of identical processors to operate
simultaneously on all elements. As a result, the entire calculation can
be performed in O(n) time steps. One possible two-dimensional pipelined
architecture suited to solving lA][x] = [B] is shown in Figure 3. Here
we are considering the case where [e] and [B] are one-dimensional vec-
tors whose elements are initially stored in the processors below the
double line, and to the right of the single line, respectively. The n 2
elements of [A] are stored in the processor matrix in the upper left
corner. The function of each processor,P .. ,in the array is indicated.
~J
J.G. Nash, at al 371

Communication between processors is along horizontal, vertical, and


diagonal nearest neighbor paths. At each time step active processors
perform the calculations indicated and immediately transfer data to
adjacent processors for their use. While this processor organization is
of the "wavefront" variety, there are also broadcast schemes that could
be useful as well. For larger calculations involving matrices [B),
[C),and [Dj, an augmented processor array would be required.
Before leaving this subject, we should note that other matrix
operations are also possible by appropriate choice of input data. For
example,

-~
~
does the mUltiplication and addition, IB][C] + [DJ.
C. PROCESSOR SIMULATIONS

In order to gain greater understanding of the matrix manipulation


processors, it was necessary to perform numerous detailed numerical
studies. These statistical studies were done by simulating the various
architectures using the high level programming language, APL, which has
the capability of operating on arrays of numbers in a straightforward
and intuitive way. The simulations have served three main functions.
First, they have provided a means for verifying the correctness of the
data flow within the Toeplitz and Faddeev matrix processors. Second,
they have allowed us to study the numerical stability of the algorithms
for representative signal processing data. Finally, they have been
useful in predicting effects of finite register lengths, roundoff tech-
niques, and pivoting schemes.
The test procedures for both the Toeplitz system solver simulation
and the Faddeev simulation were essentially identical. All tests were
performed on a randomly generated Toeplitz matrix designed to be a
reasonable simulation of an 8-bit output of the autocorrelation stage in
a radar processor. A vector [A] was generated from numbers selected at
random in the range 1/256 to 255/256, with all numbers being mUltiples
of 1/256. The length of each vector was equal to the dimension of the
Toeplitz matrix. The vector [A], which was sorted in descending
sequence,became the top line of the symmetrical Topelitz matrix, thus
defining all of the other members of the matrix. The same vector was
used as the "right hand side" of the linear system. Only Toeplitz
matrices were used in Faddeev simulations.
Most of our simulations were based on l6x16 Toeplitz matrices, a
size corresponding to some system problems of interest to us. For the
Toeplitz system solver the register lengths needed were quite large; an
average RHS output error in [X] of 0.001 required 22-bit floating
point registers (18 bit mantissa and 4 bit exponent), or 28-bit fixed
point registers. A typical cumulative error distribution for 50 dif-
ferent randomly generated sets of 8-bit input data is shown in Figure 4.
A fixed point register with 8 bits to the left of the decimal point and
19 bits to the right of the decimal point was used.
372 VLSI Processor Arrays for Matrix Manipulation

D. HARDWARE IMPLEMENTATION OF ARRAY PROCESSORS

Reduction to hardware for functional processor architectures


requires an examination of the tradeoffs associated with different
implementations. Using the l6x16 Toeplitz system solver as an example,
we have evaluated the performance capabilities of a wide range of such
possibilities in terms of speed, power consumption, and chip area.
The numerous hardware configurations of the Toeplitz system solver
we considered are shown in Table 1. The terminology in Table 1 can be
understood by considering that the Toeplitz system solver operates with
n sections (corresponding to an n x n matrix [R]) in parallel, each
section requiring three "ax-b" operations per recursion, as indicated
in Figure 1. If these "ax-b" operations are done in parallel using
three processors, then the first term in the nomenclature x-xix becomes
P-X/X. If they are done serially, using only one "ax-b" processor,
then the appropriate nomenclature is s-x/x. The arithmetical "ax-b"
calculation can also be done in various fashions, as shown in Table 2.
We did not consider the pis arithmetic.
In order to make the basis of comparison for fixed and floating
point processor sections the same, we used register lengths described
in Section C that gave the same average Rl1S error of 0.001.
The hardware configurations checked in Table 1 were broken down
into all of the necessary component circuits. Rough layouts were
obtained for each of these so that each section of the array could be
analyzed in terms of speed, power consumption, and area required. From
this information, overall single section performance specifications
could be obtained for each architecture. It was assumed that control
for each processor could be obtained from a separate PLA chip. In this
way it would not be necessary to consider the more detailed concerns of
control and timing. All data was calculated for a depletion load NMOS
process with one metal layer at 5~mnominal design rules; however, a
constant field scaling parameter, A, was included (A = 5 for 5 ~m design
rules) in order that estimates could be obtained for other design rules.
This should be accurate to dimensions as small as 2 ~m, where linear
scaling begins to break down. The results of this effort are pre-
sented in Table 3.
This information can be tabulated in forms that might be more
suitable to a system designer who chooses architectures based on a fig-
ure of merit. Three possible figures of merit have been calculated and
are also shown in Table 3. These are power-delay product, throughput,
and throughput/power. Throughput is the product of maximum operating
speed (frequency) times the processor density. This figure of merit
allows a slower, but denser architecture to be compared to a faster,
but more sparse architecture. Equal figures of merit for throughput
would imply equal rates of arithmetic calculations. The other figures
of merit should be self-explanatory.
An example of a S-S/P floating point processor section layout is
shown in Figure 5. The computing and storage elements are all arranged
on a horizontal parallel bus with a single parallel vertical bus con-
nection to other processor sections. The LIFO stack is at one end of
the bus and the computing elements are at the opposite end. Sufficient
logic circuitry is available to align mantissas for addition and
J.G. Nash, et al 373

subtraction, and to perform re-norma1ization at the end of each


calculation. A carry-save adder is used in conjunction with a fast
look-ahead adder in order to perform more efficient multiplications.

REFERENCES

1. J. M. Speiser and H. J. Whitehouse, "Parallel Processing Algorithms


and Architectures for Real-Time Signal Processing, 25th Annual
SPIE Conference, San Diego, August 24, 1981.

2. S.Y. Kung, "Highly Parallel Architecture for Solving Linear


Equations," IEEE ICASSP, Atlanta, Ga., March 39 - April 1, 1981.

3. V.N. Faddeeva, Computational Methods of Linear A1ge1ora, translated


by Curtis D. Benster (Dover Publications, 1959).
374 VLSI Processor Arrays for Matrix Manipulation

10880-7Rl

WEINER
LEVINSON BACK
LATIICE SUBSTITUTION
ARRAY ARRAY

LIFO
STACKS STORAGE SHIFT
,--..0.-..., PROCESSORS REGISTER REGISTER
STORAGE
REGISTER PROCESSORS

c
x
i::
<>-

S3
,
c
, x
;..
U
c ,
c
<>-

o
x
b
<>-

"'i OR Xi

Figure 1. Block diagram of Toeplitz linear system solver of section


size n. Processors Pi perform two "ax-b" operations and
processors Qi perform one "ax-b" operation for each
recursion. Processor P~ also contains a divider circuit
which is used to obtain a "reflection coefficient" as
part of the HL algorithm.
J.G. Nash, 8t al 375

10990-5Rl
PROBLEM: Ax=B

DATA FLOW:

''I'
I,,"M ~. , "
....... 11.-.- ......
'.IIU
., .. ,
'.IJ~
I ,U M
I . "'~...... I."" I~
I ."

:= A !:!:: '."M .......


U ' hl
1. _ .
'.un
I ..... I.""
...... t,_.... 1.- I.M" I .un

1._.
I."" I."" t.M" I.''''
I."M .. ~
I.UN I."" . . ...U

...
_ . . .... I . ..,.
~_'H' ...... ....... I.'U' I..... I.""
I ..... I .
1 . "8

-I o

--.
.....
...........
,u ..

.
Hn

..... ....
.- _ .
.......O-M. .........
OM

."..H. ..-
..... ...... ..... ...,
..... .-
I ."" '''.1114' . . . . . . . IN'

.... ..... ......... ,.,


If .....

'I::~':! -,~ :::::: ::;!!: : : : : =::::: ;:!!:U:::


"._ .....M'"._ "" .,,,, '"."U. ',. I'"
.,n .... .\: :::~
H .. Hn

n..

H. !';::: ....
. I I "

: ; ::~
I,""
,"" .!!: .::!'!: :::
..... ...... .Mot) ' . MU ....
.. n
. n ..
. tlM
,
H"

.--
....... .....
.."..., .. ....
I H' ,u,.

.....
....."
.-.
, un
.;, :~!: .
Hn ,un
H ..
u

_.'U' . n ..
.U,
1 . , 11
'I . :;~: '!:!~~
",.. \.
1.11"
"I:::;;
r.u ...
.~:ll:: .......". : h~
I _Ut.

:::~:;
,.n,'I.he! ,'UI , N'
. un
hl

...._., .........
. NI t

.-M.. ......H".H.,'M..... .M".-....


un

.-
HH H"

- ..... .... , NI' M

..... ...... ._. .....


.... M" .... .".
-
.-.... ....-" .M,...... .........- ..............-. ..........-._....
_ . n .. HM MN
,.H
,

-
.... _ .N" M
.nh , I'"
M'U .... N HH
. ..... ... u

......-.. ......u., .-
) t
, u

- -
..... .....
.... ........
..... .....,
............... ... ._".M" ..-.....
.- -
. u
.....
. .. u
.N"
._. ...OM'
. nu
nn
"
. I.U HI . MII
. IU.
111
, . n .. . ... ,Iot

..-- ......-
. n ... 'M~ .u .. .u ..
, UM ~. "1"1

n
. ....
."M
... ,..
._
,.... .....
, .... . &u
.14 M ,.... ....
. ...... _ ,_ , M"

..... .tI" .N" ....


14

MIl .Nt..
,Nt .

. f,HJ
. . . ..

Figure 2. Data flow in Faddeev algorithm. After each


recursion, data is shifted up and to the left
and the row [-1,0 ... OJ of -[I] is restored.
Although this example [A] is Toeplitz, it can
be any matrix with a non-zero determinant.
376 VLSI Processor Arrays for Matrix Manipulation

10880- 6
n + 1 PROCESSORS
r~--------------------------~A~------------------------~

n+1
PROCESSORS

ROWS OF I

q=ta- 1 FINDS RECIPROCAL a-I

:~" FINDSx=a-bc

Mtb
b~
a
FINDS ab

~a STORAGE REGISTER FOR a

Figure 3. Block diagram of (n + 1) x (n + 1) array of pipelined


processors for solving linear systems problems. This
array is of the "wavefront" variety with processors
active along diagonal lines at times tl, tz, and t3
(t 3 > t z > t l ) as indicated above.
J.G. Nash, at al 377

1088().13b

WEINER-LEVINSON
16 SECTION
FIXED POINT (8.19)
50 SAMPLES
CUMULATIVE DISTRIBUTION
50

40
(:J
z
15
w 30
w
u
x
w
a:
w 20
Ol
:;;
:>
z
10

0
0.00001 0.00010 0.00100 0.01000 0.10000
RMSERROR

Figure 4. Cumulative error distribution for Toeplitz system


solver. The "saturated" curve corresponds to results
obtained when the fixed point register is set to its
maximum value for overflow. The section size of 16
corresponds to the size of the Toeplitz matrix [RJ.

10880-29
1. TWO WORD REGISTER
2. SHIFTER
3. l'S INSERT
4. EXPADD
5. 2', COMPLEMENT
6. CARRY-SAVE ADD
7. ACCUMULATOR
8. LOOK-AHEAD ADD
9. LEFT JUSTIFY 4

LIFO

9 8 7

HORIZONTAL BUS

Figure 5. Block diagram of S-P/S 2's complement floating-point


processor for Toep1itz system solver. (Elements are
not drawn to scale.)
378 VLSI Processor Arrays for Matrix Manipulation

Table 1. Architectures Examined for Toeplitz System Solver are


Indicated by Symbol " "
10880-5Rl

P - PIP S- PIP P - SIP S -SIP P -SIS S-S/S


DOUBLE PRECISION FIXED POINT >< >< >< ,:::::::.oe::;:
SINGLE PRECISION FIXED POINT >s:::

FLOATING POINT >< ><
Table 2. Arithmetic "ax-b" Hardware Possibilities
10880-29

NOMENCLATURE MEANING

PIP WORD PARALLEL, BIT PARALLEL


SIP WORD SERIAL, BIT PARALLEL (e.g., SHIFT AND ADD
MU LTIPL Y ALGORITHM)
SIS WORD SERIAL, BIT SERIAL
PIS WORD PARALLEL, BIT SERIAL

Table 3. Toeplitz System Solver Processor Comparisons


-
10990 7Al

SINGLE PRECISION; FIXED POINT DOUBLE PRECISION; FIXED POINT SINGLE PRECISION; FLOATING POINT

S P A PT T TIP S P A PT T TIP S P A PT T TIP

P-P/P 0.41 87 56 41 4.4 0.05 0.66 28 14 18 11 0.39 0.22 42 25 9.2 lB 0.43

----- ----
S-P/P 0.92 34 20 31 5.4 0.16 2.0 15 6.1 30 B.2 0.55 0.65 17.4 9.4 11 16 0.92

P-S/P 0.68 14 5.6 9.5 26 1.9 0.60 17 7.4 10 23 1.4

--- --
S-S/P 1.9 10 1.9 19 28 2.8 1.7 9.7 2.8 16 21 2.2
' ----.....
P-S/S 3.9 9.7 1.2 38 21 2.2

S = SPEED 110 3 , nseeJ


P = POWER 1,2 mWJ

A = AREA 110- 3 ,2 em 2 J

PT' POWER - DELAY 1103 ).3 PJJ


T::: THROUGHPUT:: S-l x A- 1 (10 2 >,,3 nsec-cm2) -1

TIP::: THROUGHPUTIPOWER '" (S x A x p)-l (102 A5 nsec-cm2-W) -1


A GeneralPurpose CamBased System

J.Storrs Hall
Rutgers University

KEYWORDS
content addressable memory, parallelism, general purpose architectures, CAML,
associative processing, applications of VLSI systems, algorithm design for VLSI
systems, impact of VLSI in system design

ABSTRACT
VLSI makes feasible the massive use of content addressable memory in a
general purpose computer. We present a design for memory which is
addressable conventionally and by content, and which supports low-level bit-serial
word-parallel algorithms. CAM provides one of the most easily understood and
programmed frameworks for massively parallel computations. We present a
programming methodology for the use of our design. This includes a
programming language, CAML; a number of algorithms from various fields which
demonstrate the generality of the design and the language; and techniques for
transforming algorithms from conventional to CAM-based structures and methods.

We do not attempt to better the performance of specialized hardware for


particular tasks. Our contention is that CAM is a practical technique for the
application to general-purpose use of the massive parallelism available with VLSI.

HARDWARE
The CAM consists of words of memory, responder bits, logic to do
comparisons between the bits in the words and the values on the bus, logic to
manipulate the responder bits using the results of the comparisons, logic that
controls whether or not a word is active (comparisons happen, the word can be
written into) depending on the values of the responder bits.

The addressing hardware and the bus are such that values can be read from
and written to individual words by address as in a conventional memory. The
addressing circuitry also controls the CAM activity in associative mode, such that
upper and lower addresses are specified which limit the associative activity. This
provides the ability to deal with blocks of memory as associative structures,
without interfering with the rest of memory.

379
380 A OeneralPurpose CamBased System

The memory is controlled by a rather conventional processor.

OPERATIONS ON MEMORY
There are four basic operations possible for the CAM part of the machine:
reading words, writing words, setting flags depending on the data bits, and setting
data bits depending on the flags.
Reading and writing single words are equivalent to the read and write
operations in a conventional memory. Only single words may be read.
Writing may cause data to be written from the data bus into all selected
words. Words are selected by (a) being inside the specified address range, and
(b) having a response bit or bits that agree with the conditions specified. The set
of conditions is just the state of some of the lines on the bus and may be
thought of as part of the address (e.g. "all words from location 234 through 345
in which response bit 0 is 1"). It is possible to specify a bit mask for the write
operation. The write operation only changes bits corresponding to ones in the bit
mask.
When the data is written into each individual word, it is ORed with a value
specified from the flags (response bits) in that word. This can be made 0 by
selecting no flags for this purpose, so that the value on the bus gets written.
Alternatively, by putting 0 on the bus, we may write the value from a flag into a
data bit.
Set flags is the most characteristic feature of a CAM. It allows flag bits in
each word to be set from data in that word. This is a parallel operation, in that
it is performed on each selected word in memory independently. In this
operation, a one-bit result is computed for each selected word. Words are
selected by the addressing just described for writing. One or more flag bits are
specified on the bus, and the result for each selected word is written into those
flag bits in that word.
The results for a given word are as follows. Let Bj be the ith. data bit in a
word; OJ be the ith data line in the bus; M j be the ith mask line in the bus. Let
Rj be -M j + (B j eqv Ojl. that is, 1 if the bit equals bus or 1 if not in the mask.
The Rj are ANOed together. Before being written into the specified response
bits, the result is XORed with a final line on the bus to give the processor a
chance to invert it.

The addressing logic is such that we can do the following: (a) select a range
of words for associative access; (b) determine the address of the first active
word; (c) determine the number of active words (responders).
A major difficulty in the design of the machine as a whole arises from the
fact that all the lines on the bus must go into each Chip. In a conventional
memory architecture, a 16x 16K memory can be implemented with 16 1x 16K
J.Storrs Hall 381

chipS, since there need be no connection between different bits of the same
word. Not so in the CAM, where each bit must be connected with the flags for
that word.
Our design exacerbates the problem, since the "address range" capability
requires that 2 addresses be supplied with each memory access. Furthermore,
usability requires fairly large word sizes, 64 bits at the very least Data lines plus
mask lines plus two addresses gives many more than can be fed onto a chip.
We multiplex the signals: Since we generally are dealing with fields in the
word, we put only a slice, say 32 bits, on the bus. We can get away with
splitting the bus into data, mask, low address, and high address phases. Since
many of the al!lorithms use the same range of memory in many successive
accesses, we allow the addresses to be latched into memory. In some non-
negligible portion of cases, we may dispense with the mask by assuming it is all
ones.
We refrain from designing a processor, as a conventional variety is sufficient.
We will assume a microprogrammed one with lots of internal registers, since
some reasonably complex algorithms underlie our "machine language".

OPERATIONS IN MICROCODE

Given the ability in hardware to find all words in which given bits have given
values, we can write bit-serial, word parallel algorithms to find words in which a
given field has a numerical value greater than, less than, etc, a given value. Given
the ability to determine if there are any responders, there are bit-serial, word-
parallel algorithms for finding the maximum or minimum element Given the ability
to count the responders, it is simple to sum the elements of an array in parallel.
Given these and the ability to write into selected bits of selected words
simultaneously, we can add or subtract a given value to selected words with a
bit-serial, word-parallel algorithm. Similarly, we can do arithmetic between fields
in the same word, for all words in parallel.

The hardware can be optimized for these algorithms. For example, To add
the contents of a register to some field of all words, for example, we apply the
following algorithm (adapted from [Foster]), on the unoptimized hardware
described above:

[, ] Denote one flag "carry-in" and one "carry-out". Set carry-in in all
words to O. Set I (an internal var in cpu) to the number of the least
significant bit in the field to be added to, J to the number of the
least significant bit in the register to be added to all the words.

[2J Set carry-out in all words to O. If bit J of the register is 1, go to


[4J.
382 A General-Purpose Cam-Based System

[3J (register bit J is 0) In words where carrY-In IS 1, write the value of


word bit I into carry-out. In words where carry-in is 1, write
not(carry-out) into bit I. Go to step [5J.

[4J (Register bit J is 1) In words where carry-in is one, set carry-out


to 1. In words where carry-in is 0, write the value of word bit I
onto carry-out. In words where carry-in is 0, write not(carry-out)
into bit I.

[5] I := I - 1; J := J - 1; (that is moving toward more significant bits)


In all words, write the value of carry-out into carry-in; If I and J
still pOint to bits in the field, go to [2].

This algorithm requires 4 or 5 memory cycles per bit (depending on the register
bit). Assuming a 200-nanosecond memory, this is a microsecond per bit. We
can speed this and the other arithmetic up by attaching a full adder between two
of the flag bits in such a way that each bit would take a single memory cycle.
This is probably a good tradeoff since it would cost an extra 5% at most in
hardware and produce a 4- to 5-fold speedup.

Further speed optimization can be obtained by adding a sequencer to each


chip, essentially importing the microcode algorithms for the more common
operations. The speedup here depends on the difference between on-chip and
interchip speeds; the cost depends on chip size. Here the tradeoff is less clear.
As a reductio ad adsurdum, we could place an adder at each bit; this would
almost certainly cost more than it would be worth.

CAML
To make possible the independent development of machine architecture and
software, and to facilitate the latter, we have developed a "systems level"
language for programming instead of assembly language. CAML hides the flag
bits from the programmer in the same way BLISS or C hides registers in a
conventional machine.

The basic data construct in CAML, besides scalars of various types, is the
array of records. A pseudovector is the occurences of some field in a subset of
the records of the array (e.g., the B field in all records where the A field is
greater than 17). In an array foo this would be written foo[:a>17].b. Only
one array can be the basis of the pseudovectors in anyone statement, so the
fieldnames are used alone after the initial mention of the array. For example,
foo[2:34 : x>22 & y=z].z := x-17 means "in records 2 through 34
inclusive of foo, in which the x field is greater than 22 and the y field is equal
to the z field, place in the z field the value of the x field minus 17." Indexing is
zero-based.

CAML allows operations between pseudovectors exactly where these can be


done in parallel on the hardware. Thus it is appropriate for the development of
algorithms which take advantage of the CAM architecture. Primitive functions
J.Storrs Hall 383

include those which are primitives at the microcode level, ie arithmetic between
fields in parallel. count responders (# f 00 [ : x> 3) l. address of first responder
(@foo[ : x>3 h etc.

Control constructs in CAML also reflect the capabilities of the CAM. Iteration
through a pseudovector, for example, is included since the deaddressing capability
allows this to be done in time that depends on the number of items in the
pseudovector, but which does not depend on the length of the base array (the
original array in which the selected records lie).

CAML is two-dimensional, ie, indentation is part of the syntax.

ALGORITHMS ON A CAM
We support our contention that a CAM is well-suited as a general-purpose
machine with a wide selection of algorithms from different areas of computer
science. The benefits of simplicity and speed are sometimes available together,
sometimes separately. For algorithms using conventional data structures, usually
the algorithm can be simplified, and occasional speedups occur. Sometimes we
can change the data structure and obtain more radical improvement.

A compacting garbage collector can be implemented on a CAM much more


simply than on a conventional machine, since it is possible to change all the
pointers pointing to a given address to another one in a single (fast) operation.
This eliminates the need for "forwarding addresses." Besides whatever speed it
gains from its simplicity, it saves time "sweeping" since it can find each mark bit
in time that doesn't depend on the amount of empty space between.
This is a garbage collector for a conventional heap, with blocks of pointers.
There are N words. There are two sets of comments; those on the same lines
as the code tell what the purpose of the code is in terms of the algorithm, and
those under the statements tell what they do in terms of the machine and data.
: structure declarations go between {}
{ mem(n) .f, .g(bit) .ptr(index(n)) :this is the heap
global scalar root ; from which pointers are traced
local scalar bot
: moves through memory in the compacting phase

; memory--is full of blocks with pointers.


; the first word in a block has the length of the
; block (not counting itself) instead of a pointer

mem. f & 9 : =0 ; clear mark bits


:set fields f and g in a/l words to 0
mem [root). f := 1 ; mark phase: mark from the root
; set f i el d f of the word root points to to 1
384 A GeneralPurpose CamBased System

while i := @mem[:f] ; find half-marked cell


; (ie one not marked from)
;iterate through words whose f field is 1, letting i be
; the address of each successively
; the entire array mem is searched on each iteration
mem [ i ] 9 := 1 ; indicate fully marked
;set the g field of the i'th word to 1
for k:= 1 to mem[i].ptr ; half-mark every cell
; it points to
; iterate with k becoming 1 through mem[ilptr
mem[mem[k+i].ptr].f := 1
;set the f field of the word pointed to by the ptr
; field of the k+ i'th word to 1
;end iteration on k, after k=mem[ilptr
mem [ : 9] f : = 0 ; remove halfmarks from fullmarked cells
;set the f field of a/l the words whose g field is 1, to 0
;end iteration on i, when no word with f= 1 is found

bo t : = 0 ; (a scalar) sweep phase:


mem. f : = 9 ; f now means relocated. setting it to g
; guards the lengths. which shouldn't be.
;set the f field of each word to the value of its g field

for i =
: @mem[: g] ; find the first marked place
; iterate through words whose g field contains 1.
; i is the address of each successive one.
isearch for successive words considers only words following
ithe previous one. Note the difference from while.
mem [ i] 9 : = 0 ; unmark it
for k := 0 to mem[ i] .ptr ;move down
mem[bot+k] := mem[i+k] ;full record assignment
;end iteration on k, after k=mem[ilptr
mem [ : -f &: pt r= i ] . f : = 1 ; relocate all pOinters to it
.ptr := bot
;set the f field to 1, and the ptr field to bot,
i in a/l words where the f field=O and the ptr field = i
bot :+= mem[bot].ptr ; move up for next
iend iteration on i, when no more words with g= 1 remain
The conventional form of the same garbage collector is much more complex.
requiring forwarding addresses. an extra relocation pass. and an extra pointer per
record.
Sorting may be done in linear time using an algorithm which would be
quadratic on a conventional machine. This is a gain in simplicity and speed. Note
J.Storrs Hall 385

that "linear" here refers to an algorithm which does n extractions of the largest
element and that is a bit-serial algorithm (although presumably in microcode); so
there is an implicit extra factor of length-ot-key. However, length-ot-key is
also an implicit extra factor in a comparison when sorting on a conventional
machine; so we are not taking liberties.

The technique for conversion here is to find an algorithm (possibly not an


optimal one) which has a "bottleneck" loop which can be reduced to a parallel
CAM operation. This same technique can be applied to matrix inversion by Gauss
elimination, for example.

Parenthetically, there is a novel sorting algorithm for a CAM based on taking


each record, deciding where it should go in the output based on how many
records in the input have a lower key. This is one of the most succinct
examples of the usefulness of the interaction of coordinate (conventional) and
content addressability:
{ data(n) .key(number) .junk(whatever) jinput array
sorted(n) .key2(number) .junk2(whatever) joutput array
}
for i := 0 to n
sorted[#data[:key<data[i].key]] := data[i]

GRAPH ALGORITHMS

The garbage collector is much simpler on the CAM, and runs faster by some
constant factor, but is still a reflection of a data structure designed for a
conventional machine. In some cases, we can speed up an algorithm by a linear
factor by appropriate rearrangement of the data structure.

Dijkstra's algorithm for shortest path in a weighted digraph can be


implemented on a CAM in linear time (linear in vertices). The basic algorithm is
quadratic in vertices in a conventional implementation. Sophisticated ~onventional
implementations can be linear in the number of edges, but this can stili be
quadratic in vertices for many graphs. Similarly, minimal spanning tree can be
done on the CAM in time linear in vertices, where the conventional algorithm is n
log n in edges.

[1] [initialize] There are n vertices. we are looking for the shortest path
from vertex 1 to all the rest. Array u(n): u( 1) = 0, u(i> 1) = +infinity.
Array a(n,n): a(i,j) is the arc length from vertex i to vertex j. The set
T contains all vertices except (vertex number) 1. Set k to 1.

[2] [update u] For all i in T u(i) := min(u(i), u(k)+a(k,i))

[3J [new k] K := i such that u(i) is a minimum for in T. Remove k


from T. Repeat steps 2 and 3 until T is empty.

In a straightforward implementation steps 2 and 3 are each linear timein the


386 A GeneralPurpose CamBased System

size of T, so the whole thing is nt2 in number of vertices. We can get around
this, however, by rearranging the datastructures. The trick is to distribute u and t
around to the edges so we can do the update u operation in parallel. This leaves
us with a copy of u(i) for each edge leading into vertex i, with the "true" value
of u(i) being the minimum of that set Likewise t is represented by a t for each
edge coming into a vertex; these all change in parallel, so each is the "true" value.
{ edge(nI\2) .x, .y(index(n .a, .u(number) .t(bit) }
; O. [initializeJ
; assume fields x, y, and a have been initialized
; assume there is at least one record with y= 1;
; if not a dummy may be inserted.
edge[:y-l).u :- 0 ;set u and t to 0 for
.t := 0 ; all edges whosey is
=
[ ]. u : 999999 ;set u to "infinity" and
t := 1 ; t to 1 for all other edges
k := 1 ; k is a scalar variable

while edge [ : t)
; iterate as long as any t field in edge is

; 2. [update uJ
; essentially the same as above, but we only update the
; copy of u(i) associated with (k,i).
edge[:t & x=k].u :min- min(edge[:y=k].u)+a
; in each edge record where the t field is 1
;and the x field equals k, set the u field to
;the min of its previous value and the sum of
; its a value and the minimum u value in all
; records where the y field equals k

;3. [new kJ
;taking a "grand total" minimum of all the u's instead of
; doing each u(i) incrementally
k := edge[:t:min(u).y
; k gets the y value from the record with the
; minimum u value from those records in which the
;t field is 1

edge[:y=k).t := 0
; set the t field to 0 in all records where
; the y value equals k.
;exit the iteration when no record's t field contains 1

This new form of the algorithm runs in time proportional to n, the number of
vertices. It remains to prove that it works.
J.Storrs Hall 387

[ 1J Every vertex with at least one incoming edge has at least one u
value since the u's are stored corresponding to the y's.

[2J The successive values of u(i) in the conventional version correspond


to the values assigned at the various incoming edges in the CAM
version.

[aJ The current value of a u(i) at any point in the conventional


form is a running minimum of the values presented. Thus the
min(u) is the same as a "grand total" min over all the u's in the
parallel form.

[b] The fact that the u's for all the y=i have not been updated
does not matter, since the u used in the addition is the result
of the grand total min as per [aJ, and value of u for this
edge can only be higher or equal the "correct" u(i) for the
vertex.

I
[3J The t values are manipulated only by y=k and are thus set and
tested only in blocks corresponding to the individual bits in the serial
version.

Like the garbage collector, the minimal spanning tree algorithm gains a
marvelous simplicity (compared with an efficient conventional version), but it gains
a major speedup as well. The algorithm is one well suited to a CAM: Pick edges
of minimum cost that don't form a cycle until all vertices are connected.
{tree{n"2) .x, .y, .bx, .by{index{n
.cost(number) .mstp(bit)}
; these are edges, connecting vertices x and y.
tree. bx : = x ; originally each vertex
by : = y i is a separate subtree
ms t p : = 0 i and no edge is in the spanning tree
while i := @tree[:bx-=by:min{cost)
; find edge of min cost which doesn't form a cycle
tree[i).mstp := I iput this edge in tree
new : = tree [ i ]. bx i change the subtree number of one
old : = tree [ i ) by ; of the subtrees to the other's
tree[:bx=old).bx := new
tree[:by=old).by := new

These CAM algorithms use a technique we call "recomposition," which is the


opposite of "decomposition" in relational databases. By this we merely mean
storing explicitly the composition of relations which would be conventionally be
stored separately. In the graph algorithms, we typically take information that
would normally be associated with a vertex and associate a copy of it with each
edge entering the vertex. These algorithms also use the technique of making
structural information explicit. i.e. explicitly storing the vertices for each edge
instead of making that information implicit in a data structure.
388 A General-Purpose Cam-Based System

Fairly small relational databases of the kind often used as know lege bases in
AI programs can be stored directly as tuples (records) and manipulated with
extreme ease. By the appropriate encoding, contexts of the Conniver variety can
be represented by an extra tag on each tuple, allowing constant-time insertion
and retrieval in the complex structure. An appropriate encoding is merely an
enumeration of the context tree for which an appropriate bit pattern will select
exactly those nodes from the root to some selected node, and of those the
farthest from the root has the highest numeric value. This is another example of
the technique of making structure explicit.

REFERENCES
This is a condensed version of Rutgers LCSR-TR-16, which contains in
particular a more detailed description of CAML.

Foster, Caxton C.: Content Addressable Parallel Processors


Van Nostrand Reinhold, New York, 1976

Kohonen, Teuvo: Content-Addressable Memories


Springer-Verlag, Berlin, 1980

Mead, Carver and Conway, Lynn: Introduction to VLSI Systems


Addison-Wesley, Reading, Mass, 1980
A Statically Scheduled VLSI Interconnect
for Parallel Processors

B.R. Rau, P.J. Kuekes, and C.D. Glaeser


ESL Inc.
2249 Zanker Road
San Jose, California 95131

The interconnect mechanism of a computer system plays a central


role in determining the performance of the system. Consequently, much
effort has been expended in designing interconnection structures and in
attempting to bring the full power of VLSI technology to bear upon this
issue. One particularly attractive candidate for VLSI implementation
is a crossbar switch of relatively small dimensions (e.g., 2x2 or 4x4).
A crossbar switch of this type can be used to build larger crossbar
switches (which may be either sparse or complete), multi-bus
interconnects with the crossbar chips serving to connect the various
resources to the multiple buses, and multi-stage networks such as the
delta network [1].
However, a quick analysis reveals the inappropriateness of such a
circuit for VLSI implementation due to the excessive amount of pin I/O
in relation to logic. In fact, a 1-bit wide 8x8 crossbar chip, when
used to build a crossbar network, would reduce the board area consumed
by only a factor of 3 in comparison to a MSI implementation, and would
cost twice as much. The successful exploitation of VLSI technology
requires that some way be found of effectively utiiizing the number of
devices that can be placed on a chip which has a large number of pins.
The polycyclic architecture [2,3] provides such an opportunity.
It consists of a number of resources (processing elements, memories and
buses) which are connected together by an interconnect network. The
topology of the interconnect network may be arbitrary in that each
output of each resource may be connected to one, some or all of the
inputs of the resources. The polycyclic architecture possesses
parallelism by virtue of its multiple processing elements and permits
intermediate values to go directly from the output of one resource to
the input of another resource without being written into and read from
the main memory. 'rhe effective use of a polycyclic processor requires
that the individual operations in the computation be scheduled upon the
hardware resources so as to max~m~ze the throughput. Optimal and
near-optimal techniques for scheduling important computation structures
have been developed for this purpose [3]. The resulting schedules lead
to situations where a value is transmitted at a certain point in time
(a fixed number of cycles after the corresponding operation was
scheduled) but is not needed until some time later when the operation
to which it is an input is scheduled.
A straightforward solution is to provide a dedicated delay element
between each resource output and resource input which are directly
389
390 A Statically Scheduled VLSI Interconnect for Parallel Processors

connected to each other. The delay element must have the ability to
delay each value that passes through it by an arbitrary number of
clocks. In practice, a delay element would be implemented as a small
buffer memory with a structure that will be described shortly. These
delay elements add sufficient logic complexity to the crossbar chip to
effectively and usefully exploit the capabilities of VLSI. The use of
scratch-pad register files to provide the delays is not an adequate
solution since the delay elements, which are intended to facilitate the
scheduling task, would now themselves be resources that need to be
scheduled. This circularity greatly complicates the scheduling task
[3J.
A careful study of a number of possibilities has yielded no
satisfactory alternative to providing a delay element between every
output and input that are directly connected. In view of the
advantages of a small crossbar switch as a building block, it was
decided to design an interconnect chip consisting of a crossbar with a
delay element at each cross-point to facilitate the construction of a
wide variety of interconnection networks for polycyclic processors.
The structure of an individual delay element is determined by the
following considerations. Each value that passes through it will be
written into it once, be read one or more times, and will then be
deleted. The multiple reads result from two or more operations, which
have as input the same value, being scheduled on the same resource.
More than one value may be in the delay element at anyone time, and
their arrivals, usages and deletions may have arbitrary relative
orderings. The memory element implied by this need be nothing more
than a register file with the capability of reading from or writing to
any register.
A further issue is illustrated by the program fragment in Figure
1a, which computes the first 101 terms in the Fibonacci series and
stores them in the array A. Each iteration of the loop computes a new
value (for TO) which is used by the two subsequent iterations.
However, this value has to be reassigned, on successive iterations,
first to T1 and then to T2, in effect performing a programmatic shift
of the value from TO to T1 to T2. Such programmed shifts can
T1 := 1; T1 := 1;
T2 := 1; T2 := 1;
A[O ~ := T2; A[O] := T2;
A[1J := T1; A[ 1] := T1;
FOR I := 2 TO 100 DO FOR I := 2 STEP 3 TO 98 DO
BEGIN BEGIN
TO := T1 + T2; TO := T1 + T2;
A[I] := TO; A[I] := TO;
T2 := T1; T1 := T2 + TO;
T1 := TO; A[I+1] := T1;
END; T2 := TO + T1;
A[I+2] := T2;
END;

(a) ( b)
Figure 1. Program Fragment
B.R. Rau, 8t al 391

complicate the scheduling task for much the same reasons that the use
of scratch-pad registers for implementing delays causes problems. If
programmed shifts are to be avoided in this example, the loop of Figure
1a would have to be unrolled as in Figure 1b where three members of the
Fibonacci series are computed on each iteration. In general, this can
increase the size of the program quite substantially if the body of the
original loop is sizable. The need to perform such a shift arises
whenever successive instances of a value are written into the delay
element before previous ones have been deleted. If each instance of
the value is to be accessed using the same address, and if programmed
shifts are to be avoided, the delay element must have a built-in shift
capability to perform the equivalent of the programmed shift in Figure
1a.
Precious pins can be saved if the address that a value is to be
written into in the delay element does not have to be explicitly
provided. Instead, on-chip logic associated with each delay element
maintains a pointer to the location which is to be written into next.
The resulting structure of the delay element consists of a
register file, any location of which may be read from by providing an
explicit read address. Optionally, one may specify that the value
accessed be deleted. This option would be exercised if this is the
last access to that value. The result of doing so is that every value
with addresses greater that the address of the deleted value, is
shifted down to the location with the next lower address.
Consequently, all values present in the delay element are compacted
into the lowest locations. An incoming value is written into the
lowest empty location which is always pointed to by the Write Pointer.
The Write Pointer is incremented each time a value is written and is
decremented each time one is deleted. As a consequence of deletions, a
value, during its residence in the delay element, drifts down to lower
addresses, and is read from various locations before it is itself
deleted. A value's current position at each instant during execution
must be known by the compiler so that the appropriate read address may
be specified by the program when the value is to be read. Keeping
track of this is a simple, if somewhat tedious, task which is easily
performed by a compiler during code-generation.
There were four basic parameters that had to be specified prior to
designing the logic of the interconnect chip:
1) the number of words per register file, d,
2) the width of a word bit slice, b,
3) the number of crossbar input (write) ports, m, and
4) the number of crossbar output (read) ports, n.
These parameters had to be specified in the context of three
constraints that had to be satisfied. These constraints, which reflect
TRW's 3-D bipolar process, fabrication, packaging and testing
technologies, were that
1) the device count could not exceed 20,000,
2) the pin count could not exceed 64, and
3) the power dissipation could not exceed 4 watts per 64 pin chip.
Based on a preliminary design analysis, the following equations were
derived to evaluate the effect of the constraints upon the parameters:
392 A Statically Scheduled VLSI Interconnect lor Parellel Proceseol'8

devices/chip = 35(mb + mn + n( flog 2ml + rlOg2dl + 2))

+ 71nb + mn (60d + 90b + 24 bd) (1)

pins/chip = b(m + n) + mn + n( flog2ml + rlog2dl + 2) + 10 (2)

power/chip = mnb(0.02/m + 0.032(1 + 2/b)) (3)

When the interconnect chip is used as a building block in


constructing larger crossbars, the number of input and output ports and
the word width may all be selected to be any multiple of the
corresponding parameters of the chip. To facilitate the construction
of crossbars with more than two input ports, the chip's outputs are
implemented using tri-state logic. However, the number of words per
delay element (cross-point) in the larger crossbar cannot be increased
over the number per delay element in each chip. Consequently, it is
important that the number be large enough for potential applications.
Analysis of a diverse set of candidate problems has led us to conclude
that 16 words per register file is adequate and allows for a reasonable
margin of safety.
Figure 2 shows, for various combinations of m and n, the maximum
value of b (based on Equation 1) subject to the 20,000 device count
constraint. It also displays, for this maximum value of b, the number
of pins per chip yielded by Equation 2. The 64 pin constraint
restricts serious consideration to the four options enclosed by the
heavy lines in Figure 2. These four options .. are desirable in that they
Simultaneously approach both the device and the pin constraints and, in
so doing, exploit the full potential of this VLSI technology. All four
options also turn out to have power dissipations that are well below
the 4 watt limit.

~
PORTS 1 2 3 4
READ Iml
PORTS
Inl
31 BITS 15 BITS 9 BITS 7 BITS
1
81 PINS 69 PINS 65 PINS 68 PINS

2 14 BITS 6BITS 4 BITS 2 BITS


70 PINS 60 PINS 64 PINS 62 PINS

3 9 BITS 3 BITS 2 BITS 1 BIT


73 PINS 63 PINS 71 PINS 74 PINS

4 6 BITS 2 BITS 1 BIT OBIT


76 PINS 72 PINS 81 PINS 92 PINS

Figure 2. Maximum Bit Slice Width Without Exceeding 20,000 Devices Per Chip, and Number of Pins
(Assumes 16 Words/Cross-Pointl
B.R. RIU, at II 393

A system level analysis was used to select one out these four
options. This consisted of evaluating the total power dissipation and
the total number of chips involved in constructing a crossbar with 4
input ports, 8 output ports and a 32-bit word width. Figure 3 plots
these quantities for n = 2, 3, and for m = 1, 2, 3 and 4. The four
feasible options are circled. The m = 2, n = 2 option is the best with
respect to both power and chip count. It has the additional advantages
that m and n are powers of 2 and that m and n are equal, thereby
providing improved modularity in both the m and n dimensions when
constructing larger crossbars. A block diagram of the final design is
shown in Figure 4.
Additional features of the chip shown in Figure 4 are a diagnostic
bit serial read-in/ read-out of the input and output registers, and
status bits for detecting register file full and register file empty
conditions. These status bits potentially allow dynamic as well as
static scheduling of the interconnect (by using external logic).

References

1. J. H. Patel, "Processor-memory interconnections for


multiprocessors", Proc. 6th Annual Symposium on Computer
Architecture, pp. 168-177, April 1979.

2. B. R. Rau; R. L. Picard, C. D. Glaeser and E. M.


Greenawalt, "The polycyclic architecture: a statically scheduled
dataflow architecture", (submitted for publication).

B. R. Rau and C. D. Glaeser, "Some scheduling techniques and an


easily schedulable horizontal architecture for high performance
scien tific computing" , Proc. 14 th Annual Workshop on
Microprogramming, Oct. 1981 (to appear).
394 A Statically Scheduled VLSI Interconnect for Parallel Processors

TOTAL
POWER
(WATTS)

200

3 READ PORTS/CHIP

150

100 2 READ PORTS/CHIP

50
2 3 4
WRITE
PORTS
rOTAL
IC.

128

96 3 READ PORTS/CHIP

64 ( j } o - - - - f i j ) 2 READ PORTS/CHIP

32

2 3 4 WRITE
PORTS

Figure 3. System Level Analysis (Assuming a 4 x 8 x 32 Bit Crossbar)


B.R. RIU, at II 395

DOl

2
WEIOI

R
6 E CELLIO,OI CELLIO,ll
DINIOI G

2
WEill

R
6
E CELLI1,OI CELLll,ll
DINlll G

WPSM

RST

~
DCLK ~

6 6

FlO, 01 ... Fll,ll DOUTIOI DDUTlll


EIO, 01 ... Ell, 11

Figure 4. Interconnect Chip Block Diagram


The CMOS SLA Implementation and SLA
Program Structures

K.F. Smith, T.M. Carter, and C.E. Hunt


University of Utah
Department of Computer Science
Salt Lake City. Utah 84112

INTRODUCTION - THE SLA CONCEPT

The storage/logic array (SLA) is a form of structured logic which


is well suited to VLSI design. The SLA concept, which was derived from
the PLA, was originally conceived by Patil [3] and later elaborated upon
by Patil and Welch [4]. The SLA differs from the PLA in several major
respects. The SLA has both the AND and the OR planes from the PLA, but
these planes are superimposed or folded on top of each other. This
folding of the AND and OR planes generates a structure in which AND
terms are generated on the rows of the SLA and the OR terms are
generated on the columns. The single AND/OR plane of the SLA contains
column wires which can serve as inputs to the SLA rows (AND plane) or as
outputs from the OR plane. This functional duality of SLA columns means
not only that the SLA can be arbitrarily segmented, but that inputs to
and outputs from segments of the SLA can be arbitrarily interleaved.
Also due to the functional duality of the SLA columns, the SLA can
contain memory elements imbedded within its structure which merges
feedback loops into the array itself. This allows for the specification
and implementation of independent finite state machines and data path
modules wi thin a single integrated structure. In addition to memory,
inverters or other standard logic gates can be placed in the SLA to
provide multiple levels of logic, whereas the conventional PLA can only
generate one level of AND data and one level of OR data.
Adding row breaks placed between adjacent columns and column breaks
placed between adjacent rows allows great flexibility in segmenting the
array. Segments of the array need not be rectangUlar but may be
polygonal (where the polygon has orthogonal sides).
Aside from the physical advantages and flexibility of the SLA, it
has several logical and design automation advantages. The symbolic
nature of the SLA program speCification gives the circuit designer an
immediate perception of the logical function of the circuit being
designed. Each SLA logic symbol maps directly onto a member of the SLA
cell set, giving the SLA designer a simultaneous perception of both the
logical function of the SLA and its physical layout. Given a set of
established SLA cells and rules for using them, a circuit designer can,
for the most part, ignore both the electronics of the circuit and the
layout while concentrating on the logical function.
The SLA should ideally be technology independent. That is, one
program should be transferable, without change, between different
396
K.F. Smith, .t al 397

processes. Initial experience with I2L and NMOS SLA


implementations [5, 6] has shown that, in practice, this is not the
case. Different process technologies not only have different SLA
programming design rules, but have radically different advantages and
disadvantages. Specifically, it was shown that 12L [2] cannot
adequately implement large gates on the SLA rows or columns. An NMOS
implementation proved to be able to handle large row and column gates,
but at the price of high power consumption. Folding the AND and OR
planes in the 12L and NMOS SLA implementations has resulted in
relatively poor space utilization. The new CMOS SLA overcomes the gate
size, power, and space problems encountered in 12L and NMOS. The CMOS
SLA uses Schottky diodes as the combinational logic elements in both the
AND and OR planes and thus significantly increases packing denSity of
combinational logic over both the 12L and the NMOS SLAs. The CMOS SLA
will also have the speed and driving capability of a conventional CMOS
circuit but will consume only about one-fourth of the power consumed in
NMOS SLAB. The packing density will be comparable to that of an NMOS
ROM.

THE CMOS SLA CELL SET

The CMOS SLA Cell Set contains elements which have been present in
all previous SLA implementations. These elements may not be implemented
with exactly the same functionality in CMOS, but can be used to write
SLA programs which are functionally equivalent to those in NMOS or I2L.
In addition to the "standard" SLA elements, new ones have been added
which greatly increase the flexibility and power of the CMOS SLA Cell
Set. The CMOS SLA Cell Set includes:

- memory elements (flip-flops [latches] composed of cross-


coupled NAND gates),

- inverters (both a single inverter and two inverters in series,


generating both the true and not-true of the input signal),

- elements which act on memory elements and inverters:

- S (set a flip-flop)
- R (reset a flip-flop)
- I (inverter input),

- elements which detect the state of memory elements and


inverters which are driving onto a column:

- 0 (detect the reset state of a flip-flop


or the false output from an inverter)
- 1 (detect the set state of a flip-flop
or the true output from two inverters
in series)
398 The CMOS SLA Implementation and SLA Program Structuntl

- CMOS pass transistors,

- ohmic contacts from a column to a row,

- various interconnect, power buss, row and column loading


cells, and filler cells.

Since SLA programs can be represented as cells placed on a grid,


each of these CMOS SLA cells is designed to be placed arbitrarily (with
certain restrictions) and at the SLA programmer's discretion on an SLA
grid. Each cell is designed so that freedom of placement is restrained
as little as possible.
The si ze of each cell is an integer multiple of the si ze of the
smallest combinational element (the 1, 0, S, or R). This requirement
that cell si zes be integer mul tiples of the atomic cell assures that
placement on the SLA gr id results in per fect alignment of the cells.
Since layout design rules are checked in each cell and between each pair
of cells, no layout design rule checking of the entire SLA design is
necessary.
The memory elements in the CMOS SLA Cell Set were designed as
simple NAND gate latches and do not include any of the read/write
enabling found in the NMOS and I2L implementations. Master-slave flip-
flops similar to those found in the NMOS and I2L designs would have to
be implemented using the SLA row and column logic and these latches.

CMOS PROCESS DESCRIPTION

The CMOS process chosen for this SLA cell set includes an n-channel
Si-gate NMOS process, with enhancement and depletion-mode transistors.
This process is supplemented by the inclusion of an n- well, enclosing a
p-channel device region with enhancement-mode transistors. The n-
regions made possible the inclusion of Schottky diodes as the key
elements of the combinational logic. The n- regions used for p-channel
devices were encircled by an n+ high bias ring, which in turn was
surrounded by a p+ grounded guard ring. These were included to prevent
the SCR action encountered in CMOS circuits.

THE DIODE ARRAY

The CMOS SLA implementation makes use of the Schottky diode in the
merged AND and OR plane. A composite representing a single Schottky
diode in a row cell is shown in Figure 1. The row of the SLA is an n-
stripe with low-resistance n+ shunts on each side. The Schottky diode is
formed by contacting metal to the n- diffusion.
Both the AND and the OR planes are formed by identical Schottky
diodes which have their cathodes tied to the row and their anodes tied
to individual columns. Because of this, both of the planes are
identical and they can be merged together with no space penalty. This
was not true in the I2L and the NMOS technologies which required a
differently configured transistor for the AND plane and for the OR
plane.
K.F. Smith, at al 399

9JtlTTKY 0IOOE

\ N- STRIPE ROW

Figure 1 - Schottky Diode Composite

Representative AND and OR planes are shown in Figures 2 and 3 and


are implemented using negative true logic. The inputs to the AND plane
in Figure 2 are the three columns which are connected to the anodes of
the diodes. When an active element (such as a latch, flip-flop or
inverter) drives through a diode to the row, the current drive of the
active element exceeds the pulldown capability of the row current-source
and the row is forced high. The output of the AND gate is the
horizontal row which is pulled low by the current source. This current
source is actually a depletion mode N channel transistor with its gate
and source grounded. The row will become true (low) when all of the
inputs (columns) are true (low). Thus this row is the logical AND of the
three columns. The OR plane in Figure 3 has inputs (rows) at the
cathodes of the three Schottky diodes and has its output on the column.
The column will become true (low) when any of the three rows are true
(low). Thus this column is the logical OR of the three rows.
Therefore, the SLA rows are both the out puts of the AND plane and the
inputs to the OR plane.
The proper operation of the OR plane is dependent upon the ratio of
the current being sourced by the column pull up transistor and the
current being sunk by the current source on the row. Anyone of the
three rows must be capable of sinking the current from the depletion
mode pullup transistor. The sinking current must be greater than the
sum of all sourcing currents .
The combinational logic cells (1, 0, S, and R) are identical
diodes. The only difference between these cells is their notation in
the program itself, and this difference is maintained strictly for
clar1ty 1n the SLA file. S1nce the 10g1c array cells are all identical
individual diodes, with cells sizes of 14.5 microns wide by 27 microns
high, the packing density of the CMOS array is about 30~ of that in NMOS
400 The CMOS SLA Implementation and SLA Program Structures

(in [4], the basic row element size was 75 microns wide by 35 microns
high). An obvious advantage of CMOS over NMOS for SLA programming would
be in implementing circuits with a high ratio of combinational logic to
active elements.

VDD

Figure 2 - AND Plane Figure 3 - OR Plane

BLOCK STRUCTURE USING THE CMOS SLA CELL SET

The necessity of incl ud ing n- well s with thei r associated guard


rings to produce CMOS circuits introduces a cell design problem. The
inclusion of guard rings causes the well size of the individual active
elements to become prohibitively large i f a cell were able to stand
alone within the SLA grid. The problem is solved by requiring that
acti ve elements be progralllDed in the SLA as hori zontal "bands" with
shared n- wells, as shown in Figure 4. At each end of the band is an
"end cap" for the well, which closes the guard rings and is two column
wires wide. Thi s design constraint is considered to be a reasonable
tradeoff. Previous SLA designs in other processes have typically had
active cells programmed in horizontal bands already. Horizontal banding
of active cells is not likely to become a serious restriction in SLA
programming.
The cell size is further improved over previous implementations by
not carrying power, ground and clocks in each column. All power,
ground, and clock lines are brought in horizontally in the active cell
bands, with column wires tunneled underneath in the vertical direction.
This permits a much smaller minimum size for the array cells by not
carrying power, ground, and clocks everywhere in the array. Since there
would be no combinational logic in the band regions filled by the active
elements, power, ground and clocks are easily bussed in on metal wires.
The location of horizontal bands and their frequency is left totally to
the SLA progralllDer, as required by the circuit being designed.
K.F. Smlth,.t al 401

WI)
PHI
-- ACTl~
PHI-
GD -
!IlJ~

I-VllO
AlTIVE t-PHI
WI)
. I fELLS i-1'IiJ
PHI CT IV'E; t-GD
PHI . lea ~
GD,.

Figure 4 - Block Structure of an SLA Program


Acti ve elements in the CMOS implementation are static and
asynchronous. The logic being implemented may, however, be synchronous.
To provide the possibility of using a system clock, two clock lines are
bussed in with the power and ground in the horizontal bands of active
cells. Clocking schemes are included as a part of the SLA program using
the array cells. (This is done to take advantage of the greater density
of combinational logic in the array.) The clocks must be tapped using a
"bender" cell, which connects a power buss wire to a column wire, and
detected on a row in order to provide clocked synchronization.
Similarly, the ground must be tapped by a bender cell to supply the
necessary grounding for the row pulldown current-sources. The pull ups
on input wires are placed immediately above the particular active cell,
and the power is accessed directly, without the need for a bender cell.
DEFINING DIGITAL SYSTEMS USING CMOS SLAS

Design of digital systems is facilitated through the use of the SLA


partially because of the simultaneous perception the designer has of the
circuits logical function and physical layout. The segmentability of
the SLA allows for the separate definition and layout of control modules
and data paths while simultaneously allowing the interface between
control and data path to be optimized by physically interleaving the
signal paths as needed wi thin the array. In this way, modules can be
custom fitted to each other resulting in a significant interconnect area
reduction.
With a predefined set of SLA cells, large scale digital systems can
be built directly by writing an SLA program which is the logical
description of' the system as well as the mask level description of' the
circuit to be built.
To summarize the functionality of the CMOS SLA Cell Set, we can
402 The CMOS SLA Implementation and SLA Program Structures

cite four characteristics of the cell set: 1) the S, R, I, 0, and 1


combinational elements are all identical Schottky diodes which result in
very dense structures, 2) the memory elements are simple cross-coupled
set-reset flip-flops without read/write enables, 3} the elements are
basically asynchronous and clocking is accomplished by ANDing clock
signal s on array rows, and 4) the power buss structure is arranged in
horizontal bands rather than in global sets of vertical wires. The more
complex SLA structures are built from these low level blocks rather than
being predefined as was the case in SLA implementations in other
technologies.
Building more complex elements from a lower level cell set results
in several advantages over prior SLA implementations:

- complex elements (macros) can be custom desi gned so they fi t


well into the overall circuit,

- other standard circuit elements, such as dynamic shift-


registers, register stacks, multiplexers, etc., can be easily
implemented using the cell set,

- unnecessary functionality and/or circuit elements are not


inserted into the SLA,

- synchronous and asynchronous circuits can be implemented with


the same cell set.

CONSTRUCTION OF MACRO ELEMENTS


Macro elements are built from the simpler elements of the CMOS SLA
Cell Set. The construction of macro elements is best illustrated by an
example, a master-slave set/reset flip-flop (MSFF). This type of flip-
flop has the set/reset inputs gated by one clock (Phi-A) and the outputs
gated by another non-overlapping clock (Phi-B).
Figures 5 and 6 show two different SLA program implementations of a
MSFF using the CMOS SLA Cell Set. The two cells perform identical logic
functions but are very different both in shape and in the cells used to
implement them. The version of the MSFF illustrated in figure 5 uses
pass transistors (PTA) enabled while clock Phi-A is high to gate the
set/reset inputs into the master flip-flop, and pass transistors (PTB)
enabled while clock Phi-B is high to gate the outputs from the master
into the slave. Single inverters are placed on the outputs of the slave
flip-flop solely to increase the drive capabilities of the MSFF. Note
that, in this configuration for the MSFF, the set/reset inputs to the
MSFF come only from below the MSFF.
The version of the MSFF in Figure 6 achieves the same effect as the
the version in Figure 5, but through a very different gating technique.
Two inverters in series buffer the set/reset inputs of the MSFF. This
is necessary because the set/reset inputs to the MSFF are generated by
SLA rows and SLA rows cannot drive other SLA rows directly. The true
(111 II) output from the double inverters (BIBs) which buffer the set/reset
inputs is of the correct polarity to be sensed by SLA rows.
K.F. Smith, at 81 403

MASTER-SLAVE
4 IN FLIP-FLOPS
5
6
7
8 6U4~4 1
9 2
10 3 R
R 11 4 0
o 12 5 W
W 13 1m 1 6 S
S
14 A)6=.'.-..... 0SI'I'10~6 .... 0-'-.'Blr 7
'~'
15 , 'I'n'-'~"~
1 ....0..... R .... irT1'I'I'
~ ,8
16 , 6}6--..... .... 1 ..........05' '9

I1
I I - I I

11
17
18
19
20 SET RESET
21
22
23
24
25
26

11 11
oS R 1

Figure 5 - Master-Slave Flip-Flop 1 Figure 6 - Master-Slave Flip-Flop 2


404 The CMOS SLA Implementation and SLA Program Structures

Phi-A is routed through a single inverter (INB) because the


clocking scheme used requires that the clocks be inverted before being
sensed by SLA rows (the O's in column 10). The left segment of row 7
ANDs the buffered set input (the 1 in column 3) with an inverted Phi-A
clock to generate the set signal for the master flip-flop and row 8 ANDs
the buffered reset input (the 1 in column 6) with an inverted Phi-A
clock to generate the reset signal. Phi-B is similarly inverted to
clock the transfer of data from the master to the slave (the right
segment of row 7 and row 9). The cells which appear as PD6 or PU4 are
deVices for loading the rows and columns. The cells labeled GND bend
the ground wire required in the flip-flops and inverters to a column
wire so that the row loads can have access to ground. The cells labeled
"THRU (IN)" tunnel the column wire under the power buss.
It is readily apparent that this CMOS SLA Cell Set will allow great
flexibility in creating macros with a variety of different physical and
logical configurations. For example, a MSFF macro which included
separate read and write enable inputs could be easily implemented,
allowing the implementation a hardware stack. Other circuit elements,
such as dynamic shift registers, would be built using the same
techniques used in constructing master-slave flip-flops.
Other larger macros can also be built. Figure 7 contains the SLA
program for a bit-serial adder/subtractor of the type used in designing
a much larger machine, the Utah Serial CORDIC Machine (USCM) [1]. Three
internally identical bit-serial adder/subtractors were required in the
CMOS SLA implementation of the CORDIC. The top eight rows of the SLA
program for the bit-serial adder/subtractor contain a master-slave flip-
flop of the type illustrated in Figure 6. This MSFF is used to store
the serial carry generated by each successive bit-add or bit-subtract.
The cells labeled BIB at the bottom of the macro buffer the A, B,
ADD/SUBTRACT, and INITIALIZE inputs (which are generated by SLA rows
outside the adder/subtractor) as well as buffering the output of the
adder/subtractor (which is generated by SLA rows inside the
adder/subtractor) for use in SLA rows outside the adder/subtractor.
The final version of this adder/subtractor which was used in
actually implementing the USCM consisted mostly of the same elements,
but their physical structure was rearranged so that the
adder/subtractors could be efficiently connected to both the shift-
registers and the control state machines of the CORDIC.

CONCLUSIONS - THE UTAH SERIAL CORDIC MACHINE AND SLA PROGRAMMING

The generation of a medium to large size subsystem, the Utah Serial


Cordic Machine or USCM, demonstrated the advantages of the CMOS SLA Cell
Set over previous SLA implementations. It showed that macro elements
could be effectively used in the SLA context. Macro elements were used
to construct sixteen-bit dynamic shift-registers, adder/subtractors,
master-slave set/reset flip-flops, and parallel loaders for dynamic
shift-registers. The CMOS SLA Cell Set provided a flexibility in
generating macros that no other previous SLA implementation had. This
flexibility was noted in different configurations of the same functional
macro (MSFFs) and in implementing sub-circuits which were difficult if
K.F. Smith, et 81 405

, i BiG:::..-0~T[~6,)li0--:~IJ ' .
: IOi;:1...:R B>IDII~:::!J: : :
6)6--..-\ .... .. .. 0 sill.
, PD :-'_\0[ 1,---,,0 '
, PD ._'_ .............. i:::r:~'i~0 '
SPD :0 I:::i :::'1 :~0 '
I'[lEQ 'R="":=:-I::3:~ 0'i ~:..... (
'[JEt) '. :.=.'.:=:-(::'.( ::'. i i~:.:: 1
sR) . .:.....+. . .
~.:.=~ :.=:.( j ..--~a:1
PD 0\01 ........0
PD~.:;=.:.=0.:.::0.:.~0.:.i ........1
'_'-11-11-111 I

PD R0\~ ...... ......-........ [


PD~.'==0 ..'.::'.j ::.'.j i .......-....... i
PD~R~.'\'I' 'I' 'I' '\"
I I

~>-H~.~4 , ~4 , PU4 EJ.!4 ~14


I I I I I
II(
I

,
~

l'
11
INITIALIZE
ADD/SUBTRACT
l' l'
CARRY OUTPUT
A INPUT
B INPUT _ _ _ _---l 5 OUTPUT

Figure 7 - The USCM Adder/Subtractor


406 The CMOS SLA Implementation and SLA Program Structurel

not impossible to build in previous SLA implementations (Le. dynamic


shift-registers).
The building of the USCM entailed using the basic CMOS SLA Cell Set
for generating some random logic and showed that the more complex memory
elements (flip-flops with read/write enable) of previous SLA
implementations were not always needed or even desirable.
Using a lower level SLA cell set, such as the CMOS set, gives the
SLA circuit designer (programmer) far greater flexibility in writing SLA
programs. Working typically at a high level of abstraction with macros
such as MSFFs, he retains the ability to drop down to a more primitive
level when desirable.
The use of CMOS for implementing SLAs makes the SLA a viable
candidate for the design of VLSI systems. Advantages discovered while
building circuits using SLAs implemented in other processes are now
enhanced by improvements in density, power consumption, and programming
flexibility. Through increased development of the relatively new n-
well CMOS process, density can be improved still further with the
elimination of guard rings and, of course, shrinking geometries. The
use of Schottky diodes as the junction device in the row elements
removes the need for bussing ground or clock lines to the array,
allowing for very small combinational logic cells.
ACKNOWLEOOMENTS

This work was supported in part by a contract from Boeing Aerospace


Company under Air Force Contract Number F33615-BO-C-1196. The authors
gratefully acknowledge the assistance of the General Instrument
Corporate Research and Development Center in Chandler, Arizona, in
defining the CMOS process and layout design rules used in implementing
the CMOS SLA. They also acknowledge the contributions of Dr. Lee
A. Hollaar of the Uni versi ty of Utah Department of Computer Science
faculty to the logical definition of the CMOS SLA Cell Set.
K.F. Smith, at 81 407

REFERENCES

[1] Goates, G. B.: Waldron III, H. M.: Patil, S. S.: Smith, K. F.: and
Tatman, J. A.
Storage/Logic Arrays for VHSIC.
In Proceedings of the 1981 Institute for Defense Analysis Semi-
Custom Integated Circuit Technology Symposium. May, 19~
[2] Lin, E. S.
A Study of Loading Constraints of Existing Integrated Injection
Logic Realizations of Storage / Logic Arrays.
Master's thesis, Department of Computer Science, University of
Utah, August, 1980.

[3] Patil, S. S.
An Asynchronous Logic Array.
Project MAC Technical Memo TM-62, MIT, May, 1975.

[4] Patil, S. S. and Welch, T. A.


A Programmable Logic Approach for VLSI.
IEEE Transactions on Computers C-28(9):594-601, September, 1979.
[5] Smith, K. F.
Design of Stored Logic Arrays in 12L.
In Proceedings of the 1981 IEEE International Symposium ~
Circuits and Systems, pages 105-110. IEEE Circuits and Systems
Society, April, 1981.
IEEE Catalog No. 81CH1635-2.
[6] Smith, K. F.
Implementation of SLAts in NMOS Technology.
In Proceedings of the VLSI l International Conference, Edinburgh,
UK, pages 247-256. August, 1981.
A New CCD Parallel Processing Architecture
A.M. Chiang
MassachuseHs Institute of Technology
Lincoln Laboratory
Lexington, MassachuseHs 02173

INTRODUCTION

Charge-coupled signal processing devices are attractive for


applications in such systems as communication, radar and sonar because
of the ability of a single rather simple device to perform the equiva-
lent of a large number of arithmeti t
operations per second. For
example, a 32-po~nt transversal filter 1) operating at 20 MHz is per-
forming 1.2 x 10 operations per second which is the equivalent of a
fair size digital processor. However, conventional CCDs have not
gained wide spread acceptance in commercial or military systems
because of the lack of availibility of CCDs with sufficient cost,
power, weight and throughput advantages over competing digital tech-
niques to make them attractive for integration into otherwise digital
hardware.
We have been devetoping new architectures which allow CCDs to
operate at high speed(2 and with enormous computation power. A new
CCD parallel processing architecture is described here which will
allow devices to be built which can perform many high-level mathe-
matical operations such as vector-matrix products, matrix-matrix
products and triple matrtx products. A vector-matrix product device
could perform functions such as discrete Fourier transforms, while a
device implementing the matrix-matrix product could do doppler
processing for many range cells in the range window of a pulse radar
system. A triple-matrix product device could perform two-dimensional
image transforms and image reconstruction for video bandwidth reduc-
tion in image processing systems, or it could perform two-dimensional
Fourier transforms for simultaneous multiple beam forming in a
bistatic radar system.

DEVICE ARCHITECTURE

The basic device structure (shown in Fig. 1) consists of a


floating-gate CCD tapped delay line and an array of CCD signal
processors. The delay line is for shifting and holding analog sampled
data which are in the form of charge packets. At each stage of delay
a floating-gate sensing electrode is coupled to a corresponding CCD

WThis work was sponsored by the Department of the Air Force

408
A.M. Chiang 409

signal processor. and the sampled data are transferred and sub-
sequently processed in parallel. Within each processor all the compu-
tation functions are performed in the charge domain. and local charge
domain memories are included for storing the processed signal. Based
on this generic device architecture. a matrix-matrix product device
and a triple-matrix product device have been designed and are
described below.

A MATRIX-MATRIX PRODUCT DEVICE

A schematic of a matrix-matrix product device is shown in Fig. 2.


N
This device is capable of computing the function gkj = L ckn f nj for
n=l
j=l. 2 J. k=l. 2 K where f nj are sampled analog data and
ckn are digital numbers. The device consists of a J-point. floating-
gate tapped delay line; J M-bit charge-domain multiplying D-to-A con-
verters (MDACs); J K-stage CCD accumulating memories. each with
separate input and output shift registers; and a digital memory for
N-by-K. M-bit words. The digital memory can be either on-chip or off-
chip. All the MDACs have common digital inputs. but the output of
each MDAC goes to a corresponding accumulating memory as shown in
Fig. 2.
The details of the MDAC are depicted in Fig. 3. and show the
device to be a multiple-input CCD structure. The logic levels control
the potential of the input diodes and thereby perform a multiplication
of the charge flow to the input gates by 0 or 1. The area of the
input gate corresponding to bit m is proportional to 2m- I and the
quantity of charge. Qm' transferred by the gate is then proportional
to the product of the value of bit m. the gate area and the analog
signal.
The matrix-matrix product device operates as follows. After the
first row of analog sampled data. fll' f 12 f lJ is serially
loaded into the tapped delay line. the CCD clock is stopped. The sig-
nal charge at each tap controls the analog input of the corresponding
MDAC as indicated in Fig. 2. The first column of the digital memory.
cll' c2l' cKl' is then sequentially applied to the common digital
input ports of all the MDACs. (Note there are M bits for each digital
word applied in parallel to the MDAC). The output from the jth MDAC
is a sequence of charge packets. cll f lj c2l f lj cKl f lj
The string of output data from each MDAC is serially loaded into
the corresponding CCD accumulating memory. After the whole string of
data is loaded in the memory. the data set is parallel transferred to
the storage well of the memory. The second row of the analog sampled
data. f2l. f 22 .. f2J is then loaded into the CCD tapped delay
line. The same process is repeated. but this time the second column
of the digital memory. c12' c22' cK2 is sequentially applied to
the common input ports of all the MDACs. Thus. at the output of the
jth MDAC. there is a sequence of output data. c12 f 2j c22 f2j' "',
cK2 f 2j
410 A New CCO Parallel Processing Architecture

After this string of output data is serially loaded into the jth
accumulating memory. it is parallel transferred to the storage wells
and summed with previously stored charge. It follows that after the
Nth (or the last) row of the analog sampled data is processed by the
same procedure described above. the information stored in the jth
accumulating memory is

N
L cKn f nj
n=l

L

N
L c2n f nj
n=l

N
L cl n f nj
n=l

It can be seen that this data sequence is equal to the jth column
elements of the [C] matrix. which is to be computed by this device.
Therefore. the stored data sequence glj' g2j' gKj can now be
parallel transferred to the output shift register and serially clocked
out. In other words. the serial output from each accumulating memory
are the corresponding column elements of the [C] matrix. Thus. the
device computes the matrix-matrix product, [C] [F].
We are in the process of designing a CCD matrix-matrix product
device with N.K and J choosen to be 32. and 8-bit HDACs. The chip
size is estimated to be 30.000 mi1 2 excluding the digital memory. At
a 10 MHz clock rate. the device is performing the equivalent of
3.2 x 108 8-bit x 8-bit digital multiplications and 1010 additions per
second.

A TRIPLE HATRIX PRODUCT DEVICE

The schematic of a triple-matrix product device is shown in


Fig. 4. The device calculates the function

for k=l. 2. K
1=1. 2. L

N
where gkj = L ckn f nj It can be seen that the top part of the device
n=l
(i.e.. the tapped delay line. the HDACs. the accumulators. and the
digital memory) is identical to the previously described matrix-matrix
product device. The lower part of the device consists of an L-by-K
fixed-weight CCD multiplier bank. All the fixed-weight multipliers on
the same column have a common analog input (i.e the output from the
A.M. Chiang 411

jth accumulating memory is coupled to the inputs of the jth column of


the fixed-weight multipliers). All the multipliers on the same row
have a common output node.
The details of a CCO fixed-weight multiplier is described here.
If an analog signal is to be multiplied by a fixed number W, the input
gate area of this CCO multiplier will be made equal to W Amin where
Amin is a process-determined minimum gate area. It is easy to recog-
nize that if the analog input is f, the amount of charge transferred
into this multiplier is proportional to fW. Therefore, the output of
a fixed-weight multiplier is always proportional to the analog input
by the same scaling factor.
The device operates as follows. Two consecutive matrix-matrix
product steps are used to calculate the triple matrix product. The
CCO delay line, MDACs and accumulating memory are used to compute the
first matrix-matrix product (i.e., the C matrix). Because of the
unique structure of the device, when the input matrix F is loaded into
the device row by row, the calculated C matrix can be accessed row by
row. A fixed-weight vector-matrix product device can now be used to
complete the triple-matrix product. Now consider the case when the
first row of the [C) matrix is clocked out from the J accumulating
memories, as shown in Fig. 4. gIl is applied to the first column of
the L X J multiplier banks, g12 to the 2nd column and glj to the jth
column. Since all the multipliers on the same row have a common out-
put node, the total signal charge transferred to the summing node
J
of the 1st row of the multiplier is proportional to L glj d jl = hll
J j=l
The output charge the 2nd row of the multiplier is L glj d j2 = h12
j=l
It follows that after the 1st row elements of the [C] matrix are
clocked out from the J accumulating memories, there is one summed
charge output from each row of the multiplier bank. These L parallel
output data correspond to the first row of the [H] matrix (i.e., hll'
h12, , h lL ) Finally, after the last row of the [c] matrix has
been clocked out from the accumulating memory, the multiplier banks
compute the last row element of H. Thus the device has computed the
triple-matrix product [H]=[C] [0].
The chip area of a CCO triple-matrix product device with J, K, L,
chosen to be 16 and 8-bit MOACs is estimated to be 35,000 mi1 2 At a
10 MHz clock rate, the fixed-weight multiplier bank is performing the
equivalent of 2 x 10 9 8-bit x 8-bit multiplications per sec.

CONCLUSION

In summary, a new CCO parallel-processing architecture is described


which allows us to build CCO devices with enormous computation power.
We have achieved this by keeping the signal and performing the com-
putation in the charge domain, and processing large numbers of sampled
data in parallel. Two specific devices, a matrix-matrix product
device and a triple matrix product device, were described. The cost,
412 A New CCO Parallel Procesalng Architecture

power and weight efficiencies of these CCDs should have far reaching
implications for many military and commercial applications such as
radar, communication and image processing systems.

REFERENCES

1. A. M Chaing, E. E. Burke, A High-Speed Digitally Programmable CCD


Transversal Filter, 1980 GOMAC Digest of Papers, Houston, Texas
(1980) pp. 182-183.

2. B. E. Burke and W. T. Lindley, New CCD Programmable Transversal


Filter, Electronics Letters 13, 521 (1977).

CCD FLOATING-GATE
TAPPED DELAY LINE

ceD SIGNAL
PROCESSOR

ANALOG DIGITAL
OUTPUT INPUT

Figure 1. A CCD parallel processing architecture where computations


involving analog sampled data are done in the charge domain
and large numbers of data points are processed in
parallel.
A.M. Chiang 413

CCD FLOATING GATE


TAPPED DELAY LINE

M M M
0 0 0
A A A
C C C Nx K
DIGITAL
MEMORY

- CCO
ACCUMULATING

, L'
MEMORY

A GENERAL PURPOSE MATRIX -MATRIX PRODUCT DEVICE


N
g
kJ
= L C f . FOR k 1,2, K, j = I, 2 J
\ k nJ

Figure 2. A CCD matrix-matrix product device which computes the func-


tion [G] = [C] [F]. where the F matrix elements are analog
sampled data and the C matrix elements are digital
numbers.
414 A New CCD Parallel Processing Architecture

ANALOG
INPUT
4>1

2A ....-1-.0,

DIGITAL
INPUT

Figure 3. Schematic of a charge domain multiplying D/A converter.


A.M. Chiang 415

CCO FLOATING GATE


TAPPED DELAY LINE
ANALOG In
fni
M M
o o
A A
C C

gKI gK

L L
CCO
ACCUMULATING
MEMORY
g31 g32

921 922
9 9 ,2
"

J
I 9 d =h
i = I li j 1 11

J
L 9 d =h
i = I Ii j2 12

J
L 9 d =h
i = I Ii JL lL

J N
h =I I d FOR k = 1... K
kR i =1 n = 1 kn nj j.f
1 = 1. L
J
=L 9 d
i = I ki i1

Figure 4. A CCD triple-matrix product device which computes the


function [II] = [G] [D] = [C] [F] [D], where the F matrix
elements are analog sampled data and the C and D matrix
elements are digital numbers.

You might also like