You are on page 1of 63

Digital Systems Design 2

Programmable Logic and Storage Devices


Chapter 8: Advanced Digital Design with the Verilog HDL,
Michael D. Ciletti.
Memory, CPLDs and FPGAs
Chapter 10: Digital Design Principles and Practices, John
F. Wakerly, Prentice Hall, 2001, Third Edition

Programmable Logic and Storage


Devices

With advancement of hardware technology:

Density

Complexity

Size
Of field-programmable gate arrays (FPGAs), it provides an
attractive and cost-efficient alternative to semi-custom
application specific integrated circuits (ASICs).

The opportunity to realize large circuits in FPGAs has


created pressure for a change in the method by which
circuits are designed for FPGA-based applications:

Schematic entry tools can be productive and efficient when


designs are small.
Trend is toward larger and larger designs targeted for FPGAs.
Thus, language-based design methodology has become
essential to FPGA-based design flows.

Programmable Logic and Storage


Devices

Technologies available for implementing digital circuits range from:

Standard ICs:

Can be manufactured cheaply,


Implement very limited, basic functionality at low levels of integration.

Customized ICs

Standard Integrated Circuits (ICs) used in low-density/low-performance


applications,
To Cell-based and full-custom ICs for high-density/high-performance circuits.

Implement specialized functionality with a high level of integration


Have a small market
Creates inventory risk because the quantities that could be sold do not warrant the
expense of their development and production.

Programmable Logic Devices:

Between two extremes of density and performance that characterize standard


parts and full-custom circuits.
Born out of necessity created by two conflicting realities:

Large, dense, high-performance circuits cannot be build practically or economically


from discrete devices
Dedicated ICS cannot be produces economically to satisfy a diversity of low-volume
applciations.

Programmable Logic and Storage


Devices

PLDs:

Read-Only Memorys (ROM)


Programmable Logic Arrays (PLA)
Programmable Array Logic (PAL)
Complex Programmable Logic Devices
(CPLD)
Field Programmable Gate Arrays (FPGA),
and
Mask-Programmable Gate Arrays
(MPGA).

Programmable Logic Devices

For most up-to-date PLDs see:


www.e-insite.net/ednmag
PLDs have

a fixed architecture

Functionality is programmed for a specific application

Programming is done by:

Manufacturer - mask-programmable logic devices


(MPLD)
End-User field-programmable logic devices (FPLD).

Three basic characteristics distinguish PLDs from each


other:
1.
An architecture of identical basic functional units
2.
A programmable interconnection fabric, and
3.
A programming technology.

Programmable Logic Devices

The first type of PLDs considered has the AND-OR plane structure shown in
the figure.
This type of architecture is used to implement ROMs, PLAs, and PALs.
It implements Boolean expressions in Sum of Products (SOP) form:

AND plane forms product terms selectively from the inputs, and

OR plane forms outputs from sums of selected product terms.


A programmable interconnect fabric joins the two planes, so that the outputs
implement sum-of-product expressions of the inputs.

Whether and how a plane can be programmed determines the particular


type of PLD that is implemented by the overall structure.

Product Terms

Inputs

AND Plane

OR Plane

Outputs

AND-OR plane structure of a programmable logic device

Storage Devices

The architecture used to implement PLDs lends itself


to implementation of storage devices.
Storage Devices can be:

Read-Only, or

Random Access
depending on whether the contents of a memory cell
can be written during normal operation of the device.
ROM (read-only memory) is a device programmed to
hold certain contents, which remain unchanged during
operation and after power is removed from the
device.
RAM (random-access memory) in contrast its contents
can be changed during operation, and they vanish
when the power is removed.

Read-Only Memory (ROM)

Read-Only Memory
(ROM)

ROM Interface:

A 2n x b ROM consists of
an addressable array of
semiconductor memory
cells organized as 2n
words of b bits each.
n inputs defining
address lines.
b outputs called bit
lines.

ROM is non-volatile
memory. Its content is
preserved even if no
power is applied.

Read-Only Memory (ROM)

AND-OR planes for a ROM:


2n Minterms (Word lines)
formed from inputs
A(0)
A(1)
A(i)
A(n-1)

Address
Decoder

(Nonprogrammable)
AND Plane

OR Plane
Memory Array
2n x b

D(b-1)

D(i)

D(0)

b Outputs
(bit lines)

Using ROM for Random


Combinational Functions

ROM can be used to perform any combinational function. ROM will


actually store for each input bit-pattern (input address) the
corresponding output bit-pattern.
Example: Truth table for a 3-input, 4-output combinational logic
function.
Inputs

Outputs

A2

A1

A0

D3

D2

D1

D0

Equivalent 2-to-4 decoder with


output-polarity control

Using ROM for Random


Combinational Functions

Another example that can be built with ROM is


unsigned binary multiplication.
Typical realization of a 4x4 multiplier requires to high
number of product terms (36) to obtain one pass
multiplier through a conventional PLDs AND-OR array.
With ROM one can realize the function with one pass
through a 28 x 8 (256X8) ROM.
Contents of a ROM are normally specified by a file
that contains one entry for every address in the ROM.
The nice think about ROM-based design is that one
can usually write a simple program in a high-level
language to calculate what should be stored in the
ROM.

Two-dimensional decoding

Suppose that one wants to build a 128 x 1 ROM.

Straight forward solution will require a 7-to-128 decoder:

128 7-input NAND gates,


14 buffers and inverters with a fanout of 64 each.

ROMs with a 1M bits or more are available commercially and


they do not use linear structure for decoder which would
require a 20-to-1,048,576 decoders.

The structure used is called two-dimensional decoding.


This structure enables reduction of the decoder size to
something on the order of the square root of the number of
addresses.

The basic idea in two-dimensional decoding is to arrange the


ROM cells in an array that is as close as possible to square.
In the next illustration a possible internal structure for a 128x1
ROM is depicted.

Two-dimensional decoding

Two-dimensional decoding

As can be seen, two-dimensional decoding allows a 128x1


ROM to be built with a 3-to-8 decoder and a 16-input
multiplexer (whose complexity is comparable to that of a 4to-16 decoder).
A 1Mx1 Rom could be built with a 10-to-1024 decoder and
1024-input multiplexer. A lot simpler than the one
dimensional alternative.
Additional benefit to reduction of decoding complexity is
that two-dimensional decoding has one other benefit - it
leads to a chip whose physical dimensions are close to
square -- important for chip fabrication and packaging.
In ROMs with multiple data outputs the storage arrays
corresponding to each data output may be made narrower
in order to achieve an overall chip layout that is closer to
square. For example, the next figure shows the possible
layout of a 32K x 8 ROM.

Possible layout of a 32K x 8 ROM

Commercial ROM Types

A modern ROM is fabricated as a single IC chip; one that


stores 4M bits with a price under $5.
Various methods are used to program the information
stored in a ROM:

Mask Programmable ROMs.

Manufacturer has to be provided with the ROM content in order


to create one or more customized masks to manufacture ROMs
with the required pattern.
ROM manufacturers impose a mask charge of several thousand
dollars for the customized aspects of mask-ROM production.
Because of mask charges and the four-week delay typically
required to obtain programmed chips, mask ROMs are normally
used today only in very high-volume applications.
For a low-volume applications there are more cost-effective
choices, discussed next.

Commercial ROM Types

Programmable read-only memory (PROM)

Similar to a mask ROM, except that the customer may store


data values (program the PROM) in just a few minutes.
PROM is manufactured with all of its diodes or transistors
connected. This corresponds to having all desired bits at a
particular value (typically 1). The PROM programmer can be
used to set desired bits to the opposite value.
In bipolar PROMs this is done by vaporizing tiny fusible links
inside the PROM corresponding to each bit.
A link is vaporized by selecting it using the PROMs address
and data lines, and then applying a high-voltage pulse (1030V) to the device through a special input pin.
Early reliability problems with vaporized links technology were
solved and reliable fusible-link technology is used now days not
only in bipolar PROMs but also in the bipolar PLD circuits.

Commercial ROM Types

Erasable programmable read-only memory (EPROM):

EPROM is programmable just like PROM.


In addition it also can be erased to all 1s-state by exposing it
to ultra-violet light.
EPROM uses a different technology called floating-gate MOS.
EPROM manufacturers guarantee that a properly programmed
bit will retain 70% of its charge for at least 10 years even if the
part is stored at 125o C.
Insulating material surrounding the floating gate becomes
slightly conductive if it is exposed to ultraviolet light with a
certain wavelength which provides for the EPROM content to be
erased.
Most common application of EPROMs is to store programs in
microprocessor systems.
EPROMs are typically used during development. ROMs and
PROMs are used once the program is finalized because usually
they cost less than EPROMs of similar capacity.

Commercial ROM Types

Electrically Erasable Programmable Read-Only Memory


(EEPROM).

It is like and EPROM except that individual stored bits


may be erased electrically.

Floating gates in an EEPROM are surrounded by a


much thinner insulating layer and can be erased by
applying a voltage of the opposite polarity as the
charging voltage to the non-floating gate.

Large EEPROMs (1M bit and larger) allow erasing only


in fixed-size blocks, typically 128-512 Kbits (16-64
Kbytes) at a time. These memories are typically
called flash EPROMs or flash memories.

EEPROM can be reprogrammed only a limited number


of times (Insulating layer wares off).

Logic Symbols for standard EPROMs in


28-pin dual in-line packages.

ROM Applications

In addition to the most common application of ROMs


for program storage in microprocessor systems, there
are many other applications that can provide a lowcost realization of a complex or random
combinational logic function.
Example of Voice Signals:

When an analog voice signal enters a typical


telephone systems, it is sampled 8,000 times per
second and converted into a sequence of 8-bit bytes
representing the analog signal at each sampling
point.

This example will show how ROM-based circuits can


easily deal with this highly encoded information.

Coding Voice Samples

The simplest 8-bit encoding of the sign and


amplitude of an analog signal would be an
8-bit integer in the twos complement or
signed-magnitude system.
8-bit linear encoding yields a dynamic range
of only 28 = 256 different values.
This corresponds to a dynamic range in
signal power of 20*log(256)48dB.
By comparison, compact audio disks use a
16-bit linear encoding with a theoretical
dynamic range of 20*log(216)96dB

Coding of Voice Samples

North American telephone network uses an 8-bit compounded encoding


called law PCM (pulse code modulation).
The next figure shows the format of an 8-bit coded byte: a sort of floating
point representation containing sign (S), exponent (E) and mantissa (M)
fields.
The analog value V represented by a byte in this format is given by the
formula:

V = (1-2s)*[(2E)*(2M+33)-33]

An analog signal represented in this format can range from


-8159*k to +8159*k, where k is arbitrary scale factor.
The range of the signals is 2*8159 and the smallest difference that can be
represented is only 2 (when E=0), so the dynamic range is 20*log(8159)
78dB.

7 6543210
S E
M
sign exponent mantissa

Coding of Voice Samples

In many types of phone connections voice signal is purposely


attenuated by a few decibels to make things work better.
Given a law PCM byte, a digital attenuator must produce a
different PCM byte that represents the original analog signal
multiplied by a specified attenuation factor.
One way to build a digital attenuator is shown in the next
figure.
Each block in the figure can be build with perhaps a dozen
MSI chips or a CPLD or FPGA

8
14

-law to
linear
decoder

14
14x14
multiplier

14

linear to
-law
encoder

Coding of Voice Samples

Alternative realization of digital


attenuator can be done using a single
inexpensive 8kx8 ROM instead.
This ROM can apply any of 32
different attenuation factors to a
law input byte.
High order-address bits select a table,
and the low order address bits select
an entry.

Digital Conference Circuit

In the analog telephone network, it is easy to make a conference


connection between three or more parties:

In the digital network, digital conference circuit must include a


digital adder that produces output samples corresponding to the
sums of the input samples.
We have seen how to create binary adders for 8-bit operands.
However, binary adders cannot process law PCM bytes directly.

Just connect the analog phone wires together and you get an analog
summing junction.

The 8-bit law PCM bytes must be converted to 14-bit linear format,
The signals then can be added,
Resulting signal must then be converted to 8-bit law PCM as in
previous example.

Again, one could create a complex adder or alternatively the same


function be performed by a single 64K x 8 ROM.

The ROM has 16 address inputs accommodating two 8-bit law PCM
operands.
For each pair of operand values, the corresponding ROM address
contains the pre-computed 8-bit law PCM sum.

ROM-based Designs (Advantages)

Previous two examples illustrate many advantages of building complex


combinational functions with ROMs.
Most complex functions:

Are generally difficult to design with a custom digital logic


ROM realization of those functions is alternatively straight forward.

For a moderately complex function, a ROM-based circuit is usually faster


than a circuit using multiple SSI/MSI devices and PLDs, and often faster
than an FPGA or custom LSI chip in a comparable technology.
The program that generates the ROM contents can easily be structured to
handle unusual or undefined cases that would require additional hardware in
any other designs. For example adder function of the previous example can
easily handle out-of-range sums.
A ROMs function is easily modified just by changing the stored pattern,
usually without changing any external connections. For example, the PCM
attenuator and adder ROMs in the previous example can be changed to use
8-bit Alaw PCM, the standard digital voice coding in Europe.
The prices of ROMs and other structured logic devices are always dropping,
making them more economical and their densities are always increasing,
expanding the scope of problems that can be solved with a single chip.

ROM-based Designs
(Disadvantages)

For a simple to moderately complex


functions, a ROM-based circuit

may cost more,


consume more power, or
run slower
then a circuit using a few SSI/MSI devices and
PLDs or small FPGA.

For functions more than 20 inputs, a ROMbased circuit is impractical because of the
limit on ROM sizes that are available. For
example, one wouldnt build a 16-bit adder
in ROM it would require billions and billions
of bits.

Complex Programmable Logic


Devices

Since their inception years ago, programmable logic


devices have been very flexible workhorses of digital
design.
As IC technology advanced, there was naturally great
interest in creating larger PLD architectures to take
advantage of increased chip density. The question is
why didnt manufacturers just scale the existing
architectures?
For example, if DRAM densities increased by a factor
of 64 over the last 10 years, why couldn't
manufactures scale the 16V8 (16 input signals and its
complements, and a number of 16-variable product
terms) to create a 128V64? Such device would have
64 I/O pins, and some number (say 8) of 128-variable
product terms for each of its 128 logic macro-cells.

Complex Programmable Logic


Devices

This new chip 128V64 could


combine the functions of a larger
collection of 16V8 and offer terrific
performance and flexibility using any
input in any output function?

This new chip would be very flexible


but it would not have a good
performance.

How to expand PLD architecture?

Increase # of inputs and outputs in a conventional


PLD?

E.g., 16V8 --> 20V8 --> 22V10.

Why not --> 32V16 --> 128V64 ?


Problems:

n times the number of inputs and outputs requires n2


as much chip area -- too costly

logic gets slower as number of inputs to AND array


increases
Solution: multiple PLDs with a relatively small
programmable interconnect.

Less general than a single large PLD, but can use


software fitter to partition into smaller PLD blocks.

CPLDs vs. FPGAs

CPLD
architecture:
Small number of
largish
PLDs (e.g., 36V18)
on a single chip
Programmable
interconnect between
PLDs

CPLDs vs. FPGAs

FPGA
architecture
Much larger number of
smaller programmable
logic blocks.
Embedded in a sea of
lots and lots
of programmable
interconnect.

CPLD families

Identical individual PLD blocks (Xilinx FBs)


replicated in different family members.

Different number of PLD blocks

Different number of I/O pins


Many CPLDs have fewer I/O pins than macrocells

Buried Macrocells -- provide needed logic terms


internally but these outputs are not connected
externally.

IC package size dictates # of I/O pins but not the


total # of macrocells.

Typical CPLD families have devices with differing


resources in the same IC package.

Xilinx XC9500 CPLD Family

The xilinx XC9500 series is a family of CPLDs with a


similar architecture but varying number of external
input/output pins and internal PLDs (which Xilinx calls
function blocks FBs).
Each internal PLD has 36 inputs and 18 macrocells
and outputs and might be called 36V18.
As shown in the table in the next slide, devices in the
family are named according to the number of
macrocells they contain.
The smallest has 2 FBs and 36 macrocells, and
The largest has 16 FBs and 288 macrocells.

Xilinx CPLDs

Notice overlap in resource availability in a particular package.

Xilinx CPLDs

Another feature of this family is that a given chip, such as


XCC95108 is available in several different packages. This is
important not only to accommodate different manufacturing
practices but also to provide some choice and potential savings in
the number of external I/O pins provided. In most applications, it
is not necessary for all internal signal of a state machine or
subsystem to be visible to and used by the rest of the system.
Thus, even though the XC95108 has 108 macrocells, the outputs
of at most 69 of them can be connected externally in the 84-pin
PLCC version of the device.

In fact many of the 69 I/O pins would typically be used for inputs, in
which case even fewer outputs would be visible externally.
Note that the remaining macrocell outputs are still quite usable
internally, since they can be hooked up internally through the CPLDs
programmable interconnect.
Macrocells whose outputs are usable only internally are sometimes
called buried macrocells.

Xilinx 9500-family CPLD


architecture

Xilinx 9500-family CPLD


architecture

I/O pins can be used as input, output or bidirectional


pins according to the devices programming.
Special purpose pins:

GSK global clock

GSR global set/reset

GTS global three-state controls;


one of these signals can be selected in each
macrocell to output enable the corresponding output
driver when the macrocells output is hooked up to an
external I/O pin.
Only 4 FBs are shown in the previous schematic
diagram, however, XC9500 architecture scales to
accommodate 16 Fbs in th XC95288.

Xilinx 9500-family CPLD


architecture

Regardless of the specific family member, each FB


programmably receives 36 signals from the switch
matrix.
The inputs to the switch matrix are the 18 macrocell
outputs from each of the FBs and the external inputs
from the I/O pins.
Each FB also has 18 outputs that run under the
switch matrix as shown in the previous figure
connecting to the I/O blocks.

These are merely the output-enable signals for the


I/O block output drivers;

They are used when the FB macrocells output is


hooked up to an external I/O pin.

9500-family function blocks (FBs)


architecture

18 macrocells per FB
36 inputs per FB (partitioning challenge, but also
reason for relatively compact size of FBs)
Macrocell outputs can go to I/O cells or back into
switch matrix to be routed to this or other FBs.

9500-family function blocks (FBs)


architecture

The basic XC9500 FB programmable


AND array has just 90 product terms.
However, it also has product-term
allocation.

This mechanism allows a macrocells


unused product terms to be used by
other nearby macrocells in the same FB.
Next slide depicts a logic diagram of the
XC9500 product-term allocator and
macrocell.

9500-series macrocell (18 per FB)


Set control
Programmable
inversion or XOR
product term
Up to 5 product
terms
Global clock or productterm clock
Reset control
OE control

9500-series product-term allocator


programmable
steering
elements

Share terms from


above and below

9500-series I/O block

1.

2.

3.

Analog controls in addition to


logic ones:
Slew-rate control. The rise and
fall time of the output signals can be set to be fast or slow.
Pull-up resistor. When enabled,
pull-up resistor prevents output
pins from floating as the CPLD is
powered up. Useful if the outputs
are used to drive active-low
enable inputs of other logic that
is not supposed to be enabled
during power up.
User-programmable ground. This
feature reallocated an I/O pin be
ground pin and not a signal pin.
Extra ground pins are needed to
handle the high dynamic currents
that flow when multiple outputs
switch simultaneously.

Switch matrix for XC95108

Could be anything from a


limited set of multiplexers to a
full crossbar.
Multiplexer -- small, fast, but
difficult fitting
Crossbar -- easy fitting but
large and slow
Finding a complete set of
connections through a sparse
switch matrix is NP-complete
problem.
For each different CPLD-based
design, a set of switch-matrix
connections must be found be
fitter software.
Typically this software
together with overall CPLD
design are part of
manufacturers secret sauce

FPGAs

Historically, FPGA architectures and companies began around the


same time as CPLDs.

FPGAs are closer to programmable ASICs -- large emphasis on


interconnection routing

Xilinx launched the worlds first commercial FPGA in 1985, with the
vintage XC2000 device family.
XC3000 and XC4000 families soon followed, setting the stage for
todays Spartan and Virtex device families.
Each evolution of devices brought improvements in density,
performance, voltage levels, pin counts, and functionality.
Thus XC4000, Spartan and Spartan/XL devices have the same basic
architecture.
Timing is difficult to predict -- multiple hops vs. the fixed delay of a
CPLDs switch matrix.
But more scalable to large sizes.

FPGA programmable logic blocks have only a few inputs and 1 or 2


flip-flops, but there are a lot more of them compared to the
number of macrocells in a CPLD.

General FPGA chip architecture

a.k.a. CLB -configurable logic


block

Xilinx 4000-series FPGAs

FPGA specsmanship

Two flip-flops per CLB, plus two per


I/O cell.
25 gates per CLB if used for logic.
32 bits of RAM per CLB if not used for
logic.
All of this is valid only if your design
has a perfect fit.

Configurable Logic Block (CLB)

CLB function generators (F, G, H)

Use RAM to store a truth table

F, G: 4 inputs, 16 bits of RAM each

H: 3 inputs, 8 bits of RAM:

16x2 dual port RAM or


32X1 single port RAM.

RAM is loaded from an external PROM at system


initialization.

Broad capability using F, G, and H:

Any 2 funcs of 4 vars, plus a func of 3 vars

Any func of 5 vars

Any func of 4 vars, plus some funcs of 6 vars

Some funcs of 9 vars, including parity and 4-bit


cascadable equality checking

Dedicated Fast Carry and Borrow


Logic

The F and G function generators of the XC4000 family have:

separate dedicated logic for fast carry and borrow generation,


with dedicated routing to link the extra signal to the function
generator in the adjacent CLB.

One function generator (F) can be used to add a0+b0, and


Second function generator (G) can generate a1+b1.
The fast carry will forward the carry to the next CLB above
or below.
Fast carry and borrow logic increases the efficiency
performance of adders, subtractors, accumulators,
comparators, and counters.

CLB input and output connections


-- buried in the sea of interconnect

XC4000 Interconnect Resources

Three types of general-purpose interconnect:


1.
2.
3.

Single-length lines,
Double-length lines, and
Long lines

A grid of horizontal and vertical single-length lines


connect an array of switch boxes.

Switch boxes provide a reduced number of


connections between signal paths within each box
(not a full crossbar switch).

XC4000 Interconnect Resources

In the XC4000 there is a rich set of connections between singlelength lines and the CLB inputs and outputs.

Capabilities for nearest-neighbor and across-the-chip connection


between CLBs.
Two single groups are optimized for flexible connectivity between
adjacent blocks without the small number of unidirectional limitation of
wires in the Direct Connect groups.
With single wires it is possible to connect a CLB to another thats
more than one hop away, but they would have to go through a
programmable switch for each hop which adds delay.
Wires in the Double groups travel past two CLBs before hitting a
switch, so they provide shorter delays for longer connections.
The Long groups of wires do not go through any programmable
switches at all: instead, they travel all the way across or down a row or
column and are driven by three-state drivers near the CLB.

Detail connections controlled by


RAM bits

Programmable Switch Matrix (PSM)

Each diamond in the shaded area indicating PSM:

Is a programmable switch element (SPE) that can


connect any line to any other as shown in the next
slide under (b).

While the PSM is essential, using it has a price


signals incur a small delay each time they hop
through a PSE.

High-quality FPGA fitter software searches for not


just any CLB placement and wire connections that
work.

The placement and routing tool spends a lot of time


trying to optimize device performance by finding a
placement that allows short connections, and then
routing the connections themselves.

Programmable Switch Matrix


programmable switch element

turning the corner, etc.

The fitters job

Partition logic functions into CLBs


Arrange the CLBs
Interconnect the CLBs
Minimize the number of CLBs used
Minimize the size and delay of interconnect
used
Work with constraints

Locked I/O pins


Critical-path delays
Setup and hold times of storage elements

I/O blocks

Problems common to CPLDs and


FPGAs

Pin locking

Small changes, and certainly large ones, can


cause the fitter to pick a different allocation of
I/O blocks and pinout.
Locking too early may make the resulting circuit
slower or not fit at all.

Running out of resources

Design may blow up if it doesnt all fit on a


single device.
On-chip interconnect resources are much richer
than off-chip; e.g., barrel-shifter example.
Larger devices are exponentially more
expensive.

You might also like