You are on page 1of 13

Chapter 3 Parallel and

Pipelined Processing

ECE734 VLSI Arrays for Digital Signal Processing

Basic Ideas
Parallel processing

Pipelined processing

time

time

P1

a1

a2

a3

a4

P1

P2

b1

b2

b3

b4

P2

P3

c1

c2

c3

c4

P3

P4

d1

d2

d3

d4

P4

Less inter-processor communication


Complicated processor hardware

a1

b1

c1

d1

a2

b2

c2

d2

a3

b3

c3

d3

a4

b4

c4

d4

More inter-processor communication


Simpler processor hardware

Colors: different types of operations performed


a, b, c, d: different data streams processed
ECE734 VLSI Arrays for Digital Signal Processing

Data Dependence

Parallel processing requires


NO data dependence
between processors

P1

P1

P2

P2

P3

P3

P4

P4

Pipelined processing will


involve inter-processor
communication

time

ECE734 VLSI Arrays for Digital Signal Processing

time

Usage of Pipelined Processing


By inserting latches or
registers between
combinational logic circuits,
the critical path can be
shortened.
Consequence:
reduce clock cycle time,
increase clock frequency.

Suitable for DSP


applications that have
(infinity) long data stream.

Method to incorporate
pipelining: Cut-set retiming
Cut set:
A cut set is a set of edges of
a graph. If these edges are
removed from the original
graph, the remaining graph
will become two separate
graphs.

Retiming:
The timing of an algorithm is
re-adjusted while keeping
the partial ordering of
execution unchanged so
that the results correct

ECE734 VLSI Arrays for Digital Signal Processing

Graphic Transpose Theorem


The transfer function of a signal flow graph remain
unchanged if
The directions of each arc is reversed
The input and output labels are switched.

x[n]

z1
h[0]

z1
h[1]

h[2]

y[n]

?
=

y[n]

z1
h[0]

ECE734 VLSI Arrays for Digital Signal Processing

u[n]
h[1]

z1
h[2]

x[n]

Data broadcast structure

Algorithm transform may


lead to pipelined structure
without adding additional
delays.
Given a FIR filter SFG

We obtain

Critical path TM+ TA


No additional delay added!

Critical path TM+2TA


Use graph transposition
theorem:
Reverse all arcs
Reverse input/output

ECE734 VLSI Arrays for Digital Signal Processing

Fine-grain pipelining

To further reduce TM.


Critical Path = Max {TM1, TM2, TA}

ECE734 VLSI Arrays for Digital Signal Processing

Block Processing
One form of vectorized
parallel processing of DSP
algorithms. (Not the parallel
processing in most general
sense)
Block vector: [x(3k) x(3k+1)
x(3k+2)]
Clock cycle: can be 3 times
longer
Original (FIR filter):
y ( n) a x(n) b x(n
1)
c x(n 2)

Rewrite 3 equations at a
time:

y (3k )
x (3k )

y (3k 1) a x (3k 1)
x(3k 2)
y (3k 2)

x (3k 1)
x(3 k 2)
b x (3 k ) c x (3 k 1)

x(3k 1)
x (3k )

Define block vector


Block formulation:

x( k )

x(3k )
x(3k 1)
x(3k 2)

a 0 0
0 c b
0 0 c x(k 1)
y (k ) b a 0 x(k )

c b a
0 0 0

ECE734 VLSI Arrays for Digital Signal Processing

Block Processing

ECE734 VLSI Arrays for Digital Signal Processing

General approach for block processing

ECE734 VLSI Arrays for Digital Signal Processing

10

Block Processing for IIR Digital Filter


Original formulation:
y (n) a y (n 2) x(n)

n: sampling period
k: clock period (processor)
k = 2n

Rewrite
y (2n) a y (2n 2) x(2n)
y (2n 1) a y (2n 1) x(2n 1)

x(2n)
y (2n )

,
y
(
k
)

y (2n 1)
x(2n 1)

Note:
Pipelining: clock period =
sampling period.

Define block vectors


x(k )

Time indices

Block (parallel): clock period


not equal to sampling period.

Then
y (k ) a y (k 1) x( k )

ECE734 VLSI Arrays for Digital Signal Processing

11

Block IIR Filter

x(2k)
x(n)

S/P

y(2(k1))

y(2k)

x(2k+1)

y(2k+1)

y(2(k1)+1)

P/S

y(n)

ECE734 VLSI Arrays for Digital Signal Processing

12

Timing Comparison
x(1)

x(2)

MAC

x(3)

y(1)

x(4)

y(2)

y(3)

y(4)

Pipelining
Add

x(1)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(7)

y(1)

y(2)

y(3)

y(4)

y(5)

y(6)

y(7)

y(7)

a y(1)

Mul

Block processing
x(2)

x(4)

x(1)

x(6)

x(3)

x(8)

x(5)

x(7)

ECE734 VLSI Arrays for Digital Signal Processing

7
13

You might also like