Professional Documents
Culture Documents
Abstract
Time-multiplexed FPGAs have the potential t o dramatically improve logic density by time-sharing logic, and have
become an active research for reconfigurable computing. The
partitioning problem for time-multiplexed FPG.As is different
from the traditional partitioning problem in that the nodes
have precedence constraints among them, and the widely used
iterative improvement partitioning methods such as K&L,
FM [14,15] are no longer applicable. All previous approaches
[1,2,3] used list scheduling heuristics.
In this paper, we present a network flow based algorithm
for multi-way precedence constrained partitioning, which can
handle t h e precedence constraints while minimizing the netcut size. T h e experimental results on the MCNC benchmark
circuits show t h a t our algorithm out-performs list scheduling
by a big margin, with an average improvement of Over 50%
for bipartitioning a n d 20% for multi-way partitioning.
Configuration SRAM
1 Introduction
One of t h e major benefits provided by FPGAs is the ability
of run-time reconfiguration. Currently there is a growing interest in dynamically reconfigurable FPGAs (DRFPGA). A
virtual large logic design is partitioned into multiple stages t o
share the same smaller physical device in a time-multiplexed
fashion.
Time-multiplexed FPGAs have the potential t o dramatically improve logic density by time-sharing logic. Several
different architectures have been proposed, such as Xilinx
model [l],Dharma [ 6 ] , the Dynamically Programmable Gate
Array [7,8],and t h e Virtual Element Gate Array 191. These
D R F P G A s allow dynamic reuse of the logic blocks and wire
segments by having more than one on-chip SRAM bits controlling them. Thus logic blocks and interconnect can be
changed by reading a different SRAM bit which only takes
time in t h e order of nanoseconds. Currently, there are partially reconfigurable FPGAs available commercially such as
AT6000 from Atmel and XC6200 from Xilinx.
Figure 1 shows the Xilinx time-multiplexed F P G A configuration model [1]. T h e F P G A emulates a large device by sequencing through multiple configurations called micro-cycles.
One pass through all the micro-cycles is called a user cycle.
In each micro-cycle, the CLBs (Configurable Logic Blocks)
are re-used t o evaluate logic. T h e target architecture consists
497
the nodes that use the valhe of ithe flip-flop use the same
value: the value of the flip-flop from the previous user
cycle.
1.
v = u;=:=,vt
U.
if P ( , v ) 15 P ( u ) , then s(v)
.-*uni-directional cut
5 j 5p.
.(U).
Primary
v1
Problem Statement
stage:
If
5 j 5 p;
4 stages.
x.
than all its input c-nodes. This rule guarantees that flipflop input values are calculated before they are stored.
Each FF-node must be scheduled in a stage no earlier
than all its output nodes. This rule guarantees that all
498
03
two-ternimal net
net modeling
multi-terminal net
-.
b'
Network flow is an excellent approach to finding min-cuts because of the celebrated max-flow min-cut theorem [5]. Yang
and Wong [16] successfully applied the network flow approach
to balanced bipartitioning and employed incremental flow
technique for efficient implementation. [16]gave a net modeling so that the min-cut in the constructed network corresponds to min-net-cut in the original circuit. But with this
net modeling, nodes in the same net are symmetric and do
not have precedence constraints among them, so the min-cut
found is not necessarily a uni-directional cut. For our precedence constrained partitioning problem, it is important and
necessary to find a proper net modeling so that the precedence constraints among the nodes will be maintained.
In another related work, Cong et al. [4] first used the iterative max-flow min-cut method to find uni-directional mincut on combinational circuit in a logic synthesis algorithm.
But they only modeled two-terminal nets in a combinational
circuit.
In the following sections, we present net modeling for
both two-terminal and multi-terminal nets in combinational
and sequential circuits, so that by the max-flow computation,
the min-cut preserves the precedence constraints. We further
prove the correctness of the net modeling.
3.1
A proper net modeling must meet two requirements: (1) correctly models a net cut, so that a net is counted exactly once
if it is cut; (2) correctly models the precedence constraints
among the nodes.
For a net n = { V I , ..., w P } in N,, V I is a c-node and P ( v l ) 5
P(v,) for 2 5 j 5 p (i.e. v1 must be in a stage no later than
its output nodes). We construct network G' = ( V ' ,N ' ) from
3.2
V',z.e. V C V'
499
two-terminal net
1
net modeling
net modeling
3.3
multi-terminal net
Figure 6: Net modeling for a two-terminal net and a multiterminal net an N.f.
-__-.
A l g o r i t h m FBP-U:
Flow-based a-bounded uni-directional bipartitioning
begin
1. Construct G' from G by net-modeling;
2. Pick a pair of nodes s and t in G' a,s source and sink;
3. Max-flow computation, find a min-cut C in G';
Let X be the sub-circuit reachable lrom s through
augmenting paths, and
be the rest;
4. if (1- )CY 5 w ( X ) 5 (1-t e)a .then
stop and return C as the answer;
5. if w(X)< (1-- e)a then
5.1 collapse all nodes in X t o s ;
5.2 pick a node w E
and collapse w t o s;
5.3 goto step 3;
6. if w ( X ) > (1 + E ) & then
6.1 collapse all nodes in
t o t.,
6.2 pick a node v E X ,and collapse w t o t ;
6.3 goto step 3;
G'
r,
n
d
end
W
In step 1 of algorithm l?BP-U, the network G' is constructed from G by the net modeling discussed in sections
3.1 and 3.2. Step 2 selects the source s and sink t . Unlike
We have the following lemmas about the correctness of
FBB [16], the source and sink can not be selected randomly.
the net modeling for nets both in N, and NI.
The source s should be a node such that there is no v with
Lemma 1: Any min-cut in G' corresponds to a net-cui
P ( w ) 5 P ( s ) , and the sink t should be a; node such that there
in G.
is no v with P ( t ) 3 P ( v ) .
Proof: After the max-flow min-cut computation in G'
In step 3, a min-cut IS found in G' t)y the max-flow comevery cut edge from X ' t o X ' has capacity 1 and is saturated
putation. In step 4, if the toLa1 weight ibr X is within range,
Only the bri&ing edge for a net can be the forward cut edge
then return X as the result. In step 5 , if w(X)is less than
from X' t o X ' . Since a net has exactly one bridging edge, if
(1- ) a ,then nodes in X are collapsed l,o s and a node v from
it is cut, it contributes exactly 1 in the min-cut in G'. Ori
X is collapsed to s , so that in the next iteration more flows
the other hand, only a cut net will be counted in the mincan be pushed through the network t o explore a different cut
cut. Therefore, the min-cut size in G' equals t o the number
with a larger weight in X . The node w collapsed t o s is chosen
of cut-nets in G.
c1
such that for any U with P ( u ) 5 P(w), U is already in X . In
Lemma 2: Any min-cut in G' corresponds to a minimum
step 6, if w ( X ) is greater thL;m(1-tE)o:, then all nodes in I?
uni-directional cut in G.
are collapsed t o t , and a node v from X is collapsed t o t in
Proof: For a min-cut ( X ' ,F)in G ' , all the forward edges
step 6.2. The node v collapsed to t i s chosen such that for
from X ' t o
must be saturated after themax-flow compuany U with P ( v ) 3 P ( u ) ,u is already jn X .
tation and all the backward edges from X ' t o X ' have zero
Similar to FBB [16], incremental flow technique is emamount of flow. In both the two-terminal and multi-terminsl
ployed for efficient implementation. It is not necessary t o
net modeling for nets in N, and Nf,
for any two nodes v , II,,
calculate the max-flow from scratmch in each iteration. Only
if P ( w ) 5 P ( u ) ,then there is an edge from node U to U with
additional flow is added through the network from the source
c a p a c 9 00. So it will never h a p p e n t h a t U is in X ' and U
to the sink t o saturate the bridging; edges during the max-flow
is in X ' . Thus for any min cut (X',
X ' ) , either v and U are
computation. Similar to the proof in [:16], the time complexin the same partition, or v E X ' and U E 5?r. Therefore a
ity for the repeated max-flaw mill-cui; is asymptotically the
min-cut in G' corresponds t o a uni-directional cut in G.
same as one max-flow computation. T!ne time complexity for
We now prove that the uni-directional cut in G is mixFBP-u is O(lVllE1).
imum. Suppose there is another uni-directional cut (Y,Y )
Figure 8 shows an example of finding an a-bounded uniwith a smaller cut size, then let ( Y ' , F be
) the corresponding
directional bipartitioning with a=6. In the first iteration,
min-cut is 1 after the max-flow computation, and w(X)=l.
cut in G'. Then ( Y ' , F would
)
be a smaller cut than (X'
,F),
Figure 7: A cut in G' and the corresponding net-cut in G .
x.
500
(2.e a =
Algorithm FBP-m:
Flow-based multi-way precedence constrained partitioning
Iteration 1:
niin-cut = 1
Iteration 2:
begin
3.1. s = (UZiV,) UP,, and let w ( s ) = ~ ( P z ) ;
t = {vIAS(w)> i}, and let w ( t ) = w ( P z + l ) ;
3.2. F = { v / A S ( v ) 5 z, s.t. v V,, 1 5 j < i};
3.3. construct F from F U s U t by net modeling;
3.4. find an cu-bounded uni-directional min-net-cut
(X,
by algorithm FBP-u;
3.5. assign nodes in X t o stage i, V, = P,U ( X - s);
3.6. for v t F with AL(w) = i i
1, assign s(w) = i I;
end
x)
min-cut = 2
end
The partitioning process has three major steps.
Step 1 performs As Soon As Possible (ASAP) and As Late
As Possible (ALAP) scheduling. In the ASAP scheduling,
each node is assigned t o t h e earliest possible stage. In the
ALAP scheduling, each node is assigned t o t h e latest possible stage. For each node w, let A S ( w ) , AL(w) be the stage
assigned t o v in the ASAP, ALAP scheduling respectively.
We decide AS(w) and AL(w) as follows. In the ASAP
scheduling, each node v is first labeled with t h e earliest level
by the breadth search. Let e(.) = {uIP(u) 5 P ( w ) } be a
subset of nodes which have a higher precedence than w , let
i, be the earliest level for U. If e ( v ) = 4, then I , = 1, else
I , = nzaz{l,ju, E e(v)}+l. The earliest stage for w is AS(w) =
In the ALAP scheduling, each node is first labeled with
the latest level by the breadth search. Let e(w) = {ulP(w)5
P ( u ) }be a subset of nodes which have lower precedence than
w , and let l, be the latest level for II. If e(v) = 4, then
I , = depth, else I, = m i ~ ~ { l E~ je(w)}
u
- 1. Then the
latest stage for v is A L ( ~=)
Each node w is assigned an interval [ A S ( w ) , A L ( w )after
]
the ASAP and ALAP scheduling. If A S ( v ) = A L ( v ) = j ,
then U must be scheduled in stage j. In this case we call v
as a fixed node. If AS(w) < AL(w),then v can be assigned t o
any stage from A S ( v ) t o AL(w). We call v as a flexible node.
In step 2; let P, be the subset of nodes fixed t o stage
z (1 5 i 5 k ) based on the ASAP and ALAP scheduling,
i.e. Pi = { v / A S ( v ) = A L ( u ) = i}. Note t h a t the nodes
on a critical path are fixed, b u t many nodes on t h e noncritical paths are flexible t o be assigned t o different stages.
T h e assignment of a flexible node influences other nodes by
the precedence constraints. In our partitioning process, the
goal is to assign a stage for each of the flexible node while
balancing the number of nodes in each stage a n d minimizing
the number of interconnections between t h e stages.
Step 3 iteratively calls the network flow based bipartitioning algorithm FBP-u t o partition the flexible nodes between
stages i and i 1 (1 _< i < k). For the ith iteration, t h e
details of the partitioning process t o find V , are as follows.
In step 3.1, the source s and sink t of the network are
decided. The source is a subset of nodes where s = (U)z:V,)U
P,,and w ( s ) = ~ ( P zT)h.
e source s contains all the nodes
assigned t o stages prior t o i and the fixed nodes in P,. T h e
sink t = {vlAS(w) > i} and w ( t ) = w(P;+l). t contains nodes
Iteration 3:
min-cut = 3
x IF
@ node to be collapsed to the source or sink
Figure 8: Example of a-bounded una-directional bipartationing.
Then node a E
is collapsed t o s ( i . e . w(s)=2 now) so
t h a t more flow can be pushed through the network in the
next iteration. After the max-flow in the second iteration,
the min-cut size is 2, w ( X ) = 7 and w(X)= 4. Nodes in X
are merged t o t and node i from X is collapsed t o t . In the
third iteration, minLcut=3 and X reaches the area limit with
w ( X ) = 5 . So (X,X)
forms an a-bounded min-cut with cut
size 3. We can then find the corresponding uni-directional
net cut in the original netlist G.
rk].
q,
501
which can only be put in a stage later than z. In step 3.2, all
the flexible nodes that can be put in stage i or z 1 form a
subset F = {w\AS(v) <_ i, s.t. w $! V,, 1 <_ j < i}. Notice
1. We
that nodes in F can either be put in stagg z or z
want t o find a uni-directional min-cut ( X ,X ) in F such that
X has the desired total weight.
In step 3.3, network F' is constructed from F U s U t by
the net modeling. AlgorithEFBP-u is applied on F' to find
an @-bounded min-cut (X,X).
In step 3.5, the nodes in X
are assigned t o stage i, such that V , = P, U (X - s ) . Next
in step 3.6. all the unassigned nodes with AL(w) = z 1 are
assigned t o stage i 1, i.e. P,+1 = Pz+l U {vlAL(w)= z 1).
Then z is increased by 1 and control goes back t o step 3.1
to start the next iteration. In step 1, the ASAP and ALAP
schedulina takes 6XIVI) time. Each iteration in steD 3 takes
O(lVIIElrtime, so the'time complexity for algorithm FBP-m
is O(klVIIE/).
Experimental Results
1
Stage I
Stage 2
'
Circuit
c3540
___-.
___
1 # Nodes I # N e t s $
.
-
1038
PI0
I 1 0 1 6 1 . 72
I
I
Depth
38
1
I
Stage 3
Because of the precedence constraints, all the related reso=ch [I,2 , 31 used a variance of list scheduling heuristic.
List scheduling labels each node with ii priority and the nodes
are greedily assigned t o a !stage one at a time according to
its priority. The assignmerit of one node influences the priority of its neighboring nodes. Our experiments show that
502
s38584
Average
5127
316
max comm.
List
1 ave. comm.
max comm.
max comm.
Circuit
max comm.
57.72
J (
FBP-m
ave. comm.
503
93.8%
1 56.1% 1
ave. comm.
Circuit
I
1
runtime(sec.)
max comm.
FBP-m Imprv.
max comm. ave. comm.
ave. comm.
References
[l] Steve Trimberger, Scheduling Designs into a Tim?Multiplexed FPGA, Internntional Symposium on Field Programmable Gate Arrays, Feb., 19818.
[2] Douglas Chang and Malgorzata Marek-Sadowska, <Partitioning Sequential Circuits on dynamically Reconfigurable FP-
Conclusion
504