You are on page 1of 8

Network Flow Based Circuit Partitioning for Time-multiplexed FPG-As'

Huiqun Liu and D. F. Wong


Department of Computer Sciences, University of Texas at Austin, TX 78712
Email: { hqliu, wong}@cs.utexas.edu

of an array of augmented XC4000E-style CLBs [17, 11. Each


CLB includes a micro register t o hold CLB results between
configurations. Micro registers hold combinational logic intermediate values for use in later micro-cycles of the same user
cycle. and also hold flip-flop values for use in the next user cycle. Every configuration memory cell of the original F P G A is
backed by eight inactive memory cells. A micro-cycle starts
by saving all CLB results from the previous micro-cycle in
micro registers. After this state saving, a new configuration
is loaded into active configuration memory.

Abstract

Time-multiplexed FPGAs have the potential t o dramatically improve logic density by time-sharing logic, and have
become an active research for reconfigurable computing. The
partitioning problem for time-multiplexed FPG.As is different
from the traditional partitioning problem in that the nodes
have precedence constraints among them, and the widely used
iterative improvement partitioning methods such as K&L,
FM [14,15] are no longer applicable. All previous approaches
[1,2,3] used list scheduling heuristics.
In this paper, we present a network flow based algorithm
for multi-way precedence constrained partitioning, which can
handle t h e precedence constraints while minimizing the netcut size. T h e experimental results on the MCNC benchmark
circuits show t h a t our algorithm out-performs list scheduling
by a big margin, with an average improvement of Over 50%
for bipartitioning a n d 20% for multi-way partitioning.

Logic & Interconnect


Amy

Configuration SRAM

1 Introduction
One of t h e major benefits provided by FPGAs is the ability
of run-time reconfiguration. Currently there is a growing interest in dynamically reconfigurable FPGAs (DRFPGA). A
virtual large logic design is partitioned into multiple stages t o
share the same smaller physical device in a time-multiplexed
fashion.
Time-multiplexed FPGAs have the potential t o dramatically improve logic density by time-sharing logic. Several
different architectures have been proposed, such as Xilinx
model [l],Dharma [ 6 ] , the Dynamically Programmable Gate
Array [7,8],and t h e Virtual Element Gate Array 191. These
D R F P G A s allow dynamic reuse of the logic blocks and wire
segments by having more than one on-chip SRAM bits controlling them. Thus logic blocks and interconnect can be
changed by reading a different SRAM bit which only takes
time in t h e order of nanoseconds. Currently, there are partially reconfigurable FPGAs available commercially such as
AT6000 from Atmel and XC6200 from Xilinx.
Figure 1 shows the Xilinx time-multiplexed F P G A configuration model [1]. T h e F P G A emulates a large device by sequencing through multiple configurations called micro-cycles.
One pass through all the micro-cycles is called a user cycle.
In each micro-cycle, the CLBs (Configurable Logic Blocks)
are re-used t o evaluate logic. T h e target architecture consists

Figure 1: Xilinx time-multiplexed FPGA configuration model.


Because the logic a n d interconnect needed for a circuit is
time-multiplexed on a D R F P G A , the traditional FPGA design flow needs t o be modified. T h e traditional design flow
involves logic synthesis, technology mapping, placement and
routing. S o w it is essential t o have a good partitioning strategy t o ensure the correctness of the execution, t o minimize
the number of interconnection among the partitions as well
as satisfy both the area and pin constraints for a physical
FPGA.
T h e multi-way partitioning for time-multiplexed F P G A
is different from the traditional circuit partitioning problem.
One important reason is t h a t in a time-multiplexed F P G A ,
the order of the execution o f t h e nodes inust follow the precedence constraints. For example, in a combinational circuit, a
node must be in a stage no later t h a n all its output nodes.
Therefore, a cut should be a uni-directional cut. T h e traditional partitioning approaches, such as K&L [15] and FM
[14]based methods, do not handle this constraint and are
no longer applicable for this partitioning problem. T h e partitioning problem for time-multiplexed F P G A can be formulated as a Directed Acyclic Graph (DAG) scheduling problem.
All previous research [1,2,3] used traditional DAG scheduling
methods, such as list scheduling. However, the list scheduling heuristics do not take into account the cost for buffering
a signal between non-adjacent time frames.
In this paper, we propose a network flow based approach
for multi-way precedence constrained partitioning with application in time-multiplexed FPGAs. It can handle the precedence constraints a n d minimize t h e nuniber of interconnections at the same time. We show how t o correctly model the

' T h i s work was p a r t i a l l y s u p p o r t e d by t h e T e x a s A d v a n c e d R e s e a r c h


Program u n d e r G r a n t No. 003658288 a n d by a g r a n t f r o m t h e I n t e l
Corporation.
Permission to make digital or hard copies of all or part of this worc for personal or
classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full
citation on the first page. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
ICCAD98, San Jose, CA, USA
0 1998 ACM 1-58113-008-2/98/0011.$5.00

497

the nodes that use the valhe of ithe flip-flop use the same
value: the value of the flip-flop from the previous user
cycle.

nets in both combinational and sequential circuits, so that


by the max-flow computation, the min-cut corresponds t o a
minimum uni-directional net cut which satisfies the precedence constraints among the nodes. A bipartitioning algorithm based on the repeated max-flow min-cut computation
is presented and then it is iteratively applied t o partitioning
a ncllist into multiple stages.
The organization of the paper is as follows. In section 2,
we give the formulation for the precedence constrained partitioning problem. In section 3 we first present the proper
net modeling method for both combinational and sequential circuits, and then present the network flow based approach for a-bounded uni-directional bipartitioning. Section
4 explains the multi-way precedence constrained partitioning
algorithm for time-multiplexed FPGA. Experimental results
are discussed in section 5.

The above constraints define a partial temporal ordering


on the nodes in the circuit. Let P(w) be the precedence of a
node v, for two nodes v and U , we defir.e P ( v ) 5 P ( u ) if u
must be scheduled no later than nodle U . .We can rephrase the
above constraints as follows: for a net n = {VI, ...,up}, let u1
be the input to vj (2 5 j 5 p ) .
If v1 is a c-node, then F(v1) 5 P ( a j ) for 2

1.

A circuit can be represented by a hypergraph G = (V,N),


where V is a set of nodes, N is a set of nets where each net
is a subset of nodes which are interconnected. The nodes in

v = u;=:=,vt
U.

if P ( , v ) 15 P ( u ) , then s(v)

with the objective of

1. Minimizing the interconnection be tween the partitions;


2. Minimizing maz{w(K)ll 5 i 5 k } .
The subsets V, (1 5 i 5 IC) share the ,same physical FPGA
device in a time-multiplexed way. s(u:i i s the stage that Y
is assigned to. It is important t o minimize the amount of
interconnection among the different, partitions, which greatly
influences the placement and routing process. For Xilinxs
architecture [l],the values of the cut-ncts t o be passed to a
later stage are stored in micro registers.
The nodes in V , are called virtual nodes, while the nodes
(e.g. LUTs) in a physical FPiGA device are called real nodes.
The number of virtual nodes for each stizge should be smaller
than the number of real nodes r , i.e. maz{w(V,)Il 5 i <_
k } 5 r , so that the virtual nodes c,an be fit into the physical
FPGA device. Minimizing m:az{w(V,)/L 5 i 5 k } allows the
design t o fit into a smaller physical FPGA. To do so, we need
t o let the weight in each stage be as close t o
as possible,
which is the average of the total weigh!:
A uni-directional cut is ;t cut (X, ,U) such that for two
nodes v and U , if P ( v ) 5 P ( u ) , then (either v and U are in
the same subset, or v E X and U E
By definition, a
uni-directional cut satisfies the precedence constraints.
The a-bounded uni-direct,lonal min-cut bipartitioning with
respect to two nodes s and t . is the problem of partitioninga
circuit of total weight _W into two disjoint subsets X and X,
where s E X and t E X, with the minimum number of unidirectional cut nets such that w ( X ) is most close t o a. We
allow w(X)t o deviate from (1 - )CY t o (1 e)a, e.g. e = 5%.
For k-way precedence constrained partitioning, a =
for
each stage.
The IC-way precedence constra.ined partitioning problem
can be reduced t o finding it - 1 a-bounded uni-directional
cuts which partition the circuit into k stages, with each cut
(UjsiV,,Uj>iV,)for 15 i <: k being mi-directional.
Figure 3(a) shows an cu.-bound.ed uni-directional min-cut
with a = 5. Net nl = { a , #b,f } and i:iode a is the input t o
nodea b and f . According to the constraint, a must be in a
stage no later than b and f. Figure S(b) shows an example
that a min-cut may not necessarilly be a uni-directional cutNet n1 is cut, since P ( a ) rt: P ( b ) , a should be in X, a E X
and b E X violates the definition for a. uni-directional cut.

.-*uni-directional cut

Figure 2: Temporal partitioning of a circuit into

5 j 5p.

.(U).

Primary

is a FF-node, then P(wJ)5 F(ul) for 2

2. For any nodes v and

V are classified into two types, combinational node (c-node)


and flip-flop node (FF-node). Each node v in V has a weight
w(v), and the weight of a subset X of V , denoted by w(X),
is the total weight of all the nodes in X.
For a net n = { V I , ...,vP} with p nodes, let vl be the fanout node whose output signal is the input signal t o u j E n
(2 5 j L: p ) , 211 is the input of u j ( 2 5 j 5 p ) , and vj (2 5 j
p ) is the output of WI. We further define N = N, U Nf.A net
n E N, if V I is a c-node, and n E Nf if w1 is a FF-node. If
a net only connects two nodes (i.e. p = a), then it is a twoterminal net, if it connects more than two nodes (i.e. p > 2),
then it is a multi-terminal net.
1

v1

The k-way precedence constrained partitioning problem is


to partition a circuit G = (V,.V)into IC non-overlapping subsets VI, ..., V k , and for each node v, let s ( ~ =) j if v V3,
subject to

Problem Statement

stage:

If

5 j 5 p;

4 stages.

For a time-multiplexed FPGA, a circuit i s partitioned into


k stages (or partitions), such that the logic in different stages
temporally share the same physical FPGA device (Figure 2).
In the following discussions in the paper, we use stage and
partition interchangeably when there is no confusion. The k
stages form one user cycle, and one user cycle should produce
the same results on the outputs as would be seen by a nontime-multiplexed device.
In order for a partitioned time-multiplexed circuit t o produce the correct result in one user cycle, the nodes must be
evaluated in a proper order. According to Xilinxs architecture [l],the following three precedence constraints must be
satisfied:

x.

Each c-node must be scheduled in a stage no later than


all its output nodes.

Each FF-node must be scheduled in a stage no earlier

than all its input c-nodes. This rule guarantees that flipflop input values are calculated before they are stored.
Each FF-node must be scheduled in a stage no earlier
than all its output nodes. This rule guarantees that all

498

03

two-ternimal net

net modeling

Figure 3: ( a ) A n example of a n a-bounded uni-directional


min-cut with Q = 5. (b) A n example that a min-cut may not
be a uni-directional cut.
net modeling

multi-terminal net

The problem of finding an a-bounded uni-directional minE


have w(v) = [131, In
this paper, we present a network flow based approach which
uses iterative max-flow min-cut technique for a-bounded unidirectional bipartitioning, then apply t h e method to multiway precedence constrained partitioning.

Figure 4: Net modeling for a two-terminal net and a multinet in Nc'

cut is NP-hard even if all

-.

b'

Network Flow Based Uni-directional Bipartitioning


w

Network flow is an excellent approach to finding min-cuts because of the celebrated max-flow min-cut theorem [5]. Yang
and Wong [16] successfully applied the network flow approach
to balanced bipartitioning and employed incremental flow
technique for efficient implementation. [16]gave a net modeling so that the min-cut in the constructed network corresponds to min-net-cut in the original circuit. But with this
net modeling, nodes in the same net are symmetric and do
not have precedence constraints among them, so the min-cut
found is not necessarily a uni-directional cut. For our precedence constrained partitioning problem, it is important and
necessary to find a proper net modeling so that the precedence constraints among the nodes will be maintained.
In another related work, Cong et al. [4] first used the iterative max-flow min-cut method to find uni-directional mincut on combinational circuit in a logic synthesis algorithm.
But they only modeled two-terminal nets in a combinational
circuit.
In the following sections, we present net modeling for
both two-terminal and multi-terminal nets in combinational
and sequential circuits, so that by the max-flow computation,
the min-cut preserves the precedence constraints. We further
prove the correctness of the net modeling.

3.1

Figure 5: A cut i n G' and the corresponding net-cut i n G


Here we distinguish the net modeling of a two-terminal net
and multi-terminal net. For two-terminal nets, we add fewer
number of edges and nodes, which will speed up the max-flow
computation. For each net, exactly one bridging edge with
capacity 1 is added, and all the other edges have 00 capacity. Notice that nodes in the same net are not symmetric,
the bridging edge starts from v1. There is an edge with 00
capacity from wj (2 5 j 5 p ) to w1. After the max-flow computation, for a min-cut ( X ' , F ) in G', all the forward edges
from X ' to
must be saturated ( i . e . flow equals to the capacity) and all the backward edges from
to X' have zero
amount of flow. If a net is cut, then on& the bridging edge
will be the forward cut edge from X ' to X', therefore v1 must
be in X', which preserves the precedence constraints.
If each net has different weight, then the capecity on the
bridging edgecan be set according to the net weight. F'or_a
min-cut (X'. X')in G', we get the corresponding cut ( X ,X )
in G as fo1lov.-s.
.fI E X' in G', then v E X in G. If v E ??
in G ' , then L
X in G. Figure 5 shows how to get the
corresponding net cut in G from a cut in G'.

N e t Modeling for Combinational Circuit

A proper net modeling must meet two requirements: (1) correctly models a net cut, so that a net is counted exactly once
if it is cut; (2) correctly models the precedence constraints
among the nodes.
For a net n = { V I , ..., w P } in N,, V I is a c-node and P ( v l ) 5
P(v,) for 2 5 j 5 p (i.e. v1 must be in a stage no later than
its output nodes). We construct network G' = ( V ' ,N ' ) from

3.2

For a net n = { V I , ..., u p } in Nf,


V I is a FF-node and P(wj) 5
P ( v l ) for 2 5 j _< p (i.e. V I must be in a stage no earlier
than all its output nodes). The following is the modeling of
a net in Nf (Figure 6).

G = (V,N)by the following net modeling (Figure 4)


1. All the nodes in V are in

Net Modeling for Sequential Circuit

1. For a two-terminal net

(v1,vz) where V I is a FF-node,


add a bridging edge from w2 to V I with capacity 1, and
add an edge from w1 to v z with capacity 00.

V',z.e. V C V'

2. For a two-terminal net {w~,wz},add a bridging edge


v1 -+ w2 in G' with capacity 1, add an edge u2 -+ 211 in
G' with capacity CO.

2. For a multi-terminal net n = {m,


. . . , u p } ,add a node
x with W ( Z ) = 0. Add a bridging edge from z to w1
with capacity 1. Add an edge from w, (2 5 j 5 p ) to x
with capacity CO. Add an edge from v 1 to each node v3
(2 5 j 5 p ) with capacity CO.

3. For a multi-terminal net n = (211, ...,,up} where p > 2,


add a node x in G' with W ( X ) = 0. Add a bridging edge
from 'u1 to x with capacity 1, add an edge from z to
each node v, (2 5 3 5 p ) with capacity 03. Add an
edge from node w, (2 5 j _< p ) to V I , with capacity 03.

For each net n E Nf,


only the bridging edge has capacity
1, all the other edges have capacity 00. The FF-node ul has

499

two-terminal net

1
net modeling

which leads to contradiction that ( X ' , F j is a min-cut in G'.


Hence, the corresponding cut is a niinirnum uni-directional
cut in G.
U
Lemma 1 and Lemma 2 lead to the fcllowing theorem.
Theorem 1: The min-cui size in G' equals to the minimum number of una-directional cvt-nets $71 G.

net modeling

In this section, we present a net,work flow based uni-directional


bipartitioning algorithm FBP-U. As by the max-flow mincut computation, the nodes on one side of the min-cut can
have arbitrary total weight, we apply repeated max-flow mincut heuristic to find an a-bour,ded uni-di rectional bipartition
which minimizes the number of crossing nets.

3.3

multi-terminal net

Figure 6: Net modeling for a two-terminal net and a multiterminal net an N.f.

-__-.

A l g o r i t h m FBP-U:
Flow-based a-bounded uni-directional bipartitioning
begin
1. Construct G' from G by net-modeling;
2. Pick a pair of nodes s and t in G' a,s source and sink;
3. Max-flow computation, find a min-cut C in G';
Let X be the sub-circuit reachable lrom s through
augmenting paths, and
be the rest;
4. if (1- )CY 5 w ( X ) 5 (1-t e)a .then
stop and return C as the answer;
5. if w(X)< (1-- e)a then
5.1 collapse all nodes in X t o s ;
5.2 pick a node w E
and collapse w t o s;
5.3 goto step 3;
6. if w ( X ) > (1 + E ) & then
6.1 collapse all nodes in
t o t.,
6.2 pick a node v E X ,and collapse w t o t ;
6.3 goto step 3;

an edge with CO capacity to each of its output node vJ (2 5


j 5 p ) , so if n is cut, w1 must be in 5?. This guarantees the
precedence constraints that w1 must be in a stage no earlier
than its output nodes.
Figure 7 shows an example of the net modeling and the
corresponding net cut in G from a min-cut in G'. There are
three nets, 12.1 E Nf and 1 2 . 2 , ~are N, nets. For min-cut
( X 7 x ) n, 1 and 12.2 are cut. In G', all the cut edges from X'
to
ha= capacity 1. For net n1, the FF-node b and e-node
f are in XI d and e are in X .
G

G'

a-bounded Uni-directional Bipartitioning

r,

n
d

end
W

In step 1 of algorithm l?BP-U, the network G' is constructed from G by the net modeling discussed in sections
3.1 and 3.2. Step 2 selects the source s and sink t . Unlike
We have the following lemmas about the correctness of
FBB [16], the source and sink can not be selected randomly.
the net modeling for nets both in N, and NI.
The source s should be a node such that there is no v with
Lemma 1: Any min-cut in G' corresponds to a net-cui
P ( w ) 5 P ( s ) , and the sink t should be a; node such that there
in G.
is no v with P ( t ) 3 P ( v ) .
Proof: After the max-flow min-cut computation in G'
In step 3, a min-cut IS found in G' t)y the max-flow comevery cut edge from X ' t o X ' has capacity 1 and is saturated
putation. In step 4, if the toLa1 weight ibr X is within range,
Only the bri&ing edge for a net can be the forward cut edge
then return X as the result. In step 5 , if w(X)is less than
from X' t o X ' . Since a net has exactly one bridging edge, if
(1- ) a ,then nodes in X are collapsed l,o s and a node v from
it is cut, it contributes exactly 1 in the min-cut in G'. Ori
X is collapsed to s , so that in the next iteration more flows
the other hand, only a cut net will be counted in the mincan be pushed through the network t o explore a different cut
cut. Therefore, the min-cut size in G' equals t o the number
with a larger weight in X . The node w collapsed t o s is chosen
of cut-nets in G.
c1
such that for any U with P ( u ) 5 P(w), U is already in X . In
Lemma 2: Any min-cut in G' corresponds to a minimum
step 6, if w ( X ) is greater thL;m(1-tE)o:, then all nodes in I?
uni-directional cut in G.
are collapsed t o t , and a node v from X is collapsed t o t in
Proof: For a min-cut ( X ' ,F)in G ' , all the forward edges
step 6.2. The node v collapsed to t i s chosen such that for
from X ' t o
must be saturated after themax-flow compuany U with P ( v ) 3 P ( u ) ,u is already jn X .
tation and all the backward edges from X ' t o X ' have zero
Similar to FBB [16], incremental flow technique is emamount of flow. In both the two-terminal and multi-terminsl
ployed for efficient implementation. It is not necessary t o
net modeling for nets in N, and Nf,
for any two nodes v , II,,
calculate the max-flow from scratmch in each iteration. Only
if P ( w ) 5 P ( u ) ,then there is an edge from node U to U with
additional flow is added through the network from the source
c a p a c 9 00. So it will never h a p p e n t h a t U is in X ' and U
to the sink t o saturate the bridging; edges during the max-flow
is in X ' . Thus for any min cut (X',
X ' ) , either v and U are
computation. Similar to the proof in [:16], the time complexin the same partition, or v E X ' and U E 5?r. Therefore a
ity for the repeated max-flaw mill-cui; is asymptotically the
min-cut in G' corresponds t o a uni-directional cut in G.
same as one max-flow computation. T!ne time complexity for
We now prove that the uni-directional cut in G is mixFBP-u is O(lVllE1).
imum. Suppose there is another uni-directional cut (Y,Y )
Figure 8 shows an example of finding an a-bounded uniwith a smaller cut size, then let ( Y ' , F be
) the corresponding
directional bipartitioning with a=6. In the first iteration,
min-cut is 1 after the max-flow computation, and w(X)=l.
cut in G'. Then ( Y ' , F would
)
be a smaller cut than (X'
,F),
Figure 7: A cut in G' and the corresponding net-cut in G .

x.

500

(2.e a =

Algorithm FBP-m:
Flow-based multi-way precedence constrained partitioning

Iteration 1:
niin-cut = 1

Iteration 2:

begin
3.1. s = (UZiV,) UP,, and let w ( s ) = ~ ( P z ) ;
t = {vIAS(w)> i}, and let w ( t ) = w ( P z + l ) ;
3.2. F = { v / A S ( v ) 5 z, s.t. v V,, 1 5 j < i};
3.3. construct F from F U s U t by net modeling;
3.4. find an cu-bounded uni-directional min-net-cut
(X,
by algorithm FBP-u;
3.5. assign nodes in X t o stage i, V, = P,U ( X - s);
3.6. for v t F with AL(w) = i i
1, assign s(w) = i I;
end

x)

min-cut = 2

end
The partitioning process has three major steps.
Step 1 performs As Soon As Possible (ASAP) and As Late
As Possible (ALAP) scheduling. In the ASAP scheduling,
each node is assigned t o t h e earliest possible stage. In the
ALAP scheduling, each node is assigned t o t h e latest possible stage. For each node w, let A S ( w ) , AL(w) be the stage
assigned t o v in the ASAP, ALAP scheduling respectively.
We decide AS(w) and AL(w) as follows. In the ASAP
scheduling, each node v is first labeled with t h e earliest level
by the breadth search. Let e(.) = {uIP(u) 5 P ( w ) } be a
subset of nodes which have a higher precedence than w , let
i, be the earliest level for U. If e ( v ) = 4, then I , = 1, else
I , = nzaz{l,ju, E e(v)}+l. The earliest stage for w is AS(w) =
In the ALAP scheduling, each node is first labeled with
the latest level by the breadth search. Let e(w) = {ulP(w)5
P ( u ) }be a subset of nodes which have lower precedence than
w , and let l, be the latest level for II. If e(v) = 4, then
I , = depth, else I, = m i ~ ~ { l E~ je(w)}
u
- 1. Then the
latest stage for v is A L ( ~=)
Each node w is assigned an interval [ A S ( w ) , A L ( w )after
]
the ASAP and ALAP scheduling. If A S ( v ) = A L ( v ) = j ,
then U must be scheduled in stage j. In this case we call v
as a fixed node. If AS(w) < AL(w),then v can be assigned t o
any stage from A S ( v ) t o AL(w). We call v as a flexible node.
In step 2; let P, be the subset of nodes fixed t o stage
z (1 5 i 5 k ) based on the ASAP and ALAP scheduling,
i.e. Pi = { v / A S ( v ) = A L ( u ) = i}. Note t h a t the nodes
on a critical path are fixed, b u t many nodes on t h e noncritical paths are flexible t o be assigned t o different stages.
T h e assignment of a flexible node influences other nodes by
the precedence constraints. In our partitioning process, the
goal is to assign a stage for each of the flexible node while
balancing the number of nodes in each stage a n d minimizing
the number of interconnections between t h e stages.
Step 3 iteratively calls the network flow based bipartitioning algorithm FBP-u t o partition the flexible nodes between
stages i and i 1 (1 _< i < k). For the ith iteration, t h e
details of the partitioning process t o find V , are as follows.
In step 3.1, the source s and sink t of the network are
decided. The source is a subset of nodes where s = (U)z:V,)U
P,,and w ( s ) = ~ ( P zT)h.
e source s contains all the nodes
assigned t o stages prior t o i and the fixed nodes in P,. T h e
sink t = {vlAS(w) > i} and w ( t ) = w(P;+l). t contains nodes

Iteration 3:
min-cut = 3

x IF
@ node to be collapsed to the source or sink
Figure 8: Example of a-bounded una-directional bipartationing.

Then node a E
is collapsed t o s ( i . e . w(s)=2 now) so
t h a t more flow can be pushed through the network in the
next iteration. After the max-flow in the second iteration,
the min-cut size is 2, w ( X ) = 7 and w(X)= 4. Nodes in X
are merged t o t and node i from X is collapsed t o t . In the
third iteration, minLcut=3 and X reaches the area limit with
w ( X ) = 5 . So (X,X)
forms an a-bounded min-cut with cut
size 3. We can then find the corresponding uni-directional
net cut in the original netlist G.

rk].

Multi-way Precedence Constrained Partitioning

Now we present our network flow based multi-way precedence


constrained partitioning algorithm FBP-m. Since each cut
( U J s i q , U,>iV,) between the two adjacent stages z and i+ 1
(for 1 5 i < IC) must be uni-directional, we repeatedly apply
the bipartitioning algorithm FBP-u k - 1 times t o find a unidirectional cut.
Since the length of the critical path is usually longer than
the number o f stages, there will be more than one levels of
nodes in one stage. Let depth be the number of nodes on the
longest critical path in the netlist. For sequential circuits,
d e p t h is the longest p a t h of the combinational part between
flip-flops. When partitionin into IC stages, the number of
levels in one stage is L =
in order t o make the design as fast as possible. If the time limit for each stage is
known a priori, then the number of levels can be calculated
accordingly. Since we also want t o minimize the maximum
number of nodes in any stage in order t o allow the design
t o fit into a smaller physical F P G A , we want to make each
as possible
stage have weight as close t o the average,

q,

501

and VI = { a , b , c , d , e } . Since Ah(!$)=: 2, h is assigned to


stage 2. Figure 9(c) shows the next iteration t o find a cut
between stage 2 and 3, and FC,gure 9(d) shows the final 3-way
partitioning result.
Lemmas 3 and 4 show the correctness of algorithm FBPm.
Lemma 3: For a n y node U , A S ( u ) s(w) _< A L ( u ) .
Proof: In AS(w) = A L ( a ) = j , thex U E P j , w is fixed to
stage j and AS(w) = s ( u ) = AL(w). If A&(v) < AL(w),IJ is a
flexible node, by the construction of F in step 3.3, w is in F
only in the ith iteration when AS(i1) 5 i , and w can only be
assigned t o a stage either in step 3.5 or 3.6. If 'U E X , then w
is assigned t o stage i in step 3.5, else IJ is assign t o stage equal
to A L ( u ) by step 3.6. In both cases, A;i;(v)5 .(U). Step 3.6
guarantees that w is assigned t o a stage no later than A L ( u ) ,
0
so s(v) 5 AL(w). Therefore, .4S(v) 5 s ( v ) 5 AL(w).
Lemma 4: T h e precedence constraints among the nodes
are satisfied, i.e. f o r a n y two nodes U and U , i f P ( u ) 5 P ( u ) ,
t h e n .(U) 5 .(U).
Proof: Let Si be the set of nodes which are assigned to
a stage later than i , such that Si ==
First, we prove
that between the nodes in 14 and nodes in Si,the precedence constraints are satisfied. Since when deciding stage
i , all the nodes with AS(w) > i are in the sink t , and by
the net modeling and algorithm FBP-11, the min-cut found
is uni-directional, therefore, nodes in X preserve precedence
constraints with nodes in Si. So the p-cecedence constraints
are satisfied among nodes in V, and Si.Since this is true for
all stages 1 5 i 5 IC, the nodes in stage i and j (i < j ) meet
the precedence constraints. Therefore, ail the nodes satisfy
0
the constraints in the multi-way partiti.oning.

which can only be put in a stage later than z. In step 3.2, all
the flexible nodes that can be put in stage i or z 1 form a
subset F = {w\AS(v) <_ i, s.t. w $! V,, 1 <_ j < i}. Notice
1. We
that nodes in F can either be put in stagg z or z
want t o find a uni-directional min-cut ( X ,X ) in F such that
X has the desired total weight.
In step 3.3, network F' is constructed from F U s U t by
the net modeling. AlgorithEFBP-u is applied on F' to find
an @-bounded min-cut (X,X).
In step 3.5, the nodes in X
are assigned t o stage i, such that V , = P, U (X - s ) . Next
in step 3.6. all the unassigned nodes with AL(w) = z 1 are
assigned t o stage i 1, i.e. P,+1 = Pz+l U {vlAL(w)= z 1).
Then z is increased by 1 and control goes back t o step 3.1
to start the next iteration. In step 1, the ASAP and ALAP
schedulina takes 6XIVI) time. Each iteration in steD 3 takes
O(lVIIElrtime, so the'time complexity for algorithm FBP-m
is O(klVIIE/).

(a) after ASAP and ALAP scheduling


IAS(*.),AL(VI 1

(b) iteration 1: VI={ a, b, c, d, e )

Experimental Results

We implemented algorithms FBP-u and FBP-m in C++ on


Intel Pentium-Pro of 200Mz with 32MI3 memory, and experimented on the MCNC Partitioning93 benchmark circuits.
Table 1 shows the characteristic (of the MCNC benchmark
circuits. In column 5, depth refers to the number of nodes on
the longest critical path.

(c) iteration 2: V2={f, i, j,k, h 1

(d) final partition:

Table 1: MCNC Partitioning03 benchmark circuits

1
Stage I

Stage 2

'

Circuit
c3540

___-.
___

1 # Nodes I # N e t s $
.
-

1038

PI0

I 1 0 1 6 1 . 72

I
I

Depth
38

1
I

Stage 3

Figure 9: Example of multi-way partitioning i n t o 3 stages.


Figure 9 shows a simple example of multi-way partitioning
into 3 stages. The depth of the critical path is 9, each stage
has a maximum of 3 levels of nodes. Since there are a total
number of 15 nodes, each stage should have an average of
5 nodes assuming each node has the same weight. Figure
9(a) shows the [ A S ( w ) , A L ( v )interval
]
for each node after
the ASAP and ALAP scheduling. Nodes on the critical path
are fixed, such that PI = { a , b, c, e } , P2 = {f,i, j } , and P3 =
{ l , n , o } . Node d is a flexible node that can be either put in
stage 1 or 2, and node g can be put i n stages from 1 $9 3 .
Figure 9(b) shows the partitioning t o find a cut between stage
1 and 2. First s, t and F are constructed, where F = { d , g , h } ,
w(s) = 4 and w(t) = 3. Then F' is constructed from F
and a min-cut with size 3 is found by max-flow computation,

Because of the precedence constraints, all the related reso=ch [I,2 , 31 used a variance of list scheduling heuristic.

List scheduling labels each node with ii priority and the nodes
are greedily assigned t o a !stage one at a time according to
its priority. The assignmerit of one node influences the priority of its neighboring nodes. Our experiments show that

502

Table 2: Comparing partitioning results of uni-directional bipartitioning

s38584

Average

5127

316

max comm.

List
1 ave. comm.

max comm.

max comm.

Circuit

max comm.

57.72

ave. comm. j runtime(sec.)

J (
FBP-m
ave. comm.

503

93.8%

1 56.1% 1

ave. comm.

Circuit

I
1

runtime(sec.)

max comm.

FBP-m Imprv.
max comm. ave. comm.

ave. comm.

gorithm FBP-m iteratively applies algorithm FBP-u to find


a multi-way precedence const:rained partitioning. All previous works for precedence constrained partitioning have used
list scheduling, our experiments show t h a t t h e network flow
based algorithm out-performs list scheduing by a big margin.

network flow based partitioning method can be successfully


used t o solve the scheduling problem and has the advantage
of yielding much smaller net cuts.
Table 2 compares t h e number of uni-directional cut nets
of bipartitioning by FBP-m (when k = 2) with t h a t of list
scheduling. T h e number of levels in each of the two stages is
set t o be r-1
and (Y is set t o be
with *5% variation.
From our experiments on all the benchmark circuits, FBP-m
results in fewer number of cut nets with an average of 56%
improvement over list scheduling. I t is observed t h a t our
algorithm consistently achieved larger improvement on the
larger circuits. Table 2 also shows the C P U time for FBPm. Since incremental flow computation is employed, FBP-m
is very efficient. For netlists with more than 20000 nodes,
FBP-m only takes one minute of CPU time.
In the next set of experiment, we performed multi-way
partitioning into 3 t o 1 2 stages by FBP-m. Tables 3 and 4
show the experimental results of partitioning into 4 and 8
stages, and compare the communication cost with t h a t of list
scheduling. Columns 2 and 4 show t h e maximum number of
communications between any two stages, which is the number
of micro registers required t o store values of the cut nets t o
be used by a later stage or the next user cycle. If the nodes
in a net n E N, span from stage s1 t o s2, then one micro
register is used in stages from s1 t o s2 - 1 t o store the value.
For a net n E AJ,,t h e value from a FF-node will need t o be
stored in a micro register t o be passed t o t h e next user cycle.
Minimizing the number of communications among the stages
will facilitate t h e task of placement and routing. Columns
3 and 5 show the average number of micro registers for a
stage. Column 6 shows t h e run-time of FBP-m. Columns 7
and 8 show t h e percentage of improvement of FBP-m over
list scheduling. All of our experiments show t h a t FBP-m
outperforms list scheduling by a big margin with an average
improvement over 20%.
Here we used the flat netlists from MCNC benchmark in
the experiment, instead of the CLB-level netlists which are
derived after the technology mapping step. Our algorithm
can easily run on those CLB-level netlists.
From t h e experiments we show t h a t with proper net modeling, the network flow based approach is suitable for solving
the precedence constrained partitioning problem. First, the
net modeling captures t h e precedence of t h e nodes, so t h a t E L
min-cut satisfies the precedence constraints. Secondly, by us.
ing ASAP and ALAP scheduling first, some nodes are fixed
in a certain stage. Therefore, instead of random selection,
t h e source a n d sink are prefixed before partitioning between
stages i and i 1. This facilitates t h e max-flow min-cut computation. Next, the repeated max-flow min-cut computation
balances the number of nodes in each stage.
Our FBP-m algorithm can also be applied t o partitioning
for time-multiplexed 1/0 and buffer minimization [ 3 ] , and
pipeline design. Algorithm FBP-u finds many other applications in logic synthesis and placement algorithms [ l o , 11,
la].

References

[l] Steve Trimberger, Scheduling Designs into a Tim?Multiplexed FPGA, Internntional Symposium on Field Programmable Gate Arrays, Feb., 19818.
[2] Douglas Chang and Malgorzata Marek-Sadowska, <Partitioning Sequential Circuits on dynamically Reconfigurable FP-

GAS, International Symposium on Field Programmable Gate


Arrays, Feb., 1998.
[3] Douglas Chang and Malgorzata Ma:rek-Sadowska, Buffer
Minimization and Time-multiplex4ed I / O on Dynamically Reconfigurable FPGAs , International Symposium on Field
Programmable Gate Arrays, Feb., 1991.
[4] Sasan Iman, Massoud Pedram, Charles Fabian and Jason
Cong, Finding Uni-Directional Cuts lisased on Physical Partitioning and Logic Restructuring, 4 t h International Workshop on Physrcal Design, 1.993.
[5]J.R. Ford and D.R. Fulkerson, Flows i n Networks, Princeton
University Press, 1962.
[6] N.B. Bhat, K . Chaudhary and E.S. Kuh, Performanceoriented fully routable dynamic architecture for a field
programmable logic device, Memorendum No. UCB/ERL
M93/42, university of California, Berkdey, 1993.
[7] Jeremy Brown, Derrick Chen, et al. DELTA: Prototype for
a first- generation dynamically :programmable gate array,
Transit Note 112, MIT, 19!>5.
[SI Andre DeHon, DPGA-cocpled microprocessors: Commodity
ICs for the early 21st century, In IEEd Workshop on FPGAs
for Custom Computing Machines, 1994.
191 D. Jones and D.M. Lewis, A t.ime-inultiplexed FPGA architecture for logic emulation, In ILIEE: Custom Integrated
Circuits Conference, 1995.
[lo] R. S. Tsay, E. S . Kuh and C. P. H:su, Proud: A sea-ofgates placement algorithm., in Proceedings of the IEEE International Conference on, Computer Aided Design, pp318323, Nov. 1988.
[ll] J. M. Kleinhns, G. Sigl, F. M. Hohal-ines ad K. J. Antreich,
GORDIAN: VLSI placement by quadratic programming and
slicing optimization, IEEE Transacti,ms on Computer Aided
Design, March 1991.
[12] M. Pedram and N. Bhat, Layout tixiven technology mapping, In Proceedings of the 28th Design Automation Conference, pp 99-105, June 199.1.
[13] M. R. Garey and D. S . Johnson, Compluters and Intractability:
A Guide t o the Theory oJ NP-Completeness, W . H. Freeman
and Company, 1979.
[14] C. M. Fiduccia and R. M. Mattheyses, A linear-time heuristic
for improving network partitions, In Proceedzngs of the 19th
Design Automation Conferences, pp 175.181, June 1982.
[15] B. W. Kernighan and S . Lin, An efficient heuristic procdure
for partitioning graphs, IEEE Trar:saction on Computers,
~ ~ 1 0 6 4 - 1 0 6NOV.
8 , 1978.
[16] Honghua Yang and D. F. Wong, ,EfficientNetwork Flow
Based Min-Cut Balanced Partitioni~ig,Proc. International
Conference on Computer Aided Design, 1994.
[17] Xilinx, The Programmable Logic Data Book, 1996.

Conclusion

Time-multiplexed FPGAs have the potential t o dramatically


improve logic density by time-sharing logic, and have become
an active research area for reconfigurable computing.
We present a network flow based method for precedence
constrained partitioning. We first give a general net modeling
for both combinational and sequential circuits t o find a unidirectional min-net-cut in a netlist, then algorithm FBP-u s
developed which used repeated max-flow min-cut computation t o find an a-bounded uni-directional min-net-cut. AI-

504

You might also like