You are on page 1of 13

White Paper

Solving the ASIC Prototype Partition


Problem with Synopsys ProtoCompiler
Best Practices for FPGA-Based Prototype Design Using Synopsys ProtoCompiler with
Synopsys HAPS Systems
July 2014

Author

Introduction

Bob Erickson
ProtoCompiler
Partitioner Architect,
Synopsys, Inc.

When developing a multi-FPGA prototype of an ASIC or SOC, you have many decisions to make: how to
distribute clocks; where to put the daughter boards with real-world interfaces; which modules should be
assigned to each FPGA; where and how many cables connect the FPGAs; and how to squeeze all the signals
into those cables. All these decisions need to result in the fastest possible prototype that you can build and
debug in the allotted time. And every week the RTL changes, and sometimes it seems that every decision you
make forces you to revisit all the decisions that came before.There is a better way.
To help you make these decisions, Synopsys applied its years of experience working with prototypers to
create the new Synopsys ProtoCompiler design automation and debug software for the Synopsys HAPS
Series of FPGA-based prototyping systems. ProtoCompiler is an integrated prototyping toolset with built-in
HAPS hardware knowledge. It automates the development of a HAPS prototype from RTL compilation, to
partitioning, to synthesis, place and route, and debug.
In this paper we will describe the overall flow for developing a prototype with ProtoCompiler. Then we will focus
on ProtoCompilers partitioning subsystem and how you can use it to develop your prototype. We also explain
why a HAPS-aware partitioning solution yields superior results compared to a generic partitioning tool which is
not aware of the special characteristics of the underlying platform.

Overview of ProtoCompilers Multi-FPGA Flow


HDL
source

RTL compilation and debug instrumentation


RTL
netlist

Instrumentation
controls
Timing
constraints
Target system
spec

Pre-partition
Area
estimates

Partition
constraints
Modified target
system spec
Routing
constraints

Reports

Partition
results

Prepared
netlist

Partition

Partitioned
netlist

TSS

TSS

System route
Routed
netlist

Reports

Reports and
results

Reports and
results

System generate
Partitioned
RTL

Reports

Synthesis, place and route scripts

Figure 1: ProtoCompiler Multi-FPGA Flow

Figure 1 shows a high-level view of the ProtoCompiler data flow with designer input on the left, the processing
steps in the center with the key data generated by each step, and reports on the right. Steps with the loop symbol
were designed for rapid iteration to help you quickly converge to a good solution.

RTL Compilation and Debug Instrumentation

The RTL compilation step reads the HDL source code of the ASIC design and generates a hierarchical netlist that is
the starting point for the subsequent steps in the ProtoCompiler flow. The HDL source delivered to the prototyping
team is often quite raw, having passed some amount of the functional verification suite, but not synthesized for
either ASIC or FPGA implementation. The prototyping team must refine the HDL source to make it FPGA friendly.
Required changes may include modifications of clock generation, implementation of memory models with FPGA
block RAMs, and modifications of non-synthesizable constructs in the simulation-oriented code. Often these
refinements require manual modifications to design source code. Catching and fixing errors introduced in making
the ASIC design FPGA-ready is a time consuming and tedious task. A tool which can catch such errors quickly is a
great asset in this step and can help reduce prototype bring-up time dramatically. The ProtoCompiler compilation
step was designed for quick iterations to help you converge quickly on the final HDL source.
An optional but highly recommended step prior to partitioning is to add debug instrumentation to the design. Using the
ProtoCompiler instrumentor, you can add watch points and triggers to key interfaces and to status and control registers
of the design. When the design is running in the prototype system, the ProtoCompiler RTL debugger will monitor these

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

signals to help you identify and fix any functional problems with your prototype. You will also have a chance to insert
additional debug instrumentation after partitioning. ProtoCompilers ability to add instrumentation prior to partitioning
enables you to debug even those modules which are partitioned across multiple FPGAs in the prototype.

Pre-Partition
The pre-partition step reads the synthesized RTL netlist from the compilation step and performs some global
optimizations (like constant propagation) on the netlist and makes sure the prepared netlist is ready for partitioning.
If you have timing constraints, you can apply them to the design here and they will be carried forward through
the partitioning process. At a minimum, timing constraints must define the clocks of the design so that the
ProtoCompiler partitioner and system router can do a good job of routing clocks. The partitioner and router have
been architected to use the timing constraints to improve the performance of the prototype. The pre-partition step
also runs a fast logic synthesis on the design to estimate the FPGA resource usage of each module. These area
estimates are an important input to the partitioner.

Connectors
and
daughter
cards

FPGA
A

FPGA
B

Connectors
and
daughter
cards

Inter-FPGA
cables
Connectors
and
daughter
cards

FPGA
C

FPGA
D

Connectors
and
daughter
cards

Global
clocks

Figure 2: Target HAPS System Diagram

The Target System Specification (TSS) defines the topology of the target HAPS system and is an input to the
pre-partition step and an optional input to the partition step. Figure 2 shows a diagram of a typical target HAPS
system. The TSS describes the details of your prototype to the partitioner, including the FPGAs, daughter boards,
external connectors, global clock resources and the cables used for inter-FPGA interconnect. The TSS does not
need to be complete and correct for the pre-partition step, but it must at least define the number and type(s) of
FPGAs used in the prototype. It must be complete and correct before you can run the system generate step.
Later in this paper we will describe the abstract TSS process for refining a TSS so that it meets your goals for the
prototype. The ability to initially define the target system abstracted in this manner is a very helpful way to explore
potential partition solutions.
The compiled TSS that the partitioner and router use has extensive information about the details of the HAPS
system. The partitioner and router are able to automate many tedious and error-prone tasks because they have
access to this HAPS-specific data.

Partition
The partition step assigns each of the modules of the prepared design to one of the FPGAs in the design. You can
supply to the partitioner one or more Partition Constraints Format (PCF) file(s) containing constraints that direct
the automated partition and system route engines based on your knowledge of the architecture of the design. You
may optionally modify the TSS to meet the needs of your prototype. This step was designed for rapid iteration to
help you achieve the best possible prototype by changing the HAPS system topology and partition constraints until
you have achieved the goals for your prototype. The partition step generates a partitioned netlist that contains a
top-level module for each FPGA. Each FPGA has a logic hierarchy that parallels the prepared design. The partition
step also generates a compiled TSS and partition results that it forwards to the system route step.
Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

System Route
The system route step determines how the signals of the design will be routed using the cables and connectors
of the HAPS prototype. In almost all practical cases, the required number of inter-FPGA signals will exceed
the number of available prototype connections, and so each wire must transmit multiple signals using timedivision-multiplexing (TDM). The number of signals transmitted on a single wire is the TDM Ratio of that wire. If
the prototype has no physical connection between two FPGAs, the partitioner may decide that the best solution
requires routing signals through an intermediate FPGA; we call this a feedthrough. The output of the system route
step is a routed netlist with TDM logic included in the design.

System Generate
The final step of ProtoCompilers multi-FPGA flow is the system generate step. This step generates a synthesis,
place & route script for each of the FPGAs in the prototype. You will launch these scripts in ProtoCompilers single
FPGA flow to complete the generation of your prototype.

Overview of the ProtoCompiler Partitioner and System Router


As the complexity of prototypes has increased, the need for automated tools to facilitate the partitioning and
routing of the prototype has also increased. The ProtoCompiler partitioner and system router were developed
from the ground up to meet these needs. They provide an unprecedented level of automation for FPGA-based
prototype design using HAPS Series systems while providing extensive controls so that prototypers do not have
to sacrifice the performance they need to get the automation they want. Some of the key features of the partitioner
and system router are:
``
HAPS-aware partitioning and routing
``
Fast, routing-aware automatic partitioning with extensive user controls. In benchmark tests the system
can typically find a feasible fit for a 48 million ASIC gate capacity (4 FPGA, HAPS-70 S48) system in less
than 5 minutes.
``
Fast, fully automatic routing and insertion of high-speed time-domain multiplexing (HSTDM) interconnect.
HSTDM uses source-synchronous transmission of high speed serial data to multiplex many signals on a
single wire pair.
``
Extensive log files and reports that give clear feedback about problems and results

Partitioner
When doing partitioning, many simultaneous objectives must be optimized.
``
Do not exceed the maximum utilization of any FPGA
``
Assign interface logic so that top-level I/Os can be routed directly to connectors or daughter cards
``
Reduce the total number of nets that cross FPGAs
``
Reduce the number of clock signals that cross FPGAs. Clocks that cross FPGAs can create unwanted
skew, causing performance problems or non-functional prototypes.
``
Make the nets that cross FPGAs match the available cables. For example, if there are more cables
between FPGAs A and B compared to other pairs, assign the logic so that more wires are needed
between A and B compared to other pairs (with little or no increase in the total number of nets that
cross FPGAs).
``
Reduce the number of required feedthroughs
The ProtoCompiler partitioner has sophisticated algorithms to address all of the above objectives simultaneously.
The partitioner algorithm uses a built-in global router to understand the interconnect requirements of every
intermediate solution it evaluates. During a run, it may evaluate millions of alternative solutions. It will attempt to
find a partition solution that matches the connectivity of the prototype, as defined in the TSS. The partitioner tries
simultaneously to minimize feedthroughs and to minimize the worst TDM ratio in the system, while observing the
constraint that some signals, such as top-level I/Os, clocks and tristate nets, cannot use TDM or feedthroughs.
The partitioner has access to HAPS-specific data that makes it possible to avoid many potentially disastrous
problems that a generic partitioner may create. In particular, it has detailed knowledge of the requirements for
HSTDM, and for clock routing, and can fully automate, together with the system router, the insertion of HSTDM
logic and the routing of clocks.

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

Sometimes the TSS is established before you begin partitioning; this may be the case if you are retargeting
an existing prototype configuration. But often the configuration of cables and external I/Os is flexible. The
ProtoCompiler partitioner can use an abstract TSS to help you to quickly converge to a detail TSS. Using
abstract TSS PCF commands, you can temporarily change the board interconnect or add port bins to experiment
with partitioning solutions. Unlike the detail TSS, the abstract TSS does not have to be feasible. For example,
given a single HAPS-70 S48 you have approximately 1100 GPIO signals available at the HapsTrak 3 connectors of
each PCB module or 4,400 GPIO signals total. Say you want to design a prototype that integrates two HAPS-70
48 systems (8 FPGAs). With an abstract TSS, you can temporarily exceed the physical constraint of 4800 GPIO
signals per system and instead specify that each HAPS-70 S48 system has 10000 connections to join the two
clusters of four FPGAs. Since inter-system connections are readily available, the partitioner will concentrate on
finding a good solution for each system. You can then use the partitioner log file and reports to determine the
required inter-system connections for the next iterations. When you have a feasible abstract TSS that delivers the
results you like, you need to build a detail TSS description that matches it.
Although the algorithms of the partition engine can come up with a solution without any user input, the real-world
requirements of the prototype cannot be met without user input. The partition engine supports a wide range of
constraints and directives. The most obvious constraints define assignment of logic to FPGAs and connection of
external ports to connectors and daughter cards. In addition, you can:
``
Cluster cells and ports to make sure they stay together
``
Replicate logic into multiple FPGAs
``
Control both the maximum and minimum utilization of FPGAs
``
Control which hierarchical modules are dissolved
``
Control the routing of clocks.
Partitioning is an NP-Complete problem, among the most difficult problems in computer science. The partitioners
algorithms are powerful heuristics that come up with good results in a short time. Heuristics always have
limitations; however, you can help the partitioner find the solution you want with a few high-level constraints called
floorplan constraints.
``
You can assign a large module to a specific FPGA and then lock that FPGA so that the partitioner cannot
assign any other logic to it. This is a good idea if the interface to that module is small and clean.
``
As an alternative to assigning a module to a particular FPGA, you can assign a module to several FPGAs.
That module and all of its sub-modules will be constrained to those FPGAs. For example, if a module A
is too big to fit in a single FPGA, then you can assign A to two FPGAs and the partitioner will find a good
way to partition that module between the two FPGAs while also optimizing the connections to the two
FPGAs by adding additional logic to them.
``
When assigning top-level I/O ports to a connector or daughter board interface, you can force the
connected logic to be in the FPGA that connects to them.
Often, a few floorplan constraints can help the partitioner to repeatably find the same high quality solution.

System Router
The ProtoCompiler system router consists of two parts: a global router and a detail router.
The global router considers the cables that connect each pair of bins (FPGAs, connectors and daughter cards).
It decides which signals should be assigned to each cable and the TDM Ratio for each signal. It understands the
clock routing requirements for the target system and tries to make low-skew connections for each clock.
The detail router uses the global routers results to assign each signal to a particular wire in the prototype. Because
of TDM, many wires will have more than one assigned signal.
You control the system router through PCF constraints. To complete the prototype, you must provide enough
details in the constraints so that the router will route the correct signals to the correct connector or daughter
card pin. The system router also takes constraints that configure the TDM interfaces, including control over clock
frequency and available TDM ratios.

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

The Design Partitioning Work Flow


The ProtoCompiler partitioner and system router were designed for rapid iteration. You should expect to perform
many partitioner and system router runs before you have completed your prototype. Between iterations, you may
change the run time options, the partition constraints and/or the TSS.
Pre-partition
Area
estimates

Prepared
netlist

Partition
constraints
Modified target
system spec

Partition

Partition
results

Partitioned
netlist

TSS

No

TSS

Reports and
results

Partition
OK?
Yes

Routing
constraints

System route
Routed
netlist

No
Reports and
results

Partition
OK?
Yes

System generate

Figure 3: Partitioning iteration flow

Figure 3 shows a conceptual flow diagram for the partitioning and routing process. During your process, you may
start with no constraints and a simple TSS. As you run the partitioner and evaluate the results, you will gradually
add constraints and refine the TSS until the required details of the prototype are correct (typically external ports,
daughter card connections and clock distribution) and the partition results meet your performance goal. At the end
of the process, you may want to guarantee repeatability of the result by fully constraining all logic and ports. For a
working prototype, you will need to specify the required constraints listed in Table 1.
Constraint

Description

Bin utilization

We typically recommend using no more than about 75% of the available FPGA resources.
By default, the limits are set to 100%. You need to specify the limits for each FPGA
using the PCF bin_utilization command to make sure that each of your FPGAs can
successfully complete synthesis, place and route.

TDM control

Specify the TDM type and the required resources for generating the TDM logic.

External I/O assignments

Each external port in your prototype must get to the right pin on the right connector.

Global clock assignments

The HAPS system has a flexible capability for generation and routing of global clocks,
that is, clocks that drive more than one FPGA with low skew. You will need to determine
the source (external, PLL or FPGA-generation) for each global clock and how it will use
the HAPS global clock resources. PCF has special commands to help you specify global
clocks.

Local clock distribution

Often you cannot avoid having clocks that are sourced from one FPGA and used in
another. If the logic can tolerate the skew, then you need only make sure that the clocks
are using clock-capable pins on the receiving FPGA (the log file clearly reports all clock
crossings). If not, then you can either move the clock to the global clock network or you
can replicate the logic that generates the clock into both FPGAs. Either solution will result
in reduced clock skew.
Table 1: Required constraints for a working prototype

Recommended Organization of Partition Constraints


As you can see from Table 1, you will need to specify a lot of constraints before you finish your prototype.
Organizing them into a few files that match their modification frequency will make it easier to do quick experiments
as you converge on a solution. Table 2 suggests an organization of the PCF files.

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

File
setup.pcf

Recommended
contents
bin_utilization

Description
The most stable pcf file captures the decisions that are least likely to change.

tdm_control
dissolve controls

ports.pcf

The ports file contains all the information about the non-clock ports. In the
early stages of the process you may cluster ports that must stay together,
such as the interface to a particular daughter card. Later you may assign
these ports to the port bins that connect to the daughter board, and finally
assign each bit to a particular connector pin. We recommend that you work to
converge this file, at least to the bin stage, early in the process because port
assignments will help to stabilize the partition results. This file will also serve
as the port constraints for the system router.

cluster_port
assign_port

clocks.pcf

assign_global_net
net_attribute
assign_port
replicate_cell

floorplan.pcf

assign_cell
bin_attribute

abstract_tss.pcf

-locked

abstract_tss

The clocks and resets file specifies the clock and reset distribution for the
prototype. We recommend that you work to converge at least the global
clock definitions early in the process, and to make sure that the partitioner/
routers analysis of which signal is a clock or reset is correct. The results of the
analysis are reported in the partition report file.
The floorplan file contains cell assignments to help the partitioner find the best
solution. Often, a few high-level constraints are all thats needed.
The abstract target system file augments the detail tss by redefining the
connections in the system. In the earliest stages, it may contain both
additional port bins and inter-FPGA connections. As the ports and clocks
converge, we recommend that you add the required information to the detail
tss file, and restrict the abstract tss file to just inter-FPGA connections. Of
course, when you have converged to a solution, you will need to have all the
information in the detail tss file, and this file will no longer be necessary.

Table 2: Recommended organization of partition and system route constraints

Finding a Good Partition


Finding the best match between your design and your hardware configuration is a journey of discovery. With just
100 objects to partition into a four FPGA system, there are almost four million possible solutions. Most of them are
bad; some of them are OK; but only a few of them are excellent. When you add in the ability to change the system
interconnect, the number of possible solutions is much more than we want to count. Luckily, the ProtoCompiler
partitioner has sophisticated heuristics to find good solutions among the many, and we have added controls to
help you guide it to an excellent solution.
Lets explore partitioning an eight-FPGA design comprised of two HAPS-70 S48 systems (mb1 and mb2). We start
with the setup, ports and clocks files complete. The setup file is as follows:
bin_utilization -all_bins -resource_ratio {LUT 0.75}
bin_utilization -min -all_bins -resource_ratio {LUT 0.20}
dissolve_control -max_ratio {ALL 0.3}
We have set the maximum utilization for any FPGA to 75%. We have also set the minimum utilization to 20% to
make sure that all the FPGAs get used. We will dissolve all modules with area greater than 30% of an FPGA.
We also start with a fully connected abstract tss:
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss

-clear_fpga_traces
-add_trace_group {mb1.uA
-add_trace_group {mb1.uA
-add_trace_group {mb1.uA
-add_trace_group {mb1.uB
-add_trace_group {mb1.uB
-add_trace_group {mb1.uC
-add_trace_group {mb2.uA
-add_trace_group {mb2.uA
-add_trace_group {mb2.uA
-add_trace_group {mb2.uB
-add_trace_group {mb2.uB
-add_trace_group {mb2.uC

mb1.uB}
mb1.uC}
mb1.uD}
mb1.uC}
mb1.uD}
mb1.uD}
mb2.uB}
mb2.uC}
mb2.uD}
mb2.uC}
mb2.uD}
mb2.uD}

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width

96
96
96
96
96
96
96
96
96
96
96
96

abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss

-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group
-add_trace_group

{mb1.uA
{mb1.uA
{mb1.uA
{mb1.uA
{mb1.uB
{mb1.uB
{mb1.uB
{mb1.uB
{mb1.uC
{mb1.uC
{mb1.uC
{mb1.uC
{mb1.uD
{mb1.uD
{mb1.uD
{mb1.uD

mb2.uA}
mb2.uB}
mb2.uC}
mb2.uD}
mb2.uA}
mb2.uB}
mb2.uC}
mb2.uD}
mb2.uA}
mb2.uB}
mb2.uC}
mb2.uD}
mb2.uA}
mb2.uB}
mb2.uC}
mb2.uD}

-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width

96
96
96
96
96
96
96
96
96
96
96
96
96
96
96
96

Every FPGA connects to every other FPGA with 96 wires. The trace width of 96 is equivalent to two HAPS cables.

mb1.uA

mb2.uA
96

mb1.uB

mb2.uB

mb1.uC

mb2.uC

mb1.uD

mb2.uD

Figure 4: Abstract TSS example for a full interconnect

We run the partitioner with the ProtoCompiler command:


run partition -optimization_priority nets -hierarchy_dissolve_pass 1
Setting the optimization priority to nets is useful in the early stages of partition exploration. It attempts to find a
solution with the fewest possible nets that cross FPGAs and uses the routability of the prototype as a secondary
criteria. Setting the number of dissolve passes to 1 saves some run time during exploration phase. Later, we may
see if a second pass will improve the results.
When we look at the log file, first we check for warnings. All warnings in the log file begin with @W. For the most
part, you should be able to run the partitioner without warnings.

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

Seeing that the run was free of warnings, we notice that one of the cells that the partitioner dissolves has a lot of
random logic at its top level (local logic):
Dissolving: I: Cell1 (3460) C: Cell1_mod L: 6
Local: LUT 197402 DFF 7213
Total: LUT 224063 DFF 40372
The report indicates that a Verilog module instance (I: Cell1), 6 levels from the top of the design (L: 6), contains
197402 LUTs and 7213 DFFs. The total logic (Total: LUT 224063 DFF 40372) illustrates that the bulk of the LUT
logic at Cell1 is localized. While our initial directive was to dissolve any module that exceeded 30% of FPGA
capacity, dissolving a module with lots of localized logic can lead to more signal congestion between FPGAs and
thus increase multiplex ratios higher, so we add a no_dissolve constraint to the setup file:
no_dissolve Cell1
When we run again, we look first at the final partition cost function report:
@S3.1.3.7 AP141 |Decompose and optimize
@N: AP373 |After Decomposition: Cell Clusters: 4471 Port Clusters: 53
Nets: 398277
Total Cost = 5.62259e+008 x 1;
Routing Opt: Solution 314 Nets 26365 Feedthru 0 MaxRatio 61.26 Channel 74.59
Overflow 0.00
Clock Crossing: 0 FF Cost 0 Cost 0
Minimum Bin Size: mb.uA LUT 214034(17%) Cost 803619
We see that there are 26365 nets that cross FPGAs, no feedthroughs (Thats good!) and a maximum TDM ratio of
61.26. So in order to meet our goal of a TDM ratio of 20, we will need to triple the number of cables connecting at
least one pair of FPGAs.
This is a good first result but lets see if we can do better. Among the many blocks in this design are two major
sub-blocks that will each need two to four FPGAs to implement. You can see this by looking at the dissolve reports
in the log file. Lets make a floorplan constraint for them to keep them in separate parts of the prototype.
assign_cell {module_1} {mb1.uA mb1.uB mb1.uC mb1.uD}
assign_cell {module_2} {mb2.uA mb2.uB mb2.uC mb2.uD}
In looking at the design in the schematic viewer, we also see that a submodule of module_1 is glue logic that
connects many other submodules. It is best if we dissolve that cell so it can be broken up to improve the partition.
must_dissolve {module_1.glue}
Now we can see that we have improved the number of nets:
@S3.1.3.7 AP141 |Decompose and optimize
@N: AP373 |After Decomposition: Cell Clusters: 4507 Port Clusters: 53
Nets: 398277
Total Cost = 5.71109e+008 x 1;
Routing Opt: Solution 160 Nets 23798 Feedthru 0 MaxRatio 60.83 Channel 67.98
Overflow 0.00
Clock Crossing: 0 FF Cost 0 Cost 0
Minimum Bin Size: mb1.uD LUT 187421(15%) Cost 3.37084e+006
The Solution Summary shows how the FPGAs are used:
FPGA BINS Cells isLocked LUT
LUTM DFF
BRAM
DSP
IO
----------------------------------------------------------------------------mb.uD
1841
886437(71%)
482307(19%) 856(65%)
mb.uC
1976
557374(45%)
391793(16%) 905(69%)
16(1%)
mb.uB
58
366257(29%)
81517(3%)
18(1%)
mb.uA
487
855712(69%)
610523(25%) 1035(78%) 13(1%) 4(0%)
mb1.uD
11
187421(15%)
45383(2%)
55(4%)

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

mb1.uC
mb1.uB
mb1.uA

17
97
20

220228(18%)
677392(54%)
793931(64%)

143330(6%)
472612(19%)
631347(25%)

288(22%)
1235(94%)
1057(80%)

2(0%)
4(0%)

The maximum LUT utilization is 71%, which is well within our target of 75%. The minimum LUT utilization is 15%,
which is less than our target of 20%. While the maximum utilization constraint is hard, i.e., the partitioner must
obey it, the minimum utilization constraint is soft, so the partitioner can use less if doing so will improve the results.
However, the penalty for violating the constraint increases as the violation increases.
The Partitioner Ratio Estimate Report tells us how the cables between the FPGAs are used:
Partitioner Ratio Estimate Report
mb.uC<->mb.uD
Available Traces:
Ratio: 38.15
mb.uB<->mb.uD
Available Traces:
Ratio: 0.00
mb.uA<->mb.uD
Available Traces:
Ratio: 43.02
mb.uD<->mb1.uD
Available Traces:
Ratio: 0.00
mb.uD<->mb1.uC
Available Traces:
Ratio: 0.00
mb.uD<->mb1.uB
Available Traces:
Ratio: 0.00
mb.uD<->mb1.uA
Available Traces:
Ratio: 0.00
mb.uB<->mb.uC
Available Traces:
Ratio: 0.47
mb.uA<->mb.uC
Available Traces:
Ratio: 20.10
mb.uC<->mb1.uD
Available Traces:
Ratio: 0.00
mb.uC<->mb1.uC
Available Traces:
Ratio: 0.00
mb.uC<->mb1.uB
Available Traces:
Ratio: 1.06
mb.uC<->mb1.uA
Available Traces:
Ratio: 0.81
mb.uA<->mb.uB
Available Traces:
Ratio: 33.16
mb.uB<->mb1.uD
Available Traces:
Ratio: 0.00
mb.uB<->mb1.uC
Available Traces:
Ratio: 0.00
mb.uB<->mb1.uB
Available Traces:
Ratio: 0.00
mb.uB<->mb1.uA
Available Traces:
Ratio: 0.00
mb.uA<->mb1.uD
Available Traces:
Ratio: 0.00
mb.uA<->mb1.uC
Available Traces:
Ratio: 0.00
mb.uA<->mb1.uB
Available Traces:
Ratio: 28.86
mb.uA<->mb1.uA
Available Traces:
Ratio: 0.00
mb1.uC<->mb1.uD
Available Traces:
Ratio: 0.00
mb1.uB<->mb1.uD
Available Traces:
Ratio: 6.29
mb1.uA<->mb1.uD
Available Traces:
Ratio: 0.00
mb1.uB<->mb1.uC
Available Traces:
Ratio: 0.44

96

Net Usage: (TDM 3662 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 4130 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 45 DIRECT 0)

96

Net Usage: (TDM 1930 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 102 DIRECT 0)

96

Net Usage: (TDM 78 DIRECT 0)

96

Net Usage: (TDM 3183 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 2771 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 604 DIRECT 0)

96

Net Usage: (TDM 0 DIRECT 0)

96

Net Usage: (TDM 42 DIRECT 0)

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

10

mb1.uA<->mb1.uC
Ratio: 14.70
mb1.uA<->mb1.uB
Ratio: 60.83

Available Traces: 96

Net Usage: (TDM 1411 DIRECT 0)

Available Traces: 96

Net Usage: (TDM 5840 DIRECT 0)

We can see from this report that many connections are not used at all. Lets delete them from the abstract TSS
and run again.
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss
abstract_tss

-clear_fpga_traces
-add_trace_group {mb1.uA
-add_trace_group {mb1.uA
-add_trace_group {mb1.uA
-add_trace_group {mb1.uB
-add_trace_group {mb1.uC
-add_trace_group {mb2.uA
-add_trace_group {mb2.uA
-add_trace_group {mb2.uA
-add_trace_group {mb2.uB
-add_trace_group {mb1.uA
-add_trace_group {mb1.uC
-add_trace_group {mb1.uC

mb1.uB}
mb1.uC}
mb1.uD}
mb1.uC}
mb1.uD}
mb2.uB}
mb2.uC}
mb2.uD}
mb2.uC}
mb2.uB}
mb2.uA}
mb2.uB}

-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width
-width

96
96
96
96
96
96
96
96
96
96
96
96

We can see that the partitioner was not quite able to reproduce the result with this TSS file:
@S3.1.3.7 AP141 |Decompose and optimize
@N: AP373 |After Decomposition: Cell Clusters: 4507 Port Clusters: 53
Nets: 398277
Total Cost = 1.24468e+009 x 1;
Routing Opt: Solution 317 Nets 25244 Feedthru 5 MaxRatio 65.93 Channel 79.98
Overflow 0.00
Clock Crossing: 0 FF Cost 0 Cost 0
Minimum Bin Size: mb.uD LUT 183642(15%) Cost 7.1327e+006
Lets see what happens if we change the optimization priority to tdm_ratio.
run partition -optimization_priority tdm_ratio -hierarchy_dissolve_pass 1
When we run again, we see this result:
@S3.1.3.7 AP141 |Decompose and optimize
@N: AP373 |After Decomposition: Cell Clusters: 4507 Port Clusters: 53
Nets: 398277
Total Cost = 6.77332e+007 x 1;
Routing Opt: Solution 178 Nets 27302 Feedthru 0 MaxRatio 39.83 Channel 53.96
Overflow 0.00
Clock Crossing: 0 FF Cost 0 Cost 0
Minimum Bin Size: mb1.uD LUT 220227(18%) Cost 477855
As expected, the partitioner was able to reduce the TDM ratio at the expense of the number of nets that cross
FPGAs. At this point, we have two options:
``
We could accept this partition and lock it by exchanging the floorplan file for the partition assignments
from this run (which you can export from the ProtoCompiler data base). With the partition locked, we
could reduce the TDM ratio by scaling the width of the connections.
``
Or we could continue to explore by changing widths of the connections in the abstract TSS and perhaps
by more floorplanning.
In either case, we have converged to a pretty good solution is just a few iterations by changing run options,
changing the abstract TSS and adding a few floorplan constraints.

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

11

Finishing the Job


To complete the job, we will need to convert the abstract TSS into a detail TSS. The most efficient way to do this
is to use the launch tss -mode shell which will put you in an interactive TCL window and will provide immediate
feedback on your TSS file editing process.
With the detail TSS in place, run the partitioner once again and then run the system router. You may need to edit
ports and clock constraints to get through the router without errors or warnings. The routers log file will give you
feedback on how each of the cables are used, and the router report file provide details about how each net that
crosses FPGAs will be routed in the prototype.
When the system router has completed successfully, you can proceed to system generate and then to running the
FPGA synthesis scripts to get programming files for the FPGAs of the HAPS system.

When the RTL Changes


During the course of your project you will get many RTL drops. Some drops will have major changes and others
will have changes confined to logic deep in the hierarchy.
For major changes, the floorplan constraints you set up to get your first partition are your best starting point. The
partitioner will give you a new result, and in most cases, this result will be similar to the previous result. However, it
is likely that every FPGA has some changes so all of them must go through synthesis, place & route again.
Sometimes RTL changes are limited to a module which is on a single FPGA and do not impact the partition or
routing. In this case, you can identify the FPGA(s) that need to be rebuilt and run the synthesis, place & route
scripts only on those FPGAs.
Sometimes even minor FPGA changes will affect the partition and routing. In this case, we recommend that you
use the resulting assignments from your previous partition as the input to the partitioner instead of using the
floorplan constraints. The log file will report warnings for any cells that are not found, and will partition any new
logic with most of the design set in place. This method will provide the highest level of stability in the results.

Conclusion
The ProtoCompiler partition and system route engines were designed from the ground up for efficient prototype
partitioning. The Target System Specification (TSS) and the Partition Constraints Format (PCF) provide simple
and intuitive controls for you to get the best possible results for your prototype while accounting for both the
topology of the HAPS system and the architecture of the ASIC design. Each step in ProtoCompiler is architected
for rapid iteration to help you to converge to a good solution and to get your prototype up and running quickly.
ProtoCompilers HAPS-aware partitioning solution yields superior results compared to a generic partitioning tool
which is not aware of the special characteristics of the underlying prototyping platform.

Solving the ASIC Prototype Partition Problem with Synopsys ProtoCompiler

12

For More Information


The paper complements the following documents, all of which are available for free from Synopsys or
Xilinx websites.
``
Automating SoC RTL to Operational Prototype, SNUG Silicon Valley 2014, Marceno and Jagtiani
``
Faster Time to Prototype A Rapid Bring-Up Methodology with HAPS-DX White Paper
``
Synopsys ProtoCompiler User Guide
``
Synopsys ProtoCompiler Reference Guide
``
Synopsys HAPS-70 Hardware Reference Guide
``
FPGA-Based Prototyping Methodology Manual, Doug Amos, Austin Lesea, and Ren Richter (ISBN: 9781-61730-004-2)
``
Xilinx 7 Series FPGAs Overview, DS180
``
Breaking the Three Laws Blog. Mick Posner, Synopys.com

Glossary
Terms used with the Synopsys ProtoCompiler and automated partition engine.
Bins Groups of logical elements of the design (like I/O ports) or physical elements of the HAPS system (like
FPGAs and clock sources) referenced by the Target System Specification (TSS).
System Route Assignment of inter-FPGA nets to TDM logic and traces.
FPGA Based Prototyping Methodology Manual (FPMM) A book co-authored by Synopsys and Xilinx containing
Design-For-Prototyping (DFP) best practices.
Global Route An inter-FPGA net connection that has been assigned to a channel.
Local Random Logic Logic other than hierarchical modules, such as gates, mulitplexers and flip-flops, that is
instantiated in the defining netlist of a cell.
Partition Constraint Format (PCF) A Tcl-syntax description of constraints for the logic partition phase of the
ProtoCompiler partition engine.
Time Domain Multiplexing (TDM) A technique to reduce signal congestion between FPGAs by serializing
multiple signals through a single physical I/O.
Target System Specification (TSS) A Tcl-syntax description of a HAPS systems topology (daughter boards,
motherboards, interconnect, etc.).
Time To First Prototype (TTFP) A reference to the time elapsed between the RTL-available milestone to the
operational-prototype milestone in the ASIC/SoC prototype development schedule.

Synopsys, Inc. 700 East Middlefield Road Mountain View, CA 94043 www.synopsys.com
2014 Synopsys, Inc. All rights reserved. Synopsys is a trademark of Synopsys, Inc. in the United States and other countries. A list of Synopsys trademarks is
available at http://www.synopsys.com/copyright.html. All other names mentioned herein are trademarks or registered trademarks of their respective owners.
06/14.AP.CS4163.