You are on page 1of 23

Systolic Algorithm Design: Hardware Merge Sort and Spatial FPGA Cell Placement Case Studies

Henry Barnor Mentor: Andr Dehon e October 1, 2004

Contents
1 Introduction and motivation 1.1 Algorithms in Hardware . . . . . . . . 1.2 Systolic Nature of Hardware Algorithms 1.3 Hardware Merge Sort . . . . . . . . . . 1.4 Spatial FPGA cell placement . . . . . . Hardware Merge Sort 2.1 Algorithm: Overview . . . . . . 2.2 Algorithm: Detailed Description 2.2.1 Splitter . . . . . . . . . 2.2.2 Merger . . . . . . . . . 2.2.3 Queue . . . . . . . . . . 2.3 Implementation . . . . . . . . . 2.3.1 Pipelining Proof . . . . 2.4 Results . . . . . . . . . . . . . . Spatial FPGA Cell Placement 3.1 Algorithm Overview . . . . . . 3.2 Implementation Progress . . . . 3.2.1 Entropy Assembly . . . 3.2.2 Accumulator Assembly . 3.2.3 Swap Assembly . . . . . 3.2.4 SwapMemory Assembly 3.2.5 Memory Assembly . . . 3.3 PositionUpdate Assembly . . . . 3.3.1 Control Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 2 3 3 3 3 5 5 8 11 12 12 13 14 14 15 15 16 16 17 17 17 17 18 18 19 19 19 2

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Methods 4.1 Merge Sort Implementation . . . . . . . . . . . . . . . . . . . . . Conclusion Future Work Acknowledgements

5 6 7

List of Figures
1 2 3 4 5 6 7 Hardware Merge Sort: Algorithm Flow Chart . Hardware sort data structure . . . . . . . . . . State Diagram for Splitter . . . . . . . . . . . Pseudo-Schematic for Splitter . . . . . . . . . State Diagram for Merger . . . . . . . . . . . Pseudo-Schematic for Merger . . . . . . . . . Systolic Placer: High level block diagram of PE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 7 7 9 11 15

List of Tables
1 2 3 Operating frequency for different data widths with block ram and distributed ram implementation of FIFO structure . . . . . . . . . Total Resource usage for different data widths . . . . . . . . . . . Resource usage per Component for different data widths . . . . . 13 14 14

Abstract The availability and increasing power of Field Programmable Gate Arrays (FPGAs) is causing a shift towards implementation of algorithms in spatially programmable hardware. This seems to hint that in the future basic algorithms will be implemented in programmable hardware to achieve higher performance than possible with software running on sequential processors. Algorithms implemented on programmable gate arrays are inherently systolic. This project takes two algorithms that have been previously implemented in software to run on sequential architecture processors and transforms them to systolic algorithms implemented in hardware. This is achieved by rethinking the algorithm in terms of a group of cells working together to achieve one aim. Each cell is designed for a specic task in the ow

1
1.1

Introduction and motivation


Algorithms in Hardware

Computation intensive algorithms have for the most part been designed to run on everyday computing machines(sequential architecture microprocessors). This invariably leads to a software implementation of such algorithms. To achieve gains in speed and computational power these algorithms are parallelized and run on a connected grid of multiple sequential microprocessors. This trend is spurred by the availability and low per-unit cost of sequential architecture microprocessors. However, recent trends in the per-unit cost and per-unit power of programmable logic chips is changing the status quo. The trend is to do as much as possible in hardware as opposed to software.

The shift in ideology is not due to cost alone but more importantly gains in speed. Algorithms in hardware have access to distributed-embedded memory thus eliminating the memory access bottleneck characteristic of sequential microprocessors. There is no time-sharing of processing power, the algorithm is the processor. In addition, we can have multiple processing elements running on the same chip, without having to pay a huge speed cost for inter-process communication.

To sum up, algorithms in hardware have a higher communication bandwidth, can exploit spatial parallelism and have quicker memory-access time over the traditional software approach. With the decreasing cost of programmable logic, hardware algorithms are now practical.

1.2

Systolic Nature of Hardware Algorithms

A systolic system is dened as a network of processors which rhythmically compute and pass data through the system[2]. Algorithms that can be mapped to such a system are called a systolic algorithms. Hardware algorithm design is best achieved by breaking the system into task-specic processing elements. Each such element computes on data received and sends the results to the next element for more computation if necessary. The end result is a pipelined, multi-processor system. This is basically a systolic system and we can conclude that hardware algorithms are inherently systolic in nature and systolic algorithms can be easily implemented in hardware. 2

1.3

Hardware Merge Sort

Sorting is a basic and necessary computation for most computers. It is believed that up to 25 percent of non-numerical computer time is spent sorting[4]. This overhead can be eliminated by moving the sort to a specialized hardware component. The basic sequential methods of sorting is probably the most understood topic in computer science. The same cannot be said of systolic methods but with such a deep understanding of the sequential methods do we need to develop new methods? It can be shown that most sorting algorithms use a divide and conquer approach. However, merge sort is a natural iterative divide and conquer algorithm. This property of the algorithm allows us to design a systolic sorting processor.

1.4

Spatial FPGA cell placement

Recongurable computing is a hot topic in academia and research. A large and growing community of researchers have used eld programmable gate arrays (FPGAs) to accelerate computing applications and have achieved performance gains of one or two orders of magnitude as opposed to general use processors[1]. The IEEE in April of 2003 organized its 11th Symposium on Field Programmable Custom computing machines[3], yet recongurable computing has not made it onto the consumer market. Before this can happen a number of key drawbacks need to be overcome. One such drawback is the time required to map the program logic to physical programmable resources anytime the machine recongures itself for a task. Spatial FPGA cell placement is one possible solution to this drawback.

2
2.1

Hardware Merge Sort


Algorithm: Overview

Given, n unsorted inputs the algorithm proceeds by using the fact that each input by itself is sorted. Merging any two of these inputs in order will produce a size 2 sorted array. This size 2 sorted array can be merged in order with another size 2 sorted array to produce a size 4 sorted array. By doing this merging iteratively the algorithm produces n sorted outputs. We note that there is a simple recurring structure for this divide and conquer algorithm: a split followed by merge. By pipelining an array of processing elements to split and merge the input, we create a systolic merge sort machine. 3

We conclude therefore that a queue of size n+1 is sufcient to have the system running synchronously without any handshakes. Figure 1: Hardware Merge Sort: Algorithm Flow Chart

2.2

Algorithm: Detailed Description

Input: A sequence of k-bit values for which its possible to do binary comparisons. output: An ordered permutation from largest to smallest of the input values. Structure of Data The data will be (k+2)-bit values. The two most signicant bit are used as ags in the algorithm and do not contribute to the data value. These bits represent end of sorted (EOS) subset and end of input (EOI). The EOS bit signals to the algorithm the end of a sorted stream and beginning a new sorted stream. Thus all inputs to the sorter start with their EOS bit set. The EOI bit is used to separate streams of numbers being sorted. Figure 2: Hardware sort data structure
Data Word b[k1]..b[0]

EOS EOI

EOS = End of Sorted Subset EOI = End of Input

Hardware Elements There will basically be three independent hardware logic blocks in the algorithm. These are: the splitter, the merger and the queue. 2.2.1 Splitter

Takes a single stream of inputs and outputs two streams of numbers. 5

Input: input - (k+2)-bit input data values input_da - data available signal clk - clock signal reset - reset signal Output: output0 - (k+2)-bit output data value output1 - (k+2)-bit output data value rd_en - read enable signal for input queue wr_en0 - write enable signal to queue 0 wr_en1 - write enable signal to queue 1 Functional Summary: Outputs its input on one output bus until it sees an active EOS bit and then switches to the other output bus. It asserts the write enable signal to the queue as it outputs data and stops demanding and outputting data when the current queue becomes full. Pseudo-Code for State Machine: enable and a 0 means disable In the pseudo-code an assignment of 1 means

S0: RD_EN = 1 IF INPUT_DA = 1 OUTPUT0 = INPUT WR_EN0 = 1 WR_EN1 = 0 IF(!EOS(INPUT)) GOTO S0 ELSE GOTO S1 ENDIF 6

Figure 3: State Diagram for Splitter

Figure 4: Pseudo-Schematic for Splitter

ELSE WR_EN0 = 0 WR_EN1 = 0 OUT0_REG = 0 ENDIF END S0 -S1: RD_EN = 1 IF INPUT_DA = 1 OUT1_REG = INPUT WR_EN1 = 1 WR_EN0 = 0 7

IF(!EOS(INPUT)) GOTO S1 ELSE GOTO S0 ENDIF ELSE WR_EN0 = 0 WR_EN1 = 0 OUT1_REG = 0 ENDIF END S1 --

2.2.2

Merger

Takes two sorted inputs and produces a single sorted output of the two inputs. Input: input0, input1 - (k+2)-bit input data values input0_da, input1_da - data available signals for input0 and input1 respectively. clk - clock signal reset - the reset signal sys_en - system enable signal Output: output - (k+2) bit output data (sorted) rd_en0 - read enable signal to input queue 0 rd_en1 - read enable signal for input queue 1 wr_en - write enable signal for output queue Functional Summary: Compares the two numbers on its input and outputs the greater of the two until it sees an EOS on one of its input. The EOS on one input causes it to pipe the other input until it sees an EOS on that input. 8

Figure 5: State Diagram for Merger

Pseudo-Code for State Machine: START: DEMAND A DEMAND B IF( NO A and NO B) GOTO START IF A GOTO WAITB ELSIF B GOTO WAITA ELSE GOTO BOTH_AVAIL ENDIF END START -WAITA: DEMAND A IF A 9

GOTO BOTH_AVAIL ELSE GOTO WAITA ENDIF END WAITA -WAITB: DEMAND B IF B GOTO BOTH_AVAIL ELSE GOTO WAITB ENDIF END WAITB -BOTH_AVAIL: IF(A_IN > B_IN) OUT_REG = A_IN OUT_REG.EOS = 0 DEMAND A IF(EOS(A_IN)) GOTO PASS_B ELSE GOTO M0 ENDIF ELSE OUT_REG = B_IN OUT_REG.EOS = 0 DEMAND B IF(EOS(B_IN)) GOTO PASS_A ELSE GOTO M0 ENDIF ENDIF

10

END BOTH_AVAIL -PASS_A: IF(EOS) GOTO START ELSE OUT_REG = A_IN DEMAND A ENDIF END PASS_A -PASS_B: IF(EOS) GOTO START ELSE OUT_REG = B_IN DEMAND B ENDIF END PASS_B

Figure 6: Pseudo-Schematic for Merger

2.2.3

Queue

This is basically a hardware rst-in rst-out (FIFO) data structure. 11

2.3

Implementation

The system has been implemented in an industry standard hardware description language(HDL), VHDL,1 and veried by simulation. Considerable effort was applied to optimizing the algorithm to output data every clock cycle. This was achieved in two ways. Both the splitter and the merger were designed with an extra register/buffer to hold values before there were output. This allowed us to have data available to output on the rising edge of the clock. In addition, a demand and receive model of handshaking was used. This replaced the request and acknowledge handshaking which takes more than one cycle to accomplish. A better optimization would be to get rid of all handshaking and have the system run synchronously. This can only be achieved if we can guarantee that there will be no stall in the system. By making the queue sizes arbitrarily large enough, we can ensure that no stalls occur. However, this increases the number of resources used which is not an optimal solution. We suggest and prove below that instead of making the queue size arbitrarily large, a queue size of n + 1 is sufcient to ensure that no stalls occur in the system. 2.3.1 Pipelining Proof

We consider two arbitrary empty queues, A and B of sizes n. We note that 2nclocks would fully ll up both queues without stalling. Without loss of generality, we assume that queue A was lled rst. On the 2n+1 clock, queue A is scheduled to receive an input. Two things can happen in that case. CASE 1: Queue A has not been popped If Queue A has not been popped then A is full and the input can go into the spare queue register/buffer that we claim is sufcient. The fact that Queue A has not been popped implies that Queue A is sorted; this requires that for the next n clocks, A will be popped freeing up space for more inputs on A. Thus there will be no stall provided the queue has the extra register/buffer.
1

VHSIC Hardware Description Language

12

CASE 2: Queue A has been popped Assume m; m > 0 As have been popped. On the 2n+1 clock we have no problem since space is available for at least one more input. For subsequent clocks, we have the worst case scenario if only Bs are popped. We note that if m As have been popped on the rst 2n clock then n m Bs have been popped leaving us with m Bs. Thus only m Bs can be popped meaning only m A inputs will be pushed onto the A queue before an A is popped and we therefore will not have any stalls in the system.

2.4

Results

A 16-input/4-stage sorter was implemented and characterized for this paper. The code was run through the full FPGA design ow except for generating a programming bit-le for the FPGA. Table 1 contains running speed data obtained from Xilinx ISE. Total resource usage in terms of LUTS2 was also obtained from Xilinx ISE and is shown in Table 2. Synplicity Pro synthesis tool was used to obtain resource usage for the merger and splitter component whereas Xilinx Coregen provided footprint data for the queue components.

In all 3 tables, data is presented for two cases. Distributed memory: - Queue component is implemented using embedded distributed memory. Block ram: - Queue component is implemented using dedicated blocks of memory.

Data Width - 2 Distributed Memory(Mhz) Block Ram(Mhz)

8 117.4 85.8

16 105.5 86.0

32 92.2 61.7

64 71.6 n/a

Table 1: Operating frequency for different data widths with block ram and distributed ram implementation of FIFO structure Data is not available for 64-bit block ram implementation because it required more block ram than was available on the spartan3 FPGA in the current conguration.
2

Basic unit of FPGA resources

13

Data Width - 2 Distributed Memory(# of LUTS) Block Ram(# of LUTS) Block Ram(# of Block rams)

8 674 617 10

16 947 812 10

32 1374 1076 10

64 2182 n/a n/a

Table 2: Total Resource usage for different data widths Data Width - 2 Merger (# of LUT) Splitter (# of LUT) Distributed memory FIFO (# of LUT) Block Ram FIFO (# of LUT) Block Ram FIFO (# of Block rams) 8 56-64 23 41 37 1 16 90-123 39 49 37 1 32 118 71 65 37 1 64 159-221 135 97 37 2

Table 3: Resource usage per Component for different data widths

3
3.1

Spatial FPGA Cell Placement


Algorithm Overview

Simulated annealing is a technique for solving optimization problems. It is based on the manner in which crystals form from liquids/gases[5]. At high temperatures the molecules of the crystal move randomly. As the temperature decreases, they move less and settle into their nal crystalline position. In solving the placement problem with simulated annealing, each element to be placed acts as a molecule and tries to nd a position of low energy(i.e position of least contribution to global cost). However, a simulated system temperature adds some stochastic behavior by causing elements to make random moves that do not necessarily decrease global cost. As the system temperature drops such random movements decrease until elements settle into their nal positions.

In simulated annealing, given a circuit for placement on an FPGA, divide the FPGA into a systolic array of processing elements. Randomly assign each logic block in the circuit to a processing element. For each processing element consider swapping positions with each neighbour in turn. Swap regardless of change in cost if system temperature is high enough; otherwise, swap based on reduction in total cost[6]. 14

Figure 7: Systolic Placer: High level block diagram of PE


ControlAssembly Swap Decision

Done

Up Left Right Down

EntropyAssembly

Swap SwapAssembly Delta Cost to neighbours

SwapMemory Assembly

Neighbour Mux

External Delta Cost

Accumulator Assembly

MemoryAssembly

PositionUpdate Assembly

3.2

Implementation Progress

Implementation of the placer is still ongoing at the time of writing. VHDL is being used to implement the algorithm in hardware. The recurring structure in this algorithm is the processing element that models the crystal molecule. Processing Element The processing element can be broken into 7 major blocks: the Entropy assembly, the Accumulator assembly, the Memory assembly, the PositionUpdate assembly, Swap assembly, SwapMemory assembly and nally the Control assembly. 3.2.1 Entropy Assembly

The entropy assembly is responsible for the randomness and cooling schedule of the simulated annealing process. It consists of two main elements. A variable 15

cooling schedule that can be set at synthesis time and an LFSR3 used as a random number generator. The initial system will only have a linearly decreasing cooling schedule. More cooling schedules can be added using the VHDL feature that allows a designer to dene more than one architecture. The possibility of changing the cooling schedule at run-time will also be explored. 3.2.2 Accumulator Assembly

This block calculates the delta cost for the PE. It breaks down further into the following blocks. CurrentCost Accumulator Based on the belief that the primitive for a placement engine is position i.e all cost functions can be expressed as functions of position, the CurrentCost Accumulator calculates the PEs contribution to global cost using information it has about the positions of logic blocks its connected to. This block will also be implemented to allow the cost function to be changed at synthesis-time or run-time. Position: For the purposes of this implementation position will be a four-tuple data structure consisting of an XY-position of the upper left corner, a width and a height. HypoCost Accumultor This is a counterpart to the CurrentCost Accumulator. It calculates the cost assuming a swap is made with the current neighbour under consideration. Diff Accumulator The Diff Accumulator calculates the delta cost for the PE if the swap is taken. 3.2.3 Swap Assembly

The swap assembly makes the decision on whether a swap is to be made. A swap is made if the entropy signal is high irrespective of delta costs. If the entropy signal is low then it considers the total delta cost and issues a swap decision based on that.
3

Linear-Feedback-Shift-Register

16

3.2.4

SwapMemory Assembly

This block handles swapping the PEs memory with a neighbour whenever a swap is being made. 3.2.5 Memory Assembly

This processing element keeps track of the logic blocks that are connected to it in the memory assembly. This information is necessary to calculate costs.

3.3

PositionUpdate Assembly

Whenever a logic block performs a swap, it needs to alert logic blocks that are connected to it so that cost function calculations reect current state of the system. The PositionUpdate of each processing element acts as a leaf-node on an H-tree. An H-tree allows us to propagate the update information such that all processing elements of the array receive the information at the same time. 3.3.1 Control Assembly

The control block will be a nite state machine responsible for initiating a cycle of swap decision. This involves enabling the appropriate neighbour through the input decoder(mux), starting the swap decision process and initiating an actual swap if necessary. Pseudo-Code for State Machine S0: clock entropy initiate AccumulatorAssembly enable next neighbour in mux GOTO WAIT WAIT: IF swapdecision=1 initiate SwapMemoryAssembly GOTO SWAP ELSE GOTO S0 ENDIF

17

SWAP: IF swap=done GOTO S0 ENDIF

4
4.1

Methods
Merge Sort Implementation

Code Generation: Majority of the code for the sorter was generated by hand using a text editor. The FIFO structures were however generated using Xilinx Core Generator 6.2.03i4 with the following settings: Memory type: distributed memory / block ram Data width : varied Fifo depth : 16(minimum depth possible) Code Simulation The VHDL code was veried by simulation using ModelSim SE PLUS 5.7G from Mentor graphics5 . The verication was done by writing a testbench VHDL le that simulated inputs on the design. The post-synthesis generated VHDL le was also simulated and veried using the same test-bench. Synthesis: Synplify Pro 7.6.16 was used to synthesize the code into an EDIF le for a XILINX Spartan3, XC3S400 part with a speed grade of -47 . Synthesis was done with a user-specied clock constraint of 200Mhz for the design. A mapped VHDL netlist was also generated and veried as explained above. Placement and Routing: Placement and routing of the design onto the Spartan3 chip was done using Xilinx ISE 6.2.03i8 . Inputs to the ISE where the coregen project les and the Synplify-generated EDIF le.
4 5

http://www.xilinx.com/products/logicore/coregen/ http://www.model.com/products/60/se.asp 6 http://www.synplicity.com/products/synplifypro/ 7 The speed grade species how fast the FPGA can run 8 http://www.xilinx.com/products/

18

Conclusion

Frequency/Speed: We note from Table 1 that the frequency decreases as data width increases for distributed memory whereas frequency is relatively constant for block ram. This hints that the block ram is the limiting factor. However, Xilinx data sheets claim that block rams can run at speeds close to 200Mhz. We conclude that the limiting of the speed must be due to inadequate pipelining between the block ram and the other components.

Resource Usage: A cursory examination would seem to hint that both block ram and distributed ram use more resources as data width increases. A closer examination will show that distributed ram has a higher rate of increase. Considering data from Table 3, we note that up until about 64 bits of data, the FIFO adds a constant resource usage as opposed to the increasing resource usage of the distributed memory.

We conclude from this that we can build a system that uses distributed memory for the smaller merges in the algorithm ow and as we get to bigger merges switch over to block ram. This would enable us to implement a bigger design on a single chip and also use the resources available more efciently.

Future Work

The conclusions drawn above points us in the direction of doing a better job of pipelining the data. This would enable the system to run at faster speeds than has been demonstrated in this paper.

Acknowledgements

I would like to thank my advisor, Professor Dehon rst and foremost for his support and guidance, secondly for taking a chance on me and welcoming me to his lab even though i lacked the necessary background. I would also like to thank Nachiket Kapre for explaining to me innumerable times how to think about VHDL code. I am also grateful to Micheal Wrighton for helping me understand his thesis work[6]. My gratitude also goes to the SURF committee for giving me the opportunity to do summer research. Thank you to everyone at the IC lab. 19

References
[1] Andre DeHon. The density advantage of congurable computing. Computer, 33(4):4149, Apr 2000. [2] David J. Evans. Systolic algorithms. In David J. Evans, editor, Systolic Algorithms, number 3 in Topics in Computer Mathematics. Gordon and Breach, 1991. [3] IEEE. Field-Programmable Custom Computing Machines, 11th Annual IEEE Symposium on, April 2003. [4] G. m. Megson. An Introduction to Systolic Algorithm design. Oxford University Press, 1992. [5] Maogang Wand Majid Sarrafzadeh and Xiaojian Yang. Modern Placement Techniques. kluwer Academic Publishers, Norwell, USA, 2003. [6] Michael Wrighton. Spatial approach to FPGA cell placement by simulated annealing. Masters thesis, California Institute of Technology, 2003.

20

You might also like