You are on page 1of 30

Project Report

Flow Based Load Balancer on NETFPGA














CHAPTER 1
1. Abstract:
Currently there are two main load balancers available in the market i.e. Layer 7 Load Balancer and
Layer 4 Load Balancer. Layer 7 runs on the application layer. Because of it, there is a huge overhead
which adds to the large latency. However it provides additional features such as security against Dos
Attacks and content caching. Layer 4 load balancer runs on the network layer. It has a less latency.
But it does not provide additional features as provided by the layer 7 load balancer. This gives us the
motivation of combining the strengths of both the load balancers into one. In this project, we
proposed a FPGA based solution that will route the packets based on the least recently used node.
Being on the physical layer, it does not suffer from large overhead. Comparison will be done with
socket program run on reference router in Deterlab. The main benefits of this project is low latency
and ability to reroute packets on-the-fly.
2. Introduction
The NetFPGA is a low-cost platform, primarily designed as a tool for teaching networking
hardware and router design. It has also proved to be a useful tool for networking researchers.
Through partnerships and donations from sponsor of the project, the NetFPGA is widely
available to students, teachers, researchers, and NF anyone else interested in experimenting with
new ideas in high-speed networking hardware.
2.1 Usage Models
At a high level, the board contains four 1 Gigabit/second Ethernet (GigE) interfaces, a user
programmable Field Programmable Gate Array (FPGA), and four banks of locally-attached
Static and Dynamic Random Access Memory (SRAM and DRAM). It has a standard PCI
interface allowing it to be connected to a desktop PC or server. A reference design can be
downloaded from the http://NetFPGA.org website that contains a hardware-accelerated Network
Interface Card (NIC) or an Internet Protocol Version 4 (IPv4) router that can be readily
configured into the NetFPGA hardware. The router kit allows the NetFPGA to interoperate with
other IPv4 routers.





The NetFPGA offloads processing from a host processor. The host's CPU has access to main
memory and can DMA to read and write registers and memories on the NetFPGA. Unlike other
open-source projects, the NetFPGA provides a hardware-accelerated hardware datapath. The
NetFPGA provides a direct hardware interface connected to four GigE ports and multiple banks
of local memory installed on the card. NetFPGA packages (NFPs) are available that contain
source code (both for hardware and software) that implement networking functions. Using the
reference router as an example, there are three main ways that a developer can use the NFP. In
the first usage model, the default router hardware can be configured into the FPGA and the
software can be modified to implement a custom protocol.


Another way to modify the NetFPGA is to start with the reference router and extend the design
with a custom user module. Finally, it is also possible to implement a completely new design
where the user can place their own logic and data processing functions directly in the FPGA.
1. Use the hardware as is as an accelerator and modify the software to implement new
protocols. In this scenario, the NetFPGA board is programmed with IPv4 hardware and
the Linux host uses the Router Kit Software distributed in the NFP. The Router Kit
daemon mirrors the routing table and ARP cache from software to the tables in the
hardware allowing for IPv4 routing at line rate. The user can modify Linux to implement
new protocols and test them using the full system.
2. Start with the provided hardware from the official NFP (or from a third-party NFP),
modify it by using modules from the NFP's library or by writing your own Verilog code,
then compile the source code using industry standard design tools. The implemented
bitfile can then be downloaded to the FPGA. The new functionality can be complemented
by additional software or modifications to the existing software. For the IPv4 router, an
example of this would be implementing a Trie longest prefix match (LPM) lookup
instead of the currently implemented CAM LPM lookup for the hardware routing table.
Another example would be to modify the router to implement NAT or a firewall.
3. Implement a new design from scratch: The design can use modules from the official
NFP's library or third party modules to implement the needed functionality or can use
completely new source code.




2.2 Major Components
The NetFPGA platform contains one large Xilinx Virtex2-Pro 50 FPGA which is programmed
with user-defined logic and has a core clock that runs at 125MHz. The NetFPGA platform also
contains one small Xilinx Spartan II FPGA holding the logic that implements the control logic
for the PCI interface to the host processor.
Two 18 MBit external Cypress SRAMs are arranged in a configuration of 512k words by 36 bits
(4.5 Mbytes total) and operate synchronously with the FPGA logic at 125 MHz. One bank of
external Micron DDR2 SDRAM is arranged in a configuration of 16M words by 32 bits (64
MBytes total). Using both edges of a separate 200 MHz clock, the memory has a bandwidth of
400 MWords/second (1,600 MBytes/s = 12,800 Mbits/s).
The Broadcom BCM5464SR Gigabit/second external physical-layer transceiver (PHY) sends
packets over standard category 5, 5e, or 6 twisted-pair cables. The quad PHY interfaces with four
Gigabit Ethernet Media Access Controllers (MACs) instantiated as a soft core on the FPGA. The
NetFPGA also includes two interfaces with Serial ATA (SATA) connectors that enable multiple
NetFPGA boards in a system to exchange traffic directly without use of the PCI bus.



















CHAPTER 2

Pipeline:

The underlying processor is a general purpose processor with five pipeline stages: IF, ID, EX,
MEM, WB. The processor employs pipelining so that all parts of the processing and memory
systems can operate continuously. Typically, while the output of the one instruction is written back
in the register file, its successor is performing a memory operation a third instruction is being
executed, a fourth instruction is being decoded and fifth instruction is being fetched from
instruction memory.



































2.1 IF Stage:

This is the first stage of the pipeline where the processor fetches the instruction from instruction
memory. The instruction memory is logically divided into two memories to implement multi
threading. It is ensured that each thread accesses different instruction memories. As shown in the
figure two program counters are required to access the instruction memory. The instruction width is
32 bits. The thread scheduler determines which program counter becomes active for a particular
thread.


PC-1
PC-0
Instr-Mem1
Instr-Mem2
Thread Scheduler
Instruction
(31:0)
clk_pc1
clk_pc0
IF/ID



Fig: IF stage









2.2 ID Stage:

Instruction decoding is carried out in this stage that is, control signals required for execution of
instructions is obtained in this stage. The decoding logic determines which type of instruction is
brought in, for example it determines whether the incoming instruction is a R-Type or LW or SW
and correspondingly generates the control signals required for executing the instructions.



Instruction(31:0)
Control Unit
Register Bank
Opcode(3:0)
Func(3:0)
rs_in(4:0)
rt_in(4:0)
regwrite
wdata(63:0)
waddr(4:0)
rs_data(63:0)
rt_data(63:0)
rd_add(4:0)
Offset(8:0)
Control
Signals(5)
rt_addr(4:0)
ALU sel(3:0)
IF/ID ID/EX




Fig. ID stage


















2.3 EX Stage:

The following operations are done in EX stage:

1. The ALU performs the arithmetic or logical operation for register-to-register instruction
2. The ALU calculates the data address for load and store instructions.
3. The jump instruction is executed taking the thread id in consideration.
4. The branch condition is evaluated and if found true then the branch instruction is executed.


The ALU block is capable of performing operations like addition, subtraction, SET LESS THAN
ZERO (SLT), logical AND,OR etc. The operation that the ALU performs is based on the four ALU
control signals generated by the instruction from ID stage.


ALU SELECT
(3:0)
rs_data(63:0)
SLT
out(63:0)
mem_off(63:0)
rt_data(63:0)
Equality ckecker
Alu_src
Branch and jump logic
Branch(1)
Jump(1)
PC_OUT(8:0) PC Add Selector
wr_in(4:0)
Regwrite(1)
rt_out(63:0)
out(63:0)
mem_en(1)
mem_to_reg(1)
rt(4:0)
rd(4:0)
Regwrite(1)
mem_en(1)
mem_to_reg(1)
Offset(8:0)
Sign Extend mem_off(63:0)
Reg_dest(1)
Thread_id
ID/EX EX/MEM



Fig. EX Stage








2.4 MEM Stage:

The MEM stage performs the following operations:

1. It performs memory fetch for load and store instructions from the FIFO.
2. It stores the packet information in the FIFO which can later be used by processor to
perform operations on them.





wr_add(4:0)
regwrite
rt_out(63:
0)
Alu_out(63
:0)
mem_en
mem_to_reg
wr_add(4:0)
regwrite
Mem_out(
63:0)
mem_to_reg
Fifo_data
proc_data
Fifo_add
proc_add
Header
info
Payload
info
Fifo_add
Fifo_data
FIFO
Alu_Out(6
3:0)
EX/MEM
MEM/WB


Fig. Mem Stage








2.5 WB stage:

The WB stage selects the data from either the ALU or FIFO and sends the result to ID stage for
writing into register bank.


Alu_out(63:0)
Mem_out(63:0)
Mem_to _reg(0)
Reg_write
wdata(63:0)
waddr(4:0)
Reg_write
Mem/WB
waddr



Fig. WB Stage



CHAPTER 3
3.1 Multithreaded Multicore architecture:

Each processor is designed to carry two threads, so that it can implement thread level parallelism on
the incoming packets. Fine grain multithreading is employed executing the threads alternately. The
thread id is obtained by dividing the input clock frequency by 2. The 2 program counters are enabled
by two different clocks generated by CE signal in thread scheduler.

There are two such dual threaded cores that process the packets simultaneously that increase the
throughput of the router.



Dual
Threaded
Processor
Dual
Threaded
Processor
in_fifo_data_p
Header Parser
Re routing and
Checksum
B/W Calculator
Header Parser
Re routing and
Checksum
B/W Calculator
Output Port
lookup
Output port
lookup
FIFO
FIFO



Fig Multithreaded Multicore Architecture


3.2 Thread Scheduler:
The thread scheduler utilizes a T flip flop and a demux for generating thread ids. The two clocks
CE_PC0 and CE_PC1 for program counters are generated using T flip flop as shown in figure
below.


T FF Demux
CE
clk
clr
Q sel Q1
Q2
Thread1_ID
Thread2_ID
T
A
vdd
vdd





T FF T
CE
clk
INV
CE_PC0
CE_PC1







3.3 Working of FIFO and Processor:


There are three states namely WriteFIFO, ReadFIFO and CPU. In the
WriteFIFO state the packet comes into the pipeline and is stored into FIFO memory. The CPU
processes on the packet in the CPU state. After CPU operates on the packets they are readout from
the FIFO in ReadFIFO state.


fallthrough_read=1
fifo_mem_select=0
program_ctr_clr=1
out_rdy_en=0
fallthrough_read=0
fifo_mem_select=1
program_ctr_clr=0
out_rdy_en=0
ReadFIFO
CPU
WriteFIFO
fallthrough_read=0
fifo_mem_select=0
program_ctr_clr=1
out_rdy_en=1
~RPeqWP
||(RPeqWP.~firstwor
d)
CPU_idle




Fig: state Diagram












3.4 User Datapath:

The following figure shows the components of the router as implemented on netfpga.


Input Arbiter Load Balancer
Output Port
Lookup
Output
Queues
Ethernet
nf2c
0
nf2c
1
nf2c
2
nf2c
3














CHAPTER 4

4.1 The router consists of three Hardware Accelerators:
4.1.1 Header Parser:
This hardware accelerator extracts the source ip address , destination port number and protocol
which facilitates the implementation of 3 tuple method of {IP,Port,Proto}.As a packet enters the stage, the
Header Parser pulls the relevant fields from the packet and concatenates them. This forms the flow
header.

4.1.2 Re-Routing and Checksum Calculation:
Hash based algorithm is used for classifying the packets. In hash based algorithm first a lookup
table is created for flows so that each of its entry is related to a flow. Then hash of 3 tuple field of packets
is calculated and stored in CAM lookup table. When a packet comes in hash calculated from 3 tuple field
is used to access the lookup table and index obtained points to another Block Memory which stores the
destination ip address and destination port number to which the packet needs to be routed.
If the 3 tuple field after hashing does not match the hash table entries then the packet is routed
to output link that has the minimum bandwidth. The lookup table needs to be updated with this entry.
Once the packet gets its destination ip address and port number the checksum of packet needs to
be re-calculated and modified. The new checksum is calculated using the formula:
H = ~(~H ~SUM
old
SUM
new
)


where H is the old checksum, H is the new checksum, SUM
old
and SUM
new
are the ones complement
sum computed over source IP address, destination IP address, source port, destination port, window size
of the old and new values respectively (the only fields that change their value). indicates a bit-wise
complement operation, and indicates a ones complement sum operation.


4.1.3 Minimum Bandwidth Calculation:
This accelerator finds the output link with the minimum bandwidth. The bandwidth is calculated
for a time interval of T secs. The calculation method involves finding the total number of bits sent out on
a link over a time T to determine bandwidth.




4.2 NETFPGA BASED ROUTER:




























Dual Core
Dual
Threaded
Processor
Header parser
MIN. B/W
calculator
Rerouting and
Checksum
Ethernet
CPU
RXQ

MAC
RXQ

CPU
RXQ

MAC
RXQ

CPU
RXQ

MAC
RXQ

CPU
RXQ

MAC
RXQ

CPU
TXQ

MAC
TXQ

CPU
TXQ

CPU
TXQ

CPU
TXQ

MAC
TXQ

MAC
TXQ

MAC
TXQ

nf2c0 nf2c1 nf2c2 nf2c3
FIFO


4.3 Fields Description:


















Fig. FIELDS DESCRIPTION







src mac(16)
dst mac (48)
V,L,TOS(16) ethertype(16)
src mac lo(32)
Total
length(16)
Id(16)
Flags+frag
off(16)
TTL(8)
prot(8)
dst ip hi (16)
src ip (32)
Checksum(16)
dst ip lo(16)
UDP src(16)
UDP
dest(16)
UDP
len(1
6)
UDP
chksum(16)
Pkt sequence number(32)
OUTPUT QUEUES
short events
0
E
V
num
port ev
INPUT
FIFO

4.4 Hardware Accelerators:





















Least B/W link
flow header

fig. Hardware Accelerators


Header Parser
Extract
src_ip
CAM
Look up Table
Hashing
Hash
Function
Output
port
lookup
Extract
src_ip
Extract
src_ip



4.5 Minimum Bandwidth Calculation Accelerator









in_fifo_data













Link 1
Link 2
Link 3
Processor
FIFO
Minimum
Bandwidth
Calculation
Timer

4.6 Header parser pin diagram:

The header parser block is used to fetch the corresponding Source ip, destination port number and
protocol Id. These values are stored in the Hash table to maintain a lookup table. These values are further
used to calculate the corresponding destination mac address which will help to determine the target IP
addresses and do the routing successfully.



in_wr
clk
reset
In_fifo_data_p(6
3:0)
In_fifo_ctrl
_p(7:0)
Src_ip(31:0)
Dest_port(15:0)
Proto(7:0)
Proto(7:0)







4.7 Rerouting and checksum calculation:

Rerouting is done by replacing the mac address, destination port number. The checksum is calculated by
taking the difference in the current IP address and the new IP addresses. The resultant field is updated in
the checksum field and routed ahead to the desired nodes based on the least recently used bandwidth. The
least bandwidth is calculated after a constant time interval. This shows the least recently used bandwidth
among the various nodes.

in_wr
clk
reset
in_fifo_data_p(63:0)
in_fifo_ctrl_p(63:0)
Proto(7:0)
Flow_header(112:0)



4.8 Minimum Bandwidth Calculation:
A timer is run which will be updated after T=250000000 clock cycles. The corresponding node packets
are updated. In this count of the number of packets are calculated by fetching the corresponding packet
length.


In_wr
clk
reset
T(7:0)
pac_len(15:0)
Min_bw_node(1:0)






CHAPTER 5

Instruction Set Architecture:
5.1 Instruction Field Format:
Each instruction has a fixed width of 32 bits. The processor consists of a 32 bit
register file. Thus, the widths of source register (rs), transfer register (rt) and
destination register (rd) is 5 bits. The immediate addressing mode has an offset
field of 9 bits. Thus, the maximum offset it can support is 512. The instruction is
divided into following fields:

opcode rs rt rd offset
Function
31 28 27
23 22 18 17 13
12
4 3
0

Figure : ISA Format

5.2 Instruction Description:

The Processor supports different types of instructions such as register,
register logic instructions, arithmetic instructions, immediate arithmetic
instructions, conditional instruction, memory-based instructions and
shift instructions.
General purpose registers like rs,rt and rd are indicated with a $ sign.
The register file consists of 32 registers, hence any register is
represented by 5 bits. The register $0 is always grounded.




5.3 Instructions:
Register Arithmetic
ADD
Syntax add $rd,$rs,$rt
Description Adds the contents of rs and rt and stores into rd


SUB
Syntax sub $rd,$rs,$rt
Description Subtracts rt from rs and stores the result in rd.


SUB
Syntax sub $rd,$rs,$rt
Description Subtracts rt from rs and stores the result in rd.

Immediate Instruction:

ADDI
Syntax addi $rt,$rs,immediate
Description Adds rs and a sign-extended immediate value and
stores the result in rd





SUBI
Syntax subi $rt,$rs,immediate
Description Subtracts the immediate value from rs and stores in
rd.

SLTI
Syntax slti $rt,$rs,immediate
Description Sets rt if the content in rs is less than the value
mentioned in the offset.

Register Logic Instruction

AND
Syntax and $rd,$rs,$rt
Description Bitwise ANDs contents of rs and rt and stores the
result in rd.


OR
Syntax or $rd,$rs,$rt
Description Bitwise ORs contents of rs and rt and stores the
result in rd.




NOR
Syntax nor $rd,$rs,$rt
Description Bitwise NORs contents of rs and rt and stores the
result in rd.



XNOR
Syntax xnor $rd,$rs,$rt
Description Bitwise XNORs contents of rs and rt and stores
the result in rd.


NOT
Syntax not $rd,$rs
Description Bitwise negates the contents of rs and stores the
result in rd.

Shift Instructions

SLL
Syntax sll $rd,$rs
Description Shifts the value left, contained in rs by one bit and
stores in rd. Zeros are shifted in.


SRL
Syntax slr $rd,$rs
Description Shifts the value right, contained in rs by one bit and
stores in rd. Zeros are shifted in.



Conditional Instructions:


SLT
Syntax slt $rd,$rs,$rt
Description Sets rt if the content in rs is less than the contents of
rt, else resets.



BEQ
Syntax beq $rs, $rt, offset
Description Branches to the location pointed by the offset if (rs)
and (rt) are equal.


Memory Based Instructions:

LW lw $rt, offset ($rs)
Syntax slr $rd,$rs
Description A word (data) is loaded from the memory location
calculated by adding the contents of rs with the
offset value into rt.


SW
Syntax sw $rt, offset ($rs)
Description A word (data) is stored from rt in the memory
location calculated by adding the contents of rs with
the offset value.

Control Unit:
The ALU is in ex stage of the pipeline.

Table: Control Unit Description




Chapter 6

Purpose of the Compiler: A compiler is basically a computer program
that converts the source code written in any of the programming
language into another programming language, usually into the binary
equivalent of the code. Then binaries are then stored in the instruction
memory of the processor which are used to execute a program.

Compiling Process : The gcc complier runs on the x86 architecture
Linux Machines and converts the source code into the MIPS assembly.
This MIPS assembly code is translated into custom ISA of the processor.
Finally this ISA is converted into corresponding binaries. These binaries
are loaded into instruction memory using a Perl script. There are
instances where we have a complex code which are not supported by the
processor. These instructions are broken down into instructions
supported by the processor. This translator is written in C. The
operations generally performed by the processor are sorting of the
contents of the payload, swapping the contents etc.










MIPS instructions supported by the processor along with along with the
translation is as follows:
MIPS ISA CUSTOM ISA
add add
Addi addi
addu add
addiu addi
sub sub
subi subi
mov $x,$y add $x,$y,$0
li addi
slt slt
slti slti
sll sll
slr slr
lw lw
sw sw
beq beq

We have put 2 NOOPS between two instructions put onto the
processor because of the dependency problems in the 5 stage pipeline.
There is no internal forwarding implemented in the pipeline but because
of the fine grain scheduling only 2 no-ops are sufficient. Also register
renaming is employed to encounter the problem of registers being
updated by both the threads. This issue is tackled using register
renaming. The register file is divided for threads at the compiler level.
The register file of depth 32 is divided into two register files allocating
16 registers to one thread.

You might also like