You are on page 1of 1

The QPACE Network Processor

QPACE Collaboration: H. Baier1, H. Boettiger1, M. Drochner2, N. Eicker2,3, U. Fischer1, Z. Fodor3, A. Frommer3, C. Gomez10,
G. Goldrian1, S. Heybrock4, M. Hüsken3, D. Hierl4, T. Huth1, B. Krill1, J. Lauritsen1, T. Lippert2,3, J. McFadden1,
T. Maurer4, N. Meyer4, A. Nobile4, I. Ouda6, M. Pivanti4,5, D. Pleiter7, A. Schäfer4, H. Schick1, F. Schifano8,
H. Simma7,9, S. Solbrig4, T. Streuer4, K.-H. Sulanke7, R. Tripiccione8, J. S. Vogt1, T. Wettig4, F. Winter7
1IBM Böblingen, 2FZ Jülich, 3Univ. Wuppertal, 4Univ. Regensburg, 5INFN Trento, 6IBM Rochester,
7DESY Zeuthen, 8Univ. Ferrara, 9Univ. Milano Bicocca, 10IBM La Gaude

We present an overview of the design and implementation of the QPACE Network Processor. The Network Processor implements a standard Ethernet network and a high-speed communica-
tion network that allows for a tight coupling of the processing nodes. By using an FPGA we have the flexibility to further optimize our design and to adapt it to different application requirements.

QPACE Network Torus


(QCD Parallel Computing on the Cell Broadband Engine) is a
PowerXCell 8i
massively parallel supercomputer optimized for Lattice QCD calculations pro- Processor Network
viding a tight coupling of processing nodes by a custom network: Processor
Memory (FPGA) PHYs
• Node Cards with IBM PowerXCell 8i and custom-designed Network Pro-
cessor FPGA
• Nearest-neighbor 3D-torus network, 6GByte/s communication bandwidth
per node, remote LS-to-LS DMA communication, low latency, partitionable
• Gigabit Ethernet network
• Global Signal Tree: evaluation of global conditions and synchronization
• 256 Node Cards per rack, 4GByte memory per node
• 25.6 (51.2) TFlops single (double) precision per rack
• Efficient, low-cost watercooling solution, max. 33 kW per rack
• Capability machine (scalable architecture)

Ethernet

FPGAs are user-programmable hard- to Service Processor to Cell BE


FlexIO (2 byte, 2.5GHz)
FPGA acts as Southbridge to the Cell Processor.
Logic designed to work as fast network fabric:
ware chips:
Rocket IO SerDes
• 2 20G links to Cell BE
• Configured with a logic circuit diagram
Rambus FlexIO, 16b at 2.5GHz
to specify how the chip will work
• IBM-proprietary internal bus interface
• Ability to update the functionality at IBM Logic
128b at 208MHz
any time offers vast advantages in de-
velopment and operation • 6 10G links to torus network
They are built up from basic elements called ”slices”, intercon- Master Interface Slave Interface XGMII, 32b at 250MHz
nected using ”switch matrices”. A slice is made up of a number of Inbound
Read
128 bit 208MHz 128 bit 208MHz • Gigabit Ethernet link for booting and disk
Flip-Flops, Look-Up Tables and Multiplexers: DCR Master I/O; RGMII, 4b at 250MHz
to Service Processor

• Construct desired logic by setting up a number of these elements Flash Inbound-Write Outbound-Write
Reader Controller Controller • Serial interfaces: 2x UART, SPI, Global
Outbound
• Trade-off between performance and resource usage UART Read Signals
128 bit 208MHz 128 bit, 208MHz
They also provide other primitives like Block RAMs, Ethernet MACs, Configuration • Most logic controlled through Device Con-
processor cores, high-speed transceivers, etc. Status trol Register (DCR) Bus
Version
We chose a Xilinx Virtex-5 LX110T-FF1738-3:
to Cell BE

UART DCR
SPI Master 6 Torus Links DCR • just enough High Speed Serial Transceivers
Ethernet
• just enough pins to connect all 6 Torus links
Global • highest speed grade and sufficient capacity
Signals
RGMII
MDIO
to hold our logic
6 XGMII 4bit 250MHz
MDIO
32bit 250MHz

by SFB by IBM
to Global Signal Tree to Flash to Torus Transceivers to Ethernet Transceiver

Major Challenges Utilization Evaluation


• Implementing the FlexIO on an FPGA was (is) a major challenge; • Reaching target clock frequencies becomes very difficult as FPGA fill rate increases.
only possible due to special features of Xilinx V5 RocketIO(TM) GTP Slices 16,029 out of 17,280 92% Current Status:
Low-Power Transceivers [1]. However: PINs 656 out of 680 86%
LUT-FF Pairs 51,018 out of 69,120 73% – FlexIO at 2GHz, goal is 2.5GHz (verified plain link at 3GHz)
– No 100% compatibility of Rambus FlexIO and Xilinx GTPs Registers 38,212 out of 69,120 55% – IBM Bus at 166MHz, goal is 208MHz (verified without torus logic at 208MHz)
– Training of interface has proven difficult LUTs 36,939 out of 69,120 53%
BlockRAM/FIFO 53 out of 148 35% • Bandwidth currently up to 0.8 GByte/s per LS-to-LS link (Bottleneck is development
– Designed logic cannot be re-used for other processors time and effort)
• Designing at edge of FPGA capabilites: This splits up into: • Latency goal of 1µs missed. Current SPE to SPE latency about 3µs. (Long time
– Routing of signals difficult due to large number of clocks FlipFlops LUTs Percent FF between start of data move operation in processor and data entering links.)
Total 38212 36939 100%
– Most package pins (of largest package) used Further Reading:
IBM Logic 20225 16915 53%
– Complexity of logic limits internal bandwidths Torus 13672 14252 36% [1] I. Ouda and K. Schleupen, Application Note: FPGA to IBM Power Processor Interface
– On-Chip debugging difficult due to high logic utilization rates Ethernet 1537 894 4% Setup, (2008).
IWC 583 132 1.5%
[2] G. Goldrian. et al., QPACE: Quantum Chromodynamics Parallel Computing on the
OWC 446 642 1.2%
Cell Broadband Engine, Comput. Sci. Eng. 10 (2008) 26.

You might also like