Roadrunner Tutorial Session 1 Web1

Roadrunner Tutorial
An Introduction to Roadrunner and the Cell Processor
Cornell Wright Paul Henning

HPC-DO CCS-2
phenning@lanl.gov
cornell@lanl.gov
Ben Bergen
February 7, 2008 CCS-2
bergen@lanl.gov
LA-UR-08-2818
Contents
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
2 Roadrunner Tutorial February 7, 2008

Contents
Roadrunner Project
Cell SDK
DaCS and ALF
Application
References

Roadrunner Project Phases
Phase I Phase II Phase III
Redtail Base System

76 teraflop/s
Opteron Cluster
Replacement for Q
Advanced Algorithms
Evaluation of Cell
potential for HPC
Roadrunner
1.3 petaflop/s
Cell Accelerated
Opteron Cluster

Roadrunner Phase 3 work plan - Draft Schedules
Finish
Finish
Start
Start Accreditation
Accreditation
Accreditation
Accreditation (estimated)
(estimated)
Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec Jan
08 08 08 08 08 09 09 09 09 09 09 09 09 09 09 09 09 10
Start
Start Finish
Finish
System
System science
science runs
runs
Start
Start Finish
Finish &
& code
code &
&
Acceptance
Acceptance Acceptance
Acceptance stabilization
stabilization Stabilization
Stabilization
(estimated)
(estimated)
Start
Start Secure
Secure System
System
Start
Start Connect
Connect Science
Science runs
runs Availability
Availability
System
System File
File Including
Including VPIC
VPIC After
After
Delivery
Delivery Systems
Systems Weapons
Weapons Initial
Initial Integration
Integration
science
science
1 2 3
Roadrunner Phase 1 CU layout (Redtail)
4 dual core
Opterons
GE CVLAN-n 96 To
2nd
131 Compute Nodes IB4x Stage
IB 288-port SDR Switch
IB4x SDR
Switch
144+144
12 I/O Nodes
IB
Terminal Server
143
Service Node
IB SMC Switch
Disk
GE MVLAN

Roadrunner Phase 3 layout is almost identical
2 dual core Opterons +
4 Cell Accelerators
GE CVLAN-n 96 To
2nd
180 Compute Triblades IB4x Stage
IB ISR9288 DDR Switch
IB4x DDR
DDR Switch
192+96
12 I/O Nodes
IB
DDR Terminal Server
143
Service Node
BCMM BC Switch
Disk
GE MVLAN

Contents
Roadrunner Project
Cell SDK
DaCS and ALF
Application
References

Interest in Hybrid Computing
Observation that traditional clusters are straining

to reach PF scale:
Processor core performance slowing
Practical limits on network size and cost
Programming challenges with 10Ks of nodes
Technology discontinuity driving price/performance
Accelerators offer promise of 10x reduction in $/MF

Why Accelerators?
Specialized engines can perform selected tasks more efficiently

(faster, cheaper, cooler, etc.) than general-purpose cores.
Development of hardware (e.g. PCI-express) and software (e.g.
Linux) standards provide convenient attachment points.
Parallelism (e.g. SIMD) exploits increasing number of gates
available in each chip generation.
Accelerators have been around for a long time!
FPS AP120B, IBM 3838, Intel 80387/80487, Weitek, Atari ANTIC, S3 911

Roadrunner Goal: PetaFlop Performance
Provide a large capacity-mode computing resource

Purchased in FY2006 and presently in production
Robust HPC architecture with known usability for LANL codes
Upgrade to petascale-class hybrid accelerated architecture in

2008
Follow future trends toward hybrid/heterogeneous computers
More and varied cores and special function units
Capable of supporting future LANL weapons physics and science at
scale
IBM & LANL strategic collaboration
Capable of achieving a sustained PetaFlop/s

Roadrunner is a petascale system in 2008
Full Roadrunner Specifications:
6,912 dual-core Opterons 12,960 Cell eDP chips 3,456 nodes on 2-stage IB 4X DDR
49.8 TF DP peak Opteron 1.33 PF DP peak Cell eDP 13.8 TB/s aggregate BW (bi-dir) (1st stage)
55.2 TB Opteron memory 2.65 PF SP peak Cell eDP 6.9 TB/s aggregate BW (bi-dir) (2nd stage)
51.8 TB Cell memory 3.5 TB/s bi-section BW (bi-dir) (2nd stage)
277 TB/s Cell memory BW 432 10 GigE I/O links on 216 I/O nodes
432 GB/s aggregate I/O BW (uni-dir)
(IB limited)
18 CU clusters
Eight 2nd-stage IB 4X DDR switches

12 links per CU to each of 8 switches

Roadrunner Phase 3 is Cell-accelerated,
not a cluster of Cells
Cell-accelerated Add Cells to
compute node each
individual
node I/O
gateway
nodes
Multi-socket
multi-core
Opteron
cluster nodes

(100s of such
cluster nodes)
Scalable Unit Cluster Interconnect Switch/Fabric
Node-attached
Node-attachedCells
Cellsisiswhat
whatmakes
makesRoadrunner
Roadrunnerdifferent!
different!

Roadrunner at a glance
Cluster of 18 Connected Units RHEL & Fedora Linux
6,912 AMD dual-core Opterons SDK for Multicore Acceleration
12,960 IBM Cell eDP accelerators xCAT Cluster Management
49.8 Teraflops peak (Opteron) System-wide GigEnet network
1.33 Petaflops peak (Cell eDP) 3.9 MW Power:
1PF sustained Linpack
0.35 GF/Watt
InfiniBand 4x DDR fabric Area:
2-stage fat-tree; all-optical cables
296 racks
Full bi-section BW within each CU
5500 ft2
384 GB/s (bi-directional)
Half bi-section BW among CUs
3.45 TB/s (bi-directional)
Non-disruptive expansion to 24 CUs
107 TB aggregate memory
55 TB Opteron
52 TB Cell
216 GB/s sustained File System I/O:
216x2 10G Ethernets to Panasas

Integrated Hybrid Node
LS21 AMD Host Blade

Dual socket dual core AMD
Opteron
DDR2 direct attach DIMM
Expansion Card
2 HT2100 HT<->PCIe bridges
QS22 Based Accelerator Blades
Dual PowerXCell 8i Sockets
DDR2 direct attach DIMM
AMD Host to Cell eDP
connectivity
Two x8 PCIe Host to QS22 links

PowerXCell 8i / AMD TriBlade DDR2
DDR2
PowerXCell PowerXCell
DDR2
DDR2
(Dual Core Opteron, IB-DDR) DDR2

DDR2
8i 8i
DDR2
DDR2
PCI-E x16 2xPCI-E x16 PCI-E x16
AMD Host Blade + Expansion CB2

PCI
IBM IBM PCI-X Legacy
Southbridge 2x PCI-E 8x Southbridge Conn.
To misc. I/O:
Dual socket dual core AMD Opteron (2 x 7.2 USB etc PCI-E x8
Redrive
PCI-E x8
GFLOPS)
LS21 + 2 by HT 16x connector
DDR2 DDR2
DDR2 direct attach DIMM channels DDR2 PowerXCell PowerXCell DDR2
16 GB DDR2 8i 8i DDR2
DDR2 DDR2
10.7 GB/s/socket (0.48 B/FLOP) PCI-E x16 2xPCI-E x16 PCI-E x16
New Expansion Card CB2

PCI
IBM IBM PCI-X Legacy
Southbridge 2x PCI-E 8x Southbridge Conn.
2 HT2100 HT<->PCI-e bridges To misc. I/O: Redrive
USB etc
QS22 Accelerator Blade

PCI-E x8 PCI-E x8
Dual PowerXCell 8i Sockets

DDR2 HT x16 DDR2
204 GFLOPS @ 3.2Ghz (2 x 102 DDR2 DDR2
AMD AMD
GFLOPS) DDR2 Dual Core 2 x HT Dual Core DDR2
16x Exp.
DDR2 direct attach DIMM channels DDR2
Conn.
DDR2
8 GB HT x16
25.6 GB/s per PowerXCell 8i chip* (0.25 PCI-X PCI-E x8 HSDC

B/FLOP) HT2000 Connector
Legacy
AMD Host to PowerXCell 8i connectivity Connector
HT x8 LS21
Two x8 PCIe Host to CB2 links To misc. I/O: USB & 2x 1GbE etc
~2+2 GB/s/link ~4+ 4 GB/s total POR No HSDC

New PCI-e Redrive card 2 x HT IB
Node design points: HT2100 16x Exp. HSDC IP x4
Conn. Connector DDR
One Cell chip per Opteron core OR
HT2100 Std PCIe IB
~400 GF/s double-precision & 1P x4
Connector
~800 GF/s single-precision PCI-e 8x DDR
16 GB Cell memory &
16 GB Opteron Memory
Roadrunner nodes have a memory hierarchy
QS22 Cell blades
256 KB of
working memory
(per SPE)
25.6 GB/s ~200 GB/s per
off-SPE BW Cell on EIB bus
4 GB of 8 GB of 16 GB of
shared NUMA distributed
PCIe x8 memory shared memory
(per Cell) memory (per node)
(2 per blade) (per blade)
(2 GB/s, 2 us) 21.3 GB/s/chip
One Cell chip

per 4 GB of 8 GB of 16 GB of
Opteron core memory shared NUMA
(per core) Memory shared
(per socket) memory
(per node)
LS21 5.4 GB/s/core
Opteron blade ConnectX
IB 4X DDR
(2 GB/s, 2 us)

Hybrid Node System Software Stack
To other cluster nodes
Application MPI
IDE
Accelerated Lib
gdb
Tooling DaCSd
Tooling
Trace ALF/DaCS
Analysis
Host O/S DD
Opteron Blade
Cell Blade Cell Blade

IDE IDE
DaCSd DaCSd
Tooling Tooling
Accelerated Lib Accelerated Lib
Analysis Analysis
ALF/DaCS ALF/DaCS
Compilers/Profiling/etc. Compilers/Profiling/etc.
CellBE Linux CellBE Linux

Three types of processors work together.
Parallel computing on Cell Cell
SPE SPE
data partitioning & work queue compiler SPE (8)
(8)
pipelining
process management & ALF or libSPE EIB
synchronization
PowerPC
compiler PPE
PPE
Remote communication to/from
Cell
data communication & DaCS (OpenMPI) PCIe
synchronization
process management & x86
synchronization compiler Opteron
Opteron
computationally-intense offload OpenMPI (cluster) IB
MPI remains as the foundation

Using Roadrunners memory hierarchy:
Today with Hybrid DaCS and Supplemented Tomorrow
with ALF
SPE
Cell blade
256 KB of
work tiles ALF
pre-fetch
working memory DMA gets
per SPE
write-behind
DMA puts
additional
temporary
4 GB of data
shared
memory
(per Cell) single-physics
PCIe x8 mesh data
(2 GB/s, 2-4 us) 21.3 GB/s/chip
download
DaCS puts upload
DaCS put
4 GB of
shared
memory
per core
multi-physics
Opteron node 2.7 GB/s/core mesh data
ConnectX
IB 4X DDR
(2 GB/s, 2 us)

Roadrunner Early Hardware (Poughkeepsie)

Contents
Roadrunner Project
Cell SDK
DaCS and ALF
Application
References

Three Major Limiters to Processor Performance
Frequency Wall
Diminishing returns from deeper pipelines
Memory Wall
Processor frequency vs. DRAM memory latency
Latency introduced by multiple levels of memory
Power Wall
Limits in CMOS technology
Hard limit to acceptable system power

Where have all of the transistors gone?
Memory and memory management

Larger and larger cache sizes needed for visible improvement
Trying to extract parallelism
Long pipelines with lots of interlock logic
Superscalar designs (multiple pipelines) with even more interlocks!
Branch prediction to effectively utilize the pipelines
Speculative execution
Out-of-order execution
Hardware threads
The amount of transistors doing direct computation is shrinking

relative to the total number of transistors.

Techniques for Better Efficiency
Chip level multi-processors
Go back to simpler core designs and use the extra available chip area for multiple
cores.
Vector Units/SIMD
Have the same instruction execution logic operate on multiple pieces of data
concurrently; allows for more throughput with little increase in overhead.
Rethink memory organization
At the register level hidden registers with register renaming is transparent to the end
user but expensive to implement. Can a compiler do better with more instruction set
architected registers?
Caches are automatic but not necessarily the most efficient use of chip area.
Temporal and spatial locality mean little when processing streams of data.

Today Microprocessor Design: Power Constrained
Transistor performance scaling
Channel off-current & gate-oxide tunneling challenges
supply voltage scaling
Material and structure changes required to stay on
Moores Law
Power per (switched) transistor decreases only slowly
Chip Power Consumption increases
faster as before
Microprocessor performance limited by
processor transistor power efficiency
We know how to design processors we cannot
reasonably cool or power
Moores Law: 2x transistor density

every 18-24 months
Net: Increasing Performance requires Increasing Power Efficiency

Hardware Accelerators Concept
Streaming Systems use Architecture to Concept/Key Ideas
Address Power/Area/Performance
challenge
Processing unit with many specialized co-
processor cores
Storage hierarchy (on chip private, on chip
shared, off-chip) explicit at software level
Applications coded for parallelism with Augment standard CPU with many
localized data access to minimize off chip (8-32) small, efficient, hardware
access
accelerators (SIMD or vector) with
Advantage demonstrated per chip and
per Watt of ~100x on well behaved private memories, visible to
applications (graphics, media, dsp,) applications.
Rewrite code to highly parallel
model with explicit on-chip vs. off-
chip memory.
Stream data between cores on
same chip to reduce off-chip
accesses.

Cell BE Solutions
Increase concurrency
Multiple cores
SIMD/Vector operations in a core
Start memory movement early so that memory is available when needed
Increase efficiency
Simpler cores devote more resources to actual computation
Programmer managed memory is more efficient than dragging data through caches
Large register files give the compiler more flexibility and eliminate transistors needed
for register renaming
Specialize processor cores for specific tasks

Other (Micro)Architectural and Decisions
Large shared register file

Local store size tradeoffs
Dual issue, In order
Software branch prediction
Channels
Microarchitecture decisions, more so than architecture decisions

show bias towards compute-intensive codes

The Cell BE Concept
Compatibility with 64b Power Architecture
Builds on and leverages IBM investment and community
Increased efficiency and performance
Attacks on the Power Wall
Non Homogenous Coherent Multiprocessor
High design frequency @ a low operating voltage with advanced power management
Attacks on the Memory Wall
Streaming DMA architecture
3-level Memory Model: Main Storage, Local Storage, Register Files
Attacks on the Frequency Wall
Highly optimized implementation
Large shared register files and software controlled branching to allow deeper pipelines
Interface between user and networked world
Image rich information, virtual reality, shared reality
Flexibility and security
Multi-OS support, including RTOS / non-RTOS
Combine real-time and non-real time worlds

Cell Synergy
Cell is not a collection of different processors, but a synergistic whole

Operation paradigms, data formats and semantics consistent
Share address translation and memory protection model
PPE for operating systems and program control
SPE optimized for efficient data processing
SPEs share Cell system functions provided by Power Architecture
MFC implements interface to memory
Copy in/copy out to local storage
PowerPC provides system functions
Virtualization
Address translation and protection
External exception handling
EIB integrates system as data transport hub

State of the Art: Intel Core 2 Duo
Guess where the cache is?

State of the Art: IBM Power 5
How about on this one?

Unconventional State of the Art: Cell BE
memory warehousing vs. in-time data processing

Cell/B.E. - the space & power vs traditional
approaches
Cell/B.E. Example Dual Core
3.2 GHz 349mm2, 3.4 GHz @ 150W
9 Cores, ~230 SP GFlops 2 Cores, ~54 SP GFlops
Todays x86 Quad Core processors are

Dual Chip Modules (DCMs), 2 of these On any traditional processor, shown ratio of
processor stacked vertically & packaged cores to cache, prediction, & related items
illustrated here remains at ~50% of area the
together chip area

Cell BE Architecture

Cell Architecture is
64b Power Architecture
Power Power
ISA ISA

MMU/BIU MMU/BIU
IO
Memory COHERENT BUS
transl.
Incl. coherence/memory
compatible with 32/64b Power Arch. Applications and OSs
Cell Architecture is 64b Power Architecture
Plus
Power Power
Memory
ISA ISA
Flow Control (MFC) +RMT +RMT
MMU/BIU MMU/BIU
+RMT +RMT
IO
Memory COHERENT BUS (+RAG)
transl.
MMU/DMA MMU/DMA
+RMT +RMT

LS Alias Local Store Local Store

Memory Memory
LS Alias

Cell Architecture is 64b Power Architecture+ MFC
Plus
Power Power
Synergistic
ISA ISA
Processors +RMT +RMT
MMU/BIU MMU/BIU
+RMT +RMT
IO
Memory COHERENT BUS (+RAG)
transl.
MMU/DMA MMU/DMA
Syn. +RMT
Syn. +RMT
Proc. Proc.
LS Alias
Local Store Local Store
ISA Memory ISA
LS Alias Memory

Coherent Offload Model
DMA into and out of Local Store equivalent to Power core loads &
stores
Governed by Power Architecture page and segment tables for
translation and protection
Shared memory model
Power architecture compatible addressing
MMIO capabilities for SPEs
Local Store is mapped (alias) allowing LS to LS DMA transfers
DMA equivalents of locking loads & stores
OS management/virtualization of SPEs
Pre-emptive context switch is supported (but not efficient)

Cell Broadband Engine TM:
A Heterogeneous Multi-core Architecture
* Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.

Cell BE Block Diagram
SPE
SPU SPU SPU SPU SPU SPU SPU SPU
SXU SXU SXU SXU SXU SXU SXU SXU
LS LS LS LS LS LS LS LS
MFC MFC MFC MFC MFC MFC MFC MFC
EIB (up to 96B/cycle)
16B/cycle
16B/cycle
PPE
PPU MIC BIC
L2 L1 PXU
32B/cycle 16B/cycle
Dual FlexIOTM
XDRTM
64-bit Power Architecture with VMX

1 PPE core:
- VMX unit
- 32k L1 caches
- 512k L2 cache
- 2 way SMT

8 SPEs
-128-bit SIMD instruction set
- Register file 128x128-bit
- Local store 256KB
- MFC
- Isolation mode

Element Interconnect Bus (EIB)
- 96B / cycle bandwidth

System Memory Interface:
- 16 B/cycle
- 25.6 GB/s (1.6 Ghz)

I/O Interface:
- 16 B/cycle x 2

1 PPE core:
- VMX unit
- 32k L1 caches
- 512k L2 cache
- 2 way SMT

PPE Block Diagram
8 Pre-Decode
Fetch Control Threads alternate

L2 L1 Instruction Cache
Thread A Thread B fetch and dispatch
Interface 4 4
Branch Scan cycles
SMT Dispatch (Queue)
2 Microcode
1
Decode Thread A
L1 Data Cache
Dependency Thread B
Issue Thread A
2
1 1 1
Branch VMX/FPU Issue (Queue)
Load/Store Fixed-Point
Execution 2
Unit Unit
Unit 1 1 1 1
VMX
VMX FPU FPU
Completion/Flush Load/Store/
Arith./Logic Unit Arith/Logic Unit Load/Store
Permute
VMX Completion FPU Completion

8 SPEs
-128-bit SIMD instruction set
- Register file 128x128-bit
- Local store 256KB
- MFC
- Isolation mode

SPE Block Diagram
SPU Core: Registers & Logic
Channel Unit: Message passing SPE
interface for I/O SPU
Local Store: 256KB of SRAM
private to the SPU Core
DMA Unit: Transfers data between SPU Core (SXU)
Local Store and Main Memory
Channel Unit
Local Store
MFC
(DMA Unit)
To Element Interconnect Bus

Local Store
Never misses
No tags, backing store, or prefetch engine
Predictable real-time behavior
Less wasted bandwidth
Easier programming model to achieve very high performance
Software managed caching
Large register file or local memory
DMAs are fast to setup almost like normal load instructions
Can move data from one local store to another
No translation
Multiuser operating system is running on control processor
Can be mapped as system memory - cached copies are non-coherent wrt SPU
loads/stores

DMA & Multibuffering
DMA commands move data between system Thread 1 Thread 2

memory & Local Storage
Init DMA Fetch
DMA commands are processed in parallel with Goto Thread 2
software execution Wait for Data
Double buffering
Software Multithreading DMA Transfers Compute
16 queued commands - up to 16 kB/command

Up to 16, 128 Byte, transfers in flight on the on Init DMA Store
chip interconnect Init DMA Fetch
Goto Thread 1
Richer than typical cache prefetch instructions
Scatter-gather Wait for Data
Flexible DMA command status, can achieve low Compute
power wait

Channels: Message Passing I/O
The interface from SPU core to rest of system

Channels have capacity => allows pipelining
Instructions: read, write, read capacity
Effects appear at SPU core interface in instruction order
Blocking reads & writes stall SPU in low power wait mode
Example facilities accessible through the channel interface:
DMA control
counter-timer
interrupt controller
Mailboxes
Status
Interrupts, BISLED

SPE Block Diagram (Detailed)
Floating-Point Unit Permute Unit

Fixed-Point Unit Load-Store Unit
Branch Unit Local Store
Channel Unit (256kB)
Single Port SRAM
Result Forwarding and Staging

Register File
Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write
On-Chip Coherent Bus DMA Unit
8 Byte/Cycle 16 Byte/Cycle 64 Byte/Cycle 128 Byte/Cycle

SPE Instruction Issue
In-order
Dual issue requires alignment according to type
Instruction Swap forces single issue
Saves a pipeline stage & simplifies resource checking
9 units & 2 pipes chosen for best balance
instruction from address 0 instruction from address 4
Simple Fixed Permute
Shift Local Store
Single Precision Channel

Floating Integer Branch
Byte

SPE Pipeline Diagram
Instruction Class Pipe Latency
Simple Fixed (FX) 0 2
Shift (FX) 0 4
Single Precision (FP) 0 6
Floating Integer (FP) 0 7
Byte (BYTE) 0 4
Permute (PERM) 1 4
Load (LS) 1 6
Branch 1 4
Channel 1 6

SPE Branch Considerations
Mispredicts cost 18 cycles

No hardware history mechanism
Branch penalty avoidance techniques
Write frequent path as inline code
Compute both paths and use select instruction
Unroll loops also reduces dependency stalls
Load target into branch target buffer
Branch hint instruction
16 cycles ahead of branch instruction
Single entry BTB

SPE Instructions
Scalar processing supported on data-parallel substrate

All instructions are data parallel and operate on vectors of elements
Scalar operation defined by instruction use, not opcode
Vector instruction form used to perform operation
Preferred slot paradigm

Scalar arguments to instructions found in preferred slot
Computation can be performed in any slot

Register Scalar Data Layout
Preferred slot in bytes 0-3

By convention for procedure interfaces
Used by instructions expecting scalar data
Addresses, branch conditions, generate controls for insert

Memory Management & Mapping

System Memory Interface:
- 16 B/cycle
- 25.6 GB/s (1.6 Ghz)

MFC Detail
Local SPU Memory Flow Control System
Store SPC DMA Unit
LS <-> LS, LS<-> Sys Memory, LS<-> I/O
Legend: Transfers
8 PPE-side Command Queue entries
DMA Engine DMA Data Bus
Snoop Bus
16 SPU-side Command Queue entries
Queue
Control Bus MMU similar to PowerPC MMU
Xlate Ld/St 8 SLBs, 256 TLBs
MMIO
4K, 64K, 1M, 16M page sizes
Atomic MMU RMT Software/HW page table walk
Facility PT/SLB misses interrupt PPE
Atomic Cache Facility
4 cache lines for atomic updates
2 cache lines for cast out/MMU reload
Bus I/F Control Up to 16 outstanding DMA requests in BIU
MMIO
Resource / Bandwidth Management Tables
Token Based Bus Access Management
TLB Locking
Isolation Mode Support (Security Feature)

Hardware enforced isolation
SPU and Local Store not visible (bus or jtag)
Small LS untrusted area for communication area
Secure Boot
Chip Specific Key
Decrypt/Authenticate Boot code
Secure Vault Runtime Isolation Support
Isolate Load Feature
Isolate Exit Feature

Per SPE Resources (PPE Side)
Problem State Privileged 1 State (OS) Privileged 2 State
(OS or Hypervisor)
4K Physical Page Boundary 4K Physical Page Boundary 4K Physical Page Boundary
8 Entry MFC Command Queue Interface

DMA Command and Queue Status
DMA Tag Status Query Mask SPU Privileged Control SPU Master Run Control
DMA Tag Status
SPU Channel Counter Initialize SPU ID
32 bit Mailbox Status and Data from SPU
SPU Channel Data Initialize SPU ECC Control
32 bit Mailbox Status and Data to SPU
SPU ECC Status
4 deep FIFO SPU Signal Notification Control
SPU ECC Address
Signal Notification 1 SPU Decrementer Status & Control SPU 32 bit PU Interrupt Mailbox
Signal Notification 2 MFC DMA Control MFC Interrupt Mask
SPU Run Control
MFC Context Save / Restore Regs MFC Interrupt Status
SPU Next Program Counter
SLB Management Registers MFC DMA Privileged Control
SPU Execution Status
MFC Command Error Register
MFC Command Translation Fault Register
MFC SDR (PT Anchor)
MFC ACCR (Address Compare)
4K Physical Page Boundary 4K Physical Page Boundary MFC DSSR (DSI Status)
MFC DAR (DSI Address)
Optionally Mapped 256K Local Store Optionally Mapped 256K Local Store MFC LPID (logical partition ID)
MFC TLB Management Registers

Per SPE Resources (SPU Side)
SPU Direct Access Resources
128 - 128 bit GPRs SPU Indirect Access Resources
External Event Status (Channel 0)
Decrementer Event
(via EA Addressed DMA)
Tag Status Update Event
DMA Queue Vacancy Event
System Memory
SPU Incoming Mailbox Event
Memory Mapped I/O
Signal 1 Notification Event
This SPU Local Store
Signal 2 Notification Event
Other SPU Local Store
Reservation Lost Event
Other SPU Signal Registers
External Event Mask (Channel 1)
Atomic Update (Cacheable Memory)
External Event Acknowledgement (Channel 2)
Signal Notification 1 (Channel 3)
Signal Notificaiton 2 (Channel 4)
Set Decrementer Count (Channel 7)
Read Decrementer Count (Channel 8)
16 Entry MFC Command Queue Interface (Channels 16-21)
DMA Tag Group Query Mask (Channel 22)
Request Tag Status Update (Channel 23)
Immediate
Conditional - ALL
Conditional - ANY
Read DMA Tag Group Status (Channel 24)
DMA List Stall and Notify Tag Status (Channel 25)
DMA List Stall and Notify Tag Acknowledgement (Channel 26)
Lock Line Command Status (Channel 27)
Outgoing Mailbox to PU (Channel 28)
Incoming Mailbox from PU (Channel 29)
Outgoing Interrupt Mailbox to PU (Channel 30)

Memory Flow Controller Commands
DMA Commands
Put - Transfer from Local Store to EA space
Puts - Transfer and Start SPU execution
Putr - Put Result - (Arch. Scarf into L2) Command Parameters
Putl - Put using DMA List in Local Store LSA - Local Store Address (32 bit)
Putrl - Put Result using DMA List in LS (Arch) EA - Effective Address (32 or 64 bit)
Get - Transfer from EA Space to Local Store TS - Transfer Size (16 bytes to 16K bytes)
Gets - Transfer and Start SPU execution LS - DMA List Size (8 bytes to 16 K bytes)
Getl - Get using DMA List in Local Store TG - Tag Group(5 bit)
Sndsig - Send Signal to SPU CL - Cache Management / Bandwidth Class
Command Modifiers: <f,b>
f: Embedded Tag Specific Fence
Command will not start until all previous commands
in same tag group have completed Synchronization Commands
b: Embedded Tag Specific Barrier Lockline (Atomic Update) Commands:
Command and all subsiquent commands in same getllar - DMA 128 bytes from EA to LS and set Reservation
tag group will not start until previous commands in same putllc - Conditionally DMA 128 bytes from LS to EA
tag group have completed putlluc - Unconditionally DMA 128 bytes from LS to EA
barrier - all previous commands complete before subsiquent
SL1 Cache Management Commands
commands are started
sdcrt - Data cache region touch (DMA Get hint) mfcsync - Results of all previous commands in Tag group
sdcrtst - Data cache region touch for store (DMA Put hint) are remotely visible
sdcrz - Data cache region zero mfceieio - Results of all preceding Puts commands in same
sdcrs - Data cache region store group visible with respect to succeeding Get commands
sdcrf - Data cache region flush
Element Interconnect Bus

Element Interconnect Bus (EIB)
- 96B / cycle bandwidth

Internal Bandwidth Capability
EIB data ring for internal communication

Four 16 byte data rings, supporting multiple transfers
96B/cycle peak bandwidth
Over 100 outstanding requests
Each EIB Bus data port supports 25.6GBytes/sec* in each direction
The EIB Command Bus streams commands fast enough to support
102.4 GB/sec for coherent commands, and 204.8 GB/sec for non-
coherent commands.
The EIB data rings can sustain 204.8GB/sec for certain workloads, with
transient rates as high as 307.2GB/sec between bus units
* The above numbers assume a 3.2GHz core frequency internal bandwidth scales with core frequency
Element Interconnect Bus - Data Topology
Four 16B data rings connecting 12 bus elements
Two clockwise / Two counter-clockwise
Physically overlaps all processor elements
Central arbiter supports up to three concurrent transfers per data ring
Two stage, dual round robin arbiter
Each element port simultaneously supports 16B in and 16B out data path
Ring topology is transparent to element data interface
PPE SPE1 SPE3 SPE5 SPE7 IOIF1

16B 16B 16B 16B 16B 16B 16B 16B
16B 16B
16B 16B
16B
Data Arb 16B
16B 16B
16B 16B 16B 16B 16B 16B 16B 16B
MIC SPE0 SPE2 SPE4 SPE6 BIF/IOIF0

Example of eight concurrent transactions
PPE SPE1 SPE3 SPE5 SPE7 IOIF1
Ramp Ramp
Ramp Ramp
Ramp Ramp
Ramp Ramp Ramp
Ramp Ramp
6 7 8 9 10 11
7 8 9 10 11
Controller Controller Controller Controller Controller Controller
Data
Arbiter
Controller Controller Controller Controller Controller

Controller Controller Controller Controller
Controller Controller Controller
Ramp Ramp7 Ramp8 Ramp

9 Ramp 10 Ramp
11
Ramp
5 Ramp
4
Ramp Ramp
3
Ramp 2
Ramp
Ramp Ramp
1
Ramp 0
Ramp
Ramp
5 4 3 2 1 0
MIC
PPE SPE1
SPE0 SPE3
SPE2 SPE5
SPE4 SPE6
SPE7 BIF /
IOIF1
IOIF0
IOIF1
Ring0 Ring1
controls
Ring2 Ring3

I/O and Memory Interfaces

I/O Interface:
- 16 B/cycle x 2

I/O and Memory Interfaces
I/O Provides wide bandwidth

Dual XDRTM controller (25.6GB/s @ 3.2Gbps)
Two configurable interfaces (76.8GB/s @6.4Gbps)
Configurable number of Bytes
Coherent or I/O Protection
Allows for multiple system configurations

Cell BE Processor Can Support Many Systems
Game console systems
Blades XDRtm XDRtm XDRtm XDRtm
HDTV
Home media servers
Supercomputers Cell BE Cell BE
Processor Processor
IOIF BIF IOIF
XDRtm XDRtm XDRtm XDRtm

XDRtm XDRtm
Cell BE Cell BE
IOIF
Processor Processor
BIF IOIF
BIF SW IOIF
IOIF Processor Processor
Cell BE Cell BE
Cell BE
Processor
IOIF0 IOIF1 XDRtm XDRtm XDRtm XDRtm

Cell Synergy
Cell is not a collection of different processors, but a synergistic whole

Operation paradigms, data formats and semantics consistent
Share address translation and memory protection model
PPE for operating systems and program control
SPE optimized for efficient data processing
SPEs share Cell system functions provided by Power Architecture
MFC implements interface to memory
Copy in/copy out to local storage
PowerPC provides system functions
Virtualization
Address translation and protection
External exception handling
EIB integrates system as data transport hub

Cell Application Affinity Target Applications

Contents
Roadrunner Project
Cell SDK
DaCS and ALF
Application
References

Cell Software Environment
Programmer
Experience
End-User
Experience
Code Dev Tools Samples

Workloads
Demos
Development Debug Tools

SPE Management Lib Execution
Environment
Application Libs Environment
Development
Tools Stack
Performance Tools
Linux PPC64 with Cell Extensions
Verification Hypervisor
Miscellaneous Tools Hardware or

System Level Simulator
Language extensions
Standards:
ABI

Cell based systems In addition SDK 3.0 has: MAMBO simulator enhancements and overall
Software Stack simulation speedup; GA Quality Release on RHEL5.1; More comprehensive
performance and integration test suite; System test for hybrid systems
Applications
RHEL5.1 Enterprise Linux Distro
Partners, ISVs, Universities, Labs, etc
Sector Specific Libraries

H.264 encoder and decoder, BLAS, ALF samples, Black Scholes sample
ISVs, Universities, Labs, Open Source, etc.
Cluster and Scale out Systems
Cluster Systems Management, Accelerator Management,,
cluster file systems and protocols, etc
Data communication and synchronization layer (DaCS), Accelerated
Market Segment Specific Library Framework (ALF); Prorotype ALF and DaCS hybrid x86-Cell
function, newlib enhancements; Performance tooling & visualization
(VPA), Code Analysis Tooling (oProfile, SPU timing, PDT, FDPR-Pro);
Ubiquitous for all Markets Prototype Gdbserver support for combined hybrid (Cell/x86) remote
debugging; IDE ALF / DaCS Data description templates,wizards,code
Application Tooling and Environment
builders; IDE PDT integration; Prototype IDE Hybrid Performance
Programming model/APIs for Cell and Hybrid, Eclipse IDE, Analysis
Performance Tools
Compilers XLC GA with additional distro hosted platform coverage/OpenMP/Fortran,
gcc C++ library support and exception handling; gcc and xlC continued work
C, C++, Fortran, etc on auto-vectorization/SIMDization; gcc fortran on PPE& SPE, ADA on PPE
Core Librairies
Libspe2.0 enhancements, Newlib enhancements, tag manager and
e.g. SPE intrinsic, etc spetimers on SPEs, Infix vector operations
Linux Operating System Features RHEL5.1 Distro support, Fedora7: IPMI and Power Executive Support, RAS
e.g. SPE exploitation, CellBE support Enhancements, MSI on QS21, Perfmon2, SPE statistics, SPE Context in
Crash Files, Axon DDR2 for Swap
Device Drivers
QS20, QS21 support, F7: Triblade prototype
Firmware
e.g. Blades, Development platforms, etc QS20, QS21 support, F7: Triblade prototype
CellBE Based Hardware

SDK v3.0 Themes & Enhancements
Product Level Tested Programmer Productivity Development
Multiple HW Platform Support Eclipse IDE plug-ins
QS20 (CB1) Fedora Only
Dual source XLC , Dual Source XLF Fortran
QS21(CB+) Production Support (beta), Single Source XLC (beta)
Cell and Hybrid HPC software sample code
Linux Support
Fedora 7 (Kernel level 2.6.22) Enhanced GNU toolchain support
Red Hat Enterprise Level v5.1 (Kernel Level 2.6.18) GNU Fortran for PPE & SPE
GNU ADA (GNAT) for PPE
Toolchain packages: gcc 4.1.1, binutils 2.17+, newlib
1.15+, gdb 6.6+ gcc autovectorization and performance
enhancements
Programmer Productivity Performance Tools Programmer Productivity - Runtime
VPA Visual Performance Analyzer Product Level ALF and DaCs for Cell
PDT Performance Debugging Tool Hybrid DaCS/ALF (Prototype)
PEP/Lock Analyzer & Trace Analysis Tools Productization of combined ppe/spe gdb
CodeAnalyzer debugger
Enhanced Oprofile support SPE-side Software Managed Cache (from iRT
technology)
FDPR-Pro for Cell
Hybrid Code Analyzer
Market Segment Library Enablement
Highly optimized SIMD and MASS Math Libraries
Hybrid System Performance and Tracing Facility
Highly Optimized BLAS
Highly optimized libFFT
Monte Carlo RNG Library
Cell Security Technology (prototype/preview)
IBM SDK Development Worldwide Teams
BOEBLINGEN
QDC Extended
Toronto QDC Extended Team
Team
LTC (kernel,
SWG XLC Compiler toolchain,
ROCHESTER QDC
Yorktown QDC performance)
Prog models, Test,
Extended Team India QDC
Libs, Cluster
enablement XLC Compiler Programming model,
libs
AUSTIN QDC Beijing China QDC

Programming model,
Prog models, libs, advanced libs
apps, marquee customer
engagement, SDK
Integration, Integration Test
Austin QDC Extended

Teams
LTC (kernel, toolchain, IDE),
ARL (Mambo), STG
Performance (tools &
Analysis) Brazil QDC Haifa, Israel QDC
Extended Team Extended Team Canberra, Australia
Raleigh, NC QDC LTC (IDE) STG Performance LTC (kernel & toolchain)
Extended Team (Tools)
RISCWatch (TBD)

Supported Environments
Fedora 7 Fedora 7
QS20/QS21 Tested QS20/QS21
Fedora 7
Power Fedora 7
(Simulator
Fedora 7 with F7
X86/x86_64 sysroot)
(F7 Cross with Does not Work
3.0 compiler)
Application Build Application Execution
RHEL5 Service Offering RHEL5

QS21 QS21
Not Required
RHEL5
Power
3.0 compiler Service Offering
RHEL5
x86/x86_64
(Cross RHEL5
with 3.0 compiler)

Packaging
RHEL5.1 Product
Contains GA function (ALF, DaCS, BLAS, ..etc)
Uses Cell enabled Kernel and LibSPE2 in RHEL5.1
Shipped on IBM Passport Advantage and Internal Extreme Leverage C13UYEN.TAR
Contains installer RPM and CellSDK-Product-RHEL_3.0.0.1.0.iso
RHEL5.1 Developer
Same content as RHEL5.1 Product
Shipped on developerWorks and BSC - CellSDK-Devel-RHEL_3.0.0.1.0.iso
Can be upgraded to RHEL5.1 Product
RHEL5.1 Extras
Extra unsupported packages i.e. Beta and Prototype libraries and tools
Shipped on developerWorks and BSC - CellSDK-Extras-RHEL_3.0.0.1.0.iso
Fedora 7 Developer
Same content as RHEL.1 Developer but adds F7 Kernel, LibSPE2
Shipped on developerWorks and BSC - CellSDK-Devel-Fedora_3.0.0.1.0.iso
Fedora 7 Extras
Same content as RHEL5.1 Extras and adds OProfile, CPC tools, Simulator
Shipped on developerWorks and BSC - CellSDK-Extras-Fedora_3.0.0.1.0.iso

SDK 3.0 Packages
Component RHEL5u1 RHEL5u1 RHEL5u1 Fedora 7 Fedora 7
Product Devel Extras Devel Extras
License IPLA ILAN ILAER ILAN ILEAR
Kernel, LibSPE Included in BSC

RHEL Download
GA SDK
GCC Compiler BSC BSC BSC

Download Download Download
Performance Tools (FDPR-PRO, PDT, PDTR) Yes Yes Yes
Development Libraries (ALF, DaCS, BLAS, Yes Yes Yes

SIMDMath, MASS)
Cell Eclipse IDE Yes Yes Yes

Examples, Demos Tutorial, Docs Yes Yes Yes
XL C/C++ SS compilers Yes Yes
Development Libraries (FFT, MC, SPU Timer, Yes Yes

BETA SDK
ALF/DaCS for Hybrid)
Performance Tools (SPU Timing) Yes Yes
Performance Tools (OProfile) BSC

Download
Mambo Simulator, Isolation, and CPC Yes
GA XL C/C++ DS compilers Separate Product with 60 day Trial

Non-SDK
GA Fortran DS compiler Separate Product with Beta and then GM 4Q07 Not Supported
Visual Performance Analyzer alphaWorks Download

Installation
YUM-based with a wrapper script
Repositories for BSC and ISO images
YUM groups with mandatory, default and optional RPMs
Cell Runtime Environment (only installed with - -runtime flag)
Cell Development Libraries
Cell Development Tools
Cell Performance Tools
Cell Programming Examples
Cell Simulator
GUI install provided by pirut (using gui flag)
Script also supports

Verify lists what is installed
Uninstall
Update to a service pack (RHEL5.1 only)
Backout remove previous service pack (RHEL5.1 only)

Overview of Installation
1. Install Operating System on hardware (diskless QS21 requires remote boot)
2. Uninstall SDK 2.1 or SDK 3.0 early release
3. Install pre-requisites including RHEL5.1 specifics:
Install compat-libstdc++
Install libspe2 runtimes
Create and install ppu-sysroot RPMs for cross-compilation
4. Download installer RPM and required ISO images
Physical media and product TAR file contain both and additional instructions (README.1st)
5. Install the SDK installer
rpm -ivh cell-install-3.0.0.1.0.noarch.rpm
6. Start the install
cd /opt/cell
./cellsdk [--iso <isodir>] [--gui] install
7. Perform post-install configuration
RHEL 5.1 specifics (elfspe on QS21, libspe2 and netpbm development libraries
Complete IDE install into Eclipse and CDT
Configure DaCS daemons for DaCS for Hybrid-x86 and ALF for Hybrid-x86
Sync up the Simulator sysroot (Fedora 7 only)

SDK 3.0 Documentation
Supplied as:
PDFs, Man Pages, READMEs, XHTML (for accessibility)
Contained in:
cell-install-3.0.0-1.0.noarch.rpm (Installation Guide in the installer RPM)
cell-documentation-3.0-5.noarch.rpm
cell-extras-documentation-3.0-5.noarch.rpm
alfman-3.0-10.noarch.rpm
dacsman-3.0-6.noarch.rpm
libspe2man-2.2.0-5.noarch.rpm
simdman-3.0-6.noarch.rpm
Individual other RPMs e.g. Tutorial, IDE
Located in:
/opt/cell/sdk/docs - subdirectories used for different parts of the SDK
developerWorks - http://www-128.ibm.com/developerworks/power/cell/documents.html
Other Documentation:
XL C/C++ compilers see http://www.ibm.com/software/awdtools/xlcpp/library/
QS20/QS21 hardware see dW documentation site for links
CBEA Architecture docs - see dW documentation site for links

SDK 3.0 Documentation
Software Development Kit Programming Standards
IBM SDK for Multicore Acceleration Installation Guide C/C++ Language Extensions for Cell Broadband Engine Architecture
Cell Broadband Engine Programming Handbook SPU Application Binary Interface Specification
SIMD Math Library Specification for Cell Broadband Engine Architecture
Cell Broadband Engine Programming Tutorial
Cell Broadband Engine Linux Reference Implementation Application Binary
Cell Broadband Engine Programmer's Guide Interface Specification
Oprofile (SDK Programmer's Guide) SPU Assembly Language Specification
PDT (SDK Programmer's Guide) Programming Library Documentation
Security SDK V3.0 Installation and Users Guide Data Communication and Synchronization Programmer's Guide and API Reference
Programming Tools Documentation Data Communication and Synchronization for Hybird-x86 Programmer's Guide and
API Reference
Performance Analysis with the IBM Full-System Simulator
SPE Runtime Management Library
IBM Full-System Simulator User's Guide
SPE Runtime Management Library Version 1.2 to 2.2 Migration Guide (revised name)
XL C/C++ Compiler Information Accelerated Library Framework Programmer's Guide and API Reference
Installation Guide, Getting Started, Compiler Reference
Language Reference, Programming Guide Accelerated Library Framework for Hybrid-x86 Programmer's Guide and API
Reference
XL Fortran Compiler Information
Installation Guide, Getting Started, Compiler Reference Software Development Kit 3.0 SIMD Math Library Specifications
Language Reference, Programming Guide Basic Linear Algebra Subprograms Programmer's Guide and API Reference
Using the single-source compiler Example Library API Reference
IBM Visual Performance Analyzer User's Guide Cell BE Monte Carlo Library API Reference Manual
Cell Broadband Engine Security Software Development Kit - Installation and User's
Guide
SPU Timer Library
Mathematical Acceleration Subsystem (MASS)
Updated, New

Related Products
IBM XL C/C++ for Multicore Acceleration for Linux
http://www-306.ibm.com/software/awdtools/ccompilers/
http://www-306.ibm.com/software/awdtools/xlcpp/features/multicore/
IBM XL Fortran for Multicore Acceleration for Linux on System p

http://www.alphaworks.ibm.com/tech/cellfortran (GA planned for 11/30)
Visual Performance Analyzer (VPA) on alphaWorks

http://www.alphaworks.ibm.com/tech/vpa
IBM Assembly Visualizer for Cell Broadband Engine on alphaWorks

http://www.alphaworks.ibm.com/tech/asmvis
IBM Interactive Ray Trace Demo on alphaWorks

http://www.alphaworks.ibm.com/tech/irt

Linux Kernel
Patches made to Linux 2.6.22 kernel to provide services required to

support the Cell BE hardware facilities
Patches distributed by Barcelona Supercomputer Center
http://www.bsc.es/projects/deepcomputing/linuxoncell
For the QS20/QS21,
the kernel is installed into the /boot directory
yaboot.conf is modified
needs reboot to activate this kernel

Kernel and LibSPE2
Distro Support
RHEL5.1
Fedora7
QS21 Support
IPMI and Power Executive Support
RAS Enhancements
Perfmon2
MSI support for QS21 (PCIe interrupt signaling)
Axon DDR2 for Swap
SPE utilization statistics
SPE preemptive scheduling
SPE Context in Crash Files
Enabling for Secure SDK (under CDA only)

IBM Full-System Simulator
Emulates the behavior of a full system that contains a Cell BE processor.
Can start Linux on the simulator and run applications on the simulated operating
system.
Supports the loading and running of statically-linked executable programs and
standalone tests without an underlying operating system.
Simulation models
Functional-only simulation: Models the program-visible effects of instructions without modeling
the time it takes to run these instructions.
For code development and debugging.
Performance simulation: Models internal policies and mechanisms for system components, such
as arbiters, queues, and pipelines. Operation latencies are modeled dynamically to account for
both processing time and resource constraints.
For system and application performance analysis.
Improvements in SDK 3.0:
New Fast execution mode on 64-bit platforms
Improved performance models
New Graphical User Interface features
Improved diagnostic and statistics reporting
New device emulation support to allow booting of unmodified Linux kernels
Supplied with Fedora 7 Sysroot Image

GCC and GNU Toolchain
Base toolchain
Based on GCC 4.1.1 extended by PPE and SPE support
binutils 2.18, SPE newlib 1.15.0+, GDB 6.6+
Support additional languages
GNU Fortran for PPE and SPE
No SPE-specific extensions (e.g. intrinsics)
GNU Ada for PPE only
Will provide Ada bindings for libspe2
Compiler performance enhancements
Improved auto-vectorization capabilities
Extract parallelism from straight-line code, outer loops
Other SPE code generation improvements
If-conversion, modulo-scheduling enhancements
New hardware support
Code generation for SPE with enhanced double precision FP

GCC and GNU Toolchain
Help simplify Cell/B.E. application development

Syntax extension to allow use of operators (+, -, ) on vectors
Additional PPU VMX intrinsics
Simplify embedding of SPE binaries into PPE objects
SPE static stack-space requirement estimation
Extended C99/POSIX run-time library support on SPE
Integrated PPE address-space access on SPE
Syntax extension to provide address-space qualified types
Access PPE-side symbols in SPE code
Integrated software-managed cache for data access
Combined PPE/SPE debugger enhancements
Extended support for debugging libspe2 code
Improved resolution of multiply-defined symbols

GNU tool chain
Contains the GCC compiler for the PPU and the SPU.
ppu-gcc, ppu-g++, ppu32-gcc, ppu32-g++, spu-gcc, spu-g++
For the PPU, GCC replaces the native GCC on PPC platforms and it is a cross-compiler on x86.
The GCC for the PPU is preferred and the makefiles are configured to use it when building the
libraries and samples.
For the SPU, GCC contains a separate SPE cross-compiler that supports the standards defined in
the following documents:
C/C++ Language Extensions for Cell BE Architecture V2.4
SPU Application Binary Interface (ABI) Specification V1.7
SPU Instruction Set Architecture V1.2
The assembler and linker are common to both the GCC and XL C/C++ compilers.
ppu-ld, ppu-as, spu-ld, spu-as
The GCC associated assembler and linker additionally support the SPU Assembly Language
Specification V1.5.
GDB support is provided for both PPU and SPU debugging
The debugger client can be in the same process or a remote process.
GDB also supports combined (PPU and SPU) debugging.
ppu-gdb, ppu-gdbserver, ppu32-gdbserver

XL C/C++/Fortran Compilers
IBM XL C/C++ for Multicore Acceleration for Linux, V9.0 (dual source compiler)
Product quality and support
Performance improvements in auto-SIMD
Improved diagnostic capabilities for detecting SIMD opportunities (-qreport)
Enablement of high optimization levels (O4, O5) on the SPE
Automatic generation of code overlays
IBM XL Fortran for Multicore Acceleration for Linux, V11.1 (dual source compiler)
Beta level (with GA targeted for 11/30/07)
Optimized Fortran code generation for PPE and SPE
Support for Fortran 77, 90 and 95 standards as well as many features from the Fortran 2003 standard
Auto-SIMD optimizations
Automatic generation of code overlays
IBM XL C/C++ Alpha Edition for Multicore Acceleration, V0.9 (single source compiler)
Beta level
Allows programmer to use OpenMP directives to specify parallelism on PPE and SPE
Compiler hides complexity of DMA transfers, code partitioning, overlays, etc.. from the
programmer
See http://www.research.ibm.com/journal/sj/451/eichenberger.html
IBM XL C/C++ compiler
A cross-compiler hosted on a x86 and PPC platform.
Requires the GCC Tool chain for cross-assembling and cross-linking
applications for both the PPE and SPE.
Supports the revised 2003 International C++ Standard ISO/IEC
14882:2003(E), Programming Languages -- C++ and the ISO/IEC
9899:1999, Programming Languages -- C standard, also known as C99,
and the C89 Standard and K&R style, and language extensions for vector
programming
The XL C/C++ compiler provides the following invocation commands:
ppuxlc, ppuxlc++
spuxlc, spuxlc++
The XL C/C++ compiler includes the following base optimization levels:
-O0: almost no optimization
-O2: strong, low-level optimization that benefits most programs
-O3: intense, low-level optimization analysis with basic loop optimization
-O4: all of -O3 and detailed loop analysis and good whole-program analysis at link time
-O5: all of -O4 and detailed whole-program analysis at link time.
Auto-SIMDization is enabled at O3 -qhot or O4 and O5 by default.

Eclipse IDE
IBM IDE, which is built upon the Eclipse and C/C++ Development Tools (CDT)
platform, integrates Cell GNU tool chain, XLC/GCC compilers, IBM Full-System
Simulator for the Cell BE, and other development components in order to provide a
comprehensive, user-friendly development platform that simplifies Cell BE
development.
Cell IDE
CDT
Eclipse
Cell Tool Chain Sim perf

Cell IDE Key Features
Cell C/C++ PPE/SPE managed make project support
A C/C++ editor that supports syntax highlighting; a customizable template;
and an outline window view for procedures, variables, declarations, and
functions that appear in source code.
Full configurable build properties.
A rich C/C++ source level PPE and/or SPE cell GDB debugger integrated
into eclipse.
Seamless integration of Cell BE Simulator into Eclipse
Automatic makefile generator, builder, performance tools, and several other
enhancements.
Support development platforms (x86, x86_64, Power PC, Cell)
Support target platforms
Local Cell Simulator
Remote Cell Simulator
Remote Native Cell Blade
Performance tools Support .
Automatic embedSPU integration
ALF programming model support
SOMA support

ALF and DaCS: IBMs Software Enablement Strategy for Multi-
core Memory-Hierarchy Systems
Application
Library
DaCS ALF Tooling

Topology Process Data IDE
Management Partitioning
Others
Process
Management Error Workload Compilers
Handling Distribution
Synchronization gdb
Send / Receive Remote DMA Mailbox Error Handling Trace

Analysis
Platform
libSPE MFC PCIe 10GigE

DaCS Data Communications and Synchronization
Focused on data movement primitives
DMA like interfaces (put, get)
Message interfaces (send/recv)
Mailbox
Endian Conversion
Provides Process Management, Accelerator Topology Services
Based on Remote Memory windows and Data channels architecture
Common API Intra-Accelerator (CellBE), Host - Accelerator
Double and multi-buffering
Efficient data transfer to maximize available bandwidth and minimize inherent latency
Hide complexity of asynchronous compute/communicate from developer
Supports ALF, directly used by US National Lab for Host-Accelerator
HPC environment
Thin layer of API support on CELLBE native hardware
Hybrid DaCS
Prototyped on IB Verbs (incomplete)
Developed on Sockets (prototype in SDK3)
Developed on PCI-e (Tri-blade) SDK3.0 internal, SDK4.0 GA

DaCS Components Overview
Process Management
Supports remote launching of an accelerators process from a host process
Topology Management
Identify the number of Accelerators of a certain type
Reserve a number of Accelerators of a certain type
Data Movement Primitives HE (x86_64)
Remote Direct Memory Access (rDMA)
put/get
put_list/get_list AE (CBE) AE (CBE)
Message Passing
send/receive
AE (SPE) AE (SPE)
Mailbox
write to mailbox/read from mailbox
Synchronization
Mutex / Barrier
Error Handling

Accelerator Library Framework (ALF) Overview
Aims at workloads that are highly parallelizable
e.g. Raycasting, FFT, Monte Carlo, Video Codecs
Provides a simple user-level programming framework for Cell library
developers that can be extended to other hybrid systems.
Division of Labor approach
ALF provides wrappers for computational kernels
Frees programmers from writing their own architectural-dependent code including:
data transfer, task management, double buffering, data communication
Manages data partitioning
Provides efficient scatter/gather implementations via CBE DMA
Extensible to variety of data partitioning patterns
Host and accelerator describe the scatter/gather operations
Accelerators gather the input data and scatter output data from hosts memory
Manages input/output buffers to/from SPEs
Remote error handling
Utilizes DaCS library for some low-level operations (on Hybrid)

ALF Data Partitioning

SPU Overlays
Overlays must be used if the sum of the lengths of all the code
segments of a program, plus the lengths of the data areas
required by the program, exceeds the SPU local storage size
They may be used in other circumstances; for example

performance might be improved if the size of a data area can be
increased by moving rarely used functions (such as error or
exception handlers) to overlays
Code Code
segment segment
Data Code
segment segment Local
Code Data Storage
segment segment
Data Code
segment segment

SPE Software Managed Cache
SPE memory accesses are to Local Store Addresses only. Access
to main memory requires explicit DMA calls.
This represents a new programming model
Software cache has many benefits in SPE environment

Simplifies programming model
familiar load/store effective address model can be used
Decreases time to port to SPE
Take advantage of locality of reference
Can be easily optimized to match data access patterns

SIMD Math Library
Completed Implementation of JSRE SIMD Math Library by adding:
tgammaf4 (PPU/SPU) expm1d2 (SPU)
tgammad2 (SPU) expm1f4 (PPU/SPU)
lgammaf4 (PPU/SPU) hypotf4 (PPU/SPU)
erff4 (PPU/SPU) sincosd2 (SPU)
erfcf4 (PPU/SPU) sincosf4 (PPU/SPU)
fpclassifyd2 (SPU) tanhd2 (SPU)
fpclassifyf4 (PPU/SPU) tanhf4 (PPU/SPU)
nextafterd2 (SPU) atand2dp
nextafterf4 (PPU/SPU) acoshd2 (SPU)
modff4 (PPU/SPU) acoshf4 (PPU/SPU)
lldivi2 (SPU) asinhd2 (SPU)
lldivu2 (SPU) asinhf4 (PPU/SPU)
iroundf4 (PPU/SPU) atanhd2 (SPU)
irintfr (PPU/SPU) atanhf4 (PPU/SPU)
log1pd2 (SPU) atan2d2 (SPU)
log1pf4 (PPU/SPU) atan2f4 (PPU/SPU)
MASS and MASS/V Library
Mathematical Acceleration SubSystem
High-performance alternative to standard system math libraries
i.e. libm, SIMDmath
Versions exist for PowerPCs, PPU, SPU
Up to 23x faster than libm functions
PPU MASS
57 scalar functions, 60 vector functions
both single and double precision
SPU MASS
SDK 2.1 contains 28 SIMD functions and 28 vector functions (SP only)
Expanded SPU MASS to include all single-precision functions in PPU MASS
Added 8 new SP functions
erf, erfc, expm1, hypot, lgamma, log1p, vpopcnt4, vpopcnt8
Improved tuning of existing functions

BLAS Library
BLAS on PPU
Conforming to standard BLAS interface
Easy port of existing applications
Selected routines optimized utilizing SPUs
Only real single precision and real double precision versions
supported. Complex version is not supported.
Selected Routines (Based on use in Cholesky factorization/LU)
BLAS I (scal, copy, axpy, dot, i_amax)
BLAS II (gemv)
BLAS III (gemm, syrk, trsm)
Focus on single precision optimization
BLAS on SPU
Offer SPU Kernel routines to SPU applications
Underlying functionality implemented on the SPE
Operate on data (input/output) residing in local store
Similar to the corresponding PPU routine but not conforming to APIs

FFT Library (Prototype)
1D, 2D Square, 2D rectangular, 3D cube, 3D rectangular
Integer, Single Precision, Double Precision
Complex to Real, Real to Complex, Complex to Complex
Row size is Power-of-2, Power-of-Low-Primes, factorization of row size includes large primes
SPU, BE, ppu
In place, out of place
Forward, Inverse
C2R R2C C2C C2R R2C C2C

power of 2 power of 2 power of 2 low primes low primes low primes
1D Single 2 to 8192(SPU) 2 to 8192(SPU) 2 to 8192 SPU) 2 to 8192(SPU) 2 to 8192(SPU) 2 to 8192 SPU)
2 to 8192 (BE) 2 to 8192 (BE) 2 to 8192 (BE) 2 to 8192 (BE) 2 to 8192 (BE) 2 to 8192 (BE)
out of place out of place out of place out of place out of place out of place
forward forward forward/inverse forward forward forward/inverse
2D Square 32 to 2048 (BE) 2 to 2048 (BE)
Single In/out of place out of place
forward/inverse forward
2D 2 to 2048 (BE)
Rectangular out of place
Single forward
2D 32 to 2048 (BE)
Rectangular In/out of place
Double forward/inverse

Monte Carlo Random Number Generator Library (Prototype)
Types of Random Number Generators
True -
not repeatable
hardware support on IBM blades (not Simulator)
Quasi
repeatable with same seed
attempts to uniformly fill n-dimension space
Sobol
Pseudo
repeatable with same seed
Mersenne Twister
Kirkpatrick-Stoll

Choosing a Random Number Generator
Algorithm Size Speed Randomess
Hardware Small Slowest Random
Kirkpatrick-Stoll Moderate Fast Pseudo
Mersenne
Moderate Moderate Pseudo
Twister
Sobol Large Fastest Quasi

SPU Timer Library (Prototype)
Provides virtual clock and timer services for SPU programs
Virtual Clock
Software managed 64-bit timebase counter
Built on top of 32-bit decrementer register
Can be used for high resolution time measurements
Virtual Timers
Interval timers built on virtual clock
User registered handler is called on requested interval
Can be used for statistical profiling
Up to 4 timers can be active simultaneously, with different intervals

Performance Tools Static Analysis
FDPR-Pro
Perform global optimization at the entire executable
SPU Timing
Analysis of SPE instruction scheduling

Performance Tools Dynamic Analysis
Performance Debugging and Tool (PDT)
Generalized tracing facility; instrumentation of DaCS and ALF
Hybrid system support, e.g. PDT on Opteron, etc.
PDTR: PDT post-processor
Post processes PDT traces
Provide analysis and summary reports (Lock analysis, DMA analysis, etc.)
OProfile (Fedora 7 only)
PPU Time and event profiling
SPU time profiling
Hardware Performance Monitoring (Fedora 7 only)
Collect performance monitoring events
Perfmon2 support and enablement for PAPI, etc.
Hybrid System Performance Monitoring and Tracing facility
Launch, activate and dynamically configure tools on CellBlades and Opteron host
blade
Synchronize, merge and aggregate traces
Visual Performance Analyzer (from alphaWorks)

Cell/B.E. Security Technology & Security SDK
Isolated SPE Cell/B.E. has a security architecture which
vaults and protects SPE
PPE SPE applications.
Application a hardware key is used to

(Guest) check/decrypt applications.
Operating
Application
System
(Other Secure boot
(Hotel
Guests)
Manager) Hardware anti-tampering
Because of these robust security features,
Cell/B.E. makes an ideal platform for A&D
Bus applications
Fulfill MILS requirements
The Security SDK provides tools for
encrypting/singing applications and for
managing keys.
Using industry standard (X.509)
Contents
Roadrunner Project
Cell SDK
DaCS and ALF
Application
References

Hybrid Node System Software Stack
To other cluster nodes
Application MPI
IDE
Accelerated Lib
gdb
Tooling DaCSd
Tooling
Trace ALF/DaCS
Analysis
Host O/S DD
Opteron Blade
Cell Blade Cell Blade

IDE IDE
DaCSd DaCSd
Tooling Tooling
Accelerated Lib Accelerated Lib
Analysis Analysis
ALF/DaCS ALF/DaCS
Compilers/Profiling/etc. Compilers/Profiling/etc.
CellBE Linux CellBE Linux

ALF and DaCS: IBMs Software Enablement Strategy for Multi-
core Memory-Hierarchy Systems
Application
Library
DaCS ALF Tooling

Topology Process Data IDE
Management Partitioning
Others
Process
Management Error Workload Compilers
Handling Distribution
Synchronization gdb
Send / Receive Remote DMA Mailbox Error Handling Trace

Analysis
Platform
libSPE MFC PCIe 10GigE

DaCS Data Communications and Synchronization
Focused on data movement primitives
DMA like interfaces (put, get)
Message interfaces (send/recv)
Mailbox
Endian Conversion
Provides Process Management, Accelerator Topology Services
Based on Remote Memory windows and Data channels architecture
Common API Intra-Accelerator (CellBE), Host - Accelerator
Double and multi-buffering
Efficient data transfer to maximize available bandwidth and minimize inherent latency
Hide complexity of asynchronous compute/communicate from developer
Supports ALF, directly used by US National Lab for Host-Accelerator
HPC environment
Thin layer of API support on CELLBE native hardware
Hybrid DaCS
Prototyped on IB Verbs (incomplete)
Developed on Sockets (prototype in SDK3)
Developed on PCI-e (Tri-blade) SDK3.0 internal, SDK4.0 GA

DaCS Components Overview
Process Management
Supports remote launching of an accelerators process from a host process
Topology Management
Identify the number of Accelerators of a certain type
Data Movement Primitives HE (x86_64)
Remote Direct Memory Access (rDMA)
put/get
put_list/get_list AE (CBE) AE (CBE)
Message Passing
send/receive
AE (SPE) AE (SPE)
Mailbox
write to mailbox/read from mailbox
Synchronization
Mutex / Barrier
Error Handling

DaCS Component Topology Management
Identify the topology of Accelerator Elements for a
specified Host Element
Reserve a specific Accelerator Element
Release Accelerator Elements
Heartbeat Accelerators for Availability
HE (Opteron)
AE (CBE) AE (CBE) AE (CBE) AE (CBE)
AE (SPE) AE (SPE)

DaCS Components Process Management
Remote launching and termination of an

accelerators process from a host process
DaCS provides a method to start accelerator by
sending executables and associated libraries
DaCS utilizes a DaCS Daemon to facilitate
process launching, error detection, and orphan
process cleanup
One DaCS Daemon (dacsd) per reserved
accelerator

DaCS APIs
Init / Term Data Communication Locking Primitives
dacs_runtime_init dacs_remote_mem_create dacs_mutex_init
dacs_runtime_exit dacs_remote_mem_share dacs_mutex_share
dacs_remote_mem_accept dacs_mutex_accept
dacs_remote_mem_release dacs_mutex_lock
Reservation Service dacs_remote_mem_destroy dacs_mutex_try_lock
dacs_get_num_ avail_children dacs_remote_mem_query dacs_mutex_unlock
dacs_reserve_children dacs_put dacs_mutex_release
dacs_release_de_list dacs_get dacs_mutex_destroy
dacs_put_list
Process Management dacs_get_list Error Handling
dacs_de_start dacs_send dacs_errhandler_reg
dacs_num_processes_supported dacs_recv dacs_strerror
dacs_num_processes_running dacs_mailbox_write dacs_error_num
dacs_de_wait dacs_mailbox_read dacs_error_code
dacs_de_test dacs_mailbox_test dacs_error_str
dacs_wid_reserve dacs_error_de
dacs_wid_release dacs_error_pid
Group Functions dacs_test
dacs_group_init dacs_wait
dacs_group_add_member
dacs_group_close
dacs_group_destroy
dacs_group_accept
dacs_group_leave
dacs_barrier_wait

ALF
Division of labor approach
ALF provides wrappers for computational kernels; synthesize kernels with data
partitioning
Initialization and cleanup of a group of accelerators
Groups are dynamic
Mutex locks provide synchronization mechanism for data and processing
Remote error handling
Data partitioning and list creation
Efficient scatter/gather implementations
Stateless embarrassingly parallel processing, strided partitioning, butterfly
communications, etc.
Extensible to variety of data partitioning patterns
SPMD SDK2.1 , MPMD SDK3.0
Target prototypes: FFT, TRE, Sweep3D, Black Scholes, Linear
Algebra

ALF Core Concepts
Input Data
Host - PPE
Output Data Main Application
Host API
Input Data Partition

Acceleration Library
Output Data Partition

Work Block
Accelerated Library Framework

Work Runtime Host)
(
Queue
Broadband Engine Bus
Accelerated Library Framework

Runtime Accelerator)
(
Compute
Task

Computation Kernel (s
Accelerator
Accelerator
Accelerators - SPEs API

ALF / DaCS Hybrid Implementation
Model 1: At an API level, the Opteron is acting as host and SPEs are
accelerators. The PPE is a facilitator only. Programmers do not interact with
the PPEs directly.
Model 2: At an API level, the Opteron is acting as hosts and CellBE processors
are accelerators. The PPE runs as ALF Accelerator and ALF host to the SPE
accelerators. Opteron programmers do not interact with the SPEs directly.
ALF is being implemented using DaCS as the basic data mover and process
management layer.

Overview - ALF on Data Parallel Problem
Input
Output

ALF on Data Parallel Problem
Input
Output

ALF - Workblock
Workblock is the basic data unit of ALF
Input
Workblock = Partition Information
( Input Data, Output Data ) + Parameters Input
Desc
Parameters Output
Desc
Work Block
Output
Input Data and Output Data for a work load
can be divided into many work blocks

ALF Compute Task
Compute Task processes a Work Block Input
A task takes in the input data, context.
parameters and produces the output data
Input
Description
Output
Description
Context
Work Block
Output
Task

ALF - Task Context
Provides persistent data buffer across work
blocks
Can be used for all-reduce operations such as
min, max, sum, average, etc.
Can have both read-only section and writable
section
Writable section can be returned to host memory once
the task is finished
ALF runtime does not provide data coherency support if
there is conflict in writing the section back to host
memory
Programmers should create unique task context for each
instance of a compute kernel on an accelerator
ALF Task Context
Host / Task Main Thread
Read Only Context

(Shared )
Writable Writable Writable Writable Writable

Context Context Context Context Context
Buffer Buffer Buffer Buffer Buffer
Accelerator / Accelerator / Accelerator / Accelerator / Accelerator /

Task Instance Task Instance Task Instance Task Instance Task Instance
WB WB WB WB WB
WB WB WB WB
WB WB
WB

ALF Data Transfer List
Work Block input and output data descriptions are stored as Data Transfer List
Input data transfer list is used to gather data from host memory
Output data transfer list is used to scatter data to host memory
Data Transfer
List
Host Memory Accelerator Memory
A
B A B
C
C C
J
D Input
D
A E Description E
F Output
Description F
E F G
Parameters
G H I
D
H Work Block J
I
I
B J
H
G

ALF Two approaches to generate data transfer list
Generate on the host side
Accelerator side 9 Straight forward and easier to program
8
data partitioning Host might not be able to support all accelerators
Generate on the accelerator side
8 Indirect and harder to program (parameters will be passed in)
9 Many accelerators are stronger than a single control node
Accelerator Memory
Host Memory
A A B
C
B C
J
C D
A D E
F E
F
E F
Parameters G
G
D H I
H
I J
Work Block I
B J
H Data Transfer
G List
ALF - Queues
Two different queues are important to

programmers
Work block queue for each task
Pending work blocks will be issued to the work queue
Task instance on each accelerator node will fetch from this queue
Task queue for each ALF runtime
Multiple Tasks executed at one time
except where programmer specifies dependencies
Future tasks can be issued. They will be placed on the task
queue awaiting execution.
ALF runtime manages both queues

ALF - Buffer management on accelerators
ALF manages buffer allocation on accelerators local memory
ALF runtime provides pointers to 5 different buffers to a
computational kernel
Task context buffer (RO and RW sections)
Work block parameters
Input Buffer
Input/Output Buffer
Output Buffer
ALF implements a best effort double buffering scheme
ALF determines if there is enough local memory for double buffering
Double buffer scenarios supported by ALF
4 buffers: [In0, Out0; In1, Out1]
3 buffers: [In0, Out0; In1] : [Out1; In2, Out2]

ALF - Double buffering schemes
Timeline 0 1 2 3 4 5 6 7 8 9
WB0 I0 C0 O0
I/O and WB1 I1 C1 O1
(a) Computation
Operations WB2 I2 C2 O2
WB3 I3 C3 O3
Buf0 I0 WB0 I2 WB2

Buffer Buf1 WB0 O0 WB2 O2
(b) Usage of 4
Buffers Buf2 I1 WB1 I3 WB3
Buf3 WB1 O1 WB3 O3
Buffer
Buf0 I0 WB0 WB1 O1 I3 WB3
(c) Usage of 3 Buf1 WB0 O0 I2 WB2 WB3 O3
Buffers
Buf2 I1 WB1 WB2 O2
Buffer
Usage of Buf0 I0 WB0 O0 I2 WB2 O2
(d) Overlapped
I/O Buffers
Buf1 I1 WB1 O1 I3 WB3 O3

Synchronization Support
Barrier
All work blocks enqueued before the barrier are guaranteed to be
finished before any additional work block added after the barrier can be
processed on any of the accelerators
Programmers can register a callback function once the barrier is
encountered. The main PPE application thread cannot proceed until
the callback function returns.
Notification
Allows programmers query for a specific work block completion
Allows programmers to register a callback function once this particular
work block has been finished
Does not provide any ordering

Task
Main Thread
Task Queue Create
Message
Insert Order Invoking Thread
ALF - Parallel Task

Execution Thread
Synchronization
Point Queries
Synchronization
constructs 1 1 1 1 1 1 1 1
Barrier
1
Callback
2 Notify 2 2 2 2
2
Callback Barrier
3
3 3 3 3 3 Notify
4
Barrier
5
4 4 4 4 4 4 4
Task Wait
Basic Structure of an ALF Application
Accelerator
Control Node
Node
Initialization
Prepare
Create Task Input DTL
ALF Computing
Create
Workblock Runtime Kernel
Wait Task Prepare

Output DTL
Termination

Previous Development Environment
PPU
Main Application
Acceleration Library Sched. Tsk. Mg. Msg. Interface
OS
Acceleration Library:
Task Management: Scheduling: Msg. interface:
specifies methods for cooperating
Tasks generate, management Task dispatch and load balance Message, Synchronization
with computing kernels on SPUs
Task Queue Data Dispatcher

Load Balancer
Computing Core: Local Memory Manage: Data Movement:

Highly optimized for specific library Single or Double Buffering DMA, Scattering / Gathering
SPU SPU
Data Mv. Data Mv.
Comp. Core Mem. Mg.
Comp. Core Mem. Mg.
SPU SPU SPU SPU

Data Mv. Data Mv. Data Mv. Data Mv.
Comp. Core Mem. Mg.
Comp. Core Mem. Mg.
Comp. Core Mem. Mg.
Comp. Core Mem. Mg.

Development Environment with ALF
PPU
Main Application
Acceleration Library
ALF (PPU)
OS
Acceleration Library: Msg. interface:

Task Management: Scheduling:
specifies methods for cooperating Message,
Tasks generate, management Task dispatch and load balance
with computing kernels on SPEs Synchronization
Task Queue
Accelerated
Load Balancer Library Framework
Data Dispatcher
Local Memory Manage: Data Movement:

Computational kernel: Single or Double Buffering DMA, Scattering / Gathering
Highly optimized for specific library
SPU SPU
Comp. Core
ALF
(SPU)
Comp. Core
ALF
(SPU)
SPU SPU SPU SPU

ALF ALF ALF ALF
Comp. Core (SPU)
Comp. Core (SPU)
Comp. Core (SPU)
Comp. Core (SPU)

Contents
Roadrunner Project
Cell SDK
DaCS and ALF
Application
References

Preliminaries
Assumption: you have a working MPI code in some

language.
Strategy: acceleration takes place within an MPI rank.
Under these conditions, acceleration can happen
incrementally

Incremental Acceleration
1. Identify portions of code to accelerate.

2. Move them to another MPI rank.
3. Move them to Cell PPE.
4. Move them to Cell SPEs.
5. Vectorize.

Acceleration Opportunities
Large portions of code not necessarily hot loops.

Ideals:
Compact data
Embarrassingly parallel
Single-precision always runs faster

Move to another MPI rank.
Send your data to another MPI rank for remote processing.

Ensures you know your data requirements.
Beware COMMON or member data
Allows you to debug using familiar tools.

Move to Cell PPE
Replace your MPI communications with DaCS, execute

routines on PPE.
Need to address byte-swapping.
This is just a PowerPC, no fancy programming required.

Move to Cell SPE
Compile same code on SPEs.

Requires starting and managing N asynchronous SPE
threads.
Requires DMAs between Cell main memory and SPE local
stores.

Vectorize SPE code
Take advantage of the SPE SIMD operations and data

structures.
Every SPE instruction operates on 128-bit values (including
load/store!).
Scalar code is converted into vectors and back: lots of extra
instructions.
Like SSE or AltiVec: back-port what you did!

Hints
Abstract your communication mechanisms.

Be more concerned with minimizing data motion than
instructions.
Let the Opterons do some of the work in cooperation with
the accelerators.
Switch to C/C++ for your Cell implementation.
Allow the accelerator to rotate through different algorithms.

Plan!
Think all the way through the process before starting to

implement.
The ability to achieve acceptable levels of acceleration may
depend on decisions made early in the process.
Especially data structures and communication costs!

Contents
Roadrunner Project
Cell SDK
DaCS and ALF
Application
References

References
Roadrunner Web Site
http://lanl.gov/roadrunner/
See Roadrunner Technical Assessments
Roadrunner Applications Portal
http://rralgs.lanl.gov/portal
Cell Resource Center
http://www.ibm.com/developerworks/power/cell/
Select Docs tab
Introduction to Cell Programming:
Cell Broadband Engine Programmer's Guide
Cell Broadband Engine Programming Tutorial
Cell Broadband Engine Programming Handbook
Programming Reference:
C/C++ Language Extensions for Cell Broadband Engine Architecture
Data Communication and Synchronization Library for Hybrid-x86 Programmer's
Guide and API Reference
Hardware Reference:
Cell Broadband Engine Architecture
Synergistic Processor Unit Instruction Set Architecture
See me for a tarball of all Cell Docs

Thank You
Questions . . .
Special Notices -- Trademarks
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in
other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM
offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained
in this document.
Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions
on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give
you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY
10504-1785 USA.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives
only.
The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or
guarantees either expressed or implied.
All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the
results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations
and conditions.
IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions
worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment
type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal
without notice.
IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.
All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.
IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.
Many of the features described in this document are operating system dependent and may not be available on Linux. For more information,
please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html
Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are
dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this
document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-
available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document
should verify the applicable data for their specific environment.
Revised January 19, 2006

Special Notices (Cont.) -- Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States and/or other countries: alphaWorks, BladeCenter,
Blue Gene, ClusterProven, developerWorks, e business(logo), e(logo)business, e(logo)server, IBM, IBM(logo), ibm.com, IBM Business Partner (logo),
IntelliStation, MediaStreamer, Micro Channel, NUMA-Q, PartnerWorld, PowerPC, PowerPC(logo), pSeries, TotalStorage, xSeries; Advanced Micro-
Partitioning, eServer, Micro-Partitioning, NUMACenter, On Demand Business logo, OpenPower, POWER, Power Architecture, Power Everywhere, Power
Family, Power PC, PowerPC Architecture, POWER5, POWER5+, POWER6, POWER6+, Redbooks, System p, System p5, System Storage, VideoCharger,
Virtualization Engine.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries,
or both.
Rambus is a registered trademark of Rambus, Inc.
XDR and FlexIO are trademarks of Rambus, Inc.
UNIX is a registered trademark in the United States, other countries or both.
Linux is a trademark of Linus Torvalds in the United States, other countries or both.
Fedora is a trademark of Redhat, Inc.
Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.
Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.
AMD Opteron is a trademark of Advanced Micro Devices, Inc.
Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.
TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).
SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and
SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).
AltiVec is a trademark of Freescale Semiconductor, Inc.
PCI-X and PCI Express are registered trademarks of PCI SIG.
InfiniBand is a trademark the InfiniBand Trade Association
Other company, product and service names may be trademarks or service marks of others.
Revised July 23, 2006

Special Notices - Copyrights
(c) Copyright International Business Machines Corporation 2005.

All Rights Reserved. Printed in the United Sates September 2005.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are
NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result
in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change
IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity
under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied
upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.
IBM Microelectronics Division The IBM home page is http://www.ibm.com

1580 Route 52, Bldg. 504 The IBM Microelectronics Division home page is
Hopewell Junction, NY 12533-6351 http://www.chips.ibm.com

Roadrunner Tutorial Session 1 Web1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Roadrunner Tutorial Session 1 Web1

Uploaded by

Copyright:

Available Formats

Roadrunner Tutorial

An Introduction to Roadrunner and the Cell Processor

Cornell Wright Paul Henning

2 Roadrunner Tutorial February 7, 2008

3 Roadrunner Tutorial February 7, 2008

Phase I Phase II Phase III

Redtail Base System

4 Roadrunner Tutorial February 7, 2008

6 Roadrunner Tutorial February 7, 2008

7 Roadrunner Tutorial February 7, 2008

8 Roadrunner Tutorial February 7, 2008

Observation that traditional clusters are straining

9 Roadrunner Tutorial February 7, 2008

Specialized engines can perform selected tasks more efficiently

10 Roadrunner Tutorial February 7, 2008

Provide a large capacity-mode computing resource

Upgrade to petascale-class hybrid accelerated architecture in

11 Roadrunner Tutorial February 7, 2008

Eight 2nd-stage IB 4X DDR switches

12 Roadrunner Tutorial February 7, 2008

Scalable Unit Cluster Interconnect Switch/Fabric

13 Roadrunner Tutorial February 7, 2008

14 Roadrunner Tutorial February 7, 2008

LS21 AMD Host Blade

15 Roadrunner Tutorial February 7, 2008

(Dual Core Opteron, IB-DDR) DDR2

AMD Host Blade + Expansion CB2

New Expansion Card CB2

QS22 Accelerator Blade

Dual PowerXCell 8i Sockets

25.6 GB/s per PowerXCell 8i chip* (0.25 PCI-X PCI-E x8 HSDC

~2+2 GB/s/link ~4+ 4 GB/s total POR No HSDC

One Cell chip

17 Roadrunner Tutorial February 7, 2008

Cell Blade Cell Blade

18 Roadrunner Tutorial February 7, 2008

MPI remains as the foundation

19 Roadrunner Tutorial February 7, 2008

20 Roadrunner Tutorial February 7, 2008

21 Roadrunner Tutorial February 7, 2008

22 Roadrunner Tutorial February 7, 2008

23 Roadrunner Tutorial February 7, 2008

Memory and memory management

The amount of transistors doing direct computation is shrinking

24 Roadrunner Tutorial February 7, 2008

25 Roadrunner Tutorial February 7, 2008

Moores Law: 2x transistor density

Net: Increasing Performance requires Increasing Power Efficiency

26 Roadrunner Tutorial February 7, 2008

27 Roadrunner Tutorial February 7, 2008

28 Roadrunner Tutorial February 7, 2008

Large shared register file

Microarchitecture decisions, more so than architecture decisions

29 Roadrunner Tutorial February 7, 2008

30 Roadrunner Tutorial February 7, 2008

Cell is not a collection of different processors, but a synergistic whole

31 Roadrunner Tutorial February 7, 2008

Guess where the cache is?

32 Roadrunner Tutorial February 7, 2008

How about on this one?

33 Roadrunner Tutorial February 7, 2008

memory warehousing vs. in-time data processing

34 Roadrunner Tutorial February 7, 2008

Todays x86 Quad Core processors are