Professional Documents
Culture Documents
LA-UR-08-2818
Contents
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
Advanced Algorithms
Evaluation of Cell
potential for HPC
Roadrunner
1.3 petaflop/s
Cell Accelerated
Opteron Cluster
Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun July Aug Sept Oct Nov Dec Jan
08 08 08 08 08 09 09 09 09 09 09 09 09 09 09 09 09 10
Start
Start Finish
Finish
System
System science
science runs
runs
Start
Start Finish
Finish &
& code
code &
&
Acceptance
Acceptance Acceptance
Acceptance stabilization
stabilization Stabilization
Stabilization
(estimated)
(estimated)
Start
Start Secure
Secure System
System
Start
Start Connect
Connect Science
Science runs
runs Availability
Availability
System
System File
File Including
Including VPIC
VPIC After
After
Delivery
Delivery Systems
Systems Weapons
Weapons Initial
Initial Integration
Integration
science
science
1 2 3
5 Roadrunner Tutorial February 7, 2008
Roadrunner Phase 1 CU layout (Redtail)
4 dual core
Opterons
GE CVLAN-n 96 To
2nd
131 Compute Nodes IB4x Stage
IB 288-port SDR Switch
IB4x SDR
Switch
144+144
12 I/O Nodes
IB
Terminal Server
143
Service Node
IB SMC Switch
Disk
GE MVLAN
GE MVLAN
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
18 CU clusters
(100s of such
cluster nodes)
Node-attached
Node-attachedCells
Cellsisiswhat
whatmakes
makesRoadrunner
Roadrunnerdifferent!
different!
8 GB HT x16
4 GB of 8 GB of 16 GB of
shared NUMA distributed
PCIe x8 memory shared memory
(per Cell) memory (per node)
(2 per blade) (per blade)
(2 GB/s, 2 us) 21.3 GB/s/chip
IDE
Accelerated Lib
gdb
Tooling DaCSd
Tooling
Trace ALF/DaCS
Analysis
Host O/S DD
Opteron Blade
PowerPC
compiler PPE
PPE
Remote communication to/from
Cell
data communication & DaCS (OpenMPI) PCIe
synchronization
process management & x86
synchronization compiler Opteron
Opteron
computationally-intense offload OpenMPI (cluster) IB
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
Frequency Wall
Diminishing returns from deeper pipelines
Memory Wall
Processor frequency vs. DRAM memory latency
Latency introduced by multiple levels of memory
Power Wall
Limits in CMOS technology
Hard limit to acceptable system power
Power Power
ISA ISA
MMU/BIU MMU/BIU
IO
Memory COHERENT BUS
transl.
Incl. coherence/memory
compatible with 32/64b Power Arch. Applications and OSs
37 Roadrunner Tutorial February 7, 2008
Cell Architecture is 64b Power Architecture
Plus
Power Power
Memory
ISA ISA
Flow Control (MFC) +RMT +RMT
MMU/BIU MMU/BIU
+RMT +RMT
IO
Memory COHERENT BUS (+RAG)
transl.
MMU/DMA MMU/DMA
+RMT +RMT
LS Alias Local Store Local Store
Memory Memory
LS Alias
MMU/BIU MMU/BIU
+RMT +RMT
IO
Memory COHERENT BUS (+RAG)
transl.
MMU/DMA MMU/DMA
Syn. +RMT
Syn. +RMT
Proc. Proc.
LS Alias
Local Store Local Store
ISA Memory ISA
LS Alias Memory
DMA into and out of Local Store equivalent to Power core loads &
stores
Governed by Power Architecture page and segment tables for
translation and protection
Shared memory model
Power architecture compatible addressing
MMIO capabilities for SPEs
Local Store is mapped (alias) allowing LS to LS DMA transfers
DMA equivalents of locking loads & stores
OS management/virtualization of SPEs
Pre-emptive context switch is supported (but not efficient)
SPE
SPU SPU SPU SPU SPU SPU SPU SPU
SXU SXU SXU SXU SXU SXU SXU SXU
LS LS LS LS LS LS LS LS
16B/cycle
16B/cycle
PPE
L2 L1 PXU
32B/cycle 16B/cycle
Dual FlexIOTM
XDRTM
2 Microcode
1
Decode Thread A
L1 Data Cache
Dependency Thread B
Issue Thread A
2
1 1 1
Branch VMX/FPU Issue (Queue)
Load/Store Fixed-Point
Execution 2
Unit Unit
Unit 1 1 1 1
VMX
VMX FPU FPU
Completion/Flush Load/Store/
Arith./Logic Unit Arith/Logic Unit Load/Store
Permute
Channel Unit
Local Store
MFC
(DMA Unit)
No translation
Multiuser operating system is running on control processor
Can be mapped as system memory - cached copies are non-coherent wrt SPU
loads/stores
Instruction Issue Unit / Instruction Line Buffer 128B Read 128B Write
Shift (FX) 0 4
Byte (BYTE) 0 4
Permute (PERM) 1 4
Load (LS) 1 6
Branch 1 4
Channel 1 6
* The above numbers assume a 3.2GHz core frequency internal bandwidth scales with core frequency
69 Roadrunner Tutorial February 7, 2008
Element Interconnect Bus - Data Topology
Four 16B data rings connecting 12 bus elements
Two clockwise / Two counter-clockwise
Physically overlaps all processor elements
Central arbiter supports up to three concurrent transfers per data ring
Two stage, dual round robin arbiter
Each element port simultaneously supports 16B in and 16B out data path
Ring topology is transparent to element data interface
16B 16B
16B 16B
16B
Data Arb 16B
16B 16B
Ramp Ramp
Ramp Ramp
Ramp Ramp
Ramp Ramp Ramp
Ramp Ramp
6 7 8 9 10 11
7 8 9 10 11
Data
Arbiter
MIC
PPE SPE1
SPE0 SPE3
SPE2 SPE5
SPE4 SPE6
SPE7 BIF /
IOIF1
IOIF0
IOIF1
Ring0 Ring1
controls
Ring2 Ring3
HDTV
Home media servers
Supercomputers Cell BE Cell BE
Processor Processor
IOIF BIF IOIF
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
End-User
Experience
Verification Hypervisor
Language extensions
Standards:
ABI
Red Hat Enterprise Level v5.1 (Kernel Level 2.6.18) GNU Fortran for PPE & SPE
GNU ADA (GNAT) for PPE
Toolchain packages: gcc 4.1.1, binutils 2.17+, newlib
1.15+, gdb 6.6+ gcc autovectorization and performance
enhancements
Programmer Productivity Performance Tools Programmer Productivity - Runtime
VPA Visual Performance Analyzer Product Level ALF and DaCs for Cell
PDT Performance Debugging Tool Hybrid DaCS/ALF (Prototype)
PEP/Lock Analyzer & Trace Analysis Tools Productization of combined ppe/spe gdb
CodeAnalyzer debugger
Enhanced Oprofile support SPE-side Software Managed Cache (from iRT
technology)
FDPR-Pro for Cell
Hybrid Code Analyzer
Market Segment Library Enablement
Highly optimized SIMD and MASS Math Libraries
Hybrid System Performance and Tracing Facility
Highly Optimized BLAS
Highly optimized libFFT
Monte Carlo RNG Library
Cell Security Technology (prototype/preview)
81 Roadrunner Tutorial February 7, 2008
IBM SDK Development Worldwide Teams
BOEBLINGEN
QDC Extended
Toronto QDC Extended Team
Team
LTC (kernel,
SWG XLC Compiler toolchain,
ROCHESTER QDC
Yorktown QDC performance)
Prog models, Test,
Extended Team India QDC
Libs, Cluster
enablement XLC Compiler Programming model,
libs
Fedora 7
Power Fedora 7
(Simulator
Fedora 7 with F7
X86/x86_64 sysroot)
(F7 Cross with Does not Work
3.0 compiler)
RHEL5
x86/x86_64
(Cross RHEL5
with 3.0 compiler)
GA Fortran DS compiler Separate Product with Beta and then GM 4Q07 Not Supported
Visual Performance Analyzer alphaWorks Download
IBM SDK for Multicore Acceleration Installation Guide C/C++ Language Extensions for Cell Broadband Engine Architecture
Cell Broadband Engine Programming Handbook SPU Application Binary Interface Specification
SIMD Math Library Specification for Cell Broadband Engine Architecture
Cell Broadband Engine Programming Tutorial
Cell Broadband Engine Linux Reference Implementation Application Binary
Cell Broadband Engine Programmer's Guide Interface Specification
Oprofile (SDK Programmer's Guide) SPU Assembly Language Specification
PDT (SDK Programmer's Guide) Programming Library Documentation
Security SDK V3.0 Installation and Users Guide Data Communication and Synchronization Programmer's Guide and API Reference
Programming Tools Documentation Data Communication and Synchronization for Hybird-x86 Programmer's Guide and
API Reference
Performance Analysis with the IBM Full-System Simulator
SPE Runtime Management Library
IBM Full-System Simulator User's Guide
SPE Runtime Management Library Version 1.2 to 2.2 Migration Guide (revised name)
XL C/C++ Compiler Information Accelerated Library Framework Programmer's Guide and API Reference
Installation Guide, Getting Started, Compiler Reference
Language Reference, Programming Guide Accelerated Library Framework for Hybrid-x86 Programmer's Guide and API
Reference
XL Fortran Compiler Information
Installation Guide, Getting Started, Compiler Reference Software Development Kit 3.0 SIMD Math Library Specifications
Language Reference, Programming Guide Basic Linear Algebra Subprograms Programmer's Guide and API Reference
Using the single-source compiler Example Library API Reference
IBM Visual Performance Analyzer User's Guide Cell BE Monte Carlo Library API Reference Manual
Cell Broadband Engine Security Software Development Kit - Installation and User's
Guide
SPU Timer Library
Mathematical Acceleration Subsystem (MASS)
Updated, New
Cell IDE
CDT
Eclipse
Library
Synchronization gdb
Platform
libSPE MFC PCIe 10GigE
Code Code
segment segment
Data Code
segment segment Local
Code Data Storage
segment segment
Data Code
segment segment
2D 32 to 2048 (BE)
Rectangular In/out of place
Double forward/inverse
Quasi
repeatable with same seed
attempts to uniformly fill n-dimension space
Sobol
Pseudo
repeatable with same seed
Mersenne Twister
Kirkpatrick-Stoll
Mersenne
Moderate Moderate Pseudo
Twister
Virtual Clock
Software managed 64-bit timebase counter
Built on top of 32-bit decrementer register
Can be used for high resolution time measurements
Virtual Timers
Interval timers built on virtual clock
User registered handler is called on requested interval
Can be used for statistical profiling
Up to 4 timers can be active simultaneously, with different intervals
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
IDE
Accelerated Lib
gdb
Tooling DaCSd
Tooling
Trace ALF/DaCS
Analysis
Host O/S DD
Opteron Blade
Library
Synchronization gdb
Platform
libSPE MFC PCIe 10GigE
HE (Opteron)
AE (SPE) AE (SPE)
Host API
Compute
Task
Computation Kernel (s
Accelerator
Accelerator
Accelerators - SPEs API
Input
Output
Output
Parameters Output
Desc
Work Block
Output
Input Data and Output Data for a work load
can be divided into many work blocks
Input
Description
Output
Description
Context
Work Block
Output
Task
WB WB WB WB WB
WB WB WB WB
WB WB
WB
Accelerator Memory
Host Memory
A A B
C
B C
J
C D
A D E
F E
F
E F
Parameters G
G
D H I
H
I J
Work Block I
B J
H Data Transfer
G List
136 Roadrunner Tutorial February 7, 2008
ALF - Queues
Timeline 0 1 2 3 4 5 6 7 8 9
WB0 I0 C0 O0
I/O and WB1 I1 C1 O1
(a) Computation
Operations WB2 I2 C2 O2
WB3 I3 C3 O3
Buffer
Buf0 I0 WB0 WB1 O1 I3 WB3
(c) Usage of 3 Buf1 WB0 O0 I2 WB2 WB3 O3
Buffers
Buf2 I1 WB1 WB2 O2
Buffer
Usage of Buf0 I0 WB0 O0 I2 WB2 O2
(d) Overlapped
I/O Buffers
Buf1 I1 WB1 O1 I3 WB3 O3
Synchronization
constructs 1 1 1 1 1 1 1 1
Barrier
1
Callback
2 Notify 2 2 2 2
2
Callback Barrier
3
3 3 3 3 3 Notify
4
Barrier
5
4 4 4 4 4 4 4
Task Wait
141 Roadrunner Tutorial February 7, 2008
Basic Structure of an ALF Application
Accelerator
Control Node
Node
Initialization
Prepare
Create Task Input DTL
ALF Computing
Create
Workblock Runtime Kernel
OS
Acceleration Library:
Task Management: Scheduling: Msg. interface:
specifies methods for cooperating
Tasks generate, management Task dispatch and load balance Message, Synchronization
with computing kernels on SPUs
SPU SPU
Data Mv. Data Mv.
Comp. Core Mem. Mg.
Comp. Core Mem. Mg.
Acceleration Library
ALF (PPU)
OS
Task Queue
Accelerated
Load Balancer Library Framework
Data Dispatcher
SPU SPU
Comp. Core
ALF
(SPU)
Comp. Core
ALF
(SPU)
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
Roadrunner Project
Roadrunner Architecture
Cell Processor Architecture
Cell SDK
DaCS and ALF
Accelerating an Existing
Application
References
Questions . . .
157 Roadrunner Tutorial February 7, 2008
Special Notices -- Trademarks
This document was developed for IBM offerings in the United States as of the date of publication. IBM may not make these offerings available in
other countries, and the information is subject to change without notice. Consult your local IBM business contact for information on the IBM
offerings available in your area. In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained
in this document.
Information in this document concerning non-IBM products was obtained from the suppliers of these products or other public sources. Questions
on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this document does not give
you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation, New Castle Drive, Armonk, NY
10504-1785 USA.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goals and objectives
only.
The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with no warranties or
guarantees either expressed or implied.
All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can be used and the
results that may be achieved. Actual environmental costs and performance characteristics will vary depending on individual client configurations
and conditions.
IBM Global Financing offerings are provided through IBM Credit Corporation in the United States and other IBM subsidiaries and divisions
worldwide to qualified commercial and government clients. Rates are based on a client's credit rating, financing terms, offering type, equipment
type and options, and may vary by country. Other restrictions may apply. Rates and offerings are subject to change, extension or withdrawal
without notice.
IBM is not responsible for printing errors in this document that result in pricing or information inaccuracies.
All prices shown are IBM's United States suggested list prices and are subject to change without notice; reseller prices may vary.
IBM hardware products are manufactured from new parts, or new and serviceable used parts. Regardless, our warranty terms apply.
Many of the features described in this document are operating system dependent and may not be available on Linux. For more information,
please check: http://www.ibm.com/systems/p/software/whitepapers/linux_overview.html
Any performance data contained in this document was determined in a controlled environment. Actual results may vary significantly and are
dependent on many factors including system hardware configuration and software design and configuration. Some measurements quoted in this
document may have been made on development-level systems. There is no guarantee these measurements will be the same on generally-
available systems. Some measurements quoted in this document may have been estimated through extrapolation. Users of this document
should verify the applicable data for their specific environment.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
Cell Broadband Engine and Cell Broadband Engine Architecture are trademarks of Sony Computer Entertainment, Inc. in the United States, other countries,
or both.
Rambus is a registered trademark of Rambus, Inc.
XDR and FlexIO are trademarks of Rambus, Inc.
UNIX is a registered trademark in the United States, other countries or both.
Linux is a trademark of Linus Torvalds in the United States, other countries or both.
Fedora is a trademark of Redhat, Inc.
Microsoft, Windows, Windows NT and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries or both.
Intel, Intel Xeon, Itanium and Pentium are trademarks or registered trademarks of Intel Corporation in the United States and/or other countries.
AMD Opteron is a trademark of Advanced Micro Devices, Inc.
Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States and/or other countries.
TPC-C and TPC-H are trademarks of the Transaction Performance Processing Council (TPPC).
SPECint, SPECfp, SPECjbb, SPECweb, SPECjAppServer, SPEC OMP, SPECviewperf, SPECapc, SPEChpc, SPECjvm, SPECmail, SPECimap and
SPECsfs are trademarks of the Standard Performance Evaluation Corp (SPEC).
AltiVec is a trademark of Freescale Semiconductor, Inc.
PCI-X and PCI Express are registered trademarks of PCI SIG.
InfiniBand is a trademark the InfiniBand Trade Association
Other company, product and service names may be trademarks or service marks of others.
The following are trademarks of International Business Machines Corporation in the United States, or other countries, or both.
IBM IBM Logo Power Architecture
Other company, product and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document are
NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction could result
in death, bodily injury, or catastrophic property damage. The information contained in this document does not affect or change
IBM product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity
under the intellectual property rights of IBM or third parties. All information contained in this document was obtained in specific
environments, and is presented as an illustration. The results obtained in other operating environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied
upon for accuracy or completeness, and no representations or warranties of accuracy or completeness are made.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN "AS IS" BASIS. In no event will IBM be liable
for damages arising directly or indirectly from any use of the information contained in this document.