Professional Documents
Culture Documents
The Open Standard for Programming Heterogeneous Parallel Hardware Master Seminar Winter Term 2008/09 Multicore Parallel Programming
Peter Thoman
04-12-2008
Outline
GPGPU Programming History Task-based Multicore CPU Programming Design Overview Components Execution Model Memory Model Examples
OpenCL
04-12-2008
OpenCL
Peter Thoman
Introduction
Recent years:
04-12-2008
OpenCL
Peter Thoman
Higher level of parallelism & specialization generally yields higher maximum performance
1000 900 800 700 600 500 400 300 200 100 0 Computation (GFlop/s) Bandwidth (Gb/s)
04-12-2008
OpenCL
Peter Thoman
GPGPU History
Early GPGPU: 2003-07 Graphics APIs (DirectX/OpenGL) used to write GPGPU programs Current GPU computing: 2007-? Vendor-supplied APIs
04-12-2008
OpenCL
Peter Thoman
Early GPGPU
Rendering with pixel shaders and ping-ponging Programmer must know graphics APIs and concepts Overheads introduced by Graphics pipeline No communication and synchronization primitives
Disadvantages
04-12-2008
OpenCL
Peter Thoman
CudaZone lists 144 projects in a large variety of fields With speedups (over CPU) from factor 2 to 480 Standard C with simple extensions Arbritrary read/writes from/to memory (no texture restriction) Small high-speed shared memory as manual cache or for communication Traditional CPU functionality like bitwise integer ops
Advantages:
04-12-2008
OpenCL
Peter Thoman
As opposed to a data-parallelism model like on GPUs Long history of implicitly task-based systems:
MPI or other message-passing Basic Threading Even fork-join OpenMP 3.0 Research projects like Star Superscalar
04-12-2008
OpenCL
Peter Thoman
Simple spawning and synchronization of tasks Same memory model as existing OMP constructs No dependency handling
OpenCL
Important:
Specification not yet released, all information based on public presentations given at Siggraph and SC08
Timeline:
04-12-2008
OpenCL
Peter Thoman
10
OpenCL
Next version of Apple OSX will most likely include first implementation
04-12-2008
OpenCL
Peter Thoman
11
Enable use of all computational resources in a system allow programming GPUs, CPUs, Cell, etc. Support data- and task-parallel compute models Approachable low-level, high-performance abstraction with silicon-portability Familiar C-like parallel programming model Drive future hardware requirements including floating point precision limits Close integration with OpenGL for visualization
04-12-2008
OpenCL
Peter Thoman
12
04-12-2008
OpenCL
Peter Thoman
13
Platform Layer Runtime System Compiler/Language Specification Query, select and initialize devices Create compute contexts and command queues Resource management (memory, program scheduling) Executing compute kernels
Platform Layer:
Runtime System:
04-12-2008
OpenCL
Peter Thoman
14
Builds components written in compute kernel language Based on ISO C99, no recursion or function pointers
Built-in types:
Language:
Scalar and vector data types, pointers Data type conversion functions Image-related types Work-item and synchronization functions Math: math.h, relational and geometric functions Functions to read and write images Double precision support and rounding modes Atomics to global and shared memory Writes to 3D images
Built-in functions:
Required
Optional
04-12-2008
OpenCL
Peter Thoman
15
Components:
Compute Kernels:
Basic units of computation, similar to C functions Collection of kernels and internal functions
Compute Programs:
Components are queued in a command queue to execute on a specific device Two different Execution models:
Data-Parallel Task-Parallel
04-12-2008
OpenCL
Peter Thoman
16
Total number of items = global work size Global work size is the maximum degree of parallelism for this computation Mapped either explicitly or implicitly Items in groups can communicate and synchronize Work-groups can also be executed in parallel
04-12-2008
OpenCL
Peter Thoman
17
Most current GPUs probably won't support it Unlike data-parallel, can be written in either OpenCL kernel language or native C/C++
04-12-2008
OpenCL
Peter Thoman
18
Relaxed consistency shared memory model Multiple distinct address spaces, can be collapsed on some devices:
Private Memory per work-item Local Memory per compute unit Global/Constant Memory
Qualifiers:
__private, __local, __constant and __global
04-12-2008
OpenCL
Peter Thoman
19
04-12-2008
OpenCL
Peter Thoman
20
Host Code Initialization of a GPU device and associated context / command queue:
04-12-2008
OpenCL
Peter Thoman
21
Host Code allocate device memory buffers and create / build program:
04-12-2008
OpenCL
Peter Thoman
22
04-12-2008
OpenCL
Peter Thoman
23
04-12-2008
OpenCL
Peter Thoman
24
Shared code for vastly different hardware enables research opportunities: Distribution of kernels on devices
Run given kernel on GPU or CPU, or maybe split? Analysis of kernels, either statically or dynamically Lookup or benchmarking of available hardware at runtime Fast decision algorithm using this information
Requires
04-12-2008
OpenCL
Peter Thoman
25
Summary
Worth the headache because of performance potential Open standard platform for programming such systems Data- and task-parallel execution models
OpenCL
In the tradition of GPU programming models for the former, and mainstream CPU parallelization for the latter Distinct collapsible address spaces
Release soon!
04-12-2008
OpenCL
Peter Thoman
26
ATI/AMD Rv770
Thank you!
Consult the accompanying seminar document for a complete list of references.