You are on page 1of 27

OpenCL

The Open Standard for Programming Heterogeneous Parallel Hardware Master Seminar Winter Term 2008/09 Multicore Parallel Programming

Peter Thoman

04-12-2008

Outline

Introduction & Motivation Background:


GPGPU Programming History Task-based Multicore CPU Programming Design Overview Components Execution Model Memory Model Examples

OpenCL

Open Questions & Research Opportunities

04-12-2008

OpenCL

Peter Thoman

Introduction

Recent years:

Proliferation of parallel computing devices:

Multicore CPUs, GPUs, Cell, ... soon: Manycore CPUs

Standardized programming environment desirable OpenCL is intended to be that standard


Allow targeting various computing devices with the same program Simplify development for exotic hardware Stipulate further growth beyond HPC & research

04-12-2008

OpenCL

Peter Thoman

Motivation why bother?

Higher level of parallelism & specialization generally yields higher maximum performance
1000 900 800 700 600 500 400 300 200 100 0 Computation (GFlop/s) Bandwidth (Gb/s)

Intel Core 2 Quad Q9450 IBM Cell BE NVIDIA Geforce GTX260

04-12-2008

OpenCL

Peter Thoman

GPGPU History

Starting around 2003:

Programmable Shaders introduced on GPUs

Originally intended for lighting calculations on surfaces etc.

Side effect allow GPUs to be used as general purpose computing devices

GPGPU born Two broad phases so far:

Early GPGPU: 2003-07 Graphics APIs (DirectX/OpenGL) used to write GPGPU programs Current GPU computing: 2007-? Vendor-supplied APIs

04-12-2008

OpenCL

Peter Thoman

Early GPGPU

Graphics APIs used

Rendering with pixel shaders and ping-ponging Programmer must know graphics APIs and concepts Overheads introduced by Graphics pipeline No communication and synchronization primitives

Disadvantages

04-12-2008

OpenCL

Peter Thoman

Current GPU Computing


Vendor-supplied APIs: CUDA, CTM CUDA far more popular


CudaZone lists 144 projects in a large variety of fields With speedups (over CPU) from factor 2 to 480 Standard C with simple extensions Arbritrary read/writes from/to memory (no texture restriction) Small high-speed shared memory as manual cache or for communication Traditional CPU functionality like bitwise integer ops

Advantages:

Disadvantage: Vendor/Hardware specific

04-12-2008

OpenCL

Peter Thoman

Task-based Parallelism on CPUs


As opposed to a data-parallelism model like on GPUs Long history of implicitly task-based systems:

MPI or other message-passing Basic Threading Even fork-join OpenMP 3.0 Research projects like Star Superscalar

Explicitly task-based models rather new:


Presented last week!

04-12-2008

OpenCL

Peter Thoman

OpenMP 3.0 Task Model


Simple spawning and synchronization of tasks Same memory model as existing OMP constructs No dependency handling

Example: Parallel Postorder Tree Traversal


04-12-2008 OpenCL Peter Thoman 9

OpenCL

Important:

Specification not yet released, all information based on public presentations given at Siggraph and SC08

Timeline:

04-12-2008

OpenCL

Peter Thoman

10

OpenCL

Broad industry support:

Next version of Apple OSX will most likely include first implementation

04-12-2008

OpenCL

Peter Thoman

11

OpenCL Design Goals

Enable use of all computational resources in a system allow programming GPUs, CPUs, Cell, etc. Support data- and task-parallel compute models Approachable low-level, high-performance abstraction with silicon-portability Familiar C-like parallel programming model Drive future hardware requirements including floating point precision limits Close integration with OpenGL for visualization

04-12-2008

OpenCL

Peter Thoman

12

OpenCL Design Illustration

Convergence of both hardware and programming models:

04-12-2008

OpenCL

Peter Thoman

13

OpenCL Components (1)

OpenCL consists of 3 components:


Platform Layer Runtime System Compiler/Language Specification Query, select and initialize devices Create compute contexts and command queues Resource management (memory, program scheduling) Executing compute kernels

Platform Layer:

Runtime System:

04-12-2008

OpenCL

Peter Thoman

14

OpenCL Components (2)

Compiler (either online or offline compilation)

Builds components written in compute kernel language Based on ISO C99, no recursion or function pointers
Built-in types:

Language:

Scalar and vector data types, pointers Data type conversion functions Image-related types Work-item and synchronization functions Math: math.h, relational and geometric functions Functions to read and write images Double precision support and rounding modes Atomics to global and shared memory Writes to 3D images

Built-in functions:

Required

Optional

04-12-2008

OpenCL

Peter Thoman

15

OpenCL Execution Model

Components:

Compute Kernels:

Basic units of computation, similar to C functions Collection of kernels and internal functions

Compute Programs:

Components are queued in a command queue to execute on a specific device Two different Execution models:

Data-Parallel Task-Parallel

04-12-2008

OpenCL

Peter Thoman

16

OpenCL Data-Parallel Model


Programmer specifies N-dimensional computation domain Every element is a work-item


Total number of items = global work size Global work size is the maximum degree of parallelism for this computation Mapped either explicitly or implicitly Items in groups can communicate and synchronize Work-groups can also be executed in parallel

Work-items can be grouped in work-groups


04-12-2008

OpenCL

Peter Thoman

17

OpenCL Task-Parallel Model

Optional for compute devices

Most current GPUs probably won't support it Unlike data-parallel, can be written in either OpenCL kernel language or native C/C++

Tasks are executed as a single work-item

No clearer specification for now, conjectured to be similar to OpenMP 3.0 model

04-12-2008

OpenCL

Peter Thoman

18

OpenCL Memory Model


Relaxed consistency shared memory model Multiple distinct address spaces, can be collapsed on some devices:

Private Memory per work-item Local Memory per compute unit Global/Constant Memory

Qualifiers:
__private, __local, __constant and __global

04-12-2008

OpenCL

Peter Thoman

19

OpenCL Examples (1)

Simple vector addition kernel (compute device) code

04-12-2008

OpenCL

Peter Thoman

20

OpenCL Examples (2)

Host Code Initialization of a GPU device and associated context / command queue:

04-12-2008

OpenCL

Peter Thoman

21

OpenCL Examples (3)

Host Code allocate device memory buffers and create / build program:

04-12-2008

OpenCL

Peter Thoman

22

OpenCL Examples (4)

Host code create and run compute kernel:

04-12-2008

OpenCL

Peter Thoman

23

OpenCL Examples (5)

Kernel Code Matrix Transpose:

04-12-2008

OpenCL

Peter Thoman

24

Open Questions / Research


Shared code for vastly different hardware enables research opportunities: Distribution of kernels on devices

Run given kernel on GPU or CPU, or maybe split? Analysis of kernels, either statically or dynamically Lookup or benchmarking of available hardware at runtime Fast decision algorithm using this information

Requires

Either analytical or machine learning

04-12-2008

OpenCL

Peter Thoman

25

Summary

Modern and future systems contain massively parallel, heterogeneous hardware

Worth the headache because of performance potential Open standard platform for programming such systems Data- and task-parallel execution models

OpenCL

In the tradition of GPU programming models for the former, and mainstream CPU parallelization for the latter Distinct collapsible address spaces

Relaxed consistency shared memory

Release soon!

04-12-2008

OpenCL

Peter Thoman

26

NVIDIA GTX 280 STI Cell

ATI/AMD Rv770

Intel Nehalem AMD Phenom

Thank you!
Consult the accompanying seminar document for a complete list of references.

You might also like