OpenCL Slides

OpenCL
The Open Standard for Programming Heterogeneous Parallel Hardware Master Seminar Winter Term 2008/09 Multicore Parallel Programming
Peter Thoman
04-12-2008
Outline

Introduction & Motivation Background:

GPGPU Programming History Task-based Multicore CPU Programming Design Overview Components Execution Model Memory Model Examples
OpenCL

Open Questions & Research Opportunities
04-12-2008
OpenCL
Peter Thoman
Introduction
Recent years:
Proliferation of parallel computing devices:
Multicore CPUs, GPUs, Cell, ... soon: Manycore CPUs
Standardized programming environment desirable OpenCL is intended to be that standard

Allow targeting various computing devices with the same program Simplify development for exotic hardware Stipulate further growth beyond HPC & research
04-12-2008
OpenCL
Peter Thoman
Motivation why bother?
Higher level of parallelism & specialization generally yields higher maximum performance
1000 900 800 700 600 500 400 300 200 100 0 Computation (GFlop/s) Bandwidth (Gb/s)
Intel Core 2 Quad Q9450 IBM Cell BE NVIDIA Geforce GTX260
04-12-2008
OpenCL
Peter Thoman
GPGPU History
Starting around 2003:
Programmable Shaders introduced on GPUs
Originally intended for lighting calculations on surfaces etc.
Side effect allow GPUs to be used as general purpose computing devices
GPGPU born Two broad phases so far:
Early GPGPU: 2003-07 Graphics APIs (DirectX/OpenGL) used to write GPGPU programs Current GPU computing: 2007-? Vendor-supplied APIs
04-12-2008
OpenCL
Peter Thoman
Early GPGPU
Graphics APIs used
Rendering with pixel shaders and ping-ponging Programmer must know graphics APIs and concepts Overheads introduced by Graphics pipeline No communication and synchronization primitives
Disadvantages
04-12-2008
OpenCL
Peter Thoman
Current GPU Computing

Vendor-supplied APIs: CUDA, CTM CUDA far more popular

CudaZone lists 144 projects in a large variety of fields With speedups (over CPU) from factor 2 to 480 Standard C with simple extensions Arbritrary read/writes from/to memory (no texture restriction) Small high-speed shared memory as manual cache or for communication Traditional CPU functionality like bitwise integer ops
Advantages:

Disadvantage: Vendor/Hardware specific
04-12-2008
OpenCL
Peter Thoman
Task-based Parallelism on CPUs

As opposed to a data-parallelism model like on GPUs Long history of implicitly task-based systems:

MPI or other message-passing Basic Threading Even fork-join OpenMP 3.0 Research projects like Star Superscalar
Explicitly task-based models rather new:

Presented last week!
04-12-2008
OpenCL
Peter Thoman
OpenMP 3.0 Task Model

Simple spawning and synchronization of tasks Same memory model as existing OMP constructs No dependency handling
Example: Parallel Postorder Tree Traversal

04-12-2008 OpenCL Peter Thoman 9
OpenCL
Important:
Specification not yet released, all information based on public presentations given at Siggraph and SC08
Timeline:
04-12-2008
OpenCL
Peter Thoman
10
OpenCL
Broad industry support:
Next version of Apple OSX will most likely include first implementation
04-12-2008
OpenCL
Peter Thoman
11
OpenCL Design Goals
Enable use of all computational resources in a system allow programming GPUs, CPUs, Cell, etc. Support data- and task-parallel compute models Approachable low-level, high-performance abstraction with silicon-portability Familiar C-like parallel programming model Drive future hardware requirements including floating point precision limits Close integration with OpenGL for visualization
04-12-2008
OpenCL
Peter Thoman
12
OpenCL Design Illustration
Convergence of both hardware and programming models:
04-12-2008
OpenCL
Peter Thoman
13
OpenCL Components (1)
OpenCL consists of 3 components:

Platform Layer Runtime System Compiler/Language Specification Query, select and initialize devices Create compute contexts and command queues Resource management (memory, program scheduling) Executing compute kernels
Platform Layer:

Runtime System:

04-12-2008
OpenCL
Peter Thoman
14
OpenCL Components (2)
Compiler (either online or offline compilation)
Builds components written in compute kernel language Based on ISO C99, no recursion or function pointers
Built-in types:

Language:
Scalar and vector data types, pointers Data type conversion functions Image-related types Work-item and synchronization functions Math: math.h, relational and geometric functions Functions to read and write images Double precision support and rounding modes Atomics to global and shared memory Writes to 3D images
Built-in functions:

Required
Optional
04-12-2008
OpenCL
Peter Thoman
15
OpenCL Execution Model
Components:
Compute Kernels:
Basic units of computation, similar to C functions Collection of kernels and internal functions
Compute Programs:
Components are queued in a command queue to execute on a specific device Two different Execution models:

Data-Parallel Task-Parallel
04-12-2008
OpenCL
Peter Thoman
16
OpenCL Data-Parallel Model

Programmer specifies N-dimensional computation domain Every element is a work-item

Total number of items = global work size Global work size is the maximum degree of parallelism for this computation Mapped either explicitly or implicitly Items in groups can communicate and synchronize Work-groups can also be executed in parallel
Work-items can be grouped in work-groups

04-12-2008
OpenCL
Peter Thoman
17
OpenCL Task-Parallel Model
Optional for compute devices
Most current GPUs probably won't support it Unlike data-parallel, can be written in either OpenCL kernel language or native C/C++
Tasks are executed as a single work-item
No clearer specification for now, conjectured to be similar to OpenMP 3.0 model
04-12-2008
OpenCL
Peter Thoman
18
OpenCL Memory Model

Relaxed consistency shared memory model Multiple distinct address spaces, can be collapsed on some devices:
Private Memory per work-item Local Memory per compute unit Global/Constant Memory
Qualifiers:
__private, __local, __constant and __global
04-12-2008
OpenCL
Peter Thoman
19
OpenCL Examples (1)
Simple vector addition kernel (compute device) code
04-12-2008
OpenCL
Peter Thoman
20
OpenCL Examples (2)
Host Code Initialization of a GPU device and associated context / command queue:
04-12-2008
OpenCL
Peter Thoman
21
OpenCL Examples (3)
Host Code allocate device memory buffers and create / build program:
04-12-2008
OpenCL
Peter Thoman
22
OpenCL Examples (4)
Host code create and run compute kernel:
04-12-2008
OpenCL
Peter Thoman
23
OpenCL Examples (5)
Kernel Code Matrix Transpose:
04-12-2008
OpenCL
Peter Thoman
24
Open Questions / Research

Shared code for vastly different hardware enables research opportunities: Distribution of kernels on devices
Run given kernel on GPU or CPU, or maybe split? Analysis of kernels, either statically or dynamically Lookup or benchmarking of available hardware at runtime Fast decision algorithm using this information
Requires

Either analytical or machine learning
04-12-2008
OpenCL
Peter Thoman
25
Summary
Modern and future systems contain massively parallel, heterogeneous hardware
Worth the headache because of performance potential Open standard platform for programming such systems Data- and task-parallel execution models
OpenCL

In the tradition of GPU programming models for the former, and mainstream CPU parallelization for the latter Distinct collapsible address spaces
Relaxed consistency shared memory
Release soon!
04-12-2008
OpenCL
Peter Thoman
26
NVIDIA GTX 280 STI Cell
ATI/AMD Rv770
Intel Nehalem AMD Phenom
Thank you!
Consult the accompanying seminar document for a complete list of references.

OpenCL Slides

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

OpenCL Slides

Uploaded by

Copyright:

Available Formats

OpenCL

Introduction & Motivation Background:

Open Questions & Research Opportunities

Proliferation of parallel computing devices:

Multicore CPUs, GPUs, Cell, ... soon: Manycore CPUs

Standardized programming environment desirable OpenCL is intended to be that standard

Motivation why bother?

Intel Core 2 Quad Q9450 IBM Cell BE NVIDIA Geforce GTX260

Starting around 2003:

Programmable Shaders introduced on GPUs

Originally intended for lighting calculations on surfaces etc.

Side effect allow GPUs to be used as general purpose computing devices

GPGPU born Two broad phases so far:

Graphics APIs used

Current GPU Computing

Vendor-supplied APIs: CUDA, CTM CUDA far more popular

Disadvantage: Vendor/Hardware specific

Task-based Parallelism on CPUs

Explicitly task-based models rather new:

Presented last week!

OpenMP 3.0 Task Model

Example: Parallel Postorder Tree Traversal

Broad industry support:

OpenCL Design Goals

OpenCL Design Illustration

Convergence of both hardware and programming models:

OpenCL Components (1)

OpenCL consists of 3 components:

OpenCL Components (2)

Compiler (either online or offline compilation)

OpenCL Execution Model

OpenCL Data-Parallel Model

Programmer specifies N-dimensional computation domain Every element is a work-item

Work-items can be grouped in work-groups

OpenCL Task-Parallel Model

Optional for compute devices

Tasks are executed as a single work-item

No clearer specification for now, conjectured to be similar to OpenMP 3.0 model

OpenCL Memory Model

OpenCL Examples (1)

Simple vector addition kernel (compute device) code

OpenCL Examples (2)

OpenCL Examples (3)

OpenCL Examples (4)

Host code create and run compute kernel:

OpenCL Examples (5)

Kernel Code Matrix Transpose:

Open Questions / Research

Either analytical or machine learning

Modern and future systems contain massively parallel, heterogeneous hardware

Relaxed consistency shared memory

NVIDIA GTX 280 STI Cell

Intel Nehalem AMD Phenom

You might also like