You are on page 1of 51

OpenCL 1.

0
Jason Yang Advanced Micro Devices

Copyright Khronos Group, 2008

Processor Parallelism
Multiple cores driving performance increases

CPUs

Emerging Intersection

Increasingly general purpose data-parallel computing Improving numerical precision

GPUs

Multi-processor programming e.g. OpenMP

Heterogeneous Computing

OpenCL

Graphics APIs and Shading Languages

OpenCL Open Computing Language


Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors

Copyright Khronos Group, 2008

Anatomy of OpenCL

Language Specification C-based cross-platform programming interface Subset of ISO C99 with language extensions - familiar to developers Well-defined numerical accuracy - IEEE 754 rounding behavior with specified maximum error Online or offline compilation and build of compute kernel executables Includes a rich set of built-in functions Platform Layer API A hardware abstraction layer over diverse computational resources Query, select and initialize compute devices Create compute contexts and work-queues Runtime API Execute compute kernels Manage scheduling, compute, and memory resources

Copyright Khronos Group, 2008

Talk Overview
OpenCL Architecture OpenCL Framework
Vector Addition Example

OpenCL Extended Features


OpenGL Interoperability

Copyright Khronos Group, 2008

OpenCL Architecture
Hierarchy of Models

Copyright Khronos Group, 2008

OpenCL Platform Model (Section 3.1)

One Host + one or more Compute Devices Each Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements

Copyright Khronos Group, 2008

OpenCL Execution Model (Section 3.2)

OpenCL Program: Kernels


Basic unit of executable code similar to a C function Data-parallel or task-parallel Host Program Collection of compute kernels and internal functions Analogous to a dynamic library

Kernel Execution The host program invokes a kernel over an index space called an NDRange
NDRange = N-Dimensional Range NDRange can be a 1, 2, or 3-dimensional space A single kernel instance at a point in the index space is called a

work-item

Work-items have unique global IDs from the index space Work-items are further grouped into work-groups Work-groups have a unique work-group ID Work-items have a unique local ID within a work-group
Copyright Khronos Group, 2008

Kernel Execution

Total number of work-items = Gx x Gy Size of each work-group = Sx x Sy Global ID can be computed from work-group ID and local ID
Copyright Khronos Group, 2008

Contexts and Queues (Section 3.2.1)



Contexts are used to contain and manage the state of the world Kernels are executed in contexts defined and manipulated by the host Devices Kernels - OpenCL functions Program objects - kernel source and executable Memory objects Command-queue - coordinates execution of kernels Kernel execution commands Memory commands - transfer or mapping of memory object data Synchronization commands - constrains the order of commands Applications queue compute kernel execution instances Queued in-order Executed in-order or out-of-order Events are used to implement appropriate synchronization of execution instances

Copyright Khronos Group, 2008

OpenCL Memory Model (Section 3.3)



Shared memory model Relaxed consistency Multiple distinct address spaces Address spaces can be collapsed depending on the devices memory subsystem Address spaces Private - private to a work-item Local - local to a work-group Global - accessible by all workitems in all work- groups Constant - read only global space constant Implementations map this hierarchy to available physical memories

Private Memory
Work Item 1

Private Memory
Work Item M

Private Memory
Work Item 1

Private Memory
Work Item M

Compute Unit 1

Compute Unit N

Local Memory

Local Memory

Global / Constant Memory Data Cache

Compute Device

Global Memory

Compute Device Memory

Copyright Khronos Group, 2008

Relaxed Memory Consistency (Section 3.3.1)


The state of memory visible to a work-item is not guaranteed
to be consistent across the collection of work-items at all times Within a work-item, memory has load/store consistency Within a work-group at a barrier, local memory has consistency across work-items Global memory is consistent within a work-group, at a barrier, but not guaranteed across different work-groups Consistency of memory shared between commands are enforced through synchronization

Copyright Khronos Group, 2008

Programming Model
Data-Parallel (Section 3.4.1)
Data-parallel execution model must be implemented by all

OpenCL compute devices Define N-Dimensional computation domain Execute multiple work-groups in parallel

Task-Parallel (Section 3.4.2)

Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access

Some compute devices such as CPUs can also execute

task-parallel compute kernels

Executes as a single work-item A compute kernel written in OpenCL A native C / C++ function

Copyright Khronos Group, 2008

OpenCL Framework

Copyright Khronos Group, 2008

Basic OpenCL Program Structure


Kernels
extensions Host program Query compute devices Create contexts Create memory objects associated to contexts Compile and create kernel program objects Issue commands to command-queue Synchronization of commands Clean up OpenCL resources
C code with some restrictions and

Language

Platform Layer

Runtime

Copyright Khronos Group, 2008

Language for Compute Kernels (Chapter 6)



OpenCL C Programming Language Derived from ISO C99 A few restrictions: recursion, function pointers, functions in C99 standard headers ... Preprocessing directives defined by C99 are supported Built-in Data Types Scalar and vector data types, Pointers Data-type conversion functions: convert_type<_sat><_roundingmode> Image types: image2d_t, image3d_t and sampler_t Built-in Functions Required work-item functions, math.h, read and write image Relational, geometric functions, synchronization functions Built-in Functions Optional double precision, atomics to global and local memory selection of rounding mode, writes to image3d_t surface
Copyright Khronos Group, 2008

Example: Vector Addition - Kernel


__kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }

Copyright Khronos Group, 2008

__kernel: __global: get_global_id(): Data types:

Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1

Spec Guide

Vector Addition - Kernel


__kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }

Function Qualifiers __kernel qualifier declares a function as a kernel Kernels can call other kernel functions

Copyright Khronos Group, 2008

__kernel: __global: get_global_id(): Data types:

Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1

Spec Guide

Vector Addition - Kernel


__kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }

Address Space Qualifiers __global, __local, __constant, __private Pointer kernel arguments must be declared with an address space qualifier __kernel: __global: get_global_id(): Data types: Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1

Copyright Khronos Group, 2008

Spec Guide

Vector Addition - Kernel


__kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }

Work-item Functions Query work-item identifiers get_work_dim(), get_global_id(), get_local_id(), get_group_id() __kernel: __global: get_global_id(): Data types: Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1

Copyright Khronos Group, 2008

Spec Guide

Language Highlights
Image functions
Images must be accessed through built-in functions Reads/writes performed through sampler objects

Synchronization functions
Barriers - all work-items within a work-group must execute

the barrier function before any work-item can continue Memory fences - provides ordering between memory operations

Copyright Khronos Group, 2008

Restrictions
Pointers to functions are not allowed Pointers to pointers allowed within a kernel, but not as an
argument Bit-fields are not supported Variable length arrays and structures are not supported Recursion is not supported Writes to a pointer of types less than 32-bit are not supported Double types are not supported, but reserved 3D Image writes are not supported

Some restrictions are addressed through extensions

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)


// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);

Spec Guide

Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)


// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; Platform Layer memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | Query devices CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); Creating contexts memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);

Spec Guide

Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1

Copyright Khronos Group, 2008

Platform Layer (Chapter 4)



The platform layer allows applications to query for platform specific features Querying devices (Chapter 4.2) Find devices: CPUs, GPUs, or Accelerators Get device information:
Number of compute cores NDRange limits Maximum work-group size Sizes of the different memory spaces (constant, local, global) Maximum memory object size

Creating contexts (Chapter 4.3) clCreateContext() and clCreateContextFromType() Contexts are used by the OpenCL runtime to manage objects and execute kernels on one or more devices Contexts are associated to one or more devices
Multiple contexts could be associated to the same device

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)


// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue Create context from GPU device type cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, A handle is returned toNULL); the created // allocate the buffer memory objects context cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);

Context Creation

Spec Guide

Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)


// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); Querying devices and information // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);

Spec Guide

Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1

Copyright Khronos Group, 2008

Command-Queues (Section 5.1)


Command-queues store a set of operations to perform Command-queues are associated to a context Multiple command-queues can be created to handle
independent commands that dont require synchronization Execution of the command-queue is guaranteed to be completed at sync points

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)


// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); Create command-queue from context memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);

Spec Guide

Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1

Copyright Khronos Group, 2008

Memory Objects
Buffer objects

(Section 5.2)

user defined structures Buffer objects can be accessed via pointers in the kernel Image objects Two- or three-dimensional texture, frame-buffer, or images Must be addressed through built-in functions Sampler objects Describes how to sample an image in the kernel
Addressing modes Filtering modes

One-dimensional collection of objects (like C arrays) Valid elements include scalar and vector types as well as

Copyright Khronos Group, 2008

Creating Memory Objects



clCreateBuffer(), clCreateImage2D(), and clCreateImage3D() Memory objects are created with an associated context Memory can be created as read only, write only, or read-write Where objects are created in the platform memory space can be controlled Device memory Device memory with data copied from a host pointer Host memory Host memory associated with a pointer
Memory at that pointer is guaranteed to be valid at synchronization points

Image objects are also created with a channel format Channel order (e.g., RGB, RGBA ,etc.) Channel type (e.g., UNORM INT8, FLOAT, etc.)

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)


// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); Creating Memory Objects - Buffers // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);

Spec Guide

Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1

Copyright Khronos Group, 2008

Manipulating Object Data



Object data can be copied to host memory, from host memory, or to other objects Memory commands are enqueued in the command buffer and processed when the command is executed clEnqueueReadBuffer(), clEnqueueReadImage() clEnqueueWriteBuffer(), clEnqueueWriteImage() clEnqueueCopyBuffer(), clEnqueueCopyImage() Data can be copied between Image and Buffer objects clEnqueueCopyImageToBuffer() clEnqueueCopyBufferToImage() Regions of the object data can be accessed by mapping into the host address space clEnqueueMapBuffer(), clEnqueueMapImage() clEnqueueUnmapMemObject()

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)


// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create Host a memory command-queue associated with a pointer cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);

Spec Guide

Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (2)


// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));

Copyright Khronos Group, 2008

Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2

Spec Guide

Example: Vector Addition - Host API (2)


// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel Program Objects cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));

Copyright Khronos Group, 2008

Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2

Spec Guide

Program Objects (Section 5.4)


Program objects encapsulate:
An associated context Program source or binary Latest successful program build, list of targeted devices, build options Number of attached kernel objects

Build process
1. Create program object
clCreateProgramWithSource() clCreateProgramWithBinary()

2. Build program executable


clBuildProgram() Compile and link from source or binary for all devices or specific devices in the associated context

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (2)


// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel Create Program Object cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));

Copyright Khronos Group, 2008

Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2

Spec Guide

Example: Vector Addition - Host API (2)


// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values Build the Program Executable clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));

Copyright Khronos Group, 2008

Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2

Spec Guide

Example: Vector Addition - Host API (2)


// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); Kernel Objects // create the kernel cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));

Copyright Khronos Group, 2008

Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2

Spec Guide

Kernel Objects

(Section 5.5)

Kernel objects encapsulate Setting arguments


Specific kernel functions declared in a program Argument values used for kernel execution clSetKernelArg(<kernel>, <argument index>) Each argument data must be set for the kernel function Argument values are copied and stored in the kernel

object Kernel vs. program objects Kernels are related to program execution Programs are related to program source

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (2)


// create the program __kernel void vec_add (__global const float *a, cl_program program = clCreateProgramWithSource(context, 1, __global const float *b, &program_source, __global float *c) NULL, NULL); { int gid = get_global_id(0); // build the program c[gid] = a[gid] + b[gid]; cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); } // create the kernel cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));

Copyright Khronos Group, 2008

Spec Guide

Set Kernel Arguments

Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2

Example: Vector Addition - Host API (3)


// set work-item dimensions size_t global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Spec Guide

Executing Kernels: Section 6.1 Reading, writing, and copying buffer objects: Section 5.2.2

Copyright Khronos Group, 2008

Kernel Execution (Section 5.6)



A command to execute a kernel must be enqueued to the commandqueue clEnqueueNDRangeKernel() Data-parallel execution model Describes the index space for kernel execution Requires information on NDRange dimensions and work-group size clEnqueueTask() Task-parallel execution model (multiple queued tasks) Kernel is executed on a single work-item clEnqueueNativeKernel() Task-parallel execution model Executes a native C/C++ function not compiled using the OpenCL compiler This mode does not use a kernel object so arguments must be passed in

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (3)


// set work-item dimensions size_t global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array Data-Parallel Execution Model err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Spec Guide

Executing Kernels: Section 6.1 Reading, writing, and copying buffer objects: Section 5.2.2

Copyright Khronos Group, 2008

Command-Queue Execution
Execution model signals when commands are complete or
data is ready Command-queue could be explicitly flushed to the device Command-queues execute in-order or out-of-order In-order - commands complete in the order queued and correct memory is consistent Out-of-order - no guarantee when commands are executed or memory is consistent without synchronization

Copyright Khronos Group, 2008

Synchronization
Signals when commands are completed to the host or other
commands in queue Blocking calls - commands do not return until complete Event objects - tracks execution status of a command Some commands can be blocked until event objects signal a completion of previous command
clEnqueueNDRangeKernel() can take an event object as an argument and wait until a previous command (e.g., clEnqueueWriteBuffer) is complete

Profiling

Queue barriers - queued commands that can block command


execution

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (3)


// set work-item dimensions size_t global_work_size[0] = n; // execute kernel Blocking Call - returns only when err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, command is complete global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);

Spec Guide

Executing Kernels: Section 6.1 Reading, writing, and copying buffer objects: Section 5.2.2

Copyright Khronos Group, 2008

OpenCL Extended Features

Copyright Khronos Group, 2008

Optional Extensions (Chapter 9)


Extensions are optional features exposed through
OpenCL The OpenCL working group has already approved many extensions that are supported by the OpenCL specification
Double precision floating-point types Built-in functions to support doubles Atomic functions 3D Image writes
(Section 9.5, 9.6, 9.7) (Section 9.8) (Section 9.3)

Byte addressable stores (writing to pointers with types

less than 32-bits) (Section 9.9)


Built-in functions to support half types
Copyright Khronos Group, 2008

(Section 9.10)

OpenGL Interoperability

(Appendix B)

Both standards under one IP framework Efficient, inter-API communication OpenCL can efficiently share resources with OpenGL Textures, Buffer Objects and Renderbuffers Data is shared, not copied OpenCL objects are created from OpenGL objects
clCreateFromGLBuffer(), clCreateFromGLTexture2D(), clCreateFromGLRenderbuffer()

Applications can select compute device(s) to run OpenGL and OpenCL Efficient queuing of OpenCL and OpenGL commands into the hardware Flexible scheduling and synchronization Examples Vertex and image data generated with OpenCL and then rendered with OpenGL Images rendered with OpenGL and post-processed with OpenCL kernels
Copyright Khronos Group, 2008

Summary
A new compute language that works across a variety of parallel
processors C99 with extensions Familiar to developers Includes a rich set of built-in functions Makes it easy to develop data- and task- parallel compute programs Defines hardware and numerical precision requirements Open standard for heterogeneous parallel computing

More OpenCL news and live demos this afternoon


2:50pm AMD talk 4:10pm talk by Mark Harris

http://www.khronos.org/opencl/

Copyright Khronos Group, 2008

You might also like