Yang Opencl Intro

OpenCL 1.
0
Jason Yang Advanced Micro Devices
Copyright Khronos Group, 2008
Processor Parallelism
Multiple cores driving performance increases
CPUs
Emerging Intersection
Increasingly general purpose data-parallel computing Improving numerical precision
GPUs
Multi-processor programming e.g. OpenMP
Heterogeneous Computing
OpenCL
Graphics APIs and Shading Languages
OpenCL Open Computing Language

Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors
Anatomy of OpenCL
Language Specification C-based cross-platform programming interface Subset of ISO C99 with language extensions - familiar to developers Well-defined numerical accuracy - IEEE 754 rounding behavior with specified maximum error Online or offline compilation and build of compute kernel executables Includes a rich set of built-in functions Platform Layer API A hardware abstraction layer over diverse computational resources Query, select and initialize compute devices Create compute contexts and work-queues Runtime API Execute compute kernels Manage scheduling, compute, and memory resources
Talk Overview
OpenCL Architecture OpenCL Framework
Vector Addition Example
OpenCL Extended Features

OpenGL Interoperability
OpenCL Architecture
Hierarchy of Models
OpenCL Platform Model (Section 3.1)
One Host + one or more Compute Devices Each Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements
OpenCL Execution Model (Section 3.2)
OpenCL Program: Kernels

Basic unit of executable code similar to a C function Data-parallel or task-parallel Host Program Collection of compute kernels and internal functions Analogous to a dynamic library
Kernel Execution The host program invokes a kernel over an index space called an NDRange
NDRange = N-Dimensional Range NDRange can be a 1, 2, or 3-dimensional space A single kernel instance at a point in the index space is called a
work-item
Work-items have unique global IDs from the index space Work-items are further grouped into work-groups Work-groups have a unique work-group ID Work-items have a unique local ID within a work-group
Kernel Execution
Total number of work-items = Gx x Gy Size of each work-group = Sx x Sy Global ID can be computed from work-group ID and local ID
Contexts and Queues (Section 3.2.1)

Contexts are used to contain and manage the state of the world Kernels are executed in contexts defined and manipulated by the host Devices Kernels - OpenCL functions Program objects - kernel source and executable Memory objects Command-queue - coordinates execution of kernels Kernel execution commands Memory commands - transfer or mapping of memory object data Synchronization commands - constrains the order of commands Applications queue compute kernel execution instances Queued in-order Executed in-order or out-of-order Events are used to implement appropriate synchronization of execution instances
OpenCL Memory Model (Section 3.3)

Shared memory model Relaxed consistency Multiple distinct address spaces Address spaces can be collapsed depending on the devices memory subsystem Address spaces Private - private to a work-item Local - local to a work-group Global - accessible by all workitems in all work- groups Constant - read only global space constant Implementations map this hierarchy to available physical memories
Private Memory
Work Item 1
Private Memory
Work Item M
Private Memory
Work Item 1
Private Memory
Work Item M
Compute Unit 1
Compute Unit N
Local Memory
Local Memory
Global / Constant Memory Data Cache
Compute Device
Global Memory
Compute Device Memory
Relaxed Memory Consistency (Section 3.3.1)

The state of memory visible to a work-item is not guaranteed
to be consistent across the collection of work-items at all times Within a work-item, memory has load/store consistency Within a work-group at a barrier, local memory has consistency across work-items Global memory is consistent within a work-group, at a barrier, but not guaranteed across different work-groups Consistency of memory shared between commands are enforced through synchronization
Programming Model
Data-Parallel (Section 3.4.1)
Data-parallel execution model must be implemented by all
OpenCL compute devices Define N-Dimensional computation domain Execute multiple work-groups in parallel
Task-Parallel (Section 3.4.2)
Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access
Some compute devices such as CPUs can also execute
task-parallel compute kernels
Executes as a single work-item A compute kernel written in OpenCL A native C / C++ function
OpenCL Framework
Basic OpenCL Program Structure

Kernels
extensions Host program Query compute devices Create contexts Create memory objects associated to contexts Compile and create kernel program objects Issue commands to command-queue Synchronization of commands Clean up OpenCL resources
C code with some restrictions and
Language
Platform Layer
Runtime
Language for Compute Kernels (Chapter 6)

OpenCL C Programming Language Derived from ISO C99 A few restrictions: recursion, function pointers, functions in C99 standard headers ... Preprocessing directives defined by C99 are supported Built-in Data Types Scalar and vector data types, Pointers Data-type conversion functions: convert_type<_sat><_roundingmode> Image types: image2d_t, image3d_t and sampler_t Built-in Functions Required work-item functions, math.h, read and write image Relational, geometric functions, synchronization functions Built-in Functions Optional double precision, atomics to global and local memory selection of rounding mode, writes to image3d_t surface
Example: Vector Addition - Kernel

__kernel void vec_add (__global const float *a, __global const float *b, __global float *c) { int gid = get_global_id(0); c[gid] = a[gid] + b[gid]; }
__kernel: __global: get_global_id(): Data types:
Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1
Spec Guide
Vector Addition - Kernel

Function Qualifiers __kernel qualifier declares a function as a kernel Kernels can call other kernel functions
__kernel: __global: get_global_id(): Data types:
Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1
Spec Guide

Address Space Qualifiers __global, __local, __constant, __private Pointer kernel arguments must be declared with an address space qualifier __kernel: __global: get_global_id(): Data types: Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1
Spec Guide

Work-item Functions Query work-item identifiers get_work_dim(), get_global_id(), get_local_id(), get_group_id() __kernel: __global: get_global_id(): Data types: Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1
Spec Guide
Language Highlights
Image functions
Images must be accessed through built-in functions Reads/writes performed through sampler objects
Synchronization functions
Barriers - all work-items within a work-group must execute
the barrier function before any work-item can continue Memory fences - provides ordering between memory operations
Restrictions
Pointers to functions are not allowed Pointers to pointers allowed within a kernel, but not as an
argument Bit-fields are not supported Variable length arrays and structures are not supported Recursion is not supported Writes to a pointer of types less than 32-bit are not supported Double types are not supported, but reserved 3D Image writes are not supported
Some restrictions are addressed through extensions
Example: Vector Addition - Host API (1)

// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);
Spec Guide
Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1

// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; Platform Layer memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | Query devices CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); Creating contexts memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);
Spec Guide
Platform Layer (Chapter 4)

The platform layer allows applications to query for platform specific features Querying devices (Chapter 4.2) Find devices: CPUs, GPUs, or Accelerators Get device information:
Number of compute cores NDRange limits Maximum work-group size Sizes of the different memory spaces (constant, local, global) Maximum memory object size
Creating contexts (Chapter 4.3) clCreateContext() and clCreateContextFromType() Contexts are used by the OpenCL runtime to manage objects and execute kernels on one or more devices Contexts are associated to one or more devices
Multiple contexts could be associated to the same device

// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue Create context from GPU device type cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, A handle is returned toNULL); the created // allocate the buffer memory objects context cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);
Context Creation
Spec Guide

// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); Querying devices and information // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);
Spec Guide
Command-Queues (Section 5.1)

Command-queues store a set of operations to perform Command-queues are associated to a context Multiple command-queues can be created to handle
independent commands that dont require synchronization Execution of the command-queue is guaranteed to be completed at sync points

// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); Create command-queue from context memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);
Spec Guide
Memory Objects
Buffer objects
(Section 5.2)
user defined structures Buffer objects can be accessed via pointers in the kernel Image objects Two- or three-dimensional texture, frame-buffer, or images Must be addressed through built-in functions Sampler objects Describes how to sample an image in the kernel
Addressing modes Filtering modes
One-dimensional collection of objects (like C arrays) Valid elements include scalar and vector types as well as
Creating Memory Objects

clCreateBuffer(), clCreateImage2D(), and clCreateImage3D() Memory objects are created with an associated context Memory can be created as read only, write only, or read-write Where objects are created in the platform memory space can be controlled Device memory Device memory with data copied from a host pointer Host memory Host memory associated with a pointer
Memory at that pointer is guaranteed to be valid at synchronization points
Image objects are also created with a channel format Channel order (e.g., RGB, RGBA ,etc.) Channel type (e.g., UNORM INT8, FLOAT, etc.)

// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); Creating Memory Objects - Buffers // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create a command-queue cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);
Spec Guide
Manipulating Object Data

Object data can be copied to host memory, from host memory, or to other objects Memory commands are enqueued in the command buffer and processed when the command is executed clEnqueueReadBuffer(), clEnqueueReadImage() clEnqueueWriteBuffer(), clEnqueueWriteImage() clEnqueueCopyBuffer(), clEnqueueCopyImage() Data can be copied between Image and Buffer objects clEnqueueCopyImageToBuffer() clEnqueueCopyBufferToImage() Regions of the object data can be accessed by mapping into the host address space clEnqueueMapBuffer(), clEnqueueMapImage() clEnqueueUnmapMemObject()

// create the OpenCL context on a GPU device cl_context context = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb); cl_device_id *devices = malloc(cb); clGetContextInfo(context, CL_CONTEXT_DEVICES, cb, devices, NULL); // create Host a memory command-queue associated with a pointer cl_cmd_queue cmd_queue = clCreateCommandQueue(context, devices[0], 0, NULL); // allocate the buffer memory objects cl_mem memobjs[3]; memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); memobjs[1] = clCreateBuffer(context,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcB, NULL); memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(cl_float)*n, NULL, NULL);
Spec Guide

// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));
Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2
Spec Guide

// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel Program Objects cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));
Spec Guide
Program Objects (Section 5.4)

Program objects encapsulate:
An associated context Program source or binary Latest successful program build, list of targeted devices, build options Number of attached kernel objects
Build process
1. Create program object
clCreateProgramWithSource() clCreateProgramWithBinary()
2. Build program executable

clBuildProgram() Compile and link from source or binary for all devices or specific devices in the associated context

// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel Create Program Object cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));
Spec Guide

// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // create the kernel cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values Build the Program Executable clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));
Spec Guide

// create the program cl_program program = clCreateProgramWithSource(context, 1, &program_source, NULL, NULL); // build the program cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); Kernel Objects // create the kernel cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));
Spec Guide
Kernel Objects
(Section 5.5)
Kernel objects encapsulate Setting arguments

Specific kernel functions declared in a program Argument values used for kernel execution clSetKernelArg(<kernel>, <argument index>) Each argument data must be set for the kernel function Argument values are copied and stored in the kernel
object Kernel vs. program objects Kernels are related to program execution Programs are related to program source

// create the program __kernel void vec_add (__global const float *a, cl_program program = clCreateProgramWithSource(context, 1, __global const float *b, &program_source, __global float *c) NULL, NULL); { int gid = get_global_id(0); // build the program c[gid] = a[gid] + b[gid]; cl_int err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL); } // create the kernel cl_kernel kernel = clCreateKernel(program, vec_add, NULL); // set err = err |= err |= the args values clSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem)); clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem)); clSetKernelArg(kernel, 2, (void *)&memobjs[2], sizeof(cl_mem));
Spec Guide
Set Kernel Arguments

// set work-item dimensions size_t global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
Spec Guide
Executing Kernels: Section 6.1 Reading, writing, and copying buffer objects: Section 5.2.2
Kernel Execution (Section 5.6)

A command to execute a kernel must be enqueued to the commandqueue clEnqueueNDRangeKernel() Data-parallel execution model Describes the index space for kernel execution Requires information on NDRange dimensions and work-group size clEnqueueTask() Task-parallel execution model (multiple queued tasks) Kernel is executed on a single work-item clEnqueueNativeKernel() Task-parallel execution model Executes a native C/C++ function not compiled using the OpenCL compiler This mode does not use a kernel object so arguments must be passed in

// set work-item dimensions size_t global_work_size[0] = n; // execute kernel err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL); // read output array Data-Parallel Execution Model err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
Spec Guide
Command-Queue Execution
Execution model signals when commands are complete or
data is ready Command-queue could be explicitly flushed to the device Command-queues execute in-order or out-of-order In-order - commands complete in the order queued and correct memory is consistent Out-of-order - no guarantee when commands are executed or memory is consistent without synchronization
Synchronization
Signals when commands are completed to the host or other
commands in queue Blocking calls - commands do not return until complete Event objects - tracks execution status of a command Some commands can be blocked until event objects signal a completion of previous command
clEnqueueNDRangeKernel() can take an event object as an argument and wait until a previous command (e.g., clEnqueueWriteBuffer) is complete
Profiling
Queue barriers - queued commands that can block command

execution

// set work-item dimensions size_t global_work_size[0] = n; // execute kernel Blocking Call - returns only when err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL, command is complete global_work_size, NULL, 0, NULL, NULL); // read output array err = clEnqueueReadBuffer(context, memobjs[2], CL_TRUE, 0, n*sizeof(cl_float), dst, 0, NULL, NULL);
Spec Guide
OpenCL Extended Features
Optional Extensions (Chapter 9)

Extensions are optional features exposed through
OpenCL The OpenCL working group has already approved many extensions that are supported by the OpenCL specification
Double precision floating-point types Built-in functions to support doubles Atomic functions 3D Image writes
(Section 9.5, 9.6, 9.7) (Section 9.8) (Section 9.3)
Byte addressable stores (writing to pointers with types
less than 32-bits) (Section 9.9)

Built-in functions to support half types
(Section 9.10)
OpenGL Interoperability

(Appendix B)
Both standards under one IP framework Efficient, inter-API communication OpenCL can efficiently share resources with OpenGL Textures, Buffer Objects and Renderbuffers Data is shared, not copied OpenCL objects are created from OpenGL objects
clCreateFromGLBuffer(), clCreateFromGLTexture2D(), clCreateFromGLRenderbuffer()
Applications can select compute device(s) to run OpenGL and OpenCL Efficient queuing of OpenCL and OpenGL commands into the hardware Flexible scheduling and synchronization Examples Vertex and image data generated with OpenCL and then rendered with OpenGL Images rendered with OpenGL and post-processed with OpenCL kernels
Summary
A new compute language that works across a variety of parallel
processors C99 with extensions Familiar to developers Includes a rich set of built-in functions Makes it easy to develop data- and task- parallel compute programs Defines hardware and numerical precision requirements Open standard for heterogeneous parallel computing
More OpenCL news and live demos this afternoon

2:50pm AMD talk 4:10pm talk by Mark Harris
http://www.khronos.org/opencl/

Yang Opencl Intro

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yang Opencl Intro

Uploaded by

Copyright:

Available Formats

OpenCL 1.

Copyright Khronos Group, 2008

Increasingly general purpose data-parallel computing Improving numerical precision

Multi-processor programming e.g. OpenMP

Graphics APIs and Shading Languages

OpenCL Open Computing Language

Copyright Khronos Group, 2008

Copyright Khronos Group, 2008

OpenCL Extended Features

Copyright Khronos Group, 2008

Copyright Khronos Group, 2008

OpenCL Platform Model (Section 3.1)

Copyright Khronos Group, 2008

OpenCL Execution Model (Section 3.2)

OpenCL Program: Kernels

Contexts and Queues (Section 3.2.1)

Copyright Khronos Group, 2008

OpenCL Memory Model (Section 3.3)

Global / Constant Memory Data Cache

Compute Device Memory

Copyright Khronos Group, 2008

Relaxed Memory Consistency (Section 3.3.1)

Copyright Khronos Group, 2008

Task-Parallel (Section 3.4.2)

Some compute devices such as CPUs can also execute

task-parallel compute kernels

Copyright Khronos Group, 2008

Copyright Khronos Group, 2008

Basic OpenCL Program Structure

Copyright Khronos Group, 2008

Language for Compute Kernels (Chapter 6)

Example: Vector Addition - Kernel

Copyright Khronos Group, 2008

__kernel: __global: get_global_id(): Data types:

Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1

Vector Addition - Kernel

Copyright Khronos Group, 2008

__kernel: __global: get_global_id(): Data types:

Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1

Vector Addition - Kernel

Copyright Khronos Group, 2008

Vector Addition - Kernel

Copyright Khronos Group, 2008

Copyright Khronos Group, 2008

Some restrictions are addressed through extensions

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)

Copyright Khronos Group, 2008

Platform Layer (Chapter 4)

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)

Copyright Khronos Group, 2008

Command-Queues (Section 5.1)

Copyright Khronos Group, 2008

Example: Vector Addition - Host API (1)

Copyright Khronos Group, 2008

Copyright Khronos Group, 2008

Creating Memory Objects

Copyright Khronos Group, 2008

kernel: global: get_global_id(): Data types:

kernel: global: get_global_id(): Data types: