Professional Documents
Culture Documents
0
Jason Yang Advanced Micro Devices
Processor Parallelism
Multiple cores driving performance increases
CPUs
Emerging Intersection
GPUs
Heterogeneous Computing
OpenCL
Anatomy of OpenCL
Language Specification C-based cross-platform programming interface Subset of ISO C99 with language extensions - familiar to developers Well-defined numerical accuracy - IEEE 754 rounding behavior with specified maximum error Online or offline compilation and build of compute kernel executables Includes a rich set of built-in functions Platform Layer API A hardware abstraction layer over diverse computational resources Query, select and initialize compute devices Create compute contexts and work-queues Runtime API Execute compute kernels Manage scheduling, compute, and memory resources
Talk Overview
OpenCL Architecture OpenCL Framework
Vector Addition Example
OpenCL Architecture
Hierarchy of Models
One Host + one or more Compute Devices Each Compute Device is composed of one or more Compute Units
Each Compute Unit is further divided into one or more Processing Elements
Kernel Execution The host program invokes a kernel over an index space called an NDRange
NDRange = N-Dimensional Range NDRange can be a 1, 2, or 3-dimensional space A single kernel instance at a point in the index space is called a
work-item
Work-items have unique global IDs from the index space Work-items are further grouped into work-groups Work-groups have a unique work-group ID Work-items have a unique local ID within a work-group
Copyright Khronos Group, 2008
Kernel Execution
Total number of work-items = Gx x Gy Size of each work-group = Sx x Sy Global ID can be computed from work-group ID and local ID
Copyright Khronos Group, 2008
Private Memory
Work Item 1
Private Memory
Work Item M
Private Memory
Work Item 1
Private Memory
Work Item M
Compute Unit 1
Compute Unit N
Local Memory
Local Memory
Compute Device
Global Memory
Programming Model
Data-Parallel (Section 3.4.1)
Data-parallel execution model must be implemented by all
OpenCL compute devices Define N-Dimensional computation domain Execute multiple work-groups in parallel
Work-items in group can communicate with each other Can synchronize execution among work-items in group to coordinate memory access
Executes as a single work-item A compute kernel written in OpenCL A native C / C++ function
OpenCL Framework
Language
Platform Layer
Runtime
Spec Guide
Function Qualifiers __kernel qualifier declares a function as a kernel Kernels can call other kernel functions
Spec Guide
Address Space Qualifiers __global, __local, __constant, __private Pointer kernel arguments must be declared with an address space qualifier __kernel: __global: get_global_id(): Data types: Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1
Spec Guide
Work-item Functions Query work-item identifiers get_work_dim(), get_global_id(), get_local_id(), get_group_id() __kernel: __global: get_global_id(): Data types: Section 6.7.1 Section 6.5.1 Section 6.11.1 Section 6.1
Spec Guide
Language Highlights
Image functions
Images must be accessed through built-in functions Reads/writes performed through sampler objects
Synchronization functions
Barriers - all work-items within a work-group must execute
the barrier function before any work-item can continue Memory fences - provides ordering between memory operations
Restrictions
Pointers to functions are not allowed Pointers to pointers allowed within a kernel, but not as an
argument Bit-fields are not supported Variable length arrays and structures are not supported Recursion is not supported Writes to a pointer of types less than 32-bit are not supported Double types are not supported, but reserved 3D Image writes are not supported
Spec Guide
Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1
Spec Guide
Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1
Creating contexts (Chapter 4.3) clCreateContext() and clCreateContextFromType() Contexts are used by the OpenCL runtime to manage objects and execute kernels on one or more devices Contexts are associated to one or more devices
Multiple contexts could be associated to the same device
Context Creation
Spec Guide
Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1
Spec Guide
Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1
Spec Guide
Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1
Memory Objects
Buffer objects
(Section 5.2)
user defined structures Buffer objects can be accessed via pointers in the kernel Image objects Two- or three-dimensional texture, frame-buffer, or images Must be addressed through built-in functions Sampler objects Describes how to sample an image in the kernel
Addressing modes Filtering modes
One-dimensional collection of objects (like C arrays) Valid elements include scalar and vector types as well as
Image objects are also created with a channel format Channel order (e.g., RGB, RGBA ,etc.) Channel type (e.g., UNORM INT8, FLOAT, etc.)
Spec Guide
Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1
Spec Guide
Contexts and context creation: Section 4.3 Command Queues: Section 5.1 Creating buffer objects: Section 5.2.1
Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2
Spec Guide
Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2
Spec Guide
Build process
1. Create program object
clCreateProgramWithSource() clCreateProgramWithBinary()
Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2
Spec Guide
Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2
Spec Guide
Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2
Spec Guide
Kernel Objects
(Section 5.5)
object Kernel vs. program objects Kernels are related to program execution Programs are related to program source
Spec Guide
Creating program objects: Section 5.4.1 Building program executables: Section 5.4.2 Creating kernel objects: Section 5.5.1 Setting kernel arguments: Section 5.5.2
Spec Guide
Executing Kernels: Section 6.1 Reading, writing, and copying buffer objects: Section 5.2.2
Spec Guide
Executing Kernels: Section 6.1 Reading, writing, and copying buffer objects: Section 5.2.2
Command-Queue Execution
Execution model signals when commands are complete or
data is ready Command-queue could be explicitly flushed to the device Command-queues execute in-order or out-of-order In-order - commands complete in the order queued and correct memory is consistent Out-of-order - no guarantee when commands are executed or memory is consistent without synchronization
Synchronization
Signals when commands are completed to the host or other
commands in queue Blocking calls - commands do not return until complete Event objects - tracks execution status of a command Some commands can be blocked until event objects signal a completion of previous command
clEnqueueNDRangeKernel() can take an event object as an argument and wait until a previous command (e.g., clEnqueueWriteBuffer) is complete
Profiling
Spec Guide
Executing Kernels: Section 6.1 Reading, writing, and copying buffer objects: Section 5.2.2
(Section 9.10)
OpenGL Interoperability
(Appendix B)
Both standards under one IP framework Efficient, inter-API communication OpenCL can efficiently share resources with OpenGL Textures, Buffer Objects and Renderbuffers Data is shared, not copied OpenCL objects are created from OpenGL objects
clCreateFromGLBuffer(), clCreateFromGLTexture2D(), clCreateFromGLRenderbuffer()
Applications can select compute device(s) to run OpenGL and OpenCL Efficient queuing of OpenCL and OpenGL commands into the hardware Flexible scheduling and synchronization Examples Vertex and image data generated with OpenCL and then rendered with OpenGL Images rendered with OpenGL and post-processed with OpenCL kernels
Copyright Khronos Group, 2008
Summary
A new compute language that works across a variety of parallel
processors C99 with extensions Familiar to developers Includes a rich set of built-in functions Makes it easy to develop data- and task- parallel compute programs Defines hardware and numerical precision requirements Open standard for heterogeneous parallel computing
http://www.khronos.org/opencl/