Professional Documents
Culture Documents
http://developer.amd.com/resources/documentati...
Developer Central
HOME
China / AMD.com
RESOURCES
COMMUNITY
PARTNERS
SUPPORT
Bryan Catanzaro
8/24/2010
Introduction
OpenCL allows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including AMD
CPUs and GPUs. Like with any parallel programming model for parallel processing, achieving good efficiency requires careful attention to how the
computation is mapped to the hardware platform and executed. Since performance is a prime motivation for using OpenCL, understanding the
issues which arise when optimizing OpenCL code is a natural part of learning how to use OpenCL itself.
This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element,
for example by summing all the elements in the array. Consider this simple C code, which sums all the elements in an array:
float reduce_sum(float* input, int length) {
float accumulator = input[0];
for(int i = 1; i < length; i++)
accumulator += input[i];
return accumulator;
}
This code is completely sequential! Theres no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How
can we parallelize this code?
It turns out that we can parallelize many reductions by taking advantage of the properties of the reduction were trying to perform. As counterintuitive as it may seem, reductions are a fundamental data-parallel primitive used in many applications from databases to physical simulation
and machine learning. There are many different kinds of reductions, depending on the type of data being reduced and the operator which is being
used to perform the reduction. For example, reductions can be used to find the sum of all elements in a vector, find the maximum or minimum
element of a vector, or find the index of the maximum or minimum element of a vector.
The performance of parallel reductions can strongly depend on the details of how the reduction is mapped to a parallel platform. In this article, we
will see how selecting the right strategy for reduction can be an order of magnitude faster than using a naive reduction algorithm, on both CPU
devices, represented by the AMD Phenom II X4 965 CPU, as well as GPU devices, represented by the ATI Radeon HD 5870 GPU.
Reduction
The simple sequential sum reduction we just saw is not parallel at all: theres a sequential dependency on the accumulator variable that requires
this reduction be done in a particular order, from front to back of the input array.
1 of 9
16/03/2014 21:06
http://developer.amd.com/resources/documentati...
Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions
in a variety of ways. Well take a look at how to do this by starting from the bottom and going up.
Associativity and Commutativity
As we just mentioned, if our reduction operator gives some flexibility in terms of what order the operations must be performed, we can parallelize
a sequential reduction. Addition is a common reduction operator that gives us a lot of flexibility lets consider summing a vector of three
numbers: [10, 20, 30]. The sequential sum would do two additions: ((10 + 20) + 30). But, wed get the same answer if we had grouped the additions
differently: (10 + (20 + 30)), or even if we had reordered the additions: ((30 + 10) + 20).
Youve probably heard of these properties before if an operator allows us to regroup the operations and still get the same result, we call it
associative, and if it allows us to reorder the operations and still get the same result, we call it commutative.
It turns out that these properties are key to parallelizing a reduction. We can take advantage of associativity to divide up the reduction into
independent pieces, and then combine results from the independent pieces. For example, a+b+c+d = (a+b)+(c+d). (a+b) can be computed in parallel
with (c+d), and then the two partial reductions combined to complete the reduction. This can be generalized to reductions on vectors of arbitrary
size by recursively dividing the input vectors, computing partial reductions, and then reducing the partial reductions to form the result.
2 of 9
16/03/2014 21:06
http://developer.amd.com/resources/documentati...
Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions
A couple things to note about this code. Firstly, it assumes that the number of work-items in the work-group is a power of 2, otherwise it will try to
access illegal elements in the scratch space. For reductions where the size is not a power of two, well just pad the vector with a few extra elements
on the host. Secondly, it assumes that local memory has a legal element for every work-item in the work group. To take care of situations where
this is not true, it initializes the local memory with the identity element for the operation were performing. Thirdly, youll see that we have not
reordered any of the reduction operations, but only regrouped them. If you look at the code, youll see that weve encoded the reduction tree very
directly: on the first iteration of the reduction loop, the condition
((local_ index & mask) == 0)
will be true for every other work-item in the work-group. On the second iteration, the condition will be true for every 4th work-item in the work
group. On the third iteration, the condition is true for every 8th work-item in the work-group, and so on until weve completed the reduction. This
ensures that the reductions are not reordered, but this implementation leads to a very sparse utilization of the work-items in the work group,
which is inefficient when mapped onto SIMD processors such as AMD GPUs.
Figure 1 shows how this reduction tree proceeds in hardware when mapped onto a SIMD processor. Recall that we have assigned one work-item to
each element of the input array were reducing, and that each work-item is going to be mapped to a SIMD lane in the GPU hardware wavefront.
Figure 1 shows how each work-item will be used during the reduction, and how data will be transferred through the reduction tree. As you can see,
the work-items are going to be used sparsely, and at each step of the reduction tree, the active work-items get sparser and sparser. This leads to
poor SIMD efficiency, in the example in figure 1, we have only about 30% of the work-items active, on average.
3 of 9
16/03/2014 21:06
http://developer.amd.com/resources/documentati...
Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions
Multi-stage Reduction
4 of 9
16/03/2014 21:06
http://developer.amd.com/resources/documentati...
Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions
expresses the largest amount of parallelism, and it can be written to only take advantage of associativity, for operators which are not commutative.
However, it can be less efficient.
2. Two-stage reductions. In this approach,which we will explain in greater detail later, we express just enough parallelism to fill the machine, and
then follow with a final global reduction. Taking advantage of commutativity, we can then perform most of the work sequentially, which improves
efficiency compared to the fully-parallel multi-stage reduction. Additionally, we only have to launch two kernels per reduction.
3. Reductions using atomics. Instead of using an explicitly multi-stage algorithm, you can use atomic memory operations in OpenCL, such as
atom_ add() to reduce the partial results from each local reduction. Of course, atomic transactions will limit you to the operators and data-types
which are supported by the platform youre targeting. Many applications need reductions which are not supported by atomic operations, so well
just mention that they can be useful in some situations, but wont give details in this article.
5 of 9
16/03/2014 21:06
http://developer.amd.com/resources/documentati...
Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions
This observation motivates the two-stage reduction, where the input is divided up into p chunks, where p is large enough to keep all of our
processors busy. In OpenCL, each chunk will be processed by a work-group. Taking advantage of commutativity, we can have each work-group
process its chunk by iterating over work-group sized pieces, and having every work-item keep a running reduction as it goes. After weve processed
the entire array, each work-group writes out a single reduction result, which we assemble into another array, and then reduce with a final reduction
call.
Technically, we could do a two stage reduction without taking advantage of commutativity by having each work-item sequentially reduce a large
contiguous block of the array, and then finishing with a parallel reduction in each work-group, followed by a final reduction call. However, it is
difficult to make this approach efficient, since each work-item would be loading data from a separate region of memory, which then
reduces bandwidth utilization substantially, since the loads from a wavefront of work-items are not contiguous. Well leave it as an exercise to the
reader how to create an efficient two-stage, associative reduction, but will note that since a great number of reduction operators are commutative
anyway, this problem is mostly just a curiosity.
6 of 9
16/03/2014 21:06
http://developer.amd.com/resources/documentati...
Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions
By launching only enough work-groups to fill the compute device, we ensure that most of the reductions happen sequentially, which maximizes
SIMD efficiency and drastically improves performance. In the fully-parallel, recursive reduction style mentioned above, during the first reduction
phase, for a vector of length n elements, the number of parallel reductions we have to do is proportional to n. Using the two-stage reduction style,
we only perform a constant number of parallel reductions, regardless of how large our input is, since we only do parallel reductions for as many
work-groups as we need to fill the compute device. This makes the reduction much more efficient.
For this experiment, we found that launching 80 work-groups was a good choice for the ATI Radeon HD 5870 GPU. There are 20 compute units
on the GPU, so launching 80 work-groups provides for simultaneous scheduling of multiple work-groups on the same compute unit, which can help
cover memory latencies during execution.
7 of 9
16/03/2014 21:06
http://developer.amd.com/resources/documentati...
Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions
This code performs much better on our CPU device. When we launch a kernel with only one work-group, using just one of the four compute units
the AMD Phenom CPU provides, we average 1.2 GigaReductions/second, or 22% of our bound for this device. When we parallelize the reduction
across all the compute units of our CPU device, we average 3.0 GigaReductions/second, or 57% of our bound. Vectorizing the reductions improves
8 of 9
16/03/2014 21:06
http://developer.amd.com/resources/documentati...
Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions
been using in these code examples. The ?: operator, when used on the CPU on a SIMD vector, which introduces control flow into the SIMD vector,
causing the OpenCL compiler to emit non-vectorized code, resulting in performance losses instead of performance gains.
Overall, getting 62% of our bandwidth limited performance shows that OpenCL can provide good performance on CPU devices as well as GPU
devices.
Conclusion
When we started this article, we were faced with a challenge: how to take a sequential reduction loop and parallelize it. We took a look at how
associativity and commutativity allow us to restructure a sequential loop into reduction trees, and then looked at several strategies for building
efficient reduction trees. Perhaps surprisingly, we found that the most parallel reduction trees were also very inefficient, because they required lots
of communication and synchronization, which is expensive on parallel platforms. We then found that performing the reduction as serially as
possible provided the best performance, both on the CPU as well as the GPU. We saw a 15x performance improvement on the GPU by taking
advantage of commutativity to reduce the number of local parallel reductions we executed, compared to the fully parallel, recursive reduction. We
saw a 2.8x performance improvement on the CPU by using all the cores and the SIMD units, compared to a sequential reduction.
Since reductions are such an important part of data-parallel programming, many OpenCL programmers will encounter the need to write them at
some point. Hopefully this article has given you some ideas about what approaches will work well for your problem, whether on the CPU or the
GPU.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Got Questions?
Ask the Developer Forums
Community. Theyve got answers.
SUBMIT
Careers / Site Map / Terms and Conditions / Privacy / Cookie Policy / Trademarks
2014 Advanced Micro Devices, Inc. OpenCL and the OpenCL logo are trademarks of Apple, Inc., used with permission by Khronos.
9 of 9
16/03/2014 21:06