You are on page 1of 9

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Developer Central
HOME

TOOLS & SDKS

China / AMD.com

RESOURCES

COMMUNITY

PARTNERS

SUPPORT

OpenCL Optimization Case Study: Simple


Reductions
Reductions, which take a vector of data and reduce it to a single element, are widely used in data-parallel programming. In this article, we examine
strategies for efficiently mapping reductions onto the ATI Radeon HD 5870 GPU and AMD Phenom II X4 965 CPU. Taking advantage of
properties of the reduction being performed, as well as matching the style of reduction to the hardware platform, can result in performance
improvements of up to 15x, compared to native code.

Bryan Catanzaro

8/24/2010

Introduction
OpenCL allows developers to write portable, high-performance code that can target all varieties of parallel processing platforms, including AMD
CPUs and GPUs. Like with any parallel programming model for parallel processing, achieving good efficiency requires careful attention to how the
computation is mapped to the hardware platform and executed. Since performance is a prime motivation for using OpenCL, understanding the
issues which arise when optimizing OpenCL code is a natural part of learning how to use OpenCL itself.
This article discusses simple reductions. A reduction is a very simple operation that takes an array of data and reduces it down to a single element,
for example by summing all the elements in the array. Consider this simple C code, which sums all the elements in an array:
float reduce_sum(float* input, int length) {
float accumulator = input[0];
for(int i = 1; i < length; i++)
accumulator += input[i];
return accumulator;
}

This code is completely sequential! Theres no way to parallelize the loop, since every iteration of the loop depends on the iteration before it. How
can we parallelize this code?
It turns out that we can parallelize many reductions by taking advantage of the properties of the reduction were trying to perform. As counterintuitive as it may seem, reductions are a fundamental data-parallel primitive used in many applications from databases to physical simulation
and machine learning. There are many different kinds of reductions, depending on the type of data being reduced and the operator which is being
used to perform the reduction. For example, reductions can be used to find the sum of all elements in a vector, find the maximum or minimum
element of a vector, or find the index of the maximum or minimum element of a vector.
The performance of parallel reductions can strongly depend on the details of how the reduction is mapped to a parallel platform. In this article, we
will see how selecting the right strategy for reduction can be an order of magnitude faster than using a naive reduction algorithm, on both CPU
devices, represented by the AMD Phenom II X4 965 CPU, as well as GPU devices, represented by the ATI Radeon HD 5870 GPU.

Reduction
The simple sequential sum reduction we just saw is not parallel at all: theres a sequential dependency on the accumulator variable that requires
this reduction be done in a particular order, from front to back of the input array.

1 of 9

16/03/2014 21:06

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions

in a variety of ways. Well take a look at how to do this by starting from the bottom and going up.
Associativity and Commutativity
As we just mentioned, if our reduction operator gives some flexibility in terms of what order the operations must be performed, we can parallelize
a sequential reduction. Addition is a common reduction operator that gives us a lot of flexibility lets consider summing a vector of three
numbers: [10, 20, 30]. The sequential sum would do two additions: ((10 + 20) + 30). But, wed get the same answer if we had grouped the additions
differently: (10 + (20 + 30)), or even if we had reordered the additions: ((30 + 10) + 20).
Youve probably heard of these properties before if an operator allows us to regroup the operations and still get the same result, we call it
associative, and if it allows us to reorder the operations and still get the same result, we call it commutative.
It turns out that these properties are key to parallelizing a reduction. We can take advantage of associativity to divide up the reduction into
independent pieces, and then combine results from the independent pieces. For example, a+b+c+d = (a+b)+(c+d). (a+b) can be computed in parallel
with (c+d), and then the two partial reductions combined to complete the reduction. This can be generalized to reductions on vectors of arbitrary
size by recursively dividing the input vectors, computing partial reductions, and then reducing the partial reductions to form the result.

Figure 1: Associative Reduction Tree and SIMD Mapping


Figure 1 illustrates an associative reduction tree on an eight element vector. This reduction tree does not assume commutativity of the operator,
since none of the additions are reordered, they are just regrouped.

Building a reduction for GPU Devices


Lets build a parallel reduction for the GPU, starting at the OpenCL work-group level. Well take advantage of associativity to break the vector into
small chunks, each of which well build independent reduction trees for each chunk, and execute them independently, in parallel. Well make sure
each of the chunks is small enough that it fits in local memory, and then well assign one work-item per element.
At each stage of the reduction tree, well be loading and storing partial reductions as we compute, so its crucial to use local memory to
communicate between work-items in the work group. Well then execute the reduction tree by using a for loop in conjunction with OpenCL
barriers. For example, see the following code, which performs a min reduction to find the smallest element in a vector:
__kernel
void reduce(
__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);

2 of 9

16/03/2014 21:06

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions

if (global_index < length) {


scratch[local_index] = buffer[global_index];
} else {
// Infinity is the identity element for the min operation
scratch[local_index] = INFINITY;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = 1;
offset < get_local_size(0);
offset <<= 1) {
int mask = (offset << 1) - 1;
if ((local_index & mask) == 0) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}

A couple things to note about this code. Firstly, it assumes that the number of work-items in the work-group is a power of 2, otherwise it will try to
access illegal elements in the scratch space. For reductions where the size is not a power of two, well just pad the vector with a few extra elements
on the host. Secondly, it assumes that local memory has a legal element for every work-item in the work group. To take care of situations where
this is not true, it initializes the local memory with the identity element for the operation were performing. Thirdly, youll see that we have not
reordered any of the reduction operations, but only regrouped them. If you look at the code, youll see that weve encoded the reduction tree very
directly: on the first iteration of the reduction loop, the condition
((local_ index & mask) == 0)

will be true for every other work-item in the work-group. On the second iteration, the condition will be true for every 4th work-item in the work
group. On the third iteration, the condition is true for every 8th work-item in the work-group, and so on until weve completed the reduction. This
ensures that the reductions are not reordered, but this implementation leads to a very sparse utilization of the work-items in the work group,
which is inefficient when mapped onto SIMD processors such as AMD GPUs.
Figure 1 shows how this reduction tree proceeds in hardware when mapped onto a SIMD processor. Recall that we have assigned one work-item to
each element of the input array were reducing, and that each work-item is going to be mapped to a SIMD lane in the GPU hardware wavefront.
Figure 1 shows how each work-item will be used during the reduction, and how data will be transferred through the reduction tree. As you can see,
the work-items are going to be used sparsely, and at each step of the reduction tree, the active work-items get sparser and sparser. This leads to
poor SIMD efficiency, in the example in figure 1, we have only about 30% of the work-items active, on average.

Taking Advantage Of Commutativity


Almost all simple reduction operators, such as addition, min and max, or argmin and argmax used to find the index of the smallest or largest
element in a vector, are both associative and commutative. Well take advantage of this to improve efficiency.
SIMD reduction efficiency can be improved by using commutativity to reorder the operations into a more SIMD friendly structure, as shown in
Figure 2.

3 of 9

16/03/2014 21:06

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions

Figure 2: Commutative Reduction and SIMD Mapping


With a commutative reduction tree, the active work-items are compacted into contiguous blocks. Assuming the work-group will be mapped onto
multiple SIMD wavefronts, as a wavefront becomes completely unused during the reduction, it will be dropped from execution. This leads to better
SIMD efficiency if we assume a wavefront size of four work-items for the tree shown in figure 2, our SIMD efficiency is 58%, which is about twice
the efficiency of the associative reduction. The code needed to do this is basically identical to the associative version, just with the for loop
restructured as follows:
for(int offset = get_local_size(0) / 2;
offset > 0;
offset >>= 1) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}

Multi-stage Reduction

Figure 3: Multi-stage Reduction


So far, weve been discussing parallel reductions on vectors which are short enough to fit in a single work-group. In order to reduce over larger
vectors, we need to move to a multi-stage reduction, where we do local reductions, followed by larger reductions. Three possible ways of doing this
are:
1. Recursive multi-stage reductions. In this approach, illustrated in figure 3, the results produced by each local reduction are gathered into a new

4 of 9

16/03/2014 21:06

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions

expresses the largest amount of parallelism, and it can be written to only take advantage of associativity, for operators which are not commutative.
However, it can be less efficient.
2. Two-stage reductions. In this approach,which we will explain in greater detail later, we express just enough parallelism to fill the machine, and
then follow with a final global reduction. Taking advantage of commutativity, we can then perform most of the work sequentially, which improves
efficiency compared to the fully-parallel multi-stage reduction. Additionally, we only have to launch two kernels per reduction.
3. Reductions using atomics. Instead of using an explicitly multi-stage algorithm, you can use atomic memory operations in OpenCL, such as
atom_ add() to reduce the partial results from each local reduction. Of course, atomic transactions will limit you to the operators and data-types
which are supported by the platform youre targeting. Many applications need reductions which are not supported by atomic operations, so well
just mention that they can be useful in some situations, but wont give details in this article.

Recursive Multi-stage Reduction Performance


The recursive, multi-stage reduction is often the first reduction approach people try, since its maximally parallel. We often think that since GPUs
are highly parallel processors that can process thousands of work-items, we should parallelize things as much as possible. As well see, sometimes
its better to choose a more serial approach.
Figure 4 shows the performance achieved on the ATI Radeon HD 5870 GPU, for multi-stage reduction kernels which find the minimum element in
vectors of floats. The purely associative reduction averages 1.6 GigaReductions/second for large input sizes, while the multi-stage commutative
kernel, which permutes the reduction tree to better fit the SIMD nature of the GPU, averages 2.1 GigaReductions/second.

Figure 4: Multi-stage GPU Performance


To analyze how well this performs, we note that for a vector of n elements, there are n-1 operations which must be performed to complete the
reduction. This means that our reduction kernels are performing at 1.6 or 2.1 GFlops/s. All n elements need to be loaded from off-chip memory,
which means that the peak performance we could expect if we were off-chip memory bandwidth limited would be:
157.6 GB/s * 1 SPFLOP/4 B = 39.4 SPGFLOP/s
Accordingly, our multi-stage reductions are performing at about 5% of peak performance, which shows weve got quite a bit of room to improve
performance.
Two-stage Reduction
All those parallel reductions in the multi-stage reductions weve been discussing are fairly expensive, because they involve lots of synchronization

5 of 9

16/03/2014 21:06

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions

This observation motivates the two-stage reduction, where the input is divided up into p chunks, where p is large enough to keep all of our
processors busy. In OpenCL, each chunk will be processed by a work-group. Taking advantage of commutativity, we can have each work-group
process its chunk by iterating over work-group sized pieces, and having every work-item keep a running reduction as it goes. After weve processed
the entire array, each work-group writes out a single reduction result, which we assemble into another array, and then reduce with a final reduction
call.
Technically, we could do a two stage reduction without taking advantage of commutativity by having each work-item sequentially reduce a large
contiguous block of the array, and then finishing with a parallel reduction in each work-group, followed by a final reduction call. However, it is
difficult to make this approach efficient, since each work-item would be loading data from a separate region of memory, which then
reduces bandwidth utilization substantially, since the loads from a wavefront of work-items are not contiguous. Well leave it as an exercise to the
reader how to create an efficient two-stage, associative reduction, but will note that since a great number of reduction operators are commutative
anyway, this problem is mostly just a curiosity.

Figure 5: Two-stage Reduction


Figure 5 illustrates how the two-stage reduction works. The vector is divided into work-group size chunks, which are then parceled out to the
compute units on the device. In this example, we show a parallelization for two work-groups, a blue work group and a green work-group. The
work-groups loop over their chunks of the input vector, by performing sequential reductions. Once the entire vector has been processed, the
work-groups perform parallel reductions to finish out the first stage. The second stage of the reduction performs a parallel reduction on all the
partial results from each work-group.
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
int global_index = get_global_id(0);
float accumulator = INFINITY;
// Loop sequentially over chunks of input vector
while (global_index < length) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index += get_global_size(0);
}
// Perform parallel reduction
int local_index = get_local_id(0);
scratch[local_index] = accumulator;
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0;

6 of 9

16/03/2014 21:06

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions

float other = scratch[local_index + offset];


float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine : other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}

By launching only enough work-groups to fill the compute device, we ensure that most of the reductions happen sequentially, which maximizes
SIMD efficiency and drastically improves performance. In the fully-parallel, recursive reduction style mentioned above, during the first reduction
phase, for a vector of length n elements, the number of parallel reductions we have to do is proportional to n. Using the two-stage reduction style,
we only perform a constant number of parallel reductions, regardless of how large our input is, since we only do parallel reductions for as many
work-groups as we need to fill the compute device. This makes the reduction much more efficient.
For this experiment, we found that launching 80 work-groups was a good choice for the ATI Radeon HD 5870 GPU. There are 20 compute units
on the GPU, so launching 80 work-groups provides for simultaneous scheduling of multiple work-groups on the same compute unit, which can help
cover memory latencies during execution.

Two-stage Reduction Performance

Figure 6: Two-stage GPU Performance


Figure 6 shows the performance we achieve using the two-stage approach. As you can see, its quite a bit faster, averaging 27.3
GigaReductions/second for large vectors. This is 70% of our peak bandwidth-bound performance, quite a big difference. Vectorizing this kernel so
that it performs operations on float4 vectors instead of just float elements improved the performance even more, bringing us to 30.3
GigaReductions/second, or 77% of our bound. Its important to note that vectorizing reductions requires that the reduction operator be
commutative, since the vectorization process inherently reorders the reduction tree.
This result is perhaps somewhat counterintuitive: switching from the most parallel algorithm we could devise for reduction to a more serial
algorithm brought us great efficiency benefits. Despite the fact that GPU hardware needs a lot of parallelism to perform well, it turns out that for
some algorithms, its best to choose only a degree of parallelism which fills the compute device, but no more.
Reductions on CPU Devices

7 of 9

16/03/2014 21:06

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions

Figure 7: CPU Performance


Figure 7 shows the realized performance of several reduction kernels on the AMD Phenom II X4 965 processor, which has four compute units.
Running the two-stage kernel which provided the greatest performance on the GPU brings us 79 MegaReductions/second for large inputs. Our
bound is 21.3 GB/s * 1 SPFLOP/4 B = 5.3 SPGFLOP/s.
This means that the two-stage kernel is yielding only 1.5% of peak performance, which clearly shows room for improvement.
As weve shown, serial reductions are more efficient than parallel reductions, so the following code attempts to serialize the reduction as much as
possible. It is still a two-phase reduction, but it expects that each work-group is a serial work-group with only one work-item. Each work-group
performs the reduction serially, and then writes out the reduction for their chunk of the input vector. We can then instantiate a few work-groups,
just enough to fill all the compute units of our CPU device, and finish the reduction with a very small, sequential reduction in host code. The
following is the OpenCL kernel for this serial reduction.
__kernel
void reduce(__global float* buffer,
__const int block,
__const int length,
__global float* result) {
int global_index = get_global_id(0) * block;
float accumulator = INFINITY;
int upper_bound = (get_global_id(0) + 1) * block;
if (upper_bound > length) upper_bound = length;
while (global_index < upper_bound) {
float element = buffer[global_index];
accumulator = (accumulator < element) ? accumulator : element;
global_index++;
}
result[get_group_id(0)] = accumulator;
}

This code performs much better on our CPU device. When we launch a kernel with only one work-group, using just one of the four compute units
the AMD Phenom CPU provides, we average 1.2 GigaReductions/second, or 22% of our bound for this device. When we parallelize the reduction
across all the compute units of our CPU device, we average 3.0 GigaReductions/second, or 57% of our bound. Vectorizing the reductions improves

8 of 9

16/03/2014 21:06

OpenCL Optimization Case Study: Simple Redu...

http://developer.amd.com/resources/documentati...

Home > Resources > Documentation & Articles > Articles & Whitepapers > OpenCL Optimization Case Study: Simple Reductions

been using in these code examples. The ?: operator, when used on the CPU on a SIMD vector, which introduces control flow into the SIMD vector,
causing the OpenCL compiler to emit non-vectorized code, resulting in performance losses instead of performance gains.
Overall, getting 62% of our bandwidth limited performance shows that OpenCL can provide good performance on CPU devices as well as GPU
devices.
Conclusion
When we started this article, we were faced with a challenge: how to take a sequential reduction loop and parallelize it. We took a look at how
associativity and commutativity allow us to restructure a sequential loop into reduction trees, and then looked at several strategies for building
efficient reduction trees. Perhaps surprisingly, we found that the most parallel reduction trees were also very inefficient, because they required lots
of communication and synchronization, which is expensive on parallel platforms. We then found that performing the reduction as serially as
possible provided the best performance, both on the CPU as well as the GPU. We saw a 15x performance improvement on the GPU by taking
advantage of commutativity to reduce the number of local parallel reductions we executed, compared to the fully parallel, recursive reduction. We
saw a 2.8x performance improvement on the CPU by using all the cores and the SIMD units, compared to a sequential reduction.
Since reductions are such an important part of data-parallel programming, many OpenCL programmers will encounter the need to write them at
some point. Hopefully this article has given you some ideas about what approaches will work well for your problem, whether on the CPU or the
GPU.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

Get the hcNewsFlash.


Your email address:

HSA is going to rock your world.


Learn more about Heterogeneous
System Architecture.

Got Questions?
Ask the Developer Forums
Community. Theyve got answers.

I agree to receive hcNewsFlash and


related AMD Developer Central
communications .

SUBMIT

Careers / Site Map / Terms and Conditions / Privacy / Cookie Policy / Trademarks

2014 Advanced Micro Devices, Inc. OpenCL and the OpenCL logo are trademarks of Apple, Inc., used with permission by Khronos.

9 of 9

16/03/2014 21:06

You might also like