You are on page 1of 8

10,771,940 members (75,021 online)

Sign in
home quick answers discussions features community
help
Search f or articles, questions, tips
Articles Languages C / C++ Language General

Article
Browse Code
Stats
Revisions (2)
Alternatives
Comments (27)
About Article
Type Article
Licence CPOL
First Posted 5 Jun 2007
Views 112,881
Bookmarked 110 times




, +
Related Articles
A Beginner's Primer to OpenMP
A Comprehensive Look at
Concurrency through the
OpenMP API
Visual C++ 11 Beta Benchmark
of Parallel Loops
Use of rand() in OpenMP
parallel sections
Parallel notes N2 - toolkit for
OpenMP
It Is Difficult
Parallel Computing and Data
Parallelism
NParallel Chapter II: Loop and
Task Parallelism
Parallel notes N3 - base
OpenMP constructs
Zero abstraction overhead with
modern compilers
Granularity and Parallel
Performance
Expose Parallelism by Avoiding
or Removing Artificial
Dependencies
Loop Modifications to Enhance
Data-Parallel Performance
Concurrent Programming - A
Primer
Load Balance and Parallel

Rate this: uusheikh, 5 Jun 2007
Begin Parallel Programming With OpenMP
Improve program performance using OpenMP on multi-core systems
Introduction
Parallel processing has always been an interesting method to improve program performance.
Lately, there are more computers with multi processors or multi-core CPUs thus making parallel
processing available to the masses. However, having multiple processing units on the hardware
does not make existing programs take advantage of it. Programmers must take the initiative to
implement parallel processing capabilities to their programs to fully utilize the hardware available.
OpenMP is a set of programming APIs which include several compiler directives and a library of
support functions. It was first developed for use with Fortran and now it is available for C and
C++ as well.
Types of Parallel Programming
Before we begin with OpenMP, it is important to know why we need parallel processing. In a
typical case, a sequential code will execute in a thread which is executed on a single processing
unit. Thus, if a computer has 2 processors or more (or 2 cores, or 1 processor with
HyperThreading), only a single processor will be used for execution, thus wasting the other
processing power. Rather than letting the other processor sit idle (or process other threads from
other programs), we can use it to speed up our algorithm.
Parallel processing can be divided into two groups, task based and data based.
1. Task based: Divide different tasks to different CPUs to be executed in parallel. For example,
a Printing thread and a Spell Checking thread running simultaneously in a word processor.
Each thread is a separate task.
2. Data based: Execute the same task, but divide the work load on the data over several
CPUs. For example, to convert a color image to grayscale. We can convert the top half of
the image on the first CPU, while the lower half is converted on the second CPU (or as
many CPUs as you have), thus processing in half the time.
There are several methods to do parallel processing:
1. Use MPI: Message Passing Interface - MPI is most suited for a system with multiple
processors and multiple memory. For example, a cluster of computers with their own local
memory. You can use MPI to divide workload across this cluster, and merge the result
when it is finished. Available with Microsoft Compute Cluster Pack.
2. Use OpenMP: OpenMP is suited for shared memory systems like we have on our desktop
computers. Shared memory systems are systems with multiple processors but each are
sharing a single memory subsystem. Using OpenMP is just like writing your own smaller
4.71 (60 votes)
VC8.0 Win2K
WinXP Win2003
Vista VS2005
C++ Windows
Visual-Studio Dev
articles
Performance
PARALLEL PROGRAMMING IN
C# AND OTHER ALTERNATIVES
Interview with Dmitriy Vyukov -
the author of Relacy Race
Detector (RRD)
Building systems for automatic
C/C++ code logging
Use Thread-local Storage to
Reduce Synchronization
OpenMP for image processing
Related Research
In-The-Wild Testing: How to
Ensure Your Apps Work in the
Real World
Toad Oracle: Discover the Top
10 Most Useful Toad Features
Toad Oracle: Tips to Simplify
Database Administration and
Development
Data Modeling for Effective
Data Warehousing and Business
Intelligence
threads but letting the compiler do it. Available in Visual Studio 2005 Professional and
Team Suite.
3. Use SIMD intrinsics: Single Instruction Multiple Data (SIMD) has been available on
mainstream processors such as Intel's MMX, SSE, SSE2, SSE3, Motorola's (or IBM's) Altivec
and AMD's 3DNow!. SIMD intrinsics are primitive functions to parallelize data processing on
the CPU register level. For example, the addition of two unsigned char will take the whole
register size, although the size of this data type is just 8-bit, leaving 24-bit in the register to
be filled with 0 and wasted. Using SIMD (such as MMX), we can load 8 unsigned chars (or 4
shorts or 2 integers) to be executed in parallel on the register level. Available in Visual
Studio 2005 using SIMD intrinsics or with Visual C++ Processor Pack with Visual C++ 6.0.
Getting Started
Setting Up Visual Studio
To use OpenMP with Visual Studio 2005, you need:
1. Visual Studio 2005 Professional or better
2. A multi processor or multi core system to see speed improvement
3. An algorithm to parallelize!
Visual Studio 2005 uses OpenMP standard 2.0 and is located in the vcomp.dll (Located in
Windows>WinSxS> (ia64/amd64/x86)Microsoft.VC80.(Debug)OpenMP.(*) directory) dynamic
library.
To use OpenMP, you must do the following:
1. Include <omp.h>
2. Enable OpenMP compiler switch in Project Properties
3. That's all!
Understanding the Fork-and-Join Model
OpenMP uses the fork-and-join parallelism model. In fork-and-join, parallel threads are created
and branched out from a master thread to execute an operation and will only remain until the
operation has finished, then all the threads are destroyed, thus leaving only one master thread.
The process of splitting and joining of threads including synchronization for end result are handled
by OpenMP.
How Many Threads Do I Need?
A typical question is how many threads do I actually need? Are more threads better? How do I
control the number of threads in my code? Are number of threads related to number of CPUs?
The number of threads required to solve a problem is generally limited to the number of CPUs
you have. As you can see in the Fork-and-Join figure above, whenever threads are created, a little
time is taken to create a thread and later to join the end result and destroy the threads. When the
problem is small, and the number of CPUs are less than the number of threads, the total
execution time will be longer (slower) because more time has been spent to create threads, and
later switch between the threads (due to preemptive behaviour) then to actually solve the
problem. Whenever a thread context is switched, data must be saved/loaded from the memory.
This takes time.
The rule is simple, since all the threads will be executing the same operation (hence the same
priority), 1 thread is sufficient per CPU (or core). The more CPUs you have, the more threads you
can create.
Most compiler directives in OpenMP use the Environment Variable OMP_NUM_THREADS to
determine the number of threads to create. You can control the number of threads with the
following functions:
Collapse | Copy Code
// Get the number of processors in this system
int iCPU = omp_get_num_procs();
// Now set the number of threads
omp_set_num_threads(iCPU);
Of course, you can put any value for iCPU in the code above (if you do not want to call
omp_get_num_procs), and you can call the omp_set_num_threads functions as many times as
you like for different parts of your code for maximum control. If omp_set_num_threads is not
called, OpenMP will use the OMP_NUM_THREADS Environment Variable.
Parallel for Loop
Let us start with a simple parallel for loop. The following is a code to convert a 32-bit Color
(RGBA) image to 8-bit Grayscale image.
Collapse | Copy Code
// pDest is an unsigned char array of size width * height
// pSrc is an unsigned char array of size width * height * 4 (32-bit)
// To avoid floating point operation, all floating point weights
// in the original grayscale formula
// has been changed to integer approximation
// Use pragma for to make a parallel for loop
omp_set_num_threads(threads);
#pragma omp parallel for
for(int z = 0; z < height*width; z++)
{
pDest[z] = (pSrc[z*4+0]*3735 + pSrc[z*4 + 1]*19234+ pSrc[z*4+ 2]*9797)>>15;
}
The #pragma omp parallel for directive will parallelize the for loop according to the
number of threads set. The following is the performance gained for a 3264x2488 image on a
1.66GHz Core Duo system (2 Cores).
Collapse | Copy Code
Thread(s) : 1 Time 0.04081 sec
Thread(s) : 2 Time 0.01906 sec
Thread(s) : 4 Time 0.01940 sec
Thread(s) : 6 Time 0.02133 sec
Thread(s) : 8 Time 0.02029 sec
As you can see, by executing the problem using 2 threads on a dual-core CPU, the time has been
cut by half. However, as the number of threads is increased, the performance does not due to
increased time to fork and join.
Parallel double for Loop
The same problem above (converting color to grayscale) can also be written in a double for loop
way. This can be written like this:
Collapse | Copy Code
for(int y = 0; y < height; y++)
for(int x = 0; x< width; x++)
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 +
pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;
In this case, there are two solutions.
Solution 1
We have made the inner loop parallel using the parallel for directive. As you can see, when 2
threads are being used, the execution time has actually increased! This is because, for every
iteration of y, a fork-join operation is performed, and at the end contributed to the increased
execution time.
Collapse | Copy Code
for(int y = 0; y < height; y++)
{
#pragma omp parallel for
for(int x = 0; x< width; x++)
{
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 +
pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;
}
}
Collapse | Copy Code
Thread(s): 1 Time 0.04260 sec
Thread(s): 2 Time 0.05171 sec
Solution 2
Instead of making the inner loop parallel, the outer loop is the better choice. Here, another
directive is introduced - the private directive. The private clause directs the compiler to make
variables private so multiple copies of a variable do not execute. Here, you can see that the
execution time did indeed get reduced by half.
Collapse | Copy Code
int x = 0;
#pragma omp parallel for private(x)
for(int y = 0; y < height; y++)
{
for(x = 0; x< width; x++)
{
pDest[x+y*width] = (pSrc[x*4 + y*4*width + 0]*3735 +
pSrc[x*4 + y*4*width + 1]*19234+ pSrc[x*4 + y*4*width + 2]*9797)>>15;
}
}
Collapse | Copy Code
Thread(s) : 1 Time 0.04039 sec
Thread(s) : 2 Time 0.02020 sec
Get to Know the Number of Threads
At any time, we can obtain the number of OpenMP threads running by calling the function:
Collapse | Copy Code
int omp_get_thread_num();
Summary
By using OpenMP, you can gain performance on multi-core systems for free, without much
coding other than a line or too. There is no excuse not to use OpenMP. The benefits are there,
and the coding is simple.
There are several other OpenMP directives such as firstprivate, critical sections and reductions
which are left for another article.
uusheikh
Engineer
Malaysia
Hi, I'm Usman and I am from Malaysia, where I am doing my postgraduate in Image Processing.
Most of the time I code in C++, and CodeProject has been a really helpful 'friend' to me!
Points of Interest
You can also perform similar operation using Windows threads, but it will be more difficult to
implement, because of issues such as synchronization and thread management.
Additional Notes
The original RGB to Grayscale formula is given as Y = 0.299 R + 0.587 G + 0.114 B.
References
1. Parallel Programming in C with MPI and OpenMP (1st ed.), Michael J. Quinn, McGraw Hill,
2004
2. Scientific Parallel Computing, L. Ridgway Scott, Terry Clark, Babak Bagheri, Princeton
University Press, 2005
3. Introduction to Parallel Computing: A Practical Guide with Examples in An W.P. Petersen, P.
Arbenz, Oxford University Press, 2004.
4. Digital Color Imaging Handbook, Gaurav Sharma, CRC Press, 2003
History
June 6, 2007 - Added - MPI capabilities available with Microsoft Compute Cluster Pack
June 6, 2007 - Initial article (This is my first article on CodeProject!)
License
This article, along with any associated source code and files, is licensed under The Code Project
Open License (CPOL)
About the Author
Article Top
Comments and Discussions
Search this forum Go

You must Sign In to use this message board.
Profile popups Spacing Relaxed Noise Medium Layout Thread View Per page 25
Update
First Prev Next
virusx1984
19-Oct-13 17:23 1
Bojan Banko
1-Mar-12 10:26 1
melek_gul
5-Jan-12 7:32 2
Vivigiggle_ee
24-Jul-10 16:57 2
smart_dummies
17-Apr-09 19:26 1
Karpov Andrey
28-Nov-08 7:01 1
jhammad35
18-Apr-08 4:27 1
Chris Theis
23-Jul-07 23:58 2
Paul Conrad
15-Jul-07 13:15 1
mnansyah
28-Jun-07 15:40 1
elwolv
17-Jun-07 16:56 2
Vincent Leong77
11-Jun-07 15:02 2
sabratha
6-Jun-07 2:28 1
Niiiissssshhhhhuuuuu
6-Jun-07 0:21 1
rrrado
5-Jun-07 23:14 1
norm .net
5-Jun-07 22:15 1
s_ruchit
5-Jun-07 20:00 5
.dan.g.
5-Jun-07 13:37 1
Last Visit: 31-Dec-99 18:00 Last Update: 2-Aug-14 20:29 Refresh 1
General News Suggestion Question Bug Answer Joke Rant Admin
Good
My vote of 5
help!
My vote of 3
working with vector
Tool for OpenMP developers
Very nice article
Parallel calculation while giving
visual feedback
Very Nice
Good and easy to understand for
beginner
is OpenMP available for C# compiler
on VS2005
OpenMP performance implications.
ThanX for an Excellent article
5 from me
Interesting
Good
OpenMP and PVM
Excellent Introduction
Permalink | Advertise | Privacy | Mobile
Web04 | 2.8.140721.1 | Last Updated 5 Jun 2007
Article Copyright 2007 by uusheikh
Everything else Copyright CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid

You might also like