Professional Documents
Culture Documents
Maximizing Application’s
Performance by Threading,
SIMD and micro arcitecture
tuning
Koby Gottlieb
Intel Corporation
Feb 27 2007
-1-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Agenda
Threading gains and challenges
Optimization methodology, project milestones
– Developing Benchmark
– VTune™ Performance Analyzer
– Threading: Overview of approaches
– Intel® Thread Checker
– Intel® Thread Profiler
– Streaming SIMD Extensions (SSE) and micro architectural
issue
Project example
– 2 Cores 2x resources
Use
Use threads
threads to
to exploit
exploit full
full
resources
resources of
of dual
dual core
core processors
processors
-3-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Efficiently Utilize Dual Cores
Threads Defined
OS creates process for each Process
Process
program loaded
Data
– Each process executes as a
separate thread Code
Additional threads can be thread1()
created within the process Stack
IP
– Each thread has its own Stack and
Instruction Pointer thread2() threadN()
Stack Stack
– All threads share code and data IP
…
IP
-4-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Efficiently Utilize Dual Cores
Threading Software
OpenMP* threads
– http://www.openmp.org/
Windows* threads
– http://msdn.microsoft.com/
TTotal
(1-P) P
P
TParallel = {(1 − P ) + P + O}TTotal
it
em
N
IfIfonly
only1/2
1/2of
ofthe
the
code
codeis isparallel,
parallel,
2X
2Xspeedup
speedupis is P = parallel portion of process
unlikely
unlikely N = number of processors (cores)
O = parallel overhead
-8-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Challenges Unique to Threading
Intel®
Intel® Threading
Threading Tools
Tools can
can help!
help!
-9-
Copyright © 2004 Intel Corporation. All Rights Reserved.
– Develop a Benchmark
• Representative benchmark must define a benchmark before
optimizing.
• A good benchmark must be automatic (VTune™ analyzer tuning
assistant), not too short (above 30 seconds) and not too long.
• Surprisingly, selecting a good benchmark is time consuming and
difficult.
- 10 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
- 11 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Threading
The most challenging part of the project: how to thread.
– Added difficulty—Shared resources like FSB or L2 may eliminate the speedup
potential
– Functional or data decomposition?
– In many cases you can find mostly functional parallelism, which only scales
to 2 -3 threads.
– Examples:
• Identify the stages and let thread 0 work on N+1 front end of data element while
thread 1 works on the back end of Data element N.
• Assign thread per channel in stereo.
– For good data decomposition, the code should be designed in advance to be
threaded.
• A desirable goal is maintain the exact results in order to simplify the testing. So
Breaking input to chunks does not work if there is any history between data
elements.
– If data decomposition worked on relatively small part of the project
Almost no speedup because of the synchronization overhead.
OpenMP is very convenient for data decomposition experimentation.
• Supported by the Intel® compiler.
• It became more legitimate with intro in the MS .NET 2005 compiler*.
- 13 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
- 14 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Screen shot: Intel® Thread Checker
Diagnostics List
Verbose
Verbosediagnostics
diagnostics
Diagnostics
DiagnosticsList
List Summary
Summary
in
inTerse
Tersemode
mode and
andlegend
legend
- 15 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Screen shot: Intel® Thread Checker
Each
EachDiagnostics
Diagnosticsin in
List
Listlinks
linksto
toits
its
source
sourcecode
codeline(s)
line(s)
- 16 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Screen shot: Intel® Thread Checker
1) Right-click here . . .
2) More help!
- 17 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Example:
From Sphinx
final report.
Threading, Performance
Check what percentage of the code is threaded.
– Setting the upper bound for potential performance.
– Can use VTune™ analyzer to see how much time each thread runs.
– Check if the total instruction count of the threaded app is equal to the instruction
count of the original app.
• In many cases there is a huge overhead for threading, or just a bug (doing
some work twice).
Evaluate the amount of parallel work.
– Even if both threads spend the same amount of time, they may not be doing it at
the same time.
– If a (already debugged) threaded app runs much slower than the scalar app, look
for false sharing issues:
• “No, converting each local variable to an array of 2 variables is not a good
idea for threading efficiency.” From one of my meetings, trying to explain
how come the threaded app is 14X slower than the original app.
Check the critical path.
– Intel ® Thread profiler is great for the job after you figure out how to use it and
its cryptic terminology.
– Note that Win32 API Thread Profiler is not the same tool as the OpenMP Thread
Profiler.
- 19 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Threading Tools
- 21 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Threading Tools
Thread Profiler Critical Path
15
Start with the
critical path
10 Separate
according to
Time
5 system utilization
Add overhead
0
Critical Path View
Further analyze by
thread state
Analysis
Analysisshown
shownfor
for2-way
2-waysystem
system
- 23 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
From GainMPEG:
So what’s wrong
with this picture?
- 24 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
- 25 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
High branch
mispredictions
impact
Many L2
Demand Misses
- 26 -
Example: Copyright © 2004 Intel Corporation. All Rights Reserved.
- 27 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
From http://techreport.com/reviews/2005q2/pentium-xe-840/index.x?
pg=11LAME audio encoding
LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3
encoder. LAME MT was created as a demonstration of the benefits of multithreading
specifically on a Hyper-Threaded CPU like the Pentium 4. You can even download a paper
(in Word format) describing the programming effort.
Rather than run multiple parallel threads, LAME MT runs the MP3 encoder's psycho-
acoustic analysis function on a separate thread from the rest of the encoder using simple
linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of
everything else, and its results are buffered for later use by the second thread. The author
notes, "In general, this approach is highly recommended, for it is exponentially harder to
debug a parallel application than a linear one."
We have results for two different 64-bit versions of LAME MT from different compilers,
one from Microsoft and one from Intel, doing two different types of encoding, variable bit
rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV
file here, as we have done in our previous CPU reviews.
An educational tool used for learning about MP3 encoding. It’s goal
is to improve:
– Psycho-acoustics quality.
– The speed of MP3 encoding.
LAME is the most popular state of the art MP3 encoder/decoder
used by today’s leading products.
Project goals:
– Speeding up the encryption of an audio stream.
– Turning LAME into a Multi-Threaded (MT) engine.
– Be 1:1 bit compatible with the original version.
– Optimize specifically for SMT platforms.
– 64 bit port and CMP related optimizations.
Frame 1 FrameAudio
2 Frame
Stream
3 Frame 4
Perceptual
Psycho- Analysis Bitstream
Huffman
Read Frame MDCT Quantization
Acoustic
Model Filterbank Encoding
Encode
Specifically in LAME
- 30 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
The intuitive
approach:
Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6
Thread 1:
Thread 2:
An unbreakable dependence
This is actually Data Decomposition
due to Huffman Encoding - 31 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
LAME MT – Functional
Decomposition
T2:
Integer Intensive
- 32 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Results
- 33 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Some Observations
What can be accepted:
– Threading. There is always something to thread, but not always with
significant gain.
– Streaming SIMD Extensions opportunities.
– 64 bit porting.
• A huge opportunity. Can be used if the student can’t find other options.
• Porting the assembly code will definitely show benefit. It is a big task
waiting to be done.
Things that didn't go as expected:
– Finding the good and influential candidates. It becomes more difficult
every semester.
– One semester is too short for many apps.
– Returning code to the moderators:
• Only some parts of some projects were accepted by the open source
moderator.
• None of the projects were fully accepted.
- 36 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Backup
- 37 -