You are on page 1of 37

Copyright © 2004 Intel Corporation. All Rights Reserved.

Maximizing Application’s
Performance by Threading,
SIMD and micro arcitecture
tuning
Koby Gottlieb

Intel Corporation

Feb 27 2007

-1-
Copyright © 2004 Intel Corporation. All Rights Reserved.

Agenda
 Threading gains and challenges
 Optimization methodology, project milestones
– Developing Benchmark
– VTune™ Performance Analyzer
– Threading: Overview of approaches
– Intel® Thread Checker
– Intel® Thread Profiler
– Streaming SIMD Extensions (SSE) and micro architectural
issue

 Project example

[Mark] is a trademark or registered trademark of Intel Corporation or its -2-


subsidiaries in the United States or other countries
Copyright © 2004 Intel Corporation. All Rights Reserved.
Efficiently Utilize Dual Cores
Dual-Core Systems
 One package with 2 cores
 Software impact
– 2 Cores  2 processors

– 2 Cores  2x resources

Use
Use threads
threads to
to exploit
exploit full
full
resources
resources of
of dual
dual core
core processors
processors
-3-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Efficiently Utilize Dual Cores
Threads Defined
 OS creates process for each Process
Process
program loaded
Data
– Each process executes as a
separate thread Code
 Additional threads can be thread1()
created within the process Stack
IP
– Each thread has its own Stack and
Instruction Pointer thread2() threadN()
Stack Stack
– All threads share code and data IP

IP

-4-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Efficiently Utilize Dual Cores
Threading Software
 OpenMP* threads
– http://www.openmp.org/

 Windows* threads
– http://msdn.microsoft.com/

 POSIX* threads (pthreads)


– http://www.ieee.org/
IfIfboth
both cores
cores fully
fully busy,
busy, then
then 2x
2x
speedup
speedup possible
possible
*Other names and brands may be claimed as the property of others. -5-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Challenges Unique to Threading
Correctness Bug: Data Races
 Suppose: a=1, b=2
Thread
Thread
1
2
x=a+
b = 42
b
 What is value of x if:
x=3
– Thread1 runs before Thread2?
x = 43
– Thread2 runs before Thread1?

 Data race: concurrent read, modify, write


of same address
Outcome
Outcomedepends
dependson
onthread
thread execution
execution order
order
-6-
Challenges Unique to Threading Copyright © 2004 Intel Corporation. All Rights Reserved.

Solving Data Races:


Synchronization
Thread1 Thread2
Acquire( Acquire(
L) L)
a=1 b = 42
b=2 Release(
x=a+b L)
Release(
L)
 Acquisition of mutex L ensures atomic
access
– Only one thread can hold lock at a time
 Example APIs:
- EnterCriticalSection(), LeaveCriticalSection()
- pthread_mutex_lock(), pthread_mutex_unlock() -7-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Efficiently Utilize Dual Cores
Amdahl’s Law

TTotal

(1-P) P
P
TParallel = {(1 − P ) + P + O}TTotal
it
em

N
IfIfonly
only1/2
1/2of
ofthe
the
code
codeis isparallel,
parallel,
2X
2Xspeedup
speedupis is P = parallel portion of process
unlikely
unlikely N = number of processors (cores)
O = parallel overhead

-8-
Copyright © 2004 Intel Corporation. All Rights Reserved.
Challenges Unique to Threading

Threads Intro New Class of


Problems
 Correctness bugs
• Data races
Intel®
Intel®Thread
ThreadChecker
Checker
• Deadlock finds
findscorrectness
correctnessbugs
bugs
• and more…
 Performance bottlenecks
• Overhead
Thread
ThreadProfiler
Profilerfeature
feature
• Load balance pinpoints
pinpointsbottlenecks
bottlenecks
• and more…

Intel®
Intel® Threading
Threading Tools
Tools can
can help!
help!
-9-
Copyright © 2004 Intel Corporation. All Rights Reserved.

Methodology & Milestones: Getting


Started
– Most of the world apps are not threaded:
• There are 106,177 registered Projects in (http://sourceforge.net/ )
• Almost all the applications are not performance sensitive.
• Some performance sensitive apps are too small, too big, or too complex
– Is the app a representative picture of the real software world?
– If so, we have a problem in our multi core strategy.

– Learning the App.


• No need to understand every algorithm but overall understanding is a
must.
• Call graph of VTune™ analyzer is a great tool for this task.

– Develop a Benchmark
• Representative benchmark must define a benchmark before
optimizing.
• A good benchmark must be automatic (VTune™ analyzer tuning
assistant), not too short (above 30 seconds) and not too long.
• Surprisingly, selecting a good benchmark is time consuming and
difficult.

- 10 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Using VTune™ Performance


Analyzer
 Sampling is surprisingly easy to
use:
– Easy to get good results from sampling
without any training.
– Time breakdown is the first step for the
threading decision-making process.
– Hot spots might be vectorized
 Call graph as a tool to understand the code and select
threading direction.
– Setting the /fixed:no flag for the linker
– Call graph provides hierarchical view and
overall timing.
– Call graph overhead makes it too
inaccurate for timing; must use Sampling
for correct time estimates.

- 11 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Threading
 The most challenging part of the project: how to thread.
– Added difficulty—Shared resources like FSB or L2 may eliminate the speedup
potential
– Functional or data decomposition?
– In many cases you can find mostly functional parallelism, which only scales
to 2 -3 threads.
– Examples:
• Identify the stages and let thread 0 work on N+1 front end of data element while
thread 1 works on the back end of Data element N.
• Assign thread per channel in stereo.
– For good data decomposition, the code should be designed in advance to be
threaded.
• A desirable goal is maintain the exact results in order to simplify the testing. So
Breaking input to chunks does not work if there is any history between data
elements.
– If data decomposition worked on relatively small part of the project 
Almost no speedup because of the synchronization overhead.
 OpenMP is very convenient for data decomposition experimentation.
• Supported by the Intel® compiler.
• It became more legitimate with intro in the MS .NET 2005 compiler*.

* Other names and brands may be claimed as the property of others. - 12 -


Copyright © 2004 Intel Corporation. All Rights Reserved.

Debugging the Threaded App


 Convert app to serial code and debug first while running
thread 0 before thread 1 and then in reverse order.
– This methodology is good for 75% of the bugs and does not
require any tricky debugging technique.
– Try running in parallel and start looking for shared data elements.
 Intel® Tread Checker to the rescue.
– “No, it is not broken, just build a very small example and be
patient”. It takes a long time.
– Intel® Thread Checker gives excellent
analysis capabilities.
• The location of the faulty data element
allocation
• the read location
• the write location
• the call stack that brings us to this location.

- 13 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Intel® Thread Checker 2.0


Features

 Locates threading bugs:

– Data races (storage conflicts)

– Deadlocks (potential and actual)


– Win32 threading API usage problems

– Memory leaks and overwrites

 Isolates bugs to source code line


 Describes possible causes of errors and suggests
resolutions
 Categorizes errors by severity level

- 14 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Screen shot: Intel® Thread Checker

Diagnostics List

Verbose
Verbosediagnostics
diagnostics

Diagnostics
DiagnosticsList
List Summary
Summary
in
inTerse
Tersemode
mode and
andlegend
legend
- 15 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Screen shot: Intel® Thread Checker

Source Code View

Each
EachDiagnostics
Diagnosticsin in
List
Listlinks
linksto
toits
its
source
sourcecode
codeline(s)
line(s)

- 16 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Screen shot: Intel® Thread Checker

Help with Diagnostics

1) Right-click here . . .

2) More help!

- 17 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Intel® Thread Checker

Example:
From Sphinx
final report.

It shows two errors, one in


void cmn_pr
- 18 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Threading, Performance
 Check what percentage of the code is threaded.
– Setting the upper bound for potential performance.
– Can use VTune™ analyzer to see how much time each thread runs.
– Check if the total instruction count of the threaded app is equal to the instruction
count of the original app.
• In many cases there is a huge overhead for threading, or just a bug (doing
some work twice).
 Evaluate the amount of parallel work.
– Even if both threads spend the same amount of time, they may not be doing it at
the same time.
– If a (already debugged) threaded app runs much slower than the scalar app, look
for false sharing issues:
• “No, converting each local variable to an array of 2 variables is not a good
idea for threading efficiency.” From one of my meetings, trying to explain
how come the threaded app is 14X slower than the original app.
 Check the critical path.
– Intel ® Thread profiler is great for the job after you figure out how to use it and
its cryptic terminology.
– Note that Win32 API Thread Profiler is not the same tool as the OpenMP Thread
Profiler.
- 19 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Threading Tools

The Thread Profiler Feature


 Pinpoints threading performance
bottlenecks in apps threaded with:
– Microsoft* Windows* threads on Microsoft*
Windows* systems
– POSIX* pthreads on Linux* systems
– OpenMP* on Microsoft* Windows* and Linux*
systems
 Plugs into VTune™ environment
– Microsoft* Windows* for IA-32 systems
– Linux* for IA-32 systems

*Other names and brands may be claimed as the property of others.


- 20 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Threading Tools
Thread Profiler Feature
Analysis
 Monitors execution flows to find Critical Path
– Longest execution flow is the Critical Path
 Analyzes Critical Path
– System utilization
• Over-subscribed vs. under-subscribed
– Thread state transitions
• Blocked -> Running
 Captures threads timeline
– Visualize threading structure

- 21 -
Copyright © 2004 Intel Corporation. All Rights Reserved.
Intel® Threading Tools
Thread Profiler Critical Path
15
 Start with the
critical path
10  Separate
according to
Time

5 system utilization
 Add overhead
0
Critical Path View
 Further analyze by
thread state
Analysis
Analysisshown
shownfor
for2-way
2-waysystem
system

Acquire lock L Release L Wait for L


Thread 3 Cruise time
Idle
Wait for L
Overhead
Serial
Thread 2
Release L
Blocking time
Under-subscribed
Wait for Threads Impact time
Parallel
2&3
Thread 1
Over-subscribed
T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15
- 22 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Intel® Thread Profiler (OpenMP)


Example: From
FAAD final
report.

- 23 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Intel® Thread profiler (Win32 API)


From FAAD

From GainMPEG:
So what’s wrong
with this picture?
- 24 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Streaming SIMD Extensions Coding &


Micro-architecture Clock
Ticks (ms)

 Intel® Streaming SIMD Extensions


– Optimizing the slow thread first in case
of functional decomposition.
– In C++, use the class libraries.
– In C, use intrinsics.
– Use inline assembly if the compiler does not
behave as expected.
– For integer code or code with many shuffle
instructions, inline assembly might be the
only solution.
• But will it be accepted back to the open
source tree?
 Micro architectural issues
– Use VTune™ analyzer tuning assistant
• Its simpler than trying to learn all the ugly stuff
• It actually works and finds big issues in some cases.

- 25 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Micro arch tuning: VTune Tuning Assist


Phase 1 – identify main slow-down reasons

High branch
mispredictions
impact

The CPI is high

Many L2
Demand Misses

Use precise events to focus on instructions of interest.

- 26 -
Example: Copyright © 2004 Intel Corporation. All Rights Reserved.

Phase 2 – focus on problem


sources Branch
L2 load
misses
mispredictions

- 27 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Impact: WEB Publications


 The successful projects
have high impact.

From http://techreport.com/reviews/2005q2/pentium-xe-840/index.x?
pg=11LAME audio encoding
LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3
encoder. LAME MT was created as a demonstration of the benefits of multithreading
specifically on a Hyper-Threaded CPU like the Pentium 4. You can even download a paper
(in Word format) describing the programming effort.
Rather than run multiple parallel threads, LAME MT runs the MP3 encoder's psycho-
acoustic analysis function on a separate thread from the rest of the encoder using simple
linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of
everything else, and its results are buffered for later use by the second thread. The author
notes, "In general, this approach is highly recommended, for it is exponentially harder to
debug a parallel application than a linear one."
We have results for two different 64-bit versions of LAME MT from different compilers,
one from Microsoft and one from Intel, doing two different types of encoding, variable bit
rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV
file here, as we have done in our previous CPU reviews.

The successful projects have big impact


- 28 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

The LAME example:


What is the LAME Project?

 An educational tool used for learning about MP3 encoding. It’s goal
is to improve:
– Psycho-acoustics quality.
– The speed of MP3 encoding.
 LAME is the most popular state of the art MP3 encoder/decoder
used by today’s leading products.
 Project goals:
– Speeding up the encryption of an audio stream.
– Turning LAME into a Multi-Threaded (MT) engine.
– Be 1:1 bit compatible with the original version.
– Optimize specifically for SMT platforms.
– 64 bit port and CMP related optimizations.

FOR MORE INFO... http://lame.sourceforge.net - 29 -


Copyright © 2004 Intel Corporation. All Rights Reserved.

MP3 Encoding Overview


Break up the audio stream into frames
(uniform chunks, typically ~1K)

Frame 1 FrameAudio
2 Frame
Stream
3 Frame 4

Perceptual
Psycho- Analysis Bitstream
Huffman
Read Frame MDCT Quantization
Acoustic
Model Filterbank Encoding
Encode

Specifically in LAME
- 30 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

LAME MT – Intuitive approach

The intuitive
approach:
Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6

Thread 1:

Thread 2:

An unbreakable dependence
This is actually Data Decomposition
due to Huffman Encoding - 31 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

LAME MT – Functional
Decomposition

Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6


Floating Point Intensive
T1:
Psycho- Analysis Huffman
Read Frame MDCT Quantization
Acoustic Filterbank Encoding

T2:
Integer Intensive
- 32 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Results

- 33 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Results due to Multi-Threading

SMT SMP Platform


Platform
CBR / VBR
CBR / VBR
Using Microsoft’s 22% / 32% 38% / 62%
Compiler*
Using Intel® 20% / 29% 44% / 59%
Compiler 8.1

* Other names and brands may be claimed as the property of others. - 34 -


Copyright © 2004 Intel Corporation. All Rights Reserved.

Overall Performance Results

HT Platform CMP Platform

CBR / VBR CBR / VBR


LAME MT code 52% / 70% 78% /
+ 109%
Using Intel®
Compiler 8.1

The Lame example: high quality threading job.


- 35 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Some Observations
 What can be accepted:
– Threading. There is always something to thread, but not always with
significant gain.
– Streaming SIMD Extensions opportunities.
– 64 bit porting.
• A huge opportunity. Can be used if the student can’t find other options.
• Porting the assembly code will definitely show benefit. It is a big task
waiting to be done.
 Things that didn't go as expected:
– Finding the good and influential candidates. It becomes more difficult
every semester.
– One semester is too short for many apps.
– Returning code to the moderators:
• Only some parts of some projects were accepted by the open source
moderator.
• None of the projects were fully accepted.
- 36 -
Copyright © 2004 Intel Corporation. All Rights Reserved.

Backup

- 37 -

You might also like