Parallel Microprocessors

Intel Core Duo AMD Athlon 64 X2
Multithreading and Parallel

Microprocessors
Stephen Jenks
Electrical Engineering and Computer Science
sjenks@uci.edu
Mostly Worked on Clusters
UCI EECS Scalable Parallel and Distributed Systems Lab 2

Also Build Really Big Displays
HIPerWall:
200 Million
Pixels
50 Displays
30 Power
Mac G5s

Outline
9 Parallelism in Microprocessors
9 Multicore Processor Parallelism
9 Parallel Programming for Shared Memory
9 OpenMP
9 POSIX Threads
9 Java Threads
9 Parallel Microprocessor Bottlenecks
9 Parallel Execution Models to Address Bottlenecks
9 Memory interface
9 Cache-to-cache (coherence) interface
9 Current and Future CMP Technology
Parallelism in Microprocessors
Fetch
9 Pipelining is most Buffer
prevalent
9Developed in 1960s
Decode
9Used in everything Buffer
9Even microcontrollers Register Access
9Decreases cycle time
Buffer Buffer
9Allows up to 1 instruction
per cycle (IPC)
9No programming changes ALU
9Some Pentium 4s have
more than 30 stages! Buffer
Write Back

More Microprocessor Parallelism
9 Superscalar allows
Instruction Level
Parallelism (ILP) ALU
9Replace ALU with multiple
functional units
9Dispatch several
instructions at once
Becomes
9 Out of Order Execution
9Execute based on data Load/
FP INT INT
availability Store
9Requires reorder buffer
9 More than 1 IPC
9 No program changes
Thread-Level Parallelism
9 Simultaneous Multi- 9 Chip Multi-

threading (SMT) processors (CMP)
9Execute instructions 9More than 1 CPU per
from several threads at chip
same time 9AMD Athlon 64 X2,
9Intel Hyperthreading, Intel Core Duo, IBM
IBM Power 5/6, Cell Power 4/5/6, Xenon,
Cell
Int
Thread 1
L2 Cache
CPU1
Mem I/F
System/
FP
Thread 2
L/S CPU2

Chip Multiprocessors
9 Several CPU Cores
Intel
9Independent execution
9Symmetric (for now) Core
9 Share Memory Hierarchy Duo
9Private L1 Caches
9Shared L2 Cache (Intel Core)
9Private L2 Caches (AMD)
(kept coherent via crossbar)
9Shared Memory Interface
AMD
9Shared System Interface
Athlon 64
9 Lower clock speed
X2
Shared Resources Can
Help or Hurt! Images from
Intel and AMD
Quad Cores Today
Memory HyperTransport Memory

Controller Mem Link
Mem Controller
Frontside Bus
System/ System/ System/ System/ System/ System/

Mem I/F Mem I/F Mem I/F Mem I/F Mem I/F Mem I/F
L2 Cache L2 Cache L2 L2 L2 L2 L2 Cache L2 Cache
CPU1
CPU2
CPU1
CPU2
CPU1
CPU2
CPU1
CPU2
CPU3
CPU4
CPU1
CPU2
Core 2 Xeon (Mac Pro) Dual-Core Opteron Core 2 Quad/Extreme

Shared Memory Parallel
Programming
9 Could just run multiple programs at once
9 Multiprogramming
9 Good idea, but long tasks still take long
9 Need to partition work among processors
9 Implicitly (Get the compiler to do it)
9 Intel C/C++/Fortran compilers do pretty well
9 OpenMP code annotations help
9 Not reasonable for complex code
9 Explicitly (Thread programming)
9 Primary needs
9 Scientific computing
9 Media encoding and editing
9 Games

Multithreading
9 Definitions 9 Thread operations
9Process - a program in 9Create / spawn
execution 9Join / destroy
9CPU state (Regs, PC) 9Suspend & resume
9Resources
9Address space 9 Uses
9Thread - lightweight 9Solve problem together
process (Divide & Conquer)
9CPU state 9Do different things
9Shares resources and 9Manage game economy
address space with other 9NPC actions
threads in same process 9Manage screen drawing
9Stack 9Sound
9Input handling

OpenMP Programming Model
9 Implicit Parallelism with Source Code Annotations

#pragma omp parallel for private (i,k)
for (i = 0; i < nx; i++)
for (k = 0; k < nz; k++) {
ez[i][0][k] = 0.0; ez[i][1][k] = 0.0; …
9 Compiler reads pragma and parallelizes loop
9 Partitions work among threads (1 per CPU)
9 Vars i and k are private to each thread
9 Other vars (ez array, for example) are shared across
all threads
9 Can force parallelization of “unsafe” loops
Thread pitfalls
9 Shared data 9 False sharing

92 threads perform 9Non-shared data packed
A=A+1 into same cache line
Thread 1: Thread 2: int thread1data;
1) Load A into R1 1) Load A into R1 int thread2data;
2) Add 1 to R1 2) Add 1 to R1
3) Store R1 to A 3) Store R1 to A 9Cache line ping-pongs
between CPUs when
9Mutual exclusion preserves threads access their data
correctness
9 Locks for heap access
9Locks/mutexes
9Semaphores 9malloc() is expensive
because of mutual exclusion
9Monitors
9Java “synchronized” 9Use private heaps

POSIX Threads
9 IEEE 1003.4 (Portable Operating System
Interface) Committee
9 Lightweight “threads of control”/processes
operating within a single address space
9 A Typical “Process” contains a single thread in its
address space
9 Threads run concurrently and allow
9 Overlapping I/O and computation
9 Efficient use of multiprocessors
9 Also called pthreads

Concept of Operation
1. When program starts, main thread is running

2. Main thread spawns child threads as needed
3. Main thread and child threads run
concurrently
4. Child threads finish and join with main thread
5. Main thread terminates when process ends

Approximate Pi with pthreads
/* the thread control function */
void* PiRunner(void* param)
{
int threadNum = (int) param;
int i;
double h, sum, mypi, x;
printf("Thread %d starting.\n", threadNum);
h = 1.0 / (double) iterations;

sum = 0.0;
for (i = threadNum + 1; i <= iterations; i += threadCount) {
x = h * ((double)i - 0.5);
sum += 4.0 / (1.0 + x*x);
}
mypi = h * sum;
/* now store the result into the result array */

resultArray[threadNum] = mypi;
printf("Thread %d exiting.\n", threadNum);

pthread_exit(0);
} UCI EECS Scalable Parallel and Distributed Systems Lab 16
More Pi with pthreads: main()
/* get the default attributes and set up for creation */
for (i = 0; i < threadCount; i++) {
pthread_attr_init(&attrs[i]);
/* system-wide contention */
pthread_attr_setscope(&attrs[i], PTHREAD_SCOPE_SYSTEM);
}
/* create the threads */

for (i = 0; i < threadCount; i++) {
pthread_create(&tids[i], &attrs[i], PiRunner, (void*)i);
}
/* now wait for the threads to exit */

for (i = 0; i < threadCount; i++)
pthread_join(tids[i], NULL);
pi = 0.0;
for (i = 0; i < threadCount; i++)
pi += resultArray[i];
Java Threads
9 Threading and synchronization built in

9 An object can have associated thread
9 Subclass Thread or Implement Runnable
9 “run” method is thread body
9 “synchonized” methods provide mutual exclusion
9 Main program
9 Calls “start” method of Thread objects to spawn
9 Calls “join” to wait for completion

Parallel Microprocessor Problems
L2 Cache
CPU1
Mem I/F
System/
CPU Mem
Mem
CPU2
Then
Now
9 Memory interface too slow for 1 core/thread

9 Now multiple threads access memory simultaneously,
overwhelming memory interface
9 Parallel programs can run as slowly as sequential
ones!

Our Solution: Producer/Consumer
Parallelism Using The Cache
Data in Memory Data in Memory
Memory Communications
Bottleneck Through Cache
Thread 1 Thread 2 Producer Consumer

Half the Work Half the Work Thread Thread

Converting to Producer/Consumer
for (i = 1; i < nx - 1; i++){
for (j = 1; j < ny - 1; j++){
/* Update Magnetic Field */
for (k = 1; k < nz - 1; k++){
double invmu = 1.0/mu[i][j][k];
double tmpx = rx*invmu; double tmpy = ry*invmu; double tmpz = rz*invmu;
hx[i][j][k] += tmpz * (ey[i][j][k+1] - ey[i][j][k])
- tmpy * (ez[i][j+1][k] - ez[i][j][k]);
hy[i][j][k] += tmpx * (ez[i+1][j][k] - ez[i][j][k])
- tmpz * (ex[i][j][k+1] - ex[i][j][k]);
hz[i][j][k] += tmpy * (ex[i][j+1][k] - ex[i][j][k])
- tmpx * (ey[i+1][j][k] - ey[i][j][k]);
}
/* Update Electric Field */
for (k = 1; k < nz - 1; k++){
double invep = 1.0/ep[i][j][k];
double tmpx = rx*invep; double tmpy = ry*invep; double tmpz = rz*invep;
ex[i][j][k] += tmpy * (hz[i][j][k] - hz[i][j-1][k])
- tmpz * (hy[i][j][k] - hy[i][j][k-1]);
ey[i][j][k] += tmpz * (hx[i][j][k] - hx[i][j][k-1])
- tmpx * (hz[i][j][k] - hz[i-1][j][k]);
ez[i][j][k] += tmpx * (hy[i][j][k] - hy[i-1][j][k])
- tmpy * (hx[i][j][k] - hx[i][j-1][k]);
UCI EECS Scalable } } }
Parallel and Distributed Systems Lab 21
Synchronized Pipelined
Parallelism Model (SPPM)
Conventional Producer/Consumer
(Spatial Decomposition) (SPPM)

SPPM Features
9 Benefits 9 Drawbacks
9Memory bandwidth 9Complex
same as sequential programming
version 9Some synchronization
9Performance overhead
improvement (usually) 9Not always faster than
9Easy in concept SDM (or sequential)

SPPM Performance (Normalized)
FDTD
Red-Black Eqn Solver

So What’s Up With AMD CPUs?
9 How can SPPM be

slower than Seq?
9Fetching from other
core’s cache is slower
than fetching from
memory!
9Makes consumer
slower than producer!

CMP Cache Coherence Stinks!

Private Cache Solution:
Polymorphic Threads
9 Cache-to-Cache too slow
9 Therefore can’t move
much data between cores
9 Polymorphic Threads
9Thread morphs between
producer and consumer for
each block
9Sync data passed between
caches
9But more complex program
9Good on private caches!
9Not faster on shared caches

Ongoing Research
9 C++ Runtime to make SPPM & Polymorphic

Threads programming easier
9 Exploration of problem space
9 Media encoding
9 Data-stream handling (gzip)
9 Fine-grain concurrency in applications (protocol
processing, I/O, etc.)
9 Hardware architecture improvements
9 Better communications between cores

Future CMP Technology
9 8 cores soon
9 Room for improvement
Athlon 64 ATI
9Multi-way caches CPU GPU
expensive
9Coherence protocols XBAR
perform poorly Hyper- Memory
9 Stream programming Transport Controller
9GPU or multi-core Possible Hybrid
9GPGPU.org for details AMD Multi-Core
Design

What About The Cell Processor?
From IBM Cell

Broadband
Engine
Programmer
Handbook, 10
May 2006
9 PowerPC Processing Element with SMT

9 8 Synergistic Processing Elements
9Optimized for SIMD
9256KB Local Storage - no cache
9 4x16-byte-wide rings @ 96 bytes per clock cycle

CBE PowerPC Performance
180
160 Seq
SDM
140
SPPM
120 PTM
Seconds
100
160.27
80 122.63
114.84
110.67
60
40
41.47
33.83
24.98
26.09
20
0
Cell Core2Duo
FDTD 80x80x80x1000

Summary
9 Parallel microprocessors provide tremendous

peak performance
9 Need proper programming to achieve it
9 Thread programming is not hard, but requires
more care
9 Architecture & implementation bottlenecks
require additional work for good performance
9 Performance is architecture-dependent
9 Non-uniform cache interconnects will become
more common
Additional slides

Spawning Threads
9 Initialize Attributes (pthread_attr_init)
9 Default attributes OK
9 Put thread in system-wide scheduling
contention
9 pthread_attr_setscope(&attrs,
PTHREAD_SCOPE_SYSTEM);
9 Spawn thread (pthread_create)
9 Creates a thread identifier
9 Need attribute structure for thread
9 Needs a function where thread starts
9 One 32-bit parameter can be passed (void *)
Thread Spawning Issues
9 How does a thread know which thread it is? Does it
matter?
9 Yes, it matters if threads are to work together
9 Could pass some identifier in through parameter
9 Could contend for a shared counter in a critical section
9 pthread_self() returns the thread ID, but doesn’t help.
9 How big is a thread’s stack?
9 By default, not very big. (What are the ramifications?)
9 pthread_attr_setstacksize() changes stack size

Join Issues
9 Main thread must join with child threads
(pthread_join)
9 Why?
9 Ans: So it knows when they are done.
9 pthread_join can pass back a 32-bit value
9 Can be used as a pointer to pass back a result
9 What kind of variable can be passed back that
way? Local? Static? Global? Heap?

Parallel Microprocessors

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Microprocessors

Uploaded by

Copyright:

Available Formats

Intel Core Duo AMD Athlon 64 X2

Multithreading and Parallel

UCI EECS Scalable Parallel and Distributed Systems Lab 2

UCI EECS Scalable Parallel and Distributed Systems Lab 3

UCI EECS Scalable Parallel and Distributed Systems Lab 5

9 Simultaneous Multi- 9 Chip Multi-

UCI EECS Scalable Parallel and Distributed Systems Lab 7

Memory HyperTransport Memory

System/ System/ System/ System/ System/ System/

UCI EECS Scalable Parallel and Distributed Systems Lab 9

UCI EECS Scalable Parallel and Distributed Systems Lab 10

UCI EECS Scalable Parallel and Distributed Systems Lab 11

9 Implicit Parallelism with Source Code Annotations

9 Shared data 9 False sharing

UCI EECS Scalable Parallel and Distributed Systems Lab 13

UCI EECS Scalable Parallel and Distributed Systems Lab 14

1. When program starts, main thread is running

UCI EECS Scalable Parallel and Distributed Systems Lab 15

printf("Thread %d starting.\n", threadNum);

h = 1.0 / (double) iterations;

/* now store the result into the result array */

printf("Thread %d exiting.\n", threadNum);

/* create the threads */

/* now wait for the threads to exit */

9 Threading and synchronization built in

UCI EECS Scalable Parallel and Distributed Systems Lab 18

9 Memory interface too slow for 1 core/thread

UCI EECS Scalable Parallel and Distributed Systems Lab 19

Data in Memory Data in Memory

Thread 1 Thread 2 Producer Consumer

UCI EECS Scalable Parallel and Distributed Systems Lab 20

UCI EECS Scalable Parallel and Distributed Systems Lab 22

UCI EECS Scalable Parallel and Distributed Systems Lab 23

Red-Black Eqn Solver

UCI EECS Scalable Parallel and Distributed Systems Lab 24

9 How can SPPM be

UCI EECS Scalable Parallel and Distributed Systems Lab 25

UCI EECS Scalable Parallel and Distributed Systems Lab 26

UCI EECS Scalable Parallel and Distributed Systems Lab 27

9 C++ Runtime to make SPPM & Polymorphic

UCI EECS Scalable Parallel and Distributed Systems Lab 28

UCI EECS Scalable Parallel and Distributed Systems Lab 29

From IBM Cell

9 PowerPC Processing Element with SMT

UCI EECS Scalable Parallel and Distributed Systems Lab 30

UCI EECS Scalable Parallel and Distributed Systems Lab 31

9 Parallel microprocessors provide tremendous

UCI EECS Scalable Parallel and Distributed Systems Lab 33

UCI EECS Scalable Parallel and Distributed Systems Lab 35

UCI EECS Scalable Parallel and Distributed Systems Lab 36

You might also like