You are on page 1of 36

Intel Core Duo AMD Athlon 64 X2

Multithreading and Parallel


Microprocessors
Stephen Jenks
Electrical Engineering and Computer Science
sjenks@uci.edu
Mostly Worked on Clusters

UCI EECS Scalable Parallel and Distributed Systems Lab 2


Also Build Really Big Displays

HIPerWall:
200 Million
Pixels
50 Displays
30 Power
Mac G5s

UCI EECS Scalable Parallel and Distributed Systems Lab 3


Outline
9 Parallelism in Microprocessors
9 Multicore Processor Parallelism
9 Parallel Programming for Shared Memory
9 OpenMP
9 POSIX Threads
9 Java Threads
9 Parallel Microprocessor Bottlenecks
9 Parallel Execution Models to Address Bottlenecks
9 Memory interface
9 Cache-to-cache (coherence) interface
9 Current and Future CMP Technology
UCI EECS Scalable Parallel and Distributed Systems Lab 4
Parallelism in Microprocessors
Fetch
9 Pipelining is most Buffer
prevalent
9Developed in 1960s
Decode
9Used in everything Buffer
9Even microcontrollers Register Access
9Decreases cycle time
Buffer Buffer
9Allows up to 1 instruction
per cycle (IPC)
9No programming changes ALU
9Some Pentium 4s have
more than 30 stages! Buffer

Write Back

UCI EECS Scalable Parallel and Distributed Systems Lab 5


More Microprocessor Parallelism
9 Superscalar allows
Instruction Level
Parallelism (ILP) ALU
9Replace ALU with multiple
functional units
9Dispatch several
instructions at once
Becomes
9 Out of Order Execution
9Execute based on data Load/
FP INT INT
availability Store
9Requires reorder buffer
9 More than 1 IPC
9 No program changes
UCI EECS Scalable Parallel and Distributed Systems Lab 6
Thread-Level Parallelism

9 Simultaneous Multi- 9 Chip Multi-


threading (SMT) processors (CMP)
9Execute instructions 9More than 1 CPU per
from several threads at chip
same time 9AMD Athlon 64 X2,
9Intel Hyperthreading, Intel Core Duo, IBM
IBM Power 5/6, Cell Power 4/5/6, Xenon,
Cell
Int
Thread 1

L2 Cache
CPU1

Mem I/F
System/
FP
Thread 2
L/S CPU2

UCI EECS Scalable Parallel and Distributed Systems Lab 7


Chip Multiprocessors
9 Several CPU Cores
Intel
9Independent execution
9Symmetric (for now) Core
9 Share Memory Hierarchy Duo
9Private L1 Caches
9Shared L2 Cache (Intel Core)
9Private L2 Caches (AMD)
(kept coherent via crossbar)
9Shared Memory Interface
AMD
9Shared System Interface
Athlon 64
9 Lower clock speed
X2
Shared Resources Can
Help or Hurt! Images from
Intel and AMD
UCI EECS Scalable Parallel and Distributed Systems Lab 8
Quad Cores Today

Memory HyperTransport Memory


Controller Mem Link
Mem Controller
Frontside Bus

System/ System/ System/ System/ System/ System/


Mem I/F Mem I/F Mem I/F Mem I/F Mem I/F Mem I/F
L2 Cache L2 Cache L2 L2 L2 L2 L2 Cache L2 Cache
CPU1

CPU2

CPU1

CPU2

CPU1

CPU2

CPU1

CPU2

CPU3

CPU4
CPU1

CPU2
Core 2 Xeon (Mac Pro) Dual-Core Opteron Core 2 Quad/Extreme

UCI EECS Scalable Parallel and Distributed Systems Lab 9


Shared Memory Parallel
Programming
9 Could just run multiple programs at once
9 Multiprogramming
9 Good idea, but long tasks still take long
9 Need to partition work among processors
9 Implicitly (Get the compiler to do it)
9 Intel C/C++/Fortran compilers do pretty well
9 OpenMP code annotations help
9 Not reasonable for complex code
9 Explicitly (Thread programming)
9 Primary needs
9 Scientific computing
9 Media encoding and editing
9 Games

UCI EECS Scalable Parallel and Distributed Systems Lab 10


Multithreading
9 Definitions 9 Thread operations
9Process - a program in 9Create / spawn
execution 9Join / destroy
9CPU state (Regs, PC) 9Suspend & resume
9Resources
9Address space 9 Uses
9Thread - lightweight 9Solve problem together
process (Divide & Conquer)
9CPU state 9Do different things
9Shares resources and 9Manage game economy
address space with other 9NPC actions
threads in same process 9Manage screen drawing
9Stack 9Sound
9Input handling

UCI EECS Scalable Parallel and Distributed Systems Lab 11


OpenMP Programming Model

9 Implicit Parallelism with Source Code Annotations


#pragma omp parallel for private (i,k)
for (i = 0; i < nx; i++)
for (k = 0; k < nz; k++) {
ez[i][0][k] = 0.0; ez[i][1][k] = 0.0; …
9 Compiler reads pragma and parallelizes loop
9 Partitions work among threads (1 per CPU)
9 Vars i and k are private to each thread
9 Other vars (ez array, for example) are shared across
all threads
9 Can force parallelization of “unsafe” loops
UCI EECS Scalable Parallel and Distributed Systems Lab 12
Thread pitfalls

9 Shared data 9 False sharing


92 threads perform 9Non-shared data packed
A=A+1 into same cache line
Thread 1: Thread 2: int thread1data;
1) Load A into R1 1) Load A into R1 int thread2data;
2) Add 1 to R1 2) Add 1 to R1
3) Store R1 to A 3) Store R1 to A 9Cache line ping-pongs
between CPUs when
9Mutual exclusion preserves threads access their data
correctness
9 Locks for heap access
9Locks/mutexes
9Semaphores 9malloc() is expensive
because of mutual exclusion
9Monitors
9Java “synchronized” 9Use private heaps

UCI EECS Scalable Parallel and Distributed Systems Lab 13


POSIX Threads
9 IEEE 1003.4 (Portable Operating System
Interface) Committee
9 Lightweight “threads of control”/processes
operating within a single address space
9 A Typical “Process” contains a single thread in its
address space
9 Threads run concurrently and allow
9 Overlapping I/O and computation
9 Efficient use of multiprocessors
9 Also called pthreads

UCI EECS Scalable Parallel and Distributed Systems Lab 14


Concept of Operation

1. When program starts, main thread is running


2. Main thread spawns child threads as needed
3. Main thread and child threads run
concurrently
4. Child threads finish and join with main thread
5. Main thread terminates when process ends

UCI EECS Scalable Parallel and Distributed Systems Lab 15


Approximate Pi with pthreads
/* the thread control function */
void* PiRunner(void* param)
{
int threadNum = (int) param;
int i;
double h, sum, mypi, x;

printf("Thread %d starting.\n", threadNum);

h = 1.0 / (double) iterations;


sum = 0.0;
for (i = threadNum + 1; i <= iterations; i += threadCount) {
x = h * ((double)i - 0.5);
sum += 4.0 / (1.0 + x*x);
}
mypi = h * sum;

/* now store the result into the result array */


resultArray[threadNum] = mypi;

printf("Thread %d exiting.\n", threadNum);


pthread_exit(0);
} UCI EECS Scalable Parallel and Distributed Systems Lab 16
More Pi with pthreads: main()
/* get the default attributes and set up for creation */
for (i = 0; i < threadCount; i++) {
pthread_attr_init(&attrs[i]);
/* system-wide contention */
pthread_attr_setscope(&attrs[i], PTHREAD_SCOPE_SYSTEM);
}

/* create the threads */


for (i = 0; i < threadCount; i++) {
pthread_create(&tids[i], &attrs[i], PiRunner, (void*)i);
}

/* now wait for the threads to exit */


for (i = 0; i < threadCount; i++)
pthread_join(tids[i], NULL);

pi = 0.0;
for (i = 0; i < threadCount; i++)
pi += resultArray[i];
UCI EECS Scalable Parallel and Distributed Systems Lab 17
Java Threads

9 Threading and synchronization built in


9 An object can have associated thread
9 Subclass Thread or Implement Runnable
9 “run” method is thread body
9 “synchonized” methods provide mutual exclusion
9 Main program
9 Calls “start” method of Thread objects to spawn
9 Calls “join” to wait for completion

UCI EECS Scalable Parallel and Distributed Systems Lab 18


Parallel Microprocessor Problems
L2 Cache

CPU1
Mem I/F
System/

CPU Mem
Mem
CPU2

Then
Now

9 Memory interface too slow for 1 core/thread


9 Now multiple threads access memory simultaneously,
overwhelming memory interface
9 Parallel programs can run as slowly as sequential
ones!

UCI EECS Scalable Parallel and Distributed Systems Lab 19


Our Solution: Producer/Consumer
Parallelism Using The Cache

Data in Memory Data in Memory

Memory Communications
Bottleneck Through Cache

Thread 1 Thread 2 Producer Consumer


Half the Work Half the Work Thread Thread

UCI EECS Scalable Parallel and Distributed Systems Lab 20


Converting to Producer/Consumer
for (i = 1; i < nx - 1; i++){
for (j = 1; j < ny - 1; j++){
/* Update Magnetic Field */
for (k = 1; k < nz - 1; k++){
double invmu = 1.0/mu[i][j][k];
double tmpx = rx*invmu; double tmpy = ry*invmu; double tmpz = rz*invmu;
hx[i][j][k] += tmpz * (ey[i][j][k+1] - ey[i][j][k])
- tmpy * (ez[i][j+1][k] - ez[i][j][k]);
hy[i][j][k] += tmpx * (ez[i+1][j][k] - ez[i][j][k])
- tmpz * (ex[i][j][k+1] - ex[i][j][k]);
hz[i][j][k] += tmpy * (ex[i][j+1][k] - ex[i][j][k])
- tmpx * (ey[i+1][j][k] - ey[i][j][k]);
}
/* Update Electric Field */
for (k = 1; k < nz - 1; k++){
double invep = 1.0/ep[i][j][k];
double tmpx = rx*invep; double tmpy = ry*invep; double tmpz = rz*invep;
ex[i][j][k] += tmpy * (hz[i][j][k] - hz[i][j-1][k])
- tmpz * (hy[i][j][k] - hy[i][j][k-1]);
ey[i][j][k] += tmpz * (hx[i][j][k] - hx[i][j][k-1])
- tmpx * (hz[i][j][k] - hz[i-1][j][k]);
ez[i][j][k] += tmpx * (hy[i][j][k] - hy[i-1][j][k])
- tmpy * (hx[i][j][k] - hx[i][j-1][k]);
UCI EECS Scalable } } }
Parallel and Distributed Systems Lab 21
Synchronized Pipelined
Parallelism Model (SPPM)

Conventional Producer/Consumer
(Spatial Decomposition) (SPPM)

UCI EECS Scalable Parallel and Distributed Systems Lab 22


SPPM Features

9 Benefits 9 Drawbacks
9Memory bandwidth 9Complex
same as sequential programming
version 9Some synchronization
9Performance overhead
improvement (usually) 9Not always faster than
9Easy in concept SDM (or sequential)

UCI EECS Scalable Parallel and Distributed Systems Lab 23


SPPM Performance (Normalized)
FDTD

Red-Black Eqn Solver

UCI EECS Scalable Parallel and Distributed Systems Lab 24


So What’s Up With AMD CPUs?

9 How can SPPM be


slower than Seq?
9Fetching from other
core’s cache is slower
than fetching from
memory!
9Makes consumer
slower than producer!

UCI EECS Scalable Parallel and Distributed Systems Lab 25


CMP Cache Coherence Stinks!

UCI EECS Scalable Parallel and Distributed Systems Lab 26


Private Cache Solution:
Polymorphic Threads
9 Cache-to-Cache too slow
9 Therefore can’t move
much data between cores
9 Polymorphic Threads
9Thread morphs between
producer and consumer for
each block
9Sync data passed between
caches
9But more complex program
9Good on private caches!
9Not faster on shared caches

UCI EECS Scalable Parallel and Distributed Systems Lab 27


Ongoing Research

9 C++ Runtime to make SPPM & Polymorphic


Threads programming easier
9 Exploration of problem space
9 Media encoding
9 Data-stream handling (gzip)
9 Fine-grain concurrency in applications (protocol
processing, I/O, etc.)
9 Hardware architecture improvements
9 Better communications between cores

UCI EECS Scalable Parallel and Distributed Systems Lab 28


Future CMP Technology

9 8 cores soon
9 Room for improvement
Athlon 64 ATI
9Multi-way caches CPU GPU
expensive
9Coherence protocols XBAR
perform poorly Hyper- Memory
9 Stream programming Transport Controller
9GPU or multi-core Possible Hybrid
9GPGPU.org for details AMD Multi-Core
Design

UCI EECS Scalable Parallel and Distributed Systems Lab 29


What About The Cell Processor?

From IBM Cell


Broadband
Engine
Programmer
Handbook, 10
May 2006

9 PowerPC Processing Element with SMT


9 8 Synergistic Processing Elements
9Optimized for SIMD
9256KB Local Storage - no cache
9 4x16-byte-wide rings @ 96 bytes per clock cycle

UCI EECS Scalable Parallel and Distributed Systems Lab 30


CBE PowerPC Performance

180
160 Seq
SDM
140
SPPM
120 PTM
Seconds

100
160.27

80 122.63
114.84

110.67

60
40

41.47

33.83

24.98

26.09
20
0
Cell Core2Duo
FDTD 80x80x80x1000

UCI EECS Scalable Parallel and Distributed Systems Lab 31


Summary

9 Parallel microprocessors provide tremendous


peak performance
9 Need proper programming to achieve it
9 Thread programming is not hard, but requires
more care
9 Architecture & implementation bottlenecks
require additional work for good performance
9 Performance is architecture-dependent
9 Non-uniform cache interconnects will become
more common
UCI EECS Scalable Parallel and Distributed Systems Lab 32
Additional slides

UCI EECS Scalable Parallel and Distributed Systems Lab 33


Spawning Threads
9 Initialize Attributes (pthread_attr_init)
9 Default attributes OK
9 Put thread in system-wide scheduling
contention
9 pthread_attr_setscope(&attrs,
PTHREAD_SCOPE_SYSTEM);
9 Spawn thread (pthread_create)
9 Creates a thread identifier
9 Need attribute structure for thread
9 Needs a function where thread starts
9 One 32-bit parameter can be passed (void *)
UCI EECS Scalable Parallel and Distributed Systems Lab 34
Thread Spawning Issues
9 How does a thread know which thread it is? Does it
matter?
9 Yes, it matters if threads are to work together
9 Could pass some identifier in through parameter
9 Could contend for a shared counter in a critical section
9 pthread_self() returns the thread ID, but doesn’t help.
9 How big is a thread’s stack?
9 By default, not very big. (What are the ramifications?)
9 pthread_attr_setstacksize() changes stack size

UCI EECS Scalable Parallel and Distributed Systems Lab 35


Join Issues
9 Main thread must join with child threads
(pthread_join)
9 Why?
9 Ans: So it knows when they are done.
9 pthread_join can pass back a 32-bit value
9 Can be used as a pointer to pass back a result
9 What kind of variable can be passed back that
way? Local? Static? Global? Heap?

UCI EECS Scalable Parallel and Distributed Systems Lab 36

You might also like