You are on page 1of 128

Parallel Architectures

Lecture at UFRO
February/March 2011
Carsten Trinitis
Lehrstuhl fr Rechnertechnik und Rechnerorganisation (LRR)
Institut fr Informatik, Technische Universitt Mnchen
Germany
LRR-TUM, 2011

Microprocessor

Lecture @UFRO, March 2011


09/03/11

Microprocessor

Lecture @UFRO, March 2011


09/03/11

Microprocessor

Lecture @UFRO, March 2011


09/03/11

Parallel Architectures

Lecture @UFRO, March 2011


09/03/11

Literature

Books

David E. Culler, Jaswinder Pal Singh, Anoop Gupta:


Parallel Computer Architecture: A Hardware/Software
Approach, Morgan Kaufmann, 1999, ISBN 1-55860-343-3

Lecture @UFRO, March 2011


09/03/11

Goals of Parallel Computing


Reduction of applications' execution time

Increased extensibility and configurability


Natural organization for system with special purpose
processors
Scientific interest

Lecture @UFRO, March 2011


09/03/11

Parallelism for Performance

Processor

Bit-level up to 128 bit

Instruction-level: pipelining, functional units

Latency gets very important, branch-prediction

Toleration of latency, hyperthreading

Multiprocessors on a chip

Memory: multiple memory banks

IO: hardware DMA, Raid arrays

Multiple processors

Lecture @UFRO, March 2011


09/03/11

Architecture with Physically Centralized Memory

Processor

Processor

Processor

Processor

BUS / Network

Memory

Lecture @UFRO, March 2011


09/03/11

Architecture with Physically Distributed Memory

Lecture @UFRO, March 2011


09/03/11

10

Performance Goal

Speedup
speedup p processors =

performance p processors
performance 1 processor

Scientific computing: performance=1/time


speedup ( p processors ) =

Efficiency

time(1 processor )
time( p processors )

efficiency ( p processors ) =

speedup( p processors )
p

Lecture @UFRO, March 2011


09/03/11

11

Speedup based on Throughput

Performance = throughput = transactions / minute

speedup( p processors ) =

tpm( p processor )
tpm(1 processors )

Lecture @UFRO, March 2011


09/03/11

12

Classification
Parallel Systems

SIMD
Array

MIMD

Vector

Distributed Memory

MPP

NOW

Shared Memory

Cluster

UMA

ccNUMA

NUMA

nccNUMA

COMA

Lecture @UFRO, March 2011


09/03/11

13

Classification
Parallel systems
SIMD (Single Instruction Multiple Data):
Array Processors: Synchronized execution of the same instruction
on a set of ALUs
Vector Processors: High-end processors performing vector
operations.

MIMD (Multiple Instruction Multiple Data): Asynchronous


execution of different instructions.

M. Flynn, Very High-Speed Computing Systems,


Proceedings of the IEEE, 54, 1966

Lecture @UFRO, March 2011


09/03/11

14

MIMD computers
Distributed Memory - DM (multicomputer):

Building blocks are nodes with private physical address


space. Communication is based on messages.

Shared Memory - SM (multiprocessor):

System provides a shared address space. Communication


is based on read/write operation from/to global addresses.

Lecture @UFRO, March 2011


09/03/11

15

Distributed Memory
Distributed Memory - DM (multicomputer):

Building blocks are nodes with private physical address


space. Communication is based on messages.
Massively Parallel Systems: Dedicated systems with single operating
system instance
Clusters: Collection of workstations dedicated to parallel computing with
a dedicated high-performance network. Individual OS instances on each
machine.
Networks of workstations - NOW: No dedicated high-performance
network.

Lecture @UFRO, March 2011


09/03/11

16

Shared Memory
Uniform Memory Access UMA : (symmetric
multiprocessors - SMP):

centralized shared memory, accesses to global memory


from all processors have same latency.

Non-uniform Memory Access Systems - NUMA (Distributed


Shared Memory Systems - DSM):

memory is distributed among the nodes, local accesses


much faster than remote accesses.

Lecture @UFRO, March 2011


09/03/11

17

NUMA Systems
Cache-coherent NUMA - ccNUMA:

Home location of data is fixed. Copies of shared data in


the processor caches are automatically kept coherent, i.e.
new values are automatically propagated.

Non-cache-coherent NUMA - nccNUMA:

Home location of data is fixed. Copies of shared data are


independent of original location.

Cache-only memory COMA

Data migrates between memories of the nodes, i.e. home


location is changed.

Lecture @UFRO, March 2011


09/03/11

18

Communication in Parallel Systems


Programming Model specifies communication abstraction

Shared Memory
Global addresses
Explicit synchronization

Message Passing
Explicit exchange of messages
Implicit synchronization

Communication hardware

Shared memory (Bus-based shared memory systems,


Symmetrical Multiprocessors SMPs)

Message passing
Lecture @UFRO, March 2011

09/03/11

19

Communication Architecture
Programming Model including
communication abstraction

Compiler
libraries

System Interface
OS

Hardware Interface
Hardware

Lecture @UFRO, March 2011


09/03/11

20

Symmetric Multiprocessors (SMPs)

Lecture @UFRO, March 2011


09/03/11

21

Symmetric Multiprocessors (SMPs)

Symmetric Multiprocessors (SMP)s:

global physical address space


symmetric access to all main memory from any processor,
i.e. same latency for accesses

SMPs dominate the server market and are becoming


more common at the desktop.

Throughput engines for sequential jobs with varying


memory and CPU requirements.
Parallel programming: Automatic parallelizers are
available (e.g. Intel compilers).
Important building blocks for larger-scale systems.
Lecture @UFRO, March 2011

09/03/11

22

Design of a Bus-based SMP

P0

Pn

Memory

IO device

Lecture @UFRO, March 2011


09/03/11

23

Memory Semantics of a Sequential Computer

One program

Multiple programs

A read should return the last value written to that location.


Operations are executed in program order.
Time sharing
Same condition for read operations
Operations are executed in some order which respects the
individual program order of the programs.

Hardware ensures that semantics are enforced through:

write buffers
cache
...
Lecture @UFRO, March 2011

09/03/11

24

Cache Coherency

Lecture @UFRO, March 2011


09/03/11

25

Cache Coherency

Cache Coherency:

Processor must access valid data from cache and


main memory!
Instruction cache: Read only!
Data cache: Read & Write!
Cache, memory or both updated upon write hit:
- Update strategy required.
Read access must not refer to invalid data!

Lecture @UFRO, March 2011


09/03/11

26

Cache Coherency

Update strategies

Write-Through

Copy-Back

Provide cache coherency if only one master exists.

Hardware requirements vary.

Some processors support both.

Lecture @UFRO, March 2011


09/03/11

27

Cache Coherency

Write-Through

Main memory always updated upon write.

Upon write hit, cache is also updated.

Memory and cache content always consistent, but


memory write is slow!

Any ideas :) ?

Lecture @UFRO, March 2011


09/03/11

28

Cache Coherency

Write-Through

Buffered-Write-Through
Buffer between cache and memory.
Cache control can initiate subsequent cache before data has
been written to memory.
Easy to implement.
Overlap program execution and memory update.

Lecture @UFRO, March 2011


09/03/11

29

Cache Coherency

Write-Through

No-Write-Allocation
Upon write-miss, only memory is updated.

Write-Allocation
Upon write miss, cache and memory are updated.

Lecture @UFRO, March 2011


09/03/11

30

Cache Coherency

Copy-Back (Write-Back)

Upon write, cache is always updated

Memory is updated when cache line is evicted.


Copy back for each cache line depends on Dirty-Bit
(flagged copy-back).
Dirty-Bit indicates whether data in cache line has been
modified.

Lecture @UFRO, March 2011


09/03/11

31

Cache Coherency

Update Strategy for Caches with Valid- and Dirty-Bit


CacheAcc Write-Through
ess
No-Write Alloc.

Write-Through

Read-Hit

Cache-Data --> CPU

Cache-Data --> CPU

Cache-Data --> CPU

Read-Miss

Mem-Block, Tag --> Cache

Mem-Block,Tag --> Cache


Mem-Data --> CPU
V=1

Mem-Block, Tag --> Cache


Mem-Data --> CPU
V = 1, D = 0

CPU-Data -->

CPU-Data -->

Cache, Mem

Cache, Mem

CPU-Data --> Cache


D=1

CPU-Data --> Mem

Mem-Block, Tag --> Cache,


V=1
CPU-Data --> Cache, Mem

Mem-Data --> CPU


V=1

Write-Hit

Write-Miss

Copy-Back

Write-Alloc.

IF D==1:
Cache-Line --> Mem
Mem-Block, Tag --> Cache
V=1
CPU-Data --> Cache
D=1

Lecture @UFRO, March 2011


09/03/11

32

Cache Coherency

How to maintain consistency?

Bus-Snooping
Cache control monitors bus for other masters' access
patterns:
Write-Through: upon Write-Hit from other master set cache entry to
invalid.
Copy-Back: upon Write-Hit set cache entry to invalid or update it
and set it to dirty.

Lecture @UFRO, March 2011


09/03/11

33

Cache Coherency

How to maintain consistency?

More complex for systems with several masters with caches


Example: MESI protocol

Multiprocessor systems:
E.g. CC-NUMA
Directory based cache coherency

Lecture @UFRO, March 2011


09/03/11

34

Update Strategy / Cache Coherency

How to maintain consistency?

Example: Two processors with distribuited cache, shared mein


memory.
CPU
CPU

CPU
CPU

Cache
Cache

Cache
Cache

Main
Main
Memory
Memory

shared
Address Space

Lecture @UFRO, March 2011


09/03/11

35

Update Strategy / Cache Coherency

MESI protocol

Each cache equipped with snoop logic and control signals:


Invalidate: Invaidate entries in other caches.
Shared: Indicate if a block to be loaded already exists as cache
line.
Retry: Request other processor to stop loading until cache line has
been written back to memory block.

Lecture @UFRO, March 2011


09/03/11

36

Update Strategy / Cache Coherency

MESI protocol

Two additional status bits per cache line:


Indicate current protocol state:
-

Invalid (I)
Shared (S)
Exclusive (E)
Modified (M)

Lecture @UFRO, March 2011


09/03/11

37

Update Strategy / Cache Coherency

MESI protocol

Status bits:
Invalid (I): Cache line is invalid.
- Read/write access to this line triggers loading of memory block into
cache line.
- Other caches indicate if they hold this block through shared signal:
1: Shared Read Miss
0: Exclusive Read Miss

- State changes to S or E.
- Upon Write Miss, state changes to M. Invalidate signal is set.

Lecture @UFRO, March 2011


09/03/11

38

Update Strategy / Cache Coherency

MESI protocol

Status bits:
Shared (S):
Memory block does exist in local cache line and may exist in other
caches.
- Read-Hit:
State is not changed.

- Write-Hit:
Cache line is updated, state changes to M.
Invalidate signal is set, other caches with this line in state S I.

Lecture @UFRO, March 2011


09/03/11

39

Update Strategy / Cache Coherency

MESI protocol

Status bits:
Exclusive (E):
Memory block does only exist as local copy.
- Processor can read and write w/o bus access.
- Write access: State is changed to M.
- Other caches are not affected.

Lecture @UFRO, March 2011


09/03/11

40

Update Strategy / Cache Coherency

MESI protocol

Status bits:
Modified (M), Exclusive Modified:
Memory block does only exist as local copy and has been
modified.
- Processor can read and write w/o bus access.
- Upon read/write access from toher processor (snoop hit) line must be
written back to memory block. State is changed to S or I.
- Processor which wants to access this memory block is signalled to wait
through Retry.

Lecture @UFRO, March 2011


09/03/11

41

Update Strategy / Cache Coherency

Write-Miss

Read-Hit
or
Write-Hit

Exclusive
Read-Miss w.r.

S
Shared
Read-Miss w.r.

Shared Read Miss

Write-Hit or Write-Miss w.r.


2
1 Exclusive Read-Miss w.r.

Read-Hit
or
Shared Read Miss w.r.

E
Read-Hit or
Exclusive Read Miss w.r..

2 Write-Miss w.r.

1 Cache line is copied back to memory block (line flush)


2 Corresponding lines in other caches are invalidated
Lecture @UFRO, March 2011
09/03/11

42

Update Strategy / Cache Coherency

MESI state diagram

Remote state transitions (triggered by bus snooping).


Snoop-Hit on a Write
Snoop-Hit on a Read

Snoop-Hit
on a Read.

Snoop-Hit on a Write

3 Retry-Signal is set, then cache line is copied to memory block.


Lecture @UFRO, March 2011
09/03/11

43

Development of Microprocessors

Lecture @UFRO, March 2011


09/03/11

44

Development of Microprocessors

Higher clock rates

increased power consumption


proportional to f and U
higher frequency requires higher voltage
Small structures: Energy loss through leakage currents

increases heat output and cooling requirements

limited chip size (speed of light)

at fixed technology (e.g. 32 nm)


Smaller number of transistor levels per pipeline stage possible
More, simplified pipeline stages (P4: >30 stages)
Higher penalty of pipeline stalls
(on conflicts, e.g. branch misprediction)
Lecture @UFRO, March 2011

09/03/11

45

Development of Microprocessors

More parallelism

Increased bit width (now: 64 bit architectures)


SIMD

Instruction Level Parallelism (ILP)


exploits parallelism found in an instruction stream
limited by data/control dependencies
can be increased by speculation
modern superscalar processors can hardly get any better

Lecture @UFRO, March 2011


09/03/11

46

Development of Microprocessors

More parallelism

Thread Level Parallelism (TLP)


Hardware multithreaded (e.g. SMT: Hyperthreading)
- better exploitation of superscalar execution units
Multiple cores
Legacy software must be parallelized
- Challenge for whole software industry
- Intel tried to delay this development (see P4 clock rates)

Lecture @UFRO, March 2011


09/03/11

47

Development of Microprocessors

More parallelism

Data Level Parallelism (vector registers)


Increasing width due to increasing # of transistors
- Initially from multimedia (MMX, graphics cards, ...)
GPU: Hundreds of shader cores
Software must be adapted
- Parallelizing compilers
- Vector intrinsics (not compatible!)
- New language extensions (Parallel Building Blocks, CUDA,
OpenCL, ...)
- Growing number of shader cores / increasing vector width
- SSE: 128bit, AVX: 256bit, KNF: 512bit, ...
Lecture @UFRO, March 2011

09/03/11

48

Multi-Core Architectures

SMPs on a single chip

Chip Multi-Processors (CMP)

Advantage

Efficient exploitation of available transistor budget


Improves throughput / speed of parallelized applications
Allows tight coupling of cores
better communication between cores than in SMP
shared caches

Lower power consumption


lower clock rates
idle cores can be suspended

Lecture @UFRO, March 2011


09/03/11

49

Multi-Core Architectures

SMPs on a single chip, Chip Multi-Processors (CMP)

Disadvantage

Only improves speed of parallelized applications


Increased gap to memory speed

Lecture @UFRO, March 2011


09/03/11

50

Multi-Core Architectures: Design Issues

homogeneous vs. heterogeneous


specialized accelerator cores
- SIMD (vector)
- GPU operations
- cryptography
- DSP functions (e.g. FFT)
- FPGA (programmable circuits)
access to memory
- private memory area (distributed memory)
- via cache hierarchy (shared memory)

Connection between cores


internal bus / cross bar connect
cache architecture
Lecture @UFRO, March 2011

09/03/11

51

Multi-Core Architectures: Examples


Core

Core

Core

Core

L1

L1

L1

L1

L2

L2

Core (2x SMT)


L1

Local
Store

Core

Core

Core

Core

Local
Store

Local
Store

L2

I/O
L3

Local
Store

L3

I/O
Homogeneous with
shared caches and cross bar

Heterogeneous with
caches, local store and ring bus

Lecture @UFRO, March 2011


09/03/11

52

Shared Cache Design


Core

Core

Core

Core

Core

Core

L1

L1

L1

L1

L1

L1

Switch

Switch

Switch

L2

L2

L2

Memory

Memory

Traditional design
Multiple single-cores
with shared cache off-chip

Multicore Architecture
Shared Caches on-chip

Lecture @UFRO, March 2011


09/03/11

53

Shared Caches: Advantages

No coherency protocol required at shared cache level

Lower communication latency

Processors with overlapping working set

One processor may prefetch data for the other

Smaller cache size required

Better usage of loaded cache lines before eviction (spatial locality)

Less congestion on limited memory connection

Dynamic sharing

if one processor needs less space, other can use more


Lecture @UFRO, March 2011

09/03/11

54

Shared Caches: Disadvantages

Multiple CPUs more complex requirements

higher bandwidth

Cache should be larger (larger higher latency)

Hit latency higher due to switch logic above cache

Design more complex

One CPU might evict other CPU's data

Lecture @UFRO, March 2011


09/03/11

55

Synchronization

Lecture @UFRO, March 2011


09/03/11

56

Synchronization

Explicit synchronization among processes is required to


ensure a specific execution order of their operations.
Three types of synchronization
Mutual exclusion:
access to a critical code region is restricted

Point-to-point events:
Processes signal other processes that they have reached a specific
point of execution.

Global events:
Event constituting that a set of processes has reached a specific
point of execution.
Lecture @UFRO, March 2011
09/03/11

57

Components of a Synchronization Event


Acquire method

a method by which a process tries to acquire the right to


synchronization, e.g. to enter a critical section or to
proceed past the event synchronization point.

Waiting algorithm

a method by which a process waits for synchronization

Release method

a method for a process to enable other processes to


proceed past a synchronization event.
Lecture @UFRO, March 2011

09/03/11

58

Waiting Algorithms
Busy-waiting: spins for a variable to change
Blocking: suspends and is released by the OS

Trade-offs:

Blocking has higher overhead due to OS involvement

Blocking frees the processor

Busy-waiting consumes memory bandwidth while waiting

Hybrid waiting strategies combine both approaches.

Lecture @UFRO, March 2011


09/03/11

59

Mutual Exclusion - Locks


Hardware locks

Special bus lines for locks: holding a line means holding lock

Hardware locks were mainly used for implementing higher-level


software locks in memory.

Simple software lock


lock: ld

register, location

//copy location to register

cmp register, #0

//compare with 0

bnz lock

//if not 0, try again

st

//store 1 into location

location, #1

ret
unlock: st location, #0
ret

//return to caller
//write 0 to location
//return to caller
Lecture @UFRO, March 2011

09/03/11

60

Atomic Operations
Definition:

An operation during which a processor can simultaneously read


a location and write to it in the same bus operation.
This prevents any other processor or I/O device from writing or
reading memory until the operation is complete.
Atomic implies indivisibility and irreducibility, so an atomic
operation must be performed entirely or not performed at all.

Lecture @UFRO, March 2011


09/03/11

61

Locking with Atomic Test&Set Operation


Atomic Test&Set operations do not allow intervening access
to the same location.

T&S loads the location's content into a register and assigns


1 to location.
lock: t&s

register, location
//copy location to register
//and assign 1 to location

bnz register, lock

//compare old value with 0

ret

//return to caller

unlock: st location, #0
ret

//write 0 to location
//return to caller
Lecture @UFRO, March 2011

09/03/11

62

Other Atomic Operations


Compare&swap: Exchanges value in location with the
value in register
Fetch&op: Fetches value from location and writes value
obtained by applying the operation, e.g.
fetch&increment
fetch&add

Lecture @UFRO, March 2011


09/03/11

63

Global (Barrier) Event Synchronization

Centralized software implementation

struct bar_type {
int counter;
struct lock_type lock;
int flag=0;
} bar_name;

Barrier (bar_name,p){
lock(bar_name.lock);
// lock barrier
if (bar_name.counter==0)
bar_name.flag=0;
//reset flag if first to reach
mycount=bar_name.counter++; //mycount is private variable
unlock(bar_name.lock);
if (mycount==p){
//last to arrive?
bar_name.counter=0;
//reset counter for next bar
bar_name.flag=1;
//release waiting processes
} else
while (bar_name.flag==0){};//busy-wait for release
}
Lecture @UFRO, March 2011
09/03/11

64

Hardware Barriers

Synchronization bus

Barrier is simply a wired-AND of lines

A processor sets its input high when it reaches the barrier


and waits until the output goes high.

Advantage if barriers are executed frequently, e.g.


automatic parallelization
Complications:

Synchronization among subgroup of processors

Migration of processes

Multiple processes on a single processor

Lecture @UFRO, March 2011


09/03/11

65

Synchronization Summary

Some bus-based machines have provided full hardware


support for synchronization.
Limited flexibility leads to support only simple atomic
operations in hardware.

Higher-level primitives can easily be built on top of those.

Synchronization primitives are supported by libraries.

Lecture @UFRO, March 2011


09/03/11

66

Parallel Programming

Lecture @UFRO, March 2011


09/03/11

67

Application Programming with OpenMP


omp parallel do schedule(block)
do i=1,1000
do j=1,1000
b(i,j) =(a(i-1,j) + a(i+1,j))/2
enddo
enddo

PARALLEL DO

www.openmp.org

Lecture @UFRO, March 2011


09/03/11

68

Multicore Architectures

SMPs on a single chip

Motivation

Big gap between CPU and memory speed

High clock frequencies limit chip size

Long pipelines introduce big penalty for misprediction

Energy consumption proportional to f and U. It increases heat


output and cooling requirements.

ILP is limited

Multicore as solution

Improves throughput and speed of parallelized applications

Better communication between cores then in SMP

Saves energy
Lecture @UFRO, March 2011

09/03/11

69

Multicore Processors

IBM Power 7

SUN UltraSparc

AMD Magny Cours / Bulldozer

Intel Pentium 4 D (Netburst...)

Intel Itanium (Montecito)

Intel Core Duo, Core 2

Intel Nehalem

Intel Sandy Bridge

Intel MIC (Knights Ferry, Knights Corner, and beyond...)

Lecture @UFRO, March 2011


09/03/11

70

Intel Itanium 2 Dual Core

Two Itanium 2 cores

Multi-threading (2 Threads)
Simultaneous multi-threading for memory hierarchy resources
Temporal multi-threading for core resources
- Besides end of time slice, an event, typically an L3 cache miss,
might lead to a thread switch.

Caches
L1D 16 KB, L1I 16 KB
L2D 256 KB, L2I 1 MB (Itanium 2 unified 256 KB)
L3 12 MB (Itanium 2 9 MB)

Caches private to cores

1,7 Billion transistors


Lecture @UFRO, March 2011

09/03/11

71

Distributed Memory Multiprocessors

Lecture @UFRO, March 2011


09/03/11

72

Distributed Memory Architecture

Lecture @UFRO, March 2011


09/03/11

73

Interconnect Properties

Cost / Complexity

Hardware effort

Throughput / Bandwidth

Number of switching elements and interconnect lines


Amount of data that can be transferred across the interconnect
within a given period of time.

Bisection bandwidth

If network is segmented into two equal parts, this is the


bandwidth between the two parts.

Typically refers to worst-case segmentation.

Latency

Time between start of message send and receive of 1st bit.


Lecture @UFRO, March 2011

09/03/11

74

Structural Properties

Connect degree

# of (direct) connections to other nodes

Higher connect degree higher cost

Network diameter

Maximum # of connections that need to be used for


communication between two nodes (maximum path length)

Latency increases with diameter.

Lower diameter requires higher connect degree.


Lecture @UFRO, March 2011

09/03/11

75

Beispiel

Bisektionsbandbreite=4*Bandbreite der
Einzelverbindungen
Lecture @UFRO, March 2011

09/03/11

76

Interconnect Properties

Extendability

Unlimited or up to max. # of nodes.

Continuous or by given # of nodes (e.g. powers of 2)

Scalability

Essential properties must be maintained when increasing


# of nodes.

More precisely:
Latency should not be increased considerably.

Bisections bandwidth should increase proportionally.

Network costs should not increase significantly.

Lecture @UFRO, March 2011


09/03/11

77

Routing

Non blocking:

If a connection between two nodes can be set up at any


time (regardless of already existing connections).

Fault tolerance:

Communication between two nodes is still possible when


transmissin lnies or switches have failed.

Requires more than one path between two nodes.

Routing complexity:

Should be implemented by node/switch hardware.

Lecture @UFRO, March 2011


09/03/11

78

Classes of Interconnect

Direct interconnect:

A node is connected to each switch.


E.g. ring, mesh, hypercube

Indirect interconnect:

Nodes are connected to a combinatorial circuit


E.g. bus, crossbar

Lecture @UFRO, March 2011


09/03/11

79

Direct Interconnect

Ring (1D)

Mesh (2d)

Hypercube(4d)

Lecture @UFRO, March 2011


09/03/11

80

Indirect Interconnect

Crossbar Switch

Two disjunct sets of nodes (e.g. processors, memory)

All possible disjunct pairs can be connected at any time.

Non blocking.

Hardware effort:
n2 switching elements at
n elements per set

Lecture @UFRO, March 2011


09/03/11

81

Properties of Some Topologies


Topology Degree

Diameter Ave Dist Bisection

1D Array

N-1

N/3

huge

1D Ring

N/2

N/4

huge

2D Mesh

2 (N1/2 - 1) 2/3 N1/2

2D Torus

N1/2

Hypercube n=log N n

1/2 N1/2
n/2

Ave Dist N=1024

N1/2

21

2N1/2

16

N/2

Lecture @UFRO, March 2011


09/03/11

82

Properties of Distributed Memory Architectures

No shared memory or address space

Communication through message passing

Dedicated libratries for portable parallel code, e.g. MPI


(Message Passing Interface)

Latency imposed by software usually much higher than


that imposed by hardware.

Interconnect: Direct or indirect.

Lecture @UFRO, March 2011


09/03/11

83

Cabling of Crossbars

Single Program Multiple Data (SPMD)


seq. program and
data distribution

seq. node program


with message passing

P0

P1

P2

P3

identical copies with


different process
identifications

Lecture @UFRO, March 2011


09/03/11

85

Message Passing Programming


/*The Parallel Hello World Program*/
#include <stdio.h>
#include <mpi.h>
main(int argc, char **argv)
{
int node;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d\n",node);
MPI_Finalize();
}
Lecture @UFRO, March 2011
09/03/11

86

Message Passing Programming


Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello

World
World
World
World
World
World
World
World
World
World

from
from
from
from
from
from
from
from
from
from

Node
Node
Node
Node
Node
Node
Node
Node
Node
Node

2
0
4
9
3
8
7
1
6
5

Lecture @UFRO, March 2011


09/03/11

87

Grid Computing

Lecture @UFRO, March 2011


09/03/11

88

Introduction
http://

Web: Uniform
naming/access to
documents

http://
Grid: Uniform, highperformance access to
computational resources
On-demand creation of
powerful virtual computing
systems

Software
catalogs

Computers

Colleagues
Data archives
Lecture @UFRO, March 2011
09/03/11

89

What is the Grid?

Term borrowed from Electricity Grid


Enable communities (virtual organizations) to share
geographically distributed resources as they pursue
common goals.

Lecture @UFRO, March 2011


09/03/11

90

Elements of the Problem

Resource sharing

Computers, storage, sensors, networks,

Sharing always conditional: issues of trust, policy,


negotiation, payment,

Coordinated problem solving

Beyond client-server: distributed data analysis,


computation, collaboration,

Dynamic, multi-institutional virtual organizations

Community overlays on classic organizational structures

Large or small, static or dynamic

Lecture @UFRO, March 2011


09/03/11

91

Why Grids?

A biochemist exploits 10,000 computers to screen


100,000 compounds in an hour
1,000 physicists worldwide pool resources for petaop
analyses of petabytes of data
Civil engineers collaborate to design, execute, & analyze
shake table experiments
Climate scientists visualize, annotate, & analyze terabyte
simulation datasets
An emergency response team couples real time data,
weather model, population data

Lecture @UFRO, March 2011


09/03/11

92

Why Grids? (contd)

A multidisciplinary analysis in aerospace couples code


and data in four companies
A home user invokes architectural design functions at an
application service provider
An application service provider purchases cycles from
compute cycle providers
Scientists working for a multinational soap company
design a new product
A community group pools members PCs to analyze
alternative designs for a local road

Lecture @UFRO, March 2011


09/03/11

93

Data Grids for High Energy Physics


~PBytes/sec
~100 MBytes/sec

Online System

1 TIPS is approximately 25,000


SpecInt95 equivalents

Offline Processor Farm


There is a bunch crossing every 25 nsecs.

~20 TIPS

There are 100 triggers per second

~100 MBytes/sec

Each triggered event is ~1 MByte in size

~622 Mbits/sec
or Air Freight (deprecated)

Tier 1

France Regional
Centre

Germany Regional
Centre

Tier 0
Italy Regional
Centre

CERN Computer Centre

FermiLab ~4 TIPS
~622 Mbits/sec

Tier 2
~622 Mbits/sec
Institute Institute Institute
~0.25TIPS
Physics data cache

Physicist workstations

Caltech
~1 TIPS

Institute

~1 MBytes/sec

Tier 4

Tier2 Centre
Tier2
~1 Centre
Tier2
~1 Centre
Tier2
~1 Centre ~1
TIPS
TIPS
TIPS
TIPS

Physicists work on analysis channels.


Each institute will have ~10 physicists working on one or more
channels; data for these channels should be cached by the institute
server

Lecture @UFRO,
March 2011
Image courtesy
Harvey
Newman, Caltech
09/03/11

94

The Global Grid Forum and the Globus


Project

Global Grid Forum: Development of standard protocols


and APIs for Grid computing

Development and promotion of standard Grid protocols to


enable interoperability and shared infrastructure

Development and promotion of standard Grid software


APIs and SDKs to enable portability and code sharing

www.gridforum.org

The Globus Toolkit: Open source, reference software


base for building grid infrastructure and applications
www.globus.org
Lecture @UFRO, March 2011

09/03/11

95

Job Submission

Crossgrid

European Grid Project - 2003

21 partners

www.eu-crossgrid.org

IST (Information Society Technology)

Focusing on interactive applications

One of the applications is flooding simulation

Lecture @UFRO, March 2011


09/03/11

97

Threading

Lecture @UFRO, March 2011


09/03/11

98

Things to Consider

Design approach and structure of your application

Threading API (e.g. PThreads, OpenMP, )

Compilers and Tools (icc / icpc / ifort 12.0)

Target platforms, i.e. your hardware (Core, Nehalem, ...)

Lecture @UFRO, March 2011


09/03/11

99

What is a Thread?

Definition:
A thread is a sequence of related instructions that is
executed independently of other instruction sequences.

Every program has at least one thread!

Lecture @UFRO, March 2011


09/03/11

100

Types of Threads

User level Thread:

Kernel level thread:


Thread implemented by the operating system.

Hardware thread:
How threads appear to execution resources
in the hardware (on the processor).

Operational Flow

Thread created and manipulated by the application software.

Lecture @UFRO, March 2011


09/03/11

101

Hello World in OpenMP


#include <omp.h>
#include <stdio.h>
int main (int argc, char *argv[]) {
int nthreads, tid;
/* Fork a team of threads with their own copies of variables */
#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
/* Obtain thread number */
printf("Hello World from thread = %d\n", tid);
if (tid == 0)
{ /* Only master thread does this */
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and disband */
}
Lecture @UFRO, March 2011
09/03/11

102

OpenMP

Code has no function for explicit thread creation!


Threads are created automatically.
More on OpenMP if you play with your installation :) .

Lecture @UFRO, March 2011


09/03/11

103

Processor, Processes, and Threads

A processor runs a program which consists of one or more


processes.

A process consists of one or more threads.

Processes are mapped to the MMU (memory).

Threads are mapped to processors (affinity).

Each process has its own (virtual) address space.

Lecture @UFRO, March 2011


09/03/11

104

User Level Threads Kernel Level Threads

One to one (1:1), kernel level threading


(Linux NPTL & Win32) .
Many to one (M:1), user level threading
(GNU Portable Threads).
Many to many (M:N), Hybrid threading
(high complexity, Windows 7).

Lecture @UFRO, March 2011


09/03/11

105

User Level Threads Kernel Level Threads

One to one (1:1), kernel level threading


(Linux NPTL & Win32) .
Many to one (M:1), user level threading
(GNU Portable Threads).
Many to many (M:N), hybrid threading
(high complexity, Windows 7).

Lecture @UFRO, March 2011


09/03/11

106

Kernel Level Threading

Lecture @UFRO, March 2011


09/03/11

107

User Level Threading

Lecture @UFRO, March 2011


09/03/11

108

Hybrid Threading

Lecture @UFRO, March 2011


09/03/11

109

User Level Threads Physical Threads

Can be specified through affinity type, e.g. KMP_AFFINITY


in Intel OpenMP :

Type = none (default)


Does not bind OpenMP threads to particular thread contexts;

Type = compact
Binds the OpenMP thread <n>+1 on a free thread context as close as possible to
the thread context where the <n> OpenMP thread was bound.

Type = scatter
Distributes the threads as evenly as possible across the entire system.

Lecture @UFRO, March 2011


09/03/11

110

States of a Thread
New

Terminate

Interrupt

Enter

Ready

Event Completion

Running

Exit

Scheduler
Dispatch

Waiting

Event Wait

Lecture @UFRO, March 2011


09/03/11

111

Parallel Programming Concepts

Lecture @UFRO, March 2011


09/03/11

112

Designing your Program

Traditional way:
Program starts with main() and works through tasks
sequentially.
Only one thing is happening at any given moment!
Parallel approach:
Rethinking necessary for parallel design.
Decompose program by
Task,
Data, or
Data flow.

Lecture @UFRO, March 2011


09/03/11

113

Task Decomposition

Gardening example with two gardeners:


First gardener mows the lawn,
Second gardener weeds.
Moving and weeding as two separate independent tasks.
However, some coordination required (gardeners should not
interfere).
Programming example: OpenOffice Writer
User is entering text.
Pagination is happening simultaneously.
Think of early versions of Word!

Lecture @UFRO, March 2011


09/03/11

114

Data Decomposition

Gardening example with two gardeners:


First, both gardeners mow half the property.
Then, both weed half the flower beds.
Property and flower beds should be large enough so that
this decomposition makes sense.
Hence, problem sizes can be increased.
Programming example: OpenOffice Calc
Each thread recalculates half the values in a large
spreadsheet.

Lecture @UFRO, March 2011


09/03/11

115

Data Flow Decomposition

Gardening example with two gardeners:


First gardener prepares the tools (i.e. put gas in the mower,
clean shears, etc.) for both.
Second gardener is delayed.
Then parallel work can begin.
Programming example: Compiler
An input file must be parsed.
Then code generation begins.

Lecture @UFRO, March 2011


09/03/11

116

Challenges

Threads let you improve performance (hopefully significantly!).

However, complexity increases (significantly!).

Four types of problems:

Synchronization (coordinate threads' activities, e.g. one thread waits for


another to finish).

Communication: bandwidth/latency issues (see earlier slides :-) ).

Load balancing: Distribute work evenly among threads.

Scalability: Obtain good parallel efficiency (see earlier slides :-) ).

Lecture @UFRO, March 2011


09/03/11

117

Parallel Programming Patterns

Task Level Parallelism:


Focus on the tasks themselves, i.e. decompose problem into set of independent tasks.
Dependencies might need to be removed.
Embarassingly parallel problems: no dependencies between threads.
Divide and Conquer:
Problem is divided into parallel sub-problems.
These are solved independently.
Results are aggregated into final solution.
Example: Merge sort.
Geometric Decomposition:
Parallelization of data structures.
Each thread works on data chunks.
Example: Heat flow, wave propagation.
Pipeline:
Think of assembly lines!
Computation is broken down into stages (like in a processor :) ).
Wavefront:
Data elements are processed along a diagonal in a 2D grid.

Lecture @UFRO, March 2011


09/03/11

118

Example: Floyd-Steinberg Dithering

Lecture @UFRO, March 2011


09/03/11

119

Example: Floyd-Steinberg Dithering

Assume an eight bit grayscale image to start with.

If value is in the range [0, 127] set PixelValue to 0.

If value is in the range [128, 256] set PixelValue to 1.

For each pixel, determine the error, i.e. the deviation


from its original value:
err = OriginalValue = 255 * PixelValue.

Finally, propagate error to neighbors for


next iteration according to the
following distribution:

7/16
3/16 5/16 1/16

Lecture @UFRO, March 2011


09/03/11

120

Example: Floyd-Steinberg Dithering


void floyd_steinberg(unsigned int width, unsigned int height,
unsigned short **InputImage, unsigned short **OutputImage)
{
for (unsigned int i=0; i<height; i++)
for (unsigned int j=0; j<width; j++)
{
if (InputImage[i][j] <128)
OutputImage[i][j]=0;
else
OutputImage[i][j] = 1;
int err = InputImage[i][j] 255 * OutputImage[i][j];
InputIMage[i][j+1]

+= err * 7/16;

InputImage[i+1][j+1] += err * 3/16;


InputImage[i+1][j]

+= err * 5/16;

InputImage[i+1][j+1] += err * 1/16;


}
}

Lecture @UFRO, March 2011


09/03/11

121

Example: Floyd-Steinberg Dithering

Looks sequential at a first glance.


Previous pixel's error must be known to compute next
pixel's value.
Look at the receiving pixel's perspective:
1/16 5/16 3/16
7/16

Lecture @UFRO, March 2011


09/03/11

122

Producer-Consumer like Approach


Processed
Thread 1
Thread 2
Thread 3
Thread 4

Lecture @UFRO, March 2011


09/03/11

123

Performance Tuning

Lecture @UFRO, March 2011


09/03/11

124

Performance Tuning

Toolchain

Intel Parallel Tools

Intel Composer XE 2011


Intel C++ Compiler
Intel Visual Fortran Compiler
Intel Integrated Performance Primitives
Intel Math Kernel Library
Intel Parallel Building Blocks

Intel VTune Amplifier XE 2011


(Formerly Intel VTune Performance Analyzer with Intel Thread Profiler)

Intel Inspector XE 2011 (Formerly Intel Thread Checker)

Lecture @UFRO, March 2011


09/03/11

125

Intel Compilers & Libraries


Latest compiler version: 12

Optimized for latest micro-architectures (up to Sandy Bridge),


including 128 and 256bit vectorization.
Supports C, C++, and FORTRAN
Integrated Performance Primitives (IPP): Functions for multimedia,
data processing, communications.
Math Kernel Library (MKL): Provides numerical functions (BLAS,
LAPACK, ScaLAPACK1, Sparse Solvers, Fast Fourier Transforms,
Vector Math) optimized for each architecture.
Parallel Building Blocks: CilkPlus, ArBB, TBB, ...
Lecture @UFRO, March 2011

09/03/11

126

Intel Amplifier XE
Better known as VTune :)

Tool for software performance analysis for x86 and x64 based
architectures

Both GUI and command line interfaces.

Available for both Linux and Windows.

Supports C, C++, and FORTRAN

Integrated Performance Primitives (IPP): Functions for multimedia,


data processing, communications.
Math Kernel Library (MKL): Provides numerical functions (BLAS,
LAPACK, ScaLAPACK1, Sparse Solvers, Fast Fourier Transforms,
Vector Math) optimized for each architecture.
Lecture @UFRO, March 2011

09/03/11

127

Intel Libraries/Extensions

Integrated Performance Primitives (IPP): Functions for


multimedia, data processing, communications.
Math Kernel Library (MKL): Provides numerical functions
(BLAS, LAPACK, ScaLAPACK1, Sparse Solvers, Fast
Fourier Transforms, Vector Math) optimized for each
architecture.
Parallel Building Blocks: CilkPlus, ArBB, TBB, ...

Lecture @UFRO, March 2011


09/03/11

128

You might also like