Trinitis Parallel Arquitectures

Parallel Architectures
Lecture at UFRO
February/March 2011
Carsten Trinitis
Lehrstuhl fr Rechnertechnik und Rechnerorganisation (LRR)
Institut fr Informatik, Technische Universitt Mnchen
Germany
LRR-TUM, 2011
Microprocessor
Lecture @UFRO, March 2011

09/03/11
Microprocessor

09/03/11
Microprocessor

09/03/11
Parallel Architectures

09/03/11
Literature
Books
David E. Culler, Jaswinder Pal Singh, Anoop Gupta:

Parallel Computer Architecture: A Hardware/Software
Approach, Morgan Kaufmann, 1999, ISBN 1-55860-343-3

09/03/11
Goals of Parallel Computing

Reduction of applications' execution time
Increased extensibility and configurability

Natural organization for system with special purpose
processors
Scientific interest

09/03/11
Parallelism for Performance
Processor
Bit-level up to 128 bit
Instruction-level: pipelining, functional units
Latency gets very important, branch-prediction
Toleration of latency, hyperthreading
Multiprocessors on a chip
Memory: multiple memory banks
IO: hardware DMA, Raid arrays
Multiple processors

09/03/11
Architecture with Physically Centralized Memory
Processor
Processor
Processor
Processor
BUS / Network
Memory

09/03/11
Architecture with Physically Distributed Memory

09/03/11
10
Performance Goal
Speedup
speedup p processors =
performance p processors
performance 1 processor
Scientific computing: performance=1/time

speedup ( p processors ) =
Efficiency
time(1 processor )
time( p processors )
efficiency ( p processors ) =
speedup( p processors )
p

09/03/11
11
Speedup based on Throughput
Performance = throughput = transactions / minute
speedup( p processors ) =
tpm( p processor )
tpm(1 processors )

09/03/11
12
Classification
Parallel Systems
SIMD
Array
MIMD
Vector
Distributed Memory
MPP
NOW
Shared Memory
Cluster
UMA
ccNUMA
NUMA
nccNUMA
COMA

09/03/11
13
Classification
Parallel systems
SIMD (Single Instruction Multiple Data):
Array Processors: Synchronized execution of the same instruction
on a set of ALUs
Vector Processors: High-end processors performing vector
operations.
MIMD (Multiple Instruction Multiple Data): Asynchronous

execution of different instructions.
M. Flynn, Very High-Speed Computing Systems,

Proceedings of the IEEE, 54, 1966

09/03/11
14
MIMD computers
Distributed Memory - DM (multicomputer):
Building blocks are nodes with private physical address

space. Communication is based on messages.
Shared Memory - SM (multiprocessor):
System provides a shared address space. Communication

is based on read/write operation from/to global addresses.

09/03/11
15
Distributed Memory
Distributed Memory - DM (multicomputer):
Building blocks are nodes with private physical address

space. Communication is based on messages.
Massively Parallel Systems: Dedicated systems with single operating
system instance
Clusters: Collection of workstations dedicated to parallel computing with
a dedicated high-performance network. Individual OS instances on each
machine.
Networks of workstations - NOW: No dedicated high-performance
network.

09/03/11
16
Shared Memory
Uniform Memory Access UMA : (symmetric
multiprocessors - SMP):
centralized shared memory, accesses to global memory

from all processors have same latency.
Non-uniform Memory Access Systems - NUMA (Distributed

Shared Memory Systems - DSM):
memory is distributed among the nodes, local accesses

much faster than remote accesses.

09/03/11
17
NUMA Systems
Cache-coherent NUMA - ccNUMA:
Home location of data is fixed. Copies of shared data in

the processor caches are automatically kept coherent, i.e.
new values are automatically propagated.
Non-cache-coherent NUMA - nccNUMA:
Home location of data is fixed. Copies of shared data are

independent of original location.
Cache-only memory COMA
Data migrates between memories of the nodes, i.e. home

location is changed.

09/03/11
18
Communication in Parallel Systems

Programming Model specifies communication abstraction
Shared Memory
Global addresses
Explicit synchronization
Message Passing
Explicit exchange of messages
Implicit synchronization
Communication hardware
Shared memory (Bus-based shared memory systems,

Symmetrical Multiprocessors SMPs)
Message passing
09/03/11
19
Communication Architecture
Programming Model including
communication abstraction
Compiler
libraries
System Interface
OS
Hardware Interface
Hardware

09/03/11
20
Symmetric Multiprocessors (SMPs)

09/03/11
21
Symmetric Multiprocessors (SMPs)
Symmetric Multiprocessors (SMP)s:
global physical address space

symmetric access to all main memory from any processor,
i.e. same latency for accesses
SMPs dominate the server market and are becoming

more common at the desktop.
Throughput engines for sequential jobs with varying

memory and CPU requirements.
Parallel programming: Automatic parallelizers are
available (e.g. Intel compilers).
Important building blocks for larger-scale systems.
09/03/11
22
Design of a Bus-based SMP
P0
Pn
Memory
IO device

09/03/11
23
Memory Semantics of a Sequential Computer
One program
Multiple programs
A read should return the last value written to that location.

Operations are executed in program order.
Time sharing
Same condition for read operations
Operations are executed in some order which respects the
individual program order of the programs.
Hardware ensures that semantics are enforced through:
write buffers
cache
...
09/03/11
24
Cache Coherency

09/03/11
25
Cache Coherency
Cache Coherency:
Processor must access valid data from cache and

main memory!
Instruction cache: Read only!
Data cache: Read & Write!
Cache, memory or both updated upon write hit:
- Update strategy required.
Read access must not refer to invalid data!

09/03/11
26
Cache Coherency
Update strategies
Write-Through
Copy-Back
Provide cache coherency if only one master exists.
Hardware requirements vary.
Some processors support both.

09/03/11
27
Cache Coherency
Write-Through
Main memory always updated upon write.
Upon write hit, cache is also updated.
Memory and cache content always consistent, but

memory write is slow!
Any ideas :) ?

09/03/11
28
Cache Coherency
Write-Through
Buffered-Write-Through
Buffer between cache and memory.
Cache control can initiate subsequent cache before data has
been written to memory.
Easy to implement.
Overlap program execution and memory update.

09/03/11
29
Cache Coherency
Write-Through
No-Write-Allocation
Upon write-miss, only memory is updated.
Write-Allocation
Upon write miss, cache and memory are updated.

09/03/11
30
Cache Coherency
Copy-Back (Write-Back)
Upon write, cache is always updated
Memory is updated when cache line is evicted.

Copy back for each cache line depends on Dirty-Bit
(flagged copy-back).
Dirty-Bit indicates whether data in cache line has been
modified.

09/03/11
31
Cache Coherency
Update Strategy for Caches with Valid- and Dirty-Bit

CacheAcc Write-Through
ess
No-Write Alloc.
Write-Through
Read-Hit
Cache-Data --> CPU
Cache-Data --> CPU
Cache-Data --> CPU
Read-Miss
Mem-Block, Tag --> Cache
Mem-Block,Tag --> Cache

Mem-Data --> CPU
V=1

Mem-Data --> CPU
V = 1, D = 0
CPU-Data -->
CPU-Data -->
Cache, Mem
Cache, Mem
CPU-Data --> Cache

D=1
CPU-Data --> Mem
Mem-Block, Tag --> Cache,

V=1
CPU-Data --> Cache, Mem
Mem-Data --> CPU

V=1
Write-Hit
Write-Miss
Copy-Back
Write-Alloc.
IF D==1:
Cache-Line --> Mem
V=1
CPU-Data --> Cache
D=1

09/03/11
32
Cache Coherency
How to maintain consistency?
Bus-Snooping
Cache control monitors bus for other masters' access
patterns:
Write-Through: upon Write-Hit from other master set cache entry to
invalid.
Copy-Back: upon Write-Hit set cache entry to invalid or update it
and set it to dirty.

09/03/11
33
Cache Coherency
More complex for systems with several masters with caches

Example: MESI protocol
Multiprocessor systems:
E.g. CC-NUMA
Directory based cache coherency

09/03/11
34
Update Strategy / Cache Coherency
Example: Two processors with distribuited cache, shared mein

memory.
CPU
CPU
CPU
CPU
Cache
Cache
Cache
Cache
Main
Main
Memory
Memory
shared
Address Space

09/03/11
35
MESI protocol
Each cache equipped with snoop logic and control signals:

Invalidate: Invaidate entries in other caches.
Shared: Indicate if a block to be loaded already exists as cache
line.
Retry: Request other processor to stop loading until cache line has
been written back to memory block.

09/03/11
36
MESI protocol
Two additional status bits per cache line:

Indicate current protocol state:
-
Invalid (I)
Shared (S)
Exclusive (E)
Modified (M)

09/03/11
37
MESI protocol
Status bits:
Invalid (I): Cache line is invalid.
- Read/write access to this line triggers loading of memory block into
cache line.
- Other caches indicate if they hold this block through shared signal:
1: Shared Read Miss
0: Exclusive Read Miss
- State changes to S or E.
- Upon Write Miss, state changes to M. Invalidate signal is set.

09/03/11
38
MESI protocol
Status bits:
Shared (S):
Memory block does exist in local cache line and may exist in other
caches.
- Read-Hit:
State is not changed.
- Write-Hit:
Cache line is updated, state changes to M.
Invalidate signal is set, other caches with this line in state S I.

09/03/11
39
MESI protocol
Status bits:
Exclusive (E):
Memory block does only exist as local copy.
- Processor can read and write w/o bus access.
- Write access: State is changed to M.
- Other caches are not affected.

09/03/11
40
MESI protocol
Status bits:
Modified (M), Exclusive Modified:
Memory block does only exist as local copy and has been
modified.
- Processor can read and write w/o bus access.
- Upon read/write access from toher processor (snoop hit) line must be
written back to memory block. State is changed to S or I.
- Processor which wants to access this memory block is signalled to wait
through Retry.

09/03/11
41
Write-Miss
Read-Hit
or
Write-Hit
Exclusive
Read-Miss w.r.
S
Shared
Read-Miss w.r.
Shared Read Miss
Write-Hit or Write-Miss w.r.

2
1 Exclusive Read-Miss w.r.
Read-Hit
or
Shared Read Miss w.r.
E
Read-Hit or
Exclusive Read Miss w.r..
2 Write-Miss w.r.
1 Cache line is copied back to memory block (line flush)

2 Corresponding lines in other caches are invalidated
09/03/11
42
MESI state diagram
Remote state transitions (triggered by bus snooping).

Snoop-Hit on a Write
Snoop-Hit on a Read
Snoop-Hit
on a Read.
Snoop-Hit on a Write
3 Retry-Signal is set, then cache line is copied to memory block.

09/03/11
43
Development of Microprocessors

09/03/11
44
Higher clock rates
increased power consumption

proportional to f and U
higher frequency requires higher voltage
Small structures: Energy loss through leakage currents
increases heat output and cooling requirements
limited chip size (speed of light)
at fixed technology (e.g. 32 nm)

Smaller number of transistor levels per pipeline stage possible
More, simplified pipeline stages (P4: >30 stages)
Higher penalty of pipeline stalls
(on conflicts, e.g. branch misprediction)
09/03/11
45
More parallelism
Increased bit width (now: 64 bit architectures)

SIMD
Instruction Level Parallelism (ILP)

exploits parallelism found in an instruction stream
limited by data/control dependencies
can be increased by speculation
modern superscalar processors can hardly get any better

09/03/11
46
More parallelism
Thread Level Parallelism (TLP)

Hardware multithreaded (e.g. SMT: Hyperthreading)
- better exploitation of superscalar execution units
Multiple cores
Legacy software must be parallelized
- Challenge for whole software industry
- Intel tried to delay this development (see P4 clock rates)

09/03/11
47
More parallelism
Data Level Parallelism (vector registers)

Increasing width due to increasing # of transistors
- Initially from multimedia (MMX, graphics cards, ...)
GPU: Hundreds of shader cores
Software must be adapted
- Parallelizing compilers
- Vector intrinsics (not compatible!)
- New language extensions (Parallel Building Blocks, CUDA,
OpenCL, ...)
- Growing number of shader cores / increasing vector width
- SSE: 128bit, AVX: 256bit, KNF: 512bit, ...
09/03/11
48
Multi-Core Architectures
SMPs on a single chip
Chip Multi-Processors (CMP)
Advantage
Efficient exploitation of available transistor budget

Improves throughput / speed of parallelized applications
Allows tight coupling of cores
better communication between cores than in SMP
shared caches
Lower power consumption

lower clock rates
idle cores can be suspended

09/03/11
49
Multi-Core Architectures
SMPs on a single chip, Chip Multi-Processors (CMP)
Disadvantage
Only improves speed of parallelized applications

Increased gap to memory speed

09/03/11
50
Multi-Core Architectures: Design Issues
homogeneous vs. heterogeneous

specialized accelerator cores
- SIMD (vector)
- GPU operations
- cryptography
- DSP functions (e.g. FFT)
- FPGA (programmable circuits)
access to memory
- private memory area (distributed memory)
- via cache hierarchy (shared memory)
Connection between cores

internal bus / cross bar connect
cache architecture
09/03/11
51
Multi-Core Architectures: Examples

Core
Core
Core
Core
L1
L1
L1
L1
L2
L2
Core (2x SMT)

L1
Local
Store
Core
Core
Core
Core
Local
Store
Local
Store
L2
I/O
L3
Local
Store
L3
I/O
Homogeneous with
shared caches and cross bar
Heterogeneous with
caches, local store and ring bus

09/03/11
52
Shared Cache Design

Core
Core
Core
Core
Core
Core
L1
L1
L1
L1
L1
L1
Switch
Switch
Switch
L2
L2
L2
Memory
Memory
Traditional design
Multiple single-cores
with shared cache off-chip
Multicore Architecture
Shared Caches on-chip

09/03/11
53
Shared Caches: Advantages
No coherency protocol required at shared cache level
Lower communication latency
Processors with overlapping working set
One processor may prefetch data for the other
Smaller cache size required
Better usage of loaded cache lines before eviction (spatial locality)
Less congestion on limited memory connection
Dynamic sharing
if one processor needs less space, other can use more

09/03/11
54
Shared Caches: Disadvantages
Multiple CPUs more complex requirements
higher bandwidth
Cache should be larger (larger higher latency)
Hit latency higher due to switch logic above cache
Design more complex
One CPU might evict other CPU's data

09/03/11
55
Synchronization

09/03/11
56
Synchronization
Explicit synchronization among processes is required to

ensure a specific execution order of their operations.
Three types of synchronization
Mutual exclusion:
access to a critical code region is restricted
Point-to-point events:
Processes signal other processes that they have reached a specific
point of execution.
Global events:
Event constituting that a set of processes has reached a specific
point of execution.
09/03/11
57
Components of a Synchronization Event

Acquire method
a method by which a process tries to acquire the right to

synchronization, e.g. to enter a critical section or to
proceed past the event synchronization point.
Waiting algorithm
a method by which a process waits for synchronization
Release method
a method for a process to enable other processes to

proceed past a synchronization event.
09/03/11
58
Waiting Algorithms
Busy-waiting: spins for a variable to change
Blocking: suspends and is released by the OS
Trade-offs:
Blocking has higher overhead due to OS involvement
Blocking frees the processor
Busy-waiting consumes memory bandwidth while waiting
Hybrid waiting strategies combine both approaches.

09/03/11
59
Mutual Exclusion - Locks

Hardware locks
Special bus lines for locks: holding a line means holding lock
Hardware locks were mainly used for implementing higher-level

software locks in memory.
Simple software lock

lock: ld
register, location
//copy location to register
cmp register, #0
//compare with 0
bnz lock
//if not 0, try again
st
//store 1 into location
location, #1
ret
unlock: st location, #0
ret
//return to caller
//write 0 to location
//return to caller
09/03/11
60
Atomic Operations
Definition:
An operation during which a processor can simultaneously read

a location and write to it in the same bus operation.
This prevents any other processor or I/O device from writing or
reading memory until the operation is complete.
Atomic implies indivisibility and irreducibility, so an atomic
operation must be performed entirely or not performed at all.

09/03/11
61
Locking with Atomic Test&Set Operation

Atomic Test&Set operations do not allow intervening access
to the same location.
T&S loads the location's content into a register and assigns

1 to location.
lock: t&s
register, location
//copy location to register
//and assign 1 to location
bnz register, lock
//compare old value with 0
ret
//return to caller
unlock: st location, #0
ret
//write 0 to location
//return to caller
09/03/11
62
Other Atomic Operations

Compare&swap: Exchanges value in location with the
value in register
Fetch&op: Fetches value from location and writes value
obtained by applying the operation, e.g.
fetch&increment
fetch&add

09/03/11
63
Global (Barrier) Event Synchronization
Centralized software implementation
struct bar_type {
int counter;
struct lock_type lock;
int flag=0;
} bar_name;
Barrier (bar_name,p){
lock(bar_name.lock);
// lock barrier
if (bar_name.counter==0)
bar_name.flag=0;
//reset flag if first to reach
mycount=bar_name.counter++; //mycount is private variable
unlock(bar_name.lock);
if (mycount==p){
//last to arrive?
bar_name.counter=0;
//reset counter for next bar
bar_name.flag=1;
//release waiting processes
} else
while (bar_name.flag==0){};//busy-wait for release
}
09/03/11
64
Hardware Barriers
Synchronization bus
Barrier is simply a wired-AND of lines
A processor sets its input high when it reaches the barrier

and waits until the output goes high.
Advantage if barriers are executed frequently, e.g.

automatic parallelization
Complications:
Synchronization among subgroup of processors
Migration of processes
Multiple processes on a single processor

09/03/11
65
Synchronization Summary
Some bus-based machines have provided full hardware

support for synchronization.
Limited flexibility leads to support only simple atomic
operations in hardware.
Higher-level primitives can easily be built on top of those.
Synchronization primitives are supported by libraries.

09/03/11
66
Parallel Programming

09/03/11
67
Application Programming with OpenMP

omp parallel do schedule(block)
do i=1,1000
do j=1,1000
b(i,j) =(a(i-1,j) + a(i+1,j))/2
enddo
enddo
PARALLEL DO
www.openmp.org

09/03/11
68
Multicore Architectures
SMPs on a single chip
Motivation
Big gap between CPU and memory speed
High clock frequencies limit chip size
Long pipelines introduce big penalty for misprediction
Energy consumption proportional to f and U. It increases heat

output and cooling requirements.
ILP is limited
Multicore as solution
Improves throughput and speed of parallelized applications
Better communication between cores then in SMP
Saves energy
09/03/11
69
Multicore Processors
IBM Power 7
SUN UltraSparc
AMD Magny Cours / Bulldozer
Intel Pentium 4 D (Netburst...)
Intel Itanium (Montecito)
Intel Core Duo, Core 2
Intel Nehalem
Intel Sandy Bridge
Intel MIC (Knights Ferry, Knights Corner, and beyond...)

09/03/11
70
Intel Itanium 2 Dual Core
Two Itanium 2 cores
Multi-threading (2 Threads)
Simultaneous multi-threading for memory hierarchy resources
Temporal multi-threading for core resources
- Besides end of time slice, an event, typically an L3 cache miss,
might lead to a thread switch.
Caches
L1D 16 KB, L1I 16 KB
L2D 256 KB, L2I 1 MB (Itanium 2 unified 256 KB)
L3 12 MB (Itanium 2 9 MB)
Caches private to cores
1,7 Billion transistors

09/03/11
71
Distributed Memory Multiprocessors

09/03/11
72
Distributed Memory Architecture

09/03/11
73
Interconnect Properties
Cost / Complexity
Hardware effort
Throughput / Bandwidth
Number of switching elements and interconnect lines

Amount of data that can be transferred across the interconnect
within a given period of time.
Bisection bandwidth
If network is segmented into two equal parts, this is the

bandwidth between the two parts.
Typically refers to worst-case segmentation.
Latency
Time between start of message send and receive of 1st bit.

09/03/11
74
Structural Properties
Connect degree
# of (direct) connections to other nodes
Higher connect degree higher cost
Network diameter
Maximum # of connections that need to be used for

communication between two nodes (maximum path length)
Latency increases with diameter.
Lower diameter requires higher connect degree.

09/03/11
75
Beispiel
Bisektionsbandbreite=4*Bandbreite der
Einzelverbindungen
09/03/11
76
Interconnect Properties
Extendability
Unlimited or up to max. # of nodes.
Continuous or by given # of nodes (e.g. powers of 2)
Scalability
Essential properties must be maintained when increasing

# of nodes.
More precisely:
Latency should not be increased considerably.
Bisections bandwidth should increase proportionally.
Network costs should not increase significantly.

09/03/11
77
Routing
Non blocking:
If a connection between two nodes can be set up at any

time (regardless of already existing connections).
Fault tolerance:
Communication between two nodes is still possible when

transmissin lnies or switches have failed.
Requires more than one path between two nodes.
Routing complexity:
Should be implemented by node/switch hardware.

09/03/11
78
Classes of Interconnect
Direct interconnect:
A node is connected to each switch.

E.g. ring, mesh, hypercube
Indirect interconnect:
Nodes are connected to a combinatorial circuit

E.g. bus, crossbar

09/03/11
79
Direct Interconnect
Ring (1D)
Mesh (2d)
Hypercube(4d)

09/03/11
80
Indirect Interconnect
Crossbar Switch
Two disjunct sets of nodes (e.g. processors, memory)
All possible disjunct pairs can be connected at any time.
Non blocking.
Hardware effort:
n2 switching elements at
n elements per set

09/03/11
81
Properties of Some Topologies

Topology Degree
Diameter Ave Dist Bisection
1D Array
N-1
N/3
huge
1D Ring
N/2
N/4
huge
2D Mesh
2 (N1/2 - 1) 2/3 N1/2
2D Torus
N1/2
Hypercube n=log N n
1/2 N1/2
n/2
Ave Dist N=1024
N1/2
21
2N1/2
16
N/2

09/03/11
82
Properties of Distributed Memory Architectures
No shared memory or address space
Communication through message passing
Dedicated libratries for portable parallel code, e.g. MPI

(Message Passing Interface)
Latency imposed by software usually much higher than

that imposed by hardware.
Interconnect: Direct or indirect.

09/03/11
83
Cabling of Crossbars
Single Program Multiple Data (SPMD)

seq. program and
data distribution
seq. node program

with message passing
P0
P1
P2
P3
identical copies with

different process
identifications

09/03/11
85
Message Passing Programming

/*The Parallel Hello World Program*/
#include <stdio.h>
#include <mpi.h>
main(int argc, char **argv)
{
int node;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD, &node);
printf("Hello World from Node %d\n",node);
MPI_Finalize();
}
09/03/11
86
Message Passing Programming

Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
Hello
World
World
World
World
World
World
World
World
World
World
from
from
from
from
from
from
from
from
from
from
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
2
0
4
9
3
8
7
1
6
5

09/03/11
87
Grid Computing

09/03/11
88
Introduction
http://
Web: Uniform
naming/access to
documents
http://
Grid: Uniform, highperformance access to
computational resources
On-demand creation of
powerful virtual computing
systems
Software
catalogs
Computers
Colleagues
Data archives
09/03/11
89
What is the Grid?
Term borrowed from Electricity Grid

Enable communities (virtual organizations) to share
geographically distributed resources as they pursue
common goals.

09/03/11
90
Elements of the Problem
Resource sharing
Computers, storage, sensors, networks,
Sharing always conditional: issues of trust, policy,

negotiation, payment,
Coordinated problem solving
Beyond client-server: distributed data analysis,

computation, collaboration,
Dynamic, multi-institutional virtual organizations
Community overlays on classic organizational structures
Large or small, static or dynamic

09/03/11
91
Why Grids?
A biochemist exploits 10,000 computers to screen

100,000 compounds in an hour
1,000 physicists worldwide pool resources for petaop
analyses of petabytes of data
Civil engineers collaborate to design, execute, & analyze
shake table experiments
Climate scientists visualize, annotate, & analyze terabyte
simulation datasets
An emergency response team couples real time data,
weather model, population data

09/03/11
92
Why Grids? (contd)
A multidisciplinary analysis in aerospace couples code

and data in four companies
A home user invokes architectural design functions at an
application service provider
An application service provider purchases cycles from
compute cycle providers
Scientists working for a multinational soap company
design a new product
A community group pools members PCs to analyze
alternative designs for a local road

09/03/11
93
Data Grids for High Energy Physics

~PBytes/sec
~100 MBytes/sec
Online System
1 TIPS is approximately 25,000

SpecInt95 equivalents
Offline Processor Farm

There is a bunch crossing every 25 nsecs.
~20 TIPS
There are 100 triggers per second
~100 MBytes/sec
Each triggered event is ~1 MByte in size
~622 Mbits/sec
or Air Freight (deprecated)
Tier 1
France Regional
Centre
Germany Regional
Centre
Tier 0
Italy Regional
Centre
CERN Computer Centre
FermiLab ~4 TIPS
~622 Mbits/sec
Tier 2
~622 Mbits/sec
Institute Institute Institute
~0.25TIPS
Physics data cache
Physicist workstations
Caltech
~1 TIPS
Institute
~1 MBytes/sec
Tier 4
Tier2 Centre
Tier2
~1 Centre
Tier2
~1 Centre
Tier2
~1 Centre ~1
TIPS
TIPS
TIPS
TIPS
Physicists work on analysis channels.

Each institute will have ~10 physicists working on one or more
channels; data for these channels should be cached by the institute
server
Lecture @UFRO,
March 2011
Image courtesy
Harvey
Newman, Caltech
09/03/11
94
The Global Grid Forum and the Globus

Project
Global Grid Forum: Development of standard protocols

and APIs for Grid computing
Development and promotion of standard Grid protocols to

enable interoperability and shared infrastructure
Development and promotion of standard Grid software

APIs and SDKs to enable portability and code sharing
www.gridforum.org
The Globus Toolkit: Open source, reference software

base for building grid infrastructure and applications
www.globus.org
09/03/11
95
Job Submission
Crossgrid
European Grid Project - 2003
21 partners
www.eu-crossgrid.org
IST (Information Society Technology)
Focusing on interactive applications
One of the applications is flooding simulation

09/03/11
97
Threading

09/03/11
98
Things to Consider
Design approach and structure of your application
Threading API (e.g. PThreads, OpenMP, )
Compilers and Tools (icc / icpc / ifort 12.0)
Target platforms, i.e. your hardware (Core, Nehalem, ...)

09/03/11
99
What is a Thread?
Definition:
A thread is a sequence of related instructions that is
executed independently of other instruction sequences.
Every program has at least one thread!

09/03/11
100
Types of Threads
User level Thread:
Kernel level thread:

Thread implemented by the operating system.
Hardware thread:
How threads appear to execution resources
in the hardware (on the processor).
Operational Flow
Thread created and manipulated by the application software.

09/03/11
101
Hello World in OpenMP

#include <omp.h>
#include <stdio.h>
int main (int argc, char *argv[]) {
int nthreads, tid;
/* Fork a team of threads with their own copies of variables */
#pragma omp parallel private(nthreads, tid)
{
tid = omp_get_thread_num();
/* Obtain thread number */
printf("Hello World from thread = %d\n", tid);
if (tid == 0)
{ /* Only master thread does this */
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
} /* All threads join master thread and disband */
}
09/03/11
102
OpenMP
Code has no function for explicit thread creation!

Threads are created automatically.
More on OpenMP if you play with your installation :) .

09/03/11
103
Processor, Processes, and Threads
A processor runs a program which consists of one or more

processes.
A process consists of one or more threads.
Processes are mapped to the MMU (memory).
Threads are mapped to processors (affinity).
Each process has its own (virtual) address space.

09/03/11
104
User Level Threads Kernel Level Threads
One to one (1:1), kernel level threading

(Linux NPTL & Win32) .
Many to one (M:1), user level threading
(GNU Portable Threads).
Many to many (M:N), Hybrid threading
(high complexity, Windows 7).

09/03/11
105
User Level Threads Kernel Level Threads
One to one (1:1), kernel level threading

(Linux NPTL & Win32) .
Many to one (M:1), user level threading
(GNU Portable Threads).
Many to many (M:N), hybrid threading
(high complexity, Windows 7).

09/03/11
106
Kernel Level Threading

09/03/11
107
User Level Threading

09/03/11
108
Hybrid Threading

09/03/11
109
User Level Threads Physical Threads
Can be specified through affinity type, e.g. KMP_AFFINITY

in Intel OpenMP :
Type = none (default)

Does not bind OpenMP threads to particular thread contexts;
Type = compact
Binds the OpenMP thread <n>+1 on a free thread context as close as possible to
the thread context where the <n> OpenMP thread was bound.
Type = scatter
Distributes the threads as evenly as possible across the entire system.

09/03/11
110
States of a Thread
New
Terminate
Interrupt
Enter
Ready
Event Completion
Running
Exit
Scheduler
Dispatch
Waiting
Event Wait

09/03/11
111
Parallel Programming Concepts

09/03/11
112
Designing your Program
Traditional way:
Program starts with main() and works through tasks
sequentially.
Only one thing is happening at any given moment!
Parallel approach:
Rethinking necessary for parallel design.
Decompose program by
Task,
Data, or
Data flow.

09/03/11
113
Task Decomposition
Gardening example with two gardeners:

First gardener mows the lawn,
Second gardener weeds.
Moving and weeding as two separate independent tasks.
However, some coordination required (gardeners should not
interfere).
Programming example: OpenOffice Writer
User is entering text.
Pagination is happening simultaneously.
Think of early versions of Word!

09/03/11
114
Data Decomposition

First, both gardeners mow half the property.
Then, both weed half the flower beds.
Property and flower beds should be large enough so that
this decomposition makes sense.
Hence, problem sizes can be increased.
Programming example: OpenOffice Calc
Each thread recalculates half the values in a large
spreadsheet.

09/03/11
115
Data Flow Decomposition

First gardener prepares the tools (i.e. put gas in the mower,
clean shears, etc.) for both.
Second gardener is delayed.
Then parallel work can begin.
Programming example: Compiler
An input file must be parsed.
Then code generation begins.

09/03/11
116
Challenges
Threads let you improve performance (hopefully significantly!).
However, complexity increases (significantly!).
Four types of problems:
Synchronization (coordinate threads' activities, e.g. one thread waits for

another to finish).
Communication: bandwidth/latency issues (see earlier slides :-) ).
Load balancing: Distribute work evenly among threads.
Scalability: Obtain good parallel efficiency (see earlier slides :-) ).

09/03/11
117
Parallel Programming Patterns
Task Level Parallelism:

Focus on the tasks themselves, i.e. decompose problem into set of independent tasks.
Dependencies might need to be removed.
Embarassingly parallel problems: no dependencies between threads.
Divide and Conquer:
Problem is divided into parallel sub-problems.
These are solved independently.
Results are aggregated into final solution.
Example: Merge sort.
Geometric Decomposition:
Parallelization of data structures.
Each thread works on data chunks.
Example: Heat flow, wave propagation.
Pipeline:
Think of assembly lines!
Computation is broken down into stages (like in a processor :) ).
Wavefront:
Data elements are processed along a diagonal in a 2D grid.

09/03/11
118
Example: Floyd-Steinberg Dithering

09/03/11
119
Assume an eight bit grayscale image to start with.
If value is in the range [0, 127] set PixelValue to 0.
If value is in the range [128, 256] set PixelValue to 1.
For each pixel, determine the error, i.e. the deviation

from its original value:
err = OriginalValue = 255 * PixelValue.
Finally, propagate error to neighbors for

next iteration according to the
following distribution:
7/16
3/16 5/16 1/16

09/03/11
120

void floyd_steinberg(unsigned int width, unsigned int height,
unsigned short **InputImage, unsigned short **OutputImage)
{
for (unsigned int i=0; i<height; i++)
for (unsigned int j=0; j<width; j++)
{
if (InputImage[i][j] <128)
OutputImage[i][j]=0;
else
OutputImage[i][j] = 1;
int err = InputImage[i][j] 255 * OutputImage[i][j];
InputIMage[i][j+1]
+= err * 7/16;
InputImage[i+1][j+1] += err * 3/16;

InputImage[i+1][j]
+= err * 5/16;
InputImage[i+1][j+1] += err * 1/16;

}
}

09/03/11
121
Looks sequential at a first glance.

Previous pixel's error must be known to compute next
pixel's value.
Look at the receiving pixel's perspective:
1/16 5/16 3/16
7/16

09/03/11
122
Producer-Consumer like Approach

Processed
Thread 1
Thread 2
Thread 3
Thread 4

09/03/11
123
Performance Tuning

09/03/11
124
Performance Tuning
Toolchain
Intel Parallel Tools
Intel Composer XE 2011

Intel C++ Compiler
Intel Visual Fortran Compiler
Intel Integrated Performance Primitives
Intel Math Kernel Library
Intel Parallel Building Blocks
Intel VTune Amplifier XE 2011

(Formerly Intel VTune Performance Analyzer with Intel Thread Profiler)
Intel Inspector XE 2011 (Formerly Intel Thread Checker)

09/03/11
125
Intel Compilers & Libraries

Latest compiler version: 12
Optimized for latest micro-architectures (up to Sandy Bridge),

including 128 and 256bit vectorization.
Supports C, C++, and FORTRAN
Integrated Performance Primitives (IPP): Functions for multimedia,
data processing, communications.
Math Kernel Library (MKL): Provides numerical functions (BLAS,
LAPACK, ScaLAPACK1, Sparse Solvers, Fast Fourier Transforms,
Vector Math) optimized for each architecture.
Parallel Building Blocks: CilkPlus, ArBB, TBB, ...
09/03/11
126
Intel Amplifier XE
Better known as VTune :)
Tool for software performance analysis for x86 and x64 based
architectures
Both GUI and command line interfaces.
Available for both Linux and Windows.
Supports C, C++, and FORTRAN
Integrated Performance Primitives (IPP): Functions for multimedia,

data processing, communications.
Math Kernel Library (MKL): Provides numerical functions (BLAS,
LAPACK, ScaLAPACK1, Sparse Solvers, Fast Fourier Transforms,
Vector Math) optimized for each architecture.
09/03/11
127
Intel Libraries/Extensions
Integrated Performance Primitives (IPP): Functions for

multimedia, data processing, communications.
Math Kernel Library (MKL): Provides numerical functions
(BLAS, LAPACK, ScaLAPACK1, Sparse Solvers, Fast
Fourier Transforms, Vector Math) optimized for each
architecture.
Parallel Building Blocks: CilkPlus, ArBB, TBB, ...

09/03/11
128

Trinitis Parallel Arquitectures

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Trinitis Parallel Arquitectures

Uploaded by

Copyright:

Available Formats

Parallel Architectures

Lecture @UFRO, March 2011

Lecture @UFRO, March 2011

Lecture @UFRO, March 2011

Lecture @UFRO, March 2011

David E. Culler, Jaswinder Pal Singh, Anoop Gupta:

Lecture @UFRO, March 2011

Goals of Parallel Computing

Increased extensibility and configurability

Lecture @UFRO, March 2011

Parallelism for Performance

Bit-level up to 128 bit

Instruction-level: pipelining, functional units

Latency gets very important, branch-prediction

Toleration of latency, hyperthreading

Memory: multiple memory banks

IO: hardware DMA, Raid arrays

Lecture @UFRO, March 2011

Architecture with Physically Centralized Memory

Lecture @UFRO, March 2011

Architecture with Physically Distributed Memory

Lecture @UFRO, March 2011

Scientific computing: performance=1/time

Lecture @UFRO, March 2011

Speedup based on Throughput

Performance = throughput = transactions / minute

Lecture @UFRO, March 2011

Lecture @UFRO, March 2011

MIMD (Multiple Instruction Multiple Data): Asynchronous

M. Flynn, Very High-Speed Computing Systems,

Lecture @UFRO, March 2011

Building blocks are nodes with private physical address

Shared Memory - SM (multiprocessor):

System provides a shared address space. Communication

Lecture @UFRO, March 2011

Building blocks are nodes with private physical address

Lecture @UFRO, March 2011

centralized shared memory, accesses to global memory

Non-uniform Memory Access Systems - NUMA (Distributed

memory is distributed among the nodes, local accesses

Lecture @UFRO, March 2011

Home location of data is fixed. Copies of shared data in

Non-cache-coherent NUMA - nccNUMA:

Home location of data is fixed. Copies of shared data are

Cache-only memory COMA

Data migrates between memories of the nodes, i.e. home

Lecture @UFRO, March 2011

Communication in Parallel Systems

Shared memory (Bus-based shared memory systems,

Lecture @UFRO, March 2011

Symmetric Multiprocessors (SMPs)

Lecture @UFRO, March 2011

Symmetric Multiprocessors (SMPs)

Symmetric Multiprocessors (SMP)s:

global physical address space

SMPs dominate the server market and are becoming

Throughput engines for sequential jobs with varying

Design of a Bus-based SMP

Lecture @UFRO, March 2011

Memory Semantics of a Sequential Computer

A read should return the last value written to that location.

Hardware ensures that semantics are enforced through:

Lecture @UFRO, March 2011

Processor must access valid data from cache and