Professional Documents
Culture Documents
Lecture at UFRO
February/March 2011
Carsten Trinitis
Lehrstuhl fr Rechnertechnik und Rechnerorganisation (LRR)
Institut fr Informatik, Technische Universitt Mnchen
Germany
LRR-TUM, 2011
Microprocessor
Microprocessor
Microprocessor
Parallel Architectures
Literature
Books
Processor
Multiprocessors on a chip
Multiple processors
Processor
Processor
Processor
Processor
BUS / Network
Memory
10
Performance Goal
Speedup
speedup p processors =
performance p processors
performance 1 processor
Efficiency
time(1 processor )
time( p processors )
efficiency ( p processors ) =
speedup( p processors )
p
11
speedup( p processors ) =
tpm( p processor )
tpm(1 processors )
12
Classification
Parallel Systems
SIMD
Array
MIMD
Vector
Distributed Memory
MPP
NOW
Shared Memory
Cluster
UMA
ccNUMA
NUMA
nccNUMA
COMA
13
Classification
Parallel systems
SIMD (Single Instruction Multiple Data):
Array Processors: Synchronized execution of the same instruction
on a set of ALUs
Vector Processors: High-end processors performing vector
operations.
14
MIMD computers
Distributed Memory - DM (multicomputer):
15
Distributed Memory
Distributed Memory - DM (multicomputer):
16
Shared Memory
Uniform Memory Access UMA : (symmetric
multiprocessors - SMP):
17
NUMA Systems
Cache-coherent NUMA - ccNUMA:
18
Shared Memory
Global addresses
Explicit synchronization
Message Passing
Explicit exchange of messages
Implicit synchronization
Communication hardware
Message passing
Lecture @UFRO, March 2011
09/03/11
19
Communication Architecture
Programming Model including
communication abstraction
Compiler
libraries
System Interface
OS
Hardware Interface
Hardware
20
21
09/03/11
22
P0
Pn
Memory
IO device
23
One program
Multiple programs
write buffers
cache
...
Lecture @UFRO, March 2011
09/03/11
24
Cache Coherency
25
Cache Coherency
Cache Coherency:
26
Cache Coherency
Update strategies
Write-Through
Copy-Back
27
Cache Coherency
Write-Through
Any ideas :) ?
28
Cache Coherency
Write-Through
Buffered-Write-Through
Buffer between cache and memory.
Cache control can initiate subsequent cache before data has
been written to memory.
Easy to implement.
Overlap program execution and memory update.
29
Cache Coherency
Write-Through
No-Write-Allocation
Upon write-miss, only memory is updated.
Write-Allocation
Upon write miss, cache and memory are updated.
30
Cache Coherency
Copy-Back (Write-Back)
31
Cache Coherency
Write-Through
Read-Hit
Read-Miss
CPU-Data -->
CPU-Data -->
Cache, Mem
Cache, Mem
Write-Hit
Write-Miss
Copy-Back
Write-Alloc.
IF D==1:
Cache-Line --> Mem
Mem-Block, Tag --> Cache
V=1
CPU-Data --> Cache
D=1
32
Cache Coherency
Bus-Snooping
Cache control monitors bus for other masters' access
patterns:
Write-Through: upon Write-Hit from other master set cache entry to
invalid.
Copy-Back: upon Write-Hit set cache entry to invalid or update it
and set it to dirty.
33
Cache Coherency
Multiprocessor systems:
E.g. CC-NUMA
Directory based cache coherency
34
CPU
CPU
Cache
Cache
Cache
Cache
Main
Main
Memory
Memory
shared
Address Space
35
MESI protocol
36
MESI protocol
Invalid (I)
Shared (S)
Exclusive (E)
Modified (M)
37
MESI protocol
Status bits:
Invalid (I): Cache line is invalid.
- Read/write access to this line triggers loading of memory block into
cache line.
- Other caches indicate if they hold this block through shared signal:
1: Shared Read Miss
0: Exclusive Read Miss
- State changes to S or E.
- Upon Write Miss, state changes to M. Invalidate signal is set.
38
MESI protocol
Status bits:
Shared (S):
Memory block does exist in local cache line and may exist in other
caches.
- Read-Hit:
State is not changed.
- Write-Hit:
Cache line is updated, state changes to M.
Invalidate signal is set, other caches with this line in state S I.
39
MESI protocol
Status bits:
Exclusive (E):
Memory block does only exist as local copy.
- Processor can read and write w/o bus access.
- Write access: State is changed to M.
- Other caches are not affected.
40
MESI protocol
Status bits:
Modified (M), Exclusive Modified:
Memory block does only exist as local copy and has been
modified.
- Processor can read and write w/o bus access.
- Upon read/write access from toher processor (snoop hit) line must be
written back to memory block. State is changed to S or I.
- Processor which wants to access this memory block is signalled to wait
through Retry.
41
Write-Miss
Read-Hit
or
Write-Hit
Exclusive
Read-Miss w.r.
S
Shared
Read-Miss w.r.
Read-Hit
or
Shared Read Miss w.r.
E
Read-Hit or
Exclusive Read Miss w.r..
2 Write-Miss w.r.
42
Snoop-Hit
on a Read.
Snoop-Hit on a Write
43
Development of Microprocessors
44
Development of Microprocessors
09/03/11
45
Development of Microprocessors
More parallelism
46
Development of Microprocessors
More parallelism
47
Development of Microprocessors
More parallelism
09/03/11
48
Multi-Core Architectures
Advantage
49
Multi-Core Architectures
Disadvantage
50
09/03/11
51
Core
Core
Core
L1
L1
L1
L1
L2
L2
Local
Store
Core
Core
Core
Core
Local
Store
Local
Store
L2
I/O
L3
Local
Store
L3
I/O
Homogeneous with
shared caches and cross bar
Heterogeneous with
caches, local store and ring bus
52
Core
Core
Core
Core
Core
L1
L1
L1
L1
L1
L1
Switch
Switch
Switch
L2
L2
L2
Memory
Memory
Traditional design
Multiple single-cores
with shared cache off-chip
Multicore Architecture
Shared Caches on-chip
53
Dynamic sharing
09/03/11
54
higher bandwidth
55
Synchronization
56
Synchronization
Point-to-point events:
Processes signal other processes that they have reached a specific
point of execution.
Global events:
Event constituting that a set of processes has reached a specific
point of execution.
Lecture @UFRO, March 2011
09/03/11
57
Waiting algorithm
Release method
09/03/11
58
Waiting Algorithms
Busy-waiting: spins for a variable to change
Blocking: suspends and is released by the OS
Trade-offs:
59
Special bus lines for locks: holding a line means holding lock
register, location
cmp register, #0
//compare with 0
bnz lock
st
location, #1
ret
unlock: st location, #0
ret
//return to caller
//write 0 to location
//return to caller
Lecture @UFRO, March 2011
09/03/11
60
Atomic Operations
Definition:
61
register, location
//copy location to register
//and assign 1 to location
ret
//return to caller
unlock: st location, #0
ret
//write 0 to location
//return to caller
Lecture @UFRO, March 2011
09/03/11
62
63
struct bar_type {
int counter;
struct lock_type lock;
int flag=0;
} bar_name;
Barrier (bar_name,p){
lock(bar_name.lock);
// lock barrier
if (bar_name.counter==0)
bar_name.flag=0;
//reset flag if first to reach
mycount=bar_name.counter++; //mycount is private variable
unlock(bar_name.lock);
if (mycount==p){
//last to arrive?
bar_name.counter=0;
//reset counter for next bar
bar_name.flag=1;
//release waiting processes
} else
while (bar_name.flag==0){};//busy-wait for release
}
Lecture @UFRO, March 2011
09/03/11
64
Hardware Barriers
Synchronization bus
Migration of processes
65
Synchronization Summary
66
Parallel Programming
67
PARALLEL DO
www.openmp.org
68
Multicore Architectures
Motivation
ILP is limited
Multicore as solution
Saves energy
Lecture @UFRO, March 2011
09/03/11
69
Multicore Processors
IBM Power 7
SUN UltraSparc
Intel Nehalem
70
Multi-threading (2 Threads)
Simultaneous multi-threading for memory hierarchy resources
Temporal multi-threading for core resources
- Besides end of time slice, an event, typically an L3 cache miss,
might lead to a thread switch.
Caches
L1D 16 KB, L1I 16 KB
L2D 256 KB, L2I 1 MB (Itanium 2 unified 256 KB)
L3 12 MB (Itanium 2 9 MB)
09/03/11
71
72
73
Interconnect Properties
Cost / Complexity
Hardware effort
Throughput / Bandwidth
Bisection bandwidth
Latency
09/03/11
74
Structural Properties
Connect degree
Network diameter
09/03/11
75
Beispiel
Bisektionsbandbreite=4*Bandbreite der
Einzelverbindungen
Lecture @UFRO, March 2011
09/03/11
76
Interconnect Properties
Extendability
Scalability
More precisely:
Latency should not be increased considerably.
77
Routing
Non blocking:
Fault tolerance:
Routing complexity:
78
Classes of Interconnect
Direct interconnect:
Indirect interconnect:
79
Direct Interconnect
Ring (1D)
Mesh (2d)
Hypercube(4d)
80
Indirect Interconnect
Crossbar Switch
Non blocking.
Hardware effort:
n2 switching elements at
n elements per set
81
1D Array
N-1
N/3
huge
1D Ring
N/2
N/4
huge
2D Mesh
2D Torus
N1/2
Hypercube n=log N n
1/2 N1/2
n/2
N1/2
21
2N1/2
16
N/2
82
83
Cabling of Crossbars
P0
P1
P2
P3
85
86
World
World
World
World
World
World
World
World
World
World
from
from
from
from
from
from
from
from
from
from
Node
Node
Node
Node
Node
Node
Node
Node
Node
Node
2
0
4
9
3
8
7
1
6
5
87
Grid Computing
88
Introduction
http://
Web: Uniform
naming/access to
documents
http://
Grid: Uniform, highperformance access to
computational resources
On-demand creation of
powerful virtual computing
systems
Software
catalogs
Computers
Colleagues
Data archives
Lecture @UFRO, March 2011
09/03/11
89
90
Resource sharing
91
Why Grids?
92
93
Online System
~20 TIPS
~100 MBytes/sec
~622 Mbits/sec
or Air Freight (deprecated)
Tier 1
France Regional
Centre
Germany Regional
Centre
Tier 0
Italy Regional
Centre
FermiLab ~4 TIPS
~622 Mbits/sec
Tier 2
~622 Mbits/sec
Institute Institute Institute
~0.25TIPS
Physics data cache
Physicist workstations
Caltech
~1 TIPS
Institute
~1 MBytes/sec
Tier 4
Tier2 Centre
Tier2
~1 Centre
Tier2
~1 Centre
Tier2
~1 Centre ~1
TIPS
TIPS
TIPS
TIPS
Lecture @UFRO,
March 2011
Image courtesy
Harvey
Newman, Caltech
09/03/11
94
www.gridforum.org
09/03/11
95
Job Submission
Crossgrid
21 partners
www.eu-crossgrid.org
97
Threading
98
Things to Consider
99
What is a Thread?
Definition:
A thread is a sequence of related instructions that is
executed independently of other instruction sequences.
100
Types of Threads
Hardware thread:
How threads appear to execution resources
in the hardware (on the processor).
Operational Flow
101
102
OpenMP
103
104
105
106
107
108
Hybrid Threading
109
Type = compact
Binds the OpenMP thread <n>+1 on a free thread context as close as possible to
the thread context where the <n> OpenMP thread was bound.
Type = scatter
Distributes the threads as evenly as possible across the entire system.
110
States of a Thread
New
Terminate
Interrupt
Enter
Ready
Event Completion
Running
Exit
Scheduler
Dispatch
Waiting
Event Wait
111
112
Traditional way:
Program starts with main() and works through tasks
sequentially.
Only one thing is happening at any given moment!
Parallel approach:
Rethinking necessary for parallel design.
Decompose program by
Task,
Data, or
Data flow.
113
Task Decomposition
114
Data Decomposition
115
116
Challenges
117
118
119
7/16
3/16 5/16 1/16
120
+= err * 7/16;
+= err * 5/16;
121
122
123
Performance Tuning
124
Performance Tuning
Toolchain
125
09/03/11
126
Intel Amplifier XE
Better known as VTune :)
Tool for software performance analysis for x86 and x64 based
architectures
09/03/11
127
Intel Libraries/Extensions
128