Professional Documents
Culture Documents
5 D
Edition
th
Chapter 6
Parallel Processors from
Client to Cloud
Multiprocessors
Scalability, availability, power efficiency
6.1 Introduction
Introduction
Multicore microprocessors
Hardware
Software
Synchronization
Subword Parallelism
Cache Coherence
Difficulties
Partitioning
Coordination
Communications overhead
Parallel Programming
Amdahls Law
1
Speedup
90
(1 Fparallelizable ) Fparallelizable /100
Scaling Example
100 processors
100 processors
As in example
10 processors, 10 10 matrix
Time = 20 tadd
Time = 10 tadd + 1000/100 tadd = 20 tadd
An alternate classification
Data Streams
Single
Instruction Single
Streams
Multiple
Multiple
SISD:
Intel Pentium 4
SIMD: SSE
instructions of x86
MISD:
No examples today
MIMD:
Intel Xeon e5345
Example: DAXPY (Y = a X + Y)
Conventional MIPS code
l.d
$f0,a($sp)
addiu r4,$s0,#512
loop: l.d
$f2,0($s0)
mul.d $f2,$f2,$f0
l.d
$f4,0($s1)
add.d $f4,$f4,$f2
s.d
$f4,0($s1)
addiu $s0,$s0,#8
addiu $s1,$s1,#8
subu $t0,r4,$s0
bne
$t0,$zero,loop
Vector MIPS code
l.d
$f0,a($sp)
lv
$v1,0($s0)
mulvs.d $v2,$v1,$f0
lv
$v3,0($s1)
addv.d $v4,$v2,$v3
sv
$v4,0($s1)
;load scalar a
;upper bound of what to load
;load x(i)
;a x(i)
;load y(i)
;a x(i) + y(i)
;store into y(i)
;increment index to x
;increment index to y
;compute bound
;check if done
;load scalar a
;load vector x
;vector-scalar multiply
;load vector y
;add y to product
;store the result
Vector Processors
SIMD
Simplifies synchronization
Reduced instruction control hardware
Works best for highly data-parallel
applications
Chapter 6 Parallel Processors from Client to Cloud 14
Fine-grain multithreading
Multithreading
Coarse-grain multithreading
Simultaneous Multithreading
Multithreading Example
Future of Multithreading
Shared Memory
half = 100;
repeat
synch();
if (half%2 != 0 && Pn == 0)
sum[0] = sum[0] + sum[half-1];
/* Conditional sum needed when half is odd;
Processor0 gets missing element */
half = half/2; /* dividing line on who sums */
if (Pn < half) sum[Pn] = sum[Pn] + sum[Pn+half];
until (half == 1);
Chapter 6 Parallel Processors from Client to Cloud 22
3D graphics processing
History of GPUs
GPU Architectures
Programming languages/APIs
DirectX, OpenGL
C for Graphics (Cg), High Level Shader Language
(HLSL)
Compute Unified Device Architecture (CUDA)
Chapter 6 Parallel Processors from Client to Cloud 25
8 Streaming
processors
Chapter 6 Parallel Processors from Client to Cloud 26
Streaming Processors
Executed in parallel,
SIMD style
8 SPs
4 clock cycles
Hardware contexts
for 24 warps
Registers, PCs,
Chapter 6 Parallel Processors from Client to Cloud 27
Classifying GPUs
Instruction-Level
Parallelism
Data-Level
Parallelism
Static: Discovered
at Compile Time
Dynamic: Discovered
at Runtime
VLIW
Superscalar
SIMD or Vector
Tesla Multiprocessor
GPU
SIMD processors
4 to 8
8 to 16
SIMD lanes/processor
2 to 4
8 to 16
2 to 4
16 to 32
2:1
2:1
8 MB
0.75 MB
64-bit
64-bit
8 GB to 256 GB
4 GB to 6 GB
Yes
Yes
Demand paging
Yes
No
Yes
No
Cache coherent
Yes
No
Message Passing
Reduction
Grid Computing
Network topologies
Bus
Ring
N-cube (N = 3)
2D Mesh
Interconnection Networks
Fully connected
Chapter 6 Parallel Processors from Client to Cloud 37
Multistage Networks
Network Characteristics
Performance
Link bandwidth
Total network bandwidth
Bisection bandwidth
Cost
Power
Routability in silicon
Chapter 6 Parallel Processors from Client to Cloud 39
Job-level parallelism
Parallel Benchmarks
Code or Applications?
Traditional benchmarks
Modeling Performance
Roofline Diagram
Attainable GPLOPs/sec
= Max ( Peak Memory BW Arithmetic Intensity, Peak FP Performance )
Comparing Systems
Optimizing Performance
Optimize FP performance
Software prefetch
Memory affinity
Optimizing Performance
Arithmetic intensity is
not always fixed
Increases arithmetic
intensity
Rooflines
Benchmarks
Performance Summary
Use OpenMP:
Multi-threading DGEMM
Multithreaded DGEMM
Multithreaded DGEMM
Fallacies
Pitfalls
Concluding Remarks