You are on page 1of 11

Introduction

This course has concentrated on single-


processor architectures and techniques to
improve upon their performance
– Efficient algebraic hardware implementations
EE 4504 – Enhanced processor operation through
pipelined instruction execution and multiplicity
Computer Organization of functional units
– Memory hierarchy
– Control unit design
Section 11 – I/O operations
Parallel Processing Overview Through these techniques and
implementation improvements, the
processing power of a computer system
has increased by an order of magnitude
every 5 years
We are (still) approaching performance
bounds due to physical limitations of the
hardware
EE 4504 Section 11 1 EE 4504 Section 11 2

1
Multiple Processor Systems

At any given level of performance, the Tightly coupled systems


demand for higher performance machines – Multiple processors
has and will continue to exist – Shared, common memory system
– Perform computationally intense applications – Processors under the integrated control of a
faster and more accurately than ever before common operating system
Several approaches are possible – Data is exchanged between processors by
accessing common shared variable locations in
– Improve the basic performance of a single
memory
processor machine
– Common shared memory ultimates presents an
» Architecture / organization improvements
overall system bottleneck that effectively limits
» Implementation improvements the sizes of these systems to a fairly small
SSI --> VLSI --> ULSI number of processors (dozens)
Clock speed
Packaging
– Multiple processor system architectures
» Tightly coupled system
» Loosely coupled system
» Distributed computing system

EE 4504 Section 11 3 EE 4504 Section 11 4

2
Loosely coupled systems
– Multiple processors
– Each processor has its own independent
memory system
– Processors under the integrated control of a
common operating system
– Data exchanged between processors via inter-
processor messages
– This definition does not agree with the one
given in the text

Figure 16.1 Tightly coupled system

EE 4504 Section 11 5 EE 4504 Section 11 6

3
Distributed computing systems
– Collections of relatively autonomous
computers, each capable of independent
operation
– Example systems are local area networks of
computer workstations
» Each machine is running its own “copy” of
the operating system
» Some tasks are done on different machines
(e.g., mail handler is on one machine)
» Supports multiple independent users
» Load balancing between machines can
cause a user’s job on one machine to be
shifted to another

Loosely coupled system [HwB84]

EE 4504 Section 11 7 EE 4504 Section 11 8

4
Performance bounds
– For a system with n processors, we would like a 10000
net processing speedup (meaning lower overall
execution time) of nearly n times when
compared to the performance of a similar 1000
uniprocessor system
Ideal
– A number of poor performance “upper bounds”
log n
have been proposed over the years 100
n/ln n
» Maximum speedup of O(log n)
» Maximum speedup of O(n / ln n)
– These “bounds” were based on runtime 10
performance of applications and were not
necessarily valid in all cases
– They reinforced the computer industry’s 1

16

64

256

1024
hesitancy to “get into” parallel processing

Performance “bounds” from [HwB84]


EE 4504 Section 11 9 EE 4504 Section 11 10

5
Parallel Processors

“The real world is inherently concurrent; The tightly coupled and loosely coupled
yet our computational perception of it has architectures can have 10s, 100s, or 1000s
been strained through 300 years of of processors
basically sequential mathematics, 50 years These machines are the true parallel
of sequential algorithm development, and processors (also called concurrent
15 years of sequential FORTRAN processors)
programming. Is it any wonder that those These machines fall into Flynn’s taxonomy
searching for parallelism or concurrency in classes of SIMD and MIMD systems
a FORTRAN do-loop cannot find much?”
– SIMD: single instruction stream and single
data streams
Thurber and Patton in IEEE T/C, Dec 1973 – MIMD: multiple instruction streams and
multiple data streams

EE 4504 Section 11 11 EE 4504 Section 11 12

6
SIMD Overview
Data
Stream 0 Single “control unit” computer and an
Processor 0 array of “computational” computers
Control unit executes control-flow
Data instructions and scalar operations and
Stream 1 passes vector instructions to the processor
Processor 1 array
Instruction Processor instruction types
Control Stream
– Extensions of scalar instructions
Unit
» Adds, stores, multiplies, etc. become vector
operations executed in all processors
concurrently
– Must add the ability to transfer vector and
Data scalar data between processors to the
Stream N-1 instruction set -- attributes of a “parallel
Processor
language”
N-1

SIMD system architecture of [Fly66]

EE 4504 Section 11 13 EE 4504 Section 11 14

7
SIMD Examples

Vector addition Image smoothing


– C(I) = A(I) + B(I) – Smooth an n-by-n pixel image to reduce
– Complexity O(n) in SISD systems “noise”
for I=1 to n do – Each pixel is replaced by the average of itself
C(I) = A(I) + B(I) and its 8 nearest neighbors
– Complexity O(1) in SIMD systems – Complexity O(n2) in SISD systems
– Complexity O(n) in SIMD systems
Matrix multiply
– A, B, and C are n-by-n matrices
– Compute C= AxB
– Complexity O(n3) in SISD systems
» n2 dot products, each of which is O(n)
– Complexity O(n2) in SIMD systems
» Perform n dot products in parallel across
the array

Pixel and 8 neighbors

EE 4504 Section 11 15 EE 4504 Section 11 16

8
MIMD systems

Data MIMD systems differ from SIMD ones in


Instruction
Stream 0 Stream 0 that the “lock-step” operation requirement
Control Processor 0
Unit 0
is removed
Each processor has its own control unit
Instruction Data
Stream 1 and can execute an independent stream of
Control Stream 1
Processor 1 instructions
Unit 1
– Rather than forcing all processors to perform
the same task at the same time, processors can
be assigned different tasks that, when taken as a
whole, complete the assigned application
SIMD applications can be executed on an
MIMD structure
Instruction Data – Each processor executes its own copy of the
Stream N-1 Stream N-1
Control Processor SIMD algorithm
Unit N-1 N-1 Application code can be decomposed into
communicating processes
– Distributed simulations is a good example of a
MIMD system architecture of [Fly66] very hard MIMD application

EE 4504 Section 11 17 EE 4504 Section 11 18

9
System examples

Keys to high MIMD performance are SIMD


– Process synchronization – Illiac IV
– Process scheduling » One of the first massively parallel systems
Process synchronization targets keeping all » 64 processors
processors busy and not suspended – Goodyear Staran
awaiting data from another processor » 256 bit-serial processors
– Current system from Cray Computer Corp.uses
Process scheduling can be performed
supercomputer (Cray 3) front end coupled to an
– By the programmer through the use of parallel SIMD array of 1000s of processors
language constructs
» Specify apriori what processes will be
MIMD
instantiated and where they will be – Intel hypercube series
executed » Supported up to several hundred CISC
– During program execution by spawning processors
processes off for execution on available » Next-gen Paragon
processors. – Cray Research T3D
» Fork-join construct in some languages » Cray Y-MP coupled to a massive array of
Dec Alphas
» Target: sustained teraflop performance

EE 4504 Section 11 19 EE 4504 Section 11 20

10
Problems
– Hardware is relatively easy to build
– Massively parallel systems just take massive
amounts of money to build
– How should/can the large numbers of
processors be interconnected
– The real trick is building software that will
exploit the capabilities of the system
Reality check:
– Outside of a limited number of high-profile
applications, parallel processing is still a
“young” discipline
– Parallel software is still fairly sparse
– Risky for companies to adopt parallel strategies
-- just wait for the next new SISD system.

EE 4504 Section 11 21

11

You might also like