Flynns

Parallel Processors
Session 1
Introduction
Applications of Parallel Processors
• Structural Analysis
• Weather Forecasting
• Petroleum Exploration
New Applications
• Fusion Energy Research
• Medical Diagnosis More Performance
• Aerodynamics Simulations
• Artificial Intelligence
• Expert Systems
• Industrial Automation
• Remote Sensing
• Military Applications Common Requirement:
• Genetic Engineering
• Socioeconomics High volume of
• Encryption processing and computations
• And Many Other Applications in a limited time
Architecture
Application Technological
requirements constraints
ARCHITECTURE
• Requirements and constraints are often at odds with

each other!
• Architecture ---> making tradeoffs
• Architecture translates technology’s gifts into
performance and capability
High Performance Computing
• Achieving high performance depends on:
– Fast and reliable hardware (Technology Driven)
– Computer architectures (For example the use of Carry Look Ahead in addition
increases the speed of operation)
– Processing techniques (Better algorithms can result in higher speed of operation)
• Yet another way:
– Performing as many operations as possible simultaneously, concurrently, in
parallel, instead of sequentially
• Cost is important !
• Cost-effective solutions for high performance computing:
– Advanced computer architectures
– theories of parallel computing
– optimal resource allocation
– fast algorithms
– efficient programming languages
– …
• Requires knowledge of algorithms, languages, software, hardware,
performance evaluation, and computing alternatives
Advanced Computer Architectures
• Pipelined Computers
• Array Processors
• Multiprocessor Systems
– Hardware Structure
– Software Structure
– Parallel Computing Algorithms
– Optimal Allocations of Resources
Definition
Parallel processing provides a cost-effective
solution to achieve high system
performance through concurrent activities
It is the method of organization of operations
in a computing system where more than
one operation is performed simultaneously
Scalability
• Scalability is a major objective in the
design of advanced parallel computers
• Scalability means: a proportional increase
in performance with increasing system
resources
• System resources include: Processors,
Memory Capacity, I/O Bandwidth
Course Description
• Introductory Graduate Course
• For Computer Group
• Prerequisite: Computer Organization and Programming Concepts
• Course Work:
– Case Study (Due: Week 8)
– Project (Due: Week 16)
– Possible Homework Assignments
– A Final Exam
• References:
– Advanced Computer Architecture, Parallelism, Scalability, Programmability (Kai
Hwang)
– Scalable Parallel Computing Technology, Architecture, Programming (Kai Hwang
and Zhiwei Xu)
– Parallel Computer Architecture : A Hardware/Software Approach (David Culler,
J.P. Singh, Anoop Gupta )
– Introduction to Parallel Computing (Ted G. Lewis, Hesham El-Rewini)
Course Outline
• Introduction, History, Applications, Classification
• Principles of Parallel Processing and Basic
Concepts
• Parallel Computer Models and Structures
• Programming Requirements
• Interconnection Networks
• Performance and Scalability
• Parallel and Scalable Architectures
• Parallel Programming and Models
History
• Mechanical Computers before 1945
• Five generations of electronic computers
1. (1945-54), Vacuum Tubes, Relay Memories, Fixed-Point Arithmetic, Machine Language
(The age of Dinosaurs!)
2. (1955-64), Discrete Transistors, Multiplexed Memory Access, Floating-Point Arithmetic,
High Level Languages and Compilers, batch processing
3. (1965-74), Integrated Circuits, Microprogramming, Pipelining, Cache, Multiprogramming
and Time-Sharing OS, Multiuser applications
4. (1975-90), LSI/VLSI, Semiconductor Memory, Multiprocessors, Vector Supercomputers,
Multicomputers, Multiprocessor OS, Languages, Compilers, Environment for Parallel
Processing
5. ULSI/VHSIC Processors Memory, and Switches, High-Density Packaging, Scalable
Architectures, Massively Parallel Processing, Teraflops (1012 floating-point operations per
second)
• Introduction of Concurrency:
• In early Von Neumann models every operation (Instruction are fetched, operands are
fetched, operation is executed, the results are stored)
• Prefetch operation introduced some degree of concurrency
• Extra ALUs allowed multiple execution units within the cpu capable of operating in parallel
• Pipelined operation was introduced in the third generation
• More CPUs were added to the computers to be able to perform instructions in parallel and
independently
Evolution
• From a different perspective the evolution
of computers has gone through three
waves:
– First wave: Mainframes
– Second wave: Minicomputers, High
performance super computers
– Third wave: Personal Computers, Networked
computers
• Parallel computers are the next wave
Levels of Parallelism
• Concurrency is achieved in different levels:
– Job or Program Level: Multiple jobs or programs are processed concurrently
through multiprogramming, timesharing and multiprocessing
• Requires the development of parallel processable algorithms
• Efficient allocation of limited hardware and software resources to multiple programms
– Task or Procedure Level: Multiple procedures or tasks (program segments)
within the same program are executed in parallel
• Requires the decomposition of the program into multiple tasks
– Interinstruction Level: Multiple instructions are executed concurrently
• Requires data dependency analysis
– Intrainstruction Level: Faster and concurrent operations are executed within each
instruction
• Software involvement is the highest in the first level and the lowest in the
last level
• Hardware involvement is increasing as its speed and cost is reduced while
software is getting more expensive
Alternatives
• Parallelism in Uniprocessor Systems
• Multiprocessor Systems
• Distributed Computers
– Cluster (Networked) Computers
– Web Computing
• Parallel Computers with Centralized
Computing Facilities
Elements of Modern Computers
• Computer Architecture:
– Not only structure of the hardware
– But also:
• Instruction Set
• System Software
• Application Programs
• User Interfaces
• Depending on the nature of the problems Computing
the solutions may require different Problem
computing resources, for example: Operating
– Numerical Problems require System
mathematical formulations and integer
and floating-point operations (Numerical
Computing) Hardware
– Alphanumerical Problems require
Algorithms Mapping
And Data Architecture
database management and information
retrieval operations (Transaction Structures
Processing)
– Artificial Intelligence require logic
inferences and symbolic manipulations
Programming
(Logical Reasoning)
• Respectively the algorithms and data Binding Applications Software
structures will be different High-Level (Compile, Load)
• The mapping of the system resources for Languages
the appropriate algorithms used for specific
computing problems is an objective of
parallel computer design
• Mapping includes:
Performance Evaluation
– Processor Scheduling
– Memory Maps
– Interprocessor Communicaitons
– …
Elements of Modern Computers
• Coordinated effort by hardware
resources, operating system, and
application software determines the
System Architecture
power of a modern computer
system
• The operating system manages the Computing
allocation and deallocation of Problem
resources during the execution of Operating
user programs System
• The mapping of algorithmic and
data structures onto the machine
Algorithms Mapping Hardware
architecture relies on efficient
compiler and operating system And Data Architecture
support Structures
• Parallelism can be exploited at:
– Algorithm design Programming
– Programming
– Compilation Binding
High-Level Applications Software Processors,
– Run time (Compile, Load)
• Techniques for exploiting Languages Memory,
parallelism at the above levels form I/O and
the core of parallel processing Peripheral
technology Devices
Performance Evaluation
• Standard benchmark programs are
needed for performance evaluation
Classification of Parallel Computers
• Flynn Classification:
– Single Instruction Single Data stream
system (SISD)
– Single Instruction Multiple Data stream
system (SIMD)
– Multiple Instruction Single Data stream
system (MISD)
– Multiple Instruction Multiple Data stream
system (MIMD)
Example
• Customers in a bank are

serviced by bank tellers
• Customers are tasks and

tellers are processors
SISD System
• If there is one teller for all customers they all

are serviced in sequence in a line
• This is a conventional single processor

system and SISD model
• The total processing time is the sum of all

processing times
Parallel System
• If there are more tellers they can serve the customers in parallel
• One teller per customer: All customers are served simultaneously

and the total processing time is the largest processing time
Load Balancing
• If there are more customer than tellers customers must be assigned
to tellers such that all tellers are utilized efficiently while the load is
fairly distributed among all of them while customers are also served
in the shortest possible time
• Load balancing and scheduling become important
Dependency and Coordination
• Assume customer A deposits some money into
a joint account with teller #1
• At the same time customer B wants to withdraw
some money from the same account with teller
#2
• Obviously both transactions cannot be
completed at the same time
• Coordination is required in parallel processes
when a task is dependent on other tasks
Pipelined Processing
• The bank tellers are organized into a coordinated line of
workers
• Each teller is given a fine-grained task to perform rather
than a whole transaction
– #1 gets the customers account book
– #2 Uses the book to validate the customer and his account
– #3 Updates the account
– #4 Takes or returns cash or check to or from the customer
Pipelined Processing
• By overlapping the tasks, all tellers will always be busy if
the line of customers is full
• The customers will be served in parallel but in a different
way compare to the case that each customer is serviced
completely by one teller
SIMD or Data Parallel Processing
• Data handling is the key in SIMD
• Assume every teller needs to go to a shelf, get the account for the customer
and update it and return to his desk for every customer
• Instead, if all the accounts are given to all tellers at the right time the
overhead will be saved
• A systematic mechanism is used:
– Phase 1: Data is prepared and delivered to all tellers
– Phase 2: All tellers do the processing at the same time
• This model is useful when there are a lot of data that needs to be
processed similarly (vector/array processing)
SIMD or Data Parallel Processing
• Coordination is still required for customers A and B in the previous example
• Coordination can be done by a high level supervisor
• The supervisor can use a simple strategy such as doing all deposits first
and withdrawals second
• The performance depends on the number of deposits and number of
withdrawals, for the best results they must be equal
• The total processing time depends on the number of tellers
MISD
• Assume there are several customers with a joint account
• Assume all come to the bank for different types of transactions
• Each customer goes to a different teller
• The most efficient way of processing the customer is that the
account file is passed to one of the tellers and then is circulated
among the tellers one after the other
MISD
• If there are several similar cases waiting in line for transactions the
same procedure can be repeated in pipeline for next set of
customers (in fact for the next account file)
• Then the time to go and get the account file is saved for all the
tellers
• This mechanism is good when there is a whole bunch of data and
certain processes that is to be performed on each data
MIMD
• The tellers perform different services on different customer files
• There is no global synchronization, the tellers are on their own and
do not interact with each other
• Saves a lot of overhead accessing the files when each teller needs
to do several operations on each file
MIMD
• Simultaneous transactions for customers A and B is still prohibited
• A simple lock mechanism synchronizes the transactions for A and B
which have data dependency
• Whenever a teller processes the file for A, he puts a lock on the file
• The file cannot be processed by any other teller until the lock is
removed by the teller who put the lock
Classification of Parallel Computers
• Flynn Classification:
– Single Instruction Single Data stream
system (SISD)
– Single Instruction Multiple Data stream
system (SIMD)
– Multiple Instruction Single Data stream
system (MISD)
– Multiple Instruction Multiple Data stream
system (MIMD)
SISD
Instruction Stream (IS)
Control Processor Memory

Unit (CU) IS Unit (PU) Data Stream Unit (MU)
I/O
(DS)
• Basic single-processor or uniprocessor system

• Von Neumann architecture is a SISD system
• Also includes processors with multiple functional units
and/or pipelining
SIMD
Processing DS Local DS
Element (PE) 1 Memory (LM) 1
Loaded
Loaded IS
…
from
from CU IS
host
host
Processing DS Local DS
Element (PE) n Memory (LM) n
• A number of processors execute simultaneously the same instruction transmitted by

the control unit
• Each instruction is executed on a different set of data transmitted to each processing
element from a local memory
• The results are stored temporarily in the local memory
• There is a bidirectional bus between the local memories and the main memory
• The program is stored in the main memory and transmitted to the control unit
• This system is also called an Array Processor or Vector Computer
MISD
IS … IS CU 1 CU 2 … CU n
IS IS IS
Memory
(Program and Data)
DS
PU 1
DS
PU 2
DS
… PU n
DS
I/O
• A sequence of data is transmitted to a series of processors

• Each processor is controlled by a separate control unit
• Each processor executes a separate instruction sequence
• This is referred to as systolic array and is used for pipelined execution
of specific algorithms
• This is different from pipelining in the processors in which pipelining
belongs to the same processor and is controlled by the same control
unit
MIMD
IS IS DS
I/O CU 1 PU 1
…
Shared
Memory
I/O IS DS
CU n PU n
IS
• A set of n processors simultaneously execute

different instructions sequences on different data
sets
• Multiprocessors and parallel computers are
MIMD systems
Coordination Mechanisms of
Parallel Programs
Parallel Computer
Synchronous Asynchronous
Pipelining MIMD
SIMD (Vector/Array)
MISD (Systolic Array)
• Different operations and tasks that have

dependency in any ways must be coordinated
• Coordination can be done using synchronous
mechanisms built into the hardware or
asynchronous mechanisms
Other Classifications
• Flynn’s approach is not the only
classification and it does not cover all
possible configurations
• Another widely accepted classification has
been given by Enslow
Enslow’s Definitioin
• A multiprocessor must satisfy the following four
properties:
– It must contain two or more processors of
approximately comparable capabilities
– All processors share access to a common memory.
This does not preclude the existence of local
memories for each or some of the processors
– All processors share access to I/O channels, control
units, and devices. This does not preclude the
existence of some local I/O interface and devices
– The entire system is controlled by one Operating
System
P1 P2 … Pn
Communication Network
M1 M2 … Mi IO 1 IO 2 … IO j
• A multiprocessor conforming to Enslow’s

definition is denoted as “Tightly Coupled
Multiprocessor”
• A loosely coupled multiprocessor have
less shared and more local resources
• A loosely coupled multiprocessor is more
likely to have additional OS environments
at each individual processor
• A loosely coupled multiprocessor could be
regarded as a “Computer Network”
• They also recognized as “Distributed
Systems”
LM 1 IO 1 LM 2 IO 2 LM n IO n
Local Bus Local Bus Local Bus
P1 P2 … Pn
Communication Network
• A Distributed Computer System has:

– A multiplicity of general purpose, physical and logical resources that can be assigned to
specific tasks on a dynamic basis
– A physical distribution of the above resources interacting through a communication network
– A high-level Operating System that unifies and integrates the control of the distributed
components. Individual processors may have their own local OS.
– System Transparency which permits services to be requested by name only, without having
to identify the serving resources
– Cooperative Autonomy which permits a serving resource to refuse a request of service, or
delay it, if it is busy processing another task. There is no hierarchy of control within the
system
Levels of Concurrency
• Job: The highest level, consists of one or more
tasks
• Task: A unit of scheduling to be assigned to one
or more processors. Consists of one or more
processes
• Process: A collection of program instructions,
executed on one processor. An indivisible unit
with respect to processor allocation
• Instruction: A simple unit of execution at the
lowest level

Flynns

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Flynns

Uploaded by

Copyright:

Available Formats

Parallel Processors

• Requirements and constraints are often at odds with

• Customers in a bank are

• Customers are tasks and

• If there is one teller for all customers they all

• This is a conventional single processor

• The total processing time is the sum of all

• One teller per customer: All customers are served simultaneously

Instruction Stream (IS)

Control Processor Memory

• Basic single-processor or uniprocessor system

• A number of processors execute simultaneously the same instruction transmitted by

• A sequence of data is transmitted to a series of processors

• A set of n processors simultaneously execute

• Different operations and tasks that have

• A multiprocessor conforming to Enslow’s

Local Bus Local Bus Local Bus

• A Distributed Computer System has:

You might also like