Project - ParallelComputing BSR v2

INTRODUCTION TO PARALLEL
COMPUTING
B S RAMANJANEYULU
System Software Development Group,

CDAC, Bangalore.
1
Presentation Outline
• Need for Parallel Computing

• Requirements of Parallel Computing
• Parallel Computing Terminology
• Parallel computer architectures
• Designing parallel algorithms
• Architectural taxonomy (SISD, SIMD, MISD and
MIMD)
• Symmetric multiprocessing (SMP)
• Clusters
• Parallel programming models
2
How to Run Applications Faster?
 There are 3 ways to improve performance:

• Work Harder
• Work Smarter
• Get Help (multiple workers)
 Computer Analogy
• Use faster hardware: e.g. reduce the time per instruction
• Optimized algorithms and techniques.
• Multiple computers to solve problem.
3
Sequential vs. Parallel
Sequential
Parallel
4
Sequential vs. Parallel (Contd…)
 Traditional sequential programs execute one instruction at a time

using one processor
 Parallelism implies executing tasks simultaneously (on multiple
processors) to complete the job faster
 Parallelism can be done by:
− Breaking up the task into smaller tasks
− Assigning the smaller tasks to multiple workers (processors) to
work on simultaneously
− Coordinating the workers (processors)
− Parallel problem solving is natural. Examples: Building
construction; Automobile manufacturing
5
The Need For Faster Machines
 Grand Challenge Problems:

 Climate Modeling
 Computational Fluid dynamics
 Combustion Systems
 Human Genome
 Structural Mechanics
 Molecular Modeling
 Astrophysical Calculations
 Seismic Data Processing
6
Data Parallelism
Example:
if CPU=“1" then
start=1
end=50
else if CPU=“2" then
start=51
end=100
end if
do
i = start , end
Task on d(i)
end do
7
Task Parallelism
Task parallelism
 Multiple tasks executing concurrently is called task parallelism.
 All the CPUs execute separate code blocks simultaneously.
Example:
if CPU=“1" then
do “Task 1”
else if CPU=“2" then
do “Task 2”
end if
8
Definition
Definition :
 In computer architecture point of view, a parallel

computer is a “Collection of processing elements that
communicate and co-operate to solve large problems
fast”.
 When this architecture is combined with a parallel
algorithm, we get the ‘parallel computing system’.
9
Sequential vs. Parallel Computing
SEQUENTIAL COMPUTING
 Fetch/Store
 Compute
PARALLEL COMPUTING
 Fetch/Store
 Compute
 communicate
10
Execution Time
• Sequential system
– Execution time as a function of size of input
• Parallel system
– Execution time as a function of input size,
and number of processors used
11
Terminology of Parallel Computing
Speedup : Speedup ‘Tp’ is defined as the ratio of the serial

runtime of the best sequential algorithm for solving a problem to
the time taken by the parallel algorithm to solve the same
problem on ‘p’ processors.
Tp=T(seq) / T(parallel)
The ‘p’ processors used by the parallel algorithm are assumed to
be identical to the one used by the sequential algorithm
Efficiency: Ratio of speedup to the number of processors.

Efficiency = Tp / P
12
Terminology of Parallel Computing
Throughput (in FLOPS): (Contd…)
It is obtained by taking the clock rate of the given system and

dividing it by the number of clock cycles a floating point
instruction requires.
Cost : Cost of solving a problem on a parallel system is the

product of parallel runtime and the number of processors used ,
i.e., E = p.Sp
13
Requirements for Parallel Computing
 Multiple processors
(The workers)
 Network
(Link between workers)
 OS support
14
Requirements for Parallel Computing (Contd…)
Parallel Programming Paradigms

 Message Passing (MPI , PVM )
 Data Parallel (Fortran 90/High Performance Fortran )
 Multi-Threading
 Hybrid
 Others (OpenMP, shmem)
 Decomposition of the problem into pieces that multiple
workers can perform.
15
Issues in Parallel Computing
• Parallel computer architectures

• Efficient parallel algorithms
• Parallel programming models
• Parallel computer languages
• Methods for evaluating parallel algorithms
• Parallel programming tools
16
Designing Parallel Algorithms
 Detect and exploit any inherent parallelism in an existing

sequential Algorithm
 Invent a new parallel algorithm
 Adopt from another parallel algorithm that solves a similar

problem
17
Decomposition Techniques
Decomposition Techniques
The process of splitting the computations in a problem into a set of

concurrent tasks is referred to as decomposition.
 Decomposing a problem effectively is of paramount importance

in parallel computing.
 Without a good decomposition, we may not be able to achieve a

high degree of concurrency.
 Decomposing a problem must ensure good load balance.
18
Decomposition Techniques (Contd…)
What is meant by good decomposition?
 It should lead to high degree of concurrency (fine-granularity).
 The interaction among tasks should be as little as possible

(coarse-granularity).
•The ratio between computation and communication is known as granularity.
19
Success depends on the combination of
 Architecture, Compiler, Choice of Right Algorithm
 Portability, Maintainability, and Efficient implementation
20
Architectural Taxonomy
Flynn's taxonomy uses the relationship of program instructions

to program data. The four categories are:
 SISD – Single Instruction, Single Data Stream
 SIMD – Single Instruction, Multiple Data Stream
 MISD - Multiple Instruction, Single Data Stream
(no practical examples)
 MIMD - Multiple Instruction, Multiple Data Stream
21
SISD Model features
 Not a parallel computer
 Conventional serial, scalar von Neumann computer
 A single instruction is issued in each clock cycle
 Each instruction operates on a single (scalar) data element
 Performance measured in MIPS
 Examples: most PCs and single CPU workstations
22
SIMD Model features
Also von Neumann architectures but

more powerful instructions
Each instruction may operate on more
than one data element
Usually intermediate host executes
program logic and broadcasts
instructions to other processors
Examples: Array Processors and Vector
Processors (used in the supercomputers of
1970’s and 80’s
23
MIMD Model features
 Parallelism achieved by connecting multiple processors together

 Each processor executes its own instruction stream independent of other
processors on unique data stream
 Advantages
 Processors can execute multiple job streams simultaneously
 Each processor can perform any operation regardless of what
other processors are doing
Disadvantages
Load balancing overhead - synchronization needed to coordinate
processors at end of parallel structure in a single application
Can be difficult to program 24
MIMD Block Diagram
25
MIMD Classification
26
Parallel Computer Architecture Memory Models
Shared Memory Distributed Memory
27
Hybrid Memory
Symmetric Multiprocessors (SMP)
28
Symmetric Multiprocessors (SMP)
(Contd…)
•Uses commodity microprocessors with on-chip and off-chip

cache.
•Processors are connected to a shared memory through a high-
speed bus
•Single address space.
•Easy application development.
•Difficult to scale.
•Difficult to repair/ replace the faulty node (when compared to
clusters)
29
SMP, MPP and clusters
30
Competing Architectures
• Massively Parallel Processors (MPP)-proprietary systems built for

specific purposes
– high cost and a low performance/price ratio.
• Symmetric Multiprocessors (SMP)
– suffers from scalability
• Distributed Systems
– difficult to extract high performance.
• Clusters
– High Performance Computing--- With Commodity Processors
– High Availability Computing --- for Critical Applications
31
What is a Cluster?
 A cluster is a type of parallel or distributed processing system,

which consists of a collection of interconnected stand-
alone/complete computers cooperatively working together as a
single, integrated computing resource.
 A typical cluster consists of:

• Faster, closer connection Network than a typical LAN
• Low latency communication protocols
• Looser connection than SMP
32
Motivation for using Clusters
 The communications bandwidth between

workstations is increasing as new networking
technologies and protocols are implemented in
LANs and WANs.
 Workstation clusters are easier to integrate into

existing networks than special parallel computers.
33
Cluster Computer Architecture
34
Components of Cluster Computers
• Multiple High Performance Computers
– PCs
– Workstations
– SMPs
• State-of-the-art Operating Systems
– Layered
– Micro-kernel based
• High Performance Networks/Switches
– Gigabit Ethernet
– PARAMNet
– Myrinet
• Network Interface Cards (NICs)
• Fast Communication Protocols and Services
– Active Messages (AM)
– Virtual Interface Architecture (VIA)
35
Components of Cluster Computers (Contd…)
• Parallel Programming Environments and Tools

– Compilers
– PVM [Parallel Virtual Machine]
– MPI [Message Passing Interface]
• Applications
– Sequential
– Parallel or Distributed
36
Parallel programming models -- MPI, PVM and OpenMP
•MPI – Messaging Passing Interface

•PVM – Parallel Virtual Machine
•Both MPI and PVM are based on message passing mechanism.
•Both MPI and PVM can be used with shared-memory and
distributed memory architectures.
•MPI
- MPI is mainly for data-parallel problems.
- Collective and asynchronous operations are more powerful
in MPI.
•OpenMP – Open Multiprocessing
- OpenMP is thread-based multiprocessing.
37
- OpenMP – more suitable to SMP systems.

Features of CDAC’s PARAM Supercomputers
 Distributed memory at system level and Shared memory at Node level.
 Nodes connected by low latency high throughput System Area

Networks PARAMNet and Fast/Gigabit Ethernet.
 Standard Message Passing interface (MPI) i.e. SUN MPI, IBM MPI,
Public Domain MPI and C-DAC’s own MPI (CMPI).
 C-DAC’s High Performance Computing and Communication Software

(HPCC) for Parallel Program Development and run time support.
38
References
• http://www.llnl.gov/computing/tutorials/parallel_comp/
• Tutorials located in the Maui High Performance
Computing Center's "SP Parallel Programming
Workshop".
• Linux Parallel procesing HOW TO from
http://www.tldp.org/HOWTO/Parallel-Processing-
HOWTO.html
39
Thank you.
40

Project - ParallelComputing BSR v2

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project - ParallelComputing BSR v2

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO PARALLEL

System Software Development Group,

• Need for Parallel Computing

 There are 3 ways to improve performance:

 Traditional sequential programs execute one instruction at a time

 Grand Challenge Problems:

 Multiple tasks executing concurrently is called task parallelism.

 All the CPUs execute separate code blocks simultaneously.

 In computer architecture point of view, a parallel

Speedup : Speedup ‘Tp’ is defined as the ratio of the serial

Efficiency: Ratio of speedup to the number of processors.

It is obtained by taking the clock rate of the given system and

Cost : Cost of solving a problem on a parallel system is the

Parallel Programming Paradigms

• Parallel computer architectures

 Detect and exploit any inherent parallelism in an existing

 Invent a new parallel algorithm

 Adopt from another parallel algorithm that solves a similar

The process of splitting the computations in a problem into a set of

 Decomposing a problem effectively is of paramount importance

 Without a good decomposition, we may not be able to achieve a

 Decomposing a problem must ensure good load balance.

What is meant by good decomposition?

 It should lead to high degree of concurrency (fine-granularity).

 The interaction among tasks should be as little as possible

•The ratio between computation and communication is known as granularity.

 Architecture, Compiler, Choice of Right Algorithm

 Portability, Maintainability, and Efficient implementation

Flynn's taxonomy uses the relationship of program instructions

Also von Neumann architectures but

 Parallelism achieved by connecting multiple processors together

Shared Memory Distributed Memory

•Uses commodity microprocessors with on-chip and off-chip

• Massively Parallel Processors (MPP)-proprietary systems built for

 A cluster is a type of parallel or distributed processing system,

 A typical cluster consists of:

 The communications bandwidth between

 Workstation clusters are easier to integrate into

• Parallel Programming Environments and Tools

•MPI – Messaging Passing Interface

- OpenMP – more suitable to SMP systems.

 Distributed memory at system level and Shared memory at Node level.

 Nodes connected by low latency high throughput System Area

 C-DAC’s High Performance Computing and Communication Software

You might also like