CSC 580 - Chapter 2

CSC 580 - Chapter 2
PARALLEL PLATFORMS
AHMAD FIRDAUS BIN AHMAD FADZIL
Implicit Parallelism: Trends in Microprocessor Architectures
i. Microprocessor clock speeds have posted impressive gains over the past two decades (two to
three orders of magnitude).
ii. Higher levels of device integration have made available a large number of transistors.
iii. The question of how best to utilize these resources is an important one.
iv. Current processors use these resources in multiple functional units and execute multiple
instructions in the same cycle.
v. The precise manner in which these instructions are selected and executed provides impressive
diversity in architectures.
Pipelining and Superscalar Execution
i. Technique used in the design of computers to increase their instruction throughput (the
number of instructions that can be executed in a unit of time).
ii. The basic instruction cycle is broken up into a series called a pipeline.
iii. Rather than processing each instruction sequentially (one at a time, finishing one instruction
before starting the next), each instruction is split up into a sequence of steps so different
steps can be executed concurrently (at the same time) and in parallel (by different circuitry).
i. Each instruction is split into a sequence of dependent steps. The first step is
always to fetch the instruction from memory; the final step is usually
writing the results of the instruction to processor registers or to memory.
ii. Pipelining seeks to let the processor work on as many instructions as there
are dependent steps, just as an assembly line builds many vehicles at once,
rather than waiting until one vehicle has passed through the line before
admitting the next one.
iii. Just as the goal of the assembly line is to keep each assembler productive at
all times, pipelining seeks to keep every portion of the processor busy with
some instruction.
iv. Pipelining lets the computer's cycle time be the time of the slowest step,
and ideally lets one instruction complete in every cycle.
Analogy of Pipelining
• Every cars need to go through different
steps in production line in order to
produce a car
• Each robot in the production line has a
specific task
• Notice how each of the robot is
occupied, meaning that none of them is
idle.
i. Pipelining, however, has several limitations.
ii. The speed of a pipeline is eventually limited by the slowest stage.
iii. For this reason, conventional processors rely on very deep pipelines (20 stage pipelines in
state-of-the-art Pentium processors).
iv. However, in typical program traces, every 5-6th instruction is a conditional jump! This
requires very accurate branch prediction.
v. The penalty of a misprediction grows with the depth of the pipeline, since a larger number
of instructions will have to be flushed.
vi. One simple way of alleviating these bottlenecks is to use multiple pipelines.
vii. The question then becomes one of selecting these instructions.
i. A superscalar CPU architecture implements a form of parallelism called instruction-level
parallelism within a single processor.
ii. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate.
iii. A superscalar processor executes more than one instruction during a clock cycle by
simultaneously dispatching multiple instructions to different functional units on the processor.
iv. Each functional unit is not a separate CPU core but an execution resource within a single CPU
such as an arithmetic logic unit, a bit shifter, or a multiplier.
v. In Flynn's taxonomy, a single-core superscalar processor is classified as an SISD processor (Single
Instruction stream, Single Data stream), though many superscalar processors support short
vector operations and so could be classified as SIMD (Single Instruction stream, Multiple Data
streams). A multi-core superscalar processor is classified as an MIMD processor (Multiple
Instruction streams, Multiple Data streams).
The superscalar technique is traditionally associated with several identifying
characteristics (within a given CPU core):
i. Instructions are issued from a sequential instruction stream
ii. CPU hardware dynamically checks for data dependencies between

instructions at run time (versus software checking at compile time)
iii. The CPU processes multiple instructions per clock cycle

Scheduling of instructions is determined by a number of factors:
◦ True Data Dependency: The result of one operation is an input to the next.
◦ Resource Dependency: Two operations require the same resource.
◦ Branch Dependency: Scheduling instructions across conditional branch statements cannot be done
deterministically a-priori.
◦ The scheduler, a piece of hardware looks at a large number of instructions in an instruction queue and
selects appropriate number of instructions to execute concurrently based on these factors.
◦ The complexity of this hardware is an important constraint on superscalar processors.

Superscalar Execution: Issue Mechanisms
i. In the simpler model, instructions can be issued only in the order in which they are
encountered. That is, if the second instruction cannot be issued because it has a data
dependency with the first, only one instruction is issued in the cycle. This is called in-order
issue.
ii. In a more aggressive model, instructions can be issued out of order. In this case, if the
second instruction has data dependencies with the first, but the third instruction does not,
the first and third instructions can be co-scheduled. This is also called dynamic issue.
iii. Performance of in-order issue is generally limited.
Superscalar Execution: Efficiency Considerations
i. Not all functional units can be kept busy at all times.
ii. If during a cycle, no functional units are utilized, this is referred to as vertical waste.
iii. If during a cycle, only some of the functional units are utilized, this is referred to as
horizontal waste.
iv. Due to limited parallelism in typical instruction traces, dependencies, or the inability of the
scheduler to extract parallelism, the performance of superscalar processors is eventually
limited.
v. Conventional microprocessors typically support four-way superscalar execution.
Very Long Instruction Word Processor
i. Very long instruction word (VLIW) refers to processor architectures designed to take advantage of
instruction level parallelism (ILP).
ii. Whereas conventional processors mostly allow programs only to specify instructions that will be
executed in sequence, a VLIW processor allows programs to explicitly specify instructions that
will be executed at the same time (that is, in parallel).
iii. This type of processor architecture is intended to allow higher performance without the inherent
complexity of some other approaches.
iv. The hardware cost and complexity of the superscalar scheduler is a major consideration in
processor design.
v. To address this issues, VLIW processors rely on compile time analysis to identify and bundle
together instructions that can be executed concurrently.
vi. These instructions are packed and dispatched together, and thus the name very long instruction
word.
VLIW Considerations
i. Issue hardware is simpler.
ii. Compiler has a bigger context from which to select co-scheduled instructions.
iii. Compilers, however, do not have runtime information such as cache misses. Scheduling is,
therefore, inherently conservative.
iv. Branch and memory prediction is more difficult.
v. VLIW performance is highly dependent on the compiler. A number of techniques such as
loop unrolling, speculative execution, branch prediction are critical.
vi. Typical VLIW processors are limited to 4-way to 8-way parallelism.
Limitations of Memory System Performance
i. Memory system, and not processor speed, is often the bottleneck for many applications.
ii. Memory system performance is largely captured by two parameters, latency and bandwidth.
iii. Latency is the time from the issue of a memory request to the time the data is available at
the processor.
iv. Bandwidth is the rate at which data can be pumped to the processor by the memory system.
Bandwidth vs Latency
i. It is very important to understand the difference between latency and bandwidth.
ii. Consider the example of a fire-hose. If the water comes out of the hose two seconds after
the hydrant is turned on, the latency of the system is two seconds.
iii. Once the water starts flowing, if the hydrant delivers water at the rate of 5 gallons/second,
the bandwidth of the system is 5 gallons/second.
iv. If you want immediate response from the hydrant, it is important to reduce latency.
v. If you want to fight big fires, you want high bandwidth.
Improving Effective Memory Latency using Caches
i. Caches are small and fast memory elements between the processor and DRAM.
ii. This memory acts as a low-latency high-bandwidth storage.
iii. If a piece of data is repeatedly used, the effective latency of this memory system can be
reduced by the cache.
iv. The fraction of data references satisfied by the cache is called the cache hit ratio of the
computation on the system.
v. Cache hit ratio achieved by a code on a memory system often determines its performance.
Improving Effective Memory Latency using Caches
i. Cache hits are served by reading data from the cache, which is faster than recomputing a
result or reading from a slower data store; thus, the more requests can be served from the
cache, the faster the system performs.
ii. To be cost-effective and to enable efficient use of data, caches are relatively small.
Nevertheless, caches have proven themselves in many areas of computing because access
patterns in typical computer applications exhibit the locality of reference.
iii. Moreover, access patters exhibit temporal locality if data is requested again that has been
recently requested already, while spatial locality refers to requests for data
*Locality of Reference - phenomenon describing the same value, or related storage locations, being frequently accessed
Impact of Memory Bandwidth
i. Memory bandwidth is the rate at which data can be read from or stored into a
semiconductor memory by a processor.
ii. Memory bandwidth that is advertised for a given memory or system is usually the maximum
theoretical bandwidth. In practice the observed memory bandwidth will be less than (and is
guaranteed not to exceed) the advertised bandwidth.
iii. Memory bandwidth is determined by the bandwidth of the memory bus as well as the
memory units.
iv. Memory bandwidth can be improved by increasing the size of memory blocks.
v. The underlying system takes l time units (where l is the latency of the system) to deliver b
units of data (where b is the block size).
Impact of Memory Bandwidth
i. Exploiting spatial and temporal locality in applications is critical for amortizing memory
latency and increasing effective memory bandwidth.
ii. The ratio of the number of operations to number of memory accesses is a good indicator of
anticipated tolerance to memory bandwidth.
iii. Memory layouts and organizing computation appropriately can make a significant impact on
the spatial and temporal locality.
Tradeoffs of Multithreading and Prefetching
Alternate approach for hiding memory latency
Consider the problem of browsing the web on a very slow network connection. We deal with the problem in one of
three possible ways:
i. we anticipate which pages we are going to browse ahead of time and issue requests for them in advance;
ii. we open multiple browsers and access different pages in each browser, thus while we are waiting for one page to
load, we could be reading others; or
iii. we access a whole bunch of pages in one go - amortizing the latency across various accesses.
Multithreading for latency hiding
i. Multithreading is the ability of a program or an operating system to serve more than one
user at a time and to manage multiple simultaneous requests without the need to have
multiple copies of the programs running within the computer.
ii. To support this, central processing units have hardware support to efficiently execute
multiple threads. This approach is distinguished from multiprocessing systems (such as
multi-core systems).
iii. Where multiprocessing systems include multiple complete processing units, multithreading
aims to increase utilization of a single core by using thread-level as well as instruction-level
parallelism
Prefetching for latency hiding
i. Adding a cache can provide faster access to needed instructions. Prefetching occurs when a
processor requests an instruction from main memory before it is actually needed. Once the
instruction comes back from memory, it is placed in a cache.
ii. Since programs are generally executed sequentially, performance is likely to be best when
instructions are prefetched in program order.
iii. Alternatively, the prefetch may be part of a complex branch prediction algorithm, where the
processor tries to anticipate the result of a calculation and fetch the right instructions in
advance.
Tradeoffs of both
i. Multithreading and prefetching are critically impacted by the memory bandwidth
ii. Bandwidth requirements of a multithreaded system may increase very significantly because
of the smaller cache residency of each thread.
iii. Multithreaded systems become bandwidth bound instead of latency bound.
iv. Multithreading and prefetching only address the latency problem and may often worsen the
bandwidth problem.
v. Multithreading and prefetching also require significantly more hardware resources in the
form of storage.
Dichotomy of Parallel Computing Platforms
i. Dichotomy of parallel computing platform: Based on logical organization and physical organization of
parallel platforms.
ii. Logical organization: Refers to a programmer's view of the platform.
iii. Physical organization: Refers to the actual hardware organization of the platform.
iv. Critical components of parallel computing from a programmer's perspective:
i. Ways of expressing parallel tasks (control structures).
ii. Mechanisms for specifying interaction between these tasks (communication model).
Control Structures of Parallel Platform
i. Parallel tasks can be specified at various levels of granularity. At one extreme, each program in a set of programs can be
viewed as one parallel task. At the other extreme, individual instructions within a program can be viewed as parallel
tasks. Between these extremes lie a range of models for specifying the control structure of programs and the
corresponding architectural support for them.
ii. Processing units in parallel computers either operate under the centralized control of a single control unit or work
independently.
SIMD (Single Instruction Multiple Data Streams)
i. A single control unit dispatches instructions to each processing
unit.
ii. A global clock drives all PEs That is, at any given time, all PEs are
synchronously executing the same instruction but on different
sets of data; hence the name SIMD
iii. Analogy: A drill sergeant gives orders to a platoon of soldiers. The
sergeant barks out a single order, “Attention!!”, and all soldiers
execute it in parallel, each one using his or her own arms, legs,
and body. The next order, say “Present Arms”, is not issued until
the sergeant is satisfied that the previous one has been
completed. Again, the soldiers all carry out this next order
simultaneously, with each one presenting his or her own weapon.
iv. These architectural enhancements rely on the highly structured
(regular) nature of the underlying computations, for example in
image processing and graphics, to deliver improved performance.
MIMD (Multiple Instructions Multiple Data Streams)
i. Computers in which each processing element is capable of executing a

different program independent of the other processing elements
ii. Analogy - A good analogy to asynchronous parallelism is the way that

different crafts persons work to build a house. Carpenters, masons,
electricians, and plumbers all work independently on their own tasks and at
their own rate. Occasionally, they need to communicate with each other to
find out if a specific task has been completed or to provide someone else
with information about what you are doing.
iii. SPMD (single program, multiple data) is a technique employed to achieve

parallelism; it is a subcategory of MIMD. Tasks are split up and run
simultaneously on multiple processors with different input in order to obtain
results faster. SPMD is the most common style of parallel programming.
iv. SPMD model is widely used by many parallel platforms and requires minimal
architectural support. Examples of such platforms include the Sun Ultra
Servers, multiprocessor PCs, workstation clusters, and the IBM SP.
SIMD vs MIMD
i. SIMD computers require less hardware and memory than MIMD computers (single control unit).
ii. Furthermore, SIMD computers require less memory because only one copy of the program needs
to be stored.
iii. However, the relative unpopularity of SIMD processors as general purpose compute engines can
be attributed to their specialized hardware architectures, economic factors, design constraints,
product life-cycle, and application characteristics.
iv. In contrast, MIMD computers store the program and operating system at each processor
v. Platforms supporting the MIMD paradigm can be built from inexpensive off-the-shelf
components with relatively little effort in a short amount of time.
Communication Model of Parallel Platforms
i. There are two primary forms of data exchange between parallel tasks:
i. Accessing a shared data space.
ii. Exchanging messages.
ii. Platforms that provide a shared data space are called shared-address-space machines or
multiprocessors.
iii. Platforms that support messaging are also called message passing platforms or
multicomputer.
Shared Address Space
i. The "shared-address-space" view of a parallel platform supports a common data space that
is accessible to all processors. Processors interact by modifying data objects stored in this
shared-address-space.
ii. Shared-address-space platforms supporting SPMD programming are also referred to as
multiprocessors. Processors interact by modifying data objects stored in this shared-
address-space.
iii. If the time taken by a processor to access any memory word in the system global or local is
identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform
memory access (NUMA) machine.
Shared Address Space vs Shared Memory
i. It is important to note the difference between the terms shared address space and shared
memory.
ii. We refer to the former as a programming abstraction and to the latter as a physical machine
attribute.
iii. It is possible to provide a shared address space using a physically distributed memory.
Message Passing Platforms
i. The logical machine view of a message-passing platform consists of p processing nodes,

each with its own exclusive address space. Each of these processing nodes can either be
single processors or a shared-address-space multiprocessor – a trend that is fast gaining
momentum in modern message-passing parallel computers.
ii. Instances of such a view come naturally from clustered workstations and non-shared-
address-space multicomputers. On such platforms, interactions between processes running
on different nodes must be accomplished using messages, hence the name message
passing.
iii. Since interactions are accomplished by sending and receiving messages, the basic
operations in this programming paradigm are send and receive (the corresponding calls may
differ across APIs but the semantics are largely identical).
Physical Organization of Parallel Platforms
Architecture of an Ideal Parallel Computer
Parallel Random Access Machine (PRAM)
i. A natural extension of the serial model of computation (the Random Access Machine, or
RAM) consists of p processors and a global memory of unbounded size that is uniformly
accessible to all processors.
ii. All processors access the same address space. Processors share a common clock but may
execute different instructions in each cycle. This ideal model is also referred to as a parallel
random access machine (PRAM).
iii. Since PRAMs allow concurrent access to various memory locations, depending on how
simultaneous memory accesses are handled, PRAMs can be divided into four subclasses.
Parallel Random Access Machine (PRAM)
i. Exclusive-read, exclusive-write (EREW) PRAM. In this class, access to a memory location is

exclusive. No concurrent read or write operations are allowed. This is the weakest PRAM
model, affording minimum concurrency in memory access.
ii. Concurrent-read, exclusive-write (CREW) PRAM. In this class, multiple read accesses to a
memory location are allowed. However, multiple write accesses to a memory location are
serialized.
iii. Exclusive-read, concurrent-write (ERCW) PRAM. Multiple write accesses are allowed to a
memory location, but multiple read accesses are serialized.
iv. Concurrent-read, concurrent-write (CRCW) PRAM. This class allows multiple read and write
accesses to a common memory location. This is the most powerful PRAM model.
Allowing concurrent read access does not create any semantic discrepancies in the program.
However, concurrent write access to a memory location requires arbitration. Several protocols
are used to resolve concurrent writes. The most frequently used protocols are as follows:
i. Common, in which the concurrent write is allowed if all the values that the processors are
attempting to write are identical.
ii. Arbitrary, in which an arbitrary processor is allowed to proceed with the write operation
and the rest fail.
iii. Priority, in which all processors are organized into a predefined prioritized list, and the
processor with the highest priority succeeds and the rest fail.
iv. Sum, in which the sum of all the quantities is written (the sum-based write conflict
resolution model can be extended to any associative operator defined on the quantities
being written).
Interconnection Network
i. Interconnection networks carry data between processors and to memory.
ii. Interconnects are made of switches and links (wires, fiber).
iii. Interconnects are classified as static or dynamic.
iv. Static networks consist of point-to-point communication links among processing nodes and
are also referred to as direct networks.
v. Dynamic networks are built using switches and communication links. Dynamic networks are
also referred to as indirect networks.
Static vs Dynamic Network
Classification of interconnection networks: (a) a static network; and (b) a dynamic network.
Network Topologies
i. A variety of network topologies have been proposed and implemented.

ii. These topologies tradeoff performance for cost.
iii. Commercial machines often implement hybrids of multiple topologies for reasons of
packaging, cost, and available components.
Bus Network
i. Some of the simplest and earliest parallel machines
used buses.
ii. All processors access a common bus for exchanging
data.
iii. The distance between any two nodes is O(1) in a bus.
The bus also provides a convenient broadcast media.
iv. However, the bandwidth of the shared bus is a major
bottleneck.
v. Typical bus based machines are limited to dozens of
nodes. Sun Enterprise servers and Intel Pentium based
shared-bus multiprocessors are examples of such
architectures.
Crossbar Network
i. A crossbar network uses an p×m grid of
switches to connect p inputs to m
outputs in a non-blocking manner.
ii. The cost of a crossbar of p processors
grows as O(p2).
iii. This is generally difficult to scale for
large values of p.
iv. Examples of machines that employ
crossbars include the Sun Ultra HPC
10000 and the Fujitsu VPP500.
Multistage Network
i. Crossbars have excellent performance
scalability but poor cost scalability.
ii. Buses have excellent cost scalability, but
poor performance scalability.
iii. Multistage interconnects strike a
compromise between these extremes.
Multistage Omega Network
i. One of the most commonly used
multistage interconnects is the Omega
network.
ii. Each stage of the Omega network
implements a perfect shuffle as follows:
iii. The perfect shuffle patterns are
connected using 2×2 switches.
iv. The switches operate in two modes –
crossover or passthrough.
Two switching configurations of the 2 × 2 switch:

(a) Pass-through; (b) Cross-over.
Multistage Omega Network
A complete omega network connecting eight inputs and eight

outputs.
Completely Connected Network
i. Each processor is connected to every
other processor.
ii. The number of links in the network
scales as O(p2).
iii. While the performance scales very
well, the hardware complexity is not
realizable for large values of p.
A completely-connected network of eight nodes;
iv. In this sense, these networks are
static counterparts of crossbars.
Star Connected Network
i. Every node is connected only to a
common node at the center.
ii. Distance between any pair of nodes is
O(1). However, the central node
becomes a bottleneck.
iii. In this sense, star connected networks
are static counterparts of buses.
A star connected network of nine nodes.

End of Parallel Platforms
References
1. Grama, A. (Ed.). (2003). Introduction to parallel computing. Pearson Education.
2. Ozturk, O. Parallel Computing Platforms. Retrieved March 13, 2015, from http://www.cs.bilkent.edu.tr/~ozturk/cs426/set2.pdf
3. SIMD. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/SIMD
4. SISD. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/SISD
5. MIMD. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/SISD
6. MISD. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/SISD
7. Pipeline (computing). Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/Pipeline_(computing)
8. Superscalar. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/Superscalar
9. Distributed Memory. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/Superscalar
10. Shared Memory. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/Superscalar
11. Flynn's Taxonomy. (2004). Retrieved March 13, 2015, from https://web.cimne.upc.edu/groups/sistemes/Servicios%20de%20calculo/Barcelona/public/14716537-Flynns-Taxonomy-and-SISD-
SIMD-MISD-MIMD.pdf

CSC 580 - Chapter 2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSC 580 - Chapter 2

Uploaded by

Copyright:

Available Formats

CSC 580 - Chapter 2

i. Instructions are issued from a sequential instruction stream

ii. CPU hardware dynamically checks for data dependencies between

iii. The CPU processes multiple instructions per clock cycle

◦ Resource Dependency: Two operations require the same resource.

◦ The complexity of this hardware is an important constraint on superscalar processors.

i. Computers in which each processing element is capable of executing a

ii. Analogy - A good analogy to asynchronous parallelism is the way that

iii. SPMD (single program, multiple data) is a technique employed to achieve

i. The logical machine view of a message-passing platform consists of p processing nodes,

i. Exclusive-read, exclusive-write (EREW) PRAM. In this class, access to a memory location is

i. A variety of network topologies have been proposed and implemented.

Two switching configurations of the 2 × 2 switch:

A complete omega network connecting eight inputs and eight

A star connected network of nine nodes.

3. SIMD. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/SIMD

4. SISD. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/SISD

5. MIMD. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/SISD

6. MISD. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/SISD

7. Pipeline (computing). Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/Pipeline_(computing)

8. Superscalar. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/Superscalar

9. Distributed Memory. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/Superscalar

10. Shared Memory. Retrieved March 13, 2015, from http://en.wikipedia.org/wiki/Superscalar

You might also like