Professional Documents
Culture Documents
PARALLEL PLATFORMS
AHMAD FIRDAUS BIN AHMAD FADZIL
Implicit Parallelism: Trends in Microprocessor Architectures
i. Microprocessor clock speeds have posted impressive gains over the past two decades (two to
three orders of magnitude).
ii. Higher levels of device integration have made available a large number of transistors.
iii. The question of how best to utilize these resources is an important one.
iv. Current processors use these resources in multiple functional units and execute multiple
instructions in the same cycle.
v. The precise manner in which these instructions are selected and executed provides impressive
diversity in architectures.
Implicit Parallelism: Trends in Microprocessor Architectures
Pipelining and Superscalar Execution
i. Technique used in the design of computers to increase their instruction throughput (the
number of instructions that can be executed in a unit of time).
ii. The basic instruction cycle is broken up into a series called a pipeline.
iii. Rather than processing each instruction sequentially (one at a time, finishing one instruction
before starting the next), each instruction is split up into a sequence of steps so different
steps can be executed concurrently (at the same time) and in parallel (by different circuitry).
Implicit Parallelism: Trends in Microprocessor Architectures
Pipelining and Superscalar Execution
i. Each instruction is split into a sequence of dependent steps. The first step is
always to fetch the instruction from memory; the final step is usually
writing the results of the instruction to processor registers or to memory.
ii. Pipelining seeks to let the processor work on as many instructions as there
are dependent steps, just as an assembly line builds many vehicles at once,
rather than waiting until one vehicle has passed through the line before
admitting the next one.
iii. Just as the goal of the assembly line is to keep each assembler productive at
all times, pipelining seeks to keep every portion of the processor busy with
some instruction.
iv. Pipelining lets the computer's cycle time be the time of the slowest step,
and ideally lets one instruction complete in every cycle.
Implicit Parallelism: Trends in Microprocessor Architectures
Pipelining and Superscalar Execution
Analogy of Pipelining
• Every cars need to go through different
steps in production line in order to
produce a car
• Each robot in the production line has a
specific task
• Notice how each of the robot is
occupied, meaning that none of them is
idle.
Implicit Parallelism: Trends in Microprocessor Architectures
Pipelining and Superscalar Execution
i. Pipelining, however, has several limitations.
ii. The speed of a pipeline is eventually limited by the slowest stage.
iii. For this reason, conventional processors rely on very deep pipelines (20 stage pipelines in
state-of-the-art Pentium processors).
iv. However, in typical program traces, every 5-6th instruction is a conditional jump! This
requires very accurate branch prediction.
v. The penalty of a misprediction grows with the depth of the pipeline, since a larger number
of instructions will have to be flushed.
vi. One simple way of alleviating these bottlenecks is to use multiple pipelines.
vii. The question then becomes one of selecting these instructions.
Implicit Parallelism: Trends in Microprocessor Architectures
Pipelining and Superscalar Execution
i. A superscalar CPU architecture implements a form of parallelism called instruction-level
parallelism within a single processor.
ii. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate.
iii. A superscalar processor executes more than one instruction during a clock cycle by
simultaneously dispatching multiple instructions to different functional units on the processor.
iv. Each functional unit is not a separate CPU core but an execution resource within a single CPU
such as an arithmetic logic unit, a bit shifter, or a multiplier.
v. In Flynn's taxonomy, a single-core superscalar processor is classified as an SISD processor (Single
Instruction stream, Single Data stream), though many superscalar processors support short
vector operations and so could be classified as SIMD (Single Instruction stream, Multiple Data
streams). A multi-core superscalar processor is classified as an MIMD processor (Multiple
Instruction streams, Multiple Data streams).
Implicit Parallelism: Trends in Microprocessor Architectures
Pipelining and Superscalar Execution
The superscalar technique is traditionally associated with several identifying
characteristics (within a given CPU core):
◦ True Data Dependency: The result of one operation is an input to the next.
◦ Branch Dependency: Scheduling instructions across conditional branch statements cannot be done
deterministically a-priori.
◦ The scheduler, a piece of hardware looks at a large number of instructions in an instruction queue and
selects appropriate number of instructions to execute concurrently based on these factors.
*Locality of Reference - phenomenon describing the same value, or related storage locations, being frequently accessed
Limitations of Memory System Performance
Impact of Memory Bandwidth
i. Memory bandwidth is the rate at which data can be read from or stored into a
semiconductor memory by a processor.
ii. Memory bandwidth that is advertised for a given memory or system is usually the maximum
theoretical bandwidth. In practice the observed memory bandwidth will be less than (and is
guaranteed not to exceed) the advertised bandwidth.
iii. Memory bandwidth is determined by the bandwidth of the memory bus as well as the
memory units.
iv. Memory bandwidth can be improved by increasing the size of memory blocks.
v. The underlying system takes l time units (where l is the latency of the system) to deliver b
units of data (where b is the block size).
Limitations of Memory System Performance
Impact of Memory Bandwidth
i. Exploiting spatial and temporal locality in applications is critical for amortizing memory
latency and increasing effective memory bandwidth.
ii. The ratio of the number of operations to number of memory accesses is a good indicator of
anticipated tolerance to memory bandwidth.
iii. Memory layouts and organizing computation appropriately can make a significant impact on
the spatial and temporal locality.
Limitations of Memory System Performance
Tradeoffs of Multithreading and Prefetching
Alternate approach for hiding memory latency
Consider the problem of browsing the web on a very slow network connection. We deal with the problem in one of
three possible ways:
i. we anticipate which pages we are going to browse ahead of time and issue requests for them in advance;
ii. we open multiple browsers and access different pages in each browser, thus while we are waiting for one page to
load, we could be reading others; or
iii. we access a whole bunch of pages in one go - amortizing the latency across various accesses.
Limitations of Memory System Performance
Tradeoffs of Multithreading and Prefetching
Multithreading for latency hiding
i. Multithreading is the ability of a program or an operating system to serve more than one
user at a time and to manage multiple simultaneous requests without the need to have
multiple copies of the programs running within the computer.
ii. To support this, central processing units have hardware support to efficiently execute
multiple threads. This approach is distinguished from multiprocessing systems (such as
multi-core systems).
iii. Where multiprocessing systems include multiple complete processing units, multithreading
aims to increase utilization of a single core by using thread-level as well as instruction-level
parallelism
Limitations of Memory System Performance
Tradeoffs of Multithreading and Prefetching
Prefetching for latency hiding
i. Adding a cache can provide faster access to needed instructions. Prefetching occurs when a
processor requests an instruction from main memory before it is actually needed. Once the
instruction comes back from memory, it is placed in a cache.
ii. Since programs are generally executed sequentially, performance is likely to be best when
instructions are prefetched in program order.
iii. Alternatively, the prefetch may be part of a complex branch prediction algorithm, where the
processor tries to anticipate the result of a calculation and fetch the right instructions in
advance.
Limitations of Memory System Performance
Tradeoffs of Multithreading and Prefetching
Tradeoffs of both
i. Multithreading and prefetching are critically impacted by the memory bandwidth
ii. Bandwidth requirements of a multithreaded system may increase very significantly because
of the smaller cache residency of each thread.
iii. Multithreaded systems become bandwidth bound instead of latency bound.
iv. Multithreading and prefetching only address the latency problem and may often worsen the
bandwidth problem.
v. Multithreading and prefetching also require significantly more hardware resources in the
form of storage.
Dichotomy of Parallel Computing Platforms
i. Dichotomy of parallel computing platform: Based on logical organization and physical organization of
parallel platforms.
ii. Logical organization: Refers to a programmer's view of the platform.
iii. Physical organization: Refers to the actual hardware organization of the platform.
iv. Critical components of parallel computing from a programmer's perspective:
i. Ways of expressing parallel tasks (control structures).
ii. Mechanisms for specifying interaction between these tasks (communication model).
Dichotomy of Parallel Computing Platforms
Control Structures of Parallel Platform
i. Parallel tasks can be specified at various levels of granularity. At one extreme, each program in a set of programs can be
viewed as one parallel task. At the other extreme, individual instructions within a program can be viewed as parallel
tasks. Between these extremes lie a range of models for specifying the control structure of programs and the
corresponding architectural support for them.
ii. Processing units in parallel computers either operate under the centralized control of a single control unit or work
independently.
Dichotomy of Parallel Computing Platforms
Control Structures of Parallel Platform
SIMD (Single Instruction Multiple Data Streams)
i. A single control unit dispatches instructions to each processing
unit.
ii. A global clock drives all PEs That is, at any given time, all PEs are
synchronously executing the same instruction but on different
sets of data; hence the name SIMD
iii. Analogy: A drill sergeant gives orders to a platoon of soldiers. The
sergeant barks out a single order, “Attention!!”, and all soldiers
execute it in parallel, each one using his or her own arms, legs,
and body. The next order, say “Present Arms”, is not issued until
the sergeant is satisfied that the previous one has been
completed. Again, the soldiers all carry out this next order
simultaneously, with each one presenting his or her own weapon.
iv. These architectural enhancements rely on the highly structured
(regular) nature of the underlying computations, for example in
image processing and graphics, to deliver improved performance.
Dichotomy of Parallel Computing Platforms
Control Structures of Parallel Platform
MIMD (Multiple Instructions Multiple Data Streams)
iv. SPMD model is widely used by many parallel platforms and requires minimal
architectural support. Examples of such platforms include the Sun Ultra
Servers, multiprocessor PCs, workstation clusters, and the IBM SP.
Dichotomy of Parallel Computing Platforms
Control Structures of Parallel Platform
SIMD vs MIMD
i. SIMD computers require less hardware and memory than MIMD computers (single control unit).
ii. Furthermore, SIMD computers require less memory because only one copy of the program needs
to be stored.
iii. However, the relative unpopularity of SIMD processors as general purpose compute engines can
be attributed to their specialized hardware architectures, economic factors, design constraints,
product life-cycle, and application characteristics.
iv. In contrast, MIMD computers store the program and operating system at each processor
v. Platforms supporting the MIMD paradigm can be built from inexpensive off-the-shelf
components with relatively little effort in a short amount of time.
Dichotomy of Parallel Computing Platforms
Communication Model of Parallel Platforms
i. There are two primary forms of data exchange between parallel tasks:
i. Accessing a shared data space.
ii. Exchanging messages.
ii. Platforms that provide a shared data space are called shared-address-space machines or
multiprocessors.
iii. Platforms that support messaging are also called message passing platforms or
multicomputer.
Dichotomy of Parallel Computing Platforms
Communication Model of Parallel Platforms
Shared Address Space
i. The "shared-address-space" view of a parallel platform supports a common data space that
is accessible to all processors. Processors interact by modifying data objects stored in this
shared-address-space.
ii. Shared-address-space platforms supporting SPMD programming are also referred to as
multiprocessors. Processors interact by modifying data objects stored in this shared-
address-space.
iii. If the time taken by a processor to access any memory word in the system global or local is
identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform
memory access (NUMA) machine.
Dichotomy of Parallel Computing Platforms
Communication Model of Parallel Platforms
Dichotomy of Parallel Computing Platforms
Communication Model of Parallel Platforms
Shared Address Space vs Shared Memory
i. It is important to note the difference between the terms shared address space and shared
memory.
ii. We refer to the former as a programming abstraction and to the latter as a physical machine
attribute.
iii. It is possible to provide a shared address space using a physically distributed memory.
Dichotomy of Parallel Computing Platforms
Communication Model of Parallel Platforms
Message Passing Platforms
Classification of interconnection networks: (a) a static network; and (b) a dynamic network.
Physical Organization of Parallel Platforms
Interconnection Network
Network Topologies
2. Ozturk, O. Parallel Computing Platforms. Retrieved March 13, 2015, from http://www.cs.bilkent.edu.tr/~ozturk/cs426/set2.pdf
11. Flynn's Taxonomy. (2004). Retrieved March 13, 2015, from https://web.cimne.upc.edu/groups/sistemes/Servicios%20de%20calculo/Barcelona/public/14716537-Flynns-Taxonomy-and-SISD-
SIMD-MISD-MIMD.pdf