Parallel Computing

Parallel computing
tectures are sometimes used alongside traditional proces-

sors, for accelerating specic tasks.
In some cases parallelism is transparent to the program-
mer, such as in bit-level or instruction-level parallelism,
but explicitly parallel algorithms, particularly those that
use concurrency, are more dicult to write than se-
quential ones,[7] because concurrency introduces several
new classes of potential software bugs, of which race
conditions are the most common. Communication and
synchronization between the dierent subtasks are typi-
cally some of the greatest obstacles to getting good par-
allel program performance.
IBMs Blue Gene/P massively parallel supercomputer. A theoretical upper bound on the speed-up of a single
program as a result of parallelization is given by Amdahls
Parallel computing is a type of computation in which law.
many calculations or the execution of processes are car-
ried out simultaneously.[1] Large problems can often be
divided into smaller ones, which can then be solved at the 1 Background
same time. There are several dierent forms of parallel
computing: bit-level, instruction-level, data, and task par-
allelism. Parallelism has been employed for many years, Traditionally, computer software has been written for se-
mainly in high-performance computing, but interest in it rial computation. To solve a problem, an algorithm is
has grown lately due to the physical constraints preventing constructed and implemented as a serial stream of in-
frequency scaling.[2] As power consumption (and conse- structions. These instructions are executed on a central
quently heat generation) by computers has become a con- processing unit on one computer. Only one instruction
cern in recent years,[3] parallel computing has become the may execute at a timeafter [8]
that instruction is nished,
dominant paradigm in computer architecture, mainly in the next one is executed.
the form of multi-core processors.[4] Parallel computing, on the other hand, uses multiple pro-
cessing elements simultaneously to solve a problem. This
Parallel computing is closely related to concurrent com-
putingthey are frequently used together, and often con- is accomplished by breaking the problem into indepen-
dent parts so that each processing element can execute its
ated, though the two are distinct: it is possible to have
parallelism without concurrency (such as bit-level par- part of the algorithm simultaneously with the others. The
processing elements can be diverse and include resources
allelism), and concurrency without parallelism (such as
[5][6] such as a single computer with multiple processors, sev-
multitasking by time-sharing on a single-core CPU).
eral networked computers, specialized hardware, or any
In parallel computing, a computational task is typically [8]
broken down in several, often many, very similar subtasks combination of the above.
that can be processed independently and whose results Frequency scaling was the dominant reason for improve-
are combined afterwards, upon completion. In contrast, ments in computer performance from the mid-1980s until
in concurrent computing, the various processes often do 2004. The runtime of a program is equal to the number
not address related tasks; when they do, as is typical in of instructions multiplied by the average time per instruc-
distributed computing, the separate tasks may have a var- tion. Maintaining everything else constant, increasing the
ied nature and often require some inter-process commu- clock frequency decreases the average time it takes to ex-
nication during execution. ecute an instruction. An increase in frequency thus de-
[9]
Parallel computers can be roughly classied according to creases runtime for all compute-bound programs.
the level at which the hardware supports parallelism, with However, power consumption P by a chip is given by the
multi-core and multi-processor computers having multi- equation P = C V 2 F, where C is the capacitance
ple processing elements within a single machine, while being switched per clock cycle (proportional to the num-
clusters, MPPs, and grids use multiple computers to work ber of transistors whose inputs change), V is voltage, and
on the same task. Specialized parallel computer archi- F is the processor frequency (cycles per second).[10] In-
1
2 1 BACKGROUND
creases in frequency increase the amount of power used Optimally, the speedup from parallelization would be
in a processor. Increasing processor power consumption lineardoubling the number of processing elements
led ultimately to Intel's May 8, 2004 cancellation of its should halve the runtime, and doubling it a second time
Tejas and Jayhawk processors, which is generally cited as should again halve the runtime. However, very few par-
the end of frequency scaling as the dominant computer allel algorithms achieve optimal speedup. Most of them
architecture paradigm.[11] have a near-linear speedup for small numbers of process-
Moores law is the empirical observation that the num- ing elements, which attens out into a constant value for
ber of transistors in a microprocessor doubles every 18 large numbers of processing elements.
to 24 months.[12] Despite power consumption issues, and The potential speedup of an algorithm on a parallel com-
repeated predictions of its end, Moores law is still in ef- puting platform is given by Amdahls law[13]
fect. With the end of frequency scaling, these additional
transistors (which are no longer used for frequency scal-
ing) can be used to add extra hardware for parallel com- 1
Slatency (s) = ,
puting. 1 p + ps
where
1.1 Amdahls law and Gustafsons law
S is the potential speedup in latency of the ex-
Amdahl's Law
ecution of the whole task;
20.00
18.00
s is the speedup in latency of the execution of the
16.00
Parallel portion
50%
parallelizable part of the task;
75%
14.00 90%
12.00
95%
p is the percentage of the execution time of the
whole task concerning the parallelizable part of the
Speedup
10.00
8.00
task before parallelization.
6.00
4.00 Since S < 1/(1 - p), it shows that a small part of the
2.00 program which cannot be parallelized will limit the over-
0.00 all speedup available from parallelization. A program
16384
32768
65536
1024
2048
4096
8192
128
256
512
16
32
64
1
solving a large mathematical or engineering problem will

Number of processors
typically consist of several parallelizable parts and several
non-parallelizable (serial) parts. If the non-parallelizable
A graphical representation of Amdahls law. The speedup of a part of a program accounts for 10% of the runtime (p =
program from parallelization is limited by how much of the pro- 0.9), we can get no more than a 10 times speedup, re-
gram can be parallelized. For example, if 90% of the program gardless of how many processors are added. This puts
can be parallelized, the theoretical maximum speedup using par- an upper limit on the usefulness of adding more parallel
allel computing would be 10 times no matter how many proces-
execution units. When a task cannot be partitioned be-
sors are used.
cause of sequential constraints, the application of more
eort has no eect on the schedule. The bearing of a
child takes nine months, no matter how many women are
Two independent parts A B assigned.[14]
Original process
Make B 5x faster
Make A 2x faster
Assume that a task has two independent parts, A and B. Part

B takes roughly 25% of the time of the whole computation. By
working very hard, one may be able to make this part 5 times
faster, but this only reduces the time for the whole computation by
a little. In contrast, one may need to perform less work to make
part A be twice as fast. This will make the computation much
faster than by optimizing part B, even though part B's speedup is
greater by ratio, (5 times versus 2 times). A graphical representation of Gustafsons law.
1.3 Race conditions, mutual exclusion, synchronization, and parallel slowdown 3
Amdahls law only applies to cases where the problem In this example, instruction 3 cannot be executed before
size is xed. In practice, as more computing resources (or even in parallel with) instruction 2, because instruc-
become available, they tend to get used on larger prob- tion 3 uses a result from instruction 2. It violates condi-
lems (larger datasets), and the time spent in the paral- tion 1, and thus introduces a ow dependency.
lelizable part often grows much faster than the inher- 1: function NoDep(a, b) 2: c := a * b 3: d := 3 * b 4: e
ently serial work.[15] In this case, Gustafsons law gives := a + b 5: end function
a less pessimistic and more realistic assessment of paral-
lel performance:[16] In this example, there are no dependencies between the
instructions, so they can all be run in parallel.
Bernsteins conditions do not allow memory to be shared
Slatency (s) = 1 p + sp. between dierent processes. For that, some means of en-
forcing an ordering between accesses is necessary, such
Both Amdahls law and Gustafsons law assume that the as semaphores, barriers or some other synchronization
running time of the serial part of the program is indepen- method.
dent of the number of processors. Amdahls law assumes
that the entire problem is of xed size so that the total
amount of work to be done in parallel is also independent
of the number of processors, whereas Gustafsons law as- 1.3 Race conditions, mutual exclusion,
sumes that the total amount of work to be done in parallel synchronization, and parallel slow-
varies linearly with the number of processors. down
Subtasks in a parallel program are often called threads.

1.2 Dependencies Some parallel computer architectures use smaller,
lightweight versions of threads known as bers, while oth-
Understanding data dependencies is fundamental in im- ers use bigger versions known as processes. However,
plementing parallel algorithms. No program can run threads is generally accepted as a generic term for sub-
more quickly than the longest chain of dependent calcu- tasks. Threads will often need to update some variable
lations (known as the critical path), since calculations that that is shared between them. The instructions between
depend upon prior calculations in the chain must be exe- the two programs may be interleaved in any order. For
cuted in order. However, most algorithms do not consist example, consider the following program:
of just a long chain of dependent calculations; there are
usually opportunities to execute independent calculations If instruction 1B is executed between 1A and 3A, or if
in parallel. instruction 1A is executed between 1B and 3B, the pro-
gram will produce incorrect data. This is known as a race
Let Pi and Pj be two program segments. Bernsteins condition. The programmer must use a lock to provide
conditions[17] describe when the two are independent and mutual exclusion. A lock is a programming language con-
can be executed in parallel. For Pi, let Ii be all of the in- struct that allows one thread to take control of a variable
put variables and Oi the output variables, and likewise for and prevent other threads from reading or writing it, un-
Pj. Pi and Pj are independent if they satisfy til that variable is unlocked. The thread holding the lock
is free to execute its critical section (the section of a pro-
gram that requires exclusive access to some variable), and
Ij Oi = , to unlock the data when it is nished. Therefore, to guar-
antee correct program execution, the above program can
Ii Oj = , be rewritten to use locks:
Oi Oj = . One thread will successfully lock variable V, while the
other thread will be locked outunable to proceed until
Violation of the rst condition introduces a ow depen- V is unlocked again. This guarantees correct execution
dency, corresponding to the rst segment producing a of the program. Locks, while necessary to ensure correct
result used by the second segment. The second condi- program execution, can greatly slow a program.
tion represents an anti-dependency, when the second seg-
ment produces a variable needed by the rst segment. Locking multiple variables using non-atomic locks intro-
The third and nal condition represents an output depen- duces the possibility of program deadlock. An atomic
dency: when two segments write to the same location, the lock locks multiple variables all at once. If it cannot
result comes from the logically last executed segment.[18] lock all of them, it does not lock any of them. If two
threads each need to lock the same two variables using
Consider the following functions, which demonstrate sev- non-atomic locks, it is possible that one thread will lock
eral kinds of dependencies: one of them and the second thread will lock the second
1: function Dep(a, b) 2: c := a * b 3: d := 3 * c 4: end variable. In such a case, neither thread can complete, and
function deadlock results.
4 2 TYPES OF PARALLELISM
Many parallel programs require that their subtasks act in eral ways. Petri nets, which were introduced in Carl
synchrony. This requires the use of a barrier. Barriers are Adam Petris 1962 doctoral thesis, were an early at-
typically implemented using a software lock. One class of tempt to codify the rules of consistency models. Dataow
algorithms, known as lock-free and wait-free algorithms, theory later built upon these, and Dataow architec-
altogether avoids the use of locks and barriers. However, tures were created to physically implement the ideas of
this approach is generally dicult to implement and re- dataow theory. Beginning in the late 1970s, process
quires correctly designed data structures. calculi such as Calculus of Communicating Systems and
Not all parallelization results in speed-up. Generally, as a Communicating Sequential Processes were developed to
permit algebraic reasoning about systems composed of
task is split up into more and more threads, those threads
spend an ever-increasing portion of their time communi- interacting components. More recent additions to the
process calculus family, such as the -calculus, have
cating with each other. Eventually, the overhead from
communication dominates the time spent solving the added the capability for reasoning about dynamic topolo-
gies. Logics such as Lamports TLA+, and mathemati-
problem, and further parallelization (that is, splitting the
workload over even more threads) increases rather than cal models such as traces and Actor event diagrams, have
decreases the amount of time required to nish. This is also been developed to describe the behavior of concur-
known as parallel slowdown. rent systems.
See also: Relaxed sequential
1.4 Fine-grained, coarse-grained, and em-

barrassing parallelism
1.6 Flynns taxonomy
Applications are often classied according to how often
their subtasks need to synchronize or communicate with Michael J. Flynn created one of the earliest classication
each other. An application exhibits ne-grained paral- systems for parallel (and sequential) computers and pro-
lelism if its subtasks must communicate many times per grams, now known as Flynns taxonomy. Flynn classied
second; it exhibits coarse-grained parallelism if they do programs and computers by whether they were operat-
not communicate many times per second, and it exhibits ing using a single set or multiple sets of instructions, and
embarrassing parallelism if they rarely or never have to whether or not those instructions were using a single set
communicate. Embarrassingly parallel applications are or multiple sets of data.
considered the easiest to parallelize. The single-instruction-single-data (SISD) classication is
equivalent to an entirely sequential program. The single-
instruction-multiple-data (SIMD) classication is analo-
1.5 Consistency models gous to doing the same operation repeatedly over a large
data set. This is commonly done in signal processing ap-
Main article: Consistency model plications. Multiple-instruction-single-data (MISD) is a
rarely used classication. While computer architectures
to deal with this were devised (such as systolic arrays),
Parallel programming languages and parallel computers
few applications that t this class materialized. Multiple-
must have a consistency model (also known as a memory
instruction-multiple-data (MIMD) programs are by far
model). The consistency model denes rules for how op-
the most common type of parallel programs.
erations on computer memory occur and how results are
produced. According to David A. Patterson and John L. Hen-
nessy, Some machines are hybrids of these categories,
One of the rst consistency models was Leslie Lamport's
of course, but this classic model has survived because
sequential consistency model. Sequential consistency is
it is simple, easy to understand, and gives a good
the property of a parallel program that its parallel execu-
rst approximation. It is alsoperhaps because of its
tion produces the same results as a sequential program.
understandabilitythe most widely used scheme.[20]
Specically, a program is sequentially consistent if "
the results of any execution is the same as if the opera-
tions of all the processors were executed in some sequen-
tial order, and the operations of each individual proces- 2 Types of parallelism
sor appear in this sequence in the order specied by its
program.[19] 2.1 Bit-level parallelism
Software transactional memory is a common type of con-
sistency model. Software transactional memory borrows Main article: Bit-level parallelism
from database theory the concept of atomic transactions
and applies them to memory accesses. From the advent of very-large-scale integration (VLSI)
Mathematically, these models can be represented in sev- computer-chip fabrication technology in the 1970s un-
2.3 Task parallelism 5
til about 1986, speed-up in computer architecture was All modern processors have multi-stage instruction
driven by doubling computer word sizethe amount of pipelines. Each stage in the pipeline corresponds to a dif-
information the processor can manipulate per cycle.[21] ferent action the processor performs on that instruction in
Increasing the word size reduces the number of instruc- that stage; a processor with an N-stage pipeline can have
tions the processor must execute to perform an operation up to N dierent instructions at dierent stages of com-
on variables whose sizes are greater than the length of pletion and thus can issue one instruction per clock cycle
the word. For example, where an 8-bit processor must (IPC = 1). These processors are known as scalar pro-
add two 16-bit integers, the processor must rst add the cessors. The canonical example of a pipelined processor
8 lower-order bits from each integer using the standard is a RISC processor, with ve stages: instruction fetch
addition instruction, then add the 8 higher-order bits us- (IF), instruction decode (ID), execute (EX), memory ac-
ing an add-with-carry instruction and the carry bit from cess (MEM), and register write back (WB). The Pentium
the lower order addition; thus, an 8-bit processor requires 4 processor had a 35-stage pipeline.[23]
two instructions to complete a single operation, where a
16-bit processor would be able to complete the operation
with a single instruction.
Historically, 4-bit microprocessors were replaced with 8-
bit, then 16-bit, then 32-bit microprocessors. This trend
generally came to an end with the introduction of 32-bit
processors, which has been a standard in general-purpose
computing for two decades. Not until the early twothou-
sands, with the advent of x86-64 architectures, did 64-bit
processors become commonplace.
2.2 Instruction-level parallelism A canonical ve-stage pipelined superscalar processor. In the

best case scenario, it takes one clock cycle to complete two instruc-
Main article: Instruction-level parallelism tions and thus the processor can issue superscalar performance
A computer program is, in essence, a stream of in- (IPC = 2 > 1).
Most modern processors also have multiple execution

units. They usually combine this feature with pipelin-
ing and thus can issue more than one instruction per
clock cycle (IPC > 1). These processors are known
A canonical processor without pipeline. It takes ve clock cy- as superscalar processors. Instructions can be grouped
cles to complete one instruction and thus the processor can issue together only if there is no data dependency between
subscalar performance (IPC = 0.2 < 1). them. Scoreboarding and the Tomasulo algorithm (which
is similar to scoreboarding but makes use of register re-
naming) are two of the most common techniques for im-
plementing out-of-order execution and instruction-level
parallelism.
2.3 Task parallelism

A canonical ve-stage pipelined processor. In the best case sce-
Main article: Task parallelism
nario, it takes one clock cycle to complete one instruction and thus
the processor can issue scalar performance (IPC = 1).
Task parallelisms is the characteristic of a parallel pro-
structions executed by a processor. Without instruction- gram that entirely dierent calculations can be per-
level parallelism, a processor can only issue less than one formed on either the same or dierent sets of data.[24]
instruction per clock cycle (IPC < 1). These processors This contrasts with data parallelism, where the same cal-
are known as subscalar processors. These instructions culation is performed on the same or dierent sets of
can be re-ordered and combined into groups which are data. Task parallelism involves the decomposition of a
then executed in parallel without changing the result of task into sub-tasks and then allocating each sub-task to a
the program. This is known as instruction-level paral- processor for execution. The processors would then ex-
lelism. Advances in instruction-level parallelism domi- ecute these sub-tasks simultaneously and often coopera-
nated computer architecture from the mid-1980s until the tively. Task parallelism does not usually scale with the
mid-1990s.[22] size of a problem.[25]
6 3 HARDWARE
3 Hardware memory, a crossbar switch, a shared bus or an intercon-

nect network of a myriad of topologies including star,
ring, tree, hypercube, fat hypercube (a hypercube with
3.1 Memory and communication
more than one processor at a node), or n-dimensional
mesh.
Main memory in a parallel computer is either shared
memory (shared between all processing elements in a sin- Parallel computers based on interconnected networks
gle address space), or distributed memory (in which each need to have some kind of routing to enable the passing of
processing element has its own local address space).[26] messages between nodes that are not directly connected.
Distributed memory refers to the fact that the memory The medium used for communication between the pro-
is logically distributed, but often implies that it is phys- cessors is likely to be hierarchical in large multiprocessor
ically distributed as well. Distributed shared memory machines.
and memory virtualization combine the two approaches,
where the processing element has its own local memory
and access to the memory on non-local processors. Ac- 3.2 Classes of parallel computers
cesses to local memory are typically faster than accesses
to non-local memory. Parallel computers can be roughly classied according to
the level at which the hardware supports parallelism. This
Processor Processor Processor Processor Processor Processor Processor Processor classication is broadly analogous to the distance between
basic computing nodes. These are not mutually exclusive;
Bus Bus
for example, clusters of symmetric multiprocessors are
Memory Memory relatively common.
Distributed shared memory

network with directory 3.2.1 Multi-core computing
A logical view of a non-uniform memory access (NUMA) archi-

Main article: Multi-core processor
tecture. Processors in one directory can access that directorys
memory with less latency than they can access memory in the A multi-core processor is a processor that includes mul-
other directorys memory. tiple processing units (called cores) on the same chip.
This processor diers from a superscalar processor,
Computer architectures in which each element of
which includes multiple execution units and can issue
main memory can be accessed with equal latency and
multiple instructions per clock cycle from one instruction
bandwidth are known as uniform memory access (UMA)
stream (thread); in contrast, a multi-core processor can is-
systems. Typically, that can be achieved only by a shared
sue multiple instructions per clock cycle from multiple in-
memory system, in which the memory is not physically
struction streams. IBM's Cell microprocessor, designed
distributed. A system that does not have this property is
for use in the Sony PlayStation 3, is a prominent multi-
known as a non-uniform memory access (NUMA) archi-
core processor. Each core in a multi-core processor can
tecture. Distributed memory systems have non-uniform
potentially be superscalar as wellthat is, on every clock
memory access.
cycle, each core can issue multiple instructions from one
Computer systems make use of cachessmall and fast thread.
memories located close to the processor which store tem-
Simultaneous multithreading (of which Intels Hyper-
porary copies of memory values (nearby in both the phys-
Threading is the best known) was an early form of
ical and logical sense). Parallel computer systems have
pseudo-multi-coreism. A processor capable of simulta-
diculties with caches that may store the same value in
neous multithreading includes multiple execution units
more than one location, with the possibility of incorrect
in the same processing unitthat is it has a super-
program execution. These computers require a cache
scalar architectureand can issue multiple instructions
coherency system, which keeps track of cached values
per clock cycle from multiple threads. Temporal multi-
and strategically purges them, thus ensuring correct pro-
threading on the other hand includes a single execution
gram execution. Bus snooping is one of the most com-
unit in the same processing unit and can issue one instruc-
mon methods for keeping track of which values are being
tion at a time from multiple threads.
accessed (and thus should be purged). Designing large,
high-performance cache coherence systems is a very dif-
cult problem in computer architecture. As a result, 3.2.2 Symmetric multiprocessing
shared memory computer architectures do not scale as
well as distributed memory systems do.[26] Main article: Symmetric multiprocessing
Processorprocessor and processormemory communi-
cation can be implemented in hardware in several ways, A symmetric multiprocessor (SMP) is a computer sys-
including via shared (either multiported or multiplexed) tem with multiple identical processors that share mem-
3.2 Classes of parallel computers 7
ory and connect via a bus.[27] Bus contention prevents bus a TCP/IP Ethernet local area network.[30] Beowulf tech-
architectures from scaling. As a result, SMPs generally nology was originally developed by Thomas Sterling and
do not comprise more than 32 processors.[28] Because of Donald Becker. The vast majority of the TOP500 super-
the small size of the processors and the signicant re- computers are clusters.[31]
duction in the requirements for bus bandwidth achieved Because grid computing systems (described below) can
by large caches, such symmetric multiprocessors are ex- easily handle embarrassingly parallel problems, modern
tremely cost-eective, provided that a sucient amount clusters are typically designed to handle more dicult
of memory bandwidth exists.[27] problemsproblems that require nodes to share inter-
mediate results with each other more often. This re-
3.2.3 Distributed computing quires a high bandwidth and, more importantly, a low-
latency interconnection network. Many historic and cur-
rent supercomputers use customized high-performance
Main article: Distributed computing
network hardware specically designed for cluster com-
puting, such as the Cray Gemini network.[32] As of 2014,
A distributed computer (also known as a distributed most current supercomputers use some o-the-shelf stan-
memory multiprocessor) is a distributed memory com- dard network hardware, often Myrinet, InniBand, or
puter system in which the processing elements are con- Gigabit Ethernet.
nected by a network. Distributed computers are highly
scalable.
Massively parallel computing Main article:
Massively parallel (computing)
Cluster computing Main article: Computer cluster A massively parallel processor (MPP) is a single
A cluster is a group of loosely coupled computers that
A Beowulf cluster.
work together closely, so that in some respects they can A cabinet from IBM's Blue Gene/L massively parallel
be regarded as a single computer.[29] Clusters are com- supercomputer.
posed of multiple standalone machines connected by a
network. While machines in a cluster do not have to be computer with many networked processors. MPPs have
symmetric, load balancing is more dicult if they are many of the same characteristics as clusters, but MPPs
not. The most common type of cluster is the Beowulf have specialized interconnect networks (whereas clusters
cluster, which is a cluster implemented on multiple iden- use commodity hardware for networking). MPPs also
tical commercial o-the-shelf computers connected with tend to be larger than clusters, typically having far
8 3 HARDWARE
more than 100 processors.[33] In an MPP, each CPU walked into AMD, they called us 'the socket stealers.'
contains its own memory and copy of the operating Now they call us their partners.[36]
system and application. Each subsystem communicates
with the others via a high-speed interconnect.[34]
General-purpose computing on graphics processing
IBM's Blue Gene/L, the fth fastest supercomputer in the units (GPGPU) Main article: GPGPU
world according to the June 2009 TOP500 ranking, is an General-purpose computing on graphics processing units
MPP.
Grid computing Main article: Grid computing
Grid computing is the most distributed form of paral-

lel computing. It makes use of computers communicat-
ing over the Internet to work on a given problem. Be-
cause of the low bandwidth and extremely high latency Nvidias Tesla GPGPU card
available on the Internet, distributed computing typically
deals only with embarrassingly parallel problems. Many (GPGPU) is a fairly recent trend in computer engineering
distributed computing applications have been created, research. GPUs are co-processors that have been heav-
of which SETI@home and Folding@home are the best- ily optimized for computer graphics processing.[37] Com-
known examples.[35] puter graphics processing is a eld dominated by data par-
allel operationsparticularly linear algebra matrix oper-
Most grid computing applications use middleware (soft-
ations.
ware that sits between the operating system and the appli-
cation to manage network resources and standardize the In the early days, GPGPU programs used the normal
software interface). The most common distributed com- graphics APIs for executing programs. However, sev-
puting middleware is the Berkeley Open Infrastructure eral new programming languages and platforms have
for Network Computing (BOINC). Often, distributed been built to do general purpose computation on GPUs
computing software makes use of spare cycles, per- with both Nvidia and AMD releasing programming en-
forming computations at times when a computer is idling. vironments with CUDA and Stream SDK respectively.
Other GPU programming languages include BrookGPU,
PeakStream, and RapidMind. Nvidia has also released
3.2.4 Specialized parallel computers specic products for computation in their Tesla series.
The technology consortium Khronos Group has released
Within parallel computing, there are specialized parallel the OpenCL specication, which is a framework for writ-
devices that remain niche areas of interest. While not ing programs that execute across platforms consisting of
domain-specic, they tend to be applicable to only a few CPUs and GPUs. AMD, Apple, Intel, Nvidia and others
classes of parallel problems. are supporting OpenCL.
Recongurable computing with eld-programmable Application-specic integrated circuits Main arti-

gate arrays Recongurable computing is the use of a cle: Application-specic integrated circuit
eld-programmable gate array (FPGA) as a co-processor
to a general-purpose computer. An FPGA is, in essence, Several application-specic integrated circuit (ASIC) ap-
a computer chip that can rewire itself for a given task.
proaches have been devised for dealing with parallel
FPGAs can be programmed with hardware description applications.[38][39][40]
languages such as VHDL or Verilog. However, program- Because an ASIC is (by denition) specic to a given
ming in these languages can be tedious. Several vendors application, it can be fully optimized for that applica-
have created C to HDL languages that attempt to emu- tion. As a result, for a given application, an ASIC
late the syntax and semantics of the C programming lan- tends to outperform a general-purpose computer. How-
guage, with which most programmers are familiar. The ever, ASICs are created by UV photolithography. This
best known C to HDL languages are Mitrion-C, Impulse process requires a mask set, which can be extremely
C, DIME-C, and Handel-C. Specic subsets of SystemC expensive. A mask set can cost over a million US
based on C++ can also be used for this purpose. dollars.[41] (The smaller the transistors required for the
AMDs decision to open its HyperTransport technology chip, the more expensive the mask will be.) Mean-
to third-party vendors has become the enabling technol- while, performance increases in general-purpose comput-
ogy for high-performance recongurable computing.[36] ing over time (as described by Moores law) tend to wipe
According to Michael R. D'Amour, Chief Operating Of- out these gains in only one or two chip generations.[36]
cer of DRC Computer Corporation, when we rst High initial cost, and the tendency to be overtaken by
4.2 Automatic parallelization 9
Moores-law-driven general-purpose computing, has ren- Passing Interface (MPI) is the most widely used message-
dered ASICs unfeasible for most parallel computing ap- passing system API.[43] One concept used in program-
plications. However, some have been built. One example ming parallel programs is the future concept, where one
is the PFLOPS RIKEN MDGRAPE-3 machine which part of a program promises to deliver a required datum
uses custom ASICs for molecular dynamics simulation. to another part of a program at some future time.
CAPS entreprise and Pathscale are also coordinating their
Vector processors Main article: Vector processor eort to make hybrid multi-core parallel programming
A vector processor is a CPU or computer system that can (HMPP) directives an open standard called OpenHMPP.
The OpenHMPP directive-based programming model
oers a syntax to eciently ooad computations on
hardware accelerators and to optimize data movement
to/from the hardware memory. OpenHMPP directives
describe remote procedure call (RPC) on an accelerator
device (e.g. GPU) or more generally a set of cores. The
directives annotate C or Fortran codes to describe two
sets of functionalities: the ooading of procedures (de-
noted codelets) onto a remote device and the optimization
of data transfers between the CPU main memory and the
accelerator memory.
The rise of consumer GPUs has led to support for
compute kernels, either in graphics APIs (referred to as
The Cray-1 is a vector processor. compute shaders), in dedicated APIs (such as OpenCL),
or in other language extensions.
execute the same instruction on large sets of data. Vector
processors have high-level operations that work on linear
arrays of numbers or vectors. An example vector opera- 4.2 Automatic parallelization
tion is A = B C, where A, B, and C are each 64-element
vectors of 64-bit oating-point numbers.[42] They are Main article: Automatic parallelization
closely related to Flynns SIMD classication.[42]
Cray computers became famous for their vector- Automatic parallelization of a sequential program by a
processing computers in the 1970s and 1980s. How- compiler is the holy grail of parallel computing. Despite
ever, vector processorsboth as CPUs and as full com- decades of work by compiler researchers, automatic par-
puter systemshave generally disappeared. Modern allelization has had only limited success.[44]
processor instruction sets do include some vector process- Mainstream parallel programming languages remain ei-
ing instructions, such as with Freescale Semiconductor's ther explicitly parallel or (at best) partially implicit, in
AltiVec and Intel's Streaming SIMD Extensions (SSE). which a programmer gives the compiler directives for
parallelization. A few fully implicit parallel programming
languages existSISAL, Parallel Haskell, SequenceL,
4 Software System C (for FPGAs), Mitrion-C, VHDL, and Verilog.
4.1 Parallel programming languages

4.3 Application checkpointing
Main article: List of concurrent and parallel program-
ming languages Main article: Application checkpointing
Concurrent programming languages, libraries, APIs, and As a computer system grows in complexity, the mean
parallel programming models (such as algorithmic skele- time between failures usually decreases. Application
tons) have been created for programming parallel com- checkpointing is a technique whereby the computer sys-
puters. These can generally be divided into classes based tem takes a snapshot of the applicationa record of
on the assumptions they make about the underlying mem- all current resource allocations and variable states, akin
ory architectureshared memory, distributed memory, to a core dump; this information can be used to re-
or shared distributed memory. Shared memory program- store the program if the computer should fail. Applica-
ming languages communicate by manipulating shared tion checkpointing means that the program has to restart
memory variables. Distributed memory uses message from only its last checkpoint rather than the beginning.
passing. POSIX Threads and OpenMP are two of the While checkpointing provides benets in a variety of sit-
most widely used shared memory APIs, whereas Message uations, it is especially useful in highly parallel systems
10 7 HISTORY
with a large number of processors used in high perfor- 7 History

mance computing.[45]
Main article: History of computing
The origins of true (MIMD) parallelism go back to Luigi
5 Algorithmic methods
As parallel computers become larger and faster, it be-
comes feasible to solve problems that previously took too
long to run. Parallel computing is used in a wide range of
elds, from bioinformatics (protein folding and sequence
analysis) to economics (mathematical nance). Common
types of problems found in parallel computing applica-
tions are:[46]
dense linear algebra;

sparse linear algebra;
spectral methods (such as CooleyTukey fast
Fourier transform)
ILLIAC IV, the most infamous of supercomputers.[48]
N-body problems (such as BarnesHut simulation);
structured grid problems (such as Lattice Boltzmann Federico Menabrea and his Sketch of the Analytic Engine
methods); Invented by Charles Babbage.[49][50][51]
unstructured grid problems (such as found in nite In April 1958, S. Gill (Ferranti) discussed parallel pro-
element analysis); gramming and the need for branching and waiting.[52]
Also in 1958, IBM researchers John Cocke and Daniel
Monte Carlo method; Slotnick discussed the use of parallelism in numerical
calculations for the rst time.[53] Burroughs Corporation
combinational logic (such as brute-force crypto- introduced the D825 in 1962, a four-processor com-
graphic techniques); puter that accessed up to 16 memory modules through
graph traversal (such as sorting algorithms); a crossbar switch.[54] In 1967, Amdahl and Slotnick pub-
lished a debate about the feasibility of parallel process-
dynamic programming; ing at American Federation of Information Processing
Societies Conference.[53] It was during this debate that
branch and bound methods; Amdahls law was coined to dene the limit of speed-up
graphical models (such as detecting hidden Markov due to parallelism.
models and constructing Bayesian networks); In 1969, company Honeywell introduced its rst Multics
system, a symmetric multiprocessor system capable of
nite-state machine simulation.
running up to eight processors in parallel.[53] C.mmp, a
1970s multi-processor project at Carnegie Mellon Uni-
versity, was among the rst multiprocessors with more
6 Fault-tolerance than a few processors.[50] The rst bus-connected multi-
processor with snooping caches was the Synapse N+1 in
Further information: Fault-tolerant computer system 1984.[50]
SIMD parallel computers can be traced back to the
Parallel computing can also be applied to the design of 1970s. The motivation behind early SIMD computers
fault-tolerant computer systems, particularly via lockstep was to amortize the gate delay of the processors control
systems performing the same operation in parallel. This unit over multiple instructions.[55] In 1964, Slotnick had
provides redundancy in case one component should fail, proposed building a massively parallel computer for the
and also allows automatic error detection and error cor- Lawrence Livermore National Laboratory.[53] His design
rection if the results dier. These methods can be used was funded by the US Air Force, which was the earli-
to help prevent single event upsets caused by transient est SIMD parallel-computing eort, ILLIAC IV.[53] The
errors.[47] Although additional measures may be required key to its design was a fairly high parallelism, with up
in embedded or specialized systems, this method can pro- to 256 processors, which allowed the machine to work on
vide a cost eective approach to achieve n-modular re- large datasets in what would later be known as vector pro-
dundancy in commercial o-the-shelf systems. cessing. However, ILLIAC IV was called the most infa-
11
mous of supercomputers, because the project was only primary method of improving processor performance
one fourth completed, but took 11 years and cost almost Even representatives from Intel, a company generally as-
four times the original estimate.[48] When it was nally sociated with the 'higher clock-speed is better' position,
ready to run its rst real application in 1976, it was out- warned that traditional approaches to maximizing perfor-
performed by existing commercial supercomputers such mance through maximizing clock speed have been pushed
to their limits.
as the Cray-1.
[5] Concurrency is not Parallelism, Waza conference Jan 11,
2012, Rob Pike (slides) (video)
8 See also
[6] Parallelism vs. Concurrency. Haskell Wiki.
List of important publications in concurrent, paral- [7] Hennessy, John L.; Patterson, David A.; Larus, James
lel, and distributed computing R. (1999). Computer organization and design: the hard-
ware/software interface (2. ed., 3rd print. ed.). San Fran-
List of distributed computing conferences cisco: Kaufmann. ISBN 1-55860-428-6.
Concurrency (computer science)
[8] Barney, Blaise. Introduction to Parallel Computing.
Synchronous programming Lawrence Livermore National Laboratory. Retrieved
2007-11-09.
Content Addressable Parallel Processor
[9] Hennessy, John L.; Patterson, David A. (2002). Com-
Manycore puter architecture / a quantitative approach. (3rd ed.). San
Francisco, Calif.: International Thomson. p. 43. ISBN 1-
Serializability 55860-724-2.
Transputer [10] Rabaey, Jan M. (1996). Digital integrated circuits : a de-

sign perspective. Upper Saddle River, N.J.: Prentice-Hall.
Parallel programming model p. 235. ISBN 0-13-178609-1.
vector processing [11] Flynn, Laurie J. (8 May 2004). Intel Halts Development
Of 2 New Microprocessors. New York Times. Retrieved
Multi tasking
5 June 2012.
Fengs Classication
[12] Moore, Gordon E. (1965). Cramming more components
onto integrated circuits (PDF). Electronics Magazine. p.
4. Retrieved 2006-11-11.
9 References
[13] Amdahl, Gene M. (1967). Validity of the single pro-
[1] Gottlieb, Allan; Almasi, George S. (1989). Highly cessor approach to achieving large scale computing capa-
parallel computing. Redwood City, Calif.: Ben- bilities. Proceeding AFIPS '67 (Spring) Proceedings of
jamin/Cummings. ISBN 0-8053-0177-1. the April 1820, 1967, spring joint computer conference:
483485. doi:10.1145/1465482.1465560.
[2] S.V. Adve et al. (November 2008). Parallel Computing
Research at Illinois: The UPCRC Agenda (PDF). Paral- [14] Brooks, Frederick P. (1996). The mythical man month es-
lel@Illinois, University of Illinois at Urbana-Champaign. says on software engineering (Anniversary ed., repr. with
The main techniques for these performance benets corr., 5. [Dr.] ed.). Reading, Mass. [u.a.]: Addison-
increased clock frequency and smarter but increasingly Wesley. ISBN 0-201-83595-9.
complex architecturesare now hitting the so-called
power wall. The computer industry has accepted that fu- [15] Michael McCool; James Reinders; Arch Robison (2013).
ture performance increases must largely come from in- Structured Parallel Programming: Patterns for Ecient
creasing the number of processors (or cores) on a die, Computation. Elsevier. p. 61.
rather than making a single core go faster.
[16] Gustafson, John L. (May 1988). Reevaluating Amdahls
[3] Asanovic et al. Old [conventional wisdom]: Power is free, law. Communications of the ACM. 31 (5): 532533.
but transistors are expensive. New [conventional wisdom] doi:10.1145/42411.42415.
is [that] power is expensive, but transistors are free.
[17] Bernstein, A. J. (1 October 1966). Analysis of
[4] Asanovic, Krste et al. (December 18, 2006). The Programs for Parallel Processing. IEEE Transac-
Landscape of Parallel Computing Research: A View tions on Electronic Computers. EC-15 (5): 757763.
from Berkeley (PDF). University of California, Berke- doi:10.1109/PGEC.1966.264565.
ley. Technical Report No. UCB/EECS-2006-183. Old
[conventional wisdom]: Increasing clock frequency is the [18] Roosta, Seyed H. (2000). Parallel processing and paral-
primary method of improving processor performance. lel algorithms : theory and computation. New York, NY
New [conventional wisdom]: Increasing parallelism is the [u.a.]: Springer. p. 114. ISBN 0-387-98716-9.
12 9 REFERENCES
[19] Lamport, Leslie (1 September 1979). How to Make a [39] Shimokawa, Y.; Fuwa, Y.; Aramaki, N. (1821
Multiprocessor Computer That Correctly Executes Multi- November 1991). A parallel ASIC VLSI neuro-
process Programs. IEEE Transactions on Computers. C computer for a large number of neurons and billion
28 (9): 690691. doi:10.1109/TC.1979.1675439. connections per second speed. International Joint
Conference on Neural Networks. 3: 21622167.
[20] Patterson and Hennessy, p. 748. doi:10.1109/IJCNN.1991.170708. ISBN 0-7803-0227-
3.
[21] Singh, David Culler ; J.P. (1997). Parallel computer ar-
chitecture ([Nachdr.] ed.). San Francisco: Morgan Kauf- [40] Acken, Kevin P.; Irwin, Mary Jane; Owens, Robert
mann Publ. p. 15. ISBN 1-55860-343-3. M. (July 1998). A Parallel ASIC Architecture
for Ecient Fractal Image Coding. The Jour-
[22] Culler et al. p. 15. nal of VLSI Signal Processing. 19 (2): 97113.
doi:10.1023/A:1008005616596.
[23] Patt, Yale (April 2004). "The Microprocessor Ten Years
From Now: What Are The Challenges, How Do We Meet [41] Kahng, Andrew B. (June 21, 2004) "Scoping the Prob-
Them? (wmv). Distinguished Lecturer talk at Carnegie lem of DFM in the Semiconductor Industry. Univer-
Mellon University. Retrieved on November 7, 2007. sity of California, San Diego. Future design for man-
ufacturing (DFM) technology must reduce design [non-
[24] Culler et al. p. 124. recoverable expenditure] cost and directly address man-
ufacturing [non-recoverable expenditures]the cost of a
[25] Culler et al. p. 125. mask set and probe cardwhich is well over $1 million
at the 90 nm technology node and creates a signicant
[26] Patterson and Hennessy, p. 713. damper on semiconductor-based innovation.
[27] Hennessy and Patterson, p. 549. [42] Patterson and Hennessy, p. 751.
[28] Patterson and Hennessy, p. 714. [43] The Sidney Fernbach Award given to MPI inventor Bill
Gropp refers to MPI as the dominant HPC communica-
[29] What is clustering? Webopedia computer dictionary. Re- tions interface
trieved on November 7, 2007.
[44] Shen, John Paul; Mikko H. Lipasti (2004). Modern pro-
[30] Beowulf denition. PC Magazine. Retrieved on Novem- cessor design : fundamentals of superscalar processors (1st
ber 7, 2007. ed.). Dubuque, Iowa: McGraw-Hill. p. 561. ISBN 0-
07-057064-7. However, the holy grail of such research
[31] Architecture share for 06/2007. TOP500 Supercomput- automated parallelization of serial programshas yet to
ing Sites. Clusters make up 74.60% of the machines on materialize. While automated parallelization of certain
the list. Retrieved on November 7, 2007. classes of algorithms has been demonstrated, such suc-
cess has largely been limited to scientic and numeric ap-
[32] Interconnect. plications with predictable ow control (e.g., nested loop
structures with statically determined iteration counts) and
[33] Hennessy and Patterson, p. 537. statically analyzable memory access patterns. (e.g., walks
over large multidimensional arrays of oat-point data).
[34] MPP Denition. PC Magazine. Retrieved on November
7, 2007. [45] Encyclopedia of Parallel Computing, Volume 4 by David
Padua 2011 ISBN 0387097651 page 265
[35] Kirkpatrick, Scott (2003). COMPUTER SCIENCE:
Rough Times Ahead. Science. 299 (5607): 668669. [46] Asanovic, Krste, et al. (December 18, 2006). The
doi:10.1126/science.1081623. PMID 12560537. Landscape of Parallel Computing Research: A View
from Berkeley (PDF). University of California, Berkeley.
[36] D'Amour, Michael R., Chief Operating Ocer, DRC Technical Report No. UCB/EECS-2006-183. See table
Computer Corporation. Standard Recongurable Com- on pages 1719.
puting. Invited speaker at the University of Delaware,
February 28, 2007. [47] Dobel, B., Hartig, H., & Engel, M. (2012) Operating sys-
tem support for redundant multithreading. Proceedings
[37] Boggan, Sha'Kia and Daniel M. Pressel (August 2007). of the tenth ACM international conference on Embedded
GPUs: An Emerging Platform for General-Purpose Com- software, 8392. doi:10.1145/2380356.2380375
putation (PDF). ARL-SR-154, U.S. Army Research Lab.
Retrieved on November 7, 2007. [48] Patterson and Hennessy, pp. 74950: Although success-
ful in pushing several technologies useful in later projects,
[38] Maslennikov, Oleg (2002). Systematic Generation of the ILLIAC IV failed as a computer. Costs escalated from
Executing Programs for Processor Elements in Parallel the $8 million estimated in 1966 to $31 million by 1972,
ASIC or FPGA-Based Systems and Their Transformation despite the construction of only a quarter of the planned
into VHDL-Descriptions of Processor Element Control machine . It was perhaps the most infamous of super-
Units. Lecture Notes in Computer Science, 2328/2002: computers. The project started in 1965 and ran its rst
p. 272. real application in 1976.
13
[49] Menabrea, L. F. (1842). Sketch of the Analytic Engine In- Designing and Building Parallel Programs, by Ian
vented by Charles Babbage. Bibliothque Universelle de Foster
Genve. Retrieved on November 7, 2007. quote: when
a long series of identical computations is to be performed, Internet Parallel Computing Archive
such as those required for the formation of numerical ta-
bles, the machine can be brought into play so as to give Parallel processing topic area at IEEE Distributed
several results at the same time, which will greatly abridge Computing Online
the whole amount of the processes.
Parallel Computing Works Free On-line Book
[50] Patterson and Hennessy, p. 753.
Frontiers of Supercomputing Free On-line Book
[51] R.W Hockney, C.R Jesshope. Parallel Computers 2: Covering topics like algorithms and industrial appli-
Architecture, Programming and Algorithms, Volume 2. cations
1988. p. 8 quote: The earliest reference to parallelism
in computer design is thought to be in General L. F. Universal Parallel Computing Research Center
Menabreas publication in 1842, entitled Sketch of the
Analytical Engine Invented by Charles Babbage". Course in Parallel Programming at Columbia Uni-
versity (in collaboration with IBM T.J Watson X10
[52] Parallel Programming, S. Gill, The Computer Journal Vol. project)
1 #1, pp2-10, British Computer Society, April 1958.
Parallel and distributed Grobner bases computation
[53] Wilson, Gregory V (1994). The History of the Devel- in JAS
opment of Parallel Computing. Virginia Tech/Norfolk
State University, Interactive Learning with a Digital Li- Course in Parallel Computing at University of
brary in Computer Science. Retrieved 2008-01-08. Wisconsin-Madison
[54] Anthes, Gry (November 19, 2001). The Power of Paral- OpenHMPP, A New Standard for Manycore
lelism. Computerworld. Retrieved 2008-01-08.
Berkeley Par Lab: progress in the parallel com-
[55] Patterson and Hennessy, p. 749. puting landscape, Editors: David Patterson, Dennis
Gannon, and Michael Wrinn, August 23, 2013
The trouble with multicore, by David Patterson,
10 Further reading posted 30 Jun 2010
Rodriguez, C.; Villagra, M.; Baran, B. (29 The Landscape of Parallel Computing Research: A
August 2008). Asynchronous team algorithms View From Berkeley (one too many dead link at this
for Boolean Satisability. Bio-Inspired Mod- site)
els of Network, Information and Computing Sys-
tems, 2007. Bionetics 2007. 2nd: 6669.
doi:10.1109/BIMNICS.2007.4610083.
Sechin, A.; Parallel Computing in Photogrammetry.

GIM International. #1, 2016, pp. 2123.
11 External links
Go Parallel: Translating Multicore Power into Ap-
plication Performance
Instructional videos on CAF in the Fortran Standard

by John Reid (see Appendix B)
Parallel computing at DMOZ
Lawrence Livermore National Laboratory: Intro-

duction to Parallel Computing
Comparing programmability of Open MP and

pthreads
What makes parallel programming hard?

14 12 TEXT AND IMAGE SOURCES, CONTRIBUTORS, AND LICENSES
12 Text and image sources, contributors, and licenses

12.1 Text
Parallel computing Source: https://en.wikipedia.org/wiki/Parallel_computing?oldid=761695779 Contributors: The Anome, Awaterl,
Arvindn, Aldie, SimonP, Maury Markowitz, Heron, RTC, Michael Hardy, Nixdorf, Pnm, Kku, Wapcaplet, Alo, Egil, Fantasy, Ronz,
J'raxis, Docu, Angela, BigFatBuddha, Expatrick, Emperorbma, Viajero, Maximus Rex, Furrykef, Stephane Simard, Wiwaxia, Raul654,
Fredrik, RedWolf, Sander~enwiki, Kuszi, Gantlord, Ojigiri~enwiki, Hadal, Qlmatrix, Dave Bass, Mfc, Giftlite, DocWatson42, DavidCary,
Brona, Michael Devore, Jason Quinn, Gracefool, AlistairMcMillan, Mobius, Manuel Anastcio, Gdr, LiDaobing, Beland, APH, Sam Ho-
cevar, Qiq~enwiki, Dhuss, Chris j wood, Rama, Antaeus Feldspar, Dyl, ZeroOne, *drew, Walden, Art LaPella, Femto, Cmdrjameson,
Nk, Mdd, Alansohn, Liao, Guy Harris, Diego Moya, DariuszT, Yamla, Mechonbarsa, Tony Sidaway, RainbowOfLight, Sciurin, Oleg
Alexandrov, Woohookitty, Henrik, Arcann, Ruud Koot, Noetica, Eluchil, Rpwoodbu, Qwertyus, Kbdank71, Dwarf Kirlston, Tlroche,
Rjwilmsi, Strait, JoshuacUK, Martychen, Brighterorange, Matt Deres, Johnnyw, RobertG, Rbonvall, Gurch, BMF81, Michael Suess,
Chobot, Siddhant, YurikBot, Wavelength, Hairy Dude, Koeyahoo, Gaius Cornelius, Bovineone, Ugur Basak, CarlHewitt, Jmacaulay,
Tony1, Amwebb, Psy guy, Omeros~enwiki, Knotnic, 6a4fe8aa039615ebd9ddb83d6acf9a1dc1b684f7, BorgQueen, Curpsbot-unicodify,
GrEp, Boggie~enwiki, SmackBot, Kellen, Charleswarner, Gilliam, Skizzik, NickGarvey, Amux, Artoftransformation, TimBentley, Lin-
guistAtLarge, Persian Poet Gal, EncMstr, LaggedOnUser, Stevage, Wikibarista, Adamstevenson, Nbarth, Leiting, Trekphiler, Can't sleep,
clown will eat me, Nixeagle, JonHarder, TKD, Allan McInnes, Cybercobra, Caniago, Hslayer, Mlpkr, Vina-iwbot~enwiki, Autopilot,
Lambiam, Kuru, Disavian, Almkglor, Jaganath, Statsone, Soumyasch, PseudoSudo, Ckatz, Ripe, Ehheh, SandyGeorgia, Quaeler, Irides-
cent, JoeBot, GDallimore, JForget, Thedemonhog, Raysonho, Jesse Viviano, Mpotse, Neelix, Mblumber, Grahamec, Tawkerbot4, Stevag,
Omicronpersei8, Gimmetrow, Mattisse, Malleus Fatuorum, Epbr123, Biruitorul, Hcberkowitz, Marek69, Dabcanboulet, Ideogram, Dgies,
Dawnseeker2000, Apantomimehorse, Widefox, RU.Siriuz, Darklilac, JAnDbot, East718, Dsigal, Magioladitis, VoABot II, Ottojschlosser,
, Tedickey, Cic, Catgut, GermanX, MartinBot, Uvainio, Jack007, R'n'B, J.delanoy, DrKay, Vi2, Bogey97, Charliepeck, Robert-
greer, Jevansen, Treisijs, Brvman, Waraqa, JRS, ACSE, Butwhatdoiknow, GimmeBot, Auryon~enwiki, BwDraco, Gwizard, Vanishe-
dUserABC, ThomHImself, !dea4u, ParallelWolverine, Jimmi Hugh, Brendanwood, Graham Beards, Anderston~enwiki, Bentogoa, Arbor
to SJ, Allmightyduck, Nuttycoconut, Bichito, Realist2, Kumioko, Martarius, ClueBot, The Thing That Should Not Be, Tomas e, Razimantv,
Gareldspence, Ceiling Cat, Stow44, Piledhigheranddeeper, Mad031683, Shaiguitar, Ykhwong, Ceilican, Dekisugi, Hmeleka, Hariharan
wiki, BlueCannon, Damiansoul, Versus22, Marcoacostareyes, Bearsona, Asafshelly, Stickee, Laser brain, Skarebo, Mm40, Out4thecount,
Addbot, Ramu50, MrOllie, Quod~enwiki, 5 albert square, Romainhk, Legobot, EchetusXe, TaBOT-zerem, Vincnet, Talayco, Robert Mi-
jakovi, AnomieBOT, Galoubet, Commander Shepard, Piano non troppo, HRV, Zadneram, Citation bot, Quad4rax, LilHelpa, Xqbot,
TheAMmollusc, Lina6329, P99am, Martijnthie, Miym, Zenioss, SassoBot, Loltowne, Mangst, Howard McCay, Gastonhillar, FrescoBot,
Liridon, StaticVision, Ashelly, Ezekiel56, Nacho Insular, Socreddy, Ajaxartar, Mr Rav, Romanlezner, IvanM89, RaulMetumtam, Mag-
gyero, Guarani.py, Edderso, Glenn Maddox, Delightm, Jesse V., RjwilmsiBot, JamieHanlon, Gvaz151, John of Reading, Primefac, Kl-
brain, TuHan-Bot, AvicBot, ZroBot, Daonguyen95, F, Handheldpenguin, AvicAWB, Anthiety, AManWithNoPlan, Prathikh, Conscious
Code, Chris.schaeer, Gwen-chan, Cswierkowski, ClueBot NG, Rezabot, StitchProgramming, Widr, MerlIwBot, Andreygeo, Uwsbel,
CrADHD, , BattyBot, Decentprady001, Agirault, Cccrrraaaiiiggg, Dexbot, Br'er Rabbit, SoledadKabocha, DougBoost,
Frosty, Wangjipu, Phamnhatkhanh, Michael Ten, Pdecalculus, M Stocker199, Almeria.raul, Hollylilholly, Interrexconsul, Monkbot, Wag-
gie, Mansoor-siamak, Soa Koutsouveli, Sbala142, Dmunoz5, Stokkie64, Todaypress, CAPTAIN RAJU, Ashwin Geet Dsa, CLCStudent,
Eno Lirpa, Christopher Overbeck, Fmadd, Zerkly, Eltoro2197, Tufty20 and Anonymous: 288
12.2 Images
File:AmdahlsLaw.svg Source: https://upload.wikimedia.org/wikipedia/commons/e/ea/AmdahlsLaw.svg License: CC BY-SA 3.0 Con-
tributors: Own work based on: File:AmdahlsLaw.png Original artist: Daniels220 at English Wikipedia
File:Beowulf.jpg Source: https://upload.wikimedia.org/wikipedia/commons/8/8c/Beowulf.jpg License: GPL Contributors: ? Original
artist: User Linuxbeak on en.wikipedia
File:BlueGeneL_cabinet.jpg Source: https://upload.wikimedia.org/wikipedia/commons/a/a7/BlueGeneL_cabinet.jpg License: CC-BY-
SA-3.0 Contributors: ? Original artist: ?
File:Commons-logo.svg Source: https://upload.wikimedia.org/wikipedia/en/4/4a/Commons-logo.svg License: PD Contributors: ? Origi-
nal artist: ?
File:Cray_1_IMG_9126.jpg Source: https://upload.wikimedia.org/wikipedia/commons/6/6e/Cray_1_IMG_9126.jpg License: CC BY-
SA 2.0 fr Contributors: Own work Original artist: Rama
File:En-Parallel_computing.ogg Source: https://upload.wikimedia.org/wikipedia/commons/3/3b/En-Parallel_computing.ogg License:
CC BY-SA 3.0 Contributors:
Derivative of Parallel computing Original artist: Speaker: Mangst
Authors of the article
File:Fivestagespipeline.png Source: https://upload.wikimedia.org/wikipedia/commons/2/21/Fivestagespipeline.png License: CC-BY-
SA-3.0 Contributors: ? Original artist: ?
File:Folder_Hexagonal_Icon.svg Source: https://upload.wikimedia.org/wikipedia/en/4/48/Folder_Hexagonal_Icon.svg License: Cc-by-
sa-3.0 Contributors: ? Original artist: ?
File:Gustafson.png Source: https://upload.wikimedia.org/wikipedia/commons/d/d7/Gustafson.png License: CC BY-SA 3.0 Contributors:
Own work Original artist: Peahihawaii
File:IBM_Blue_Gene_P_supercomputer.jpg Source: https://upload.wikimedia.org/wikipedia/commons/d/d3/IBM_Blue_Gene_P_
supercomputer.jpg License: CC BY-SA 2.0 Contributors: originally posted to Flickr as Blue Gene / P Original artist: Argonne National
Laboratorys Flickr page
File:ILLIAC_4_parallel_computer.jpg Source: https://upload.wikimedia.org/wikipedia/commons/9/91/ILLIAC_4_parallel_
computer.jpg License: CC BY 2.0 Contributors: Flickr Original artist: Steve Jurvetson from Menlo Park, USA
12.3 Content license 15
File:Nopipeline.png Source: https://upload.wikimedia.org/wikipedia/commons/2/2c/Nopipeline.png License: CC-BY-SA-3.0 Contribu-

tors: ? Original artist: ?
File:Numa.svg Source: https://upload.wikimedia.org/wikipedia/commons/7/7b/Numa.svg License: CC BY-SA 3.0 Contributors: Own
work Original artist: Raul654
File:NvidiaTesla.jpg Source: https://upload.wikimedia.org/wikipedia/commons/3/32/NvidiaTesla.jpg License: Public domain Contribu-
tors: Camera Original artist: Mahogny
File:Optimizing-different-parts.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/40/Optimizing-different-parts.svg
License: Public domain Contributors: Own work Original artist: Gorivero
File:Sound-icon.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/47/Sound-icon.svg License: LGPL Contributors:
Derivative work from Silsor's versio Original artist: Crystal SVG icon set
File:Superscalarpipeline.svg Source: https://upload.wikimedia.org/wikipedia/commons/4/46/Superscalarpipeline.svg License: CC BY-
SA 3.0 Contributors: Own work Original artist: Amit6, original version (File:Superscalarpipeline.png) by User:Poil
File:Wikibooks-logo-en-noslogan.svg Source: https://upload.wikimedia.org/wikipedia/commons/d/df/Wikibooks-logo-en-noslogan.
svg License: CC BY-SA 3.0 Contributors: Own work Original artist: User:Bastique, User:Ramac et al.
File:Wikiversity-logo.svg Source: https://upload.wikimedia.org/wikipedia/commons/9/91/Wikiversity-logo.svg License: CC BY-SA 3.0
Contributors: Snorky (optimized and cleaned up by verdy_p) Original artist: Snorky (optimized and cleaned up by verdy_p)
12.3 Content license

Creative Commons Attribution-Share Alike 3.0

Parallel Computing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallel Computing

Uploaded by

Copyright:

Available Formats

Parallel computing

tectures are sometimes used alongside traditional proces-

solving a large mathematical or engineering problem will

Assume that a task has two independent parts, A and B. Part

Subtasks in a parallel program are often called threads.

1.4 Fine-grained, coarse-grained, and em-

2.2 Instruction-level parallelism A canonical ve-stage pipelined superscalar processor. In the

Most modern processors also have multiple execution

2.3 Task parallelism

3 Hardware memory, a crossbar switch, a shared bus or an intercon-

Distributed shared memory

A logical view of a non-uniform memory access (NUMA) archi-

Grid computing Main article: Grid computing

Grid computing is the most distributed form of paral-

Recongurable computing with eld-programmable Application-specic integrated circuits Main arti-

4.1 Parallel programming languages

with a large number of processors used in high perfor- 7 History

dense linear algebra;

Transputer [10] Rabaey, Jan M. (1996). Digital integrated circuits : a de-

Sechin, A.; Parallel Computing in Photogrammetry.

Instructional videos on CAF in the Fortran Standard

Parallel computing at DMOZ

Lawrence Livermore National Laboratory: Intro-

Comparing programmability of Open MP and

What makes parallel programming hard?

12 Text and image sources, contributors, and licenses

File:Nopipeline.png Source: https://upload.wikimedia.org/wikipedia/commons/2/2c/Nopipeline.png License: CC-BY-SA-3.0 Contribu-

12.3 Content license

You might also like