Professional Documents
Culture Documents
(10 X 2 = 20)
Super computer
Mainframe Computer
Minicomputer
Microcomputer
for(i=0;i<100;i=i+1)
{
A[i+1]=A[i]+C[i]; /*S1*/
B[i+1]=B[i]+A[i]; /*S2*/
}
What are the dependences between S1 and S2 in the loop?
Answer
There are two different dependences:
1. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1],
which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
2. S2 uses the value, A[i+1], computed by S1 in the same iteration.
These two dependences are different and have different effects. To see how they differ, lets assume
that only one of these dependences exists at a time. Because the dependence of statement S1 is on an
earlier iteration of S1, this dependence is loop carried. This dependence forces successive iterations
of this loop to execute in series.
The second dependence (S2 depending on S1) is within an iteration and is not loop carried. Thus, if
this were the only dependence, multiple iterations of the loop could execute in parallel, as long as
each pair of statements in an iteration were kept in order. We saw this type of dependence in an
example in Section 3.2, where unrolling was able to expose the parallelism.
It is also possible to have a loop-carried dependence that does not prevent parallelism, as the next
example shows.
Part B
Answer All the Questions.
(5X16=80)
11. a. Explain the concepts and challenges of Instruction Level Parallelism (ILP).
Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer
program can be performed simultaneously. The potential overlap among instructions is called
instruction level parallelism.
There are two largely separable approaches to exploiting ILP: an approach that relies on
hardware to help discover and exploit the parallelism dynamically, and an approach that relies on
software technology to find parallelism, statically at compile time. Processors using the dynamic,
hardware-based approach, including the Intel Pentium series, dominate in the market; those using the
static approach, including the Intel Itanium, have more limited uses in scientific or applicationspecific environments.
The simplest and most common way to increase the ILP is to exploit parallelism among
iterations of a loop. This type of parallelism is often called loop-level parallelism. Here is a simple
example of a loop, which adds two 1000-element arrays, that is completely parallel:
for
(i=1; i<=1000; i=i+1)
x[i] = x[i] + y[i];
Every iteration of the loop can overlap with any other iteration, although within each loop iteration
there is little or no opportunity for overlap.
Data Dependences and Hazards
Determining how one instruction depends on another is critical to determining how much
parallelism exists in a program and how that parallelism can be exploited. In particular, to exploit
instruction-level parallelism we must determine which instructions can be executed in parallel. If two
instructions are parallel, they can execute simultaneously in a pipeline of arbitrary depth without
causing any stalls, assuming the pipeline has sufficient resources (and hence no structural hazards
exist). If two instructions are dependent, they are not parallel and must be executed in order, although
they may often be partially overlapped. The key in both cases
There are three different types of dependences: data dependences (also called true data dependences),
name dependences, and control dependences. An instruction j is data dependent on instruction I if
either of the following holds:
Data Hazards
A hazard is created whenever there is dependence between instructions, and they are close
enough that the overlap during execution would change the order of access to the operand involved
in the dependence. Because of the dependence, we must preserve what is called program order, that
is, the order that the instructions would execute in if executed sequentially one at a time as
determined by the original source program. The goal of both our software and hardware techniques is
to exploit parallelism by preserving program order only where it affects the outcome of the program.
Detecting and avoiding hazards ensures that necessary program order is preserved.
RAW (read after write) j tries to read a source before i writes it, so j incorrectly gets the old value.
This hazard is the most common type and corresponds to a true data dependence. Program order
must be preserved to ensure that j receives the value from i.
WAW (write after write)j tries to write an operand before it is written by i. The writes end up being
performed in the wrong order, leaving the value written by i rather than the value written by j in the
destination. This hazard corresponds to output dependence. WAW hazards are present only in
pipelines that write in more than one pipe stage or allow an instruction to proceed even
when a previous instruction is stalled.
11. b. What is multithreading? Discuss different types of multithreading in detail.
In a multithreaded application, there are several points of execution within the same memory
space.
hardware multithreading
I ncreasing utilization of a processor by switching to another thread when one thread is stalled.
thread
A thread includes the program counter, the register state, and the stack. I t is a lightweight process;
whereas threads commonly share a single address space, processes dont.
process
A process includes one or more threads, the address space, and the operating system state. Hence, a
process switch usually invokes the operating system, but not a thread switch.
fine-grained multithreading
A version of hardware multithreading that implies switching between threads after every instruction.
coarse-grained multithreading
A version of hardware multithreading that implies switching between threads only after significant
events, such as a last-level cache miss.
simultaneous multithreading (S M T )
A version of multithreading that lowers the cost of multithreading by utilizing the resources needed
for multiple issue, dynamically scheduled microarchitecture.
In this design, each core has its own execution pipeline. And each core has the resources
required to run without blocking resources needed by the other software threads.
While the example in Figure 2 shows a two-core design, there is no inherent limitation in the number
of cores that can be placed on a single chip. Intel has committed to shipping dual-core processors in
2005, but it will add additional cores in the future. Mainframe processors today use more than two
cores, so there is precedent for this kind of development.
The multi-core design enables two or more cores to run at somewhat slower speeds and at much
lower temperatures. The combined throughput of these cores delivers processing power greater than
the maximum available today on single-core processors and at a much lower level of power
consumption. In this way, Intel increases the capabilities of server platforms as predicted by Moores
Law while the technology no longer pushes the outer limits of physical constraints.
12. b . Discuss Amdahls Law and how Processor Speedup is calculated explain with an
example.
Amdahls Law states that the performance improvement to be gained from using some faster
mode of execution is limited by the fraction of the time the faster mode can be used. Amdahls Law
defines the speedup that can be gained by using a particular feature. What is speedup? Suppose that
we can make an enhancement to a computer that will improve performance when it is used.
Amdahls Law defines the speedup that can be gained by using a particular feature. What is speedup?
Suppose that we can make an enhancement to a computer that will improve performance when it is
used. Speedup is the ratio
Speedup =
Speedup =
Speedup tells us how much faster a task will run using the computer with the enhancement as
opposed to the original computer.
Amdahls Law gives us a quick way to find the speedup from some enhancement, which depends on
two factors:
1. The fraction of the computation time in the original computer that can be converted to take
advantage of the enhancementFor example, if 20 seconds of the execution time of a
program that takes 60 seconds in total can use an enhancement, the fraction is 20/60. This
value, which we will call Fraction enhanced, is always less than or equal to 1.
2. The improvement gained by the enhanced execution mode; that is, how much faster the task would
run if the enhanced mode were used for the entire program This value is the time of the original
mode over the time of the enhanced mode. If the enhanced mode takes, say, 2 seconds for a portion
of the program, while it is 5 seconds in the original mode, the improvement is 5/2. We will call this
value, which is always greater than 1, Speedup enhanced.
The execution time using the original computer with the enhanced mode will be the time spent using
the unenhanced portion of the computer plus the time spent using the enhancement:
13. a. Explain trends in power, energy, cost and technology in integrated circuits with example.
Energy and Power within a Microprocessor
For CMOS chips, the traditional primary energy consumption has been in switching transistors, also
called dynamic energy. The energy required per transistor is proportional to the product of the
capacitive load driven by the transistor and the square of the voltage:
This equation is the energy of pulse of the logic transition of 010 or 101. The energy of a
single transition (01 or 10) is then:
The power required per transistor is just the product of the energy of a transition multiplied by the
frequency of transitions:
For a fixed task, slowing clock rate reduces power, but not energy. Clearly, dynamic power and
energy are greatly reduced by lowering the voltage, so voltages have dropped from 5V to just under
1V in 20 years. The capacitive load is a function of the number of transistors connected to an output
and the technology, which determines the capacitance of the wires and the transistors.
Example Some microprocessors today are designed to have adjustable voltage, so a 15%
reduction in voltage may result in a 15% reduction in frequency. What would be the impact on
dynamic energy and on dynamic power?
Answer Since the capacitance is unchanged, the answer for energy is the ratio of the voltages
since the capacitance is unchanged:
Integrated circuit costs are becoming a greater portion of the cost that varies between computers,
especially in the high-volume, cost-sensitive portion of the market. Indeed, with personal mobile
devices increasing reliance of whole systems on a chip (SOC), the cost of the integrated
circuits is much of the cost of the PMD. Thus, computer designers must understand the costs of chips
to understand the costs of current computers. Although the costs of integrated circuits have dropped
exponentially, the basic process of silicon manufacture is unchanged: A wafer is still tested and
chopped into dies that are packaged). Thus, the cost of a packaged integrated circuit is
In this section, we focus on the cost of dies, summarizing the key issues in testing and packaging at
the end. Learning how to predict the number of good chips per wafer requires first learning how
many dies fit on a wafer and then learning how to predict the percentage of those that will work.
From there it is simple to predict cost:
13. b.i). Explain in detail the various types of dependencies with suitable example.
There are 5 types of data dependencies. They are as follows:
Flow dependence:
A statement S2 is flow-dependent on the statement S1 if an execution
path exists from S1 to s2 and if at least one output of S1 feeds in as input to
S2.
Ex: S1:
S2:
load R1, A
Add R2, R1
S
S
Anti-dependence:
1
Statement S2 is anti-dependent on statement S12 if S2 follows S1 in
program order and if the output of S2 overlaps the input to S1.
Ex:
S1:
add R2, R1
S2:
move R1, R3
S
S
1
1
Two statements are output dependent if they produce the same output variable.
Output dependence:
Ex:
S1:
load R1, A
S
1
S
1
S2:
move R1, R3
I/O dependence:
Read and write are I/O statements. I/O dependence occurs not because the same
variable is involved but because the same file is referenced by both I/O statements.
Unknown dependence:
Answer
The following dependences exist among the four statements:
1. There are true dependences from S1 to S3 and from S1 to S4 because of Y[i]. These are not
loop carried, so they do not prevent the loop from being considered parallel. These
dependences will force S3 and S4 to wait for S1 to complete.
2. There is an antidependence from S1 to S2, based on X[i].
3. There is an antidependence from S3 to S4 for Y[i].
4. There is an output dependence from S1 to S4, based on Y[i].
The following version of the loop eliminates these false (or pseudo) dependences:
After the loop, the variable X has been renamed X1. In code that follows the loop, the compiler can
simply replace the name X by X1. In this case, renaming does not require an actual copy operation
but can be done by substituting names or by register allocation. In other cases, however, renaming
will require copying.
14. a. Explain vector architecture with neat diagram and give the suitable example
We begin with a vector processor consisting of the primary components that Fig shows. This
processor, which is loosely based on the Cray-1, is the foundation for discussion throughout this
section. We will call this instruction set architecture VMIPS; its scalar portion is MIPS, and its vector
portion is the logical vector extension of MIPS. The rest of this subsection examines how the basic
architecture of VMIPS relates to other processors.
The primary components of the instruction set architecture of VMIPS are the following:
Vector registersEach vector register is a fixed-length bank holding a single vector. VMIPS
has eight vector registers, and each vector register holds 64 elements, each 64 bits wide. The
vector register file needs to provide enough ports to feed all the vector functional units. These
ports will allow a high degree of overlap among vector operations to different vector
registers. The read and write ports, which total at least 16 read ports and 8 write ports, are
connected to the functional unit inputs or outputs by a pair of crossbar switches.
Vector functional unitsEach unit is fully pipelined, and it can start a new operation on
every clock cycle. A control unit is needed to detect hazards, both structural hazards for
functional units and data hazards on register accesses. Fig shows that VMIPS has five
functional units. For simplicity, we focus exclusively on the floating-point functional units.
Vector load/store unitThe vector memory unit loads or stores a vector to or from memory.
The VMIPS vector loads and stores are fully pipelined, so that words can be moved between
the vector registers and memory with a bandwidth of one word per clock cycle, after an initial
latency. This unit would also normally handle scalar loads and stores.
A set of scalar registersScalar registers can also provide data as input to the vector
functional units, as well as compute addresses to pass to the vector load/store unit. These are
the normal 32 general-purpose registers and 32 floating-point registers of MIPS. One input of
the vector functional units latches scalar values as they are read out of the scalar register file.
ii). How multiple lanes used for beyond one element per clock and explain how to handling
loops not equal to 64.
Beyond One Element per Clock Cycle
A critical advantage of a vector instruction set is that it allows software to pass a large amount of
parallel work to hardware using only a single short instruction. A single vector instruction can
include scores of independent operations yet be encoded in the same number of bits as a
conventional scalar instruction. The parallel semantics of a vector instruction allow an
implementation to execute these elemental operations using a deeply pipelined functional unit, as in
the VMIPS implementation weve studied so far; an array of parallel functional units; or a
combination of parallel and pipelined functional units. Figure 4.4 illustrates how to improve vector
performance by using parallel pipelines to execute a vector add instruction.
The size of all the vector operations depends on n, which may not even be known until run
time! The value of n might also be a parameter to a procedure containing the above loop and
therefore subject to change during execution.
The solution to these problems is to create a vector-length register (VLR). The VLR controls
the length of any vector operation, including a vector load or store. The value in the VLR, however,
cannot be greater than the length of the vector registers. This solves our problem as long as the real
length is less than or equal to the maximum vector length (MVL). The MVL determines the number
of data elements in a vector of an architecture. This parameter means the length of vector registers
can grow in later computer generations without changing the instruction set; as we shall see in the
next section, multimedia SIMD extensions have no equivalent of MVL, so they change the
instruction set every time they increase their vector length.
What if the value of n is not known at compile time and thus may be greater than the MVL?
To tackle the second problem where the vector is longer than the maximum length, a technique called
strip mining is used. Strip mining is the generation of code such that each vector operation is done
for a size less than or equal to the MVL. We create one loop that handles any number of iterations
that is a multiple of the MVL and another loop that handles any remaining iterations and must be less
than the MVL. In practice, compilers usually create a single strip-mined loop that is parameterized to
handle both portions by changing the length. We show the strip-mined version of the DAXPY loop in
C:
15.b. i).How will you detect and enhance loop level parallelism?
Loop-level parallelism is normally analyzed at the source level or close to it, while most
analysis of ILP is done once instructions have been generated by the compiler. Loop-level analysis
involves determining what dependences exist among the operands in a loop across the iterations of
that loop
The analysis of loop-level parallelism focuses on determining whether data accesses in later
iterations are dependent on data values produced in earlier iterations; such dependence is called a
loop-carried dependence. Most of the examples we considered in Section 3.2 have no loop-carried
dependences and, thus, are loop-level parallel. To see that a loop is parallel, let us first look at the
source representation:
In this loop, there is a dependence between the two uses of x[i], but this dependence is within a
single iteration and is not loop carried. There is a dependence between successive uses of i in
different iterations, which is loop carried, but this dependence involves an induction variable and can
be easily recognized and eliminated
Because finding loop-level parallelism involves recognizing structures such as loops, array
references, and induction variable computations, the compiler can do this analysis more easily at or
near the source level, as opposed to the machine-code level
Consider the loop :
for(i=0;i<100;i=i+1)
{
A[i+1]=A[i]+C[i]; /*S1*/
B[i+1]=B[i]+A[i]; /*S2*/
}
What are the dependences between S1 and S2 in the loop?
Answer
There are two different dependences:
1. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1],
which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1].
2. S2 uses the value, A[i+1], computed by S1 in the same iteration.
These two dependences are different and have different effects. To see how they differ, lets assume
that only one of these dependences exists at a time. Because the dependence of statement S1 is on an
earlier iteration of S1, this dependence is loop carried. This dependence forces successive iterations
of this loop to execute in series.
The second dependence (S2 depending on S1) is within an iteration and is not loop carried. Thus, if
this were the only dependence, multiple iterations of the loop could execute in parallel, as long as
each pair of statements in iteration were kept in order. We saw this type of dependence in an example
in Section 3.2, where unrolling was able to expose the parallelism.
It is also possible to have a loop-carried dependence that does not prevent parallelism, as the next
example shows