Aca305 2000

FIRST SEMESTER EXAMINATIONS June 2000
ADVANCED COMPUTER ARCHITECTURE (623.305)
The University of Western Australia

Department of Electrical and Electronic Engineering
FIRST SEMESTER EXAMINATIONS JUNE 2000

This paper contains:

10 questions;
XX pages.
Candidates should attempt ALL questions
QUESTION 1 (10 Marks)
Compare buses, crossbar switches, and multistage networks for building a

multiprocessor system. Assume a word length of w bits and 2 × 2 switches are used in
building the multiprocessor system with n processors and m shared-memory modules.
Assume a word length of w bits and that 2 × 2 switches are used in building the
multistage networks. To facilitate the comparison process, complete the following
table.
Network Bus System Multistage Network Crossbar Switch

Characteristics
Minimum latency for
unit data transfer
Bandwidth per
processor
Wiring complexity
Switching complexity
Connectivity and
routing capability
Remarks Assume n n × n MIN using k × Assume n × n
processors on the k switches with line crossbar with line
bus; bus width is w width of w bits width of w bits
bits
1
Answer:
Network Bus System Multistage Network Crossbar Switch

Characteristics
Minimum latency for Constant O(logk n) Constant
unit data transfer
Bandwidth per O(w/n) to O(w) O(w) to O(nw) O(w) to O(nw)
processor
Wiring complexity O(w) O(nwlogkn) O(n2w)
Switching complexity O(n) O(nlogkn) O(n2)
Connectivity and Only one-to-one at a Some permutations and All permutations, one
routing capability time broadcast, if network at a time
unblocked
Remarks Assume n processors n × n MIN using k × k Assume n × n
on the bus; bus width switches with line crossbar with line
is w bits width of w bits width of w bits
Analyse the data dependencies among the following statements:
S1: /RDG5 / R1 ← 1024 /

S2: /RDG50 / R2 ← Memory(10) /
S3: $GG55 / R1 ← (R1) + (R2) /
S4: 6WRUH05 / Memory(1024) ← (R1) /
S5: 6WRUH05 / Memory(64) ← 1024 /
Note that (Ri) means that the content of register Ri and Memory(10) contains 64
initially.
a) Draw a dependence graph to show all the dependencies.
b) Are there any resource dependencies if only one copy of each functional unit
is available in the CPU?
Answer:
a)
S1 S2 S5
S4 S3
2
b) There are storage dependencies between instruction pairs (S2, S5) and (S4, S5).
There is a resource dependence between S1 and S2 on the load unit, and another
between S4 and S5 on the store unit.
Compare the PRAM models with physical models of real parallel computers in each of
the following categories:
a) Which PRAM variant can best model SIMD machines and how?
b) Repeat the question in (a) for shared-memory MIMD machines.
Answer:
a) Since the processing elements of a SIMD machine read and write data from
different memory modules synchronously, no access conflicts should arise.
Thus, any PRAM variant can be used to model SIMD machines.
b) The processors in a MIMD machine can read the same memory location
simultaneously. However, writing to the same memory location is
prohibited. Thus, the CREW-PRAM can best model an MIMD machine.
Vectorizing compilers generally detect loops that can be executed on a pipelined

vector computer. Are the vectorization algorithms used by vectorizing compilers
suitable for MIMD machine parallelization? Justify your answer.
Answer:
Vectorization is a subset of MIMD parallel processing so the parallelization algorithms

that detect vectorization all detect code that can be multiprocessed. However, there
are several complicating factors: the synchronization is implicit in vectorization
whereas it must normally be explicit with multiprocessing and it is frequently time
consuming. Thus, the crossover vector length where the compiler will switch from
scalar to vector processing will normally be greater on the MIMD machine. There are
some vector operations that are more difficult on a MIMD machine, so some of the
code generation techniques may be different.
Consider the following pipelined processor with four stages. This pipeline has a total
evaluation time of six clock cycles. All successor stages must be used after each clock
cycle.
3
Output
Input
S1 S2 S3 S4
a) Specify the reservation table for this pipeline with six columns and four rows?
b) List the set of forbidden latencies between task initiations.
c) Draw the state diagram which shows all possible latency cycles.
d) List all greedy cycles from the state diagram.
e) Determine the minimal average latency (MAL).
Answer:
a) Reservation table:
1 2 3 4 5 6
S1 X X
S2 X X
S3 X
S4 X
b) Forbidden latency: 4. Collision vector; (1 0 0 0).
c) State transition diagram shown below.
d) Simple cycles: (1,5), (1,1,5), (1,1,1,5), (1,2,5), (1,2,3,5), (1,2,3,2,5), (1,2,3,2,1,5),

(2,5), (2,1,5), (2,1,2,5), (2,1,2,3,5), (2,3,5), (3,5), (3,2,5), (3,2,1,5), (3,2,1,2,5), (5),
(3,2,1,2), and (3).
e) Greedy cycles: (1,1,1,5) and (1,2,3,2).
f) MAL = (1+1+1+5)/4 = 2.
4
A uniprocessor computer can operate in either scalar or vector mode. In vector mode,
computations can be performed nine times faster than in scalar mode. A certain
benchmark program took time T to run on this computer. Further, it was found that
25% of T was attributed to the vector mode. In the remaining time, the machine
operated in the scalar mode.
a) Calculate the effective speedup under the above conditions as compared with the
condition when the vector mode is not used at all. Also calculate α, the percentage
of code that has been vectorized in the above program.
b) Suppose we double the speed ratio between the vector mode and the scalar mode
by hardware improvements. Calculate the effective speedup that can be achieved.
c) Suppose the same speedup obtained in part (b) must be obtained by compiler
improvements instead of hardware improvements. What would be the new
vectorization ratio α that should be supported by the vectorizing compiler for the
same benchmark program?
Answer:
a) If the vector mode is not used at all, the execution time will be:
0.75T + 9 × 0.25T = 3T
Therefore, effective speedup = 3T/T =3. Let the fraction of vectorized code be α.
Then α = 9 × 0.25T/3T = 0.75.
b) Suppose that the speed ratio between the vector mode and the scalar mode is
doubled. The execution time becomes:
0.75T + 0.25T/2 = 0.875T
The effective speedup is 3T/0.875T = 24/7 = 3.43.
c) Suppose the speed for vector mode computation is still nine times as fast as that
for scalar mode. To maintain the effective speedup of 3.43, the vectorization ratio
α must satisfy the following relation:
1 24
=
α 1− α 7
+
9 1
Solving the equation, we obtain α = 51 / 64 ≈ 0.8 .
5
On a Fujitsu VP2000, the vector processing unit is equipped with two load/store
pipelines plus five functional pipelines as shown below.
9HFWRU 8QLW
W
L
0DVN 0DVN 3LSHOLQH
Q
6 5HJLV
6 8
H
WHU 0DVN3LSHOLQH
W J
L D
Q U
8 R /RDG6WRUH
W $/83LSHOLQH
H 6
J 3LSHOLQH
U
D Q U H
U L R W
R D W V
L $/83LSHOLQH
W 0 /RDG6WRUH F J
H H
6
3LSHOLQH
9 5
P 'LYLGH 3LSHOLQH
H
W
V
\ &KDQQHO
6
3URFHVV
6FDODU
%XIIHU6WRUDJH
([HFXWLRQ
8QLW
6FDODU8QLWV
Consider the execution of the following compound vector function (CVF):
$, %,×&,',×(,),×*,

for I=1,2,.....,N. Initially, all vector operands are in memory, and the final vector result
must be store in memory. Show the space-time diagram, for pipelined execution of the
CVF. Note that, two vector loads can be carried out simultaneously on the two vector-
access pipes. At the end of computation, one of the two access pipes is used for
storing the A array.
Answer:
Access 1 Load B Load D Load F Store A
Access 2 Load C Load E Load G
Multiply Mul B,C Mul D,E Mul F,G
(BC+DE)
Add BC+DE + FG
Time

6
Given the following assembly language code. Exploit the maximum degree of
parallelism among the 16 instructions, assuming no resource conflicts and multiple
functional units are available simultaneously. For simplicity, no pipelining is assumed.
All instructions take one machine cycle to execute. Ignore all other overhead.
/RDG 5$ 5←0HP$

/RDG 5% 5←0HP%
0XO 555 5←5×5
/RDG 5' 5←0HP'
0XO 555 5←5×5
$GG 555 5←55
6WRUH ;5 0HP;←5
/RDG 5& 5←0HP&
0XO 555 5←5×5
/RDG 5( 5←0HP(
$GG 555 5←55
6WRUH <5 0HP<←5
$GG 555 5←55
6WRUH 85 0HP8←5
6XE 555 5←55
6WRUH 95 0HP9←5
a) Draw a program graph with 16 nodes to show the flow relationships among the 16
instructions.
b) Consider the use of a three-issue superscalar processor to execute this program

fragment in minimum time. The processor can issue one memory-access instruction
(Load or Store but not both), one Add/Sub instruction, and one Mul (multiply)
instruction per cycle. The Add unit, Load/Store unit, and Multiply unit can be used
simultaneously if there is no data dependence.
Answer: 10 8
(a) R7 R4 R1 R2
R9
R5
R8 R3
11
R10
R6
12 13 15
R11 R12
Y X
14 16
7
(b)
Mem. Acc. Add Mul

1 1
2 2
3 4 3
4 8 5
5 10 6 9
6 7 11
7 12 13
8 14 15
9 16
Explain the applicability and the restrictions involved in using Amdahl’s law and
Gustafson’s law to estimate the speedup performance of an n-processor system
compared with that of a single-processor system. Ignore all communication overheads.
(please answer in one paragraph).
Answer:
Amdahl’s law is based on a fixed workload, where the problem size is fixed regardless
of the machine size. Gustafson’s law is based on a scaled workload, where the problem
size is increased with the machine size so that the solution time is the same for
sequential and parallel executions.
Why are hypercube networks, which were very popular in first-generation

multicomputers, being replaced by 2D or 3D meshes or tori in the second and third
generations of multicomputers?
Answer:
In VLSI systems, networks with many dimensions require more and longer wires than
low-dimensional networks. Thus, high-dimensional networks cost more and run more
slowly than low-dimensional networks. Under the assumption of constant wire
bisection, low-dimensional networks have wide channels, and high-dimensional
networks have narrow channels. With the wormhole routing method, which is used by
most of the second- and third-generation multicomputers, the wider channels provide a
lower latency, less contention, and higher hot-spot throughput.

Aca305 2000

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Aca305 2000

Uploaded by

Copyright:

Available Formats

FIRST SEMESTER EXAMINATIONS June 2000

ADVANCED COMPUTER ARCHITECTURE (623.305)

The University of Western Australia

FIRST SEMESTER EXAMINATIONS JUNE 2000

This paper contains:

Candidates should attempt ALL questions

QUESTION 1 (10 Marks)

Compare buses, crossbar switches, and multistage networks for building a

Network Bus System Multistage Network Crossbar Switch

Network Bus System Multistage Network Crossbar Switch

QUESTION 2 (15 Marks)

Analyse the data dependencies among the following statements:

S1: /RDG5 / R1 ← 1024 /

a) Draw a dependence graph to show all the dependencies.

b) Repeat the question in (a) for shared-memory MIMD machines.

Vectorizing compilers generally detect loops that can be executed on a pipelined

Vectorization is a subset of MIMD parallel processing so the parallelization algorithms

QUESTION 5 (16 Marks)

b) Forbidden latency: 4. Collision vector; (1 0 0 0).

c) State transition diagram shown below.

d) Simple cycles: (1,5), (1,1,5), (1,1,1,5), (1,2,5), (1,2,3,5), (1,2,3,2,5), (1,2,3,2,1,5),

e) Greedy cycles: (1,1,1,5) and (1,2,3,2).

0.75T + 0.25T/2 = 0.875T

The effective speedup is 3T/0.875T = 24/7 = 3.43.

Solving the equation, we obtain α = 51 / 64 ≈ 0.8 .

QUESTION 7 (15 Marks)

Consider the execution of the following compound vector function (CVF):

$ , % , ×& , ' , ×( , ) , ×* ,

Access 1 Load B Load D Load F Store A

Access 2 Load C Load E Load G

Multiply Mul B,C Mul D,E Mul F,G

QUESTION 8 (10 Marks)

 /RDG 5$ 5←0HP $ 

b) Consider the use of a three-issue superscalar processor to execute this program

Mem. Acc. Add Mul

Why are hypercube networks, which were very popular in first-generation

You might also like

S1: /RDG5 / R1 ← 1024 /

$, %,×&,',×(,),×*,

/RDG 5$ 5←0HP$