Advanced Computer Architecture Solutions

TEST-3 SOLUTIONS Subject: Advanced Computer Architecture
1) Consider the following pipeline reservation table. 1 X S2 S3 (a) (b) (c) (d) (e) X X X X X 2 3 4 5 X 6 7 X 8 S1
What are the forbidden latencies? Draw the state transition diagram. List all the simple cycles and greedy cycles. Determine the optimal constant latency cycle and the minimal average latency. Let the pipeline clock period be = 20 ns. Determine the throughput of the pipeline. (10 Marks) Sol: Forbidden latencies: 2, 4, 5 and 7 Permissible latencies: 1, 3, 6 and 8 Collision vector: C7C6C5C4C3C2C1 = 1011010 CASE 1: latency 3 Present state Collision vector PS with 3 shifts Next state Present state Collision vector PS with 3 shifts Next state 1011010 1011010 + 0001101 1011011 1011011 1011010 + 0001011 1011011 Present state Collision vector PS with 8 shifts Next state 1011011 1011010 + 0000000 1011010
CASE 2: latency 6 Present state 1011010 Collision vector 1011010 PS with 6 shifts + 0000001 Next state 1011011
Present state Collision vector PS with 6 shifts Next state
1011011 1011010 + 0000001 1011011
1011011 1011010 + 0000000 1011010
CASE 3: latency 1 Present state 1011010 Collision vector 1011010 PS with 1 shifts + 0101101 Next state 1111111 CASE 4: latency 8 Present state 1111111 Collision vector 1011010 PS with 8 shifts + 0000000 Next state 1011010 Present state Collision vector PS with 8 shifts Next state 1011010 1011010 + 0000000 1011010
8+
1011010
3 6 8+ 1* 8+
1011011
3* 6
1111111
Latency cycles: (1, 8) (1, 8, 8) (1, 8, 3, 8) (1, 8, 8, 3, 8) (1, 8, 6, 8) (1, 8, 8, 6, 8) (8) (3) (6) (1, 8, 8, 6, 6, 8) Simple cycles: (3) (6) (8) (1, 8) (3, 8) (6, 8) Greedy cycles: (3) (1, 8) Optimal latency cycle: (3) MAL: Lower bound = 3 Upper bound = 4+1 = 5 Average greedy cycle latency = (1+8) / 2 = 4.5 MAL 4.5
MAL = (3) Given:

= 20 ns Throughput of the pipeline = N/n x = 3/8 x 20 x 10-9 = 18.75 MIPS.
2) Describe the mechanisms for instruction pipelining interms of prefetch buffers, multiple functional units. (10 Marks) Sol: Prefetch buffers:
Seq buffer 1 Seq buffer 2 Memor y Fetch cach e Seq buffer 2
Target buffer 1 Target buffer 2 Seq Instructions from branched locations buffer 2
Instruction pipeline
There are 3-types of pre-fetch buffers, namely 1. Sequential buffers 2. Target buffers 3. Loop buffers to match instruction fetch rate to pipeline consumption rate. Sequential buffers: Sequential instructions are loaded into a pair of sequential buffers for in-sequence pipelining. Target buffers: Instructions from a branch target are loaded into a pair of target buffers for out of sequential pipelining. Both buffers operate in FIFO fashion. These buffers become part of the pipeline as additional stages. A conditional branch instruction cause both sequential buffers and target buffers to fill with instructions.
After the branch condition is checked, appropriate instructions are taken from one of the two buffers. The instructions in the other buffers are discarded. Two buffers alternate to prevent a collision between instruction following into and out of pipeline. Multiple functional units: Loop buffers: These buffers hold sequential instruction contained in small loop. The loop buffers are maintained by fetch stage of pipeline. Pre-fetched instructions in the loop body will be executed repeatedly until all iterations complete execution. The loop buffer operates in two steps. a. It contains instructions sequentially ahead of current instruction. This saves the instruction fetch time from memory. b. It recognizes when the target of a branch falls within the target boundary. The above architecture is pipelined scalar architecture. In this architecture, in order to resolve data dependences and resource dependences among successive instructions entering the pipeline. The reservation stations [RS] are used with each functional unit. Operands can wait in the reservation stations until its data dependences have been resolved. Each reservation station is uniquely identified by a tag, which is monitored by a tag unit. The tag unit keeps checking the tags from all currently used registers or reservation stations. This register tagging technique allows the hardware to resolve conflicts between source and destination registers assigned for multiple instructions. Besides resolving conflicts, the reservation stations also serve as buffers to interface the pipelined function units with decode and issue units. The multiple functional units are supported to operate in parallel, once the dependences are resolved.
Instructions from memory
Instruction fetch unit Tag unit Decode and issue unit
Register file
B A
T S
Reservation Stations Functional units
R S F U
R S F U
R S F U
R S F U
Load register s Memor y
PART-2 Answer any Two full questions.
3) Consider the five-stage pipelined processor specified by the following reservation table 1 X S2 X X X X X X S3 S4 S5 2 3 4 5 X 6 S1
(a) (b) (c) (d)
What are the forbidden latencies? Draw the state transition diagram. List all the simple cycles and greedy cycles. Determine the optimal constant latency cycle and the minimal average latency (MAL).
(10 Sol: Marks) Forbidden latencies: 3, 4 and 5 Permissible latencies: 1, 2 and 6 Collision vector: C5C4C3C2C1 = 11100 CASE 1: latency 1 Present state Collision vector PS with 1 shifts Next state Present state Collision vector PS with 1 shifts Next state Present state Collision vector PS with 6 shifts Next state 11100 11100 + 01110 11110 11110 11100 + 01111 11111 11111 11100 + 00000 11100 Present state Collision vector PS with 6 shifts Next state 11110 11100 + 00000 11100
CASE 2: latency 2 Present state Collision vector PS with 2 shifts Next state Present state Collision vector PS with 2 shifts Next state CASE 3: latency 6 Present state Collision vector PS with 6 shifts 11100 11100 + 00000 11100 11100 + 00111 11111 11111 11100 + 00111 11111 Present state Collision vector PS with 6 shifts Next state 11111 11100 + 00000 11100
Next state
11100
6+
11100
1* 6+ 2* 6+
11110
1
1111
Latency cycles: (2),(6),(2,6),(1,6),(1,1,6) Simple cycles: (2),(6),(2,6),(1,6),(1,1,6) Greedy cycles: (2) (1, 6) Optimal latency cycle: (2) MAL: Lower bound = 2 Upper bound = 3+1 = 4 Average greedy cycle latency = (1+6) / 2 = 3.5 MAL = 2 4) Consider the following pipelined processor with four stages. This pipeline has a total evaluation time of six clock cycles. All successor stages must be used after each clock cycle. Output Input
S1 S2 S3 S4
(a) (b) (c) (d) (e) Sol:
Specify the reservation table for this pipeline with six columns and four rows. List the set of forbidden latencies between task initiations. Draw the state diagram which shows all possible latency cycles List all greedy cycles from the state diagram What is the value of minimal average latency (MAL)? (10 Marks) Reservation table: 1 X S2 X X X X X 2 3 4 X X S3 S4 5 6 S1
Forbidden latencies: 2 and 4 Permissible latencies: 1, 3 and 5 Collision vector: C4C3C2C1 = 1010
CASE 1: latency 1 Present state Collision vector PS with 1 shifts Next state Present state Collision vector PS with 1 shifts Next state CASE 2: latency 3 Present state Collision vector PS with 3 shifts Next state 1010 1010 + 0001 1011 1010 1010 + 0101 1111 1111 1010 + 0111 1111 Present state Collision vector PS with 5 shifts Next state 1111 1010 + 0000 1010
Present state Collision vector PS with 3 shifts Next state CASE 3: latency 5 Present state Collision vector PS with 5 shifts Next state
1011 1010 + 0001 1011
1011 1010 + 0000 1010
1010 1010 + 0000 1010
5+
1010
1* 5+ 3 5+
1111
1011
3*
Simple cycles: (3),(5),(3,5),(1,5) Greedy cycles: (3) (1,5) Average greedy cycle latency = (1+5) / 2 = 3 MAL: Lower bound = 3 Upper bound = 2+1 = 3 MAL = 3
5) Design an arithmetic pipeline unit for fixed-point multiplication of 8-bit integer using CSA and CPA. (10 Marks) Sol: An arithmetic pipeline unit for fixed-point multiplication of 8-bit integer using CSA and CPA:
PART3 Answer any Two full questions.
6) How is the dot product operation n S = ai x bi i=1 implemented without data forwarding? What are the advantages that accure, with internal data forwarding? (5+5 = 10 Marks) Sol: The product operation n S = ai x bi i=1 For example: A = (1, 2, 3, 4) B = (4, 5, 6, 7) A B = (1x4+2x5+3x6+4x7) = 60
Implementing the dot-product operation with internal data forwarding between a multiply unit and an add unit.
Advantages: The three instructions must be executed sequentially in a looping structure in without internal data forwarding. With data forwarding, the output of the multiplier is fed directly into the input register R4 of the adder and the output of the multiplier is also routed to register R3 as shown in Fig. Therefore internal data forwarding between the two functional units reduces the total execution time through the pipelined processor. 7) Design a binary multiply pipeline unit for two 4-bit operands. Use minimum number of CSAs and CPAs. Show all interconnections and bus width in the schematic diagram. Calculate the output of each CSA and CPA. (5+5 = 10 Marks)
Sol: A binary multiply unit for two 4-bit operands:
For example : Two 4-bit operands 1111 x 1111 1111 11110 111100 1111000 1110001 CSA1: 001111 011110 111100 S = 101101 C = 111100
CSA2:
0101101 0111100 1111000 S = 1101001 C = 1111000 CPA: 1101001 + 1111000 S= 11100001 8) Describe dynamic instruction scheduling achieved in Tomasulos register-tagging scheme built in IBM 360/91 processor. Sol: Dynamic instruction scheduling achieved in Tomasulos register-tagging scheme built in IBM 360/91 processor:
This hardware dependence resolution scheme was implemented with multiple
(10 Marks)
floating point units of IBM 91 processors for the model 91 processor, 3 RSs are used in a floating point adder and two pairs in a floating point multiplier. The scheme resolves resource conflicts as well as data dependences using register tagging to allocate or deallocate the source and destination registers. An issue instruction whose operands are not available is forwarded to an RS associated with the functional unit it will use. It waits until its data dependences have been resolved and its operands become available. The dependence is resolved by monitoring the result bus. When all operands for an instruction is available, it is dispatched to the functional unit for execution. All working registers are tagged. If a source register is busy when an instruction reaches the issue stage, the tag for the source register is forwarded to an RS. When the register becomes available, the tag can signal the availability.
Total execution time is 13 cycles, from cycle 4 to cycle 16

Advanced Computer Architecture Solutions

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Computer Architecture Solutions

Uploaded by

Copyright:

Available Formats

TEST-3 SOLUTIONS Subject: Advanced Computer Architecture

Present state Collision vector PS with 6 shifts Next state

1011011 1011010 + 0000001 1011011

Present state Collision vector PS with 8 shifts Next state

1011011 1011010 + 0000000 1011010

MAL = (3) Given:

Instructions from memory

Instruction fetch unit Tag unit Decode and issue unit

Reservation Stations Functional units

Load register s Memor y

PART-2 Answer any Two full questions.

(a) (b) (c) (d)

(a) (b) (c) (d) (e) Sol:

1011 1010 + 0001 1011

Present state Collision vector PS with 5 shifts Next state

1011 1010 + 0000 1010

1010 1010 + 0000 1010

PART3 Answer any Two full questions.

Sol: A binary multiply unit for two 4-bit operands:

Total execution time is 13 cycles, from cycle 4 to cycle 16

You might also like