You are on page 1of 322

CONTENTS

96 .............................................................................................................. 3
95 .............................................................................................................. 8
94 ............................................................................................................ 13
93 ............................................................................................................ 18
92 ............................................................................................................ 22
96 ............................................................................................................ 24
95 ............................................................................................................ 28
94 ............................................................................................................ 34
93 ............................................................................................................ 38
92 ............................................................................................................ 42
96 ............................................................................................................ 44
95 ............................................................................................................ 49
94 ............................................................................................................ 54
93 ............................................................................................................ 58
92 ............................................................................................................ 63
96 ............................................................................................................ 67
95 ............................................................................................................ 69
94 ............................................................................................................ 71
93 ............................................................................................................ 73
92 ............................................................................................................ 75
96 ............................................................................................................ 77
95 ............................................................................................................ 80
94 ............................................................................................................ 85
93 ............................................................................................................ 89
92 ............................................................................................................ 92
96 ............................................................................................................ 96
95 .......................................................................................................... 104
94 .......................................................................................................... 113
93 .......................................................................................................... 121
92 .......................................................................................................... 126
96 ...................................................................................................... 131
95 ...................................................................................................... 133
94 ...................................................................................................... 137
96 .......................................................................................................... 143
95 .......................................................................................................... 145
94 .......................................................................................................... 149
93 .......................................................................................................... 152

1
92 .......................................................................................................... 154
96 .......................................................................................................... 157
95 .......................................................................................................... 163
94 .......................................................................................................... 168
96 .......................................................................................................... 174
95 .......................................................................................................... 181
96 .......................................................................................................... 186
95 .......................................................................................................... 189
94 .......................................................................................................... 194
93 .......................................................................................................... 197
92 .......................................................................................................... 199
96 .......................................................................................................... 202
95 .......................................................................................................... 205
94 .......................................................................................................... 209
93 .......................................................................................................... 213
92 .......................................................................................................... 217
96 .......................................................................................................... 222
96 .......................................................................................................... 227
95 .......................................................................................................... 236
94 .......................................................................................................... 243
93 .......................................................................................................... 250
92 .......................................................................................................... 257
96 .......................................................................................................... 263
95 .......................................................................................................... 267
94 .......................................................................................................... 269
93 .......................................................................................................... 272
92 .......................................................................................................... 274
96 .......................................................................................................... 277
95 .......................................................................................................... 280
94 .......................................................................................................... 283
93 .......................................................................................................... 286
92 .......................................................................................................... 289
96 ...................................................................................................... 291
95 ...................................................................................................... 293
94 ...................................................................................................... 295
93 ...................................................................................................... 298
92 ...................................................................................................... 300
93 ...................................................................................................... 302
95 .......................................................................................................... 305

2
96 ...................................................................................................... 309
95 .......................................................................................................... 314
95 .......................................................................................................... 317
95 .......................................................................................................... 320


3
96

1. _____ implements the translation of a program's address space to physical
addresses.
(A) DRAM
(B) Main memory
(C) Physical memory
(D) Virtual memory
Answer: (D)
2. To track whether a page of disk has been written since it was read into the
memory, a ____ is added to the page table.
(A) valid bit
(B) tag index
(C) dirty bit
(D) reference bit
Answer: (C)
3. (Refer to the CPU architecture of Figure 1 below) Which of the following
statements is correct for a load word (LW) instruction?
(A) MemtoReg should be set to 0 so that the correct ALU output can be sent to
the register file.
(B) MemtoReg should be set to 1 so that the Data Memory output can be sent to
the register file.
(C) We do not care about the setting of MemtoReg. It can be either 0 or 1.
(D) MemWrite should be set to 1.
Answer: (B)








4
PC
Instruction
memory
Read
address
Instruction
[31
_
0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32
Instruction [15 0]
0
0
M
u
x
0
1
Control
Add
ALU
result
M
u
x
0
1
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Sign
extend
M
u
x
1
ALU
result
Zero
PCSrc
Data
memory
Write
data
Read
data
M
u
x
1
Instruction [15 11]
ALU
control
Shift
left 2
ALU
Address
PC
Instruction
memory
Read
address
Instruction
[31
_
0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32
Instruction [15 0]
0
0
M
u
x
0
1
Control
Add
ALU
result
M
u
x
0
1
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Sign
extend
M
u
x
1
ALU
result
Zero
PCSrc
Data
memory
Write
data
Read
data
M
u
x
1
Instruction [15 11]
ALU
control
Shift
left 2
ALU
Address

Figure 1

4. IEEE 754 binary representation of a 32-bit floating number is shown below
(normalized single-precision representation with bias = 127)
31 30 ~ 23 22 ~ 0
S exponent fraction
1 bit 8 bits 23 bits
(S) (E) (F)
What is the correct binary presentation of (- 0.75)
10
in IEEE single-precision float
format?
(A) 1011 1111 0100 0000 0000 0000 0000 0000
(B) 1011 1111 1010 0000 0000 0000 0000 0000
(C) 1011 1111 1101 0000 0000 0000 0000 0000
(D) 1011 1110 1000 0000 0000 0000 0000 0000
Answer: (A)
5. According to Question 4, what is the decimal number represented by the word
below?
Bit position 31 30 ~ 23 22 ~ 0
Binary value 1 10000011 011000.00
(A) -10
(B) -11
(D) -22
(D) -44

5
Answer: (A)
6. Assume that the following assembly code is run on a machine with 2 GHz clock.
The number of cycles for assembly instruction is shown in Table 1.
add $t0, $zero, $zero
loop: beq $a1, $zero finish
add $t0, $t0, $a0
sub $a1, $a1, 1
j loop
finish: addi $t0, $t0, 100
add $v0, $t0, $zero

instruction Cycles
add, addi, sub 1
lw, beq, j 2
Table 1
Assume $a0 = 3, $a1 = 20 at initial time, select the correct value of $v0 at the
final cycle:
(A) 157
(B) 160
(C) 163
(D) 166
Answer: (B)
7. According to Question 6, please calculate the MIPS (millions instructions per
second) of this assembly code:
(A) 1342
(B) 1344
(C) 1346
(D) 1348
Answer: (B)
MIPS =
1344
10
84
125
10 2
10
count n instructio
cycles clock
rate clock
10 CPI
rate clock
6
9
6
6
=




6
Questions 8-11. Link the following terms ((1) ~ (4))
(1) Microsoft Word
(2) Operating system
(3) Internet
(4) CD-ROM
to the most related terminology shown below (A, B, C,..., K), choose the most
related one ONLY (answer format: e.g., (1) K, for mapping item (1) to
terminology K).
A Applications software F Personal computer
B High-level programming language G Semiconductor
C Input device H Super computer
D Integrated circuit I Systems software
E Output device K Computer Networking
Please write down the answers in the answer table together with the choice
questions.
8. (1) Microsoft word
9. (2) Operating system
10. (3) Internet
11. (4) CD-ROM
Answer:
8. (1) Microsoft word A
9. (2) Operating system I
10. (3) Internet K
11. (4) CD-ROM C

Questions 12-15. Match the memory hierarchy element on the left with the closet
phrase on the right: (answer format: e.g., (1) d, for mapping item (1) (left) to
item d (right))
(1). L1 cache a. A cache for a cache
(2). L2 cache b. A cache for disks
(3). Main memory c. A cache for a main memory
(4). TLB d. A cache for page table entries
Please write down the answers in the answer table together with the choice
questions.
12. (1) L1 cache
13. (2) L2 cache
14. (3) Main memory
15. (4) TLB

7
Answer:
12. (1) L1 cache a
13. (2) L2 cache c
14. (3) Main memory b
15. (4) TLB d

Questions 41-50. Based on the function of the seven control signals and the datapath
of the MIPS CPU in Figure 1 (the same figure for Question 28), complete the
settings of the control lines in Table 2 (use 0, 1, and X (dont care) only) for the
two MIPS CPU instructions (beg and add). X (dont care) can help to reduce the
implementation complexity, you should put X whenever possible.
Instr. Branch
ALU
Src
Reg
Write
Reg
Dst
Memto
Reg
Memory
Write
Memory
Read
ALU
Op1
ALU
Op0
beq (16) (17) (18) (19) (20) (21) (22) 0 1
add (23) (24) (25)
Table 2
Please write down the answers in the answer table together with the choice
questions.
16. (16) =
17. (17) =
18. (18) =
19. (19) =
20. (20) =
21. (21) =
22. (22) =
23. (23) =
24. (24) =
25. (25) =
Answer:
16. (16) = 1
17. (17) = 0
18. (18) = 0
19. (19) = X
20. (20) = X
21. (21) = 0
22. (22) = 0
23. (23) = 1
24. (24) = 0
25. (25) = 0

8
95

1-4
Choose ALL the correct answers for each of the following 1 to 4 questions. Note that
credit will be given only if all choices are correct.
1. With pipelines:
(A) Increasing the depth of pipelining increases the impact of hazards.
(B) Bypassing is a method to resolve a control hazard.
(C) If a branch is taken, the branch prediction buffer will be updated.
(D) In static multiple issue scheme, multiple instructions in each clock cycle are
fixed by the processor at the beginning of the program execution.
(E) Predication is an approach to guess the outcome of an instruction and to
remove the execution dependence.
Answer: (A)
(B) False, (should be data hazard)
(C) False, (prediction buffer should be updated when guess wrong)
(D) False, (by the compiler)
(E) False, (should be Speculation)

2. Increasing the degree of associativity of a cache scheme will
(A) Increase the miss rate.
(B) Increase the hit time.
(C) Increase the number of comparators.
(D) Increase the number of tag bits.
(E) Increase the complexity of LRU implementation.
Answer: (B), (C), (D), (E)
(A) False, (decrease the miss rate)

3. With caching:
(A) Write-through scheme improves the consistency between main memory and
cache.
(B) Split cache applies parallel caches to improve cache speed.
(C) TLB (translation-lookaside buffer) is a cache on page table, and could help
accessing the virtual addresses faster.
(D) No more than one TLB is allowed in a CPU to ensure consistency.
(E) An one-way set associative cache performs the same as a direct mapped
cache.
Answer: (A), (B), (E)
(C) False, (help accessing physical address faster)

9
(D) False, (MIPS R3000 and Pentium 4 have two TLBs)

4. In a Pentium 4 PC,
(A) DMA mechanism can be applied to delegate responsibility from the CPU.
(B) AGP bus can be used to connect MCH (Memory Control Hub) and a
graphical output device.
(C) USB 2.0 is a synchronous bus using handshaking protocol.
(D) The CPU can fetch and translate IA-32 instructions.
(E) The CPU can reduce instruction latency with deep pipelining.
Answer: (A), (B), (D)
(C) False, (USB 2.0 is an asynchronous bus)
(E) False, (pipeline can not reduce single instructions latency)

5. Examine the following two CPUs, each running the same instruction set. The first
one is a Gallium Arsenide (GaAs) CPU. A 10 cm (about 4) diameter GaAs wafer
costs $2000. The manufacturing process creates 4 defects per square cm. The
CPU fabricated in this technology is expected to have a clock rate of 1000 MHz,
with an average clock cycles per instruction of 2.5 if we assume an infinitely fast
memory system. The size of the GaAs CPU is 1.0 cm 1.0 cm.
The second one is a CMOS CPU. A 20 cm (about 4) diameter CMOS wafer
costs $1000 and has 1 defect per square cm. The 1.0 cm 2.0 cm CPU executes
multiple instructions per clock cycle to achieve an average clock cycles per
instruction of 0.75, assuming an infinitely fast memory, while achieving a clock
rate of 200 MHz. (The CPU is larger because it has on-chip caches and executes
multiple instructions per clock cycle.)
Assume for both GaAs and CMOS is 2. Yields for GaAs and CMOS wafers
are 0.8 and 0.9 respectively. Most of this information is summarized in the
following table:

Wafer
Diam. (cm)
Wafer
Yield
Cost
($)
Defects
(1/cm
2
)
Freq.
(MHz)
CPI
Die Area
(cm cm)
Test Dies
(per wafer)
GaAs 10 0.80 $2000 3.0 1000 2.5 1.0 1.0 4
CMOS 20 0.90 $1000 1.8 200 0.75 1.0 2.0 4
Hint: Here are two equations that may help:
( )
per wafer dies test
area die 2
diameter wafer
area die
diameter/2 wafer
per wafer dies
2

=
t t

o
o
|
.
|

\
|
=
area die area unit per defects
- 1 yield wafer yield dies

(a) Calculate the averagae execution time for each instruction with an infinitely
fast memory. Which CPU is faster and by what factor?
(b) How many seconds will each CPU take to execute a one-billion-instruction
program?

10
(c) What is the cost of a GaAs die for the CPU? Repeat the calculation for CMOS
die. Show your work.
(d) What is the ratio of the cost of the GaAs die to the cost of the CMOS die?
(e) Based on the costs and performance ratios of the CPU calculated above, what
is the ratio of cost/performance of the CMOS CPU to the GaAs CPU?
Answer:
(a) Execution Time (GaAs) for one instruction = 2.5 1 ns = 2.5 ns
Execution Time (CMOS) for one instruction = 0.75 5 ns = 3.75 ns
GaAs CPU is faster by 3.75/2.5 =1.5 times
(b) Execution Time (GaAs) = 1 10
9
2.5 ns = 2.5 seconds
Execution Time (CMOS) = 1 10
9
3.75 ns = 3.75 seconds
(c) GaAs: dies per wafer =
( )
4
1 2
10
1
2 / 10
2

t t
= 67
die yield =
2
2
1 3
1 8 . 0
|
.
|

\
|
= 0.2
Cost of a GaAs CPU die =
0.2 67
2000

= 149.25
CMOS: dies per wafer =
( )
4
2 2
20
2
2 / 20
2

t t
= 121
die yield =
2
2
2 8 . 1
1 9 . 0
|
.
|

\
|
= 0.576
Cost of a GaAs CPU die =
0.576 121
1000

= 14.35
(d) The cost of a GaAs die is 149.25/14.35 = 10.4 times than a CMOS die
(e) The ratio of cost/performance of the CMOS to the GaAs is 10.4/1.5 = 6.93

6. Given the following 8 possible solutions for a POP or a PUSH operation in a
STACK: (1) Read from Mem(SP), Decrement SP; (2) Read from Mem(SP),
Increment SP; (3) Decrement SP, Read from Mem(SP) (4) Increment SP, Read
from Mem(SP) (5) Write to Mem(SP), Decrement SP; (6) Write to Mem(SP),
Increment SP; (7) Decrement SP, Write to Mem(SP); (8) Increment SP, Write to
Mem(SP).
Choose only ONE of the above solutions for each of the following questions.
(a) Solution of a PUSH operation for a Last Full stack that grows ascending.
(b) Solution of a POP operation for a Next Empty stack that grows ascending.
(c) Solution of a PUSH operation for a Next Empty stack that grows ascending.
(d) Solution of a POP operation for a Last Full stack that grows ascending.
Answer:
(a) (8) (b) (3) (c) (6) (d) (1)


11
stack pointer (SP)

SP
SP
Address
big
small
Last Full Next Empty

Last full PUSH (1) Increase SP; (2) Write to MEM(SP)
Last full POP (1) Read from MEM(SP); (2) Decrease SP
Next Empty PUSH (1) Write to MEM(SP); (2) Increase SP
Next Empty POP (1) Decrease SP; (2) Read from MEM(SP)

7. Execute the following Copy loop on a pipelined machine:
Copy: lw $10, 1000($20)
sw $10, 2000($20)
addiu $20, $20, -4
bne $20, $0, Copy
Assume that the machine datapath neither stalls nor forwards on hazards, so you
must add nop instructions.
(a) Rewrite the code inserting as few nop instructions as needed for proper
execution;
(b) Use multi-clock-cycle pipeline diagram to show the correctness of your
solution.
Answer: Suppose that register Read and Write could occur in the same clock cycle.
(a) lw $10, 1000($20)
Copy: addiu $20, $20, 4
nop
sw $10, 2000($20)
bne $20, $0, Copy
lw $10, 1000($20)
(b)
1 2 3 4 5 6 7 8 9 10 11
lw IF ID EX
MEM
WB
addiu IF ID EX
MEM WB

nop IF ID EX MEM WB
sw IF ID EX MEM WB
bne IF ID EX MEM WB
lw IF ID EX MEM WB

12
8. In a Personal Computer, the optical drive has a rotation speed of 7500 rpm, a
40,000,000 bytes/second transfer rate, and a 60 ms seek time. The drive is served
with a 16 MHz bus, which is 16-bit wide.
(a) How long does the drive take to read a random 100,000-byte sector?
(b) When transferring the 100,000-byte data, what is the bottleneck?
Answer:
(a)
40000000
100000
60 / 7500
5 . 0
60 + + ms
= 60ms + 4ms + 2.5ms = 66.5ms
(b) The time for the bus to transfer 100000 bytes is
6
10 16
2 / 100000

= 3.125 ms
So, the optical drive is the bottleneck.

8. A processor has a 16 KB, 4-way set-associate data cache with 32-byte blocks.
(a) What is the number of sets in L1 cache?
(b) The memory is byte addressable and addresses are 35-bit long. Show the
breakdown of the address into its cache access components.
(c) How many total bytes are required for cache?
(d) Memory is connected via a 16-bit bus. It takes 100 clock cycles to send a
request to memory and to receive a cache block. The cache has 1-cycle hit
time and 95% hit rate. What is the average memory access time?
(e) A software program is consisted of 25% of memory access instructions. What
is the average number of memory-stall cycle per instruction if we run this
program?
Answer:
(a) 16KB/(32 4) = 128 sets
(b)
tag index Block offset Byte offset
23 bits 7 bits 3 bits 2 bits
(c) 2
7
4 (1 + 23 + 32 8) = 140 kbits = 17.5 KB
(d) Average Memory Access Time = 1 + 0.05 100 = 6 clock cycles
(e) (6 1) 1.25 = 6.25 clock cycles

13
94

1. Compare two memory system designs for a classic 5-stage pipelined processor.
Both memory systems have a 4-KB instruction cache. But system A has a
4K-byte data cache, with a miss rate of 10% and a hit time of 1 cycle; and system
B has an 8K-byte data cache, with a miss rate of 5% and a hit time of 2 cycles
(the cache is not pipelined). For both data caches, cache lines hold a single word
(4 bytes), and the miss penalty is 10 cycles. What are the respective average
memory access times for data retrieved by load instructions for the above two
memory system designs, measured in clock cycles?
Answer:
Average memory access time for system A = 1 + 0.1 10 = 2 cycles
Average memory access time for system B = 2 + 0.05 10 = 2.5 cycles

2. (a) Describe at least one clear advantage a Harvard architecture (separate
instruction and data caches) has over a unified cache architecture (a single
cache memory array accessed by a processor to retrieve both instruction and
data)
(b) Describe one clear advantage a unified cache architecture has over the Harvard
architecture
Answer:
(a) Cache bandwidth is higher for a Harvard architecture than a unified cache
architecture
(b) Hit ratio is higher for a unified cache architecture than a Harvard architecture

3. (a) What is RAID?
(b) Match the RAID levels 1, 3, and 5 to the following phrases for the best match.
Uses each only once.
Data and parity striped across multiple disks
Can withstand selective multiple disk failures
Requires only one disk for redundancy
Answer:
(a) An organization of disks that uses an array of small and inexpensive disks so
as to increase both performance and reliability
(b) RAID 5 Data and parity striped across multiple disks
RAID 1 Can withstand selective multiple disk failures
RAID 3 Requires only one disk for redundancy


14
4. (a) Explain the differences between a write-through policy and a write back policy
(b) Tell which policy cannot be used in a virtual memory system, and describe the
reason
Answer:
(a) Write through: always write the data into both the cache and the memory
Write back: updating values only to the block in the cache, then writing the
modified block to the lower level of the hierarchy when the block is replaced
(b) Write-through will not work for virtual memory, since writes take too long.
Instead, virtual memory systems use write-back

5. (a) What is a denormalized number (denorm or subnormal)?
(b) Show how to use gradual underflow to represent a denorm in a floating point
number system.
Answer:
(a) For an IEEE 754 floating point number, if the exponent is all 0s, but the
fraction is non-zero then the value is a denormalized number, which does not
have an assumed leading 1 before the binary point. Thus, this represents a
number (-1)
s
0.f 2
-126
, where s is the sign bit and f is the fraction.
(b) Denormalized number allows a number to degrade in significance until it
becomes 0, called gradual underflow.
For example, the smallest positive single precision normalized number is
1.0000 0000 0000 0000 0000 0000
two
2
-126
but the smallest single precision denormalized number is
0.0000 0000 0000 0000 0000 0001
two
2
-126
, or l.0
two
2
-149

6. Try to show the following structure in the memory map of a 64-bit Big-Endian
machine, by plotting the answer in a two-row map where each row contains 8
bytes.
Struct{
int a; // 0x11121314
char* c; // A, B, C, D, E, F, G
short e; // 0x2122
}s;
Answer:
0 1 2 3 4 5 6 7
11 12 13 14 A B C D
E F G 21 22
C Int 4 bytes short 2 bytes Ccharacter 1 byte
half word 3 objects
wordsobject half word

15
7. Assume we have the following 3 ISA styles:
(1) Stack: All operations occur on top of stack where PUSH and POP are the only
instructions that access memory;
(2) Accumulator: All operations occur between an Accumulator and a memory
location;
(3) Load-Store: All operations occur in registers, and register-to-register
instructions use 3 registers per instruction.
(a) For each of the above ISAs, write an assembly code for the following
program segment using LOAD, STORE, PUSH, POP, ADD, and SUB and
other necessary assembly language mnemonics.
{ A = A + C;
D = A B;
}
(b) Some operations are not commutative (e.g., subtraction). Discuss what are
the advantages and disadvantages of the above 3 ISAs when executing
non-commutative operations.
Answer:
(a)
(1) Stack (2) Accumulator (3) Load-Store
PUSH A LOAD A LOAD R1, A
PUSH C ADD C LOAD R2, C
ADD STORE A ADD R1, R1, R2
POP A SUB B STORE R1, A
PUSH A STORE D LOAD R2, B
PUSH B SUB R1, R1, R2
SUB STORE R1, D
POP D
(b) Stack Accumulator ISA non-commutative operations
compiler time
instruction scheduling Load-Store ISA
non-commutative operations
instruction scheduling Stack Accumulator ISA
Load-Store ISA


16
8. The program below divides two integers through repeated addition and was
originally written for a non-pipelined architecture. The divide function takes in as
its parameter a pointer to the base of an array of three elements where X is the
first element at 0($a0), Y is the second element at 4 ($a0), and the result Z is to be
stored into the third element at 8($a0), Line numbers have been added to the left
for use in answering the questions below.
1 DIVIDE: add $t3, $zero, $zero
2 add $t2, $zero, $zero
3 lw $t1, 4($a0)
4 lw $t0, 0($a0)
5 LOOP: beq $t2, $t0, END
6 addi $t3, $t3, 1
7 add $t2, $t2, $t1
8 j LOOP
9 END: sw $t3, 8($a0)
(a) Given a pipelined processor as discussed in the textbook, where will data be
forwarded (ex. Line 10 EX.MEM? Line 11 EX.MEM)? Assume that
forwarding is used whenever possible, but that branches have not been
optimized in any way and are resolved in the EX stage.
(b) How many data hazard stalls are needed? Between which instructions should
the stall bubble(s) be introduced (ex. Line 10 and Line 11)? Again, assume
that forwarding is used whenever possible, but that branches have not been
optimized in any way and are resolved in the EX stage.
(c) If X = 6 and Y = 3,
(i) How many times is the body of the loop executed?
(ii) How many times is the branch beq not taken?
(iii) How many times is the branch beq taken?
(d) Rewrite the code assuming delayed branches are used. If it helps, yon may
assume that the answer to X/Y is at least 2. Assume that forwarding is used
whenever possible and that branches are resolved in IF/ID. Do not worry
about reducing the number of times through the loop, but arrange the code to
use as few cycles as possible by avoiding stalls and wasted instructions.
Answer:
(a) Line 4 MEM.WB
(b) 1 stall is needed, between Line 4 and Line 5
(c) (i) 2 (ii) 2 (iii) 1
(d) DIVIDE: add $t2, $zero, $zero
lw $t0, 0($a0)
add $t3, $zero, $zero
lw $t1, 4($a0)
LOOP: beq $t2, $t0, END
add $t2, $t2, $t1


17
j LOOP
addi $t3, $t3, 1
END: sw $t3, 8($a0)

18
93

1. Explain how each of the following six features contributes to the definition of a
RISC machine: (a) Single-cycle operation, (b) Load/Store design, (c) Hardwired
control, (d) Relatively few instructions and addressing modes, (e) Fixed
instruction format, (f) More compile-time effort.
Answer:
(a)

(b) Load/Store CPU
Load/Store
CPU
(c) Hardwire Control
(d)
(e)
(f) RISC


2. (1) Give an example of structural hazard.
(2) Identify all of the data dependencies in the following code. Show which
dependencies are data hazards and how they can be resolved via
forwarding?
add $2, $5, $4
add $4, $2, $5
sw $5, 100($2)
add $3, $2, $4
Answer:
(1) datapath
lw $5, 100($2)
add $2, $7, $4
add $4, $2, $5
sw $5, 100($2)
4 1 4

structural hazard
(2) : 1 add $2, $5, $4
2 add $4, $2, $5
3 sw $5, 100($2)
4 add $3, $2, $4

19
Data dependency Data hazard
$2 (1, 2) (1, 3) (1, 4) (1, 2) (1, 3)
$4 (2, 4) (2, 4)
Take instruction (1, 2) for example. We dont need to wait for the first instruction
to complete before trying to resolve the data hazard. As soon as the ALU creates
the sum for the first instruction, we can supply it as an input for the second
instruction.







3. Explain (1) what is a precise interrupt? (2) what does RAID mean? (3) what does
TLB mean?
Answer:
(1) An interrupt or exception that is always associated with the correct instruction
in pipelined computers.
(2) An organization of disks that uses an array of small and inexpensive disks so
as to increase both performance and reliability.
(3) A cache that keeps track of recently used address mappings to avoid an access
to the page table.

4. Consider a 32-byte direct-mapped write-through cache with 8-byte blocks.
Assume the cache updates on write hits and ignores write misses. Complete the
table below for a sequence of memory references which occur from left to right.
(Redraw the table in your answer sheet)
address 00 16 48 08 56 16 08 56 32 00 60
read/write r r r r r r r w w r r
index 0 2
tag 0 0
hit/miss miss
Answer:
Suppose the address is 8 bits. 32-byte direct-mapped cache with 8-byte blocks
4 blocks in the cache; block offset = 3 bits, [2:0]; index = 2 bits, [4:3]; tag = 8
3 2 = 3 bits, [5:7]

5 4
4 5
555 44
44 555

20
address tag index
decimal binary binary decimal binary decimal
00 000000 0 0 00 0
16 010000 0 0 10 2
48 110000 1 1 10 2
08 001000 0 0 01 1
56 111000 1 1 11 3
16 010000 0 0 10 2
08 001000 0 0 01 1
56 111000 1 1 11 3
32 100000 1 1 00 0
00 000000 0 0 00 0
60 111100 1 1 11 3

address 00 16 48 08 56 16 08 56 32 00 60
read/write r r r r r r r w w r r
index 0 2 2 1 3 2 1 3 0 0 3
tag 0 0 1 0 1 0 0 1 1 0 1
hit/miss miss miss miss miss miss miss hit hit miss hit hit
cache write hits write misses
write3 reference 32 write misses
2 reference 00 hit

5. (1) List two Branch Prediction strategies and (2) compare their differences.
Answer:
(1) Static prediction Dynamic prediction
(2) (a)

(b)
Misprediction penalty
(c)
(a) run time run time information

(b) Misprediction
penalty
(c)

6. Explain how the reference bit in a page table entry is used to implement an
approximation to the LRU replacement strategy.
Answer:
The operating system periodically clears the reference bits and later records them
so it can determine which pages were touched during a particular time period.
With this usage information, the operating system can select a page that is among

21
the least recently referenced.

7. Trace Booths algorithm step by step for the multiplication of 2 (6)
Answer:
2
ten
(6
ten
) = 0010
two
1010
two
= 1111 0100
two
= 12
ten

Iteration Step Multiplicand Product
0 Initial values 0010 0000 1010 0
1
00 no operation 0010 0000 1010 0
Shift right product 0010 0000 0101 0
2
10 prod = prod - Mcand 0010 1110 0101 0
Shift right product 0010 1111 0010 1
3
01 prod = prod + Mcand 0010 0001 0010 1
Shift right product 0010 0000 1001 0
4
10 prod = prod - Mcand 0010 1110 1001 0
Shift right product 0010 1111 0100 1


8. What are the differences between Trap and Interrupt?
Answer:
InterruptCPU (
processor )

Trap(processor )CPU
Trap
interrupt

22
92

1. A certain machine with a 10 ns (1010
-9
s) clock period can perform jumps (1
cycle), branches (3 cycles), arithmetic instructions (2 cycles), multiply
instructions (5 cycles), and memory instructions (4 cycles). A certain program has
10% jumps, 10% branches, 50% arithmetic, 10% multiply, and 20% memory
instructions. Answer the following question. Show your derivation in sufficient
detail.
(1) What is the CPI of this program on this machine
(2) If the program executes 10
9
instructions, what is its execution time
(3) A 5-cycle multiply-add instruction is implemented that combines an
arithmetic and a multiply instruction. 50% of the multiplies can be turned into
multiply-adds. What is the new CPI
(4) Following (3) above, if the clock period remains the same, what is the
programs new execution time.
Answer:
(1) 1 0.1 + 3 0.1 + 2 0.5 + 5 0.1 + 4 0.2 = 2.7
(2) Execution time = 10
9
2.7 10 ns = 27 s
(3) CPI = (1 0.1 + 3 0.1 + 2 0.45 + 5 0.05 + 4 0.2 + 5 0.05) / (0.1 +
0.1 + 0.45 + 0.05 + 0.2 + 0.05) = 2.6 / 0.95 = 2.74
(4) Execution time = 10
9
0.95 2.74 10 ns = 26.03 s

: 100%CPI


2. Answer True (O) or False () for each of the following. (NO penalty for wrong
answer.)
(1) Most computers use direct mapped page tables.
(2) Increasing the block size of a cache is likely to take advantage of temporal
locality.
(3) Increasing the page size tends to decrease the size of the page table.
(4) Virtual memory typically uses a write-back strategy, rather than a
write-through strategy.
(5) If the cycle time and the CPI both increase by 10% and the number of
instruction deceases by 20%, then the execution time will remain the same.
(6) A page fault occurs when the page table entry cannot be found in the
translation lookaside buffer.
(7) To store a given amount of data, direct mapped caches are typically smaller
than either set associative or fully associative caches, assuming that the
block size for each cache is the same.
(8) The twos complement of negative number is always a positive number in
the same number format.

23
(9) A RISC computer will typically require more instructions than a CISC
computer to implement a given program.
(10) Pentium 4 is based on the RISC architecture.
Answer:
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
O O O O
: Modern CPUs like the AthlonXP and Pentium 4 are based on a mixture of
RISC and CISC.

3. The average memory access time (AMAT) is defined as
AMAT = hit time + miss_rate miss_penalty
Answer the following two questions. Show your derivation in sufficient detail.
(1) Find the AMAT of a 100MHz machine, with a miss penalty of 20 cycles, a hit
time of 2 cycles, and a miss rate of 5%.
(2) Suppose doubling the size of the cache decrease the miss rate to 3%, but
cause the hit time to increase to 3 cycles and the miss penalty to increase to 21
cycles. What is the AMAT of the new machine
Answer:
(1) AMAT = (2 + 0.05 20) 10ns = 30ns
(2) AMAT = (3 + 0.03 21) 10ns = 36.3ns

4. If a pipelined processor has 5 stages and takes 100 ns to execute N instructions.
How long will it take to execute 2N instructions, assuming the clock rate is 500
MHz and no pipeline stalls occur?
Answer:
Clock cycle time = 1/(50010
6
)= 2 ns, N + 4 = 100 / 2 N = 46
The execution time of 2N instruction = 2 46 + 4 = 96 clock cycles = 192 ns




24
96

1. Answer the following questions briefly:
(a) Typically one CISC instruction, since it is more complex, takes more time to
complete than a RISC instruction. Assume that an application needs N CISC
instructions and 2N RISC instructions, and that one CISC instruction takes an
average 5T ns to complete, and one RISC instruction takes 2T ns. Which
processor has the better performance?
(b) Which of the following processors have a CISC instruction set architecture?
ARM AMD Opteron
Alpha 21164 IBM PowerPC
Intel 80x86 MIPS
Sun UltraSPARC
(c) True & False questions;
(1) There are four types of data hazards; RAR, RAW, WAR, and WAW.
(True or False?)
(2) AMD and Intel recently added 64-bit capability to their processors
because most programs run much faster with 64-bit instructions. (True or
False?)
(3) With a modern processor capable of dynamic instruction scheduling and
out-of-order execution, it is better that the compiler does not to optimize
the instruction sequences, (True or False?)
Answer:
(a) CISC time = N 5T = 5NT ns
RISC time = 2N 2T = 4NT ns
RISC time < CISC time, so the RISC architecture has better performance.
(b) Intel 80x86, AMD Opteron
(c) (1) False, RAR does not cause data hazard.
(2) False, most programs run much faster with 64-bit processors not 64-bit
instructions
(3) False, the compiler still tries to help improve the issue rate by placing the
instructions in a beneficial order.

2. For commercial applications, it is important to keep data on-line and safe in
multiple places.
(a) Suppose we want to backup 100GB of data over the network. How many
hours does it take to send the data by FTP over the Internet? Assume the
average bandwidth between the two places is 1Mbits/sec.


25
(b) Would it be better if you burn the data onto DVDs and mail the DVDs to the
other site? Suppose it takes 10 minutes to bum a DVD which has 4GB
capacity and the fast delivery service can deliver in 12 hours.
Answer:
(a) (100Gbyte)/1Mbits = 800 1024 seconds = 227.56 hours
(b) (100GB/4GB) 10 minutes = 250 minutes = 4.17 hours
4.17 + 12 = 16.17 hours < 227.56 hours
So, it is better to burn the data into DVDs and mail them to other site.

3. Suppose we have an application running on a shared-memory multiprocessor.
With one processor, the application runs for 30 minutes.
(a) Suppose the processor clock rate is 2GHz. The average CPI (assuming that all
references hit in the cache) on single processor is 0.5. How many instructions
are executed in the application?
(b) Suppose we want to reduce the run time of the application to 5 minutes with 8
processors. Let's optimistically assume that parallelization adds zero overhead
to the application, i.e. no extra instructions, no extra cache misses, no
communications, etc. What fraction of the application must be executed in
parallel?
(c) Suppose 100% of our application can be executed in parallel. Let's now
consider the communication overhead. Assume the multiprocessor has a 200
ns time to handle reference to a remote memory and processors are stalled on
a remote request. For this application, assume 0.02% of the instructions
involve a remote communication reference, no matter how many processors
are used. How many processors are needed at least to make the run time be
less than 5 minutes?
(d) Following the above question, but let's assume the remote communication
references in the application increases as the number of processors increases.
With N processors, there are 0.02*(N1)% instructions involve a remote
communication reference. How many processors will deliver the maximum
speedup?
Answer:
(a) 30 60 second = Instruction count 0.5 0.5 ns
Instruction Count =1800 /0.25 ns = 7200 10
9

(b) Suppose that the fraction of the application must be executed in parallel = F.

26
So,
8
) 1 (
1
5
30
F
F +
=
F = 20/21 = 0.952
(c) Assume N is the number of processors that will make run time < 5 minutes
(30 60)/N + 7200 10
9
0.0002 200 ns < 5 60 N > 150
So, at lease 150 processors are needed to make the rum time < 5 minutes
(d) Speedup =
N
60 30
+ 7200 10
9
0.0002 (N 1) 200 ns
= 1800N
1
+ 288 (N 1)
Let the derivative of Speedup = 0 1800N
2
+ 288 = 0 N = 2.5
2.5 processors ill deliver the maximum speedup

4. Number representation.
(a) What range of integer number can be represented by 16-bit 2's complement
number?
(b) Perform the following 8-bit 2's complement number operation and check
whether arithmetic overflow occurs. Check your answer by converting to
decimal sign-and-magnitude representation.
11010011
11101100
Answer:
(a) 2
15
~ + (2
15
1)
(b) 11010011 11101100 = 11010011 + 00010100 = 11100111
check: 45 ( 20) = 45 + 20 = 25
The range for 8-bit 2s complement number is: 2
7
~ + (2
7
1)
So, no overflow

5. Bus
(a) Draw a graph to show the memory hierarchy of a system that consists of CPU,
Cache, Memory and I/O devices. Mark where memory bus and I/O bus is.
(b) Assuming system 1 have a synchronous 32-bit bus with clock rate = 33 Mhz
running at 2.5V. System 2 has a 64-bit bus with clock rate = 66 Mhz running
at 1.8V. Assuming the average capacitance on each bus line is 2pF for bus in
system 1. What is the maximum average capacitance allowed for the bus of
system 2 so the peak power dissipation of system 2 bus will not exceed that of
the system 1 bus?
(c) Serial bus protocol such as SATA has gained popularity in recent years. To
design a serial bus that supports the same peak throughput as the bus in
system 2, what is the clock frequency of this serial bus?

27
Answer:
(a)

(b) Power dissipation = fCV
2

The peak power dissipation for system 1 =
33 10
6
(2 10
-12
32) 2.5
2
= 13.2 mw
Suppose the average capacitance for system 2 = C
66 10
6
C 1.8
2
< 13.2 mw C < 61.73 pF
The maximum average capacitance for system 2 is 61.73 pF.
(c) Since SATA uses a single signal path to transmit data serially (or bit by bit),
the frequency should be designed as 66 MHz 64 = 4.224 GHz to support the
same peak throughtput as bus system 2.
(b)system 2 bus line


28
95

PART I:
Please answer the following questions in the format listed below. If you do not follow
the format, you will get zero points for these questions.
1. (1) T or F
(2) T or F
(3) T or F
(4) T or F
(5) T or F
2. X = Y =
Stall cycles =
3. Option is times faster than the old machine
4. 1-bit predictor: 2-bit predictor:
1. True & False Questions
(1) If an address translation for a virtual page is present in the TLB, then that
virtual page must be mapped to a physical memory page.
(2) The set index decreases in size as cache associativity is increased (assume
cache size and block size remain the same)
(3) It is impossible to have a TLB hit and a data cache miss for the same data
reference.
(4) An instruction takes less time to execute on a pipelined processor than on a
nonpipelined processor (all other aspects of the processors being the same).
(5) A muti-cycle implementation of the MIPS processor requires that a single
memory by used for both instructions and data.
Answer:
(1) T
(2) T
(3) F
(4) F
(5) T

2. Consider the following program:
int A[100]; /* size(int) = 1 word */
for (i = 0; i < 100; i++)
A[i] = A[i] + 1;
The code for this program on a MIPS-like load/store architecture looks as
follows:
ADDI R1, R0, #X
ADDI R2, R0, A ; A is the base address of array A
LOOP: LD R3, 0(R2)

29
ADDI R3, R3, #1
SD R3, 0(R2)
ADDI R2, R2, #Y
SUBI R1, R1, #1
BNE R1, R0, LOOP
Consider a standard 5-stage MIPS pipeline. Assume that the branch is resolved
during the instruction decode stage, and full bypassing/register forwarding are
implemented. Assume that all memory references hit in the cache and TLBs. The
pipeline does not implement any branch prediction mechanism. What are values
of #X and #Y, and how many stall cycles are in one loop iteration including stalls
caused by the branch instruction?
Answer:
X = 100
Y = 4
Stall cycles = 3 ((1) between LD and ADDI, (2) between SUBI and BEQ, (3) one
below BEQ)
Since branch decision is resolved during ID stage, a clock stall is needed
between SUBI and BEQ

3. Suppose you had a computer hat, on average, exhibited the following properties
on the programs that you run:
Instruction miss rate: 2%
Data miss rate: 4%
Percentage of memory instructions: 30%
Miss penalty: 100 cycles
There is no penalty for a cache hit (i.e. the cache can supply the data as fast as the
processor can consume it.) You want to update the computer, and your budget will
allow one of the following:
Option #1: Get a new processor that is twice as fast as your current
computer. The new processors cache is twice as fast too, so it
can keep up with the processor.
Option #2: Get a new memory that is twice as fast.
Which is a better choice? And what is the speedup of the chosen design compared
to the old machine?
Answer:
Option 2 is 4.2/2.6 = 1.62 times faster than the old machine.
Suppose that the base CPI = 1
CPI
old
= 1 + 0.02 100 + 0.04 0.3 100 = 4.2
CPI
opt1
= 0.5 + 0.02 100 + 0.04 0.3 100 = 3.7
CPI
opt2
= 1 + 0.02 50 + 0.04 0.3 50 = 2.6


30
(option#1)processor cache stall
= CPI cycle time
clock rate doublecycle time CPI
base CPI 1option #1 CPI
0.5 processor cache

4. The following series of branch outcomes occurs for a single branch in a program.
(T means the branch is taken, N means the branch is not taken).
TTTNNTTT
How many instances of this branch instruction are mis-predicted with a 1-bit and
2-bit local branch predictor, respectively? Assume that the BHT are initialized to
the N state. You may assume that this is the only branch is the program.
Answer:
1-bit predictor: 3 2-bit predictor: 5
: FSM2-bit predictor 5
FSM2-bit predictor 6

PART II:
For the following questions in Part II, please make sure that you summarize all your
answer in the format listed below. The answers are short, such as alphabelts, numbers,
or yes/no. You do not have to show your calculations. There is no partial credit to
incorrect answers.

(5a) (5b)
(6a) (6b) (6c)
(7a) (7b) (7c)
(8a) (8b) (8c) (8d) (8e)
(9a) (9b) (9c) (9d) (9e)
5. Consider the following performance measurements for a program:
Measurement Computer A Computer B Computer C
Instruction Count 12 billion 12 billion 10 billion
Clock Rate 4 Ghz 3 Ghz 2.8 Ghz
Cycles Per Instruction 2 1.5 1.4
(5a) Which computer is faster?
(5b) Which computer has the higher MIPS rating?
Answer:
(5a) Computer C

31
Execution Time for Computer A =
6
10 4
2 10 12
9
9
=



Execution Time for Computer B =
6
10 3
5 . 1 10 12
9
9
=



Execution Time for Computer C =
5
10 8 . 2
4 . 1 10 10
9
9
=



(5b) The MIPS rates for all computers are the same.
MIPS for computer A =
2000
10 2
10 4
6
9
=


MIPS for computer B =
2000
10 5 . 1
10 3
6
9
=


MIPS for computer B =
2000
10 4 . 1
10 8 . 2
6
9
=



6. Consider the following two components in a computer system:
A CPU that sustain 2 billion instructions per second.
A memory backplane bus capable of sustaining a transfer rate of 1000
MB/sec
If the workload consists of 64 KB reads from the disk, and each read operation
takes 200,000 user instructions and 100,000 OS instructions.
(6a) Calculate the maximum I/O rate of CPU.
(6b) Calculate the maximum I/O rate of memory bus.
(6c) Which of the two components is likely to be the bottlenect for I/O?
Answer:
(6a) 6667
(6b) 15625
(6c) CPU
The maximum I/O rate of CPU =
6667
200000 100000
10 2
9
~
+


Calculate the maximum I/O rate of memory bus =
15625
10 64
10 1000
3
6
=




32
7. You are going to enhance a computer, and there are two possible improvements:
either make multiply instructions run four times faster than before, or make
memory access instructions run two times faster than before. You repeatedly run
a program that takes 100 seconds to execute. Of this time, 20% is used for
multiplication, 50% for memory access instructions, and 30% for other tasks.
Calculate the speedup:
(7a) Speedup if we improve only multiplication:
(7b) Speedup if we only improve memory access:
(7c) Speedup if both improvements are made:
Answer:

(7a) Speedup =
18 . 1
8 . 0
4
2 . 0
1
=
+

(7b) Speedup =
33 . 1
5 . 0
2
5 . 0
1
=
+

(7c) Speedup =
67 . 1
3 . 0
2
5 . 0
4
2 . 0
1
=
+ +


8. Multiprocessor designs have become popular for todays desktop and mobile
computing. Given a 2-way symmetric multiprocessor (SMP) system where both
processors use write-back caches, write update cache coherency, and a block size
of one 32-bit word. Let us examine the cache coherence traffic with the following
sequence of activities involving shared data. Assume that all the words already
exist in both caches and are clean. Fill-in the last column (8a)-(8e) in the table to
identify the coherence transactions that should occur on the bus for the sequence.
Step Processor Memory activity Memory address
Transaction
required
(Yes or No)
1 Processor 1 1-word write 100 (8a)
2 Processor 2 1-word write 104 (8b)
3 Processor 1 1-word read 100 (8c)
4 Processor 2 1-word read 104 (8d)
5 Processor 1 1-word read 104 (8e)
Answer:


33

(8a) Yes
(8b) Yes
(8c) No
(8d) No
(8e) No

9. False sharing can lead to unnecessary bus traffic and delays. Follow the direction
of Question 8, except change the cache coherency policy to write-invalidate and
block size to four words (128-bit). Reveal the coherence transactions on the bus
by filling-in the last column (9a)-(9e) in the table below.

Step Processor Memory activity Memory address
Transaction
required
(Yes or No)
1 Processor 1 1-word write 100 (9a)
2 Processor 2 1-word write 104 (9b)
3 Processor 1 1-word read 100 (9c)
4 Processor 2 1-word read 104 (9d)
5 Processor 1 1-word read 104 (9e)
Answer:
(9a) Yes
(9b) Yes
(9c) Yes
(9d) No
(9e) No

snoopy protocol
(9d)Yes


34
94

1. Suppose we have a 32 bit MIPS-like RISC processor with the following
arithmetic and logical instructions (along with their descriptions):
Addition
add rd, rs, rt Put the sum of registers rs and rt into register rd.
Addition immediate
add rt, rs, imm Put the sum of register rs and the sign-extended
immediate into register rt.
Subtract
sub rd, rs, rt Register rt is subtracted from register rs and the result is
put in register rd.
AND
and rd, rs, rt Put the logical AND of register rs and rt into register rd.
AND immediate
and rt, rs, imm Put the logical AND of register rs and the zero-extended
immediate into register rt.
Shift left logical
sll rd, rt, imm Shift the value in register rt left by the distance (i.e. the
number of bits) indicated by the immediate (imm) and
put the result in register rd. The vacated bits are filled
with zeros.
Shift right logical
srl rd, rt, imm Shift the value in register rt right by the distance (i.e. the
number of bits) indicated by the immediate (imm) and
put the result in register rd. The vacated bits are filled
with zeros.
Please use at most one instruction to generate assembly code for each of the
following C statements (assuming variable a and b are unsigned integers). You
can use the variable names as the register names in your assembly code.
(a) b = a / 8; /* division operation */
(b) b = a % 16; /* modulus operation */
Answer:
(a) srl b, a, 3
(b) and b, a, 15

a = 10010011a 16
10010011a 8 10010011

35
2. Assume a RISC processor has a five-stage pipeline (as shown below) with each
stage taking one clock cycle to finish. The pipeline will stall when encountering
data hazards.
IF ID EXE MEM WB
IF: Instruction fetch
ID: Instruction decode and register file read
EXE: Execution or address calculation
MEM: Data memory access
WB: Write back to register file
(a) Suppose we have an add instruction followed immediately by a subtract
instruction that uses the add instruction's result:
add r1 r2, r3
sub r5 r1, r4
If there is no forwarding in the pipeline, how many cycle(s) will the pipeline
stall for?
(b) If we want to use forwarding (or bypassing) to avoid the pipeline stall caused
by the code sequence above, choosing from the denoted 6 points (A to F) in
the following simplified data path of the pipeline, where (from which point to
which point) should the forwarding path be connected?



(c) Suppose the first instruction of the above code sequence is a load of r1 instead
of an add (as shown below).
load rl [r2]
sub r5 r1, r4
Assuming we have a forwarding path from point E to point C in the pipeline
data path, will there be any pipeline stall for this code sequence? If so, how
many cycle(s)? (If your first answer is yes, you have to answer the second
question correctly to get the 5 pts credit.)
Answer:
(a) read registerwrite registerclock cyclestall 2
cyclesstall 3clock cycles
(b) D to C
(c) Yes, 1 clock cycle

3. Cache misses are classified into three categories-compulsory, capacity, and
conflict. What types of misses could be reduced if the cache block size is
increased?

Answer: compulsory

IF
A
ID
B
EXE
C
MEM
D
WB
E
F

36
4. Consider three types of methods for transferring data between an I/O device and
memory: polling, interrupt driven, and DMA. Rank the three techniques in terms
of lowest impact on processor utilization
Answer: (1) DMA, (2) Interrupt driven, (3) Polling

5. Assume an instruction set that contains 5 types of instructions: load, store,
R-format, branch and jump. Execution of these instructions can be broken into 5
steps: instruction fetch, register read, ALU operations, data access, and register
write. Table 1 lists the latency of each step assuming perfect caches.
Instruction
class
Instruction
fetch
Register
read
ALU
operation
Data
access
Register
write
Load
Store
R-fonnat
Branch
Jump
2ns
2ns
2ns
2ns
2ns
1ns
1ns
1ns
1ns
1ns
1ns
1ns
1ns
2ns
2ns
1ns

1ns
Table 1
(a) What is the CPU cycle time assuming a multicycle CPU implementation (i.e.,
each step in Table 1 takes one cycle)?
(b) Assuming the instruction mix shown below, what is the average CPI of the
multicycle processor without pipelining? Assume that the I-cache and
D-cache miss rates are 3% and 10%, and the cache miss penalty is 12 CPU
cycles
Instruction Type Frequency
Load 40%
Store 30%
R-format 15%
Branch 10%
Jump 5%
(c) To reduce the cache miss rate, the architecture team is considering increasing
the data cache size. They find that by doubling the data cache size, they can
eliminate half of data cache misses. However, the data access stage now takes
4 ns. Do you suggest them to double the data cache size? Explain your
answer.
Answer:
(a) CPU cycle time multicycle
implementation CPU cycle time 2ns
(b) CPI without considering cache misses = 5 0.4 + 4 0.3 + 4 0.15 + 3 0.1
+ 1 0.05 = 4.15
Average CPI = 4.15 + 0.03 12 + (0.3 + 0.4) 0.1 12 = 5.35

37
(c) CPI after doubling data cache = 4.15 + 0.03 6 + (0.3 + 0.4) 0.05 6 =
4.54
Average instruction execution time before doubling data cache = 5.35 2ns =
10.7 ns
Average instruction execution time after doubling data cache = 4.54 4ns =
18.16 ns
Doubling data cache Doubling data cache
double the data cache.

38
93

1. Consider a system with an average memory access time of 50 nanoseconds, a
three level page table (meta-directory, directory, and page table). For full credit,
your answer must be a single number and not a formula.
(a) If the system had an average page fault rate of 0.01% for any page accessed
(data or page table related), and an average page fault took 1 millisecond to
service, what is the effective memory access time (assume no TLB or memory
cache)?
(b) Now assume the system has no page faults, we are considering adding a TLB
that will take 1 nanosecond to lookup an address translation. What hit rate in
the TLB is required to reduce the effective access time to memory by a factor
of 2.5?
Answer:
(a) page fault effective memory access time = 4 50
= 200 ns (meta-directorydirectorypage table
data access)page fault rate = 0.01%effective memory access time =
200 + 4 0.01% 1000000ns = 600 ns
(b) (200 / 2.5) = 50ns + 1ns + 150ns (1 H) H = 0.81

2. In this problem set, show your answers in the following format:
<a> ? CPU cycles
Derive your answer.
<b> CPI = ?
Derive your answer.
<c> Machine ? is ?% faster than ?
Derive your answer.
<d> ? CPU cycles
Derive your answer.
Both machine A and B contain one-level on-chip caches. The CPU clock rates
and cache configurations for these two machines are shown in Table 1. The
respective instruction/data cache miss rates in executing program P are also
shown in Table 1. The frequency of load/store instructions in program P is 20%.
On a cache miss, the CPU stalls until the whole cache block is fetched from the
main memory. The memory and bus system have the following characteristics:
1. the bus and memory support 16-byte block transfer;
2. 32-bit synchronous bus clocked at 200 MHz, with each 32-bit transfer taking
1 bus clock cycle, and 1 bus clock cycle required to send an address to
memory (assuming shared address and data lines);
3. assuming there is no cycle needed between each bus operation;
4. a memory access time for the first 4 words (16 bytes) is 250 ns, each
additional set of four words can be read in 25 ns. Assume that a bus transfer

39
of the most recently read data and a read of the next four words can be
overlapped.
Machine A Machine B
CPU clock rate 800 MHz 400 MHz
I-cache
configuration
Direct-mapped,
32-byte block, 8K
2-way, 32-byte block,
128K
D-cache
configuration
2-way, 32-byte block,
16K
4-way, 32-byte block,
256K
I-cache miss rate 6% 1%
D-cache miss rate 15% 4%
Table 1
To answer the following questions, you don't need to consider the time required
for writing data to the main memory:
(1) What is the data cache miss penalty (in CPU cycles) for machine A?
(2) What is the average CPI (Cycle per Instruction) for machine A in executing
program P? The CPI (Cycle per Instruction) is 1 without cache misses.
(3) Which machine is faster in executing program P and by how much? The CPI
(Cycle per Instruction) is 1 without cache misses for both machine A and B.
(4) What is the data cache miss penalty (in CPU cycles) for machine A if the bus
and memory system support 32-byte block transfer? All the other memory/bus
parameters remain the same as defined above.
Answer:
(a) 440 CPU cycles
Since bus clock rate = 200 MHz, the cycle time for a bus clock = 5 ns
The time to transfer one data block from memory to cache = 2 (1 + 250/5 + 1
4) 5 ns = 550 ns
The data miss penalty for machine A = 550ns / (1/800MHz) = 440 CPU cycles
(b) CPI = 40.6
Average CPI = 1 + 0.06 440 + 0.2 0.15 440 = 40.6
(c) Machine B is 409% faster than A
machine A machine B cache block size 32-byte
miss penalty 550ns
machine B clock rate 400Mhz machine B miss penalty = 220
clock cycles
machine B average CPI = 1 + 0.01 220 + 0.2 0.04 220 = 4.96
Execution time for machine A = 40.6 1.25 IC =50.75IC
Execution time for machine B = 4.96 2.5 IC =12.4IC
machine B is 50.75/12.4 = 4.09 faster than machine A
(d) 240 CPU cycles
The time to transfer one data block from memory to cache = (1 + 250/5 + 25/5 +
4) 5 ns = 300 ns

40
c3 c2
c1 c0 d3 d2 d1 d0 c3 c2
c1 c0 d3 d2 d1 d0
The data miss penalty for machine A = 300ns / (1/800MHz) = 240 CPU cycles

3. Given the bit pattern 10010011, what does it represent assuming
(a) Its a twos complement integer?
(b) Its an unsigned integer?
Write down your answer in decimal format.
Answer:
(a) -2
7
+ 2
4
+ 2
1
+ 2
0
= - 109
(b) 2
7
+ 2
4
+ 2
1
+ 2
0
= 147

4. Draw the schematic for a 4-bit 2s complement adder/subtractor that produces A +
B if K=1, A B if K = 0. In your design try to use minimum number of the
following basic logic gates (1-bit adders, AND, OR, INV, and XOR).
Answer:
K = 0: S = A + (B 1) + 1 = A + (B + 1) = A B
K = 1: S = A + (B 0) + 0 = A + B + 0 = A + B










5. We want to add for 4-bit numbers, A[3:0], B[3:0], C[3:0], D[3:0] together using
carry-save addition. Draw the schematic using 1-bit full adders.
Answer:

+ + + +
a
3
b
3
a
2
b
2
a
1
b
1
a
0
b
0
K
s
3
s
2
s
1
s
0
c
4

41
6. We have an 8-bit carry ripple adder that is too slow. We want to speed it up by
adding one pipeline stages. Draw the schematic of the resulting pipeline adder.
How many 1-bit pipeline register do you need? Assuming the delay of 1-bit adder
is 1 ns, whats the maximum clock frequency the resulting pipelined adder can
operate?
Answer:
(1) schematic




















(2) 13 1-bit pipeline registers
(3) 1/4ns = 250MHz



a0
b0
+
a1
b1
+
a2
b2
+
a3
b3
+
a4
b4
+
a5
b5
+
a6
b6
+
a7
b7
+
s0
s1
s2
s3
s4
s5
s6
s7
c8
c0
a0
b0
+
a0
b0
+
a1
b1
+
a1
b1
+
a2
b2
+
a2
b2
+
a3
b3
+
a3
b3
+
a4
b4
+
a4
b4
a4
b4
+
a5
b5
+
a5
b5
a5
b5
+
a6
b6
+
a6
b6
a6
b6
+
a7
b7
+
a7
b7
a7
b7
+
s0
s1 s1
s2 s2
s3 s3
s4
s5
s6
s7
c8
c0

42
92

1. A pipelined processor architecture consists of 5 pipeline stages: instruction fetch
(IF), instruction decode and register read (ID), execution or address calculate
(EX), data memory access (MEM), and register write back(WB). The delay of
each stage is summarized below: IF = 2 ns, ID = 1.5 ns, EX = 4 ns, MEM = 2.5 ns,
WB = 2 ns.
(1) Whats the maximum attainable clock rate of this processor?
(2) What kind of instruction sequence will cause data hazard that cannot be
resolved by forwarding? Whats the performance penalty?
(3) To improve on the clock rate of this processor, the architect decided to add
one pipeline stage. The location of the existing pipeline registers cannot be
changed. Where should this pipeline stage be placed? Whats the maximum
clock rate of the 6-stage processor? (Assuming there is no delay penalty when
adding pipeline stages)
(4) Repeat the analysis in (2) for the new 6-stage processor. Is there other type(s)
of instruction sequence that will cause a data hazard, and cannot be resolved
by forwarding? Compare the design of 5-stage and 6-stage processor, what
effect does adding one pipeline stage has on data hazard resolution?
Answer:
(1) stagedelay4 nsclock rate = 1/ (4 10
-9
) = 250 MHz
(2) (a) Load
forwarding data hazard
(b) stallclock cycleforwardingthe performance
penalty1clock cycle delay
(3) (a) EXdelaydelay2 nsEX1EX2

(b) stagedelay2.5 nsclock rate = 1/ (2.5 10
-9
) = 400
MHz
(4) (a) Load-usedata hazardforwarding
data hazard forwardingLoad-use data hazardstall
2clock cycledata hazardstall 1clock cycle
(b) data hazardpenalty

2. (1) What type of cache misses (compulsory, conflict, capacity) can be reduced by
increasing the cache block size
(2) Can increasing the degree of the cache associativity always reduce the average
memory access timeExplain your answer.
Answer:

43
(1) Compulsory
(2) No. AMAT = hit time + miss rate miss penalty. Increase the degree of cache
associativity may decrease miss rate but will lengthen hit time; therefore, the
average memory access time may not necessary be reduced.

3. List two types of cache write policies. Compare the pros and cons of these two
polices.
Answer:
(1) Write-through: A scheme in which writs always update both the cache and the
memory, ensuring that data is always consistent between the two.
Write-back: A scheme that handles writes by updating values only to the
block in the cache, then writing the modified block to the lower level of the
hierarchy when the block is replaced.
(2)
Polices Write-through Write-back
Pros
block
MemoryCPU write
Cons
cache
MemoryCPU write



4. Briefly describe the difference between synchronous and asynchronous bus
transaction
Answer:
Bus type Synchronous Bus Asynchronous Bus
Differences
Synchronous bus includes a clock
in the control lines and a fixed
protocol for communication
relative to clock
Asynchronous bus is not
clocked
Advantage
very little logic and can run very
fast
It can accommodate a wide
range of devices.
It can be lengthened without
worrying about clock skew.
Disadvantage
Every device on the bus must run
at the same clock rate.
To avoid clock skew, they cannot
be long if they are fast
It requires a handshaking
protocol.



44
96

1. The following MIPS assembly program tries to copy words from the address in
register $a0 to the address in $a1, counting the number of words copied in
register $v0. The program stops copying when it finds a word equal to 0, You do
not have to preserve the contents of registers $v1, $a0, and $a1. This terminating
word should be copied but not counted.
loop: lw $v1, 0($a0) # read next word from source
addi $v0, $v0, 1 # Increment count words copied
sw $v1, 0($a1) # Write to destination
addi @a0, $a0, 1 # Advance pointer to next word
addi @a0, $a1, 1 # Advance pointer to next word
bne $v1, $zero, loop # Loop if word copied != zero
There are multiple bugs in this MIPS program; fix them and turn in a bug-free
version.
Answer:
addi $v0, $zero, 1
Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a1)
addi $a0, $a0, 4
addi $a1, $a1, 4
bne $v1, $zero, Loop

2. Carry lookahead is often used to speed up the addition operation in ALU. For a
4-bit addition with carry lookahead, assuming the two 4-bit inputs are a3a2a1a0
and b3b2b1b0, and the carry-in is c0,
(a) First derive the recursive equations of carry-out ci+1 in terms of ai and bi and
ci, where i = 0, 1,.., 3.
(b) Then by defining the generate (gi) and propagate (pi) signals, express c1, c2,
c3, and c4 in terms of only gi's, pis and c0.
(c) Estimate the speed up for this simple 4-bit carry lookahead adder over the
4-bit ripple carry adder (assuming each logic gate introduces T delay).
Answer:
(a) ci+1 = aibi + aici + bici
(b) c1 = g0 + (p0c0)
c2 = g1 + (p1g0) + (p1p0c0)
c3 = g2 + (p2g1) + (p2p1g0) + (p2p1p0c0)
c4 = g3 + (p3g2) + (p3p2g1) + (p3p2p1g0) + (p3p2p1p0c0)
(c) The critical path delay for 4-bit ripple carry adder = 2T 4 = 8T
The critical path delay for 4-bit ripple carry adder = 2T + T = 3T

45
Speedup = 8T/3T = 2.67
: critical path dealy carry


3. When performing arithmetic addition and subtraction, overflow might occur. Fill
in the blanks in the following table of overflow conditions for addition and
subtraction.
Operation Operand A Operand B
Result indicating
overflow
A + B 0 0 (a)
A + B < 0 < 0 (b)
A B 0 < 0 (c)
A B < 0 0 (d)

Prove that the overflow condition can be determined simply by checking to see if
the Carryin to the most significant bit of the result is not the same as the CarryOut
of the most significant bit of the result.
Answer:
(1)
Operation Operand A Operand B Result indicating overflow
A + B 0 0 (a) < 0
A + B < 0 < 0 (b) 0
A B 0 < 0 (c) < 0
A B < 0 0 (d) 0
(2) Build a table that shows all possible combinations of Sign and CarryIn to the
sign bit position and derive the CarryOut, Overflow, and related information.
Thus
Sing
A
Sing
B
Carry
In
Carry
Out
Sign
of
result
Correct
Sign of
Result
Over
flow
?
CarryIn
XOR
CarryOut
Notes
0 0 0 0 0 0 No 0
0 0 1 0 1 0 Yes 1 Carries differ
0 1 0 0 1 1 No 0 |A| < |B|
0 1 1 1 0 0 No 0 |A| > |B|
1 0 0 0 1 1 No 0 |A| > |B|
1 0 1 1 0 0 No 0 |A| < |B|
1 1 0 1 0 1 Yes 1 Carries differ
1 1 1 1 1 1 No 0
From this table an XOR of the CarryIn and CarryOut of the sign bit serves to
detect overflow.

46
4. Assume all memory addresses are translated to physical addresses before the
cache is accessed. In this case, the cache is physically indexed and physically
tagged. Also assume a TLB is used. (a) Under what circumstance can a memory
reference encounter a TLB miss, a page table hit, and a cache miss? Briefly
explain why. (b) To speed up cache accesses, a processor may index the cache
with virtual addresses. This is called a virtually addressed cache, and it uses tags
that are virtual addresses. However, a problem called aliasing may occur. Explain
what aliasing is and why. (c) In today's computer systems, virtual memory and
cache work together as a hierarchy. When the operating system decides to move a
page back to disk, the contents of that page may have been brought into the cache
already. What should the OS do with the contents that are in the cache?
Answer:
(a) Data/instruction is in memory but not in cache and page table has this
mapping but TLB has not.
(b) A situation in which the same object is accessed by two addresses; can occur
in virtual memory when there are two virtual addresses for the same physical
page.
(c) If the contents in cache are dirty, force them write back to memory and
invalidate them in cache. After that, copy the page back to disk. If not,
invalidate them in cache and copy the page back to disk.

5. The following three instructions are executed using MIPS 5-stage pipeline.
1. lw $2, 20($1)
2. sub 4, $2, $5
3. or $4, $2, $6
Since there is one cycle delay between lw and sub, a hazard detection unit is
required. Furthermore, by the time the hazard is detected, sub and or may have
already been fetched into the pipeline. Therefore it is also required to turn sub
into a nop and delay the execution of sub and or by one cycle as shown below.
1. lw $2, 20($1)
2 nop
3. sub $4, $2, $5
4. or $4, $2, $6
(a) In which stage should the hazard detection unit be placed? Why? (b) How can
you turn sub into a nop in MIPS 5-stage pipeline? (c) How can you prevent sub
and or from making progress and force these two instructions to repeat in the next
clock cycle? (d) Explain why there is one cycle delay between lw and sub.
Answer:
(a) ID: Instruction Decode and register file read stage.
(b) Deassert all nine control signals (in EX/MEM pipeline register) in the EX
stage.
(c) Set both control signals PCWrite and IF/IDWrite to 0 to prevent the PC
register and IF/ID pipeline register from changing.

47
(d) As shown in the following diagram, after 1-cycle stall between lw and sub,
the forwarding logic can handle the dependence and execution proceeds. (If
there were no forwarding, then 2 cycle delay is needed)
lw IF ID EX MEM WB
nop IF ID EX MEM WB
sub IF ID EX MEM WB

6. Answer the following questions briefly.
(a) Will addition "0010 + 1110" cause an overflow using the 4-bit two's
complement signed-integer form? (Simply answer yes or no).
(b) What would you get after performing arithmetic right shift by one bit on
1100
two
?
(c) If one wishes to increase the accuracy of the floating-point numbers that can
be represented, then he/she should increase the size of which part in the
floating-point format?
(d) Name one event other than branches or jumps that the normal flow of
instruction execution will be changed, e.g., by switching to a routine in the
operating system.
Answer:
(a) NO
(b) 1110
two

(c) Fraction
(d) Arithmetic overflow

7. A MIPS instruction takes fives stages in a pipelined CPU design: (1) IF:
instruction fetch, (2) ID: instruction decode/register fetch, (3) ALU: execution or
calculate a memory address, (4) MEM: access an operand in data memory, and (5)
WB: write a result hack into the register. Label one appropriate stage in which
each of the following actions needs to be executed. (Note that A and B are two
source operands, while ALUOut is the output register of the ALU, PC is the
program counter, IR is the instruction register. MDR is the memory data register,
Memory[k] is the k-th word in the memory, and Reg[k] is the k-th registers in the
register file.)
(a) Reg[IR[20-16]] = MDR;
(b) ALUOut = PC + (sign-extend (IR[15-0]) << 2);
(c) Memory[ALUOut] = B;
Answer:
(a) WB
(b) ID

48
(c) MEM

49
95

1. (1) Can you come up with a MIPS instruction that behaves like a NOP? The
instruction is executed by the pipeline but does not change any state.
(2) In a MIPS computer a main program can use "jal procedure address" to make a
procedure call and the callee can use "jr $ra" to return to the main program.
What is saved in register $ra during this process?
(3) Name and explain the three principle components that can be combined to
yield runtime.
Answer:
(1) sll $zero, $zero, 0
(2) The address of the instruction following the jal (Return address)
(3) Runtime = instruction count CPI (cycles per instruction) clock cycle time

2. (1) Briefly explain the purpose of having a write buffer in the design of a
write-through cache.
(2) Large cache block tends to decrease cache miss rate due to better spatial
locality. However, it has been observed that too large a cache block actually
increases miss rate. Especially in a very small cache. Why?
Answer:
(1) After writing the data into the write buffer, the processor can continue
execution without wasting time to wait the memory update. The CPU
performance can thus be increased.
(2) The number of blocks that can be held in the cache will become small, and
there will be a great deal of competition for those blocks. As a result, a block
will be bumped out of the cache before many of its words are accessed.

3. (1) Dynamic branch prediction is often used in today's machine. Consider a loop
branch that branches nine times in a row, and then is not taken once. What is
the prediction accuracy for this branch, assuming a simple 1-bit prediction
scheme is used and the prediction bit for this branch remains in the prediction
buffer? Briefly explain your result.
(2) What is the prediction accuracy if a 2-bit prediction scheme is used? Again
briefly explain your result.
Answer:
(1) The steady-state prediction behavior will mispredict on the first and last loop
iterations. Mispredicting the last iteration is inevitable since the prediction bit
will say taken. The misprediction on the first iteration happens because the bit
is flipped on prior execution of the last iteration of the loop, since the branch
was not taken on that exiting iteration. Thus, the prediction accuracy for this
branch is 80% (two incorrect predictions and eight correct ones).

50
(2) The prediction accuracy if a 2-bit prediction scheme is 90%, since only the last
loop iteration will be mispredict.

4. Answer the following questions briefly:
(1) In a pipelined CPU design, what kind of problem may occur as it executes
instructions corresponding to an if-statement in a C program? Name one
possible scheme to get around this problem more or less.
(2) Consider the possible actions in the Instruction Decode stage of a pipelined
CPU. In addition to setting up the two input operands of ALU, what is the
other possible action? (Hint: consider the execution of a jump instruction)
(3) What is x if the maximum number of memory words you can use in a 32-bit
MIPS machine in a single program is expressed as 2
x
? (Note: MIPS uses a
byte addressing scheme.)
Answer:
(1) Control hazard.
Solution: Insert Nop instruction, delay branch, branch prediction
(2) Decode instruction, sign-extend 16 bits immediate constant, jump address
calculation, branch target calculation, register comparison, load-use data
hazard detection.
(3) A single program in 32-bit MIPS machine can use 256 MB = 2
28
Bytes = 2
26

words. So, x = 26.

5. Consider the following flow chart of a sequential multiplier. We assume that the
64-bit multiplicand register is initialized with the 32-bit original multiplicand in
the right half and 0 in the left half. The final result is to be placed in a product
register. Fill in the missing descriptions in blanks A and B.
start
Test multiplier[0]
Shift the multiplicand register left by 1 bit
Blank B
Blank A
32nd repetition?
Done
Yes: 32 repetitions
Multiplier[0] = 0
Multiplier[0] = 1
No: <32 repetitions


51
Answer:
Blank A: add Multiplicand to product and place the result in the Product register
Blank B: shift the Multiplier register right 1 bit

6. Schedule the following instruction segment into a superscaler pipeline for MIPS.
Assume that the pipeline can execute one ALU or branch instruction and one data
transfer instruction concurrently. For the best, the instruction segment can be
executed in four clock cycles. Fill in the instruction identifiers into the table. Note
that data dependency should be taken into account.
(Identifier) (Instruction)
ln-1 Loop: lw $t0, 0($s1)
ln-2 addu $t0, $t0, $s2
ln-3 sw $t0, 0($s1)
ln-4 addi $s1, $s1, 4
ln-5 bne $s1, $zero, Loop

Clock Cycle ALU or branch instruction Data transfer instruction
1
2
3
4
Answer:
Clock Cycle ALU or branch instruction Data transfer instruction
1 ln-1 (lw)
2 ln-4 (addi)
3 ln-2 (addu)
4 ln-5 (bne) ln-3 (sw)


7. Suppose a computer's address size is k bits (using byte addressing), the cache size
is S bytes, the block size is B bytes and the cache is A-way set-associative.
Assume that B is a power of two, so B = 2
b
. Figure out what the following
quantities are in terms of S, B, A, b and k:
(1) the number of sets in the cache
(2) the number of index bits in the address
(3) the number of bits needed to implement the cache
Answer:
Address size: k bits
Cache size: S bytes/cache
Block size: B = 2
b
bytes/block

52
Associativity: A blocks/set
(1) Number of sets in the cache = S/AB;
(2) Number of bits for index =
b
A
S
AB
S

|
.
|

\
|
=
|
.
|

\
|
2 2
log log

(3) Number of bits for the tag = |
.
|

\
|
=
|
|
.
|

\
|

|
.
|

\
|

A
S
k b b
A
S
k
2 2
log log

Number of bits needed to implement the cache
= sets/cache associativity (data + tag + valid)
=
|
|
.
|

\
|
+
|
.
|

\
|
+ =
|
|
.
|

\
|
+
|
.
|

\
|
+ 1 log 8 1 log 8
2 2
A
S
k B
B
S
A
S
k B A
AB
S
bits

8. To compare the maximum bandwidth for a synchronous and an asynchronous bus,
assume that the synchronous bus has a clock cycle of 50 ns, and each bus
transmission takes 1 clock cycle. The asynchronous bus requires 40 ns per
handshake and the asynchronous handshaking protocol consists of seven steps to
read a word from memory and receive it in an I/O device as shown below. The
data portion of both buses is 32 bits wide. Find the bandwidth for each bus in
MB/sec when performing one-word reads from a 200-ns memory.









Answer:
(1) For the synchronous bus, which has 50-ns bus cycles. The steps and times
required for the synchronous bus are as follows:
1. Send the address to memory: 50 ns
2. Read the memory: 200 ns
3. Send the data to the device: 50 ns
Thus, the total time is 300 ns.
The maximum bus bandwidth = 4 bytes/300ns = 13.3 MB/second
(2) For the asynchronous bus, the memory receives the address at the end of step
1 and does not need to put the data on the bus until the beginning of step 5;
step 2, 3, and 4 can overlap with the memory access time. This leads to the
following timing:
Step 1: 40 ns
Step 2, 3, 4: maximum (3 40 ns, 200 ns) = 200 ns

DataRdy
Ack
Data
ReadReq
1
3
4
5
7
6
4
2
2

53
Step 5, 6, 7: 3 40 ns = 120 ns
Thus, the total time is 360 ns.
The maximum bus bandwidth = 4 bytes/360ns = 11.1 MB/second

9. Bus arbitration is needed in deciding which bus master gets to use the bus next
in a computer system. There are a wide variety of schemes for bus arbitration;
these may involve special hardware or extremely sophisticated bus protocols. In
a bus arbitration scheme, a device (or the processor) wanting to use the bus
signals a bus request and is later granted the bus. After a grant, the device can
use the bus, later signaling to the arbiter that the bus is no longer required. The
arbiter can then grant the bus to another device. Most multiple-master buses
have a set of bus lines for performing bus requests and grants. A bus release line
is also needed if each device does not have its own request line. Sometimes the
signals used for bus arbitration have physically separate lines, while in other
systems the data lines of the bus are used for this function. Arbitration schemes
usually try to balance two factors in choosing which device to grant the bus,
namely, the priority and the fairness. In general, bus arbitration schemes can be
divided into four broad classes. What are those four classes? Briefly explain
those four classes of bus arbitration schemes.
Answer:
1. Daisy chain arbitration: the bus grant line is run through the device from
highest priority to lowest
2. Centralized, parallel arbitration: use multiple request lines and a centralized
arbiter chooses from among the devices request the bus (PCI backplane bus)
3. Distributed arbitration by self-selection: Each device wanting the bus places a
code indicating its identity on the bus. (Apple Macintosh II Nubus)
4. Distributed arbitration by collision detection: Each device independently
requests the bus. The collision is detected when multiple simultaneous
requests occur. (Ethernet)











54
94

1. How many addressing modes are used in the following MIPS code? Please select
at least one instruction from the assembly code to explain different addressing
modes.


search: $v1, 0($v0)
sw $fp, 20($sp) lw $v0, 32($fp)
sw $gp,16($sp) bne $v1, $v0, $L4
move $fp, $sp lw $v1, 8($fp)
sw $a0, 24 ($fp) move $v0, $v1
sw $al, 28($fp) j $L1
sw $a2, 32 ($fp) $L4:
sw $zero, 8($fp) lw $v0, 8($fp)
$L2: addu $v1, $v0, 1
lw $v0, 8($fp) sw $vl, 8 ($fp)
lw $v1, 28($fp) j $L2
slt $v0, $v0, $v1 $L3:
bne $v0, $zero, $L5 li $v0, -1
j $L3 j $L1
$L5: $L1:
lw $v0, 8($fp) move $sp, $fp
move $v1, $v0 lw $fp, 20 ($sp)
sll $v0,v1, 2 addu $sp, $sp, 24
lw $v1, 24($fp) j $ra
addu $v0, $v0, $v1 .end search
Answer:
Addressing Modes Example instruction
(1) Register addressing slt $v0, $v0, $v1
(2) Base or displacement addressing lw $v0, 32($fp)
(3) Immediate addressing li $v0, -1
(4) PC-relative addressing bne $v0, $zero, $L5
(5) Pseudodirect addressing j $L1


2. Answer the following two yes or no questions about the MIPS assembly language.
a. Is it true that instruction slt $s1, $s2, $s3 will set $s1 to 1 if $s2 is less than
$s3?
b. Is it true that the so-called jump and link instruction, e.g., jal 2500, is mainly to
support the return action from a procedure call to its caller function?
Addressing mode examples

55
Answer:
a. Yes
b. No. (it jump from caller to callee and simultaneously saves the address of the
following instruction in register $ra)

3. Consider a pipelined CPU design with the 5 stages being (1) instruction fetch, (2)
decoding, (3) execution, (4) memory access, and (5) register write back.
(a) List all instruction stages in which we need to read or write the register file.
(b) What is the minimum number of IO ports required for the register file if the
access of the register file takes one full clock cycle? (Note: an IO port can be
used for either a read or write operation.)
(c) What is the minimum number of IO ports required for the register file if the
access of the register file takes only half a clock cycle? (Note: we will be able
to perform two accesses to or from the register files within one clock cycle.)
Answer:
(a) decoding and register write back stages
(b) 3
(c) 2

4. Consider the following summary table for execution steps that need to be
performed by four major instruction classes: (1) arithmetic-logic (or called
R-type), (2) memory-reference, (3) branch, and (4) jump instructions.
(a) Complete the missing action in entry marked ? under the column for the
memory-reference instructions.
(b) What is the instruction class of entry marked o?
Execution Step R-type
Memory-
Reference
o |
Instruction
Fetch
IR-Memory[PC]; PCPC+4;
Instruction
Decode
A Reg[IR[25-21]];
B Reg[IR[20-16]];
ALUOut PC + (sign-extend(IR[15-0] << 2);
Execution
ALUOut A op
B;
ALUOut A +
sign-extend
(IR[15-0]);
PC PC[31-
28]||(IR[25- 0] <<
2);
If (A= =B) then PC
ALUOut;
Memory Access

Load: MDR (?);
Store: omitted;

Register Write
Back
Reg[IR[15- 11]]
ALUOut;
Load: Reg[IR[20-
16]] MDR;

Answer:
(a) memory[ALUOut]
(b) jump instruction


56
5. Answer the following questions:
a) Explain (please also draw a diagram) the following methods how can they
resolve the multiple simultaneously interrupt requests from the I/O Devices? (1)
Daisy chain, (2) Polling
b) Explain the following two modes, (1) cycle stealing and (2) block mode, for
DMA controller to transfer the data from I/O device to Memory. Which mode is
transparent (unknown) to CPU operation? Why?
c) Now suppose that CPU is executing a maximum of 10
6
instructions/sec. An
average instruction execution requires five machine cycles, three of which use
the memory bus, a memory read/write uses one machine cycle for transferring
one word. What is DMA transfer rate (word/sec) for the above two DMA
controller modes?
Answer:
(a) (1) Daisy chain: I/Odevices
CPUdevices









(2) polling: CPUI/O devicesstatus register
devicedevice












(b) (1) Cycle stealing: DMA steals memory cycles from CPU, transferring one or a
few words at a time before returning control.
(2) Block mode: an entire block is transferred in a single continuous burst
needed for magnetic-disk drives etc. Where data transmission cannot be
Bus
Arbiter
Device 1
Highest
Priority
Device N
Lowest
Priority
Device 2
Grant Grant Grant
Release
Request
wired-OR
CPU
Device
1
Device
2
Device
n

57
stopped or slowed down without loss of data.
So, cycle stealing is transparent to CPU operation.
(c) Cycle stealing: (5 3) 10
6
= 2 10
6
words/second
DMA block transfer: 5 10
6
words/second

6. The interrupt breakpoint or the DMA breakpoint is the instance when the CPU
responds to the interrupt request (INTR) or DMA request; If the CPU is executing
an instruction (several micro-steps or machine cycles are required for executing
each instruction in one instruction cycle).
(1) Where the interrupt breakpoint occurs?
(2) Where the DMA breakpoint occurs?
(3) After receiving the INTA signal from CPU, how the I/O device identifies
itself to CPU?
Answer:
(1) machine cycles interrupt breakpoint
(2) machine cycles DMA breakpoint
(3) An interrupting device identifies itself to the CPU by:
C Sending the address of the interrupt handling routine (vector interrupt)
C Putting an identifier in a Cause Register
Instruction Cycle
Processor
Cycle
Processor
Cycle
Processor
Cycle
Processor
Cycle
Processor
Cycle
Processor
Cycle
Fetch
instruction
Decode
instruction
Fetch
operand
Execute
instruction
Store
result
Processor
interrupt
Interrupt
breakpoint
DMA
breakpoint











58
93

1. Consider the representations of the floating-point numbers.
(a) A number is often denoted as (-1)
S
F 2
E
. What are the English names and
meanings of F, and E, respectively?
(b) In the IEEE 754 standard format, a number is denoted as (-1)
S
(1.F) 2
(E-Bias)
.
For single and double precision numbers, what are the values of Bias,
respectively?
Answer:
(a) F: Fraction (or significand)
E: Exponent
Double precision bias: 1023

2. Explain the meaning of each of the following MIPS instructions using an
if-statement. Denote the program counter as PC when needed. Note that an
instruction has four bytes.
slt $s1, $s2, $s3
slti $s1, $s2, 100
bne $s1, $s2, 25
Answer:
(1) if $s2 < $s3 then $s1 = 1 else $s1 = 0
(2) if $s2 < 100 then $s1 = 1 else $s1 = 0
(3) if $s1 $s2 then goto (PC + 4) + 100

3. In a 5-staged pipelined computer, 20% of the instructions are assumed to be
branch instructions that could cause one-cycle pipeline stalls, if not properly
handled.
(a) Ignoring all other hazards, what is the CPI of this computer by taking into
account this branch-related control hazard?
(b) If the probability of a branch instruction being taken is 30% on the average,
then what is the average CPI under the predict-not-taken branch prediction
scheme?
Answer:
(a) CPI = 1 + 0.2 1 = 1.2
(b) CPI = 1 + 0.2 0.3 1 = 1.06


59
4. Consider a computer system with a cache of 4K blocks, a four-word block size, a
4-byte word size, and a 32-bit address.
(1) What are the total number of sets and the total number of tag bits for caches
that are C direct-mapped, C two-way set associative, C four-way set
associative, and C fully associative?
(2) Draw 4 diagrams of the 32-bit address for above four type of caches and
indicate in each diagram which bit fields are used for tags, index to block,
block offset, etc., respectively
(3) What block number does byte address 1200 map to in the four types of caches,
respectively?
Answer:
(1)(2) 4K block index = 12 bit; 4-word block block offset = 2 bits;
4-byte word = 2 bits byte offset
C direct-mapped: total tag = (32 12 2 2) 4K = 16 4K = 64K bits
16 bits 12 bits 2 bits 2 bits
Tag Index
Block
offset
Byte
offset
C two-way set associative: total tag = (16 + 1) 2K 2 = 68K bits
17 bits 11 bits 2 bits 2 bits
Tag Index
Block
offset
Byte
offset
C four-way set associative: total tag = (17 + 1) 1K 4 = 72 K bits
18 bits 10 bits 2 bits 2 bits
Tag Index
Block
offset
Byte
offset
C fully associative: total tag = (32 0 2 2) 1 4K = 112K bits
28 bits 2 bits 2 bits
Tag
Block
offset
Byte
offset
(3) block address =

75 16 / 1200 =
C direct-mapped: block number 75
C two-way set associative: block number 75, set number 75
C four-way set associative: block number 75, set number 75
C fully associative: block number 75, set number 0


60
5. Suppose we have a processor with a base CPI (clock-cycle per instruction) of 1.0,
assuming all reference hit in the primary cache, and a clock rate of 500 MHz.
Assume a main memory access time of 100 ns, including all the miss handling.
Suppose the miss rate per instruction at the primary cache is 5%.
(1) What is the miss penalty to main memory in clock cycles and the effective
CPI for this one-level caching processor?
(2) What will the effective CPI and how much faster will the machine be if we
add a secondary cache that has a 10-ns access time for either a hit or a miss
and the secondary cache is large enough to reduce the miss rate to main
memory to 2%?
Answer:
(1) CPU clock cycle time = 1 / 500 MHz = 2 ns
Miss penalty for main memory = 100 / 2 = 50 clock cycles
CPI = 1 + 50 0.05 = 3.5
(2) Miss penalty for second level cache = 10 / 2 = 5 clock cycles
CPI = 1 + 0.05 5 + 0.02 50 = 2.25
Speedup = 3. 5 / 2.25 = 1.56

6. (1) Explain the purpose of jump-and-link (jal) instruction.
(2) Explain why most of todays computer system use 2s complement instead of
signed-magnitude in their hardware implementations.
(3) Explain why geometric mean may be useful in comparing machine
performance.
(4) Power of 2 is normally used in the design of a computer. Is it possible to
construct a five-way set associatively cache? Why?
(5) Is MIPS (million instructions per second) an accurate measure for comparing
performance of different architecture? Why?
Answer:
(1) jal
PC ()$ra
(2) signed-magnitude : C 0
programmer C
C sign bit

(3) The geometric mean is independent of which data series we use for
normalization because it has the property





|
|
.
|

\
|
=
i
i
i
i
Y
X
Y
X
mean Geometric
) ( mean Geometric
) ( mean Geometric

61
(4) YESFive-way 5 blocksblock
byte 2 set associatively cache work
(5) NO, since there are 3 problems with using MIPS:
C MIPS specifies the instruction execution rate but does not take into
account the capabilities of the instructions
C MIPS varies between programs on the same computer
C MIPS can vary inversely with performance

7. How will you fill in a personal record such as Tom Lien in the following table
using little-endian? Assume each row consists of 4 bytes.


Answer:
m o T
n e i L


8. The following figure is a 32-bit ALU constructed from 32 1-bit ALUs. CarryOut
of the less significant bit is connected to the CarryIn of the more significant bit.
Can you add a simple logic to detect if there is an overflow?
CarryIn
ALU0
CarryOut
a
0 Result
0
b
0
CarryIn
ALU1
CarryOut
a
1 Result
1
b
1
CarryIn
ALU2
CarryOut
a
2 Result
2
b
2
CarryIn
ALU31
CarryOut
a
31 Result
31
b
31
CarryIn Operation

Answer:
Overflow = CarryIn[31] XOR CarryOut[31]

62
CarryIn
ALU0
CarryOut
a
0
Result
0
b
0
CarryIn
ALU1
CarryOut
a
1 Result
1
b
1
CarryIn
ALU2
CarryOut
a
2 Result
2
b
2
CarryIn
ALU31
CarryOut
a
31
Result
31
b
31
CarryIn Operation
Overflow




63
92

1. Compute the value of the following floating-point number A based on the IEEE
standard. (Note: this floating-point number is composed of three fields, i.e., a sign
bit, 8 exponent bits, and 23 significant bits)
A = (11000000101000000000000000000000)
Answer:
(1)
1
(1.01)
two
2
129-127
= 101
two
= 5

2. Consider the addition process of the following two binary numbers A and B.
Determine the so-called carry generate and carry propagate, signals for each bit.
A = (00011010)
B = (11100101)
Answer:
A = 0 0 0 1 1 0 1 0
B = 1 1 1 0 0 1 0 1
p
i
1 1 1 1 1 1 1 1
g
i
0 0 0 0 0 0 0 0

3. Consider the process of adding up two floating-point numbers in a
microprocessor.
(1) Derive the proper sequence of the following three operations.
C Addition of significands
C Normalization
C Alignment of Exponents
(2) What operation is still needed in addition to the above three operations.
Answer:
(1) CCC
(2) Round the sum


4. One extension of the MIPS instruction set architecture has two new instructions
called movn (move if not zero) and movz (move if zero). For example, the
instruction
movn $8, $11, $4
copies the contents of register 11 into register 8, provided that the value in
register 4 is nonzero (otherwise it does nothing). The movz instruction is similar
but copying takes place only if the registers value is zero. Show how to use the
new instructions to put whichever is larger, register 8s value or register 11s
value, into register 10. If the values are equal, copy either into register 10. You

64
may use register 1 as an extra register for temporary use. Do not use any
conditional branches.
Answer:
slt $1, $8, $11
movn $10, $11, $1
movz $10, $8, $1

5. Consider three machines with different cache configurations:
Cache 1Direct-mapped with one-word blocks.
Cache 2Direct-mapped with four-word blocks.
Cache 3Two-way set associative with four-word blocks.
The following miss measurements have been made
Cache 1Instruction miss rate is 4%, data miss rate is 8%.
Cache 2Instruction miss rate is 2%, data miss rate is 5%.
Cache 3Instruction miss rate is 2%, data miss rate is 4%.
For these machines, one-half of the instructions contain a data reference. Assume
that the cache miss penalty is 6 + Block size in words. The CPI for this workload
was measured on a machine with cache 1 and found to be 2.0. (1) Determine
which machine spends the most cycles on cache misses. (2) The cycle time for the
three machines are 10 ns for the first and second machines and 12 ns for the third
machine. Determine which machine is the fastest and which is the slowest.
Answer:
(1) C1
Cache Miss penalty I cache miss D cache miss Total Miss
C1 6 + 1 = 7 4% 7 = 0.28 8% 7 = 0.56 0.28 + 0.56/2 = 0.56
C2 6 + 4 = 10 2% 10 = 0.2 5% 10 = 0.5 0.2 + 0.5/2 = 0.45
C3 6 + 4 = 10 2% 10 = 0.2 4% 10 = 0.4 0.2 + 0.4/2 = 0.4
(2) We need to calculate the base CPI that applies to all three processors. Since
we are given CPI = 2 for C1, CPI
base
= CPI CPI
misses
= 2 0.56 = 1.44
Execution Time for C1 = 2 10 ns IC = 20 10
-9
IC
Execution Time for C2 = (1.44 + 0.45) 10 ns IC = 18.9 10
-9
IC
Execution Time for C3 = (1.44 + 0.4) 12 ns IC = 22.1 10
-9
IC
Therefore C2 is fastest and C3 is slowest.

6. A program repeatedly performs a three-step process: It reads in a 4-KB block of
data from disk, does some processing on the data, and then writes out the result as
another 4-KB block elsewhere on the disk. Each block is contiguous and
randomly located on a single track on the disk. The disk drive rotates at
7200RPM, has an average seek time of 8 ms, and has a transfer rate of 20 MB/sec.
The controller overhead is 2 ms. No other program is using the disk processor,

65
and there is no overlapping of disk operation with processing. The processing step
takes 20 million clock cycles, and the clock rate is 400 MHz. What is the overall
speed of the system in blocks processed per second?
Answer:
(Seek time + Rotational delay + Data transfer time + Control time) 2 +
processing time = (8 ms + 0.5/(7200/60) sec + 4K/20M + 2 ms) 2 +
(2010
6
)/(40010
6
) = (8 + 4.17 + 0.2 + 2) 2 + 50 = 78.37 ms
Block processed/second = 1/78.37 ms = 12.76

7. Suppose register $s0 has the binary number
1111 1111 1111 1111 1111 1111 1111 1111
two

and that register $s1 has the binary number
0000 0000 0000 0000 0000 0000 0000 0000
two

What are the values of registers $t0 and $t1 after these two instructions?
slt $t0, $s0, $s1 # set on less than signed comparison
sltu $t1, $s0, $s1 # set on less than unsigned comparison
Answer:
(1) $t0 = 1
(2) $t1 = 0

8. Explain
(1) spatial locality
(2) write-back cache
(3) page table
(4) compulsory misses
(5) branch delay slot
Answer:
(1) The locality principle stating that if a data location is referenced, data
locations with nearby addresses will tend to be referenced soon.
(2) A scheme that handles writes by updating values only to the block in the
cache, then writing the modified block to the lower level of the hierarchy
when the block is replaced.
(3) The table containing the virtual to physical address translations in a virtual
memory system. The table, which is stored in memory, is typically indexed by
the virtual page number; each entry in the table contains the physical page
number for that virtual page if the page is currently in memory.
(4) Also called cold start miss. A cache miss caused by the first access to a block
that has never been in the cache.
(5) The slot directly after a delayed branch instruction, which in the MIPS
architecture is filled by an instruction that does not affect the branch.

66
9. Which change is more effective on a certain machine: speeding up 10-fold the
floating point square root operation only, which takes up 20% of execution time,
or speeding up 2-fold all other floating point operations, which take up 50% of
total execution time? Assume that the cost of accomplishing either change is the
same, and the two changes are mutually exclusive.
Answer:
Speedup
1

( )
10
2 . 0
2 . 0 1
1
+
= 1.22
Speedup
2

( )
2
5 . 0
5 . 0 1
1
+
= 1.33
So, speeding up 2-fold all other floating point operations is more effective





67
96

1. Here is a series of address reference given as word addresses: 1, 4, 8, 5, 20, 17, 19,
56, 9, 11, 4, 43, 5, 6, 9, 17. Show the hits and misses and final cache contents for
a direct-mapped cache with four-word blocks and a total size of 16 words.
Answer:
Referenced
Address
(decimal)
Block
address
Tag Index Hit/Miss
1 0 0 0 Miss
4 1 0 1 Miss
8 2 0 2 Miss
5 1 0 1 Hit
20 5 1 1 Miss
17 4 1 0 Miss
19 4 1 0 Hit
56 14 3 2 Miss
9 2 0 2 Miss
11 2 0 2 Hit
4 1 0 1 Miss
43 10 2 2 Miss
5 1 0 1 Hit
6 1 0 1 Hit
9 2 0 2 Miss
17 4 1 0 Hit

index contents
0 16, 17, 18, 19
1 4, 5, 6, 7
2 8, 9, 10, 11
3



68
2. The following program tries to copy words from the address in register $a0 to the
address in register $a1 and count the number of words copied in register $v0. The
program stops copying when it finds a word equal to 0. You do not have to
preserve the contents of registers $v1, $a0, and $a1. This terminating word should
be copied but not counted.
Loop: lw $v1, 0($a0) # read next word from source
addi $v0, $v0, 1 # increment count words copied
sw $v1, 0($a0) # write to destination
addi $a0, $a0, 1 # advance pointer to the next source
addi $a1, $a1, 1 # advance pointer to the next
destination
bne $v1, $zero, loop # loop if word copied is not zero
There are multiple bugs in this MIPS program. Please fix them and turn in
bug-free version.
Answer:
addi $v0, $zero, -1
Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a1)
addi $a0, $a0, 4
addi $a1, $a1, 4
bne $v1, $zero, Loop


69
95

1. Assume the critical path of a new computer implementation is memory access for
loads and stores. This causes the design to run at a clock rate of 500MHz instead
of the target clock rate of 750MHz. What is the solution with minimum
multi-cycle path to make the machine run at its targeted clock rate? Using the
table shown below, determine how much faster of the approach used on the
previous answer is compared with the 500 MHz machine with single-cycle
memory access. Assume all jumps and branches take the same number of cycles
and that the set instructions and arithmetic immediate instructions are
implemented as R-type instructions.
Instruction class Mix frequency Cycles per instruction on 500MHz machine
Loads 22% 5
Stores 11% 4
R-Type 49% 4
Jmup/brach 18% 3
Answer:
(1) If the execution of memory access can be divided into two clock cycles, the
machine may run at its targeted clock rate of 750MHz.
(2) CPI for single-cycle memory access machine =
5 0.22 + 4 0.11 + 4 0.49 + 3 0.18 = 4.04
For multi-cycle memory access machine, the CPI for load is 5 + 1 = 6 and for
store is 4 + 1 = 5
CPI for multi-cycle memory access machine = 6 0.22 + 5 0.11 + 4 0.49
+ 3 0.18 = 4.37
The average instruction execution for single-cycle memory access machine =
4.04 2 ns = 8.08 ns
The average instruction execution for multi- cycle memory access machine =
4.37 1.3 ns = 5.81 ns
The machine with multi-cycle memory access is 8.08/5.81 = 1.39 times faster
than the machine with single-cycle memory access.

2. A C procedure that swaps two locations in memory is shown below:
swap(intv[], int k)
{int temp;
temp = v[k];
v[k] =v[k + l];
v[k + 1] =temp;
}
(1) Find the hazard in the following code from the body of the swap procedure.
(2) Reorder the instructions to avoid as many pipelines stalls as possible.


70
# reg $2 has the address of v[k]
lw $15, 0($2) # reg $15(temp) = v[k]
lw $16, 4($2) # reg$16 = v[k + l]
sw $16, 0($2) # v[k] = reg $16
sw $15, 4($2) # v[k + l] = reg$15(temp)
Answer:
(1) there is a data hazard for register $16 between the second load word
instruction and the first store word instruction.
(2)
lw $15, 0($2)
lw $16, 4($2)
sw $15, 4($2)
sw $16, 0($2)

3. Bus A is a bus with separate 32-bit address and 32-bit data. Each transmission
takes one bus cycle. A read to the memory incurs a three-cycle latency, then
starting with the fourth cycle, the memory system can deliver up to 8 words at a
rate of 1 word every bus cycle. For a write, the first word is transmitted with the
address; after a three-cycle latency up to 7 additional words may be transmitted at
the rate of 1 word every bus cycle. Evaluate the bus assuming only 1 word
requests where 60% of the requests are reads and 40% are writes. Find the
maximum bandwidth that each bus and memory system can provide in words per
bus cycle.
Answer:
The latency for reading 8 words = 3 + 8 = 11
The maximum bandwidth for read = 8/11 (words/cycles)
The latency for writing 7 words = 1 + 3 + 7 = 11
The maximum bandwidth for write = 7/11 (words/cycles)
The maximum bandwidth that each bus and memory system can provide =
(8/11) 0.6 + (7/11) 0.4 = 0.69 (words/cycles)

71
94

1. Here is a string of address references given as word addresses: 1, 4, 8, 5, 20, 17,
19, 56, 9, 11, 4, 43, 5, 6, 9, 17. Show the hits and misses and final cache contents
for a two-way set associative cache with one-word blocks and a total size of 16
words. Assume LRU replacement.
Answer: length of offset = 0 bit, length of index = 3 bits
Referenced
Address
(decimal)
Referenced
Address
(Binary)
Tag Index Hit/Miss
Contents
Block0 Block1
1 000001 0 1 Miss 1
4 000100 0 4 Miss 4
8 001000 1 0 Miss 8
5 000101 0 5 Miss 5
20 010100 2 4 Miss 4 20
17 010001 2 1 Miss 1 17
19 010011 2 3 Miss 19
56 111000 7 0 Miss 8 56
9 001001 1 1 Miss 9 17
11 001011 1 3 Miss 19 11
4 000100 0 4 Hit 4 20
43 101011 5 3 Miss 43 11
5 000101 0 5 Hit 5
6 000110 0 6 Miss 6
9 001001 1 1 Hit 9 17
17 010001 2 1 Hit 9 17

Set Block0 Block1
0 8 56
1 9 17
2
3 43 11
4 4 20
5 5
6 6
7


72
2. Consider three machines with different cache configurations:
Cache 1: Direct mapped with one-word blocks.
Cache 2: Direct mapped with four-word blocks.
Cache 3: 2-way set associative with four-word blocks.
The following miss rate measurements have been made:
Cache 1: Instruction miss rate is 4%; data miss rate 8%.
Cache 2: Instruction miss rate is 2%; data miss rate 5%.
Cache 3: Instruction miss rate is 2%; data miss rate 4%.
For these machines, one-half of the instructions contain a data reference. Assume
that the cache miss penalty is 6 + Block size in words. The CPI for this workload
was measured on a machine with cache 1 and was found to be 2.0. Answer the
following two questions.
(1) Determine which machine spends the most cycles on cache misses.
(2) The clock rates for the three machines are 10 ns for the first and second
machines and 12 ns for the third machine. Determine which machine is the
fastest and which is the slowest.
Answer:
(1) C1
Cache Miss penalty I cache miss D cache miss Total Miss
C1 6 + 1 = 7 4% 7 = 0.28 8% 7 = 0.56 0.28 + 0.56/2 = 0.56
C2 6 + 4 = 10 2% 10 = 0.2 5% 10 = 0.5 0.2 + 0.5/2 = 0.45
C3 6 + 4 = 10 2% 10 = 0.2 4% 10 = 0.4 0.2 + 0.4/2 = 0.4
(2) We need to calculate the base CPI that applies to all three processors. Since
we are given CPI = 2 for C1, CPI
base
= CPI CPI
misses
= 2 0.56 = 1.44
Execution Time for C1 = 2 10 ns IC = 20 10
-9
IC
Execution Time for C2 = (1.44 + 0.45) 10 ns IC = 18.9 10
-9
IC
Execution Time for C3 = (1.44 + 0.4) 12 ns IC = 22.1 10
-9
IC
Therefore C2 is the fastest and C3 is the slowest.




73
93

1. How does DMA increase system concurrency? How does it complicate hardware
design?
Answer:
(1) DMA increases system concurrency by freeing the CPU to perform other
tasks while it handles data transfer to/from the disk.
(2) The hardware design of a system with DMA is complicated because a special
DMA controlled must be integrated into the system so that DMA and normal
CPU operations can coexist.

2. There are six relative conditions between the values of two registers. In this
problem we consider two of them. Assuming that variable i corresponds to
register $19 and variable j to $20, show the MIPS code for the conditions
corresponding to the following two C codes:
(1) if (i == j) goto L1;
(2) if (i < j) goto L1;
Answer:
(1) beq $19, $20, L1
(2) slt $at, $19, $20
bne $at, $zero, L1

3. Consider a virtual memory system with the following properties:
(1) 40 bit virtual address
(2) 16 KB pages
(3) 36-bit physical address
Assume that the valid, protection, dirty, and use bits take a total of 4 bits and that
all the virtual pages are in use. Assume that disk addresses are not stored in the
page table. What is the total size of the page table for each process on this
machine?
Answer:
Each page is 16 KB 14 bits page offset.
The bits of virtual page number = 40 14 = 26 bits 2
26
entries in the page table
Each entry requires 36 14 = 22 bits to store the physical page number and an
additional 4 bits for the valid, protection, dirty, and use bits. We round the 26 bits
up to a full word per entry, so this gives us a total size of 2
26
32 bits or 256 MB.


74
4. Assume that a hard disk in a computer transfers data in one-work chunks and can
transfer at 2 MB/sec. Assume that no transfer can be missed. Assume that the
number of clock cycles for polling operation is 100 and that the processor
executes with a 50 MHz clock. Determine the fraction of CPU time consumed by
the hard disk assuming that you poll often enough so that no data is ever lost.
Answer:
We must poll at a rate equal to the data rate in one-word chunks, which is 500K
times per second (2 MB per second/4 bytes per transfer). Thus,
Cycles per second for polling = 500K 100. Ignoring the discrepancy in bases,
Fraction of the processor consumed = (50 10
6
) / (50 10
6
) = 100%



75
92

1. A program runs in 10 seconds on computer A, which has a 100 MHz clock. We
are trying to help a computer designer build a machine B that will run this
program in 6 seconds. The designer has determined that a substantial increase in
the clock rate is possible, but this increase will affect the rest of the CPU design,
causing machine B to require 1.2 times as many clock cycles as machine A for
this program. What clock rate should design for machine B?
Answer:
Cycles for executing in computer A = 10 100 MHz = 10
9

Cycles needed for executing in computer B = 1.2 10
9

Suppose the clock rate for the computer B is R, then 1.2 10
9
/ R = 6
R = 200 MHz

2. Use Booth's algorithm to compute 2
ten
3
ten
.
Answer:
Iteration Step Multiplicand Product
0 Initial values 0010 0000 1101 0
1
10 Prod Mcand 0010 1110 1101 0
Shift right rpoduct 0010 1111 0110 1
2
01 Prod + Mcand 0010 0001 0110 1
Shift right product 0010 0000 1011 0
3
10 Prod Mcand 0010 1110 1011 0
Shift right product 0010 1111 0101 1
4
11 No operation 0010 1111 0101 1
Shift right product 0010 1111 1010 1


3. Given the bit pattern 1000 1111 1110 1111 1100 0000 0000 0000. What does it
represent, assuming that it is
(1) a two's complement integer?
(2) an unsigned integer?
(3) a single precision floating-point number?
(4) a MIPS instruction?
Answer:
(1) (2
30
+ 2
29
+ 2
28
+ 2
20
+ 2
14
)
10

(2) (2
31
+ 2
27
+ 2
26
+ 2
25
+ 2
24
+ 2
23
+ 2
22
+ 2
21
+ 2
19
+ 2
18
+ 2
17
+ 2
16
+ 2
15
+ 2
14
)
10

(3) 1.110111111
2
2
96

(4) lw $t7, C000
16
($ra)

76
4. Suppose you want to perform two sums: one is a sum of two scalar variables and
one is a matrix sum of a pair of two-dimensional arrays, size 1000 by 1000. What
speedup do you get with 1000 processors?
Answer:
Suppose T is the time required to add two variables
Execution Time for one processor = 1 + 1000 1000 = 1000001 (T)
Execution Time for 1000 processor = 1 + (1000 1000) / 1000 = 1001 (T)
So, speedup = 1000001(T)/1001(T) = 999



77
96

1. Finite State Machine (FSM) can be divided into two types, Moore machine and
Mealy machine. Please use an example to demonstrate that Mealy machine would
have glitches or spikes at the output.
Answer:
The output of the Mealy machine can change either when the state changes or
when the input changes. This may cause temporary false outputs to occur. These
temporary false outputs are referred to as glitches and spikes.
For example, in the following circuit, A is a signal directly from a primary input,
B is a signal from a state output, and C is a circuit output.
C
A
D
Q
B

If signal A come early than signal B, then there will be a glitch occurs at output C.
The following is the timing diagram for the previously Mealy machine.
Glitch
A
B
C


2. Your company uses a benchmark C to evaluate the performance of a computer A
used in your company. But the computer A can only execute integer instructions,
and it uses a sequence of integer instructions to emulate a single floating-point
instruction. The computer A is rated at 200 MIPS on the benchmark C. Now,
your boss would like to attach a floating-point coprocessor B to the computer A
such that the floating-point instructions can be executed by the coprocessor for
performance improvement. Note that, however, the combination of computer A
and the coprocessor B is rated only at 60 MIPS on the same benchmark C. The
following symbols are used in this problem:
I: the number of integer instructions executed on the benchmark C.
F: the number of floating-point instructions executed on the benchmark C.

78
N:the number of integer instructions to emulate a floating-point instruction.
Y:time to execute the benchmark C on the computer A alone.
Z: time to execute the benchmark C on the combination of computer A and the
coprocessor B.
a. Write an equation for the MIPS rating of computer A using the symbols
above.
b. Given I = 5 10
6
, F = 5 10
5
, N = 30, find Y and Z.
c. Do you agree with your boss from the performance point of view? Please
state the reasons to justify your answer.
Answer:
(a) MIPS
A
=
6
10
+
Y
N F I

(b) MIPS
A
=
6
10
+
Y
N F I
Y =
6
10
+
A
MIPS
N F I
=
6
5 6
10
30 10 5 10 5

+
A
MIPS
=
6
5 6
10 200
30 10 5 10 5

+
= 100 ms
MIPS
A+B
=
6
10
+
Z
F I
Z =
6
10
+
+B A
MIPS
F I
=
6
5 6
10 60
10 5 10 5

+
= 91.67 ms
(c) Yes. Although the MIPS of the processor/coprocessor combination seems to
be lower than that of the processor alone, that is not the case. This is clearly
seen from the execution times since it only takes 91.67 ms to execute the
program with the coprocessor present as opposed to the 100 ms seconds
without it.

3. Suppose that a computer's address size is 32 bits (using byte addressing), the
cache size is 32 Kbytes, the block size is 1-word, and the cache is 4-way set
associative. (a) what is the number of sets in the cache. (b) what is the total
number of bits needed to implement the cache. Please show your answer in the
exact total number of bits.
Answer:
(a) The number of blocks in cache = 32 Kbyte / 4 byte = 8K
The number of sets in the cache = 8K / 4 = 2K
(b) The length of index field = log
2
(2K) = 11 bits
The length of byte offset field = 2 bits
The length of tag field = 32 11 2 = 19
The size of the cache = 2K (1 + 19 + 32) 4 = 416 Kbits


79
4. A virtual memory system often implements a TLB to speed up the
virtual-to-physical address translation. A TLB has the following characteristics.
Assume each TLB entry has a valid bit, a dirty bit, a tag, and the page number.
Determine the exact total number of bits to implement this TLB.
It is direct-mapped
It has 16 entries
The page size is 4 Kbytes
The virtual address space is 4 Gbytes
The physical memory is 1 Gbytes
Answer:
The length of virtual page number = log
2
(4 Gbytes/4 K bytes) = 20 bits
The length of physical page number = log
2
(1 Gbytes/4 K bytes) = 18 bits
The index field = log
2
16 = 4 bits
The tag field = 20 4 = 16 bits
The bits in each entry of TLB = 2 + 16 + 18 = 36
The size of TLB = 37 16 = 576 bits

5. Suppose that we have a system with the following characteristics:
A memory and bus system supporting block access of 4 words.
A 64-bit synchronous bus clocked at 200 MHz, with each 64-bit transfer
taking 1 clock cycle, and 1 clock cycle required to send an address to
memory.
2 clock cycles needed between each bus transaction. (Assume the bus is idle
before an access).
A memory access time of 4 words is 300 ns.
Find the sustained bandwidth for a read of 256 words. Provide your answer in
MB/sec.
Answer:
1. 1 clock cycle that is required to send the address to memory
2. 300ns / (5ns/cycle) = 60 clock cycles to read memory
3. 2 clock cycles to send the data from the memory
4. 2 idle clock cycles between this transfer and the next
This is a total of 65 cycles, and 256/4 = 64 transactions are needed, so the entire
transfer takes 65 64 = 4160 clock cycles. Thus the latency is 4160 cycles 5
ns/cycle = 20,800 ns. The bus bandwidth is (256 4) bytes (1sec / 20,800ns) =
49.23 MB/sec



80
95

1. Design an array multiplier that multiplies two 3 bit integer in twos complement
format and produces one 6-bit integer also in twos complement format.
Answer:
(b
2
) b
1
b
0
(b
2
) b
1
b
0

(a
2
) a
1
a
0
(a
2
) a
1
a
0

(a
0
b
2
) a
0
b
1
a
0
b
0
a
0
b
1
a
0
b
0

(a
1
b
2
) a
1
b
1
a
1
b
0
+ a
2
b
2
0 a
1
b
1
a
1
b
0

+ a
2
b
2
(a
2
b
1
) (a
2
b
0
) (a
1
b
2
) (a
0
b
2
)
c
4
c
3
c
2
c
1
c
0
(a
2
b
1
) (a
2
b
0
)
c
4
c
3
c
2
c
1
c
0


b0
a0
b1
a1
b0
FA 0 FA FA
FA
b2 a2
0
a0
FS 0
a1
FS
b2
FS
0
b0
FS 0
b1
FS
a2
FS
0
C0 C1 C2 C3 C4
b1
0 0 0

FA: full adder
FS: full substract


81

b
2
b
1
b
0

a
2
a
1
a
0

a
0
b
2
a
0
b
1
a
0
b
0

a
1
b
2
a
1
b
1
a
1
b
0

+ a
2
b
2
a
2
b
1
a
2
b
0

c
5
c
4
c
3
c
2
c
1
c
0













2. Design a synchronous sequential machine that has one input X(t) and one output
Y(t). Y(t) should be 1 if the machine has been more 1s than 0s in the input over
the past 3 time steps, and 0 otherwise. Below is a sample sequence:

t 0 1 2 3 4 5 6 7 8 9 10
X(t) 0 1 0 1 1 0 1 0 1 0 1
Y(t) - - - 0 1 1 1 1 0 1 0
Answer:
X(t-3) X(t-2) X(t-1) Y(t)
0 0 0 0
0 0 1 0
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 1
1 1 0 1
1 1 1 1
Y(t) = X(t-3)X(t-2) + X(t-2)X(t-1) + X(t-3)X(t-1)

X(t-1) X(t-2) X(t-3)
X(t)
Y(t)
3-bit shift register
a1
FA FA FA FA FA
FA FA FA FA FA
0
0
0
a0
a2
c0 c1 c2 c3 c4 c5
b0 b1 b2
b0 b1 b2
b0 b1 b2

82
3. Design a vector-interrupt controller (VIC) that has four interrupt source A, B, C
and D with fixed priority A < B < C < D. In case of any interrupt occurred, the
VIC should output the ID of the interrupting source with the highest priority. For
example, if (A, B, C, D) = (0, 1, 0, 1), then the VIC should set Interrupt
Occurred to 1 and Source ID to 11 indicating that D is the interrupt source for
the host to serve. On the other hand, if (A, B, C, D) = (0, 0, 0, 0), then the VIC set
Interrupt Occurred to 0 indicating that no service is required.





Answer:
Inputs Outputs
A B C D ID
1
ID
0
INT
0 0 0 0 X X 0
1 0 0 0 0 0 1
X 1 0 0 0 1 1
X X 1 0 1 0 1
X X X 1 1 1 1











ID
1
= C + D ID
0
= D + BC
INT = A + B + C + D









VIC
A
B
C
D
Interrupt Occurred
Source ID
1 1 1
1 1 1
00
00 01 11 10
AB
CD
1 1 1
1 1 1
01
11
10
1 1
1 1 1
00
00 01 11 10
AB
1 1 1
1 1
01
11
10
CD
ID
1
ID
0
INT
A
B
C
D
ID
0

ID
1


83
4. Consider a loop branch that branches nine times in a row, then is not taken once.
Assume that we are using a dynamic branch prediction scheme.
(a) What is prediction accuracy for this branch if a simple 1-bit prediction
scheme is used
(b) What is prediction accuracy for this branch if a 2-bit prediction scheme is
used?
(c) Please draw the finite state machine for a 2-bit prediction scheme.
Answer:
(a) 80%(8/10) 100% = 80%
(b) 90%(9/10) 100% = 90%
(c)



5. What feature of a write-through cache makes it more desirable than a write-back
cache in a multiprocessor system (with a shared memory)? On the other hand,
what feature of a write-back makes it more desirable than a write-through cache
in the same system?
Answer:
(a) Write-through improve the coherence between share memory and cache,
thereby reduces the complexity of the cache coherence protocol.
(b) Write-back reduces bus traffic and thereby allows more processors on a single
bus.

6. Consider a fully associative cache and a direct mapped cache with the same cache
size.
(a) Explain which one has a lower cache miss rate and why?
(b) The majority of processor caches today are direct-mapped, two-way set
associative, or four-way set associative, but not fully associative. Why?
Answer:

84
(a) Fully associative, because fully associative can eliminate the misses caused
by multiple memory location compete for the same cache location (conflict)
(b) Because the costs of extra comparators and delay imposed by having to do the
compare for fully associative are too high.

85
94

1. The terms big-endian and little-endian were originally found in Jonathan Swift's
book, Gulliver's Travels. Now all processors must be designated as either
big-endian or little-endian. For example, DEC Alpha RISC and Intel 80x86
processors are little-endian. Motorola 6800 microprocessors and Sun
SuperSPARC are big-endian.
(a) Briefly explain the differences between big-endian and little-endian.
(b) Please illustrate big-endian and little-endian by considering the number 4097
stored in a 4-byte_integer.

Address Big-Endian representation Little-Endian representation
00
01
02
03
Answer:
(a) In a big-endian system, the most significant value in the sequence is stored at
the lowest storage address (i.e., first). In a little-endian system, the least
significant value in the sequence is stored first.
(b) 4097 = 00001001
hex

Address Big-Endian representation Little-Endian representation
00 00
hex
01
hex

01 00
hex
10
hex

02 10
hex
00
hex

03 01
hex
00
hex



2. A computer whose processes have 1024 pages in their address spaces keeps its
page tables in memory. The overhead required for reading a word from the page
table is 500 nsec. In order to reduce the overhead, the computer has a TLB
(Translation Look-aside Buffer), which holds 32 (virtual page, physical page
frame) pairs, and can do a look up in 100 nsec. What hit rate is needed to reduce
the mean overhead to 200 nsec?
Answer:
100 ns + (1 H) 500 ns = 200 ns; H = 0.8

3. Assume the instruction ADD R0, R1, R2, LSL#2 in one instruction set
architecture conducts R0 = R1 + R2 4 operation. Could you conduct R0 = 99
R1 using two ADD instructions? Write down the code if your answer is YES;
otherwise, state the reasons.

86
Answer: YES
ADD R0, Rl, R1, LSL#5 // R0 = R1 + 32R1 = 33R1
ADD R0, R0, R0, LSL#1 // R0 = 33R1 + 66R1 = 99R1


4. Forwarding is a technique to eliminate the data hazards occurred among the
pipelining instructions. However, not all data hazards can be handled by the
forwarding. If an instruction following a LOAD instruction depends on the results
of the LOAD instruction, the data hazard is occurred, and the pipeline is stalled
one cycle.
(a) Assume the percentage of LOAD instruction is 20% is a program, and half the
time the instruction following a LOAD instruction needs the result of the
LOAD instruction. What is the performance degradation due to the data
hazard?
(b) Pipeline scheduling or instruction scheduling techniques could be used to
eliminate the data hazard mentioned in (a). What is the philosophy of pipeline
scheduling? Use an example to demonstrate that pipeline scheduling can
eliminate the data hazard.
(c) What is the possible overhead of pipeline scheduling?
Answer:
(a) hazardpipeline machineCPI1load-use data
hazardstall 1 clock cycle. 20%LOADload-use
data hazardCPI = 1 + 0.2 0.5 1 = 1.1pipelineperformance
(1.1 1)/1 = 10%
(b) Rather then just allow the pipeline to stall, the compiler could try to schedule
the pipeline to avoid these stalls by rearranging the code sequence to
eliminate the hazards.
adddata hazard2lw
1add()data hazard
lw $t2, 4($t0)
add $t3, $t1, $t2
sub $t6, $t6, $t7
lw $t4, 8($t0)
add $t5, $t1, $t4
and $t8, $t8, $t9
lw $t2, 4($t0)
lw $t4, 8($t0)
sub $t6, $t6, $t7
add $t3, $t1, $t2
add $t5, $t1, $t4
and $t8, $t8, $t9
(c) Pipeline scheduling increases the number of registers used and compiler
overhead.

5. State two reasons that MIPS is not an accurate measure for comparing
performance among computers.
Answer:
Distance > 2

87
1. MIPS specifies the instruction execution rate but does not take into account
the capabilities of the instructions
2. MIPS varies between programs on the same computer
3. MIPS can vary inversely with performance

6. (a) Consider the following sequence of address references given as word
addresses:
22, 10, 26, 30, 23, 18, 10, 14, 30, 11, 15, 19
For a 2-way set associative cache with a block size of 8 bytes, a word size of 4
bytes, a data capacity of 64 bytes and the LRU replacement, label each
reference in the sequence as a hit or a miss. Assume that the cache is initially
empty.
(b) Determine the number of bits required in each entry of a TLB that has the
following characteristics:
E The TLB is directed-mapped
E The TLB has 32 entries
E The page size is 1024 bytes
E Virtual byte addresses are 32 bits wide
E Physical byte addresses are 31 bits wide
Note that you only need to consider the following items for each entry:
E The valid bit
E The tag
E The physical page number
Answer: (a)
Referenced
Address
(decimal)
Referenced
Address
(Binary)
Tag Index Hit/Miss
Contents
set Block0 Block1
22 10110 10 11 Miss 3 22,23
10 01010 01 01 Miss 1 10,11
26 11010 11 01 Miss 1 10,11 26,27
30 11110 11 11 Miss 3 22,23 30,31
23 10111 10 11 Hit 3 22,23 30,31
18 10010 10 01 Miss 1 18,19 26,27
10 01010 01 01 Miss 1 18,19 10,11
14 01110 01 11 Miss 3 22,23 14,15
30 11110 11 11 Miss 3 30,31 14,15
11 01011 01 01 Hit 1 18,19 10,11
15 01111 01 11 Hit 3 30,31 14,15
19 10011 10 01 Hit 1 18,19 10,11
(b) The page size is 1024 bytes page offset has 10 bits
Hence, the physical page number has 31 10 = 21 bits and
TLB has 32 entries index has 5 bits, then tag size = 32 10 5 = 17
So, the number of bits in each entry = 1 + 17 + 21 = 39 bits

88
7. (1) The average memory access time (AMAT) is defined as
AMAT = time for a hit + miss rate miss penalty
Consider the following two machines:
E Machine 1: 100 MHz, a hit time of 1 clock cycle, a miss rate of 5% and a
miss penalty of 20 clock cycles
E Machine 2: 100 MHz, a hit time of 1.2 clock cycles, a miss rate of 3% and
a miss penalty of 25 clock cycles
Determine which machine has smaller AMAT
(2) Assume that you are running a program which uses a lot of data and that 50%
of the data the program needs causes page faults and must be retrieved from a
disk array. If the program needs data at the rate of 600 Mbytes/second and each
disk in the disk array can supply data at the rate of 30 Mbytes/second, what is
the minimum number of disks required in the disk array? You do not need to
worry about disk errors.
(3) Assume that there are 10 pairs of processors and disk arrays placed all over a
network. Each processor needs data at the rate of 600 Mbytes/second from its
disk array across the network. If one-third of the total traffic crosses the
bisection of the network, what is the bisection bandwidth needed (in
Mbytes/second)?
Answer:
(1) AMAT for Machine 1 = 10 ns (1 + 0.05 20) = 20 ns
AMAT for Machine 2 = 10 ns (1.2 + 0.03 25) = 19.5 ns
Hence, Machine 2 has smaller AMAT
(2) (600 0.5) / 30 = 10 disks
(3) 10 600 (1/3) = 2000 MB/second

89
93

1. Consider a 16-bit processor which includes a register file of 4 registers (R0-R3).
R3 is hardwired to act as the program counter. This processor has only one
instruction Rd = Rs + #immed, and is implemented using a 3-stage pipeline.
(1) Instruction Fetch (IF): Instruction Register (IR) = MEM[R3]; R3 += 2;
(2) Instruction Decoding (ID): Decode IR to determine Rd, Rs, and #immed.
(3) Execution (EX): Execute Rd = Rs + #immed.
The operation R3 += 2 of the IF-stage and the write-back of Rd in the EX-stage
occur at the very end of each clock cycle (CC). If there is a conflict (both EX: Rd
and IF: R3 are writing to the register R3), the write-back R3 overrides the
operation R3 += 2 in the IF-stage. Consider executing the following instruction
sequence and its timing chart of the pipeline operation
Clock Cycle (CC)
Instruction
Address
Instruction 1 2 3 4 5 6 7
0x0100 R0 = R3 + #0x001 IF ID EX
0x0102 R3 = R3 + #0x010 IF ID EX
0x0104 R2 = R3 + #0x100 IF ID EX
0x0106 R2 = R0 + #0x200 IF ID EX
0x???? R1 = R0 + #0x300 IF ID EX
(a) Right after CC = 1, what is the hexadecimal value stored in the register R3?
(b) Right after CC = 3, what is the hexadecimal value stored in the register R0?
(c) Right after CC = 4, what is the hexadecimal value stored in the register R3?
(d) Right after CC = 6, what is the hexadecimal value stored in the register R2?
(e) What is the hexadecimal address of last instruction in the table?
Answer:
(a) (b) (c) (d) (e)
0102 0103 0114 0303 0114
(: stage ID instruction decode register fetch stage
ID stage clock
clock IDR3 clock 1
R3 clock 2 R3)


90
2. Suppose you are given an instruction set architecture (ISA) which includes only
two instruction formats
SUB Rd, Rs, #immed /* Rd = Rs - #immed */
ADD Rd, Rs, Rt /* Rd = Rs + Rt */
where Rd is the 8-bit destination register, Rs and Rt are the 8-bit source registers
and the immediate value can be an integer between -4 to 3. How would you
translate each of the following pseudo-instructions into one or multiple real ISA
instructions?
(1) MOV Rd, Rs /*Rd = Rs*/
(2) INC Rd /*Rd++ */
(3) MOV Rd, Rs, lsl 1 /* Rd = Rs << 1; i.e., left shift */
(4) CLEAR Rd /* Rd = 0 */
Answer:
(1) SUB Rd, Rs, #0
(2) SUB Rd, Rd, #-1
(3) ADD Rd, Rs, Rs
(4) ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
ADD Rd, Rd, Rd
(: Rd shift left 8 0)

3. Suppose Booths algorithm is used as our approach to multiplying two 8-bit
unsigned integer numbers. How many additions and subtractions are needed to
multiply 123 by 123?
Answer: Multiplier: 123
10
= 0111 1011
2

0 1 1 1 1 0 1 1 0


additions: 2 , subtractions: 2

+ +


91
4. In a computer memory hierarchy, determine which of the following five
combinations of events (for locating a page of memory) in the cache, TLB, page
table and main memory are possible to occur. Answer Yes or No to each of (1), (2),
(3), (4) and (5).

TLB
Page
Table
Cache
Main
Memory
(1) Hit Hit Hit Miss
(2) Hit Hit Miss Miss
(3) Miss Hit Hit Miss
(4) Miss Miss Hit Hit
(5) Hit Miss Hit Hit
Answer: (1) No (2) No (3) No (4) No (5) No

5. For a CPU to effectively handle service requests from peripheral devices,
vectored interrupt is a popular mechanism. To implement vectored interrupt
function a combinational circuit called priority encoder is commonly used.
(1) What is the operation principle of vectored interrupt?
(2) What is a priority encoder? How is it different from an ordinary encoder?
Answer:
(1) An interrupt vector is the memory address of an interrupt handler, or an index
into an array called an interrupt vector table. Interrupt vector tables contain
the memory addresses of interrupt handlers. When an interrupt is generated,
the processor saves its execution state, and begins execution of the interrupt
handler at the interrupt vector.
Processor
Interrupt
Controller
Interrupt
Vector
Interrupt
Line
Peripheral

(2) A priority encoder encodes only the highest-order active input, even if
multiple inputs are activated.
The ordinary encoder has the limitation that only one input can be active at
any given time. If two inputs are active simultaneously, the output produces
an undefined combination.

92
92

1. (1) Give the flow diagram of the procedures for multiplying two binary
floating-point numbers.
(2) Multiply the two decimal numbers 0.75
ten
and -0.375
ten
by using the steps from
your answer in (1). Show the step-by-step intermediate results in your answer.
Answer: (1) (2)

In binary, the task is
1.100
two
2
-1
times 1.100
two
2
-2

Step 1: adding the exponents
(1 + 127) + (2 + 127) 127 = 124

Step 2: Multiplying the significands:
1.100
two
1.100
two
= 10.010000
two
2
-3
,
but we need to keep it to 4 bits, so it is
10.01
two
2
-3


Step 3: Normalizing the product: 1.001
2
-2
, since 127 2 126, so, no
overflow or underflow

Step 4: Rounding the product makes no
change: 1.001
two
2
-2


Step 5: make the sign of the product
negative 1.001
two
2
-2


Converting to decimal
1.001
two
2
-2
= 0.01001
two
= 9/2
5
ten
=
0.2812
ten






93
2. (a) What are the five steps required for the normal MIPS instructions? Briefly
describe each step in one sentence.
(b) Consider the following two contignous MIPS instructions.
add $s0, $t0, $tl
sub $t2, $s0, $t3
What solution can be used to resolve the data hazard problem in the two
instructions? Give a graphical instruction-pipeline representation of your
solution.
Answer:
(a) 1. Instruction Fetch
2. Instruction Decode and Register Fetch
3. Execution, Memory Address Computation, or Branch Completion
4. Memory Access or R-type instruction completion
5. Memory Read Completion
(b) Forwarding








3. Suppose we have a processor with a base CPI of 1.0, assuming all reference hit in
the primary cache, and a clock rate of 800 MHz. Assume a main memory access
time of 125 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 4%. What is the total CPI for this machine with
one level of caching? Now we add a secondary cache that has a 20 ns access time
for either a hit or a miss and the secondary cache is large enough to reduce the
miss rate to main memory to 2%? What is the total CPI for this machine with a
two-level cache?
Answer:
(1) CPU clock cycle time = 1 / 800 MHz = 1.25 ns
Miss penalty for main memory = 125 / 1.25 = 100 clock cycles
CPI for machine with one-level cache = 1 + 100 0.04 = 5
(2) Miss penalty for second level cache = 20 / 1.25 = 16 clock cycles
CPI for machine with two-level cache = 1 + 0.04 16 + 0.02 100 = 3.64


94
4. (1) What is the ideal performance improvement for an n-stage pipeline machine?
(2) Write two reasons that the pieline machine cannot achieve the ideal
pformance except data hazard
(3) What are the methods to remove the data hazard?
(4) Describe what is a carry save adder tree.
Answer:
(1) n times faster than the machine without pipelining
(2) (a) The stages may be imperfectly balanced
(b) The delay due to pipeline registers
(c) Control hazard.
(d) Time to fill and drain the pipeline
(3) (a) Insert nop instruction by compiler.
(b) Reorder code sequence by compiler.
(c) Forwarding by hardware.
(4) It is a tree of carry-save adders arranged to add the arguments in parallel.
Each carry-save adder in each level adds three operands and produces two
results. Carry save adder tree usually used for the addition of partial products
of multiplication.

5. RAID (redundant arrays of inexpensive disks) have been widely used to speed up
the disk access time. Several levels of RAID are supported. Please make the right
binding between the following RAID levels and explanations.
(1) RAID-0 (A) block-interleaved parity
(2) RAID-1 (B) non-redundant striping
(3) RAID-4 (C) mirrored disks
(4) RAID-5 (D) block-interleaved distributed parity
Answer:
RAID-0 RAID-1 RAID-4 RAID-5
(B)
non-redundant
striping
(C)
mirrored disks
(A)
block-interleaved
parity
(D)
block-interleaved
distributed parity


6. DSP processors are increasingly employed in embedded systems for supporting
audio and video applications. Explain the key features of DSP processors
(different) from conventional general-purpose processors.
Answer:
The essential difference between a DSP and a microprocessor is that a DSP
processor has features designed to support high-performance, repetitive,
numerically intensive tasks. In contrast, general-purpose processors are not

95
specialized for a specific kind of applications.
Features that accelerate performance in DSP applications include:
1. Single-cycle multiply-accumulate capability. High-performance DSPs often
have two multipliers that enable two multiply-accumulate operations per
instruction cycle.
2. Specialized addressing modes. DSPs generally feature multiple-access
memory architectures that enable DSPs to complete several accesses to
memory in a single instruction cycle.
3. Specialized execution control. Usually, DSP processors provide a loop
instruction that allows tight loops to be repeated without spending any
instruction cycles for updating and testing the loop counter or for jumping
back to the top of the loop.
4. DSP processors are known for their irregular instruction sets, which generally
allow several operations to be encoded in a single instruction.



96
96

1. (Choice)
(1) Which is (are) correct?
a. Suppose there was a 16-bit IEEE 754-like floating-point format with 5
exponent bits (1.0000 0000 00 2
-15
to 1.1111 1111 11 2
14
, 0, , NaN)
is the likely range of numbers it could represent.
b. For 32-bit IEEE 754 floating-point standard, the smallest positive normalized
number is: 1.0000 0000 0000 0000 0000 000 2
-125
.
c. For 32-bit IEEE 754 floating-point standard, the smallest denormalized
number is: 0.0000 0000 0000 0000 0000 001 2
-126
.
Answer: c
a1.0000 0000 00 2
-14
to 1.1111 1111 11 2
15
, 0, , NaN
b1.0000 0000 0000 0000 0000 000 2
-126


(2) Some programming languages allow two's complement integer arithmetic on
variables declared byte and half word, i.e., 16 bits. What MIPS instructions would
be used?
a. Load with lbu, lhu; arithmetic with add, sub, mult, div; then storing using sb,
sh.
b. Load with lb, lh, arithmetic with add, sub, mult, div; then storing using sb, sh.
c. Loads with lb, lh; arithmetic with add, sub, mult, div; using and to mask result
to 8 or 16 bits after operation; then store using sb, sh.
Answer: b

(3) Carry look-ahead adder can diminish the carry delay which dominates the delay
of ripple carry adder. Generate (g
i
) and propagate (p
i
) functions are two main
operations of carry look-ahead adder. Assume a and b are two operands and c
i+1

is the carry out of level i and carry in of level i + 1, which is (are) correct?
a. g
i
= a
i
b
i

b. p
i
= (a
i
+ b
i
) c
i

c. If g
i
equals to 1, we can say the carry out of level i is 1.
d. Carry look-ahead adder can be extended to multi-level style. The first group
generate of a 3-bit group can then be defined as G
o
= g
2
+ (p
2
g
1
) + (p
2
p
1

g
0
)
Answer: a, c, d
ba
i
+ b
i



97








Structure a Structure b
(4) The above figure shows two multiplication structures. Which is correct?
a. The shift operation in the multiplicand in structure a is shift-right.
b. The shift operation in the multiplier in the structure a is shift-right.
c. The multiplier is stored in the right part of the product register in structure b.
d. In structure b, one control signal for shifting multiplicand register is missed.
Answer: a
ashift-left
cthe multiplier is stored in the right part of the product register initially but is
faded away after multiplication is completion

(5) About the 32-bit MIPS instructions, which description is correct?
a. MIPS has 32 registers inside CPU because it is a 32-bit CPU.
b. add instruction can not directly store the addition result to memory.
c. Since memory structure is byte-addressing, the address offset in beq
instruction is referred to as byte.
d. In MIPS, "branch-if-less-than" is realized using slt and beq/bne, since its
design principle is two faster instructions are more useful than one slow and
complicated instruction.
Answer: b
a32-bit CPU means the length of register is 32
caddress offset in beq instruction is referred to as word
ddesign principle- smaller is faster (keeping instruction set small)

2. (a) What is procedure frame? Also, stack pointer and frame pointer are used to
maintain procedure frame. Why does procedure frame require two pointers?
(b) Procedure has to spill registers to memory (save and then restore). Caller must
take care of $ax series and $tx series and callee must take care of $ra and $sx
series. Following codes require correction for spilling registers. Correct the
errors and state your reasons.




98
fact: L1:
addi $sp, $sp, -4 addi $a0, $a0, -1
sw $ra, 0($sp) jal fact
slti $t0, $a0, 1 lw $ra, 0($sp)
beq $t0, $zero, L1 addi $sp, $sp, 4
addi $v0, $zero, 1 mul $v0, $a0, $v0
addi $sp, $sp, 4 jr $ra
jr $ra
Answer:
(a) A procedure frame is the segment of stack containing a procedures saved
registers and local variables.
A frame pointer is used to point to the location of the saved registers and local
variables for a given procedure. A stack pointer might change during the
procedure, and so references to local variable in memory might have different
offsets depending on where they are in the procedure, making the procedure
harder to understand. Alternatively, a frame pointer offers a stable base register
within a procedure for local memory references.
(b)
fact: L1:
addi $sp, $sp, -8 addi $a0, $a0, -1
sw $ra, 4($sp) jal fact
sw $a0, 0($sp) lw $a0, 0($sp)
slti $t0, $a0, 1 lw $ra, 4($sp)
beq $t0, $zero, L1 addi $sp, $sp, 8
addi $v0, $zero, 1 mul $v0, $a0, $v0
addi $sp, $sp, 8 jr $ra
jr $ra

Since procedure fact will recursively call itself and the argument in register $a0
will still be used after the call has returned, the content of register $a0 should
be saved before the call has made

3. Please explain the concept of non-restoring division algorithm.
Answer:
Restoring
(r + d)
((r + d) 2)((r
+ d) 2 - d)Nonrestoring
(r 2)

99
(r 2 + d)(r + d) 2 d = r 2 + d
Nonrestoring
performance
4. We wish to compare the performance of two different computers: M1 and M2.
Following measurements have been made on these computers: Program 1
executes for 2.0 seconds on M1, and 1.5 seconds on M2, whereas Program 2
executes for 5.0 seconds on M1, and 10.0 seconds on M2.
(a) Which computer is faster for each program, and how many times as fast is it?
The following additional measurements were then made: Program 1 executes 5
10
9
instructions on M1, and 6 10
9
instructions on M2.
(b) Find instruction execution rate (instructions/second) for each computer when
running Program 1.
Suppose M1 costs $500 and M2 costs $800. A user requires that Program 1 must
be executed 1600 times each hour. Any remaining time is used to run Program 2.
If the computer has enough performance to execute Program 1 the required
number of times per hour, then performance is measured by the throughput for
Program 2.
(c) Which computer is faster for this workload? Why?
(d) Which computer is more cost-effective? Show your calculations.
Answer:
(a) For program 1, M2 is 2/1.5 = 1.33 times faster than M1
For program 1, M1 is 10/5 = 2 times faster than M2
(b) The instruction execution rate for M1 = (5 10
9
)/2 = 2.5 10
9
instr./sec.
The instruction execution rate for M2 = (6 10
9
)/1.5 = 4 10
9
instr./sec.
(c) Executing program 1 1600 times on M1 takes 2(1600) = 3200 sec which
leaves 400 sec for program 2. Hence it will execute 400/5 = 80 times.
Executing program 1 1600 times on M2 takes 1.5(1600) = 2400 sec which
leaves 1200 sec for program 2. This program takes 10 sec on M2 so it will
execute 1200/10 = 120 times during the hour.
Therefore M2 is faster for this workload
(d) So far as cost effectiveness we can compare them by $500/80 = $6.25 per
iteration/hr for M1, while for M2 we have $800/120 = $6.67 per iteration/hr,
so M1 is more cost effective.

100
5. Given the code sequence:
lw $t1, 8($t7) ; assume mem($t7+8) contains (+72)
10

addi $t2, $zero, #10
nor $t3, $t1, $t2
beq $t1, $t2, Label
add $t4, $t2, $t3
sw $t4, 108($t7)
Label: ...
According to the multi-cycle implementation scheme in the textbook (see figure
below),
(a) How many cycles will it take to execute this code?
(b) What is going on during the 19th cycle of execution?
(c) In which cycle does the actual addition of 108 and $t7 take place?
Step Name
Action for R-type
Instructions
Action for Memory-
Reference Instructions
Action for
branches
Action for
jumps
Instruction fetch
IR Memory[PC]
PC PC+4
Instruction decode/register fetch
A Reg[IR[25-21]]
B Reg[IR[20-16]]
ALUOut PC + sign-extend(IR[15-0]) << 2
Execution, address computation,
branch/jump completion
ALUOut A op B
ALUOut A + sign-extend
(IR[15-0])
If (A==B) then
PC ALUOut;
PC {PC[31-28],
(IR[25-0]<<2)}
Memory Access or R-type
completion
Reg[IR[15-11]]
ALUOut
Load: MDR Memory[ALUOut]
or
Store: Memory[ALUOut] B

Memory read completion Load: Reg[IR[20-16]] MDR
Answer:
(a) 5 + 4 + 4 + 3 + 4 + 4 = 24
(b) The contents of registers $t2 and $t3 are added by ALU
(c) the 23
th
cycle

6. Instruction count, CPI, and clock rate are three key factors to measure
performance. The performance of a program depends on the algorithm, the
programming language, the compiler, the instruction set architecture, and the
actual hardware used.
(a) What performance factor(s) above may be affected by using different
Instruction Set Architectures? Why?
(b) MIPS (Million Instructions per Second) of running a benchmark program on
machine A is higher than that of running the same benchmark on machine B.
Which machine is faster? Why?
Answer:
(a) Instruction count, CPI, and clock rate
(b) We can not differentiate which machine is faster from the measure of MIPS
before the capabilities of the ISA of these two machines are given.

101
7. To implement these five MIPS instructions: [lw, sb, addi, xor, beq],
(a) If simple single-cycle design is used, at least how many adders must be used?
What each of these adders is used for?
(b) Similarly, at least how many memories are there? What each of them is used
for?
(c) Repeat (a) for multi-cycle design.
(d) Repeat (b) also for multi-cycle design.
Answer:
(a) two adders are needed for single cycle design, one for PC + 4 calculation and
the other for branch target address calculation
(b) two memories are needed for single cycle design,, one for instruction fetch
and the other for data access
(c) no adders is needed for multi-cycle design
(d) one memory is needed for both instruction and data access
8. Assume the three caches below, each consisting of 16 words. Given the series of
address references as word addresses: 2, 3, 4, 16, 18, 16, 4, 2. Please label each
reference as a hit or a miss for the three caches (a), (b), and (c) below. Assuming
that LRU is used for cache replacement algorithm and all the caches are initially
empty.
(a) a direct-mapped cache with 16 one-word blocks;
(b) a direct-mapped cache with 4 four-word blocks;
(c) a four-way set associative cache with block size of one-word.
Answer:
(a)
Word address
Tag Index Hit/Miss 3C
Decimal Binary
2 00010 0 0010 Miss compulsory
3 00011 0 0011 Miss compulsory
4 00100 0 0100 Miss compulsory
16 10000 1 0000 Miss compulsory
18 10010 1 0010 Miss compulsory
16 10000 1 0000 Hit
4 00100 0 0100 Hit
2 00010 0 0010 Miss conflict
(b)
Word address
Tag Index Hit/Miss 3C
Decimal Binary
2 00010 0 00 Miss compulsory
3 00011 0 00 Hit
4 00100 0 01 Miss compulsory

102
16 10000 1 00 Miss compulsory
18 10010 1 00 Hit
16 10000 1 00 Hit
4 00100 0 01 Hit
2 00010 0 00 Miss conflict

(c)
Word address
Tag Index Hit/Miss 3C
Decimal Binary
2 00010 000 10 Miss compulsory
3 00011 000 11 Miss compulsory
4 00100 001 00 Miss compulsory
16 10000 100 00 Miss compulsory
18 10010 100 10 Miss compulsory
16 10000 100 00 Hit
4 00100 001 00 Hit
2 00010 000 10 Hit

9. Continued from above question 8:
(a) For each of above (a), (b), and (c) caches, how many misses are compulsory
misses?
(b) For each of above (a), (b), and (c) caches, how many misses are conflict
misses?
(c) What type of cache misses (compulsory, conflict and capacity) can be reduced
by increasing the cache block size?
(d) What type of cache misses can be reduced by increasing set associativity?
Answer:
Cache configuration (a) (b)
direct-mapped cache with 16 one-word blocks 5 1
direct-mapped cache with 4 four-word blocks 3 1
four-way set associative cache with block size of one-word 5 0

(c) compulsory
(d) conflict

103
10. What is the average CPI for each of the following 4 schemes taking to execute the
code sequence below? (Note: For the pipeline scheme, there are five stages: IF,
ID, EX, MEM, and WB. We assume the reads and writes of register file can
occur in the same clock cycle, and the stall circuits are available.)
add $t3, $s1, $s2
sub $t1, $s1, $s2
lw $t2, 100($t3)
sub $s1, $t1, $t2
(a) single cycle scheme;
(b) multi-cycle scheme without pipelining;
(c) pipelined scheme without data forwarding hardware;
(d) pipelined scheme with data forwarding hardware (one from EX/MEM to ALU
input, and the other from MEM/WB to ALU input) available.
Answer:
(a) CPI = 1
(b) CPI = (4 + 4 + 5 + 4)/4 = 4.25
(c) The clocks for executing this code = (5 1) + 4 + 3 = 11, CPI = 11/4 = 2.75
(d) The clocks for executing this code = (5 1) + 4 + 1 = 9, CPI = 9/4 = 2.25


104
95

1. Booths algorithm is an elegant approach to multiply signed numbers. It starts
with the observation that with the ability to both add and subtract there are
multiple ways to compute a product. The key to Booths insight is in his
classifying groups of bits into the beginning, the middle, or the end of a run of 1s.
Is Booths algorithm always better? Why does the Booths algorithm work for
multiplication of twos complement signed integers?
Answer:
(1) 01 (01010101) Booths algorithm

(2) a = a
31
a
30
a
0
b = b
31
b
30
b
0
a

) 2 ( ) 2 ( ..... ) 2 ( ) 2 ( ) 2 (
0
0
1
1
29
29
30
30
31
31
+ + + + + a a a a a
Booths algorithm












Booths algorithm

2. The general division algorithm is called restoring division, since each time the
result of subtracting the divisor from the dividend is negative you must add the
divisor back into the dividend to restore the original value. An even faster
algorithm doses not immediately add the divisor back if the remainder is negative.
This nonrestoring division algorithm takes 1 clock per step. Using the expression
(r + d) 2 d = r 2 + d to explain the non-restoring algorithm.
Answer:
Restoring
(r + d)
((r + d) 2)
0
- 1
+ 1
0
a
i-1
a
i
Do nothing
Subtract b
Add b
Do nothing
Operation
Do nothing 1 1
Subtract b 0 1
Add b 1 0
Do nothing 0 0
Operation a
i-1
a
i
0
- 1
+ 1
0
a
i-1
a
i
Do nothing
Subtract b
Add b
Do nothing
Operation
Do nothing 1 1
Subtract b 0 1
Add b 1 0
Do nothing 0 0
Operation a
i-1
a
i
31
31 30
30
30 29
2
2 1
1
1 0
0
0 1
2 ) (
2 ) (
.... . ...
2 ) (
2 ) (
2 ) (
+
+
+
+

b a a
b a a
b a a
b a a
b a a
( ) ( ) ( ) ( ) ( ) ( )
a b
a a a a a b
=
+ + + + +
0
0
1
1
29
29
30
30
31
31
2 2 ..... 2 2 2

105
((r + d) 2 d)Nonrestoring

(r 2)(r 2 + d)(r + d) 2 d = r 2 +
dNonrestoring
performance

3. Multiple forms of addressing are generally called addressing modes. The MIPS
addressing modes are the following:
(1) Register addressing, where the operand is a register.
(2) Base or displacement addressing, where the operand is at the memory
location whose address is the sum of a register and a constant in the
instruction.
(3) Immediate addressing, where the operand is a constant within the instruction
itself.
(4) PC-relative addressing, where the address is the sum of the PC and a constant
in the instruction.
(5) Pseudodirect addressing, where the jump address is the 26 bits of the
instruction concatenated with the upper bits of the PC.
The following binary codes are corresponding to their MIPS instructions,
respectively. Indicate these two instructions belonging to which of the above
addressing modes and according to them, find binary codes of [add $s4, $t3, $t2]
and [lw $s0, 48($t1)].
add $t0, $s1, $s2 00000010 00110010 01000000 00100000
lw $t0, 32($s2) 10001110 01001000 00000000 00100000
Answer:
1. add $s4, $t3, $t2 belongs to Register addressing and its binary code is
00000001 01101010 10100000 00100000
2. lw $s0, 48($t1) belongs to Base or displacement addressing and its binary
code is 10001101 00110000 00000000 00110000

4. A compiler designer is trying to decide between two code sequences for a
particular machine. The hardware designers have supplied the following facts: the
CPI (clocks per instruction) of instruction class A is 1, the CPI of instruction class
B is 2 and the CPI of instruction class C is 3. For a particular high-level-language
statement, the compiler writer is considering two code sequences that require the
following instruction counts: Sequence-1 executes 2 As, 1 B, and 2 Cs, and
Sequence-2 executes 4 As, 1 B, and 1 C.
Sequence-1 executes 2 + 1 + 2 = 5 instructions. Sequence-2 executes 4 + 1 + 1 =
6 instructions. So Sequence-1 executes fewer instructions. We also know that
CPU clock cycles
1
= (2 1) + (1 2) + (2 3) = 10 cycles and
CPU clock cycles
2
= (4 1) + (1 2) + (1 3) = 9 cycles.

106
So Sequence-2 is faster, even though it actually executes one extra instruction.
Since Sequence-2 takes fewer overall clock cycles but has more instructions, it
must have a lower CPI.
CPI
1
= CPU clock cycles
1
/ instruction count1 = 10/5 = 2.
CPI
2
= CPU clock cycles
2
/ instruction count
2
= 9/6 = 1.5.
The above shows the danger of using only one factor (instruction count) to assess
performance. When comparing two machines, you must look at all three
components, which combine to form execution time. If some of the factors are
identical, like the clock rate in the above example, performance can be
determined by comparing all the non-identical factors.
The question is that, based on CPI
1
/CPI
2
= 2/1.5, can we provide two versions of
the particular machine, in which (clock rate of M
sequence-1
) / (clock rate of
M
sequence-2
) = 4/3, to let the two code sequences have the same execution time?
Answer: No.
Suppose that clock rate of M
sequence-1
is 4 GHz and of M
sequence-2
is 3 GHz.
The execute time for Sequence-1 = (5 2) / 4 10
9
= 2.5 ns
The execute time for Sequence-2 = (6 1.5) / 3 10
9
= 3 ns
Even though the clock ration of these two machines is 4/3, the execution time of
these two machines is different.

5. Following the above question assume (clock rate of M
sequence-1
) / (clock rate of
M
sequence-2
) = 6/5, you are asked to adjust the CPI of instruction class C to let the
two code sequences have the same execution time?
Answer:
Suppose that the adjusted CPI of class C instruction is x.
Then, ((2 1) + (1 2) + (2 x))/ 6 = ((4 1) + (1 2) + (1 x)) / 5
x = 4
The CPI of instruction class C should be adjusted to 4

6. Suppose a program runs in 100 seconds on a machine, with multiply operations
responsible for 80 seconds of this time. The execution time of the program after
making the improvement is given by the following equation:
execution time after improvement = (exec time affected by improvement / amount
of improvement) + (exec time unaffected)
If one wants the program to run two times faster, that is 50 seconds, then
execution time after improvement = (80 seconds / n) + (100 80 seconds)
So, 50 seconds = (80 seconds In) + 20 seconds. Thus n, the amount of
improvement, is 8/3.
The performance enhancement possible with a given improvement is limited by
the amount that the improved feature is used. This concept is referred to as
Amdahl's law in computing.

107
Let's consider a general model that the subsystem-A operations responsible for a
seconds and the subsystem-B operations responsible for b seconds of the total
execution time t seconds. We also recognize that the improvement needs costs.
Assume that the subsystem-A needs cost of C
A
to get 10/9 improvement and it
continues needing cost of C
A
to get 10/9 improvement of the improved
subsystem-A, i.e., 100/81 improvement of the original-subsystem-A. Assume the
improvement is restricted as the above discrete function. Subsystem-B follows
the same rule with the discrete 10/9 improvement and discrete cost of C
B
.
Suppose the subsystem-A has n
A
times improvements and subsystem-B has n
B

times improvements.
The question is that under the total cost limitation C
L
, you are asked to discuss
how to formulate the problem to get the maximum improvement by improving
both subsystem-A and subsystem-B.
(a) Calculate both the costs to improve subsystem-A and subsystem-B.
(b) Calculate both the responsible time of improved subsystem-A and
subsystem-B.
(c) Formulate the problem you are going to solve.
Answer:
(a) The cost for subsystem-A = C
A
n
A

The cost for subsystem-B = C
B
n
B

(b) The responsible time of improved subsystem-A =
A
n
a
|
.
|

\
|
9
10
(seconds)
The responsible time of improved subsystem-B =
B
n
b
|
.
|

\
|
9
10
(seconds)
(c) Minimize
|
|
|
|
|
.
|

\
|
+
|
.
|

\
|
+
|
.
|

\
|
) (
9
10
9
10
b a t
b a
B A
n n , where C
A
n
A
+ C
B
n
B
s C
L


7. A simple, single-cycle implementation of MIPS processor capable of executing
{lw, sw, add, sub, and, or, slt, beq, j} instructions is given in the Patterson and
Hennessy book.
(a) How many adders does this implementation need? And what does each of the
adders do?
(b) If we change the implementation to a multi-cycle style, then at least how
many adders do we still need? And what does each of these adders do?
Answer:

108
(a) Two adders are required for single-cycle implementation. One is for
computing the branch target, and the other is for computing the next
instruction address (PC + 4).
(b) None of adders is needed, because ALU could compute the next instruction
address in the first cycle and the branch target in the second cycle.

8. Comparing the following two implementations for MIPS processor: Multi-cycle
approach (all cycles dedicated to executing a single instruction), and pipelined
approach:
(a) There are two major advantages of the multi-cycle approach. What are they?
(b) For the pipelined approach, what extra hardware costs will it require? (State
only the principle, and you do not need to list the exact hardware items.)
(c) What is the most noticeable advantage of the pipelined approach?
Answer:
(a) The ability to allow instructions to take different numbers of clock cycles and
the ability to share functional units within the execution of a single
instruction.
(b) Pipeline registers, memory and adders.
(c) Pipelined approach gains efficiency by overlapping the execution of multiple
instructions, increasing hardware utilization and improving performance.

9. Given a microprogram controlled MIPS processor, its control store contents, and
microprogram sequencer together with two dispatch tables (in which the value
field indicates the microinstruction address in the control store) below:




















-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-

109

Label
ALU
control
SRC1 SRC2
Register
control
Memory
PCWrite
control
Sequencing
Fetch Add PC 4 Read PC ALU Seq
Add PC Extshft Read Dispatch 1
Mem1 Add A Extend Dispatch 2
LW2 Read ALU Seq
Write MDR Fetch
SW2 Write ALU Fetch
Rformat1 Func code A B Seq
Write ALU Fetch
BEQ1 Subt A B ALUOut-cond Fetch
JUMP1 Jump address Fetch

Dispatch ROM 2 Dispatch ROM 1
Op Opcode name Value Op Opcode name Value
100011 1w 0011 000000 R-format 0110
101011 sw 0101 000010 jmp 1001
000100 beq 1000
100011 1w 0010
101011 sw 0010















(Note that SRC1 and SRC2 in control store are represented as ALUScrA and
ALUScrB in the processor diagram.)
(d) How many cycles will it need to execute the following code sequence:
lw $t1, 0($t3)
adi $t3, $t3, #2
sub $t1, $t1, $t2
sw $t1, 0($t4)
adi $t4, $t4, #4
beq $t3, $t5, Label

110
(e) What operations are undertaken in the third cycle of an R-format instruction?
Include involved latches, multiplexers and other function units in your
answer.
(f) Repeat (b) for the second cycle of a memory reference instruction.
Answer:
(a) 5 + 4 + 4 + 4 + 4 + 3 = 24 cycles
(b) ALU depend on function code to execute the specify function and the two
operands are selected by setting ALUScrA = 1 to choose register A and by
setting ALUScrB = 00 to choose register B.
(c) Read register file and use ALU to compute PC + Extshft by setting ALUScrA
= 0 to choose register PC and by setting ALUScrB = 11 to choose
SignExt[IR[imm16]]<<2.

10. In the pipelined datapath design below, there is an obvious problem:













(a) Point out the problem, and explain it.
(b) Then, indicate how the problem can be corrected.
Answer:
(a) Consider a load instruction in the MEM/WB pipeline register. The instruction
in the IF/ID pipeline register supplies the write register number, yet this
instruction occurs considerable after the load instruction.
(b) We need to preserve the destination register number in the load instruction.
Load must pass the register number from the ID/EX through EX/MEM to the
MEM/WB pipeline register for use in the WB stage.







111



11. Given a 2
S
-byte, 2
L
-byte-per-line cache in an M-bit, byte-addressable memory
system,
(a) What is the range of index field size, in no. of bits, in a memory address while
accessing the cache?
(b) Repeat (a) for the tag field size.
Answer:
The number of blocks in the cache = 2
S
/ 2
L
= 2
S-L
(a) (b)
Tag Index Offset
Direct-mapped M S S L L
Fully associative M L 0 L

12. In the RAID design, seven levels of the RAIDs are introduced in a commonly
used textbook.
(a) Which level of RAID uses the least storage redundancy? How much is the
redundancy?
(b) Which level used the most redundancy, and how much is it?
(c) What is the most noticeable drawback of RAID 4 (block-interleaved parity)?
And how does RAID 5 correct this drawback?
Answer:
(a) (1) RAID 0; (2) 0 redundancy
(b) (1) RAID 1; (2) the number of data disks
(c) (1) Parity disk is the bottleneck;
(2) Spread the parity information throughout all the disks to avoid single
parity disk bottlenecks.


112
13. In parallel processors sharing data, answer the following:
(a) In uniform memory access (UMA) designs, do all processors use the same
address space?
(b) Do all UMA processors access memory at the same speed?
(c) Draw a system diagram showing how processors and memory (modules) and
connected.
Answer:
(a) Yes
(b) Yes
(c)


















Cache
Processor
Cache
Processor
Cache
Processor
Single bus
Memory I/O

113
94

1. (a) What is response time? Who will care about the response time?
(b) What is throughput? Who will care about the throughput?
(c) Think of an example in which if we improve the response time of a computer
system, its throughput will be worsened.
(d) Now think of an example in which if we improve the throughput of a computer
system, its response time will be worsened
Answer:
(a) (1) The time between the start and complement of a task. (2) Individual
computer users.
(b) (1) The total amount of work done in a given time. (2) Data center managers.
(c) processesround-robin CPU
schedulingtime sliceresponse time
throughput
(d) processesshortest job first
CPU schedulingthroughput
processesresponse time

2. Given three classes of instructions: class A, B, and C, having CPI
A
= a, CPI
B
= b,
and CPI
C
= c, where CPI stands for cycles per instruction.
(a) If we can tune the clock rate to 120% without affecting any CPI
A,B,C
, what is
the performance gain G(a) = [performance
new
/performance
original
-1] = ?
(b) Increasing clock rate to 150% and then CPI
A
= 1.5CPI
A
, while CPI
B
and CPI
C

remain unchanged. If class A instructions account for 40% of all dynamic
instructions, what is the performance gain G(b) = ?
(c) Now let the compiler come into play. Given original clock rate, if for every
class A instruction to be eliminated, there must be x class B instructions and y
class C instructions added into the execution stream. Under what condition
would you want to eliminate class A instructions?
Answer:
(a) G(a) = (1.2 1) = 0.2
(b) G(b) =
25 . 0 1 25 . 1 1
5 . 1
1
) 6 . 0 5 . 1 4 . 0 (
1
= =
+

(c)
a c y b x < +


114
3. MIPS has only a few addressing modes.
(a) What are these addressing modes?
(b) What makes MIPS use these addressing modes, but not others, and not more
modes? To answer this question properly, you should formulate the principles
behind the selection of these modes.
Answer:
(a) (b)
1. Register addressing Simplicity favors regularity
2. Base or displacement addressing
3. Immediate addressing Make the common case fast
4. PC-relative addressing Good design demands good compromises
5. Pseudodirect addressing


4. Given A = a
n-1
a
n-2
a
2
a
1
a
0
, B = b
n-1
b
n-2
b
2
b
1
b
0
, and carry-in c
0
:
(a) What is the equation for gi, which indicates that a carry c
i+1
must be generated
regardless of c
i
?
(b) What is the equation for p
i
, which indicates that c
i+i
= c
i
?
(c) c
n
= f (g
x
, p
x
, c
0
)|
x = (n - l) ~ 0
= ?
Answer:
(a) g
i
= a
i
b
i

(b) p
i
= a
i
+ b
i

(c) c
n
= g
n-1
+ p
n-1
g
n-2
+ p
n-1
p
n-2
g
n-3
+ .... + p
n-1
p
n-2
...p
1
p
0
c
0



115
Done
Yes: 32 repetitions
2. Shift the Product
register right 1 bit
No: < 32 repetitions
Product0 = 1
1a. Add multiplicand to the
left half of product & place
the result in the left half of
Product register
32nd
repetition?
Start
Test
Product0
Product0 = 0
Done
Yes: 32 repetitions
2. Shift the Product
register right 1 bit
No: < 32 repetitions
Product0 = 1
1a. Add multiplicand to the
left half of product & place
the result in the left half of
Product register
32nd
repetition?
Start
Test
Product0
Product0 = 0
5. Given the hardware and the flow chart of a multiplication algorithm,







(a) Modify the hardware as less as possible for Booth's multiplication algorithm.
Draw the modified hardware.
(b) Based on the modified hardware designed in (1), redraw the
flow chart for Booths multiplication algorithm.
(c) Describe the characteristics and advantages of Booths algorithm.
Answer: (a) (b)















(c) The major features of Booth's algorithm are that it handles signed number
multiplication and if shifting is faster than addition, it would be faster than
traditional sequential multiplier in average case.
Product
Multiplicand
32-bit ALU
Write
32 bits
64 bits
product0
Control
Shift right
Product
Multiplicand
32-bit ALU
Write
32 bits
64 bits
Product0/Product-1
Control
Shift right
Mythical bit
Done
Yes: 32 repetitions
2. Shift the Product
register right 1 bit
No: < 32 repetitions
0 1
Add multiplicand to the left
half of product & place the
result in the left half of
Product register
32nd
repetition?
Start
Test
Product0/-1
1 0
subtract multiplicand from
the left half of product &
place the result in the left
half of Product register
0 0/11

116
6. Given the simplified datapath of a pipelined computer and a sequence of code:






add $3, $1, $2 (Instruction 1)
lw $4, 100($3) (Instruction 2)
sub $2, $4, $3 (Instruction 3)
sw $4, 100($1) (Instruction 4)
and $6, $3, $2 (Instruction 5)
(a) Identify all of the data dependencies in the code by the following
representation:
Ri: Ij Ik
It means that Instruction k (Ik): depends on Instruction j (Ij) for Register i
(Ri).
(b) Which dependencies are data hazards that may be data forwarded?
(c) Which dependencies are data hazards that must be resolved via stalling?
Answer:
(a) R3: I1 I2 (b) R3: I1 I2 (c) R4: I2 I3
R3: I1 I3 R3: I1 I3
R3: I1 I5 R4: I2 I4
R4: I2 I3 R2: I3 I5
R4: I2 I4
R2: I3 I5

7. Given the datapath of a single-cycle computer and the definition and formats of
its instructions:














IM Reg DM R
eg
MemtoReg
MemRead
MemWrite
ALUOp
ALUSrc
RegDst
PC
Instruction
memory
Read
address
Instruction
[31
_
0]
Instruction [20
_
16]
Instruction [25
_
21]
Add
Instruction [5
_
0]
RegWrite
4
16 32
Instruction [15
_
0]
0
Registers
Write
register
Write
data
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Sign
extend
ALU
result
Zero
Data
memory
Address
Read
data
M
u
x
1
0
M
u
x
1
0
M
u
x
1
0
M
u
x
1
Instruction [15
_
11]
ALU
control
Shift
left 2
PCSrc
ALU
Add
ALU
result

117
add $rd, $rs, $rt #rd = $rs + $rt R-format
lw $rt, addr($rs) #$rt = Memory[$rs + sign-extended addr] I-format
beq $rs, $rt, addr #if ($rs = $rt) go to PC + 4 + 4 addr I-format

Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits
R-format op rs rt rd shamt funct
I-format op rs rt address/immediate
Complete the following table for the setting of the control lines of each instruction.
Assume that the control signal Branch and the Zero output of the ALU are ANDed
together to become the control signal PCSrc.
Instruction RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch
add
lw
beq
(1: asserted of control line, 0: deasserted of control line, x: don't care)
Answer:
Instruction RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch
add 0 1 0 1 0 0 0
lw 1 0 1 1 1 0 0
beq 1 0 0 0 1
branch 1, 1, 0 0, 0, 1
work


8. Suppose we have a processor with a base CPI of 1.0, assuming all references hit
in the primary cache, and a clock rate of 5 GHz. Assume a main memory access
time of 100 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 2%. How much faster will the processor be if
we add a secondary cache that has a 5 ns access time for either a hit or miss and is
large enough to reduce the miss rate to main memory to 0.5%?
Answer:
The miss penalty to main memory is 100 / 0.2 = 500 clock cycles
For the processor with one level of caching, total CPI = 1.0 + 500 2% = 11.0
The miss penalty for an access to the second-level cache is 5 / 0.2 = 25 clock
cycles
For the two-level cache, total CPI = 1.0 + 2% 25 + 0.5% 500 = 4.0
Thus, the processor with the secondary cache is faster by 11.0/4.0 = 2.8


118
9. According to the following MIPS program, complete the given table by using the
technique of loop unrolling for superscalar pipelines. Write your answer with the
leading (1), (2), ... and (9).
Loop: lw $t0, 0($s1)
addu $t0, $t0, $s2
sw $t0, 0($s1)
addi $s1, $s1, -4
bne $s1, $zero, Loop
ALU or branch inst. Data transfer inst. Clock cycle
Loop: addi $s1, $s1, -16 lw $t0, 0($s1) 1
(blank) lw $t1, 12($s1) 2
(1) lw $t2, 8($s1) 3
(2) (6) 4
addu $t2, $t2, $s2 (7) 5
(3) (8) 6
(4) (9) 7
(5) sw $t3, 4($s1) 8
Answer:
(1) addu $t0, $t0, $s2
(2) addu $t1, $t1, $s2
(3) addu $t3, $t3, $s2
(4) (blank)
(5) bne $s1, $zero, Loop
(6) lw $t3, 4($s1)
(7) sw $t0, 16($s1)
(8) sw $t1, 12($s1)
(9) sw $t2, 8($s1)

10. The total number of bits needed for a cache is the summation of data, tags, and
valid-bit. Assuming the 32-bit byte address, a direct-mapped cache of size 2
n

blocks with 2
m
-word (2
m+2
-byte) blocks. What is the number of bits in such a
cache?
Answer:
The size of a tag = 32 (n + m + 2) = 30 n m
The size of cache = 2
n
(32 2
m
+ 30 n m + 1) bits


119
11. (a) Assume an instruction cache miss rate for a program is 2% and a data cache
miss rate is 4%. If a processor has a CPI of 2 without any memory stalls and
the miss penalty is 100 cycles for all misses, determine how much faster a
processor would run with perfect instruction and data caches that never
missed. Assume the frequency of all loads and stores is 36%.
(b) Suppose we increase the performance of the processor by doubling its clock
rate. How much faster will the processor be with the faster clock, assuming
the same miss rates and the absolute time to handle a cache miss does not
change?
Answer:
(a) CPI considering miss penalty = 2 + 0.02 100 + 0.04 0.36 100 = 5.44
5.44 / 2 = 2.72 times faster
(b) Measured in faster clock cycle, the new miss penalty will be 200 cycles
Total miss cycles per instruction = 2% 200 + 36% (4% 200) = 6.88
Faster system with cache miss, CPI = 2 + 6.88 = 8.88
Slower system with cache miss, CPI = 5.44
The faster clock system will be
Execution time of slow clock/Execution time of faster clock
= I CPI_slow Cycle time / (I CPI_fast Cycle time)
= 5.44 / (8.88 ) = 1.23 times faster

12. This is about I/O system design. Consider the following computer system: (1) a
CPU that sustains 3 billion instructions per second and averages 100,000
instructions in the operating system per I/O operation, (2) a memory backplane
bus capable of sustaining a transfer rate of 1000 MB/sec, (3) SCSI Ultra320
controllers with a transfer rate of 320 MB/sec and accommodating up to 7 disks,
and (4) disk drives with a read/write bandwidth of 75 MB/sec and, an average
seek plus rotational latency of 6 ms. If the workload consists of 64 KB reads
(where the block is sequential on a track) and the user program needs 200,000
instructions per I/O operation, find the maximum sustainable I/O rate and the
number of disks and SCSI controllers required. Assume that the reads can always
be done on an idle disk if one exists (i.e., ignore disk conflicts)
Answer:
The two fixed components of the system are the memory bus and the CPU. Let's
first find the I/O rate that these two components can sustain and determine which
of these is the bottleneck. Each I/O takes 200,000 user instructions and 100,000
OS instructions,
so Maximum I/O rate of CPU
=
( )
second
I/Os
000 , 10
10 100 200
10 3
I/O per ns Instructio
rate execution n Instructio
3
9
=
+

=



120
Each I/O transfers 64 KB, so
Maximum I/O rate of bus =
second
I/Os
625 , 15
10 64
10 1000
I/O per Bytes
bandwidth Bus
3
6
=

=

The CPU is the bottleneck, so we can now configure the rest of the system to
perform at the level dictated by the CPU, 10,000 I/Os per second.
Let's determine how many disks we need to be able to accommodate 10,000 I/Os
per second. To find the number of disks, we first find the time per I/O operation
at the disk:
Time per I/O at disk = Seek + rotational time + Transfer time
= 6 ms +
MB/sec 75
KB 64
= 6.9 ms
Thus, each disk can complete 1000 ms/6.9 ms or 146 I/Os per second. To saturate
the CPU requires 10,000 I/Os per second, or 10,000/146 69 disks.
To compute the number of SCSI buses, we need to check the average transfer rate
per disk to see if we can saturate the bus, which is given by
Transfer rate =
MB/sec 56 . 9
ms 9 . 6
KB 64
ime Transfer t
size Transfer
~ =

The maximum number of disks per SCSI bus is 7, which won't saturate this bus.
This means we will need 69/7, or 10 SCSI buses and controllers.

121
93

1. (1) What is the disadvantage if without applying Amdahls law?
(2) What is the meaning of CPI = 1.5?
(3) Can the CPI be smaller than 1? Why?
(4) Do we need to know the ISA while design a good compiler? Why?
Answer:
(1) Amdahls law
Amdahls law

(2) 1.5CPU clock cycles
(3) Yes, superscalarcomputerclockissue
CPI1total clock
(4) Yes, superscalarcomputerclockissue
total clockCPI1

2. Please draw the formats of five MIPS addressing modes.
Answer:























Byte Halfword Word
Registers
Memory
Memory
Word
Memory
Word
Register
Register
1. Immediate addressing
2. Register addressing
3. Base addressing
4. PC-relative addressing
5. Pseudodirect addressing
op rs rt
op rs rt
op rs rt
op
op
rs rt
Address
Address
Address
rd . . . funct
Immediate
PC
PC
+
+

122
3. Compare the number of gate delays for the critical paths of two 16-bit adders, one
using ripple carry and the other using two-level carry lookahead.
Answer:
(1) Ripper carry adderbitcarry2gatedelays
16-bit ripper carry addercritical path2 16 = 32 gate delays.
(2) carry lookahead addergipi1gate delay
pigi2gate delayPiGi
2gate delaysPiGi
Carry lookahead addercritical path1 + 2 + 2 = 5 gate delays.

critical path
carrypath

4. Prove that the Booths algorithm work for multiplication of twos complement
signed integers.
Answer:
Suppose that a is multiplier, b is multiplicand, and a
i
is the i
th
bit of a. The booths
algorithm implements the following computation:
0
- 1
+ 1
0
a
i-1
a
i
Do nothing
Subtract b
Add b
Do nothing
Operation
Do nothing 1 1
Subtract b 0 1
Add b 1 0
Do nothing 0 0
Operation a
i-1
a
i
0
- 1
+ 1
0
a
i-1
a
i
Do nothing
Subtract b
Add b
Do nothing
Operation
Do nothing 1 1
Subtract b 0 1
Add b 1 0
Do nothing 0 0
Operation a
i-1
a
i
31
31 30
30
30 29
2
2 1
1
1 0
0
0 1
2 ) (
2 ) (
.... . ...
2 ) (
2 ) (
2 ) (
+
+
+
+

b a a
b a a
b a a
b a a
b a a
( ) ( ) ( ) ( ) ( ) ( )
a b
a a a a a b
=
+ + + + +
0
0
1
1
29
29
30
30
31
31
2 2 ..... 2 2 2


5. According to the following figure, what are the values of a(l), a(2), ..., d(7) in the
table?

Instruction RegDst ALUSrc
Memto
Reg
Reg
Write
Mem
Read
Mem
Write
Branch ALUOpl ALUOp0
R-format a(l) a(2) a(3) a(4) a(5) a(6) a(7) 1 0
Lw b(l) b(2) b(3) b(4) b(5) b(6) b(7) 0 0
Sw c(l) c(2) c(3) c(4) c(5) c(6) c(7) 0 0
beq d(l) d(2) d(3) d(4) d(5) d(6) d(7) 0 1



123

















Answer:
Instruction RegDst ALUSrc
Memto
Reg
Reg
Write
Mem
Read
Mem
Write
Branch ALUOpl ALUOp0
R-format 1 0 0 1 0 0 0 1 0
Lw 0 1 1 1 1 0 0 0 0
Sw 1 0 0 1 0 0 0
beq 0 0 0 0 1 0 1


6. Why it may cause the stale data problem during DMA I/O data transfer? How to
overcome it? Give three different approaches to resolve this problem and explain
your reasons clearly.
Answer:
(1) Consider a read from disk that the DMA unit places directly into memory. If
some of the locations into which the DMA writes are in the cache, the
processor will receive the old value when it does a read. Similarly, if the
cache is write-back, the DMA may read a value directly from memory when a
newer value is in the cache, and the value has not been written back. This is
called the stale data problem.
(2)
1. One approach is to route the I/O activity through the cache. This ensures
that reads see the latest value while writes update any data in the cache.
2. A second choice is to have the OS selectively invalidate the cache for an
I/O read or force write-backs to occur for an I/O write (often called cache
flushing).
PC
Instruction
memory
Read
address
Instruction
[31
_
0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32
Instruction [15 0]
0
0
M
u
x
0
1
Control
Add
ALU
result
M
u
x
0
1
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Sign
extend
M
u
x
1
ALU
result
Zero
PCSrc
Data
memory
Write
data
Read
data
M
u
x
1
Instruction [15 11]
ALU
control
Shift
left 2
ALU
Address
PC
Instruction
memory
Read
address
Instruction
[31
_
0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32
Instruction [15 0]
0
0
M
u
x
0
1
Control
Add
ALU
result
M
u
x
0
1
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Sign
extend
M
u
x
1
ALU
result
Zero
PCSrc
Data
memory
Write
data
Read
data
M
u
x
1
Instruction [15 11]
ALU
control
Shift
left 2
ALU
Address

124
3. The third approach is to provide a hardware mechanism for selectively
flushing (or invalidating) cache entries,

7. What is called split transaction protocol used in I/O data bus design?
Answer:
A method for increasing the effective bus bandwidth is to release the bus when it
is not being used for transmitting information.

8. There are three ways to schedule the branch delay slot in order to reduce or
eliminate the control hazard. Give simple examples to explain their principle
clearly and briefly.
Answer:
branch delay slot3: (a) from before, (b) from target,
and (c) from fall through. (a) is the best. Use (b) (c) when (a) is impossible (data
dependency). (b) is only valuable when branch taken. It is OK to execute this
instruction even the branch is non-taken (c) is only valuable when branch not
taken; It is OK to execute this instruction even the branch is taken.



125
9. It is well known that multi-level cache design is one of the most important ways
to upgrade the CPU performance. Suppose that we have a processor with a base
CPI of 1.0, assuming all references hit in the primary cache, and a clock rate of
1000 MHz. Assume a main memory access time of 200 ns, including all miss
handing. Suppose the miss rate per instruction at the primary cache is 5%. How
much faster will the machine be if we add a secondary cache that has a 20 ns
access time for either a hit or a miss and is large enough to reduce the miss rate to
main memory to 2%? What are the global miss rate as well as local miss rate for
this two level cache machine?
Answer:
(1) The CPU clock cycle time = 1/1000MHz = 1 ns
CPI for one-level cache = 1 + 200 0.05 = 11
CPI for two-level cache = 1 + 20 0.05 + 200 0.02 = 6
The machine with two-level cache will be faster than machine with two-level
cache by 11/6 = 1.83
(2)
Miss rate Primary cache Secondary cache
Global 5% 2%
Local 5% 0.02/0.05 = 40%


126
92

1. A base processor and two options for improving its hardware and compiler design
are described as follows :
(a) The base machine, M
base
:
M
base
has a clock rate of 200 MHz and the following measures:
Instruction class CPI Frequency
A 2 50%
B 3 20%
C 4 30%
(b) The machine with improved hardware, M
hw
:
M
hw
has a clock rate of 250 MHz and the following measures :
Instruction class CPI Frequency
A 1 50%
B 2 20%
C 3 30%
(c) The combination of the improved compiler and the base machine, M
comp
:
The instruction improvements from this enhanced compiler are: as follows :
Instruction class % of instructions executed v.s. M
base

A 70%
B 80%
C 60%

(1) What is the CPI (clock cycles per instruction) for each machine?
(2) How much faster is each of M
hw
and M
comp
than M
base
?
Answer:
(1) CPI
Mbase
= 0.5 2 + 0.2 3 + 0.3 4 = 2.8
CPI
Mhw
= 0.5 1 + 0.2 2 + 0.3 3 = 1.8
Suppose there are I instructions to be run in M
base
then there are
0.5 0.7 I + 0.2 0.8 I + 0.3 0.6 I = 0.69 I instruction to be run in
M
comp

So, CPI
Mcomp
= (0.50.72 + 0.20.83+0.30.64) I /0.69I = 2.75


(2)





94 . 1
10 250
8 . 1
10 200
8 . 2
M of ExeTime
M of ExeTime
6
6
hw
base
=

=
I
I
48 . 1
10 200
75 . 2 69 . 0
10 200
8 . 2
M of ExeTime
M of ExeTime
6
6
comp
base
=

=
I
I

127
2. (a) Describe the basic concepts and advantages of Booths algorithm.
(b) Explain the difference between the restoring division algorithm and the
non-restoring division algorithm.
(c) Calculate the largest and smallest positive normalized numbers for the IEEE
754 standard single-precision floating-point operand format.
Answer:
(a) Booths algorithm replace a string of 1s in the multiplier with an initial
subtract when we first see a 1 and then later add when we see the bit after the
last 1. If machines perform a shifting faster than an addition then for average
Booths algorithm will speed the computation. Besides, Booths algorithm
handles signed numbers well.
(b) restoring division algorithm
non-restoring division algorithm


(c) Largest number: 1.1111 1111 1111 1111 1111 111 2
127

Smallest number: 1.0 2
-126


3. Describe the following different implementations of a computer and compare
their advantages and disadvantages: single-cycle, multi-cycle, and pipelined
implementations.
Answer:
Implementation Single-cycle Multi-cycle Pipelined
Difference
An implementation
in which an
instruction is
executed in one
clock cycle
An implementation
in which an
instruction is
executed in
multiple clock
cycles
An implementation in
which multiple
instructions are
overlapped in
execution, much like
to an assembly line
Advantage Simple
Less hardware
overhead
High performance
Disadvantage Poor performance Control is complex
Need to resolve
hazards



128
4. It is well known that control hazard is one of the main bottlenecks during
pipelining execution of instructions. You are required to describe the principles of
the following two resolving methods by using simple examples clearly and
briefly.
(a) What is called delayed branch technique? How to utilize the delay slot by
inserting appropriate instruction instead of no-op?
(b) How to use 2-bit branch prediction technique to reduce the branch hazard
penalty?
Answer:
(a) Compiler detects branch instructions and rearranges instruction sequence to
eliminate the branch hazard penalty. We can place an instruction that is not
affected by the branch (safe instruction) in the delayed branch delay slot.
For example, the safe instruction add $s1, $s2, $s3 in the following graph
can be moved to the delay slot.








(b) 2-bit branch prediction technique2
branch hazard penalty.

5. What is called n-way set associative address mapping technique used in cache
memory design? Give an example to explain its principle clearly. Usually, it has
better hit ratio than that of direct mapping technique. Is it true? Why?
Answer:
In a set-associative cache that has a fixed number of locations (at least two) were
each block can be placed; a set-associative cache with n locations for a block is
called an n-way set associative cache. For small size of cache, it is true that the
set-associative mapping has better hit ratio than that of direct mapping because it
can reduce the miss rate due to conflict miss.

6. Compare the main differences among the following three I/O data transfer
techniques: polling, interrupt, and DMA. Also describe their main advantages and
disadvantages clearly and briefly
Answer:
become

129
Types Polling Interrupt DMA
Differences
The processor
periodically checking
the status of an I/O
device to determine
the need to service
the device
I/O devices employs
interrupts to indicate
to the processor that
they need attention
DMA approach
provides a device
controller the ability
to transfer data
directly to or from
the memory without
involving the
processor
Advantages
Simple Can eliminates the
need for the
processor to poll the
device and allows the
processor to focus on
executing programs
DMA can be used to
interface a hard disk
without consuming
all the processor
cycles
Disadvantages
Waste a lot of
processor time
More complex than
polling
Require hardware
support

7. Given the datapath for a multi-cycle computer and the definition and formats of
its instructions,
add $rd, $rs, $rt #$rd = $rs + $rt R-format
lw $rt, addr($rs) #$rt = Memory[$rs + addr] I-format
sw $rt, addr($rs) #Memory[$rs + addr] = $rt I-format
beq $rs, $rt, addr #if ($rs = $rt) goto PC + 4 + 4 addr I-format
j addr #go to 4 addr J-format


Name Fields Comments
Field size 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits all MIPS instructions 32 bits
R-format op rs rt rd shamt funct arithmetic instruction format
I-format op rs rt address/immediate transfer, branch, imm format
J-format op target address jump instruction format

130
(a) Write the steps taken to execute the add instruction. How many clock cycles
are required for this instruction?
(b) Repeat (a) for the lw instruction.
(c) Repeat (b) for the beq instruction.
Answer:
(a) Cinstruction fetch,Cinstruction decode and register fetch, Cexecution,
Cwrite back. 4clock cycles.
(b) Cinstruction fetch,Cinstruction decode and register fetch, Caddress
calculation, Cmemory access, Cwrite back. 5clock cycles.
(c) Cinstruction fetch,Cinstruction decode and register fetch, (include target
address calculation) Cbranch completion. 3clock cycles.


131
96

1. PC-relative addressing
(a) What is the PC-relative addressing?
(b) What is the major advantage for this addressing mode?
(c) What is the major limitation of this addressing mode implemented in a RISC
with the fixed-length instruction format?
(d) Assume instructions are always word-aligned and the immediate filed is
12-bit long. What is the target range that a PC-relative branch instruction can
go to? (a word = 4 bytes)
Answer:
(a) PC-relative addressing: where the address is the sum of the PC and a constant
in the instruction.
(b) PC-relative addressing is useful in connection with conditional jumps,
because we usually only want to jump to some nearby instruction. Another
advantage of program-relative addressing is that the code may be
position-independent, i.e. it can be loaded anywhere in memory without the
need to adjust any addresses.
(c) The range that a PC-relative branch instruction can go to is limited.
(d) About
11
2 words of the current instruction.

2. Pipelining
In general, the speedup of a 5-stage pipelined scalar processor against its
non-pipelined counterpart hardly achieves 5. Please give at least 4 reasons.
Answer:
(1) The stages may be imperfectly balanced.
(2) The delay due to pipeline register.
(3) Data hazard and control hazard.
(4) Time to fill pipeline and time to drain it reduces speedup

3. Virtual memory
(a) Why can the virtual memory mechanism provide the memory protection
among processes in a multi-processing environment?
(b) Describe how a virtual address is translated into a physical address.
(c) What is the translation lookaside buffer (TLB) designed for? How does it
work?
(d) What are the benefits if a larger page size is chosen? (List at least 3.) What is
the drawback? (List at least 1.)
Answer:
(a) The hardware provides three basic capabilities to protect among processes.

132
1. Support two modes that indicate whether the running process is a user
process or an operating system process called supervisor process.
2. Provide a portion of the processor state that a user process can read but
not write.
3. Provide mechanisms whereby the processor can go from user mode to
supervisor mode, and vice versa.
(b) In virtual memory systems, page table is used to translate a virtual address
into a physical address. The page table containing the virtual to physical
address translations.
(c) TLB is used to make address translation fast. TLB (a cache of a page table)
keep track of recently used address mappings to avoid an access to the page
table.
(d) Benefits: (1) more efficient to amortize the high access time
(2) more spatial and temporal localities
(3) smaller page table size
Drawbacks: (1) more internal fragmentation
(2) higher miss penalty

4. Amdahl's law
Amdahl's law is useful to predict the expected speedup for certain technique.
Now apply it to the power reduction. Assume the power of a circuit is
proportional to V
2
F, where V is the supply voltage and F is the working clock
frequency. Assume the throughput is directly proportional to F.
(a) For a technique A that can reduce V by a factor of 2. Assume the technique A
only affects 40% of the circuit. Please derive the power reduction factor.
(b) Following (a), further assume if reducing V by a factor of 2 will result in the
maximum frequency F reduced by 80% for that part and 20% for other part.
Please derive the improvement of throughput-power product.
Answer:
(a) Power reduction factor = 43 . 1
7 . 0
1
6 . 0 ) 2 / ( 4 . 0
2 2
2
= =
+ F V F V
F V

(b) Power = 0.4 (V/2)
2
F (1 0.8) + 0.6 V
2
F (1 0.2) = 0.5V
2
F
Throughput-power product = (0.4 0.2F + 0.6 0.8F) 0.5V
2
F = 0.28V
2
F
2

Throughput-power product improvement= 75 . 3
28 . 0
2 2
2
=

F V
F V F

(b)throughputFthroughput-power product
Fpower




133
95

1. Performance enhancement:
(1) You have two possible improvements on a computer: either make multiply
instructions run four times faster than before, or make memory access
instructions run two times faster than before. You repeatedly run a program
that takes 10000 seconds to execute. Of this time, 20% is used for
multiplication, 50% for memory access instructions, and 30% for other tasks.
(a) What will the speedup be if you improve only memory access?
(b) What will the speedup be if both improvements are made?
(2) If the gate delay of AND/OR/XOR are the same, 2ns, (regardless of the
number of inputs), show the design of the 16-bit two level carry look ahead
adder (first level is 4-bit in a group) and the speedup over the 16-bit ripple
adder.
(3) In the design of instruction set, one principle is "make the common case fast".
Take one example of MIPS instruction set design to illustrate this principle.
Answer:
(1)
(a) Speedup = 33 . 1
5 . 0
2
5 . 0
1
=
+

(b) Speedup = 67 . 1
3 . 0
2
5 . 0
4
2 . 0
1
=
+ +

(2)
Propagate: (4)
P0 = p3p2p1p0
P1 = p7p6p5p4
P2 = p11p10p9p8
P3 = p15p14p13p12
Generate: (4)
G0 = g3 + (p3g2) + (p3p2g1) + (p3p2p1g0)
G1 = g7 + (p7g6) + (p7p6g5) + (p7p6p5g4)
G2 = g11 + (p11g10) + (p11p10g9) + (p11p10p9g8)
G3 = g15 + (p15g14) + (p15p14g13) + (p15p14p13g12)
4 (C
i
)
C1 = G0 + c0P0
C2 = G1 + G0P1 + c0P0P1
C3 = G2 + G1P2 + G0P1P2 + c0P0P1P2
C4 = G3 + G2P3 + G1P2P3 + G0P1P2P3 + c0P0P1P2P3

134














Ripper carry adderbitcarry2gatedelays
16-bit ripper carry addercritical path delay2 16 2 ns = 64 nscarry
lookahead addergipi1gate delaypi
gi2gate delayPiGi2
gate delaysPiGiCarry
lookahead addercritical path delay(1 + 2 + 2) 2ns = 10 nsSpeedup =
64/10 = 6.4
(3) CPC-C


2. Pipelining:
(1) Execution time = Instruction Count * Cycle Per Instruction * Cycle Time is
a popular performance equation. Explain why the pipelining technique can
increase the computer performance in terms of this equation.
(2) In reality, it is impossible to get an n-fold speedup by an n-stage pipelining.
Give the reasons.
Answer:
(1) (a) Pipeline can reduce the average CPI by overlapping the execution of
instructions
(b) By dividing the datapath into stages, pipeline can increase the clock rate
and thus shorten the cycle time.
(2) (a) The stages may be imperfectly balanced.
(b) The delay due to pipeline register.
(c) Data hazard and control hazard.
(d) Time to fill pipeline and time to drain it reduces speedup
C
a
r
r
y
I
n
R
e
s
u
l
t
0
-
-
3
A
L
U
0
C
a
r
r
y
I
n
R
e
s
u
l
t
4
-
-
7
A
L
U
1
C
a
r
r
y
I
n
R
e
s
u
l
t
8
-
-
1
1
A
L
U
2
C
a
r
r
y
I
n
C
a
r
r
y
O
u
t
R
e
s
u
l
t
1
2
-
-
1
5
A
L
U
3
C
a
r
r
y
I
n
C
1
C
2
C
3
C
4
P
0
G
0
P
1
G
1
P
2
G
2
P
3
G
3
p
i
g
i
p
i
+
1
g
i
+
1
c
i
+
1
c
i
+
2
c
i
+
3
c
i
+
4
p
i
+
2
g
i
+
2
p
i
+
3
g
i
+
3
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
a
8
b
8
a
9
b
9
a
1
0
b
1
0
a
1
1
b
1
1
a
1
2
b
1
2
a
1
3
b
1
3
a
1
4
b
1
4
a
1
5
b
1
5
C
a
r
r
y
-
l
o
o
k
a
h
e
a
d
u
n
i
t
C
a
r
r
y
I
n
R
e
s
u
l
t
0
-
-
3
A
L
U
0
C
a
r
r
y
I
n
R
e
s
u
l
t
4
-
-
7
A
L
U
1
C
a
r
r
y
I
n
R
e
s
u
l
t
8
-
-
1
1
A
L
U
2
C
a
r
r
y
I
n
C
a
r
r
y
O
u
t
R
e
s
u
l
t
1
2
-
-
1
5
A
L
U
3
C
a
r
r
y
I
n
C
1
C
2
C
3
C
4
P
0
G
0
P
1
G
1
P
2
G
2
P
3
G
3
p
i
g
i
p
i
+
1
g
C
a
r
r
y
I
n
R
e
s
u
l
t
0
-
-
3
A
L
U
0
C
a
r
r
y
I
n
R
e
s
u
l
t
4
-
-
7
A
L
U
1
C
a
r
r
y
I
n
R
e
s
u
l
t
8
-
-
1
1
A
L
U
2
C
a
r
r
y
I
n
C
a
r
r
y
O
u
t
R
e
s
u
l
t
1
2
-
-
1
5
A
L
U
3
C
a
r
r
y
I
n
C
1
C
2
C
3
C
4
P
0
G
0
P
1
G
1
P
2
G
2
P
3
G
3
p
i
g
i
p
i
+
1
g
i
+
1
c
i
+
1
c
i
+
2
c
i
+
3
c
i
+
4
p
i
+
2
g
i
+
2
p
i
+
3
g
i
+
3
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
a
8
b
8
a
9
b
9
a
1
0
b
1
0
a
1
1
b
1
1
a
1
2
b
1
2
a
1
3
b
1
3
a
1
4
b
1
4
a
1
i
+
1
c
i
+
1
c
i
+
2
c
i
+
3
c
i
+
4
p
i
+
2
g
i
+
2
p
i
+
3
g
i
+
3
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
a
8
b
8
a
9
b
9
a
1
0
b
1
0
a
1
1
b
1
1
a
1
2
b
1
2
a
1
3
b
1
3
a
1
4
b
1
4
a
1
5
b
1
5
C
a
r
r
y
-
l
o
o
k
a
h
e
a
d
u
n
i
t

135
3. Pipeline hazards:
(1) Identify types of data hazards and explain them briefly.
(2) What techniques can be used to reduce the performance penalty caused by
control hazards? List at least 3 techniques.
Answer:
(1) Read after Write (RAW): the first instruction may not have finished writing to
the operand, the second instruction may use incorrect data.
Write after Read (WAR): the write may have finished before the read, the
read instruction may incorrectly get the new written value.
Write after Write (WAW): Two instructions that write to the same operand
are performed. The first one issued may finish second, and therefore leave the
operand with an incorrect data value.
(2) C Branch prediction
C Move branch decision earlier
C Delayed branch

4. Memory system:
(1) What is the main objective of the memory hierarchy?
(2) What is the fundamental principle that makes the memory hierarchy work?
Describe the principle briefly.
(3) Briefly describe the three common strategies for each block placement.
(4) Compare the strategies mentioned in (3) in terms of the cache miss rate and
the hardware implementation cost.
Answer:
(1)

(2) (Principle of locality)

(temporal locality)

(spatial locality)

(3) Direct-mapped cache: A cache structure in which each memory location is
mapped to exactly one location in the cache.
Set-associative cache: A cache that has a fixed number of locations (at least
two) where each block can be placed.
Fully associative cache: A cache structure in which a block can be placed in
any location in the cache.
(4)

136
strategies miss rate Cost
Direct-mapped High Low
Set-associative Medium Medium
Fully associative Low High


137
94
1. A 1-bit full adder cell is represented as the following symbol.
+
a b
S
C
i
C
O

Assume the time delay is T for all input to output paths. Design an adder for R =
A + B + C + D by ONLY using these full adder cells. A, B, C, and D are all 4-bit
2's complement values. A can also be represented as {A
3
, A
2
, A
1
, A
0
} where A
3
is
the MSB and A
0
is the LSB. The same rule applies to B, C and D.
(1) What is the minimum bit width for R to be able to store all possible results?
(2) Draw your design in terms of the given symbol. You should minimize the
number of required adder cells. Report the number of adder cells used in your
design as well.
(3) What is the worst-case time delay of your adder design in terms of T? You
should minimize this time delay.
Answer:
(1) 6 bits (hint: the addition of two n-bit numbers yields a (n + 1)-bit result.)
(2) 12 adder cells
c3 c2
c1 c0 d3 d2 d1 d0 c3 c2
c1 c0 d3 d2 d1 d0


(3) 6T




138
2. Three enhancements with the following speedups are proposed for a new
architecture: speedup
1
= 30, speedup
2
= 20, speedup
3
= 15. Only one
enhancement is usable at a time.
(1) Please derive the Amdahl's law for multiple enhancements but each is usable
at a time, that is, list the speedup formula, assume FE
i
is the fraction of time
that enhancement i can be used and SE
i
, is the speedup of enhancement i. For
a single enhancement the equation reduces to the familiar form of Amdahl's
Law.
(2) If enhancements 1 and 2 are each usable for 25% of the time. What fraction of
the time must enhancement 3 be used to achieve an overall speedup of 10?
(3) Assume the enhancements can be used 25%, 25% and 10% of the time for
enhancements 1, 2, 3, respectively. For what fraction of the reduced execution
time is no enhancement in use?
(4) Assume, for some benchmark, the possible fraction of use is 15% for each of
enhancements 1 and 2 and 70% for enhancement 3. We want to maximize
performance. If only one enhancement can be implemented, which should it
be? If two enhancements can be implemented, which should be chosen.
Answer:
(1)
|
.
|

\
|
+
=

= =
i
i i
i
i
FE
SE
FE
Speedup
3
1
3
1
1
1

(2)
( )
% 45
25 . 0 25 . 0 1
15 20
25 . 0
30
25 . 0
1
10 =
+ + +
= f
f
f

(3) Suppose t is the execution time before improve
Execution time after improvement = t ( 4 . 0
15
1 . 0
20
25 . 0
30
25 . 0
+ + + ) = 0.43 t
The fraction is (0.43 t) / t) = 0.43

(4) Speedup
1
=
( )
1695 . 1
15 . 0 1
30
15 . 0
1
=
+

Speedup
2
=
( )
1661 . 1
15 . 0 1
20
15 . 0
1
=
+

Speedup
3
=
( )
8851 . 2
7 . 0 1
15
7 . 0
1
=
+

If only one enhancement can be implemented, enhancement 3 should be
chosen.
Speedup
12
=
( )
4 . 1
15 . 0 15 . 0 1
20
15 . 0
30
15 . 0
1
=
+ +



139
Speedup
13
=
( )
96 . 4
7 . 0 15 . 0 1
15
7 . 0
30
15 . 0
1
=
+ +

Speedup
23
=
( )
9 . 4
7 . 0 15 . 0 1
15
7 . 0
20
15 . 0
1
=
+ +

If two enhancements can be implemented, enhancement 1 & 3 should be
chosen.

3. Consider a virtual memory system with the following characteristics:
a. Total of 1 million pages;
b. 4K bytes of space in each page;
c. Each entry within the page table has 12 bits, including 1 valid bit.
(1) What is the total addressable physical memory space using the virtual
memory system?
(2) How many bits are required for the virtual address, including bits for C the
virtual page number and C the page offset?
(3) Explain the meaning of page fault in a virtual memory system. What will
happen to the valid bit within the page table if page fault occurs?
Answer:
(1) The size of a physical page number = 12 1 = 11 The number of pages in
physical memory = 2
11
= 2K
The total addressable physical memory space = 2K 4KB = 8 MB
(2) C Since the virtual memory has 1 million = 2
20
pages the size of a virtual
page number = 20 bits
C Since there are 4K = 2
12
bytes in each page the size of page offset = 12
bits
(3) Page fault: An even that occurs when an accessed page is not present in main
memory.
If the valid bit for a virtual page is off, a page fault occurs.


140
4. The following figure shows the pipelined datapath with the control signals
identified. A sequence of MIPS instructions is given as follows:
add $s1, $s2. $s3 # Register $s1l = $s2 + $s3
sw $s1, 100 ($s4) # Store register $s1 into Memory [$s4 + 100]
(1) Because the setting of the control lines depends only on the opcode values
(Instruction [31-26]), we define whether control signal should be 0 (not
activated), 1 (activated), or X (don't care), for each of the instructions.
Complete the table by specifying the values of A, B, C, D and E.
Instruction
EX stage MEM stage WB stag
Reg
Dst
ALU
Op1
ALU
Op0
ALU
Src
Branch
Mem
Read
Mem
Write
Reg
Write
Memto
Reg
add 1 1 0 0 0 A 0 1 B
sw C 0 0 1 0 0 1 D E
(2) The inputs carrying the register number to the register file are all 5 wide in the
machine language instructions. Identify the bit ranges of Read register 1 and
Read register 2 separately.
(3) The above two instructions are dependent; that is, the sw instructions uses the
results calculated by the add instruction. To resolve this hazard, we must first
detect such a hazard and then forward the proper value. Specify all the
necessary inputs to the forwarding unit (not shown in the figure) so that any
data dependence can be detected. One input is done for you,
ID/EX.Instruction [20-16].

Answer:
(1)
A B C D E
0 1 X 0 X

141
(2)
Read register 1 bit rages is [25-21]
Read register 2 bit rages is [20-16]
(3)
ID/EX.Instruction[25-21] (i.e., ID/EX.RegisterRs)
ID/EX.Instruction[20-16] (i.e., ID/EX.RegisterRt)
EX/MEM.RegWrite
EX/MEM.RegisterRd
MEM/WB.RegWrite
MEM/WB.RegisterRd

5. An I/O device tries to fetch data from memory using the following asynchronous
handshaking protocol: (grey signals are asserted by the I/O device, where
memory asserts the signals in solid black; numbered arrows are referred to the
following steps)
DataRdy
Ack
Data
ReadReq
1
3
4
5
7
6
4
2
2

The steps in the asynchronous protocol begin immediately after the I/O device
signals a request by raising ReadReq and putting the address on the Data lines:
(1) When memory sees the ReadReq line, it reads the address from the data bus
and raises Ack to indicate it has been seen.
(2) I/O device sees the Ack line high and releases the ReadReq and data lines.
(3) Memory sees the ReadReq is low and drops the Ack line to acknowledge the
ReadReq signal.
(4) This step starts when the memory has the data ready. It places the data from
the read request on the data lines and raises DataRdy.
(5) The I/O device sees DataRdy, reads the data from the bus, and signals that it
has the data by raising Ack.
(6) The memory sees the Ack signal, drops DataRdy, and releases the data lines.
(7) Finally, the I/O device, seeing DataRdy goes low, drops the Ack line, which
indicates that the transmission is completed.
A new bus transaction can now begin.
Your task is to implement the above asynchronous handshaking protocol in finite
state machine. Note that you only have to show the signal flow without hardware
design.
Answer:

142
1
Record from
data lines
and assert
Ack
ReadReq
ReadReq
________
ReadReq
ReadReq
3, 4
Drop Ack;
put memory
data on data
lines; assert
DataRdy
Ack
Ack
6
Release data
lines and
DataRdy
________
___
Memory
2
Release data
lines; deassert
ReadReq
Ack
DataRdy
DataRdy
5
Read memory
data from data
lines;
assert Ack
DataRdy
DataRdy
7
Deassert Ack
I/O device
Put address
on data
lines; assert
ReadReq
________
Ack
___
________
New I/O request
New I/O request
1
Record from
data lines
and assert
Ack
ReadReq
ReadReq
________
ReadReq
ReadReq
3, 4
Drop Ack;
put memory
data on data
lines; assert
DataRdy
Ack
Ack
6
Release data
lines and
DataRdy
________
___
1
Record from
data lines
and assert
Ack
ReadReq
ReadReq
________
ReadReq
ReadReq
3, 4
Drop Ack;
put memory
data on data
lines; assert
DataRdy
Ack
Ack
6
Release data
lines and
DataRdy
________
___
Memory
2
Release data
lines; deassert
ReadReq
Ack
DataRdy
DataRdy
5
Read memory
data from data
lines;
assert Ack
DataRdy
DataRdy
7
Deassert Ack
I/O device
Put address
on data
lines; assert
Read
Memory
2
Release data
lines; deassert
ReadReq
Ack
DataRdy
DataRdy
5
Read memory
data from data
lines;
assert Ack
DataRdy
DataRdy
7
Deassert Ack
I/O device
Put address
on data
lines; assert
ReadReq
________
Ack
___
________
New I/O request
New I/O request




143
96

1. Consider a MIPS processor with an additional floating point unit. Assume
functional unit delays in the processor are as follows: memory (2 ns), ALU and
adders (2 ns), FPU add (8 ns), FPU multiply (16 ns), register file access (1 ns),
and the remaining units (0 ns). Also assume instruction mix as follows: loads
(31%), stores (21%), R-format instructions (27%), branches (5%). jumps (2%),
FP adds and subtracts (7%), and FP multiplys and divides (7%).
(1) What is the delay in nanosecond to execute a load, store, R-format, branch,
jump, FP add/subtract, and FP multiply/divide instruction in a single-cycle
MIPS design?
(2) What is the averaged delay in nanosecond to execute a load, store, R-format,
branch, jump, FP add/subtract, and FP multiply/divide instruction in a
multicycle MIPS design?
Answer:
(1) 20 ns
Instruction Memory Register
ALU/FPU add/
FPU multiply
Memory Register
Delay
(ns)
load 2 1 2 2 1 8
store 2 1 2 2 0 7
R-format 2 1 2 0 1 6
branch 2 1 2 0 0 5
jump 2 0 0 0 0 2
FP add/sub 2 1 8 0 1 12
FP mul/div 2 1 16 0 1 20
(2) Average delay = (5 0.31 + 4 0.21 + 4 0.27 + 3 0.05 + 3 0.02 +
4 0.07 + 4 0.07) 16 = 4.24 16 = 67.84 ns

2. Consider a cache with 4 memory blocks. Assume that the cache contains no
memory block initially. How many cache misses will be introduced by the
directed mapped, 2-way set associative and fully associative caches if the
memory blocks with addresses 0, 8, 0, 6 and 8 are fetched sequentially?

144
Answer:
Directed mapped
2-way set
associative
fully associative
Block
addresses
Tag Index H/M Tag Index H/M Tag H/M
0 0 0 Miss 0 0 Miss 0 Miss
8 2 0 Miss 4 0 Miss 8 Miss
0 0 0 Miss 0 0 Hit 0 Hit
6 1 2 Miss 3 0 Miss 6 Miss
8 2 0 Miss 4 0 Miss 8 Hit
No. of misses 5 4 3

3. Which of the following techniques can resolve control hazard
(1) Branch prediction
(2) Stall
(3) Delayed branch
Answer: all these three techniques can resolve control hazard
(1) Execution of the branch instruction is continued in the pipeline by
predicting the branch is to take place or not. If the prediction is wrong,
the instructions that are being fetched and decoded are discarded
(flushed).
(2) Pipeline is stalled until the branch is complete. The penalty will be
several clock cycles.
(3) A delayed branch always executes the following instruction. Compilers
and assemblers try to place an instruction that does not affect the branch
after the branch in the branch delay slot.

4. Write a C program which exhibits the temporal and spatial localities. The C
program cannot exceed 5 lines.
Answer:
clear1(int array[ ], int size)
{
int i;
for (i = 0; i < size; i += 1)
array[i] = 0;
}


145
95

1. The following program tries to copy words from the address in register $a0 to the
address in register $a1 and count the number of words copied in register $v0. The
program stops copying when it finds a word equal to 0. You do not have to
preserve the contents of registers $v1, $a0, and $a1. This terminating word should
be copied but not counted.
Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a0)
addi $a0, $a0, 1
addi $a1, $a1, 1
bne $v1, $zero, loop
There are multiple bugs in this MIPS program. Please fix them and turn in
bug-free version.
Answer:
addi $v0, $zero, -1
Loop: lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a1)
addi $a0, $a0, 4
addi $a1, $a1, 4
bne $v1, $zero, Loop

2. (a) Fill in the following table using the index provided in the keywords (1)-(5) to
determine the 3-bit Booth algorithm. Assume that you have both the
multiplicand and twice the multiplicand already in registers.
(1) None (2) Add the multiplicand (3) Add twice the multiplicand (4) Subtract
the multiplicand (5) Subtract twice the multiplicand.
(b) Assume x is 010101
2
and y is 011011
2
. Please use 2-bit and 3-bit Booth
algorithm to do the y*x operation.
(c) Will the 3-bit Booth algorithm always have a fewer operations than the 2-bit
Booth algorithm? Justify your answer by a brief description.
Current bits Previous bit
Operation
a
i+1
a
i
a
i-1

0 0 0
0 0 1
0 1 0
0 1 1
1 0 0
1 0 1
1 1 0
1 1 1

146
Answer:
(a)
Current bits Previous bit
Operation
a
i+1
a
i
a
i-1

0 0 0 (1)
0 0 1 (2)
0 1 0 (2)
0 1 1 (3)
1 0 0 (5)
1 0 1 (4)
1 1 0 (4)
1 1 1 (1)

(b) 2-bit
Iteration Step Multiplicand Product
0 initial values 011011 000000 010101 0
1
10 Multiplicand 011011 100101 010101 0
Shift right product 011011 110010 101010 1
2
01 + Multiplicand 011011 001101 101010 1
Shift right product 011011 000110 110101 0
3
10 Multiplicand 011011 101011 110101 0
Shift right product 011011 110101 111010 1
4
01 + Multiplicand 011011 010000 111010 1
Shift right product 011011 001000 011101 0
5
10 Multiplicand 011011 101101 011101 0
Shift right product 011011 110110 101110 1
6
01 + Multiplicand 011011 010001 101110 1
Shift right product 011011 001000 110111 0

3-bit
Iteration Step Multiplicand Product
0 initial values 011011 000000 010101 0
1
010 + Multiplicand 011011 011011 010101 0
Shift right product 011011 000110 110101 0
2
010 + Multiplicand 011011 100001 110101 0
Shift right product 011011 001000 011101 0
3
010 + Multiplicand 011011 100011 011101 0
Shift right product 011011 001000 110111 1
(c) 3-bit version2-bit version010101
1001103-bit version2-bit version3


147
3. Define zero, de-normalized number, floating point number, infinity, and NaN
(Not a number) in IEEE 754 double precision format by giving the range of their
exponents and significands, respectively. Fill in your answer in the following
table.
zero de-normalized floating point infinity NaN
Exponent
significand
Answer:
zero de-normalized floating point infinity NaN
Exponent 0 0 1 2046 2047 2047
significand 0 nonzero anything 0 nonzero

4. Consider a CPU with following instructions:
Instruction Example Meaning
add add $1, $2, $3 $1 = $2 + $3
sub sub $1, $2, $3 $1 = $2 - $3
There are five pipeline stages:
(1) IF-Instruction fetch
(2) ID-Instruction decode and register fetch
(3) EX-Execution or calculate effective address
(4) MEM-Access data memory
(5) WB-Write back to registers
Now, consider a program segment S:
add $1, $2, $3
sub $4, $1, $5
add $6, $1, $7
add $8, $4, $1
sub $2, $7, $9
(a) If we stall the pipeline when there is a data hazard (no forwarding), how many
cycles will it take to complete this program segment. Draw the resulting
pipeline.
(b) Is it possible to produce the same result in few cycles by reordering
instruction? If, so show the reordering, depict the new pipeline and indicate
how many cycles it will take to complete this program segment S.
Answer:
(a) Suppose that register read and write can happen in the same clock cycle.
Therefore, we should stall 2 cycles between lines 1 and 2, and stall 1 cycle
between lines 2 and 4.
The total cycles to complete the segment = (5 1) + 5 + 3 = 12 clock cycles


148
1 2 3 4 5 6 7 8 9 10 11 12
add IF ID EX MEM WB
sub IF ID ID ID EX MEM WB
add IF IF IF ID EX MEM WB
add IF ID ID EX MEM WB
sub IF IF ID EX MEM WB

(b) add $1, $2, $3
sub $2, $7, $9
sub $4, $1, $5
add $6, $1, $7
add $8, $4, $1
1 2 3 4 5 6 7 8 9 10 11
add IF ID EX MEM WB
sub IF ID EX MEM WB
sub IF ID ID EX MEM WB
add IF IF ID EX MEM WB
add IF ID ID EX MEM WB
The total cycles to complete the segment = (5 1) + 5 + 2 = 11 clock cycles

149
94

1. Use Booth algorithm to calculate the following:
(a) multiplicant multiplier = 10111 10011 = 9 13 = 117
Iteration Step Multiplicand Product
0 initial values 10111
00000 100110
-10111
1 10111
2 10111
3 10111
4 10111
5 10111 00011 10101 1
(b) Prove the correctness of the Booth algorithm. The main idea of Booth
algorithm is that a sequence of k 1s (k additions) is replaced by one addition
and one subtraction. You must explain why only one subtraction (without
subtraction) is needed for the last sequence of 1s when multiplier is negative.
Answer: (a)
Iteration Step Multiplicand Product
0 initial values 10111 00000 100110
1
10 prod = prod - Mcand 10111 01001 100110
Shift right product 10111 00100 110011
2
11 no operation 10111 00100 110011
Shift right product 10111 00010 011001
3
01 prod = prod + Mcand 10111 11001 011001
Shift right product 10111 11100 101100
4
00 no operation 10111 11100 101100
Shift right product 10111 11110 010110
5
10 prod = prod - Mcand 10111 00111 010110
Shift right product 10111 00011 10101 1
(b) Suppose that a is multiplier, b is multiplicand, and a
i
is the i
th
bit of a. The
booths algorithm implements the following computation:
0
- 1
+ 1
0
a
i-1
a
i
Do nothing
Subtract b
Add b
Do nothing
Operation
Do nothing 1 1
Subtract b 0 1
Add b 1 0
Do nothing 0 0
Operation a
i-1
a
i
0
- 1
+ 1
0
a
i-1
a
i
Do nothing
Subtract b
Add b
Do nothing
Operation
Do nothing 1 1
Subtract b 0 1
Add b 1 0
Do nothing 0 0
Operation a
i-1
a
i
31
31 30
30
30 29
2
2 1
1
1 0
0
0 1
2 ) (
2 ) (
.... . ...
2 ) (
2 ) (
2 ) (
+
+
+
+

b a a
b a a
b a a
b a a
b a a
( ) ( ) ( ) ( ) ( ) ( )
a b
a a a a a b
=
+ + + + +
0
0
1
1
29
29
30
30
31
31
2 2 ..... 2 2 2

150
Multiplier 1 1


2. What is the biased single precision IEEE 754 floating point format of 0.9375?
What is purpose to bias the exponent of the floating point numbers?
Answer:
(1) 0.9375
ten
= 0.1111
two
= 1.111
two
2
-1

S E F
0 01111110 11100000000000000000000
(2)
bias


3. Why do the ripple carry adders perform additions in a sequential manner?
Carry-lookahead adder is one of the fast-carry schemes to improve the adder
performance over ripple carry adders. What is the principle of these fast-carry
schemes? Briefly explain.
Answer:
(1) carry-incarry-out

(2)
ripple carry adders


4. Assume the following:
(1) k is the number of bits for a computers address size (using byte addressing)
(2) S is cache size in bytes
(3) B is block size in bytes, B = 2
b

(4) A stands for A-way associative cache
Figure out the following quantities in terms of S, B, A, and k:
(a) the number of sets in the cache
(b) the number of index bits in the address, and
(c) the number of bits needed to implement the cache
Answer:
Address size: k bits
Cache size: S bytes/cache
Block size: B = 2
b
bytes/block
Associativity: A blocks/set

151
Number of sets in the cache = S/AB; Number of bits for index =
|
.
|

\
|
AB
S
2
log

Number of bits for the tag =
|
.
|

\
|
=
|
|
.
|

\
|

|
.
|

\
|

A
S
k b b
A
S
k
2 2
log log

Number of bits needed to implement the cache = sets/cache associativity
(data + tag + valid) =
|
|
.
|

\
|
+
|
.
|

\
|
+ =
|
|
.
|

\
|
+
|
.
|

\
|
+ 1 log 8 1 log 8
2 2
A
S
k B
B
S
A
S
k B A
AB
S
bits

152
93

1. Given the following bit pattern:
(0100 0000 0010 1101 1111 1000 0100 1101)
two

What decimal number does it represent? Assume that it is an IEEE 754 single
precision floating point number.
Answer:
(1)
sign
(1+significand) 2
exponent bias

= (1)
0
(1 + 2
-2
+ 2
-4
+ 2
-5
+ 2
-7
+ 2
-8
+ 2
-9
+ 2
-10
+ 2
-11
+ 2
-12
+ 2
-17
+ 2
-20
+ 2
-21
+
2
-23
) 2
1
= 2 + 2
-1
+ 2
-3
+ 2
-4
+ 2
-6
+ 2
-7
+ 2
-8
+ 2
-9
+ 2
-10
+ 2
-11
+ 2
-16
+ 2
-19
+ 2
-20
+2
-22


2. Show the minimal MIPS instruction sequence for a new instruction called not that
takes the ones complement of a Source register and places it in a Destination
register. Convert this instruction (accepted by the MIPS assembler): not $s0, $s1
(Hint: It can be done in one instruction if you use the new logical instruction)
Answer: nor $s0, $s1, $zero

3. Consider the following measurements made on a pair of SPARCstation 10s
running Solaris 2.3, connected to two different types of networks, and using
TCP/IP for communication:
Characteristic Ethernet ATM
Bandwidth from node to network 1.25MB/sec 10MB/sec
Interconnect latency 18 s 42 s
HW latency to/from network 5 s 9 s
SW overhead sending to network 198 s 211 s
SW overhead receiving from network 249 s 356 s
(HW: Hardware, SW: Software)
Find the, host-to-host latency for a 250-byte message using each network.
Answer:
The transmission time
Ethernet
= s 200
byte/sec 10 1.25
bytes 250
6
=


The transmission time
ATM
=
s 25
byte/sec 10 0 1
bytes 250
6
=


The total latency to send and receive the packet is the sum of the transmission
time and the hardware and software overhead:
Total time
Ethernet
= 198 + 5 + 18 + 5 + 249 + 200 = 675 s
Total time
ATM
= 211 + 9 + 42 + 9 + 356 + 25 = 658 s

153
4. Suppose there are a processor running at 1.5G Hz and a hard-disk. The hard disk
has a transfer rate of 8 MB/sec and uses DMA. Assume that the initial setup of a
DMA transfer takes 800 clock cycles for the processor, and assume the handling
of the interrupt at DMA completion requires 400 clock cycles for the processor. If
the average transfer from the disk is 16 KB, what fraction of this processor is
consumed if the disk is actively transferring 100% of the time? Ignore any impact
from bus contention between the processor and DMA controller.
Answer:
Each DMA transfer takes 16KB / (8MB/sec) = 2 10
-3
seconds. So if the disk is
constantly transferring, it requires (800 + 400) / 2 10
-3
= 600 10
3
clock
cycles/sec.
Fraction of processor consumed = (600 10
3
) / (1.5 10
9
) = 0.4 10
-3
= 0.04%

154
92

1. Since assembly language is the interface to higher-level software, the assembler
can also treat common variations of machine language instructions as if they were
instructions in their own right. However, these instructions need not be
implemented in hardware. Such instructions are called pseudoinstructions. And
many such instruction sets appear in MIPS programs.
For each pseudoinstruction in the following table, produce a minimal sequence of
actual MIPS instructions to accomplish the same thing. You may need to use $at
for some of the sequences. In the following table, big refers to a specific number
that requires 32 bits to represent and small to a number that can be expressed
using 16 bits.
Pseudoinstruction What is accomplishes Solution
move $t1, $t2 $t1 = $t2 ex: add $t1, $t2, $zero
clear $t0 $t0 = 0
beq $t1, small, L if ($t1 = small) go to L
beq $t2, big, L if ($t2 = big) go to L
li $t1, small $t1 = small
li $t2, big $t2 = big
ble $t3, $t5, L if ($t3 <= $t5) goto L
bgt $t4, $t5, L if ($t4 > $t5) go to L
bge $t5, $t3, L if ($t5 >= $t3) go to L
addi $t0, $t2, big $t0 = $t2 + big
lw $t5, big($t2) $t5 = Memory[$t2+big]
Answer:
Pseudoinstruction What is accomplishes Solution
move $t1, $t2 $t1 = $t2 add $t1, $t2, $zero
clear $t0 $t0 = 0 add $t0, $zero, $zero
beq $t1, small, L if ($t1 = small) go to L
li $at, small
beq $t1, $at, L
beq $t2, big, L if ($t2 = big) go to L
li $at, big
beq $at, $t2, L
li $tl, small $t1 = small addi $t1, $zero, small
li $t2, big $t2 = big
lui $t2, upper(big)
ori $t2, $t2, lower(big)
ble $t3, $t5, L if ($t3 <= $t5) goto L
slt $at, $t5, $t3
beq $at, $zero, L
bgt $t4, $t5, L if ($t4 > $t5) go to L
slt $at, $t5, $t4
bne $at, $zero, L
bge $t5, $t3, L if ($t5 >= $t3) go to L
slt $at, $t5, $t3
beq $at, $zero, L

155
addi $t0, $t2, big $t0 = $t2 + big
li $at, big
add $t0, $t2, $at
1w $t5, big($t2) $t5 = Memory[$t2+big]
li $at, big
add $at, $at, $t2
lw $t5, 0($at)

2. To add an addressing mode to MIPS, that allows arithmetic instructions to
directly access memory. If we add an instruction, addm, as is found in the 80x86,
as following brief description:
addm $t2, 100($t3) # $t2 = $t2 + Memory[$t3 + 100]
then please describe the steps of this instruction addm might take. Then write a
paragraph or two explaining why it would be hard to add this instruction to the
MIPS pipeline. (Hint: You may have to add one or more additional stages to the
pipeline.)
Answer:
(1) The steps of this instruction addm might take:
Step1: instruction fetch
Step2: instruction decode and register fetch
Step3: memory address calculation
Step4: memory access
Step5: execution
Step6: write back
(2) addmMIPS pipeline stage56clock rate
pipelineperformancepipelinestages
(stage 3)
hazardpenalty

3. Here are two different I/O systems intended for use in transaction processing:
System A can support 1500 I/O operations per second.
System B can support 1000 I/O operations per second.
The systems use the same processor that executes 500 million instructions per
second. The latency of an I/O operation for these two systems differs. The latency
for an I/O on system A is equal to 20 ms, while for system B the latency is 18 ms
for the first 500 I/Os per second and 25 ms per I/O for each I/O between 500 and
1000 I/Os per second. In the workload, every 10
th
transaction depends on the
immediately preceding transaction and must wait for its completion. What is the
maximum transaction rate that still allows every transaction to complete in 1
second and that does not exceed the I/O bandwidth of the machine? (Assume that
each transaction requires 5 I/O operations and that each I/O operation requires
10,000 instructions. And for simplicity, assume that all transaction requests arrive
at the beginning of a 1-second interval.)

156
Answer:
System A
Transactions: 9 Compute
latency
1
I/Os: 45 5
Times: 900 ms 100 s 100 ms exceeds 1 s
Thus system A can only support 9 transactions per second.

System Bfirst 500 I/Os (first 100 transactions)
Transactions: 9 Compute
latency
1 1
I/Os: 45 5 5
Times: 810 ms 100 s 90 ms 90 ms 990.1 ms
Thus system B can support 11 transactions per second.


157
96

1. Given the following MIPS instruction code segment, please answer each question
below.
16 LI: addi $t0, $t0, 4
20 lw $s1, 0($t0)
24 sw $s1, 32($t0)
28 lw $t1, 64($t0)
32 slt $s0, $t1, $zero
36 bne $s0, $zero, L1
(a) Given a pipeline processor which has 5 stages: IF, ID, EX, ME, WB. Assume
no forwarding unit is available. There are hazards in the code, please detect
the hazards and point out where to insert no-ops (or bubbles) to make the
pipeline datapath execute the code correctly. You don't need to rewrite the
entire code segment. You can simply indicate the location where you would
insert the no-ops. For example, if you want to insert 6 no-ops between the
instruction addi at address 16 and lw at address 20, you can state something
like "6 no-ops between 16 and 20".
(b) Assume a forwarding unit is available to only forward data from ME and/or
WB to EX. Please reorder/rewrite the code to maximize its performance. Note
that you should consider maximizing the performance based on the
assumption that the loop might be iterated a few times. You may insert no-ops
in the code segment to resolve inevitable hazards if any.
Answer:
(a) 2 no-ops between 16 and 20
2 no-ops between 20 and 24
2 no-ops between 28 and 32
2 no-ops between 32 and 36
1 no-ops behind 36
(b)
L1: addi $t0, $t0, 4
lw $t1, 64($t0)
lw $s1, 0($t0)
slt $s0, $t1, $zero
nop
nop
bne $s0, $zero, L1
sw $s1, 32($t0)


158
2. Assume you are asked to design the architecture of the memory hierarchy for a
computer with a 32-bit 4 GHz MIPS processor. The processor has a 64 KB 1
st

level cache and a 256 KB 2
nd
level cache on chip. The 1
st
level cache is 2-way
associative and the 2
nd
level cache is 8-way associative. Assume the word size is
32 bits and the block size for both caches is 8 bytes. Assume both caches are
virtually addressed. The size of the physical memory is 2 GB. The memory space
is byte-addressing. Based on the given information, please answer the following
questions.
(a) Please locate virtual address 0x0000 ABCD in both caches. That is, show
which set the address will be if it's in the 1
st
level cache and 2
nd
level cache,
respectively.
(b) Suppose the update policy of the 1
st
level cache is write allocate, write back,
and LRU replacement. Execute each of the following instruction and indicate
whether it's a hit or a miss for (1) to (5) on 1
st
level cache. (Assume initially
the content of $s0 = 0x0000 0000, $s1 = 0xFEDC 0000, $s2 = 0x8000 0000,
and both caches are empty.)
Instruction Cache hit or miss
lb $t0, 0x001F($s0) miss
lb $t1, 0x801D($s1) (1)
lb $t2, 0x0018($s1) (2)
sb $t1, 0x0018($s0) (3)
lb $t0, 0x001C($s2) (4)
sb $t0, 0x001A($s1) (5)
Finally, after executing the piece of codes, has the memory been updated?
Please answer yes or no.
(c) Suppose the access time to main memory with 2
nd
level cache disabled is
100ns, including all the miss handling. Suppose the base CPI of the processor
is 2, assuming all references hit in the 1
st
level cache. Further assume the test
program you use to test the memory hierarchy has a 4% miss rate per
instruction for 1
st
level cache. Now with 2
nd
level cache enabled, the test
program has a miss rate of 0.2%. Suppose the access time of 2
nd
level cache is
20ns for either hit or miss. How much performance improvement you will get
with the 2
nd
level cache enabled?
Answer:

159
(a)
L1 cache L2 cache
Cache size 64 KB 256 KB
Mapping 2-way 8-way
Block size 8 B 8 B
# of sets
K
B
KB
4
2
8
64
= K
B
KB
4
8
8
256
=

Address
format
Tag Set Offset Tag Set Offset
17 12 3 17 12 3

Since 0x0000 ABCD = 0000 0000 0000 0000 1010 1011 1100 1101
2

The set address for both level 1 and level 2 cache is 01010111100
2

(b)
Instruction
Memory address
hit or
miss Hex
Binary
Tag index offset
lb $t0, 0x001F($s0) 0000001F 00000000000000000 000000000011 111 Miss
lb $t1, 0x801D($s1) FEDC801D 11111110110111101 000000000011 101 Miss
lb $t2, 0x0018($s1) FEDC0018 11111110110111000 000000000011 000 Miss
sb $t1, 0x0018($s0) 00000018 00000000000000000 000000000011 000 Miss
lb $t0, 0x001C($s2) 8000001C 10000000000000000 000000000011 100 Miss
sb $t0, 0x001A($s1) FEDC001A 11111110110111000 000000000011 010 Miss
No, memory has not been updated, since write-back strategy is used.
(c)
The miss penalty to main memory is 100 / 0.25 = 400 clock cycles
For the processor with one level of caching CPI = 2.0 + 400 4% = 18
The miss penalty for an access to the second-level cache is 20 / 0.25 = 80
clock cycles
For the two-level cache, total CPI = 2.0 + 4% 80 + 0.2% 400 = 6
Thus, the processor with the secondary cache is faster by 18/6 = 3








160
3. True or false:
(a) In processor implementation, single-cycle implementation is not as good as
multi-cycle implementation because single-cycle implementation tends to
have a longer clock cycle and higher CPI than multi-cycle implementation.
(b) Thrashing occurs if a program constantly accesses more virtual memory than
it has physical memory, causing continuously swapping between memory and
disk.
(c) RAID 3,4, and 5 all have the capability of performing parallel reads and
writes.
(d) Suppose a program runs in 60 seconds on a machine, with multiplication
responsible for 40 seconds of the time. According to Amdahl's law, we can
simply improve the speed of multiplication to have the program run at 3 times
faster.
(e) The idea of using two levels of cache is that 1
st
level cache is to minimize the
cache miss ratio and the 2
nd
level cache is to reduce the cache hit time.
Answer:
(a) False, single-cycle implementation had lower CPI than multi-cycle
implementation.
(b) True
(c) False, for small access, it is true that RAID 4 and 5 have the capability of
performing parallel reads and writes. But only one request can be served at a
time for RAID 3, no matter the amount of access is small or big
(d) False, suppose x is the improvement then, 60/3 = 20 + 40/x x =
(e) False, 1
st
level cache is to minimize the cache hit time and the 2
nd
level cache
is to reduce the cache miss ratio

4. Design a direct memory access (DMA) controller in a multi-master bus-based
system.
(a) Show a generic design that can be used for transferring data between the main
memory and the I/O. Specify the functionality of the registers used and the
interface signals of the DMA controller.
(b) Using the interface signals, elaborate the DMA operations that transfer a
block of data from memory to the I/O.
Answer:
(a)

161
Data
count
Data
Register
Address
Register
Control logic
Memory
address
Data count
Device
number
DMA request
DMA Acknowledge
Interrupt
Data bus
CPU
DMA Controller
Memory
I/O bus
I/O
Device
I/O
Device

Data count: indicate the amount of data to be transfer
Data register: buffer the data to/from memory
Address register: indicate which memory address to be access
DMA request: to ask CPU for a DMA transfer
DMA acknowledge: acknowledge DMA controller to a DMA request
Interrupt: single CPU when DMA controller needs help
(b)
1. DMA controller ask CPU for a DMA transfer by DMA request
2. CPU response the DMA request by DMA acknowledge
3. CPU initializes DMA controller to tell
read/write
device address
starting address of memory block for data
amount of data to be transferred
4. CPU carries on with other work
5. DMA controller deals with transfer
6. DMA controller sends Interrupt when finished


162
5. Fill in the appropriate term or terminology for the underline fields:
(a) move $s1, $zero = addi __, __, __
(b) CPU execution time = Instruction count _________ clock cycle time.
(c) After a silicon ingot is sliced, it is called a .
(d) For a 32-bit register, if the least significant byte (B0) is stored at memory
address 4N where N is an integer 0, this storage order is called ____
endian.
(e) For a 32-bit register, if the least significant byte (B0) is stored at memory
address 4N + 3 where N is an integer 0, this storage order is called ____
endian.
Answer:
(a) $s1, $zero, 0
(b) CPI (cycles per instruction)
(c) blank wafer
(d) little
(e) big

163
95

1. For the pipeline processor shown below, the following sequence of instructions
causes the pipeline hazard due to load-use dependency.
lw $4, 100($2)
add $8, $4, $4
Assuming the lw instruction will take 2 data memory cycles to get the data from
the memory and a forwarding circuit is employed, detail the design of the hazard
detection unit for this processor assuming MIPS-like ISA is used. Sketch your
design in the processor pipeline diagram and explain the signals you use. Write
down the behavioral code for the logic of the hazard detection unit.








Answer:
Since lw will take 2 memory cycles to get the data, we need to stall the pipeline
for 2 clock cycles to solve the load-use data hazard.
IM RF DM1 DM2
Hazard
Detection
unit
EX/MEM1 EX/MEM2 MEM2/WB ID/EX

Forwarding
Unit
PC
IF/ID
C C C C C C

Signals:
C IF/ID.RegisterRs C IF/ID.RegisterRt C ID/EX.RegisterRt
C ControlSignalClear
C ID/EX.MemRead EX/MEM1.MemRead C EX/MEM1.RegisterRd
PCWrite IF/IDWrite

Behavioral code:
IF ((ID/EX.MemRead) and
((ID/EX.RegisterRt = IF/ID.RegisterRs) or
IM RF DM1 DM2
EX/MEM1 MEM1/MEM2 MEM2/WB ID/EX
Forwarding
Unit
PC
IF/ID

164
(ID/EX.RegisterRt = IF/ID.RegisterRt)))
stall the pipeline
IF ((EX/MEM1.MemRead) and
((EX/MEM1.RegisterRd = IF/ID.RegisterRs) or
(EX/MEM1.RegisterRd = IF/ID.RegisterRt)))
stall the pipeline
RdRtEX stageRd

2. For the above pipelined processor, come up with the behavioral code for the logic
of the forwarding unit assuming MIPS-like ISA is used. State you assumptions if
any.
Answer:
EX hazard:
IF (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd = 0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
IF (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd = 0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01
MEM1 hazard:
IF (EX/MEM2.RegWrite
and (EX/MEM2.RegisterRd = 0)
and not (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd = 0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRs)
and (EX/MEM2.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
IF (EX/MEM2.RegWrite
and (EX/MEM2.RegisterRd = 0)
and not (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd = 0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRt)
and (EX/MEM2.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10
MEM2 hazard:
IF (MEM2/WB.RegWrite
and (MEM2/WB.RegisterRd = 0)
and not (EX/MEM2.RegWrite
and (EX/MEM2.RegisterRd = 0)
and (EX/MEM2.RegisterRd = ID/EX.RegisterRs)
and not (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd = 0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRs)
and (MEM2/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 11

165
IF (MEM2/WB.RegWrite
and (MEM2/WB.RegisterRd = 0)
and not (EX/MEM2.RegWrite
and (EX/MEM2.RegisterRd = 0)
and (EX/MEM2.RegisterRd = ID/EX.RegisterRt)
and not (EX/MEM1.RegWrite
and (EX/MEM1.RegisterRd = 0)
and (EX/MEM1.RegisterRd = ID/EX.RegisterRt)
and (MEM2/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 11


166
3. Assume you are asked to design the architecture of the memory hierarchy for a
computer which has a 32-bit MIPS processor with a clock rate of 2 GHz. The
processor has a 32 KB(Kilo-Byte) 1
st
level cache and a 256 KB 2
nd
level cache on
chip. The 1
s1
level cache is 4-way associative and the 2
nd
level cache is fully
associative. Assume the word size is 32 bits and the block size for both caches is
32 bytes. The size of the physical memory is 2 GB(Giga-Byte). The memory
space is byte-addressing. Based on the given information, please answer the
following questions.
(1) How many bits are needed for each of the fields in the following structure to
index 1
st
level cache and the 2
nd
level cache, respectively? Note: show the
answers for 1
st
level cache and 2
nd
level cache separately.
Tag Index Block Offset

(2) Suppose the access time to main memory with 2
nd
level cache disabled is
250ns. That is, the access time includes 1
st
level miss handling. Suppose the
base CPI of the processor is 2, assuming all references hit in the 1
st
level
cache. Further assume the test program you use to test the memory hierarchy
has a 3% miss rate per instruction for 1
st
level cache. Now with 2
nd
level
cache enabled, the test program has a miss rate of 0.2%. Suppose the access
time of 2
nd
level cache is 20ns for either a hit or a miss. How much
performance improvement you will get with the 2
nd
level cache enabled?
(3) Suppose this computer has a 32-bit virtual address space and 4 KB page size.
C How many virtual pages are there?
C How many physical pages are there?
C Assume each entry in a page table consume 1 word, what is the size of the
page table in bytes?
(4) Following the specification in (3), given the page table in the following,
please derive the physical address of the virtual address 0x00001004, and then
locate the address in the 1
st
level cache. That is, show which set the address
will be if it's in 1
st
level cache.
Page Entry no Valid Dirty Ref Physical page address
0 1 1 1 0x0001 1000
1 1 0 0 0x0004 1000
2 1 0 0 0x0001 2000
3 1 1 1 0x0003 3000
4 1 0 1 0x000F E000

Answer:
(1) CLevel-one cache
Tag Index Block Offset
18 8 3

167
C Level-two cache
Tag Index Block Offset
26 0 3
Since there are 32/4 = 8 words in a block, the block offset = log
2
8 = 3
(2) CPI for 2
nd
cache disable = 2 + 0.03 250/0.5 = 17
CPI for 2
nd
cache enable = 2 + 0.03 20/0.5 + 0.002 250/0.5 = 4.2
The performance improvement = 17 / 4.2 = 4.05 times
(3) C 2
32
/4K = 2
20
= 1M virtual pages
C 2G/4K = 2
31
/2
12
= 2
19
= 0.5 M physical pages
C 1M 4 bytes = 4Mbytes
(4) virtual address = 00001004
16
= 0000 0000 0000 0000 0001 0000 0000 0100
2

virtual page no. = 0000 0000 0000 0000 0001
2

page offset = 0000 0000 0100
2

Lookup for the Page Entry no. 1, the Physical page address is
0004 1000
16
= 0000 0000 0000 0100 0001
2

The physical address = 0000 0000 0000 0100 0001 0000 0000 0100
2

Tag Index Block Offset Byte Offset
18 8 3 2
000000000000100000 10000000 001 00
So, the virtual address will map to cache set numbered 128

4. Suppose you run photoshop to load a 4 MB(Mega-Byte) image file from the hard
disk to the memory for editing. Unfortunately, your disk is so fragmented that all
data blocks associated with this file is scattered around the disk randomly. The
parameters of the disk are listed below.
Average seek time: 12 milli-second
Rotational speed: 5000 RPM(rotation per minute)
Block size: 512 bytes
Transfer rate: 0.4 MB/sec
Ignore all other overheads. How long does the photoshop program need to wait
for the file transfer to finish from the hard disk to the memory?
Answer:
We have 4MB/0.5KM = 8K blocks to load
Move a block from the disk require:
ms
M
k
m 25 . 19
4 . 0
5 . 0
60
5000
5 . 0
12 = + +

The total time the load the file require: 8K 19.25 ms = 157.7 s




168
94

1. Which of the following is (are) true?
(a) For a fixed size cache memory, the larger the line size is the smaller the tag
memory the cache uses.
(b) For a fixed size cache memory, the larger the line size is the larger the tag
memory the cache uses.
(c) For a direct-mapped cache, no address tag is the same in the tag memory.
(d) For a two-way associative cache, no address tag is the same in the tag
memory.
Answer: (a)

2. Which of the following is (are) true for a 64KB cache with a line size of 32 bytes?
Assume that the cachable memory is 1 GB.
(a) In a direct-mapped implementation, the tag length is 16 bits; the index field is
11 bits in length.
(b) In a direct-mapped implementation, the tag length is 14 bits; the index field is
16 bits in length.
(c) In a direct-mapped implementation, the tag length is 14 bits; the field
determining line size is 5 bits in length.
(d) In a two-way implementation, the tag length is 15 bits; the index field is 10
bits in length.
Answer: (c)(d)
3. Which of the following is (are) true?
(a) A non-blocking cache allows hit under miss to hide miss latency.
(b) A non-blocking cache does not allow miss under hit to hide miss latency.
(c) Miss under miss allows multiple outstanding cache misses.
(d) A non-blocking cache allows a load instruction to access the cache if the
previous load is a cache miss.
Answer: (a)(b)(c)(d)

4. Which of the following is (are) true for the forwarding unit in a 5-stage pipelined
processor?
(a) The forwarding unit is used to detect the instruction cache stalling.
(b) The forwarding unit is a combinational circuit which detects the true data
dependency for EXE pipeline stage and selects the forwarded results for the
execution unit
(c) The forwarding unit is a pipeline register which detects the true data
dependency for EXE pipeline stage and selects the forwarded results for the

169
execution unit.
(d) The forwarding unit compares the source register number of the instructions
in the MEM and WB stages with the destination register number of the
instruction in the decode stage.
Answer: (b)

5. Which of the following is (are) not true?
(a) A control hazard is the delay in determining the proper data to load in the
MEM stage of a pipeline processor.
(b) A load-use data hazard occurs because the pipeline flushes the instructions
behind.
(c) To flush instructions in the pipeline means to load the pipeline with the
requested instructions using the predicted PC.
(d) A branch prediction buffer is a buffer that the compiler uses to predict a
branch.
Answer: (a), (b), (c), (d)

6. Which of the following is (are) true for the combinations of events in the TLB,
virtual memory system, and cache?
(a) It is possible that an access results in a TLB hit, a page table hit, and a cache
miss.
(b) It is possible that an access results in a TLB hit, a page table miss, and a cache
miss.
(c) It is possible that an access results in a TLB hit, a page table miss, and a cache
hit.
(d) It is possible that an access results in a TLB miss, a page table hit, and a cache
miss.
Answer: (a), (d)

7. Which of the following is (are) true?
(a) Virtual memory technique treats the main memory as a fully-set associative
write-back cache.
(b) Virtual address must be always larger than the physical address.
(c) TLB can be seen as the cache of a page table.
(d) If the valid bit for a virtual address is off, a page fault occurs.
Answer: (a), (c), (d)

170
8. Which of the following is (are) true?
(a) Memory-mapped I/O is an I/O scheme in which special designed I/O
instructions are used to access the memory space.
(b) The process of periodically checking status bits to see if it is time for the next
I/O operation is called interrupt.
(c) DMA is a mechanism that provides a device controller the ability to transfer
data directly to or from memory without involving the processor. DMA is
also a bus master.
(d) In a cache-based system, because of the coherence problem, thus DMA can
not be used.
Answer: (c)

9. Which of the following is (are) true?
(a) Computers have been built in the same, old-fashioned way for far too long,
and this antiquated model of computation is running out of steam.
(b) Dynamic power = Capacitive load Voltage
2
Frequency switched
(c) Static power is due to the small operating current in CMOS.
(d) Yield = the percentage of good dies from the total number of dies on the
wafer.
Answer: (b), (d)

10. Which of the following is (are) true?
(a) ISA (instruction set architecture) is an abstraction which is the interface
between the hardware and the low-level software (assembly instructions).
This abstract interface enables different implementations of the same ISA to
run identical software.
(b) A caller is the program that is called by the procedure which gives the call.
(c) A basic block is a sequence of instructions with branch at the beginning and at
the end.
(d) A register file is a large memory for storing files
Answer: (a)

11. Which of the following statements confirming to the design principle: simplicity
favors regularity?
(a) Keeping all instructions in a single size.
(b) Always requiring three operands in arithmetic instruction
(c) Keeping the register fields in the same place in each instruction format.
(d) Having the same opcode field in the same place in each instruction format.
Answer: (a), (b), (c)

171
12. Which of the following is (are) true?
(a) Page fault is signaled by software.
(b) TLB exception can only be handled in hardware.
(c) A cache miss is handled in hardware.
(d) A page fault is handled in software.
Answer: (a), (c), (d)
: TLB miss can be handled either in hardware or in software.

13. Which of the following is (are) true?
(a) When a cache write hit occurs, the written data are also updated in the next
level of memory. This is the write-through policy.
(b) There is no cache coherency problem for the write-through cache since the
data are written into the next level of memory.
(c) When a cache write hit occurs, the written data are only updated in the cache.
This is the write-back policy.
(d) Cache data inconsistency appears in a write-back cache when an I/O master
writes data into the memory block which is cached.
Answer: (a), (b), (c), (d)

14. Which of the following affects the CPI (clock per instruction)?
(a) Cache structure
(b) Memory data bus width
(c) Process technology
(d) Clock cycle time
Answer: (a), (b), (c)

15. Which of the following is (are) true?
(a) A C compiler compiles a C program into assembly language program for the
target machine.
(b) Pseudoinstructions are instructions which are not implemented in hardware.
(c) A label is a pseudoinstruction
(d) Pseudoinstructions are directives in an assembly language program
Answer: (a), (b), (d)
: The compiler transforms the C program into an assembly language program, a
symbolic form of what the machine understands.

172
16. Which of the following is (are) true?
(a) In a pipeline processor, a structure hazard means that the hardware cannot
support the combination of instructions that are executed in the same clock
cycle.
(b) A structure hazard is caused by the branch instruction which is mispredicted.
(c) A structure hazard occurs if a unified cache is accessed both by the instruction
fetch and the data load at the same clock.
(d) A structure hazard is an exception which causes the processor to fetch
instruction from the exception handler.
Answer: (a), (c)

17. Which of the following is (are) true?
(a) Pipelining reduces the instruction execution latency to one cycle.
(b) Pipelining not only improves the instruction throughput but also the
instruction latency.
(c) Pipelining improves the instruction throughput rather than individual
instruction execution time.
(d) Pipelining improves the instruction throughput other than individual
instruction execution time.
Answer: (c)

18. Which of the following is (are) true?
(a) Temporal locality means the tendency to use data items that are close in
location.
(b) Temporal locality means the tendency to reuse data items that are recently
accessed.
(c) Spatial locality means the tendency to use data items that are close in location.
(d) Spatial locality means the tendency to reuse data items that are recently
accessed.
Answer: (b), (c)

19. Which of the following is (are) data transfer instructions?
(a) jal subroutine_1
(b) sw R1, 100(R2)
(c) beq R1, R2, start
(d) or R1, R2, R3
Answer: (b)

173
20. Which of the following instruction(s) performs NOT operation assuming R0 = 0?
(a) OR R1, R0, R3
(b) AND R1, R0, R3
(c) NOR R1, R0, R3
(d) ADD R1, R0, R3
Answer: (c)


174
96

1. (a) Assume variable h is associated with register $s2 and the base address of the
array A is in $s3. Now the C assignment statement is as below:
A[12] = h + A[8]
Write the compiled MIPS assembly code by filling the blanks (A), (B), (C).
lw $t0, (A) ($s3)
add (B) , (C) , $t0
sw $t0, 48($s3)
(b) If the program is run with a machine of 50 MHz clock, and it needs to execute
the code in (a) for 10000 times. Below is the number of cycles for each class of
instruction. How many micro seconds will it take to execute this program?
Instruction Cycles
Arithmetic 1
Data transfer 3
Jump 2
(c) If the machine in (b) is a 4-way VLIW machine, what is the MOPS (million
operations per second) of this machine?
Answer:
(a)
(A) (B) (C)
32 $t0 $s2
(b) (3 + 1 + 3) 10000 0.02 s =1400 s
(c) Since the VLIW is 4-way, the data transfer instruction can be completed in one
clock cycle and during this cycle, 3 operations have been done. The three
instructions (total 7 operations) can be done in 3 cycles, then MOPS = 7
(500/3) = 1166.67

2. (a) A PC has 4 MB of RAM beginning at address 00000000H. Calculate the very
last address (in hex) of this 4 MB block.
(b) If the starting address and the ending address of the ROM block are 008000H
and 010000H, calculate the size of the ROM in K.
Answer:
(a) 003FFFFFH
(b) 010000H 008000H + 1 = 8000H + 1 = 1000 0000 0000 0000
2
+ 1
= 2
15
+ 1 = (32K + 1) bytes

(b) Stallings Computer Organization and Architecture
32K bytes

175
16 32
Sign
extend
ALU control
ALU
result
ALU
Zero
4
MemRead
MemWrite
Data
memory
Write
data
Read
data
Address
RegWrite
Registers
Write
register
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Write
data
Data
Data
Register
numbers
5
5
5
3. Please draw the block diagram to build a simple MIPS datapath, only using the
following four components: "ALU", "Sign extend", "Data memory", "Registers".







Answer:











4. (a) Use the block diagram of 1-bit full adder as a basic block to construct a 32-bit
ripple adder (S = A + B).
(b) Add some logic blocks to the design of ripple adder so that it can do 2's
complement subtraction (S = A B).
(c) Using 4-bit carry-lookahead blocks to form a 16-bit carry- lookahead adder.
Draw the block diagram and write down the corresponding logic equations.
(d) Compare the number of "gate delays" for the critical paths of two 16-bit adders,
one use ripple carry and one using two-level carry lookahead.
Answer:
(a)
+ + + +
a
31
b
31
a
2
b
2
a
1
b
1
a
0
b
0
c
0
s
31
s
2
s
1
s
0
c
32






Instruction
16 32
Registers
Write
register
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Data
memory
Write
data
Read
data
Write
data
Sign
extend
ALU
result
Zero
ALU
Address
MemRead
MemWrite
RegWrite
ALU operation
4
Instruction
16 32
Registers
Write
register
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Data
memory
Write
data
Read
data
Write
data
Sign
extend
ALU
result
Zero
ALU
Address
MemRead
MemWrite
RegWrite
ALU operation
4

176
C
a
r
r
y
I
n
R
e
s
u
l
t
0
-
-
3
A
L
U
0
C
a
r
r
y
I
n
R
e
s
u
l
t
4
-
-
7
A
L
U
1
C
a
r
r
y
I
n
R
e
s
u
l
t
8
-
-
1
1
A
L
U
2
C
a
r
r
y
I
n
C
a
r
r
y
O
u
t
R
e
s
u
l
t
1
2
-
-
1
5
A
L
U
3
C
a
r
r
y
I
n
C
1
C
2
C
3
C
4
P
0
G
0
P
1
G
1
P
2
G
2
P
3
G
3
p
i
g
i
p
i
+
1
g
i
+
1
c
i
+
1
c
i
+
2
c
i
+
3
c
i
+
4
p
i
+
2
g
i
+
2
p
i
+
3
g
i
+
3
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
a
8
b
8
a
9
b
9
a
1
0
b
1
0
a
1
1
b
1
1
a
1
2
b
1
2
a
1
3
b
1
3
a
1
4
b
1
4
a
1
5
b
1
5
C
a
r
r
y
-
l
o
o
k
a
h
e
a
d
u
n
i
t
C
a
r
r
y
I
n
R
e
s
u
l
t
0
-
-
3
A
L
U
0
C
a
r
r
y
I
n
R
e
s
u
l
t
4
-
-
7
A
L
U
1
C
a
r
r
y
I
n
R
e
s
u
l
t
8
-
-
1
1
A
L
U
2
C
a
r
r
y
I
n
C
a
r
r
y
O
u
t
R
e
s
u
l
t
1
2
-
-
1
5
A
L
U
3
C
a
r
r
y
I
n
C
1
C
2
C
3
C
4
P
0
G
0
P
1
G
1
P
2
G
2
P
3
G
3
p
i
g
i
p
i
+
1
g
C
a
r
r
y
I
n
R
e
s
u
l
t
0
-
-
3
A
L
U
0
C
a
r
r
y
I
n
R
e
s
u
l
t
4
-
-
7
A
L
U
1
C
a
r
r
y
I
n
R
e
s
u
l
t
8
-
-
1
1
A
L
U
2
C
a
r
r
y
I
n
C
a
r
r
y
O
u
t
R
e
s
u
l
t
1
2
-
-
1
5
A
L
U
3
C
a
r
r
y
I
n
C
1
C
2
C
3
C
4
P
0
G
0
P
1
G
1
P
2
G
2
P
3
G
3
p
i
g
i
p
i
+
1
g
i
+
1
c
i
+
1
c
i
+
2
c
i
+
3
c
i
+
4
p
i
+
2
g
i
+
2
p
i
+
3
g
i
+
3
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
a
8
b
8
a
9
b
9
a
1
0
b
1
0
a
1
1
b
1
1
a
1
2
b
1
2
a
1
3
b
1
3
a
1
4
b
1
4
a
1
i
+
1
c
i
+
1
c
i
+
2
c
i
+
3
c
i
+
4
p
i
+
2
g
i
+
2
p
i
+
3
g
i
+
3
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
a
8
b
8
a
9
b
9
a
1
0
b
1
0
a
1
1
b
1
1
a
1
2
b
1
2
a
1
3
b
1
3
a
1
4
b
1
4
a
1
5
b
1
5
C
a
r
r
y
-
l
o
o
k
a
h
e
a
d
u
n
i
t
(b) S = A + B when M = 0, S = A B when M = 1
+ + + +
a
3
b
3
a
2
b
2
a
1
b
1
a
0
b
0
M
s
3
s
2
s
1
s
0
c
4

(c)
Propagate: (4)
P0 = p3p2p1p0
P1 = p7p6p5p4
P2 = p11p10p9p8
P3 = p15p14p13p12
Generate: (4)
G0 = g3 + (p3g2) + (p3p2g1) + (p3p2p1g0)
G1 = g7 + (p7g6) + (p7p6g5) + (p7p6p5g4)
G2 = g11 + (p11g10) + (p11p10g9) + (p11p10p9g8)
4 (C
i
)
C1 = G0 + c0P0
C2 = G1 + G0P1 + c0P0P1
C3 = G2 + G1P2 + G0P1P2 + c0P0P1P2
C4 = G3 + G2P3 + G1P2P3 + G0P1P2P3 + c0P0P1P2P3
















177
(d)
(1) Ripper carry adderbitcarry2gatedelays
16-bit ripper carry addercritical path2 16 = 32 gate delays.
(2) carry lookahead addergipi1gate delay
pigi2gate delayPiGi
2gate delaysPiGi
Carry lookahead addercritical path1 + 2 + 2 = 5 gate
delays.

5. (a) Figure 5.1 shows the partial finite state machine with control line settings to
control the datapath in Figure 5.2. Figure 5.2 below shows the MIPS
multicycle datapath with exception handling. Please fill in the names and
values of the control lines that need to be changed in the empty states A to E
such that the finite state machine can control the datapath correctly.
(b) Assume only exception, arithmetic overflow, can occur in this MIPS CPU.
Please redraw the finite state machine to handle this exception using the
datapath shown in Figure 5.2.
B
C
D A
E

Figure 5.1


178

Figure 5.2
Answer:
(a)
A B C D E
ALUSrcA =1
ALUSrcB =10
ALUOp = 00
ALUSrcA =1
ALUSrcB =00
ALUOp = 10
ALUSrcA =1
ALUSrcB =00
ALUOp = 01
PCWriteCond
PCSource = 01
PCWrite
PCSource = 10
RegDst = 1
RegWrite
MemtoReg = 0

(b)













179



6. (a) While executing the MIPS code shown below, what is the target address of the
branch instruction if it is taken? (Assume the starting address of this code
segment is 28
dec.
)
lw $4, 50($7)
beq $1, $4, 3
add $5, $3, $4
sub $6, $4, $3
or $7, $5, $2
slt $8, $5, $6
(b) Assume this code is executed on a MIPS CPU with 5 pipeline stages and data
forwarding capability. If this CPU uses "always assume branch not taken"
strategy to handle branch instruction but the branch is taken in this example,
how many clock cycles are required to complete this program? Please explain
your answers in detail.
Answer:
(a) Branch target address will be the address of instruction slt; therefore, the target
address = 28 + 5 4 = 48
dec.


180
(b) Suppose that the branch decision is made at ID stage. Since the branch is taken,
3 instructions should be executed for this code sequence. 2 clocks should be
stalled between lw and beq instructions since the branch decision is made at ID
stage. Besides 1 instruction (add) will be flushed. Therefore, the total cycles for
completing the code sequence = (5 1) + 3 + 2 + 1 = 10 clock cycles.

7. (a) Please briefly explain the relationship between virtual memory, TLBs, and
caches in the memory system of modern computers.
(b) Assume there are two small caches, each consisting of six one-word blocks.
One cache is direct mapped, and the other cache is two-way set associative.
Please find the number of misses of each cache organization given the
following sequence of block address: 0, 15, 12, 3, 15, 0. Besides the number of
misses, please also explain your answers in detail.
Answer:
(a) The TLB contains a subset of the virtual-to-physical page mappings that are in
the page table. On every reference, we look up the virtual page number in the
TLB. Under the best of circumstances, a virtual address is translated by the
TLB and sent to the cache where the appropriate data is found, retrieved, and
sent back to the processor.
(b) Direct map Two-way set associative
Block
address
Cache
block
Hit/Miss
Block
address
Set no. Hit/Miss
0 0 Miss 0 0 Miss
15 3 Miss 15 0 Miss
12 0 Miss 12 0 Miss
3 3 Miss 3 0 Miss
15 3 Miss 15 0 Miss
0 0 Miss 0 0 Miss
Number of misses 6 Number of misses 6


181
95

1. Answer the following problem briefly.
(1) The ARM processor is a RISC or a CISC machine.
(2) How much general purpose registers does the ARM processor have in
supervisor mode?
(3) The ARM processor supports only little-endian, only big-endian, or both
little-endian and big-endian memory addressing modes.
(4) List two key objectives which the USB (Universal Serial Bus) has been
designed to meet.
Answer:
(1) RISC
(2) There are 13 general purpose register (R0 R12) in the supervisor mode
(3) Little-endian
(4) USB was designed C to allow peripherals to be connected using a single
standardized interface socket C to improve plug-and-play capabilities by
allowing devices to be connected and disconnected without rebooting the
computer C to power to low-consumption devices without the need for an
external power supply and C to allow many devices to be used without
requiring manufacturer specific, individual device drivers to be installed.
(2)There are 15 general purpose register (R0 R14) in the user mode
(3)The byte ordering on a MIPS chip is big-endian.


2. Design an 8-bit carry select adder using 4-bit ripple carry adders and 2-input
multiplexers.
(1) Draw the block diagram of the 8-bit carry select adder using the block
diagrams of 4-bit ripple carry adder and 2-input multiplexer shown in Fig. 1,
and explain how the 8-bit carry select adder works.
4-bit Ripple Carry Adder
a[3:0] b[3:0]
s[3:0]
C
out
C
in
4-bit Ripple Carry Adder
a[3:0] b[3:0]
s[3:0]
C
out
C
in
1 0
x y
z
Sel 1 0
x y
z
Sel

Figure 1: Block diagrams of 4-bit ripple carry adder and 2-input multiplexer.
(2) Assume the critical delay of a 4-bit ripple carry adder and a 2-input
multiplexer is 4ns and 0.2ns, respectively. Calculate the critical delay of the
8-bit carry select adder.

182
Answer:
(1)
1 0 1 0 1 0 1 0 1 0
4-bit Ripple Carry Adder 0
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7
4-bit Ripple Carry Adder
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
s
0
s
1
s
2
s
3
4-bit Ripple Carry Adder
a
0
b
0
a
1
b
1
a
2
b
2
a
3
b
3
s
0
s
1
s
2
s
3
s
4
s
5
s
6
s
7
c
4-bit Ripple Carry Adder 1
a
4
b
4
a
5
b
5
a
6
b
6
a
7
b
7

(2) Critical delay of the 8-bit carry select adder = 4 + 0.2 = 4.2 ns

3. A computer system has L1 and L2 caches. The local hit rates for L1 and L2 are
90% and 80%, respectively. The miss penalties are 10 and 50 cycles, respectively.
Assuming a CPI of 1.2 without any cache misses and an average of 1.1 memory
accesses per instruction:
(1) What is the effective CPI after cache misses are factored in?
(2) Taking the two levels of caches as a single cache memory, what are its miss
rate and miss penalty?
Answer:
(1) The effective CPI = 1.2 + 1.1 (0.1 10 + 0.1 0.2 50) = 3.4
(2) Hit rate = 0.9 + 0.1 0.8 = 0.98 Miss rate = 1 0.98 = 0.02 = 2%
Miss penalty = 50 cycles

4. Figure 2 depicts a 4-stage branch prediction scheme that corresponds to keeping 2
bits of history. As long as a branch continues to be taken, we predict that it will
be taken the next time (the "Predict taken" state in the Fig. 2). After the first
misprediction, the state is changed, but we continue to predict that the branch will
be taken. A second misprediction causes another change of state, this time to a
state that causes the opposite prediction. Now a processor runs a program that
consists of two nested loops, with a single branch instruction at the end of each
loop and no other branch instruction anywhere. Also, the outer loop is executed
10 times and the inner loop 20 times.
Determine the accuracy of the following two branch prediction strategies:
(1) always predict taken,
(2) use the branch prediction scheme shown in Fig. 2.

183
(Hint: Accuracy is defined as the ratio of the number of predictions to the number
of total branch predictions.)

Figure 2: A 4-stage branch prediction scheme.
Answer:
The total number of branch instruction executed in the inner loop is 20 10 = 200.
The total number of branch instruction executed in the outer loop is 10.
(1) The number for the inner branch to guess wrong is 10. The number for the
outer branch to guess wrong is 1 Hence accuracy = (210 11)/210 = 0.9476
(2) If the 2-bit predict scheme is initialized at predict taken state
The number for the inner branch to guess wrong is 10. The number for the
outer branch to guess wrong is 1 Hence accuracy = (210 11)/210 = 0.9476

5. An example of MIPS machine assembly language notation, op a, b, c means an
instruction with an operation op on the two variables b and c, and to put their
result in a. Now a C language statement is as:
f = (g + h) (i + j);
The variables f, g, h, i, and j can be assigned to the registers $s0, $s1, $s2, $s3,
and $s4, respectively. Now we use two temporary registers $t0 and $t1 to write
the compiled MIPS assembly code as follows:
add $t0, $s1, (A)
(C) $t1 $s3, (B)
sub $s0, (D), (E)
Please fill the results on the blank (A), (B), (C), (D), and (E).
Answer:
(A) (B) (C) (D) (E)
$s2 $s4 add $t0 $t1


184
6. Please answer the following questions:
(1) Discuss the differences between "RISC" and "CISC" machine.
(2) A performance metric on processor is called "MOPS". What is MOPS? If a
machine has the same metric on MIPS and MOPS, what does it means?
Answer:
(1)
RISC CISC

()



()


(2) MOPS means Millions of Operations Per Second
If a machine has the same metric on MIPS and MOPS means that this
machine is a single cycle machine.

7. Consider two different implementations, M1 and M2, of the same instruction set.
There are four classes of instruction (A, B, C, and D) in the instruction set. M1
has a clock rate of 500 MHz and M2 has a clock rate of 750 MHz.
Instruction class CPI (MachineM1) CPI (Machine M2)
A 1 2
B 2 2
C 3 4
D 4 4
(Hint: CPI means clock cycles per instruction)
(1) Assume the peak performance is defined as the fastest rate that a machine can
execute an instruction sequence chosen to maximum that rate. What are the
peak performances of M1 and M2? Please express as instructions per second?
(2) If the number of instructions executed in a certain program is divided equally
among the classes of instructions. How much faster is M2 than M1?
Answer:
(1) Peak performance for M1 = (500 10
6
)/1 = 500 10
6

Peak performance for M2 = (750 10
6
)/2 = 375 10
6

(2) CPI for M1 = (1 + 2 + 3 + 4)/4 = 2.5

185
CPI for M2 = (2 + 2 + 4 + 4)/4 = 3
Suppose the instruction count the program = IC
Execution for M1 = (2.5 IC)/500 10
6
= 5 IC (ns)
Execution for M2 = (3 IC)/750 10
6
= 4 IC (ns)
M2 is faster than M1 by 5/4 = 1.25 times


186
96

1. There is an unpipelined processor that has a 1 ns clock cycle and that uses 4
cycles, for ALU and branch operations and 5 cycles for memory operations.
Assume that the relative frequencies of these operations are 40%, 20%, and 40%,
respectively. Suppose that due to clock skew and setup, pipelining the processor
adds 0.2 ns of overhead to the clock. Ignoring any latency impact, how much
speedup in the instruction execution rate will we gain from a pipeline
implementation?
Answer:
Average instruction execution time
= 1 ns ((40% + 20%) 4 + 40% 5)
= 4.4ns
Speedup from pipeline
= Average instruction time unpiplined/Average instruction time pipelined
= 4.4ns/1.2ns = 3.7

2. A computer system has L1 and L2 caches. The local hit rates for L1 and L2 are
95% and 80%, respectively. The miss penalties are 8 and 60 cycles, respectively.
(a) Assume a CPI (Cycles per Instruction) of 1.2 without any cache miss and an
average of 1.1 memory accesses per instruction, what is effective CPI after cache
misses are factored in? (b) Taking the two levels of caches as a single cache
memory, what are its miss rate and miss penalty?
Answer:
(a) Effective CPI = 1.2 + 1.1 [(1 0.95) 8 + (1 0.95) (1 0.8) 60] = 2.3
(b) Hit rate = 95% + 5% 80% = 99% Miss rate = 1 99% = 1%
(or Miss rate = 0.05 0.2 = 0.01)
Total miss penalty 60 cycles

3. Engineers in your company developed two different hardware implementations,
M1 and M2, of the same instruction set, which has three classes of instructions: I
(Integer arithmetic), F (Floating-point arithmetic), and N (Non-arithmetic). M1's
clock rate is 1.2GHz and M2's clock cycle time is 1ns. The average CPI for the
three instruction classes on M1 and M2 are shown below:
Class CPI for M1 CPI for M2
I 3.2 3.8
F 5.6 4.2
N 2.4 2.0
Please answer the following questions:
a. What are the peak performances of M1 and M2 in MIPS?

187
b. If 50% of all instructions executed in a program are from class N and the rest
are divided equally among F and I, which machine is faster and by what
factor?
c. The designers of M1 plan to redesign the machine to improve its
performance.
With the instruction mix given in question b, please evaluate each of the
following 4 redesign options and rank them according to their performance
improvement.
1. Using a faster floating-point unit which doubles the speed of
floating-point arithmetic execution.
2. Adding a second integer ALU to reduce the integer CPI to 1.6.
3. Using faster logic that allows a clock rate of 1.5GHz with the same CPIs.
4. The CPIs given in the table include the effect of instruction cache misses
at an average rate of 5%. Each cache miss adds 10 cycles to the effective
CPI of the instruction causing the miss. A new redesign option is to use a
larger instruction cache that would reduce the miss rate from 5% to 2%.
d. If you prefer the M2 implementation and would like to work out test
programs that run faster on M2 than on M1. Let x and y be the fraction of
instructions belonging to class I and F respectively. What kind of relationship
between x and y will you maintain?
Answer:
a. The ideal instruction sequence for M1 is one composed entirely of
instructions from class N. So M1's peak performance is (1.210
9
)/2.4 = 500
MIPS. The ideal sequence for M2 contains only instructions from N. So M2's
peak performance is (110
9
)/2 = 500 MIPS.
b. The average CPI of M1 = 0.5 2.4 + 0.25 3.2 + 0.25 5.6 = 3.4
The average CPI of M2 = 0.5 2.0 + 0.25 3.8 + 0.25 4.2 = 3
M1 then is
16 . 1
10 2 . 1
4 . 3
10 1
3
9
9
=

times faster than M2


c. We compare the instruction execution time for each design.
CPI
1
= 0.5 2.4 + 0.25 3.2 + 0.25 5.6 0.5 = 2.7
Instruction execution time for design 1 = 2.7/1.2G = 2.25 ns
CPI
2
= 0.5 2.4 + 0.25 1.6 + 0.25 5.6 = 3
Instruction execution time for design 2 = 3/1.2G = 2.5 ns
Instruction execution time for design 3 = 3.4/1.5G = 2.27 ns
Effective CPI = 3.4 = CPI
base
+ 10 0.05 CPI
base
= 2.9
New CPI = 2.9 + 10 0.02 = 3.1
Instruction execution time for design 4 = 3.1/1.2G = 2.58 ns
Hence, the relative performance is
Design 1 > Design 3 > Design 2 > Design 4

188
(d) M2 faster than M1 Instruction Time for M2 < Instruction Time for M1
G
y x y x
G
y x y x
2 . 1
) 1 ( 4 . 2 6 . 5 2 . 3
1
) 1 ( 2 2 . 4 8 . 3 + +
<
+ +

0.41 <
y
x


4. Please answer the following questions about memory hierarchy.
a. Please write short C codes to demonstrate the locality of memory access.
b. What are TLB and page table? Please describe clearly and systematically how
a memory access is completed by the processor cache/main
memory/TLB/page table/hard disk.
c. A computer system has a cache memory with 128K bytes. The 32-bit memory
address format is as follows: Tag bits: 31~15, Index (or Set) bits: 14~4, Offset
bits:3~0. Please derive the number of degrees of set associativity in this
cache.
Answer:
a. int list[100];
int i;
for (i = 0; i != 100; i++)
list[i] = i;
b. TLB: A cache that keeps track of recently used address mappings to avoid an
access to the page table.
Page table: The table containing the virtual to physical address translations in
a virtual memory system.
A virtual address issued from CPU is translated by the TLB. When a TLB
miss occurs, the entry of the mapping will move from page table to TLB. If
page table can not find this mapping, then a page fault occurs. Operating
System moves the missing page from hard disk to the physical memory and
the mapping thus will exist in the page table. We use the translated physical
address to search data/instruction in the cache. If cache hit, data/instruction
will send to CPU; otherwise, cache miss occurs. A separate control will move
the missing block from memory to cache and then send to CPU.
c. Offset = 4 bits block size = 16 bytes
The number of blocks in the cache = 128KB/16B = 8K
The index field has 11 bits the cache has 2K sets
The number of blocks in a set = 8K/2K = 4
Hence, hence the degree of set associativity = 4

189
95

1. There are two machines, M1 and M2, run the same instruction set. The
instruction set is composed of 3 classes of instructions (A, B. and C). M1 runs
at 100MHz, and M2 runs at 250 MHz. The average number of cycles per
instruction for each implementation are as follows:
Instruction Class CPI of M1 CPI of M2
A 1 2
B 1 2
C 3 2
(1) Define the peak performance as the fastest rate at which a machine could
execute an instruction sequence chosen to maximize the rate. What are the
peak performances of M1 and M2 in instructions per second?
(2) If a benchmark program consists of 30%, 30%, and 40% of all instructions for
class A, B, and C, respectively. Which machine will execute the program
faster, and by how much?
Answer:
(1) Peak performance for M1 = 100M / 1 = 100M
Peak performance for M2 = 250M / 2 = 125M
(2) The Average execution time of an instruction for M1
= (0.3 1 + 0.3 1 + 0.4 3) / 100M = 18 ns
The Average execution time of an instruction for M2
= (0.3 2 + 0.3 2 + 0.4 2) / 250M = 8 ns
M2 will execute the program faster by 18 / 8 = 2.25 times

2. Consider a 6-stage pipeline, if it needs two clock cycles in stage 3 and three
clock cycles in stage 4, while each of the other stages only needs one clock
cycle.
(1) Please draw a figure to indicate the execution flow of 5 instructions for the
above pipeline machine in ideal case (without any hazard).
(2) Please specify the number of clock cycles that is required to execute n
instructions for the above pipeline machine in ideal case (without any hazard).
(3) Please use figure to explain how the superscalar technique can be applied to
improve the performance of the above pipeline machine?
Answer:

190
(1)
S1 S2 S3 S4 S5 S6
I1 c1
I2 I1 c2
I3 I2 I1 c3
I4 I3 I2 I1 c4
I5 I4 I3 I2 I1 c5
I5 I4 I3 I2 I1 c6
I5 I4 I3 I2 I1 c7
I5 I4 I3 I2 I1 c8
I5 I4 I3 I2 I1 c9
I5 I4 I3 I2 c10
I5 I4 I3 c11
I5 I4 c12
I5 c13
(2) The total clock cycles = ((6 1) + 1+ 2) + n = 8 + n
(3) See the figure below, suppose 2-issue superscalar is used for (1). Only 11
clock cycles is needed rather 13 clock cycles to execute these instructions.
Superscalar allows more than one instruction to be executed in each stage,
hence the performance can be increased.
S1 S2 S3 S4 S5 S6
I1
c1
I2
I3 I1
c2
I4 I2
I5 I3 I1
c3
I4 I2
I5 I3 I1
c4
I4 I2
I5 I3 I1
c5
I4 I2
I5 I3 I1
c6
I4 I2
I5 I3 I1
c7
I4 I2
I5 I3 I1
c8
I4 I2
I5 I3 I1
c9
I4 I2
I5 I3
c10
I4
I5
c11




191
3. You are designing a memory system similar to the one shown in the following
figure: Your memory system design uses 16KB pages, 40-bit virtual byte
address and 32-bit physical address. The TLB contains 16 entries. The cache
has 4K blocks with 4-word block size (1 word has 4 bytes) and is 4-way set
associative.




















(1) What is the total size of the physical page number in the page table for each
process on this processor? (Assuming that all the virtual pages are in use.)
(2) What is the total number of tag bits for the cache?
(3) What kind of associativity (direct-mapped, full-associative or set-associative)
will you choose for this TLB and why?
(4) Assume that you choose 2-way set associativity for TLB and that each block
has 4 words and the initial TLB is empty. After a series of address references
given as word addresses: 1, 4, 8, 5, 17, 32, 19, 1, 56, 9, 25. Please label each
reference in the list as hit or miss. (Note: the first word address of each block
is a multiple of 4 and LRU is used.)
(5) A memory reference in this system may encounter three types of misses: a
TLB miss, a page fault and cache miss. Consider all the 8 combinations of the
three events (with hit/miss) and identify/explain which cases are impossible.
(6) To improve the cache performance, you are allowed to change C cache size
C associativity C block size. Please describe their positive and negative
affects on the performance.
Answer:
(1) The size of a physical page number = 32 log
2
(16K) = 32 14 = 18

=
=
20
Virtual page number Page offset
31 30 29 3 2 1 0 14 13 12 11 10 9
Virtual address
Tag Valid Dirty
TLB
Physical page number
Tag Valid
TLB hit
Cache hit
Data
Data
Byte
offset
=
=
=
=
=
Physical page number Page offset
Physical address tag Cache index
12
20
Block
offset
Physical address
18
32
8 4 2
12
8
Cache

=
=
20
Virtual page number Page offset
31 30 29 3 2 1 0 14 13 12 11 10 9
Virtual address
Tag Valid Dirty
TLB
Physical page number
Tag Valid
TLB hit
Cache hit
Data
Data
Byte
offset
=
=
=
=
=
Physical page number Page offset
Physical address tag Cache index
12
20
Block
offset
Physical address
18
32
8 4 2
12
8
Cache

192
The number of entries of page table = 2
40
/ 16K = 2
26

The total size of physical page number in the page table = 2
26
18 bits
(2) The size of a tag = 32 log
2
(4K/4) log
2
16 = 18
The total number of tag bits = 2
10
18 4 = 72KB bits
(3) Since this TLB has only 16 entries, use full-associative may be the most
appropriate for small miss rate.
(4) The size of offset field = 2 bits; the size of index field = log
2
(16/2) = 3
Word address
Decimal Binary Tag Index Hit/Mis
1 000001 0 0 Miss
4 000100 0 1 Miss
8 001000 0 2 Miss
5 000101 0 1 Hit
17 010001 0 4 Miss
32 100000 1 0 Miss
19 010011 0 4 Hit
1 000001 0 0 Hit (since 2-way)
56 111000 1 6 Miss
9 001001 0 2 Hit
25 011001 0 6 Miss
(5)

TLB
Page
table
Cache Identify/Explain
hit hit miss
Possible, although the page table is never really
checked if TLB hits.
miss hit hit
TLB misses, but entry found in page table: after retry,
data is found in cache.
miss hit miss
TLB misses, but entry found in page table; after retry,
data misses in cache.
miss miss miss
TLB misses and is followed by a page fault; after retry,
data must miss in cache.
hit miss miss
Impossible: cannot have a translation in TLB if page is
not present in memory.
hit miss hit
Impossible: cannot have a translation in TLB if page is
not present in memory.
miss miss hit
Impossible: data cannot be allowed in cache if the page
is not in memory.


193
(6)
Positive affects Negative affects
Cache size
Increase cache size will reduce
capacity miss rate
Increase cache will increase hit
time
associativity
Increase assocativity will
reduce conflict miss rate
Increase assocativity will
increase hit time
block size
Increase block size may reduce
miss rate
Increase block size will
increase miss penalty but too
large block size will also
increase miss rate















194
94

1. True or False. (If the statement is false, explain the answer shortly)
(1) If we write a 32-bit (4-byte) data word, 0x12345678, to the address 0x2000 in
a big-endian system, then the byte stored in 0x2000 is 0x78.
(2) A write-through cache will have the same miss rate as a write-back cache.
(3) The case of TLB miss, Page Table miss, Cache hit is possible.
(4) In memory hierarchy design, increasing the block size will help to decrease
the miss penalty.
(5) Conflict misses will not happen in fully associate caches.
Answer:
(1) False (the correct answer is 0x12)
(2) True
(3) False (block is not in memory and will not be in cache too)
(4) False (increasing block size will increase the time to transfer a block from
cache to memory and thus increases miss penalty)
(5) True

2. Please compare "write-through" and "write-back" in cache system.
Answer:
Polices Write-through Write-back

A scheme in which writs always
update both the cache and the
memory, ensuring that data is
always consistent between the
two.
A scheme that handles writes by
updating values only to the block in
the cache, then writing the modified
block to the memory when the block
is replaced

block
MemoryCPU write

cache
MemoryCPU write




195
3. Use Booths algorithm to compute 5 3 (4-bit number) = 15 (8-bit number).
Complete the following table
Iteration Step Product
0 Initial step
1
(No) operation
Shift
2
(No) operation
Shift
3
(No) operation
Shift
4
(No) operation
Shift
Answer:
Iteration Step Product
0 Initial step 0000 1101 0
1
10 Prod - Mcand 1011 1101 0
Shift right 1101 1110 1
2
01 Prod +Mcand 0010 1110 1
Shift right 0001 0111 0
3
10 Prod - Mcand 1100 0111 0
Shift right 1110 0011 1
4
11 No operation 1110 0011 1
Shift right 1111 0001 1

4. (1) If the execution time of each instruction is t. How long does it take to execute n
instructions in an ideal 6-stage pipeline machine (assuming pipeline hazard and
overhead are ignored)?
(2) In which condition the execution sequence of instructions may be out of order
in a pipeline system?
Answer:
(1)
6
t
((6 1) + n) =
6
) 5 ( t n +

(2) dynamic pipeline scheduling hardware supportinstructions may be out
of order execution.

5. Please describe the criteria which will affect the encoding and length of
instruction set and the design considerations of a RISC processor.
Answer:

196
Single-cycle operation


Load/Store design
Load/Store CPU
Load/Store

CPU
Hardwired control
Hardwire Control

Relatively few
instructions and
addressing modes

Fixed instruction format


197
93

1. :
(1) Associative cache organization
(2) Nanoprogramming
(3) Horizontal and vertical microinstruction
(4) Describe the basic idea of Booths multiplier and write down the conversion
table.
(5) Two-bit dynamic branch prediction.
(6) What is the carry-save adder (CSA)? Give the structure of adding 4 numbers
by CSA.
Answer:
(1) A compromise between a direct mapped cache and a fully associative cache
where each address is mapped to a certain set of cache locations. For example
in an n-way set associative cache with S sets and n cache locations in each
set, block b is mapped to set b mod S and may be stored in any of the n
locations.
(2) A combination of vertical and horizontal microinstructions in a two-level
scheme is called nanoprogramming. Many microinstructions occur several
times through the micro program. In this case, the distinct microinstructions
are placed in a small control storage. The nanostore then contains the index in
the microcontrol store of the appropriate microinstruction.
(3) A vertical microinstruction is highly encoded and looks like a simple
macroinstruction; it might contain a single opcode field and one or two
operand specifiers.
A horizontal microinstruction might be completely unencoded and each
control signal may be assigned to a separate bit position in the
microinstruction format.
(4) Booths algorithm follows this scheme by performing an addition when it
encounters the first digit of a block of ones (0 1) and a subtraction when it
encounters the end of the block (1 0). This works for a negative multiplier as
well. When the ones in a multiplier are grouped into long blocks, Booths
algorithm performs fewer additions and subtractions than the normal
multiplication algorithm. The following shows the conversion table.
a
i
a
i-1
Operation
0 0 Do nothing
0 1 Add b
1 0 Subtract b
1 1 Do nothing
(5) A branch prediction scheme. A prediction must be wrong twice before it is
changed.
(6) A Carry-Save Adder is just a set of one-bit full-adders, without any

198
carry-chaining. Therefore, an n-bit CSA receives three n-bit operands, namely
A(n-1)..A(0), B(n-1)..B(0), and CIN(n-1)..CIN(0), and generates two n-bit
result values, SUM(n-1)..SUM(0) and COUT(n-1)..COUT(0).












2. Compare the instruction-set architectures in RISC and CISC processors in terms
of at least 5 important characteristics.
Answer:
RISC (reduced instruction set computer) CISC (complex instruction set computer)
All instructions are the same size
(32 bits on the MIPS)
Instructions are not the same size
Few addressing modes are supported Support a lot of addressing modes
Only a few instruction formats
(makes decoding easier)
Support a lot of instruction formats
Arithmetic instructions can only work
on registers
Arithmetic instructions can work on
memory
Data in memory must be loaded into
registers before processing
Data in memory can be processed
directly without using load/store
instructions

3. Explain the basic idea (giving the key points and the reasons) of the two major
division algorithms:
(1) restoring and
(2) nonrestoring divisions.
Answer:
(1) Restoring division: It keeps subtracting the divisor until the quotient goes
negative then go back to the previous step, shift right one place and continue. In
the restoring division remainder is always positive (or zero).
(2) Nonrestoring division: It subtracts the divisor until the divident changes sign, then
shift right one place. In the next iteration add the divisor until the divident
changes sign and then shift right one place go back to the first step (subtracting).
In the nonrestoring division, the remainder may be positive or negative.

199
92
1. :
(a) Write down the IEEE 754 representation (in hex format) for the value
-13.125
(b) What is the advantage of twos complement representation when compared
with signed-magnitude representation?
(c) Given an n-bit 2s complement representation (X
n-1
X
n-2
X
1
X
0
). Whats the
value that it represents (write down the equation but NOT any specific
example)?
(d) Write down the Boolean equation of testing overflow in n-bit 2s complement
addition.
(e) What is CPI?
Answer:
(a) -13.125
10
= -1101.001
2
= -1.101001
2
2
3

1 10000010 10100100000000000000000 = C1520000
16

(b) signed-magnitude :
(1) 0 programmer

(2)
(3) sign bit ?
(c)

+
2
0
1
1
2 2
n
i
i
i
n
n
X X

(d) Overflow = c
n
c
n-1
, where c
n-1
and c
n
is the carry-in and carry-out bits of the
(n 1)
th
bit, respectively.
(e) Clock cycles per instruction: CPU cycles.

2. About nanoprogramming technique:
(a) Sketch and explain the block diagram of nanoprogramming.
(b) Explain the reason why people use nanoprogramming for CISC processor
control design.
Answer:
(a) If the microstore is wide, and has lots of the same words, then we can save
microstore memory by placing one copy of each unique microword in a
nanostore, and then use the microstore to index into the nanostore. Figure 1a
illustrates the space requirement for the original microstore ROM. There are n
= 2048 words that are each 41 bits wide. Suppose now that there are 100
unique microwords in the ROM. Figure 1b illustrates a configuration that uses
a nanostore, in which an area savings can be realized if there are a number of
bit patterns that recur in the original microcode sequence. The unique

200
microwords (100 for this case) form a nanoprogram, which is stored in a
ROM that is 100 words deep by 41 bits wide. The microprogram now indexes
into the nanostore.
Original
Microprogram
w = 41 bits
n

=

2
0
4
8

w
o
r
d
s
Micro-
Program
w = 7 bits
n

=

2
0
4
8

w
o
r
d
s
w = 41 bits
m
=

1
0
0

n
a
n
o
w
o
r
d
s

Fig. 1a Fig. 1b
(b) CISC Computer
horizontal encodingmicrostorevertical encoding
CISCverticalhorizontal encoding
Nanoprogramming

3. For the hierarchical carry-lookahead adder design, please write down the
Boolean equations for the following signals.
(a) 16-bit Group Propagate based on 4-bit Propagate Pi and 4-bit Generate Gi (i
= 0, 1, 2, 3).
(b) 16-bit Group Generate based on 4-bit Propagate Pi and 4-bit Generate Gi
(i=0, l, 2, 3).
Answer:
(a) 16-bit Group Propagate = P3P2P1P0
(b) 16-bit Group Generate = G3 + (P3G2) + (P3P2G1) + (P3P2P1G0)

4. Assume there is a 4 KB cache with set-associative address mapping. And the
cache is partitioned into 32 sets with 4 blocks in each set. The memory-address
size is 23 bits, and the smallest addressable unit is byte.
(a) To what set of the cache is the address 000010AF
16
assigned?
(b) If the addresses 000010AF
16
and FFFF7xyz
16
can be as assigned to the same
cache set, what values can the address digits xyz have?
Answer:
(a) block size =
32
4 32
4
=

KB
bytes,
byte address = 000010AF
16
= 0000 0000 000 0000 0001 0000 1010 1111
2

block address = 0000 0000 000 0000 0001 0000 101
2

block address mod 32 = 00101
2
set number = 5
(b) address 000010AF
16
set 5FFFF7xyz
16
bit 9 to bit 500101
xyz
16
= 00101
2
(: 10)


201
5. (a) What kinds of hazard may occur on the pipeline architecture?
(b) Consider a pipeline machine with 5 stages (instruction fetch, instruction decode,
execution, memory access, and write back) and load-store instruction set. What
hazards will occur in the following program (please indicate the numbers of the
instructions)? And how to solve it?
(1) Load R1, 3(R2) ;R1 Mem(R2 + 3)
(2) Add R3, R2, R7 ;R3 R2 + R7
(3) Store 0(R4), R3 ;Mem(R4) R3
(4) Sub R2, R1, R5 ;R2 R1 R5
(5) Load R6, 4(R3) ;R6 Mem(R3 + 4)
(6) Add R8, R6, R1 ;R8 R6 + R1
(7) OR R6, R4, R5 ;R6 R4 or R5
(8) Sub R3, R7, R2 ;R2 R7 R2
Answer:
(a) structural hazard, data hazard, and control hazard
(b) register writeregister readclock cycle(2, 3), (5, 6)
data hazards(2, 3)data hazardforwarding
(5, 6)data hazardstallclockforwarding
compilerNOP
data hazard

6. What is daisy-chain arbitration? And what is its application in the computer
systems?
Answer:
(1) Daisy chain arbitration the grant line runs through the connected devices
from highest priority to lowest priority with priorities determined by position
of the devices. This scheme is simple, but a low priority device may be locked
out indefinitely, and the use of a daisy chain grant limits the bus speed.










(2) Daisy-chain arbitrationbus systembus master
busI/O device

Bus
Arbiter
Device 1
Highest
Priority
Device N
Lowest
Priority
Device 2
Grant Grant Grant
Release
Request
wired-OR

202
96

I.
1. In coding assembly, an integer multiplication by a power of 2 can be replaced by
a left shift, and an integer division by a power of 2 can be replaced by a logical
right shift.
Answer:
That an integer division by a power of 2 can be replaced by a logical right shift is
true only for unsigned integers.

2. For a cache, one way to improve the performance of the write-through scheme is
to use a write buffer. With write buffer, the processor does not need stall while
performing a write.
Answer:
CPU still need to stall until data is written into the cache and buffer.

3. RAID 2 may recover a single-bit failure with extra 3-bits error correction
information. RAID 3 can achieve the same goal with only one extra bit
information and thus reduce the overhead. Explain briefly how RAID 3 does it.
Answer:
RAID 3 need only one extra parity bit to hold the check information in case there
is a failure. When a bit fails, then you subtract all the data in the good bits from
the parity bit; the remaining information must be the missing information.

II.
1. Pipeline.
(a) Explain the law of performance. That is, how can CPU time be factored into
three terms? (Hint: CPI and cycle time) Also, briefly explain what issues may
have impact on these three terms.
(b) Instead of a traditional 5-stage pipeline processor, supposedly we have a new
design by only allowing register operands for load/store instructions with no
offset. Specifically, all load/stores with nonzero offsets would become:
lw r3, 30(r5) is changed into addi r1, r5, 30
lw r3, (r1)
Can you give a 4-stage pipeline for the new design? Scratch the pipeline
organization diagram.
(c) Does the new design still require a "forwarding unit" or a "stall detection unit"
respectively? Why or why not?
(d) Referring to Question (a), please give the effects of the three terms due to the
new design in (b) and briefly explain your arguments.

203
Answer:
(a)
Time =
Seconds
=
Instructions

Clock cycles

Seconds
Program Program Instruction Clock cycle
= Instruction count CPI Cycle time

Term Impact Factors
Instruction
count
Algorithm, Programming language, Compiler, Instruction
set architecture
CPI
Algorithm, Programming language, Compiler, Instruction
set architecture, Computer organization
Clock rate
Instruction set architecture, Computer organization, VLSI
technology

(b)
IM Reg
DM
Reg
IF ID EX/MEM WB

(c) The new pipeline structure still needs forwarding unit because data hazard
would happen if two consecutive instructions have data dependency. But it is
not necessary to keep the stall detection unit since memory access can be
done in the third stage and no stall is needed when load-use hazard occurs.
That is, the load-use hazard can be resolved just by forwarding unit.
(d) Instruction count will increase since one memory access instruction in 5-stage
pipeline requires two instructions in the new 4-stage pipeline structure.
CPI maybe decreases since the impact on hazards decreases.
Clock rate maybe decreases since more work to do in the third stage and the
latency are lengthened thus leading to long clock cycle time.


204
2. For a typical processor with five-stage pipeline, the following code is executed.
I
1
: add $5, $1, $3
I
2
: sub $1, $5, $4
I
3
: lw $3, 10($2)
I
4
: and $2, $5, $6
I
5
: bne $7, $8, Label
I
6
: add $9, $10, $11
I
7
: add $9, $11, $12
Label: lw $13, 25($14)
where I
n
indicates the n
th
instruction in this code.
(a) Classify all true dependencies in the above code in terms of I
n
.
(b) Identify all types of hazards in the above code in terms of I
n
.
(c) Give the methods or techniques to resolve the hazards in (b).
Answer:
(a) (I
1
and I
2
), (I
1
and I
4
)
(b) Data hazard: (I
1
and I
2
)
Control hazard: I
5
(c)
Type Solutions
Data
hazard
Software solution: (1)compilerno operation (nop)
(2)data hazard
Hardware solution: (1)Forwardingdata hazard
(2)load-useForwardingdata
hazard(stall)Forwarding
Control
hazard
Software solution: (1)compilerno operation(nop)
(2)Delay branch: compilerbranch
Hardware solution: (1)control hazard
(2)(static or dynamic)
pipeline
pipeline


205
95

I.
1. For floating operations, (A + B) + C is equal to A + (B + C).
2. A 1GHz RISC CPU is faster than a 1 GHz CISC CPU, because the CPI of RISC
is smaller than that of CISC.
3. The law of performance indicates that CPU time can be shown as a product of 3
terms. What are those three terms? Explain what factors may have impact on the
three terms respectively.
4. What are advantages and side effects with increasing the block size for a given
cache size?
5. How many bits are required to store the ROM entries for a ROM with m-bit
inputs and k-bit output?
6. When an equal is true, why is an instruction beq r1, r2, imm16 operated as:
PC PC + 4 + sign_extension(Imm16)||00
b
?
Answer:
1. Wrong, floating point addition is not associative. That is, (A + B) + C = A +
(B + C)
2. Wrong, this statement does not consider the instruction ability for both CPU.
3.
Term Factors
Instruction count Algorithm, Programming language, Compiler, Instruction
set architecture
CPI Possibly algorithm, Programming language, Compiler,
Instruction set architecture, Organization
Clock rate Organization, Technology
4. Increasing block size decreases miss rate due to spatial locality. But very large
block could increase miss rate. Increase block size also increase cache miss
penalty.
5. 2
m
k bits
6. beqtarget addressPC16
32[sign_extension(Imm16)]
[sign_extension(Imm16)||00
b
]4PC[PC + 4 +
sign_extension(Imm16)||00
b
]


206
II.
1. The pipelining is a key technique to improve the performance of CPU. It allows
that multiple instructions are overlapped in execution so that CPU can take less
time to finish the execution of an application. To implement the pipeline, we can
use single-cycle or multicycle approach. Please answer the following questions in
brief.
(a) Compared to single-cycle approach, what are the advantages and
disadvantages of the multicycle approach.
(b) Give the designing principles of the multicycle approach.
(c) Write the impact of pipeline using the multicycle approach on clock rate, CPI
(clock cycles per instruction), instruction count, and branch miss penalty.
Answer:
(a) Only one clock is required for the execution of a single-cycle pipeline stage
and more than one clock are required for the execution of a multicycle
pipeline stage.
Advantage: easy to balance the execution time among stages
Disadvantage: require more registers to store the data produced in each clock
(b) Balance the jobs that should be done in each clock
Balance the jobs that should be done in each stage
(c) Clock rate: increase
CPI: increase
Instruction count: may increase (penalty increase and more NOP instruction
may be needed to resolve the hazards)
Branch miss penalty: increase

2. Given a static RAM and its operation in the following figures with the following
pin definitions:
CE': The chip enable input, which is active low. When CE' = 1,
the SRAM's data pins are disabled, and when CE' = 0, the data
pins are enabled.
R/W: The control signal indicating whether the current operation
is a read (i.e. R/W' = 1) or a write (R/W' = 0). Read and write are
normally specified relative to the CPU, so read means reading
from RAM and write means writing to RAM.
Adrs: Specifying the address for the read or write.
Data: Denoting a bi-directional bundle of signals for data transfer.
When R/W' = 1, the pins are output, and when R/W' = 0, the data
pins are inputs.
(a) Please design a 2M 16-bit SRAM system built from the 1M 8-bit
SRAM as shown below.
(b) Please illustrate the timing diagram of your design for the new memory
system.

207



















Answer:
(a)
CE
R/W
Adrs
Data
CE
R/W
Adrs
Data
CE
R/W
Adrs
Data
CE
R/W
Adrs
Data
Adrs 0~19
Adrs 20
R/W
Data in 8~15
Data out 8~15
Data in 0~7
Data out 0~7
M1 M2
M3
M4


CE
R/W
Adrs
Data
SRAM
a
Interface
CE
R/W
Adrs
Data
SRAM
CE
R/W
Adrs
Data
SRAM
a
Interface
From SRAM From CPU
CE
R/W
Adrs
Data
Read Write

208
(b)
Adrs20
Adrs0-19
R/W
Data
From SRAM
Read (M1, M2)
From CPU
Write (M1, M2)
From SRAM From CPU
Read (M3, M4) Write (M3, M4)






209
94

I. : 1~2 ()
1. In comparison with ones complement, the major advantage of twos complement
is that for an algorithm it usually does fewer multiplications.
2. The critical path of the ripple carry adder is linearly proportional to the width of
the adder, while the critical path of the carry look-ahead adder (CLA) is
independent of the width of the adder.
3. The communication schemes between CPU and peripherals include polling,
interrupt, DMA, and write-through.
4. In the design of bus, the decoder is used to decide which bus master has the bus
ownership.
5. The major difference between the hardwired control unit and the
microprogramming control unit is that the former needs the support of program
counter.
6. The Booths algorithm is faster when doing multiplications since it combines the
product and multiplier registers.
7. DMA (Direct Memory Access) can be used to improve the performance of CPU
by direct load/store instructions.
8. In coding assembly, an integer multiply by a power of 2 can be replaced by a left
shift, and an integer division by a power of 2 can be replaced by a right shift.
9 Amdahls law is a rule stating that the performance enhancement within a given
improvement is unlimited so as that the performance of a chip can be doubled by
every two years.
10. Hierarchical caches (such as 2nd-level cache) are aimed at reducing average hit
time.
11. Since the memory is very cheap today, horizontal microinstruction or VLIW can
offer a cost-performance solution for embedded systems.
12. Widely variable instruction lengths (such as X86) still can give a deep pipeline
design as each stage can be balanced by reading a single instruction byte.
13. In general, the lookup of TLB can be in parallel with accessing the first level
cache.
14. PCI bus can operate up to 133MHz so that it can support fast memory transfer of
DDR memory in the north bridge.
15. In the SOC design for embedded systems, we can use popular CPU such as
Pentium 4 for the processor as well as the AGP bus for the on-chip bus to
integrate as a new system.
Answer:
1. 1s complement number2s
complement number
2. The critical path of the CLA is not independent of the width of the adder
3. Does not include write-through

210
4. The arbiter is used to decide which bus master has the bus ownership
5. The microprogramming control unit needs the support of program counter
6. Booths algorithm
7. Direct I/O operations.
8. Only unsigned integer division by a power of 2 can be replaced by a right
shift.
9. Performance enhancement possible with a given improvement is limited by
the amount that improved feature is used
10. Multiple level caches is used to reduce miss penalty
11. Embedded systems require small memory to reduce memory access time for
real-time computation
12. Reading a single instruction byte can not balance the job done by each stage
13. In general, the lookup of TLB should be accessed before the first level cache
14 PCI is connected to south bridge chip
15. AGP bus is used to connect Pentium 4 to outside world and can not be used as
an on-chip bus
II.
1. The Chung Cheng Computer Company (L
4
) has announced two versions of
CPUs, CCU1 and CCU2, for BARM Inc (Better than ARM).
(1) CCU1 running at 100MHz has the instruction fractions and cycles as follows.
Give the CPI and MIPS rate
Instruction Class ALU lw/store branch
Frequency 50% 30% 20%
CPI of instruction 1 2 3
(2) CCU1 looks like a normal MIPS CPU with the below pipeline. Explain how
data-forwarding (or called bypassing) can be used to reduce the effects of load
delays and why we cannot eliminate them completely.
instruction
fetch
decode/registe
r fetch
execute
memory
access
register WB
(3) Now in CCU2 we add a new stage by allowing the second operand to be
shifted by an arbitrary amount before ALU computation as follows. Give a
possible data path diagram to support the new design.
instruction
fetch
decode/regi
ster fetch
shift/rotate
operand 2
execute
memory
access
register WB
(4) With CCU2 running at 150MHz, if half of ALU instructions can be merged
into the shift operations and CPI of instructions are the same, please give CPI
and MIPS rate for CCU2. What is the speedup over CCU1?
(5) What data hazards may occur in CCU2 and how could they be resolved in the
above pipeline?

211
Answer:
(1) CPI = 1 0.5 + 2 0.3 + 3 0.2 = 1.7
MIPS =
6
6
10 7 . 1
10 100

= 58.82
(2) Suppose that we have the following two instructions to execute in CCU1: lw
$t0 0($s1), add $t2, $t0, $t1. During clock 4, the data is still being read from
memory by load instruction while the ALU in stage 3 is performing the
operation for the following instruction. So, we can not use forwarding only to
completely remove the load delays. If we stall a clock cycle between load and
the following instruction and then apply forwarding technique then the load
delay can be eliminate completely.
(3)






(4) CPI =
2 . 0 3 . 0 25 . 0
2 . 0 3 3 . 0 2 25 . 0 1
+ +
+ +
= 1.93
MIPS =
6
6
10 93 . 1
10 150

= 77.72
Speedup =
ns
ns
67 . 6 93 . 1 75 . 0
10 7 . 1


= 1.76
(5) In addition to EX and MEM hazards, there are shift/rotate hazard in CCU2.
We can extend the forwarding unit like the following diagram to resolve this
problem.
IM Reg
shift/
rotate
DM R
eg
IM Reg
shift/
rotate
DM R
eg


2. One of the most important aspects of computer design is instruction set
architecture (ISA) design because it affects many aspects of the computer system
including implementation of CPU and compiler.
(1) Give the useful information that must be encoded in an instruction set (ISA).
(2) What are the principles of designing the instruction set?
Answer:
IM Reg DM Reg
Shift/
rotate
ALU

212
(1) Memory (2) Simplicity favors regularity
Registers Smaller is faster
Instruction format Good design demands good compromises
Addressing mode Make the common case fast

3. The choice of associativity in memory hierarchy depends on the time cost of
cache miss and the hardware cost of implementation.
(1) Give the advantages and disadvantages of caches moving from direct-mapped
to set-associative caches.
(2) For a cache of 16KB with 16B per block and 32b input address, please
compute the total tag bits respectively required for caches with direct-mapped,
4-way set associative, and fully associative.
(3) Compare the different considerations for the choice of associativity in
designing caches, TLB, and virtual memory.
Answer:
(1) Advantage: decrease miss rate
Disadvantage: increase hit time and hardware overhead
(2) Direct-mapped:
offset = 4 bits, number of blocks = 16KB/16B = 1K = 2
10
tag field =
32 4 10 = 18 bits
The total tag bit = 1K 18 = 18 Kbits
4-way set associative:
number of set = 1K/4 = 256 = 2
8
tag field = 32 4 8 = 20
The total tag bit = 256 4 20 = 20 Kbits
Fully associative:
tag field = 32 4 = 28
The total tag bit = 1 1K 28 = 28 Kbits
(3)
Cache
cache miss penaltydirect-mapped, set
associative, and fully associative
depend onhit time
TLB
TLBpage tablecachecachedirect-mapped, set
associative, and fully associativeTLB
sizefully associative
Virtual
memory
page faultpenaltymiss rate
fully associative







213
93

I. : 1~2 ()
1. For a C program, the length of its binary code produced for a RISC is always
longer than that produced for a CISC.
2. As compared to 1s complement number system, the advantage of 2s
complement number system includes easy management of sign bit.
3. In the design of a pipelined control unit, it is possible to encounter the following
hazards: data hazards, memory hazards, control hazards, and structural hazards.
4. In an on-chip bus like AMBA, the arbiter is used to decide which bus master has
the bus ownership.
5. For float-point numbers, A + (B + C) is equal to (A + B) + C.
6. If the cache associativity is increased, the miss rate will decrease, the cost will
increase, and the access time will decrease.
7. In MIPS instruction set, a branch instruction with a 16-bit offset field can give the
maximal distance of the target address with 2
16
-1 bytes from the current PC
address.
8. The reasons that the PLA can be more efficient than ROM for control unit are (1)
PLA has no duplicate entries and (2) PLA is easier to decode the address.
9. Increasing the depth of pipelining may not always improve performance, mainly
because of larger memory access time.
10. Instruction set architecture may have significant impact on compiler design.
Answer:
1. True, RISCCISC
2. False, there is no sign bit in 2s complement representation
3. False, memory hazards
4. True
5. False, float-point numbers
6. False, access time will increase
7. False, about
15
2 words or
7 1
2 bytes from the current PC address
8. False, statement (2) is incorrect. In addition to statement (1) PLA also can
share product terms and can take into account dont cares
9. False, penalty of hazards
10. True, because the output of a complier is a sequence of instructions from ISA

214
II.
1. The Chung Cheng Computer Company (L
4
) has designed two versions of
CPUs for Outel Inc. (named, CCU1 and CCU2 - CCU outside), which can run
at 100 MHz and 200 MHz respectively. The average number of cycles and
frequency for each instruction class are as follows:
Instruction
Class
Frequency
Cycles of
CCU1
Cycles of
CCU2
A 50% 1 2
B 30% 2 3
C 20% 3 3
(1) Which CPU is faster when executing the same program? What is speedup?
(2) What is "MIPS"? Compute the MIPS rating of CCU1 and CCU2 respectively.
(3) Give all possible techniques why they make CCU2 running at a faster clock if
the instruction set architecture is not changed.
(4) If L
4
claims that CCU1 is a 32-bit CPU and CCU2 is a 64-bit CPU, how do
you define the differences between 32-bit CPU and 64-bit CPU?
Answer:
(1) CPI
ccu1
= 1 0.5 + 2 0.3 + 3 0.2 = 1.7
CPI
ccu2
= 2 0.5 + 3 0.3 + 3 0.2 = 2.5
Suppose IC represents the instruction count then
ExTime
ccu1
= (1.7 IC) / (100 10
6
) = (17 IC) ns
ExTime
ccu2
= (2.5 IC) / (200 10
6
) = (12.5 IC) ns. Hence, CCU2 is faster
Speedup = ExTime
ccu1
/ ExTime
ccu2
= (17 IC) / (12.5 IC) = 1.36
(2) MIPS: a measurement of program execution speed based on the number of
millions of instructions
MIPS
ccu1
= (100 10
6
) / (1.7 10
6
) = 58.82
MIPS
ccu2
= (200 10
6
) / (2.5 10
6
) = 80
(3) Advance VLSI technology, faster components, advance computer
organization
(4) 32-bit or 64-bit CPUCPU32-bit or 64-bit
registersbit

2. Now L
4
has designed another simple version of CPU (named, CCU0), which
does not support interrupt. Argue which of the following design techniques are
not possible for CCU0. Why?
(1) Pipelining
(2) Virtual memory
(3) Polling I/O
(4) Data forwarding
(5) Cache memory
Answer:

215
(1) possible: pipeline
(2) impossible: page fault CPU page
hard disk main memory
(3) possible: polling CPU
(4) possible: forward CPU
(5) possible: the cache miss handling is done with the processor control unit
and with a separate controller that initials the memory access and refills the
cachecache miss CPU stall CPU

3. The following figure shows the cache miss rate versus block size for five
different size caches in a memory system. Assume the memory system takes 40
clock cycles of overhead and then deliver 16 bytes every 2 clock cycles. Assume
a cache hit takes 1 cycle.
(1) Give the access time to fetch a data block for each block size
(2) Give which block size has the lowest average memory access time for 4K,
and 64K
(3) Give your observations from the figure. Explain why the lowest miss rate
occurs at different block size for different cache size
Cache size
Block size 1K 4K 16K 64K 256K
16 15.0% 8.5% 4.0% 2.0% 1.0%
32 13.0% 7.0% 3.0% 1.5% 0.7%
64 13.5% 7.0% 2.5% 1.0% 0.5%
128 17.0% 8.0% 3.0% 1.0% 0.5%
256 22.0% 9.5% 3.5% 1.2% 0.5%
Answer: (1)(2)
Block
size
Block transfer
time (cycles)
Average memory access time (cycles)
4K Cache 64K Cache
16 40 + 1 2 = 42 1 + 42 0.085 = 4.57 1 + 42 0.02 = 1.84
32 40 + 2 2 = 44 1 + 44 0.07 = 4.08 1 + 44 0.015 = 1.66
64 40 + 4 2 = 48 1 + 48 0.07 = 4.36 1 + 48 0.01 = 1.48
128 40 + 8 2 = 56 1 + 56 0.08 = 5.48 1 + 56 0.01 = 1.56
256 40 + 16 2 = 72 1 + 72 0.095 = 7.84 1 + 72 0.012 = 1.864
(2) For 4K cache, block size 32 has the smallest AMAT
For 64K cache, block size 64 has the smallest AMAT
(3) block sizespatial localitymiss rate
block sizeblockconflictmiss rateCache
sizeblock sizemiss


216
4. Pipelining- consider a 5-stage pipeline like MIPS
(1) Explain why instruction work at each stage should be as balanced as possible.
Give an example to support your arguments.
(2) How to solve the data hazard due to an immediate use on a previous data load
from memory? Does it still require a stall? Why?
(3) The finish time of branch instructions can be moved early from MEM to ID.
What are the costs behind that? Is it possible to move earlier to the IF stage?
Answer:
(1) pipelineclock cycle timestage
stagestage
stagepipeline performance
8ns4-stage pipeline
machinesperformancemachine 14stage
1ns-5ns-1ns-1nsmachine 24stage
2ns-2ns-2ns-2nsmachine 1clock rate200MHzmachine 2clock
rate500MHzmachines 1 and 2100
Execution time for machine 1 = ((4 1) + 100) 5 ns = 515 ns
Execution time for machine 2 = ((4 1) + 100) 2 ns = 206 ns
stageclock ratepipelineperformance

(2) stallclock cycleforwardingdata hazard
Because in the stag 4 the data is still being read from memory by load
instruction while the ALU in stage 3 is performing the operation for the
following instruction, we can not use forwarding only to completely remove
the load delays. If we stall a clock cycle between load and the following
instruction and then apply forwarding technique then the load delay can be
eliminate completely.
(3) registerdataregister file32-bit XOR array
registercost32XOR gates1NOR
gate
IF stage registersregister
IF stage


217
92

I. :
1. MIPS (million instruction per second) is not a good metric to measure the
performance of two processor with the cycle time, which is impractical for real
applications.
2. The decimal value of a hexadecimal fixed-point number, 0x1F.3 is 31.3.
3. In implementing pipelining for branch instructions, the adder of computing target
adders at the ID stage can be eliminated because the ALU can be free in the
decoding phase.
4. The reasons that the PLA can be more efficient than ROM for control unit are (1)
PLA has no duplicate entries and (2) PLA is easier to decode the address.
5. Cache miss rate is dependent on both the block size and cache associativity. The
cache miss rate decreases as the block size increases.
6. There are two ways of writing data in a hierarchical memory with cache:
write-through and write-back. The write-through policy costs less than write-back
because the write-through writes only a single word rather than a cache block.
7. For a given size of L1 cache, using the multi-level caches can reduce the average
miss penalty instead of the miss rate of L1 cache.
8. A daisy chain bus uses a bus grant line that chains through each device from
lowest to highest priority.
9. There are three ways in interfacing processors and peripherals: polling, interrupt,
and DMA. Among them, the polling I/O consumes the most amount of processor
time.
10. If a processor does not support precise interrupt, the virtual memory is not
possible as the OS has no way to resume the execution.
Answer:
1. True
2. False, should be 31.1875
3. False, the adder of computing target address at the ID stage can not be eliminated
4. False, no address decode is required for PLA
5. False, too large block size increase miss rate
6. False, write-through policy costs more than write-back does
7. True
8. False, from highest to lowest priority
9. True
10. True


218
II.
1. Use 4-bit carry lookahead adders to design a 16-bit carry lookahead adder.
(1) Give the block diagram of the 16-bit adder.
(2) Describe how carry signals, generate signals, and propagate signals are passed
between 4-bit carry lookahead adders.
(3) Assume that each 4-bit carry lookahead adder takes d time units to generate an
output carry after it receives the input carry. Whats tee total addition time of
this 16-bit carry lookahead adder?
Answer:
(1) (2) Propagate signal:

P0 = p3p2p1p0
P1 = p7p6p5p4
P2 = p11p10p9p8
P3 = p15p14p13p12

Generate signal:

G0 = g3 + (p3 g2) + (p3p2 g1) + (p3 p2 p1 g0)
G1 = g7 + (p7 g6) + (p7p6 g5) + (p7 p6 p5 g4)
G2 = g11 + (p11 g10) + (p11 p10 g9) +
(p11 p10 p9g8)
G3 = g15 + (p15g14) + (p15p14 g13) +
(p15 p14p13 g12)

Carry signal:

C1 = G0 + c0P0
C2 = G1 + G0P1 + c0P0P1
C3 = G2 + G1P2 + G0P1P2 + c0P0P1P2
C4 = G3 + G2P3 + G1P2P3 + G0P1P2P3 +
c0P0P1P2P3

(3) gipi1gate delaypigi2gate
delayPiGi2gate delaysPiGi
1 + 2 + 2 = 5
gate delays4-bit carry lookahead addertotal
delay time = 5 gate delay + d + 3 gate delay = 8 gate delay + d

d2gate delay5d

CarryIn
Result0--3
ALU0
CarryIn
Result4--7
ALU1
CarryIn
Result8--11
ALU2
CarryIn
CarryOut
Result12--15
ALU3
CarryIn
C1
C2
C3
C4
P0
G0
P1
G1
P2
G2
P3
G3
pi
gi
pi + 1
gi + 1
ci + 1
ci + 2
ci + 3
ci + 4
pi + 2
gi + 2
pi + 3
gi + 3
a0
b0
a1
b1
a2
b2
a3
b3
a4
b4
a5
b5
a6
b6
a7
b7
a8
b8
a9
b9
a10
b10
a11
b11
a12
b12
a13
b13
a14
b14
a15
b15
Carry-lookahead unit
CarryIn
Result0--3
ALU0
CarryIn
Result4--7
ALU1
CarryIn
Result8--11
ALU2
CarryIn
CarryOut
Result12--15
ALU3
CarryIn
C1
C2
C3
C4
P0
G0
P1
G1
P2
G2
P3
G3
pi
gi
pi + 1
g
CarryIn
Result0--3
ALU0
CarryIn
Result4--7
ALU1
CarryIn
Result8--11
ALU2
CarryIn
CarryOut
Result12--15
ALU3
CarryIn
C1
C2
C3
C4
P0
G0
P1
G1
P2
G2
P3
G3
pi
gi
pi + 1
gi + 1
ci + 1
ci + 2
ci + 3
ci + 4
pi + 2
gi + 2
pi + 3
gi + 3
a0
b0
a1
b1
a2
b2
a3
b3
a4
b4
a5
b5
a6
b6
a7
b7
a8
b8
a9
b9
a10
b10
a11
b11
a12
b12
a13
b13
a14
b14
a1
i + 1
ci + 1
ci + 2
ci + 3
ci + 4
pi + 2
gi + 2
pi + 3
gi + 3
a0
b0
a1
b1
a2
b2
a3
b3
a4
b4
a5
b5
a6
b6
a7
b7
a8
b8
a9
b9
a10
b10
a11
b11
a12
b12
a13
b13
a14
b14
a15
b15
Carry-lookahead unit

219
2. Interrupt:
(1) Describe the differences between interrupt, exception, and trap.
(2) Write the detailed procedure of I/O devices with interrupt mechanism step by
step and show how these steps are performed via CPU, operating systems, or
devices.
Answer:
(1) Interrupt: CPU(
processor)
Exception: CPU
0, overflowundefined OP code(processor
)
Trap: (processor)
(2)
1. I/O device device controller CPU
2. CPU program counter

3. process
4. CPU
5. process
program counter process

3. Pipelining:
We are plan to construct a new CPU: CCU (creative computing unit). Assume the
operation times of major components are: Mem = 4 ns, ALU = 2 ns ID&Register
rd = 2 ns, and Register wb = 1 ns, The instruction mix for applications is 25%
loads, 10% stores, 50% R-format operations and 15% branches. Consider an
organization with 5-stage pipelining:
(1) Determine max clock rates for 3 implementations of single-cycle,
multiple-cycle, and ideal pipeline (no hazard, no cache stalls) respectively.
(2) Compute the CPI for three implementations and their performance in term of
the Average Execution Time per Instruction. (TPI = CPI * cycle time).
(3) Considering the pipelining implementation, actually we have 3 cycle stall for
a branch. If we invent a way to reduce the stall with only 1 cycle, compute the
speedup obtained for such an invention.
(4) In fact Mem of 4 ns is only estimated. If we build a cache for the CCU, the
cache access time can be 2 ns, that is , Mem needs only 2 ns instead of 4 ns.
However, we need to stall and pay a memory latency penalty of 40 ns for
those 5% cache misses. How many stall cycles does the new pipelined CCU
have? Also compute the TPI for the new CPU.
Answer:

220
(1)
Machine single-cycle multiple-cycle pipeline
Cycle time 4 + 2 + 2 + 4 + 1 = 13 ns 4 ns 4 ns
Clock rate 76.92 MHz 250 MHz 250 MHz
(2)
Machine single-cycle multiple-cycle pipeline
CPI 1 50.25+40.1+40.5+30.15 = 4.1 1
TPI 113 = 13 ns 4.14= 16.4 ns 14 = 4 ns

(3) CPI before improve = 1 + 0.15 3 = 1.45
CPI after improve = 1 + 0.15 1 = 1.15
Speedup = Exectution time before improve / Exectution time after improve
= CPI before improve / CPI after improve = 1.45 / 1.15 = 1.26
(4) (a) miss penalty = 40 ns / 2 ns = 20
memory stall cycle per instruction = (1 + 0.25 + 0.1) 0.05 20 = 1.35
(b) new CPI = 1 + 1.35 = 2.35
new TPI = 2.35 2 ns = 4.7 ns

4. Performance analysis of two bus schemes:
Suppose we have a system with the following characteristics:
(1) A memory and bus system supporting block access of 4 to 16 32-bit words.
(2) A 64-bit synchronous bus clocked at 200 MHz, with each 64-bit transfer
taking 1 clock cycle. And 1 clock cycle required to send an address to
memory.
(3) Two clock cycles needed between each bus operation. (Assume the bus is idle
before an access.)
(4) A memory access time for the first four words of 200 ns; each additional set of
four words can be read in 20 ns. Assume that a bus transfer of the most
recently read data and a read of the next four words can be overlapped.
(a) Find the sustained bandwidth and the latency for a read of 256 words for transfers
that use 4-word blocks and for transfers that use 16-word blocks.
(b) Compute the effective number of bus transactions per second for each case.
Recall that a single bus transaction consists of an address transmission followed
by data.
Answer:
(a) For the 4-word block transfers, each block takes
5. 1 clock cycle that is required to send the address to memory
6. 200ns / (5ns/cycle) = 40 clock cycles to read memory
7. 2 clock cycles to send the data from the memory
8. 2 idle clock cycles between this transfer and the next
This is a total of 45 cycles. The bus bandwidth is (4 4) bytes (1sec / 45
5ns) = 71.11 MB/sec
Each block takes 45 cycles and 256/4 = 64 transactions are needed, so the

221
entire transfer takes 45 64 = 2880 clock cycles. Thus the latency is 2880
cycles 5 ns/cycle = 14,400 ns.
For the 16-word block transfers, the first block requires
1. 1 clock cycle to send an address to memory
2. 200 ns or 40 cycles to read the first four words in memory
3. 2 cycles to send the data of the block, during which time the read of the
four words in the next block is started
4. 2 idle cycles between transfers and during which the read of the next
block is completed
Each of the three remaining 16-word blocks requires repeating only the last
two steps. Thus, the total number of cycles for each 16-word block is 1 + 40
+ 4 (2 + 2) = 57 cycles. The bus bandwidth with 16-word blocks is (16 4)
bytes (l sec / 57 5 ns) = 224.56 MB/second.
Each block takes 57 cycles and 256/16 =16 transactions are needed, so the
entire transfer takes, 57 16 = 912 cycles. Thus the latency is 912 cycles 5
ns/cycle = 4560 ns.
(b) For the 4-word block transfers:
The number of bus transactions per second is 64 transactions (1sec /
14,400ns) = 4.44M transactions/second
For the 16-word block transfers:
The number of bus transactions per second with 16-word blocks is 16
transactions 1second/4560 ns = 3.51 M transactions/second.













222
96

1. Identify two differences between the following terminology pairs.
(1) Computer Networks vs Cluster Computers
(2) Multi-Core Server vs Multi-processor Server
(3) VLIW vs Superscalar
(4) Synchronous DRAM vs Cache DRAM
(5) TLB (Translation Lookaside Buffer) vs Page Table
Answer:
(1) A network is a collection of computers and devices connected to each other.
The network allows computers to communicate with each other and share
resources and information.
A computer cluster is a group of linked computers, working together closely
so that in many respects they form a single computer. The components of a
cluster are commonly connected to each other through fast local area
networks.
(2) A multi-core Server is a computer that has two or more independent cores
(normally a CPU) that are packaged into a single IC.
Multi-processor Server is a computer that has two or more processors that
have common access to a main memory.
(3) Both VLIW and Superscalar can issue more than one insturction to the
execution units per cycle. But in a VLIW approach, compiler decides which
instructions can be run in parallel. In a superscalar approach, the hardware
decides which instructions can be run in parallel.
(4) Both Synchronous DRAM and Cache DRAM are used to improve the
performance of DRAM.
A DRAM with an on-chip cache, called the cache DRAM. That is, cache
DRAM integrates SRAM cache onto generic DRAM chips.
Typical DRAM is asynchronous but Synchronous DRAM is synchronous
which exchanges data with the processor synchronized to an external clock
signal and running at full speed of the processor/memory bus without
imposing wait states.
(5) TLB is cache of Page Table. TLB is stored in cache and page table is strored
in memory. TLB has tag field but page table has not.






223
2. Consider a hypothetical 32-bit microprocessor having 32-bit instructions
composed of two fields: the first byte contains the OP code and the remainder the
immediate operand or an operand address. Assume that the local address bus is 32
bits and the local data bus is 16 bits. No time multiplexing between the address
and data buses.
(1) What is the maximum directly addressable memory capacity (in bytes)?
(2) What is the minimum bit numbers required for the program counter?
(3) Assume the direct addressing mode is applied, how many address and data
bus cycles required to fetch an instruction and its corresponding C or data
from memory?
Answer:
OP Address/ immediate
Instruction format: 8 24
(1) The maximum directly addressable memory capacity = 2
24
= 16Mbytes
(2) The minimum bit numbers required for the PC = Min(24, 32) = 24 bits
(3)
Address bus cycle Data bus cycle
Instruction fetch 1 2
Operand fetch 1 2

3. Perform the following three Intel X86 instructions,
MOV AX 0248H
MOV BX 0564H
CMP AX BX
and list the Carry Flag(CF), Overflow Flag(OF), Parity Flag(PF), Sign Flag(SF),
and Zero Flag(ZF).
Answer:
0248H 0564H = 0000001001001000
2
0000010101100100
2
=
0000001001001000
2

+ 1111101010011100
2

1111110011100100
2


Carry Flag Overflow Flag Parity Flag Sign Flag Zero Flag
0 0 0 1 0

instruction 1 move Hexcimal constant 0248 to register AX

224
instruction 2 move Hexcimal constant 0564 to register BX
instruction 3 compare registers by subtract BX from AX. (AX BX)
1 PF=1PF=0

4. Analysis of Program Structures.
(1) Analyze the following program, and find out how many times the statement
"sum ++ " are executed.
sum = 0;
For (i = 0; i < n; j++)
h = i + l;
For (j = 0; j < h * h; j++)
sum ++;
(2) Analyze the following program, and find out how many times the statement
"A(i, j, k)" are executed.
For k = 1 to n
For i = 0 to k-1
For j = 0 to k-l
For i j then A(i, j, k)
End
End
End
Answer:
(1) If h = n, the statement "sum ++ " will execute n
2
times in the inner for loop.
Since i can be 0 to n 1 h can be 1 to n, the statement "sum ++ " will
execute
6
) 1 2 )( 1 (
1
2
+ +
=

=
n n n
i
n
i

(2) If K = n, the function A(i, j, k) will execute n n n = n
2
n times in the two
inner for loops. K can be 1 to n A(i, j, k) will be executed
3
) 1 )( 1 (
2
) 1 (
6
) 1 2 )( 1 (
) (
1 1
2
1
2
+
=
+

+ +
= =

= = =
n n n n n n n n
i i i i
n
i
n
i
n
i
times

5. Hamming error correction codes.
(1) How many check-bits are needed if the Hamming error correction code is
used to detect single bit errors in a 1024-bit data word?
(2) For the 8-bit word 00111001, the check bits stored with it would be 0111.
Suppose when the word is read from memory, the check bits are calculated to
be 1101. What is the data word that was read from memory?

225
Answer:
(1) 2
k
1 1024 + k k = 11
(2)
12 11 10 9 8 7 6 5 4 3 2 1
M8 M7 M6 M5 C4 M4 M3 M2 C4 M1 C2 C1
0 0 1 1 0 1 0 0 1 1 1 1
C1 = M1M2M4M5M7 = 10110 = 1
C2 = M1M3M4M6M7 = 10110 = 1
C4 = M2M3M4M8 = 0010 = 1
C8 = M5M6M7M8 = 1100 = 0
C8 C4 C2 C1
0 1 1 1
1 1 0 1
1 0 1 0
The result 1010, indicating that bit position 10, which contain M6, is in error.
12 11 10 9 8 7 6 5 4 3 2 1
M8 M7 M6 M5 C4 M4 M3 M2 C4 M1 C2 C1
0 0 0 1 0 1 0 0 1 1 1 1
The data word read from memory should be: 00011001

6. Consider a cache and a main memory hierarchy, in which cache = 32K words,
main memory = 128M words, cache block size = 8 words, and word size = 4
bytes.
(1) Show physical address format for Direct Mapping (How many bits in Tag,
Block, and Word?)
(2) Show physical address format for 4-way Set Associative Mapping (How
many bits in Tag, Set, and Word?)
(3) Show physical address format for Sector Mapping with 16 blocks per sector.
(How many bits in Sector, Block, and Word?)
Answer:
Memory = 128M words = 512 M bytes = 2
29
bytes
(1) Number of blocks =
words
words
8
2K 3
= 4 M = 2
12


226
Tag Block (index) Word (offset)
29 17 = 12 12 5
(2) Number of sets =
words
words
8 4
2K 3

= 1 M = 2
10

Tag Block (index) Word (offset)
29 15 = 14 10 5
(3) Number of sets =
words
words
8 4
2K 3

= 1 M = 2
10

Sector Block Word (offset)
29 9 = 20 4 5




227
96

If some questions are unclear or not well defined to you, you can make your own
assumptions and state them clearly in the answer sheet.

1. Choose the most appropriate answer (one only) to each following question.
1.1 Which of the following MIPS addressing mode means that the operand is a
constant within the instruction itself? (a) Register addressing (b) Immediate
addressing (c) Base addressing (d) PC-relative addressing
1.2 Which of the following feature is typical for the RISC machine? (a)
Powerful instructions (b) Large CPI (c) More addressing modes (d) Poor
code density
1.3 Which is the IEEE 754 binary representation for the floating point number
-0.4375
ten
in single precision? (a) 1 11111110 111 00000000000000000000
(b) 1 11111110 11000000000000000000000 (c) 1 01111101
11100000000000000000000 (d) 1 01111101 11000000000000000000000
1.4 A program runs in 10 seconds on computer A (which has a 4 GHz clock)
and 6 second on computer B. If computer B requires 1.2 times as many
clock cycles as computer A for this program. What clock rate does computer
B have? (a) 6 GHz (b) 7 GHz (c) 8 GHz (d) 9 GHz
1.5 Pipelining improves (a) Instruction throughput (b) Individual instruction
execution time (c) Individual instruction latency (d) All of the above are
correct
1.6 Which of the following technique is associated primarily with a
hardware-based approach to exploiting instruction-level parallelism? (a)
Very long instruction word (VLIW) (b) Explicitly parallel instruction
computer (EPIC) (c) Dynamic pipeline scheduling (d) Register renaming
1.7 Consider a cache with 64 blocks and a block size of 16 bytes. What block
number does byte address 1200 map to? (a) 10 (b) 11 (c) 12 (d) 13
1.8 Which of the following statement about "write back" is incorrect? (a) new
value is written only to the block in the cache (b) the modified block is
written to the lower level of the hierarchy when it is replaced (c) more
complex to implement than write-through (d) can ensure that data is always
consistent between cache and memory
1.9 Which of the following statement is incorrect? (a) The compiler must
understand the pipeline to achieve the best performance, (b) Deeper
pipelining usually increases clock rate. (c) Increasing associativity of cache
may slow access time, leading to lower overall performance, (d) The
addition of second level cache can reduce miss rate of the first level cache.
1.10 In a magnetic disk, the disks containing the data are constantly rotating. On
average it should take half a revolution for the desired data on the disk to
spin under the read/write head. Assuming that the disk is rotating at 10,000
revolutions per minute (RPM), what is the average time for the data to rotate
under the disk head? (a) 0.1 ms (b) 0.2 ms (c) 3 ms (d) 6 ms

228
Answer:
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10
(b) (d) (d) (c) (a) (c) (b) (d) (d) (c)

2. Performance Analysis
2.1 The following measurements have been made on two different computers: M1
and M2.
Program Time on M1 Time on M2
1 2.0 seconds 1.5 seconds
2 5.0 seconds 10.0 seconds

Program Instructions executed on M1 Instructions executed on M2
1 5 10
9
6 10
9

If the clock rates of M1 and M2 are 4 GHz and 6 GHz, respectively, find the
clock cycles per instruction (CPI) for program 1 on both computers.
Answer:
CPI for M1 =
9
9
10 5
10 4 2


= 1.6, CPI for M2 =
9
9
10 6
10 6 5 . 1


= 1.5

2.2 Assuming the CPI for program 2 on each computer in Problem 2.1 is the same
as the CPI for program 1, find the instruction count for program 2 running on
each computer.
Answer:
Instruction Count for M1 =
6 . 1
10 4
9

= 2.5 10
9

Instruction Count for M2 =
5 . 1
10 6
9

= 4 10
9



229
2.3 A compiler designer is trying to decide between two code sequences for a
particular computer. The hardware designers have supplied the following facts:
CPI for this instruction class
A B C
CPI 1 2 3
For a particular high-level-language statement, the compiler writer is
considering two code sequences that require the following instruction counts:
Code sequence
Instruction counts for instruction class
A B C
1 2 1 2
2 4 1 1
Which code sequence executes the most instructions? Which will be faster?
What is the CPI for each sequence?
Answer:
(1) Instruction count for code sequence 1 = 2 + 1 + 2 = 5
Instruction count for code sequence 2 = 4 + 1 + 1 = 6
Hence, code sequence 2 executes the most instructions
(2) Clock cycles for code sequence 1 = 2 1 + 1 2 + 2 3 = 10
Clock cycles for code sequence 2 = 4 1 + 1 2 + 1 3 = 9
Hence, code sequence 2 is faster than code sequence 1
(3) CPI for code sequence 1 = 10 / 5 = 2
CPI for code sequence 2 = 9 / 6 = 1.5

2.4 You could speed up a Java program on a new computer by adding hardware
support for garbage collection. Garbage collection currently comprises 20% of
the cycles of the program. You have two possible changes to the machine. The
first one would be to automatically handle garbage collection in hardware. This
causes an increase in cycle time by a factor of 1.2. The second would be to
provide for new hardware instructions to be added to the ISA that could be used
during garbage collection. This would halve the number of instruction needed for
garbage collections but increase the cycle time by 1.1. Which of theses two
options, if either, should you choose? Why?
Answer:
Automatic garbage collection by hardware: The execution time of the new
machine is (1 0.2) 1.2 = 0.96 times that of the original.
Special garbage collection instructions: The execution time of the new
machine is (1 0.2/2) 1.1 = 0.99 times that of the original.
Therefore, the first option is the best choice.


230
3. Datapath and Control
3.1 Consider the following machines, and compare their performance using the
following instruction frequencies: 25% Loads, 13% Stores, 47% R-type
instructions, and 15% Branch/Jump.
M1: The multicycle datapath shown in Fig. 1 with a 4 GHz clock.
M2: A machine like the multicycle datapath of Fig. 1, except that register updates
are done in the same clock cycle as a memory read or ALU operation. Thus
in Fig. 2 (which shows the complete finite state machine control of Fig. 1),
states 6 and 7 and states 3 and 4 are combined. This machine has a 3.2 GHz
clock, since the register update increases the length of the critical path.
M3: A machine like M2 except that effective address calculations are done in the
same clock cycle as a memory access. Thus states 2, 3, and 4 can be
combined, as can 2 and 5, as well as 6 and 7. This machine has a 2.8 GHz
clock because of the long cycle created by combining address calculation
and memory access.
Find the effective CPI and MIPS (million instructions per second) for all
machines.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-


Figure 1






231


Figure 2
Answer:
Instruction Frequency M1 M2 M3
Loads CPI 25% 5 4 3
Stores CPI 13% 4 4 3
R-type CPI 47% 4 3 3
Branch/jump CPI 15% 3 3 3
Effective CPI 4.1 3.38 3
MIPS 976 946 933


232
3.2 Exception detection is an important aspect of execution handling. Try to identify
the cycle in which the following exceptions can be detected for the multicycle
datapath in Fig. 1. Consider the following exceptions:
a. Overflow exception
b. Invalid instruction
c. External interrupt
d. Invalid instruction memory address
e. Invalid data memory address
Answer:
a b c d e
Detection time cycle 4 cycle 2 any cycle cycle 1 cycle 3

4. Pipelining
4.1 Consider the following code segment in C:
A = B + E;
C = B + F;
Here is the generated MIPS code for this segment, assuming all variables are in
memory and are addressable as offsets from $t0:
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
Find the hazards in the code segment and reorder the instructions to avoid any
pipeline stalls.
Answer:
Both add instructions have a hazard because of their respective dependence on
the immediately preceding lw instruction. Notice that bypassing eliminates
several other potential hazards including the dependence of the first add on the
first lw and any hazards for store instructions. Moving up the third lw
instruction eliminates both hazards:
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
forwarding unit
forwarding unit

233
4.2 MIPS instructions classically take five steps (IF, ID, EX, MEM WB) to execute
in pipeline. To resolving control hazards, the decision about whether to branch in
MIPS architecture is moved from MBM stage to the ID stage. Explain the
advantage, difficulties, and how to overcome the difficulties when moving the
branch execution to the ID stage.
Answer:
Move the branch execution from MEM to the ID, then only 1 instructions need
flushed. Moving the branch decision up requires two actions to occurs earlier:
(1) Compute the branch target: move the branch adder from the EXE stage to the
ID stage
(2) Equality test of two registers: XOR their respective bits and NOR all the
results

4.3 Compare the performance for single-cycle, multicycle, and pipelined control by
the average instruction time using the following instruction frequencies (25%
loads, 10% stores, 11 % branches, 2% jumps, and 52% ALU instructions) and
functional unit times (200 ps for memory access, 100 ps for ALL) operation, and
50 ps for register file read or write). For the multicycle design, the number of
clock cycles for each instruction class is shown in Fig. 2. For the pipelined design,
loads take 1 clock cycle when there is no load-use dependence and 2 when there
is. Branches take 1 when predicted correctly and 2 when not. Jumps always pay 1
full clock cycle of delay, so their average time is 2 clock cycles. Other
instructions take 1 clock cycle. For pipelined execution, assume that half of the
load instructions are immediately followed by an instruction that uses the result
and that one-quarter of the branches are mispredicted. Ignore any other hazards.
Answer:
For single-cycle machine:
CPI = 1
Clock cycle time = 200 + 50 + 100 + 200 + 50 = 600 ps
Average instruction time = 1 600 = 600 ps
For multicycle machine:
CPI = 0.25 5 + 0.1 4 + 0.11 3 + 0.02 3 + 0.52 4 = 4.12
Clock cycle time = 200 ps
Average instruction time = 4.12 200 = 824 ps
For pipeline machine:
Effective CPI = 1.5 0.25 + 1 0.1 + 1 0.52 + 1.25 0.11 + 2 0.2 = 1.17
Clock cycle time = 200 ps
Average instruction time = 1.17 200 = 234 ps
The relative performance of the three machines is
Pipeline > single-cycle > multi-cycle
in terms of average instruction time.

234
4.4 Suppose the memory access became 2 clock cycles long. Find the relative
performance of the single-cycle and multicycle designs by the average instruction
time as described in Problem 4 3.
Answer:
For single-cycle machine:
CPI = 1
Clock cycle time = 200 + 50 + 100 + 200 + 50 = 600 ps
Average instruction time = 1 600 = 600 ps
For multicycle machine:
CPI = 0.25 7 + 0.1 6 + 0.11 4 + 0.02 4 + 0.52 5 = 5.47
Clock cycle time = 100 ps
Average instruction time = 5.47 100 = 547 ps

5. Memory Hierarchy
5.1 Suppose we have a processor with a base CPI of 1.0, assuming all references hit
in the primary cache, and a clock rate of 5 GHz. Assume a main memory access
time of 100 ns, including all the miss handling. Suppose the miss rate per
instruction at the primary cache is 2%. How much faster will the processor be if
we add a secondary cache that has a 5 ns access time for either a hit or a miss and
is large enough to reduce the miss rate to main memory to 0.6%?
Answer:
The miss penalty to main memory is 100 / 0.2 = 500 clock cycles
For the processor with one level of caching, total CPI = 1.0 + 500 2% = 11.0
The miss penalty for an access to the second-level cache is 5 / 0.2 = 25 clock
cycles
For the two-level cache, total CPI = 1.0 + 2% 25 + 0.6% 500 = 4.5
Thus, the processor with the secondary cache is faster by 11.0/4.5 = 2.44


235
5.2 Consider a memory hierarchy using one of the following three organizations for
main memory: (a) one-word-wide memory organization, (b) wide memory
organization, and (c) interleaved memory organization. Assume that the cache
block size is 16 words, that the width of organization (b) is four words, and that
the number of banks in organization (c) is four. If the main memory latency for a
new access is 10 memory bus clock cycles and the transfer time is 1 memory bus
clock cycle. Assume that it takes 1 clock cycle to send the address to the main
memory, what are the miss penalties for each of these organizations?
Answer:
(a) one-word-wide memory organization:
the miss penalty would be 1 + 16 10 + 16 1 = 177 clock cycles.
(b) wide memory organization (four-word-wide):
the miss penalty would be 1 + 4 10 + 4 1 = 45 clock cycles.
(c) interleaved memory organization:
the miss penalty would be 1 + 4 10 + 16 1 = 57 clock cycles.


236
95

If some questions are unclear or not well defined to you, you can make your own
assumptions and state them clearly in the answer sheet.
1. Short Questions:
Answer and explain the following questions.
(1) Given the bit pattern: 1000 1110 1110 1111 0100 0000 0000 0000
What does it represent, assuming that it is
C a two's complement integer?
C a single precision floating-point number?
(2) Explain what "exception" is and how exceptions are handled.
(3) In the simplest implementation for MIPS instruction set, every instruction
begins execution on one clock edge and completes execution on the next
clock edge. Please explain disadvantages of the single-cycle
implementation.
Answer:
(1) C 2
31
+ 2
27
+ 2
26
+ 2
25
+ 2
23
+ 2
22
+ 2
21
+ 2
19
+ 2
18
+ 2
17
+ 2
16
+ 2
14

C 1.110111101
2
2
29 127
= 1.110111101
2
2
98

(2) Exception: An unscheduled event (from within the CPU) that disrupts
program execution.
Handle: save the address of the offending instruction in the EPC and
transfer control to the operating system at some specified address
(3) Disadvantages: a single cycle machine is inefficient both in performance
and hardware cast. Since:
C The clock cycle of single cycle machine is equal to the worst-case delay
for all instruction and the penalty for using a fixed clock cycle is
significant.
C each functional unit can be used only once per clock; therefore, some
functional units must be duplicated, raising the cost of hardware.

2. Performance Analysis:
(1) Consider the machine with three instruction classes X, Y, and Z. The
corresponding CPIs for these instruction classes are 1, 2, and 3,
respectively. Now suppose we measure the code for the same program
from two different compilers and obtain the following data:
Code from Instruction counts (in billions) for each instruction class
X Y Z
Compiler 1 5 1 1
Compiler 2 10 1 1
Assume that the machine's clock rate is 500 MHz. Which code sequence will
execute faster according to MIPS and according to execution time?
(2) The table below shows the number of floating-point operations executed in
two different programs and the runtime for those programs on three different
machines:

237
Program Floating-point operations Execution time in seconds
Machine A Machine B Machine C
Program 1 100,000,000 1000 100 40
Program 2 1,000,000 1 10 40
Which machine is fastest according to total execution time?
(3) Assume that equal amounts of time will be spent running each program on
some machine. Which machine is fastest using the data of Table 2 and
assuming a weighting that generates equal execution time for each benchmark
on machine A? Which machine is fastest if we assume a weighting that
generates equal execution time for each benchmark on machine B?
(4) There are two possible improvements: either make multiply instruction run
four times faster than before, or make memory access instructions run two
times faster than before. You repeatedly run a program that takes 100 seconds
to execute. Of this time, 20% is used for multiplication, 40% for memory
access instructions, and 40% for other tasks. What will the speedup be if you
improve only memory access? What will the speedup be if both
improvements are made?
Answer:
(1) Execution Time for compiler 1 =
6
9
10 500
10 ) 3 1 2 1 1 5 (

+ +
= 20 s
Execution Time for compiler 2 =
6
9
10 500
10 ) 3 1 2 1 1 10 (

+ +
= 30 s
MIPS for compiler 1 =
6
9
10 20
10 ) 1 1 5 (

+ +
= 350
MIPS for compiler 2 =
6
9
10 30
10 ) 1 1 10 (

+ +
= 400
According to execution time, code sequence from compiler 1 is faster.
According to MIPS, code sequence from compiler 2 is faster.
(2) Execution Time for Machine A = 1000 + 1 = 1001
Execution Time for Machine B = 100 + 10 = 110
Execution Time for Machine C = 40 + 40 = 80
Hence, Machine C is fastest.
(3) C Equal time no Machine A
Program Weight Machine A Machine B Machine C
Program 1 1/1000 1000 100 40
Program 2 1 1 10 40
Weighted AM 2 10.1 40
Hence, Machine A is the fastest.
Weighted AM for Machine A = 2 1
001 . 1
1
1000
001 . 1
001 . 0
= +

238
Weighted AM for Machine B = 1 . 10 10
001 . 1
1
100
001 . 1
001 . 0
= +
Weighted AM for Machine C = 40 40
001 . 1
1
40
001 . 1
001 . 0
= +
C Equal time no Machine B
Program Weight Machine A Machine B Machine C
Program 1 1/10 1000 100 40
Program 2 1 1 10 40
Weighted AM 91.8 18.2 40
Hence, Machine B is the fastest.
Weighted AM for Machine A = 8 . 91 1
1 . 1
1
1000
1 . 1
1 . 0
= +
Weighted AM for Machine B = 2 . 18 10
1 . 1
1
100
1 . 1
1 . 0
= +
Weighted AM for Machine C = 40 40
1 . 1
1
40
1 . 1
1 . 0
= +
(4) Speedup by improving memory access = 25 . 1
6 . 0
2
4 . 0
1
=
+

Speedup by improving both = 54 . 1
4 . 0
2
4 . 0
4
2 . 0
1
=
+ +


3. Instruction Set:
(1) What is the "addressing mode"? Please explain "displacement addressing" and
"PC-relative addressing".
(2) Memory-memory and load-store are two architectural styles of instruction
sets. We can calculate the instruction bytes fetched and the memory data
bytes transferred using the following assumptions about the two instruction
sets:
The opcode is always 1 byte (8 bits)
All memory address are 2 bytes (16 bits)
All data operands are 4 bytes (32 bits)
All instructions are an integral number of bytes in length
There are no optimizations to reduce memory traffic
For the following C code, write an equivalent assembly language program
in each architecture style (assume all variables are initially in memory):
a = b + c;
b = a + c;
d = a b;
For each code sequence, calculate the instruction bytes fetched and the
memory data bytes transferred (read or written). Which architecture is most
efficient as measured by code size? Which architecture is most efficient as
measured by total memory bandwidth required (code + data)?

239
Answer:
(1) Multiple forms of addressing are generically called
addressing modes.
Displacement addressing: the operand is at the memory location whose
address is the sum of a register and a constant in the instruction
PC-relative addressing: the address is the sum of the PC and a constant in the
instruction
(2) C Memory-memory:
Instructions Code bytes Data bytes
add a , b, c 1 + 2 + 2 + 2 = 7 4 + 4 + 4 = 12
add b, a, c 1 + 2 + 2 + 2 = 7 4 + 4 + 4 = 12
sub d, a, b 1 + 2 + 2 + 2 = 7 4 + 4 + 4 = 12
Total 21 36
Total memory bandwidth = 21 + 36 = 57
C Load-store: suppose there are 16 registers in the CPU
Instructions Code bytes Data bytes
load $1, b (
4 2 8 / 4 1 = + +
4
load $2, c (
4 2 8 / 4 1 = + +
4
add $3, $1, $2 (
3 ) 8 / 4 ( 3 1 = +
0
add $4, $3, $2 (
3 ) 8 / 4 ( 3 1 = +
0
sub $5, $3, $4 (
3 ) 8 / 4 ( 3 1 = +
0
store $3, a (
4 2 8 / 4 1 = + +
4
store $4, b (
4 2 8 / 4 1 = + +
4
store $5, d (
4 2 8 / 4 1 = + +
4
Total 29 20
Total memory bandwidth = 29 + 20 = 49
According to code size, Memory-memory architecture is more efficient
According to memory bandwidth, load-store architecture is more efficient

4. Pipelining
(1) MIPS instructions classically take five steps to execute in pipeline. Please
explain the detailed operations of the five-stage pipeline used in MIPS
instructions.
(2) Explain the three different hazards: data hazards, control hazards, and
structural hazards. Describe the schemes for resolving these hazards.
(3) For each pipeline register in Fig. 1, label each portion of the pipeline register
with the name of the value that is loaded into the register. Determine the
length of each field in bits. For example, the IF/ID pipeline register contains
two fields, one of which is an instruction field that is 32 bits wide.



240


















Fig. 1
Answer:
(1)
1. IF: Instruction fetch
2. ID: Instruction Decode and register file read
3. EX: Execution or address calculation
4. MEM: Data memory access
5. WB: Write back
(2) C
Structural hazards: hardware cannot support the instructions executing in the
same clock cycle
Data hazards: attempt to use item before it is ready
Control hazards: attempt to make a decision before condition is evaluated
C
Type Solutions
Structure
hazard
(: use two memories, one for instruction and
one for data)
Data
hazard
Software solution: (1).compilerno operation (nop)
(2).data hazard
Hardware solution: (1).Forwardingdata
hazard(2).load-useForwarding
data hazard(stall)
Forwarding
Control Software solution:compilerno operation(nop)

241
hazard Hardware solution: (1).control hazard
(2).(static or dynamic)
pipeline
pipeline


(3)
Pipeline
register
IF/ID ID/EX EX/MEM MEM/WB

PC + 4
(32 bits)
PC + 4
(32 bits)
Branch target
address
(32 bits)
Memory
data
(32 bits)

Instruction
(32 bits)
Register data 1
(32 bits)
Zero indicator
(1 bit)
ALU result
(32 bits)

Register data 2
(32 bits)
ALU result
(32 bits)
Destination
register No.
(5 bits)

Sign-extension
unit output
(32 bits)
Register data 2
(32 bits)


Register No. rt
(5 bits)
Destination register
(5 bits)


Register No. rd
(5 bits)

Total (bits) 64 138 102 69


5. Memory Hierarchy:
(1) Please explain what memory hierarchy is and why it is necessary.
(2) Cache misses can be sorted into three simple categories: Compulsory,
Capacity, and Conflict. Please explain why they occur and how to reduce
them.
(3) Consider three machines with different cache configurations:
Cache 1: Direct-mapped with one-word blocks
Cache 2: Direct-mapped with four-word blocks
Cache 3: Two-way set associative with four-word blocks
The following miss rate measurements have been made:
Cache 1: Instruction miss rate is 4%; data miss rate is 8%
Cache 2: Instruction miss rate is 2%; data miss rate is 6%
Cache 3: Instruction miss rate is 2%; data miss rate is 4%
For these machines, one-half of the instructions contain a data reference.
Assume that the cache miss penalty is 6 + Block size in words. The CPI for
this workload was measured on a machine with cache 1 and was found to be

242
2.0. Determine which machine spends the most cycles on cache misses.
(4) The cycle times for the machines in Problem 5.3 are 2 ns for the first and
second machines and 2.5 ns for the third machine. Determine which machine
is the fastest and which is the slowest.
Answer:
(1) Memory hierarchy: A structure that uses multiple levels of memories; as the
distance from the CPU increases, the size of the memories and the access time
both increase.
The reasons for using Memory hierarchy:
C To take the advantage of the principle of locality
C To provide the user with as much memory as is available in the cheapest
technology, while providing access at the speed offered by the fastest
memory.
(2)
Explanation Solution
Compulsory first access to a block increase block size
Capacity
cache cannot contain all blocks
accessed by the program
increase cache size
Conflict
multiple memory locations
mapped to the same cache
location
increase associativity

(3) C1 spends the most time on cache misses
Cache
Miss
penalty
I cache miss D cache miss Total Miss
C1 6 + 1 = 7 4% 7 = 0.28 8% 7 = 0.56 0.28 + 0.56/2 = 0.56
C2 6 + 4 = 10 2% 10 = 0.2 6% 10 = 0.6 0.2 + 0.6/2 = 0.5
C3 6 + 4 = 10 2% 10 = 0.2 4% 10 = 0.4 0.2 + 0.4/2 = 0.4
(4)
We need to calculate the base CPI that applies to all three processors. Since
we are given CPI = 2 for C1, CPI
base
= CPI CPI
misses
= 2 0.56 = 1.44
Execution Time for C1 = 2 2 ns IC = 4 10
-9
IC
Execution Time for C2 = (1.44 + 0.5) 2 ns IC = 3.96 10
-9
IC
Execution Time for C3 = (1.44 + 0.4) 2.5 ns IC = 4.6 10
-9
IC
Therefore C2 is fastest and C3 is slowest.







243
94

1. Short Questions: Answer and explain the following questions. Credit will be
given only if explanation is provided.
(1) Explain the problems with using MIPS (million instructions per second) as a
measure for comparing machines.
(2) There are two possible improvements to enhance a machine: either make
multiply instructions run four times faster than before, or make memory
access instructions run two times faster than before. You repeatedly run a
program that takes 100 seconds to execute. Of this time, 10% is used for
multiplication, 50% for memory access instructions, and 40% for other tasks.
What will the speedup be if you improve only memory access? What will the
speedup be if both improvements are made?
(3) Explain and compare microprogrammed control and hardwire control.
(4) There are two basic options when writing to the cache: write through and
write back. Please explain write through and write back, and describe their
advantages.
(5) Explain the differences among superpipelining, superscalar and dynamic
pipeline scheduling.
Answer:
(1)
1. MIPS Specifies the instruction execution rate but does not take into
account the capabilities of the instructions (we can not compare
computers with different instruction sets using MIPS, since the
instruction counts will certainly differs)
2. MIPS varies between programs on the same computer
3. MIPS can vary inversely with performance
(2)
1. Speedup (improve memory) =
33 . 1
50 2 / 50
100
=
+

2. Speedup (improve both) =
48 . 1
40 2 / 50 4 / 10
100
=
+ +

(3)
Microprogrammed Hardwire

A method of specifying control
that uses microcode rather than a
finite state representation.
An implementation of finite state
machine control typically using
programmable logic arrays (PLAs)
or collections of PLAs and random
logic.

(1) Flexibility: make changes late
in design cycle (easy to change)
(2) Generality: can implement
(1) easy to pipeline
(2) less cost to implement

244
multiple instruction sets on
same machine
(3) can implement more powerful
instruction sets

(1) hard to pipeline
(2) costly to implement
(1) hard to design
(2) lack of flexibility (hard to
change),
(3) lack of generality (a instruction
set only for one machine)

(4)
Write-through Write-back

A scheme in which writs always
update both the cache and the
memory, ensuring that data is
always consistent between the two.
A scheme that handles writes by
updating values only to the block in
the cache, then writing the modified
block to the memory when the
block is replaced

block
MemoryCPU write

cache
MemoryCPU write

(5) Superpipelining: An advanced pipelining technique that increases the depth of
the pipeline to overlap more instructions
Superscalar: An advanced pipelining technique that enables the processor to
execute more than on instruction per clock cycle
Dynamic pipeline scheduling: Hardware support for reordering the order of
instruction execution so as to avoid stalls

2. Performance Analysis:
(1) The PowerPC, made by IBM and Motorola and used in the Apple Macintosh,
shares many similarities to MIPS. The primary difference is two more
addressing modes (indexed addressing and update addressing) plus a few
operations. Please explain the two addressing modes provided by PowerPC.
(2) Consider an architecture that is similar to MIPS except that it supports update
addressing for data transfer instructions. If we run gcc using this architecture,
some percentage of the data transfer instructions shown in Fig. 1 will be able
to make use of the new instructions, and for each instruction changed, one
arithmetic instruction can be eliminated. If 20% of the data transfer
instructions can be changed, which will be faster for gcc, the modified MIPS
architecture or the unmodified architecture? How much faster? (Assume that
the modified architecture has its cycle time increasing by 10% in order to
accommodate the new instructions.)
(3) When designing memory systems, it becomes useful to know the frequency of

245
memory reads versus writes as well as the frequency of accesses for
instruction-mix information for MIPS for the program gcc in Fig. 1, find the
following:
(a) The percentage of all memory accesses that are for data (vs. instructions).
(b) The percentage of all memory accesses that are writes (vs. reads).
Answer:
(1) Indexed addressing:
lw $t1, $a0 + $s3 #$t1 = Memory[$a0 + $s3]
Update addressing (update a register as part of load):
lwu $t0,4($s3) #$t0 = Memory[$s3 + 4]; $s3 = $s3 + 4
(2) Execution
unmodified
: (1 0.48 + 1.4 0.35 + 1.7 0.15 + 1.2 0.02) IC T =
1.249 IC T
Execution
modified
:
( )
( ) 93 . 0 02 . 0 15 . 0 35 . 0 2 . 0 35 . 0 48 . 0
02 . 0 2 . 1 15 . 0 7 . 1 35 . 0 4 . 1 2 . 0 35 . 0 48 . 0 1
= + + +
+ + +
IC
0.93 1.1T = 1.297 IC T
Unmodified architecture is faster than modified architecture by 1.297/1.249 =
1.038 times
(3) (a) 0.35/1.35 = 26%
(b) 0.35 0.5 / 1.35 = 13%

(b)data transferloadstore
luiread memory

3. Computer Arithmetic:
(1) Suppose you wanted to add four numbers (A, B, E, F) using 1-bit full adders.
There are two approaches to compute the sum as shown in Fig. 2(a) and 2(b):
cascaded of traditional ripple carry adders and carry save adders. If A, B, E, F
are 4-bit numbers, draw the detailed architecture (consists of 1-bit full adders)
of the carry save addition shown in Fig. 2(b).
(2) Assume that the time delay through each 1-bit full adder is 2T. Calculate and
compare the times of adding four 8-bit numbers using the two different
approaches.
(3) Try Booths algorithm for the signed multiplication of two numbers: 2
ten

3
ten
= 6
ten
(or 0010
two
1101
two
= 1111 1010
two
). Explain the operations
step by step.
Answer:

246
(1) 4-bit carry save addition 4-bit traditional ripple carry adders

(2) (a) CSAcritical path1, 2
full adders8-bit CSAfull adder8delay = (2 + 8)
2T = 20 T
(b) TRCAcritical path1, 2
full adders8-bit TRCAfull adder9delay = (2
+ 9) 2T =22 T
(3)
Iteration Step Multiplicand Product
0 Initial values 0010 0000 1101 0
1
10 Prod Mcand 0010 1110 1101 0
Shift right rpoduct 0010 1111 0110 1
2
01 Prod + Mcand 0010 0001 0110 1
Shift right product 0010 0000 1011 0
3
10 Prod Mcand 0010 1110 1011 0
Shift right product 0010 1111 0101 1
4
11 No operation 0010 1111 0101 1
Shift right product 0010 1111 1010 1

4. Pipelining:
(a) MIPS instructions classically take five steps to execute in pipeline. Please
explain the detailed operations of the five-stage pipeline used in MIPS
instructions.
(b) Fig. 3 shows a pipelined datapath of MIPS processor. Please explain the
function of hazard detection unit and forwarding unit in Fig. 3 and they how
to resolve data hazards.
(c) Dynamic branch prediction is usually used to resolve control hazards.
Consider a loop branch that branches nine times in a row, then is not taken
once. What is the prediction accuracy for this branch when applying 1-bit and
2-bit prediction schemes respectively? (1-bit predictor updates the prediction
bit on a mispredict, a prediction in 2-bit predictor must miss twice before it is
changed)

247
Answer:
(a) 1. IF: Instruction fetch
2. ID: Instruction Decode and register file read
3. EX: Execution or address calculation
4. MEM: Data memory access
5. WB: Write back
(b) EXdata hazardEX/MEM pipeline registerRd
ID/EX pipeline registerRsRtdata
hazardEX/MEM.RegisterRd = ID/EX.RegisterRsMultiplexor
10forwarding ALUEX/MEM.RegisterRd =
ID/EX.RegisterRtMultiplexor10forwarding ALU

MEMdata hazardMEM/WB pipeline registerRd
ID/EX pipeline registerRsRtdata
hazardMEM/WB.RegisterRd = ID/EX.RegisterRsMultiplexor
01forwarding ALU
MEM/WB.RegisterRd = ID/EX.RegisterRtMultiplexor
01forwarding ALU
Hzard Detection Unit is used to detect whether there is a load-use data hazard
exist between two consecutive instructions. If it is true, stall the pipeline for one
clock.
(c) 1-bit prediction: 80%
2-bit prediction: 90%

5. Cache:
(1) How does the control unit deal with cache misses? Please describe the steps to
be taken on an instruction cache miss as clear as possible.
(2) Please explain the function of three portions (Tag, Index, and Block offset) in
the address of Fig. 4. How many total bits are required for the cache?
(3) Assume an instruction cache miss rate for gcc of 2% and a data cache miss
rate of 4%. If a machine has a CPI of 2 without any memory stalls and the
miss penalty is 50 cycles for all misses, determine how much faster a machine
would run with a perfect cache that never missed. Use the instruction
frequencies for gcc from Fig. 1.
(4) Suppose we increase the performance of the machine in the previous question
by doubling its clock rate. Since the main memory speed is unlikely to change,
assume that absolute time to handle a cache miss does not change. Assuming
the same miss rate as the previous question, how much faster will the machine
be with the faster clock?
Answer:

248
(1) The control unit deal with cache misses as follows:
1. Send the original PC value (current PC4) to the memory
2. Instruct main memory to perform a read and wait for the memory to
complete its access
3. Write the cache entry, putting the data from memory in the data portion of
the entry, writing the upper bits of the address into the tag field, and
turning the valid bit on
4. Restart the instruction execution at the first step, which will refetch the
instruction, this time finding it in the cache.
(2) Tag: contains the address information required to identify whether the
associated block in the hierarchy corresponds to a requested word.
Index: is used to select the block.
Block offset: specify a word within a block.
Total bits = 2
12
(1 + 16 + 4 32) = 580 Kbits
(3) CPI for non-perfect cache = 2 + 0.02 50 + 0.04 0.35 50 = 3.7
Hence, the machine with perfect cache is faster then with non-perfect cache
by 3.7/2 =1.85 times.
(4) The penalty becomes 100 clock cycles
New CPI = 2 + 0.02 100 + 0.04 0.35 100 = 5.4
Hence the machine with faster clock is
37 . 1
2 / 4 . 5
7 . 3
=
faster


Instruction class MIPS example Average CPI
Frequency
gcc spice
Arithmetic Add, sub, addi 1.0 48% 50%
Data transfer lw, sw, lb, sb, lui 1.4 35% 41%
Conditional branch beq, bne, slt, slti 1.7 15% 7%
Jump j, jr, jal 1.2 2% 2%
Fig. 1











Fig. 2(a) Fig. 2(b)

249


















Fig. 3


















Fig. 4
Address (showing bit positions)
16 12 Byte
offset
V Tag Data
Hit Data
16 32
4K
entries
16 bits 128 bits
Mux
32 32 32
2
32
Block offset Index
Tag
31 16 15 4 32 1 0
Address (showing bit positions)
16 12 Byte
offset
V Tag Data
Hit Data
16 32
4K
entries
16 bits 128 bits
Mux
32 32 32
2
32
Block offset Index
Tag
31
Address (showing bit positions)
16 12 Byte
offset
V Tag Data
Hit Data
16 32
4K
entries
16 bits 128 bits
Mux
32 32 32
2
32
Block offset Index
Tag
31 16 15 4 32 1 0
PC
Instruction
memory
Registers
M
u
x
M
u
x
M
u
x
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Data
memory
M
u
x
Hazard
detection
unit
Forwarding
unit
0
M
u
x
IF/ID
I
n
s
t
r
u
c
t
i
o
n
ID/EX.MemRead
I
F
/
I
D
W
r
i
t
e
P
C
W
r
i t
e
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Rt
Rs
Rd
Rt
EX/MEM.RegisterRd
MEM/WB.RegisterRd
PC
Instruction
memory
Registers
M
u
x
M
u
x
M
u
x
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Data
memory
M
u
x
Hazard
detection
unit
Forwarding
unit
0
M
u
x
IF/ID
I
n
s
t
r
u
c
t
i
o
n
ID/EX.MemRead
I
F
/
I
D
W
r
i
t
e
P
C
W
r
i t
e
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Rt
Rs
Rd
Rt
EX/MEM.RegisterRd
MEM/WB.RegisterRd

250
93

If some questions are unclear or not -well defined to you, you can make your own
assumptions and state them clearly in the answer sheet.
I. : Please choose the most appropriate answer (one only) to each following
question.
1. Which of the following metric will not benefit from the forwarding technique
(a) program instruction count (b) execution time (c) data hazard stall (d)
effective CPI.
2. Considering a direct-mapped cache with 64 blocks and block size of 16 bytes,
what block number does memory address (256)
10
map to (a) 4 (b) 16 (C) 64 (d)
256
3. Which of the following style of instruction sets is most likely to achieve
smallest instruction count while compiling a program? (a) accumulator (b)
load-store (c) stack (d) memory-memory
4. How many bits of ROM is required to implement a Moore finite-state machine
with 8 states, 2 input and 3 output control signals? (a) 48 (b) 60 (c) 96 (d) 192
5. To get a speed up of 5 from 20 processors, which number is closest to the
minimum percentage of the original program that has to be sequential? (a) 5%
(b) 10% (c) 16% (d) 80%
6. What is the main advantage by adding some complex instructions into the
existed instructions sets? (a) reduced CPI (b) less instruction count (c) faster
clock cycle (d) increased MIPS.
7. Which of the following RISC addressing model will involve the memory
operation? (a) Base addressing (b) Register addressing (c) PC-relative
addressing (d) immediately addressing
8. Which of the following floating point number represented in IEEE 754
standard (1 -bit for sign, 8-bit for exponent) is the largest? (a) 0 11111111
10000000000000000000000 (b) 0 01000000 10000000000000000000000 (c) 1
11000000 10000000000000000000000 (d) 0 10000000
00000000000000000000000
9. Which of the following statement about computer arithmetic is correct? (a)
Basic Booths algorithm can always improve the multiplication speed (b) The
addition of two floating point numbers wont overflow (c) The subtraction of
two twos complement numbers wont overflow (d) The floating-point addition
is not associative.
10. Whats not the advantage of the addition of second level cache? (a) reduced
miss penalty (b) reduced miss rate of the first level cache (c) reduced effective
CPI (d) reduced programs execution time
11. The compiler technique can help improving many metrics except (a) average
CPI (b) clock frequency (c) miss rate (d) control hazard
12. Which of the following feature is not typical for the RISC machine? (a)
powerful instruction set (b) small CPI (c) limited addressing mode (d) simple

251
hardware architecture
13. Which of the following statement about the use of larger block size is correct?
(a) It can always reduce the miss rate. (b) It can reduce the miss rate because of
the temporal locality of the program, (c) It can reduce compulsory misses, (d)
It can reduce the miss penalty.
14. By increasing the pipelined depth of the machine, (a) the execution time can
always be reduced (b) the chance of hazard becomes smaller (c) the average
CPI may increase (d) None of the above are correct.
15. Which of the statements about the single cycle, multicycle and pipelined
implementation of MIPS machine is correct (a) Single cycle machine requires
least amount of hardware (b) Pipelined machine has the smallest effective CPI
(c) Multicycle machine requires least amount of hardware (d) The clock
frequency of the multicycle machine is slowest.
Answer:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(a) (b) (d) (d) (c) (b) (a) (d) (d) (b) (b) (a) (c) (c) (c)

(4)state3control signals3next state bitsROM size = 2
(2+3)

(3 + 3) = 192
(5)
( )
5
20
1
1
=
+

x
x
, x =0.1578
(8)(a) is not a number

II. Computer Datapath
Fig. 1 shows the complete datapath of the multi-cycle implementation of MIPS
processor.
1. Describe the function of the following segment of the program. ($zero is one
MIPS register and always 0. Register $a1 equals 10 initially.)
add $t0, $zero, zero
loop1: add $t1, $t0, $t0
add $t1, $t1, $t1
add $t2, $a0, $t1
sw $zero, 0($t2)
addi $t0, $t0, 1
slt $t3, $t0, $a1
bne $t3, $zero, loop1
2. Evaluate the CPI of the individual instruction in the above program, and also
calculate the overall CPI for running the entire program.
3. Lets suppose the overflow of ALU operation will lead to an exception. When
this exception occurs, the machine will jump to special interrupt routine
services at 0x000C, and the old PC will be written to a specific register $k1.

252
Discuss how to modify the datapath to support this exceptions behavior.
4. If we pipeline the datapath shown in Fig. 1 to achieve a five-stage pipelined
MIPS, some pipelining hazard may occur while running the above program.
Find out what hazards will occur and how many cycles the machine will stall
related to the execution of the instruction bne $t3, $zero, loop1. (Assume no
data forwarding mechanism is used, and the register read/write can happen in
the same cycle.)
5. One way to cope up with the hazard is to use branch delay slot. Explain this
concept and illustrate how to apply it to the program shown as above.
Answer:
1. Clear 10 elements in an array which starting address is contained in register
$a0
2.
Instruction add sw addi slt bne Overall
CPI 4 4 4 4 3 3.86
(Instruction count = 1 + 10 7 = 71, total CPU cycles = 4 + 10 (4 + 4 + 4
+ 4 + 4 + 4 + 3) = 274. Overall CPI = 274/71 = 3.86)
3.



















4. (a) Data hazard, control hazard, and structural hazard.
(b) bnesltstall 2 clock cyclesdata hazardbnestall 1
clock cyclescontrol hazard
5. The delayed branch allows one or more instructions following the branch to
be executed in the pipeline whether the branch is taken or not. The instruction
$k1
0x000C

253
following a branch or jump is called the delay slot. Compilers and assemblers
try to place a safe instruction, which is not affected by the branch, in the
branch delay slot.
We can remove the sw instruction to behind the bne instruction as shown
below.
add $t0, $zero, zero
loop1: add $tl, $t0, $0
add $tl, $tl, $tl
add $t2, $a0, $tl
addi $t0, $t0, 1
slt $t3, $t0, $a1
bne $t3, $zero, loop1
sw $zero, 0($t2)

III. Cache
Suppose the gcc program is run on some 200 MHz RISC machine for a two-way
associated unified cache with 64-KB of data and four-word block. This miss rate
of instruction and data memory access is 2% and 6% respectively. The CPI of the
machine is 2 when there is no memory stall. The miss penalty is 40 cycles for all
misses. The instruction mix of the gcc program is 22% load, 8% store, 50%
R-type, 18% branch and 2% jump instructions. The instruction length of this
processor is 32-bit wide.
1. Draw the cache organization and calculate the overall size of the cache
including the tag and the valid bit.
2. Calculate the MIPS of this machine for running the gcc program.
3. In some cache organization, some dirty bit will be used. Discuss the purpose
of the use of dirty bit.
4. Virtual memory can be regarded as another level of the memory hierarchy.
Compare the virtual memory and data cache from the following aspects:
block placement scheme, block replacement strategies and write policy.
Answer:
1. four-word block 4 bits of offset; the no. of a blocks = 64KB/16B = 4K; The
no. of sets = 4K/2 = 2K. Hence, the index field has 11 bits and the tag field =
32 11 4 = 17 bits.
The total cache size = 2K 2 (1 + 17 + 128) = 584 Kbits







254
Address
22 8
V Tag Index
0
1
2
Data V Tag Data
Hit
1 2 3 8 9 10 11 12 30 31 0
2-to-1
Data
MUX 2-to-1
Data
MUX
17 11
128
128
offset: 4 bits
2046
2047


2. Average CPI = 2 + 0.02 40 + 0.06 (0.22 + 0.08) 40 = 3.52
MIPS =
6
6
10 52 . 3
10 200

=56.82
3. write backcachewrite hitcache blockdirty bit1
cache blockdirty bit1
blockmemory
4. Virtual memorypage fault penaltymiss ratevirtual
memoryblock placementvirtual memory
fully associative; block replacementLRU; write policy
write-back
Cachemiss penaltyblock placement
direct-mapped, set associative, and fully associative. block replacement
LRU or Randomly replacement. write policywrite-back
or write through


255
IV. Computer arithmetic
1. Draw the gate-level implementation of a full adder.
2. Draw the detailed architecture of a simple ALU as shown in Fig. 2 capable of
performing 4-bit twos complement addition and subtraction. When op equals
1 it will perform the operation of (A+B) while op equals 0, it will perform the
operation of (A-B), This ALU will set the flag f_over (=1) when overflow
occurs. The other flag f_comp will be set when two numbers are equal (i.e.
A=B).
Answer:
1. 2.





+
+
+
+
b
3
a
3
b
2
a
2
b
1
a
1
a
0
b
0
op
s
3
s
2
s
1
s
0
c
4
f_over
f_comp

256


















Fig. 1








Fig. 2

-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
ALU
4
S = (s
3
, s
2
, s
1
, s
0
)
f_over
f_comp
op
A = (a
3
, a
2
, a
1
, a
0
)
B = (b
3
, b
2
, b
1
, b
0
)
4
4

257
92

1. Answer and explain the following questions. Credit will be given only if
explanation is provided.
(1) What is the addressing mode? Enumerate and explain two addressing
modes that are the frequently used in reduced instruction set computers
(RISCs).
(2) RISCs generally have poor code density (larger code size) compared with
CISCs (complex instruction set computers). Please explain that RISCs how to
reduce code size?
(3) Pentium 4 has a much deeper pipeline than Pentium III. Please explain the
advantage and disadvantage of deeper pipeline.
(4) Is the floating-point addition performed in a computer associative? Why?
(5) Derive the IEEE 754 binary representation for the floating-point
number 10.75
ten
in single precision
(6) Assume that multiply instructions take 10 cycles and account for 20% of the
instructions in a typical program and that the other 80% of the instructions
require an average of 5 cycles for each instruction. What percentage of time
does the CPU spend doing multiplication?
(7) Assume that in 1000 memory references there are 30 misses in the first-level
(LI) cache and 6 misses in the second-level (L2) cache. The miss penalty from
the L2 cache to memory is 100 clock cycles, the hit time of the L2 cache is 10
clock cycles, the hit time of LI is 1 clock cycle, and there are 1.5 memory
references per instruction. What is the average memory access time? Ignore
the impact of writes.
Answer:
(1) 1. Register addressing, where the operand is a register
add $s0, $s1, $s2 # s0 $s1 + $s2
2. Base or displacement addressing, where the operand is at the memory
location whose address is the sum of a register and a constant in the
instruction
lw $s0, 20($s1) $s0 MEM[20 + $s1]
(2) Compiler optimizations to reduce code size: such as s strength reduction, dead
code elimination, and common subexpression elimination.
Hardware techniques to reduce code size: such as dictionary compression
where identical code sequences are identified and each occurrence is assigned
a variable-length codeword based on the frequency of occurrence. (more
frequently occurring instruction sequences are assigned shorter codewords)
(3) Extending the length of a pipeline can have major benefits to increase speed,
but at the same time having a deeper pipeline will need more complex
circuitry to prevent data hazards and if such hazards cannot be prevented the
overall performance of the CPU will be less.

258
(4) No.
For Example, suppose x = l.5
ten
10
38
, y = l.5
ten
10
38
, and z = 1.0, and
that these are all single precision numbers.
Then x + (y + z) = l.5
ten
10
38
+ (l.5
ten
10
38
+ 1.0) = l.5
ten
10
38
+ l.5
ten

10
38
= 0.0 (x + y) + z = ( l.5
ten
10
38
+ l.5
ten
10
38
) + 1.0 = 0.0 + 1.0 =
1.0
Therefore, x + (y + z) (x + y) + z
(5) 10.75
ten
= 1010.11
two
= 1.01011
two
2
3

IEEE 754 single precision format = 1 10000010 01011000000000000000000
(6)
% 33 . 33 3333 . 0
8 . 0 5 2 . 0 10
2 . 0 10
= =
+


(7) Average memory access time = 1.5 (1 + 0.03 10 + 0.006 100) = 2.85
clock cycles

2. (1) There are three methods to implement the datapath: single-cycle (M1),
multicycle (M2), and pipeline (M3). The operation times for the major
functional units in these implementations are 2 ns for memory access, 2 ns for
ALU operation, and 1 ns for register file read or write. Assuming that the
multiplexors, control unit, PC accesses, sign extension unit, and wires have no
delay. The other details of these three, implementations are listed as follows:
M1: The critical path of single-cycle implementation for the different
instruction classes is:
Instruction class Functional units used by the instruction class
ALU type Instruction fetch Register access ALU Register access
Load word Instruction fetch Register access ALU Memory access Register access
Store word Instruction fetch Register access ALU Memory access
Branch Instruction fetch Register access ALU
Jump Instruction fetch
M2: Multicycle implementation uses the control shown in <Fig. 1>.
M3: For the pipelined implementation, assume that half of the load instructions
are immediately followed by an instruction that uses the result, that the
branch delay on misprediction is 1 clock cycle, and that one-quarter of the
branches are mispredicted. Assume that jumps always pay 1 full clock
cycle of delay, so their average time is 2 clock cycles.
If the instruction mix is 23% loads, 13% stores, 43% ALU instructions, 19%
branches, and 2% jumps. Please calculate the average CPI (clock cycles per
instruction) and the average instruction time for the three implementations.
(2) Consider the five instructions in the following program. These instructions
execute on the five-stage pipelined datapath of <Fig. 2>. The five stages are:
instruction fetch (IF), instruction decode and fetch operand (ID), instruction
execution (EX), memory access (MEM), and register write back (WB).
Assume that each stage takes one cycle to complete its execution and the first
instruction starts from clock cycle 1.

259
add $1, $2, $3
add $4, $5, $1
lw $6, 50($7)
sub $8, $6, $9
add $10, $11, $8
(a) At the end of the fifth cycle of execution, which registers are being read
and which register will be written? How many cycles will it take to execute
this program?
(b) Explain what the forwarding unit and the hazard detection unit are doing
during the fifth cycle of execution. If any comparisons are being made,
mention them.
Answer:
(1)
CPI for M1 = 1
CPI for M2 = 5 0.23 + 4 0.13 + 4 0.43 + 3 0.19 + 3 0.02 = 4.02
CPI for M3 = 1 + 0.23 0.5 1 + 0.19 0.25 1 + 0.02 1 =1.1825
Average instruction time for: M1 = 1 8 ns = 8 ns
M2 = 4.02 2 ns = 8.04 ns
M3 = 1.1825 2 ns = 2.365 ns
(2)
(a) Register $1 is being written and Registers $6 and $9 are being read.
Since there is load-use data hazard between instructions 3 and 4, one
clock stall is required; therefore, execute this program needs (5 1) + 5 +
1 = 10 clock cycles
(b) The forwarding unit is comparing $1 = $7? $4 = $7? $1 = $6? $4 = $6?
The hazard detection unit is comparing $6 = $6? $6 = $9?


3. (1) Suppose we have made the following measurements:
Frequency of floating-point (FP) instructions = 5%
Frequency of ALU instructions = 25%
Average CPI of FP instructions = 20.0
Average CPI of ALU instructions = 4.0
Average CPI of other instructions = 2.0
Assume that the two design alternatives are to decrease the average CPI of FP
instructions to 8.0 or to decrease the average CPI of ALU instructions to 2.0.
Compare the performance of these two design alternatives.
(2) Consider the following three processors with the same instruction architecture:
(a) A simple processor running at a clock rate of 1.2 GHz and achieving a
pipeline CPI of 1.0. This processor has a cache system that yields 0.01
misses per instruction.
(b) A deeply pipelined processor with slightly smaller caches and a 1.5 GHz
clock rate. The pipeline CPI of the processor is 1.2, and the smaller caches

260
yield 0.014 misses per instruction on average.
(c) A speculative superscalar processor. This processor has the smallest caches
and a 1 GHz clock rate. The pipeline CPI of the processor is 0.4, and the
smallest caches lead to 0.02 misses per instruction, but it hides 20% of the
miss penalty on every miss by dynamic scheduling.
Assume that the main memory time (which sets the miss penalty) is 100 ns.
Determine the relative performance of these three processors.
Answer:
(1) (a) CPI for Design 1 = 0.05 8 + 0.25 4 + 0.7 2 = 2.8
CPI for Design 2 = 0.05 20 + 0.25 2 + 0.7 2 = 2.9
The performance of Design 1 is 2.9 / 2.8 = 1.0357 times better than
Design 2. Hence, decreasing the average CPI of FP is better than
decreasing the average CPI of ALU.
(2) (a) machine: cycle time = 1/1.2GHz = 0.83 ns
Average instruction time = (1 + 0.01 100/0.83) 0.83 = 1.83 ns
(b) machine: cycle time = 1/1.5GHz = 0.67 ns
Average instruction time = (1.2 + 0.014 100/0.67) 0.67 = 2.2 ns
(c) machine: cycle time = 1/1.0GHz = 1 ns
Average instruction time = (0.4 + 0.02 0.8 100) 1 = 2 ns
The performance relationship of machines (a) (b) (c) is (a) > (c) > (b).

4. (1) Suppose the branch frequencies (as percentages of all instructions) as follows:
Conditional branches 20%
Unconditional branches 1%
Conditional branches 60% are taken
Consider a five-stage pipelined machine where the branch is resolved at the
end of the third cycle for conditional branches and at the end of the second
cycle for unconditional branches. Assuming that only the first pipe stage can
always be done independent of whether the branch goes and ignoring other
pipeline stalls, how much faster would the machine be without any branch
hazards?
(2) A superscalar MIPS processors can issue two instructions per clock cycle. One
of the instructions could be an integer ALU operation or branch, and the other
could be a load or store.
(a) Enumerate the possible extra resources required for extending a simple
MIPS pipeline into the superscalar pipeline so that it wouldnt be hindered
by structural hazards.
(b) Unroll the following loop once under the assumption that the loop index is
a multiple of two. How would this unrolled loop be scheduled on the
superscalar pipeline for MIPS? Reorder the instructions to avoid as many
pipeline stalls as possible and show your unrolled and scheduled code as
the following table.


261
Loop: lw $t0, 0($s1) ALU or branch Data transfer Clock cycle
addu $t0, $t0, $s2 Loop: 1
sw $t0, 0($s1) 2
addi $s1, $s1, -4 3
bne $s1, $zero, Loop
Answer:
(1) The penalty for conditional branch hazard = 2 clock cycles
The penalty for unconditional branch hazard = 1 clock cycle
The average CPI considering branch hazard = 1 + 0.2 0.6 2 + 0.01 1 =
1.25
The machine without branch hazard is 1.25 times faster than the machine with
branch hazard.
(2) (a) Extra hardware include:
Register file: 2 read for ALU, 2 read for store, one write for ALU, one write
for load
Separated adder for address calculation of data transfers
(b)
Loop: lw $t0, 8($s1)
addu $t0, $t0, $s2
sw $t0, 8($s1)
lw $t1, 4($s1)
addu $t1, $t1, $s2
sw $t1, 4($s1)
addi $s1, $s1, -8
bne $s1, $zero, Loop

ALU or branch instruction Data transfer instruction Clock cycle
Loop: addi $s1, $s1, -8 lw $t0, 0($s1) 1
lw $t1, 4($s1) 2
addu $t0, $t0, $s2 3
addu $t1, $t1, $s2 sw $t0, 8($s1) 4
bne $s1, $zero, Loop sw $t1, 4($s1) 5


262


Fig. 1


PC
Instruction
memory
Registers
M
u
x
M
u
x
M
u
x
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Data
memory
M
u
x
Hazard
detection
unit
Forwarding
unit
0
M
u
x
IF/ID
I
n
s
t
r
u
c
t
i
o
n
ID/EX.MemRead
I
F
/
I
D
W
r
i t
e
P
C
W
r
i t
e
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Rt
Rs
Rd
Rt
EX/MEM.RegisterRd
MEM/WB.RegisterRd
PC
Instruction
memory
Registers
M
u
x
M
u
x
M
u
x
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Data
memory
M
u
x
Hazard
detection
unit
Forwarding
unit
0
M
u
x
IF/ID
I
n
s
t
r
u
c
t
i
o
n
ID/EX.MemRead
I
F
/
I
D
W
r
i t
e
P
C
W
r
i t
e
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Rt
Rs
Rd
Rt
EX/MEM.RegisterRd
MEM/WB.RegisterRd


Fig. 2


263
96

1. [ADC] The successive approximation converter is one of the most widely used
types of ADC. Given the following simplified block diagram:
(a) Explain how the control logic works using a flowchart.
(b) Assume V
A
= 10.4V, use a simple four-bit converter with a step size of 1V to
illustrate the conversion process.
Answer:
(a)
Start
Clear all bits
Start at MSB
Set bit = 1
Is DAC output
> V
AX
Is DAC output
> V
AX
Conversion finished,
number in register
Clear bit
back to 0
Clear bit
back to 0
Yes
No
Yes
No
End


(b) 10.4 > 8 b
3
= 1
10.4 8 = 2.4 < 4 b
2
= 0
2.4 > 2 b
1
= 1
2.4 2 = 0.4 < 1 b
0
= 0


264
2. [Instruction Set Architecture, Cache, Performance]
(a) One of the differences between RISC architectures and CISC architecture is
supposed to be the reduced types of instructions available. A student thinks it
would be a good idea to simplify the instruction set even more to remove the
special case instructions that take immediate operands such as "li", "addi", etc.
Explain to him/her why this might not be such a good idea.
(b) Explain how a memory system that pages to secondary storage depends on
locality of reference for efficient operation.
(c) Program A consists of 2000 consecutive add instructions, while program B
consists of a loop that executes a single add instruction 2000 times. You run
both programs on a certain machine and find that program B consistently
executes faster. Explain.
Answer:
(a) Common operations will now require multiple instructions without any
corresponding improvement in cycle time
(b) Without locality, the memory system would perform a disk speeds as every
access could require an access to disk. A working set that fits within physical
memory for efficient operations.
(c) Program B fits easily in the instruction cache but program A takes more time
to be fetched.

3. [Adder] A half-adder takes two input bits A and B to produce sum (S) and carry
(C
out
) outputs.
(a) Use basic logic gates to construct the circuits for a half adder.
(b) Use exactly two half-adders and one OR gate to construct a full adder.
Answer:
(a) Truth table for half adder Circuit
A B S C
out

0 0 0 0 S = AB
0 1 1 0 C
out
= AB
1 0 1 0
1 1 0 1
(b) Truth table for full adder
A B C
in
S C
out

0 0 0 0 0 S = (AB)C
0 0 1 1 0


0 1 0 1 0

A
B
Cout

265
0 1 1 0 1
( ) ( )
( )
in
in in in
in in in in out
C B A AB
C B A B A C C AB
ABC BC A C B A C AB C
+ =
+ + + =
+ + + =

1 0 0 1 0
1 0 1 0 1
1 1 0 0 1
1 1 1 1 1
Circuit










4. [Pipelining] Refer to the following figure, if the time for an ALU operation can be
shortened by 25%; (a) Will it affect the speedup obtained by pipelining? If yes,
by how much? Otherwise, why? (b) What if the ALU now takes 25% more time?









Answer:
(a) Shortening the ALU operation will not affect the speedup obtained from
pipelining. It would not affect the clock cycle, which will be determined by
the operation that takes the most time, i.e., instruction fetch or data access.
(b) If the ALU operation takes 25% more time, it becomes the bottleneck in the
pipeline. The clock cycle needs to be 250 ps.
Suppose the original instruction time = x
The change of speedup =
200 /
250 /
x
x
= 0.8
So, the speedup would be 20% less.
A
B
C
in
C
out
S

266
5. [Memory] Give a computer system that features:
a single processor
32-bit virtual addresses
a cache of 2
10
sets that are four-way set-associative and have 8-byte blocks
a main memory of 2
26
bytes;
a page size of 2
12
bytes.






(a) Does this system cache virtual or physical addresses?
(b) How many bytes of data from memory can the cache hold? (excluding tags)
(c) In the cache, each block of data must have a tag associated with it. How many
bits long are these tags?
(d) How many comparators are needed to build this cache while allowing single
cycle access?
(e) At any one time, what is the greatest number of page-table entries that can
have their valid bit set to 1?
Answer:
(a) Virtual
(b) 2
10
4 8 bytes =32 Kbytes
(c) 32 10 3 = 19 bits
(d) 4 comparators
(e) 2
32
/2
12
bytes/page = 2
20
pages.
All of them can valid because they can all alias the same physical page.

CPU Cache Page Table
Main memory
Disk

267
95

1. [I/O]
(1) Both networks and buses connect components together. Which of the following
are true about them?
(a) Networks and I/O buses are almost always standardized.
(b) Shared media networks and multimaster buses need an arbitration scheme.
(c) Local area networks and processor-memory buses are almost always
synchronous.
(d) High-performance networks and buses use similar techniques compared to
their lower-performance alternatives: they are wider, send many words per
transaction, and have separate address and data lines.
(2) In ranking of the three ways of doing I/O, which statements are true?
(a) If we want the lowest latency for an I/O operation to a single I/O device, the
order is polling, DMA, and interrupt driven.
(b) In terms of lowest impact on processor utilization from a single I/O device,
the order is DMA, interrupt driven, and polling
Answer:
(1) (a), (b)
(2) (a), (b)

2. [RAID] What does RAID stand for? Regarding RAID levels 1, 3, 4, 5, and 6,
which one has the highest check disk overhead? Which one has worst throughput
for small writes?
Answer:
(1) RAID (Redundancy Arrays of Inexpensive Disks): An organization of disks
that uses an array of small and inexpensive disks so as to increase both
performance and reliability.
(2) RAID 1 has the highest check disk overhead.
(3) For small writes, RAID 3 has the worst throughput.

3. [Memory Hierarchies]
(1) Which of the following statements (if any) are generally true?
(a) There is no way to reduce compulsory misses.
(b) Fully associate caches have no conflict misses.
(c) In reducing misses, associativity is more important than capacity.
(2) A new processor can use either a write-through or write-back cache selectable
through software.
(a) Assume the processor will run data intensive applications with a large
number of load and store operations. Explain which cache write policy
should be used.

268
(b) Consider the same question but this time for a safety critical system in
which data integrity is more important.
Answer:
(1) (b)
(2) (a) For the data-intensive application, the cache should be write-back. A write
buffer is unlikely to keep up with this many stores, so write-through
would be too slow.
(b) For the safety-critical system, the processor should use the write-through
cache. This way the data will be written in parallel to the cache and main
memory, and we could reload bad data in the cache from memory in the
case of an error.

4. [Multiprocessors, Amdahls law] A program takes T
S
seconds when executed on a
single CPU. Now assume that we have p processors which can be used for
parallel processing. (a) If only a fraction f of the program can be speeded up to
take advantage of parallel processing, what is the speedup S? (b) Now assume
that the improvement in performance by using p processor can be formulated as
p(1 px) for the parallelizable portion, find the value p that will maximize the
overall speedup.
Answer:
(a)
( )
p
f
f
S
+
=
1
1

(b) Make differential of the equation p(1 px) and let the result equal to 0. Then
we have p = 1/2x. That is, when p = 1/2x, the equation p(1 px) will has
maximum value.

5. [Floating-Point Representation] Consider a shorten version of the IEEE standard
floating point format with only 12 bits: one sign bit, 5 bits for the exponent, and 6
bits for the significand.
(a) Represent 1.5 and -0.75 with this format.
(b) What is the range of numbers it could represent? (excluding denormalized
numbers.)
Answer:
(a) 1.5
ten
= 1.1
two
= 1.1
two
2
0
, floating point format = 0 01111 100000
-0.75
ten
= -0.11
two
= -1.1
two
2
-1
, floating point format = 1 01110 100000
(b) 1.000000 2
-14
to 1.111111 2
15
, 0, , NaN



269
94

1. Use the following three operations to process 10010011, and choose the correct
result.
(1) Logical right shift: (A)10010011 (B)11100100 (C) 00100100 (D)01001001
(E) 11001001.
(2) Arithmetic right shift: (A) 11001001 (B) 01001001 (C) 10010011 (D)
11100100 (E) 00100100.
(3) Right rotate: (A) 10010011 (B) 11100100 (C) 11001001 (D) 00100100 (E)
01001001.
Answer:
(1) (D)
(2) (A)
(3) (C)

2. Which of the following operation is equivalent to division by 2 in twos
complement notation, when there is no overflow or underflow?
(A) arithmetic right rotate (B) arithmetic right shift (C) arithmetic left shift (D)
arithmetic left rotate (E) left rotate.
Answer: (B)
(B)

3. Which of the following situation cannot be a binding time? (A)When a program
is written (B) When a base register used for addressing is loaded (C)When the
instruction containing the address is executed (D)When the program is translated
(E) none of the above.
Answer: (A)

4. Addressing modes constitute a very important topic when people discuss
alternative designs of CPUs. Common addressing modes include register,
immediate, direct, and PC-relative mode, etc. Find a wrong statement from the
following choices. (A) Every CPU supports the direct addressing mode. (B) In
practice, having immediate, direct, register, and indexed mode is enough for
almost all applications. (C) Compilers are in charge of finding the best addressing
modes for statements written in high-level languages. (D) When offering a
limited number of addressing modes, the architecture must make sure that
common applications will be computable. (E) none of the above.
Answer: (A)

270
5. Assume that you are designing the instruction format for a strange new CPU that
will have 16 user-accessible registers. Further assume (1) that all instructions will
be encoded in exactly 16 bits and (2) that your boss wants you to include as many
instructions that employ register addressing as possible. If each instruction must
allow users to use at least two registers, how many different instructions can you
get? Which of the following choice is impossible?
(A) 64 (B) 128 (C) 256 (D) 512 (E) none of the above.
Answer: (D)

6. Which is not a possible effect of increasing the degree of associativity in the
design of cache? (A) decreasing the miss rate (B) increasing the hit time (C)
requiring more comparators (D) avoiding the needs to bind actual addresses to
variable names (E) none of the above.
Answer: (D)

7. Assume a cache of 2K blocks and a 16-bit address. Let o and |, respectively, be
the total number of sets and the total number of tag bits for caches that are
two-way set associate. Let and o, respectively, be the total number of sets and
the total number of tag bits for caches that are fully associative. Compute o/ and
|/o. You must show the computation process for getting your answers to get
credits.
Answer:
o = 2K / 2 = 1K = 2
10

| =2
10
2 6 = 6 2
11
(: tagoffsettag field = (16 10) = 6
bits)
= 1
o = 1 2K 16 = 32K = 2
15

o/ = 1K= 2
10
; |/o = 6 2
11
/2
15
= 0.375

8. In the following C program fragment, which types of hazards might occur in a
pipelined machine? Explain your answers.
If (a = = b){
x = y + z;
w = x-1;}
r = w + x;
Answer:


271
Control hazarddata hazard
compilera, b, x, y, z, w$a0, $a1, $s0, $s1, $s2,
$s3C program
1 bne $a0, $a1, L1
2 add $s0, $s1, $s2,
3 addi $s3, $s0, -1
4 L1: add $s0, $s3, $s0
Line 1control hazarda = bLines (2,3), (2,4), (3,4)
data hazard

9. Which of the following architecture/model for multiprocessors systems is most
unlikely to adopt techniques of critical sections that are commonly discussed in
the course of Operating Systems? Explain your answer.
(A) uniform memory access (B) symmetric multiprocessors (C) nonuniform
memory access (D) message passing.
Answer: (D)

10. Explain the main difference between the actual meanings represented by the
following paired terms.
(1) CISC vs. RISC
(2) Big endian vs. little endian
(3) Programmed I/O vs. DMA
Answer:
(1) CISC stands for complex instruction set computer and is the name given to
processors that use a large number of complicated instructions, to try to do
more work with each one.
RISC stands for reduced instruction set computer and is the generic name
given to processors that use a small number of simple instructions, to try to do
less work with each instruction but execute them much faster.
(2) Big endian: the most significant byte of any multibyte data field is stored at
the lowest memory address.
Little endian: the least significant byte of any multibyte data field is stored at
the lowest memory address.
(3) Programmed I/O: CPU has to write/read one byte at a time between main
memory and device. This takes up a lot of CPU time
DMA: DMA controller does programmed I/O on behalf of CPU so that CPU
can do other work



272
93

1. Acronyms:
Example: DMA Direct Memory Access (1) VHDL (2) SoC (3) TLB (4) RAID
(5) NUMA. (Within the context of computer architecture.)
Answer:
(1) VHDL: Very high speed integrated circuit Hardware Description Language
(2) SoC: System on a Chip
(3) TLB: Translation-Lookaside Buffer
(4) RAID: Redundant Array of Inexpensive Disks
(5) NUMA: NonUniform Memory Access

2. Number System, IEEE 754:
(1) Determine the base of the number system for the following operation to be
correct: 23 + 44 + 13 + 32 = 222
(2) Find the smallest normalized positive floating-point number in the IEEE-754
single-precision representation.
Answer:
(1) Suppose base is B then (2B + 3) + (4B + 4) + (1B + 3) + (3B + 2) = 2B
2
+ 2B
+ 2 B
2
4B 5 = 0 B = 5. So the base for the number system is 5.
(: 2232225)
(2) 1.0 2
-126
= 0 00000001 00000000000000000000000

3. Cache:
Consider data transfers between two levels of hierarchical memory. If we
logically organize all data stored in the lower level into blocks and store the
most-frequently-used blocks in the upper level. Assume that the hit rate is H, the
upper level latency is T
u
, and the lower level latency is T
1
, and the miss penalty is
T
m
.
(1) What is the average memory latency?
(2) What is the speedup by using the hierarchical memory system?
Answer:
(1) Average memory latency = T
u
+ (1 H) T
m

(2) Speedup = T
l
/ (T
u
+ (1 H) T
m
)

4. Pipelining:
(1) Assuming no hazards, a pipelined processor with s stages (each stage takes 1
clock cycle) can execute n instructions in ____ clock cycles.
(2) Use the result in (1) to show that the ideal pipeline speedup equals the number
of stages.

273
Answer:
(1) (s 1) + n
(2) speedup = Execute time before enhancement/Execute time after enhancement
= (n s) / (s 1 + n)
if n , then speed = (n s) / n = s

5. Approximation Circuits:
In a traditional adder design, the calculation must consider all input bits to obtain
the final carry out. However, in real programs, inputs to the adder are not
completely random and the effective carry chain is much shorter for most cases.
Instead of designing for the worst-case scenario, it is possible to build a faster
adder with a much shorter carry chain to approximate the result. Suppose we
consider only the previous k inputs (lookahead k-bits) instead of all previous
input bits to estimate the i-th carry bit c
i
, i.e.,
c
i
= f (a
i-1
, b
i-1
, a
i-2
, b
i-2
, , a
i-k
, b
i-k
) where 0 < k < i + 1 and a
j
, b
j
= 0 if j < 0.
(1) With random inputs, show that c
i
will generate correct a result, with a
probability of
2
2
1
1
+

k
.
(2) What is the probability of having a correct carry result considering only k
previous inputs for an N-bit addition?
(3) Design a logic circuit to detect when the approximate adder will generate an
incorrect carry result for the i-th carry bit.
Answer:
(1) (a
i,
b
i
)1 (i 1)(a
i
, b
i
)
1 =1/2 (00, 01, 10, 11)1
1/4
k (i 1 to i k)i k 1 (i 1 to i
k)1
2
2
1
4
1
)
2
1
(
+
=
k
k

1
2
2
1
+ k
k (k k
k )
(2) Because there are totally (N K 1) K-stages in the N bits, so the probability
=
1
2
2
1
1

+
|
|
.
|

\
|

k N
k

(3) k k [(a
j
, b
j
), i 1
>= j >= i k, 1]
The error detect function =
( )( ) ( )( )
k i k i k i i i i i
c b a b a b a

...
2 2 1 1


274
92

1. Cache:
Suppose that you have a computer system with the following properties:
Instruction miss rate (IMR): 1.5%
Date miss rate (DMR): 4.0%
Percentage of memory instructions (MI): 30%
Miss penalty (MP): 60 cycles
Assume that there is no penalty for a cache hit. Also assume that a cache block is
one-word (32 bits).
(1) Express the number of CPU cycles required to execute a program with K
instructions (assuming that CPI = 1) in terms of the miss rates, miss
percentage and miss penalty.
(2) You are allowed to upgrade the computer with one of the following
approaches:
(a) Get a new processor that is twice as fast as your current computer. The
new processors cache is twice as fast too, so it can keep up with the
processor.
(b) Get a new memory that is twice as fast.
What is a better choice? Explain with a detailed quantitative comparison
between the two choices.
Answer:
(1) The number of CPU cycles = K (1 + 0.015 60 + 0.04 0.3 60) = 2.62K
(2) (a) The CPI for new processor = 0.5 + 0.015 60 + 0.04 0.3 60 = 2.12
(b) The CPI for new memory = 1 + 0.015 30 + 0.04 0.3 30 =1.81
So, (b) is a better choice.

2. Floating-point Arithmetic :
(1) Why is biased notation used in IEEE 754 representation?
(2) Write the binary representation for the smallest negative floating point value
greater than -1 using single precision IEEE 754 format.
(3) Illustrate the key steps in floating-point multiplication using the example:
0.5
10
( 0.4375
10
).
Answer:
(1)
bias

(2) 1 = 1 01111111 0000000000000000000000 So, the smallest negative
floating point value greater than 1 is 1 01111110 1111111111111111111111
(: 1 2)
(3) In binary, the task is 1.000
2
2
-1
times 1.110
2
2
-2

275
Step 1: adding the exponents
(1 + 127) + (2 + 127) 127 = 124
Step 2: Multiplying the significands:
1.000
2
1.110
2
= 1.110000
2
2
-3
, but we need to keep it to 4 bits,
so it is 1.110
2
2
-3

Step 3: The product is already normalized and, since
127 3 126, so, no overflow or underflow
Step 4: Rounding the product makes no change:
1.110
2
2
-3

Step 5: make the sign of the product negative
1.110
2
2
-3

Converting to decimal: 1.110
2
2
-3
= 0.001110
2
= 0.00111
2
= 7/2
5
10
=
0.21875
10


3. Parallel Computing :
(1) What are the two possible approaches for parallel processors to share data?
(2) Outline Flynns taxonomy of parallel computers.
(3) Suppose you want to perform two sums: one is a sum of two scalar variables
and one is a matrix sum of a pair of two-dimensional arrays, size 500 by 500.
What speedup do you get with 500 processors?
Answer:
(1) Single address space and Message passing
(2) 1. Single instruction stream, single data stream (SISD)
2. Single instruction stream, multiple data streams (SIMD)
3. Multiple instruction streams, single data stream (MISD)
4. Multiple instruction streams, multiple data streams (MIMD)
(3) Speedup =
500
500 500
1
500 500 1

+
+
= 499

4. Disk I/O :
Suppose we have a magnetic disk with the following parameters.
Controller overhead 1 ms
Average seek time 12 ms
# sectors per track 32 sectors/track
Sector size 512 bytes
(1) If the disks rotation rate is 3600 RPS. What is the transfer rate for this disk?
(2) What is the average time to read or write an entire track (16 consecutive kB).
If the disks rotation rate is 3600 RPM? Assume sectors can be read or written
in any order.
(3) If we would like an average access time of 21.33 ms to read or write 8
consecutive sectors (4k bytes), what disk rotation rate is needed?
Answer:

276
(1) Data transfer rate = 3600 32 512 = 57600KB/sec
(2) Access time = 12 ms + 1/60 + 1 ms
= 12 ms + 16.67 ms + 1 ms = 29.67 ms
(3) Suppose the rotation rate is R cycles per second
Rotation time = 21.33 ms = 12 ms + 0.5/R + 8/(R32) + 1 ms
R = 90.036 RPS
(2)rotaional timetrack sector

5. Pipelining :
(1) Explain the three types of hazards encountered in pipelining. (Note: State the
causes and possible solutions.)
(2) What are the characteristics of the MIPS instruction set architecture (ISA) that
facilitates pipelined execution? (Note: State at least two properties.)
Answer:
(1) 1. Structural hazards: hardware cannot support the instructions executing in
the same clock cycle (limited resources)
2. Data hazards: attempt to use item before it is ready (Data dependency)
3. Control hazards: attempt to make a decision before condition is evaluated
(branch instructions)
Type Solutions
Structure
hazard
(: use two memories, one for instruction
and one for data)
Data hazard
Software solution: (a)compilerno operation
(nop)(b)data hazard
Hardware solution: (a)Forwardingdata
hazard(b)load-useForwarding
data hazard(stall)
Forwarding
Control hazard
Software solution: (a)compilerno operation
(nop)(b)Delay branch
Hardware solution: (a)control
hazard(b)(static or
dynamic)pipeline

pipeline
(2) 1. Instructions are the same length
2. Has only a few instruction formats, with the same source register field
located in the same place in each instruction
3. Memory operands only appear in loads or stores
4. Operands must be aligned in memory

277
96

1. Please explain the following concepts or terminologies:
(a) The concept of store-program computer. (b) Amdahls law. (c) Branch delay
slot. (d) Miss penalty of cache. (e) Page table.
Answer:
(a) The idea that instructions and data of many types can be stored in memory as
numbers, leading to the stored program computer.
(b) A rule stating that the performance enhancement possible with a given
improvement is limited by the amount that the improved feature is used.
(c) The slot directly after a delayed branch instruction, which in the MIPS
architecture is filled by an instruction that does not affect the branch.
(d) The time to replace a block in cache with the corresponding block from
memory, plus the time to deliver this block to the processor.
(e) The table containing the virtual to physical address translations in a virtual
memory system.

2. For a two-bit adder that implements twos complement addition, answer the
following questions: (Assume no carry in)
(a) Write all possible inputs and outputs for the 2-bit adder.
(b) Indicate which inputs result in overflow.
Answer: (a), (b)
Input Sum
Overflow Remark
a
1
a
0
b
1
b
0
s
1
s
0

0 0 0 0 0 0 No 0 + 0 = 0
0 0 0 1 0 1 No 0 + 1 = 1
0 0 1 0 1 0 No 0 + (2) = 2
0 0 1 1 1 1 No 0 + (1) = 1
0 1 0 0 0 1 No 1 + 0 = 1
0 1 0 1 1 0 Yes 1 + 1 = 2
0 1 1 0 1 1 No 1 + (2) = 1
0 1 1 1 0 0 No 1 + (1) = 0
1 0 0 0 1 0 No 2 + 0 = 2
1 0 0 1 1 1 No 2 + 1 = 1
1 0 1 0 0 0 Yes 2 + (2) = 0
1 0 1 1 0 1 Yes 2 + (1) = 1
1 1 0 0 1 1 No 1 + 0 = 1

278
1 1 0 1 0 0 No 1 + 1 = 0
1 1 1 0 0 1 Yes 1 + (2) = 1
1 1 1 1 1 0 No 1 + (1) = 2

3. Convert the following C language into MIPS codes.
clear1(int array[ ], int size)
{
int i;
for (i = 0; i < size; i += 1)
array[i] = 0;
}
Answer:
Suppose that $a0 and $a1 contain the starting address and the size of the array,
respectively.
move $t0, $zero
1oop1: sll $t1, $t0, 2
add $t2, $a0, $t1
sw $zero, 0($t2)
addi $t0, $t0, 1
slt $t3, $t0, $a1
bne $t3, $zero, loop1

4. Explain the five MIPS addressing modes.
Answer:
Multiple forms of addressing are generically called addressing modes. The MIPS
addressing modes are the following:
1. Register addressing, where the operand is a register.
2. Base or displacement addressing, where the operand is at the memory location
whose address is the sum of a register and a constant in the instruction.
3. Immediate addressing, where the operand is a constant within the instruction
itself.
4. PC-relative addressing, where the address is the sum of the PC and a constant
in the instruction.
5. Pseudodirect addressing, where the jump address is the 26 bits of the
instruction concatenated with the upper bits of the PC.


279
JumpReg JumpReg
5. If we wish to add the new instruction jr (jump register), explain any necessary
modification to the following datapath.















Answer:
A modification to the datapath is necessary to allow the new PC to come from a
register (Read data 1 port), and a new signal (e.g., JumpReg) to control it through
a multiplexor as shown in the following Figure.


















PC
Instruction
memory
Read
address
Instruction
[31
_
0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32
Instruction [15 0]
0
0
M
u
x
0
1
Control
Add
ALU
result
M
u
x
0
1
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Sign
extend
M
u
x
1
ALU
result
Zero
PCSrc
Data
memory
Write
data
Read
data
M
u
x
1
Instruction [15 11]
ALU
control
Shift
left 2
ALU
Address
PC
Instruction
memory
Read
address
Instruction
[31
_
0]
Instruction [20 16]
Instruction [25 21]
Add
Instruction [5 0]
MemtoReg
ALUOp
MemWrite
RegWrite
MemRead
Branch
RegDst
ALUSrc
Instruction [31 26]
4
16 32
Instruction [15 0]
0
0
M
u
x
0
1
Control
Add
ALU
result
M
u
x
0
1
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Sign
extend
M
u
x
1
ALU
result
Zero
PCSrc
Data
memory
Write
data
Read
data
M
u
x
1
Instruction [15 11]
ALU
control
Shift
left 2
ALU
Address

280
95

1. The single-cycle datapath for the MIPS architecture is shown below.









(a) The single-cycle datapath is not used in modern designs. Why? Please explain
in detail.
(b) What is a multicycle datapath design? Modify the above single cycle datapath
as a multicycle datapath. Draw the modified datapath and explain your
modification.
(c) What is a pipelining implementation? Modify the above single cycle datapath
as a pipelined datapath. Draw the modified datapath and explain your
modification.
Answer:
(a) The single-cycle datapath is inefficient, because the clock cycle must have the
same length for every instruction in this design. The clock cycle is determined
by the longest path in the machine, but several of instruction classes could fit
in a shorter clock cycle.
(b) A multicycle datapath is an implementation in which an instruction is
executed in multiple clock cycles.










PC
Instruction
memory
Read
address
Instruction
16 32
Add ALU
result
M
u
x
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Shift
left 2
4
M
u
x
ALU operation
4
RegWrite
MemRead
MemWrite
PCSrc
ALUSrc
MemtoReg
ALU
result
Zero
ALU
Data
memory
Address
Write
data
Read
data
M
u
x
Sign
extend
Add
PC
Instruction
memory
Read
address
Instruction
16 32
Add ALU
result
M
u
x
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Shift
lef
PC
Instruction
memory
Read
address
Instruction
16 32
Add ALU
result
M
u
x
Registers
Write
register
Write
data
Read
data 1
Read
data 2
Read
register 1
Read
register 2
Shift
left 2
4
M
u
x
ALU operation
4
RegWrite
MemRead
MemWrite
PCSrc
ALUSrc
MemtoReg
ALU
result
Zero
ALU
Data
memory
Address
Write
data
Read
data
M
u
x
Sign
extend
Add
t 2
4
M
u
x
ALU operation
4
RegWrite
MemRead
MemWrite
PCSrc
ALUSrc
MemtoReg
ALU
result
Zero
ALU
Data
memory
Address
Write
data
Read
data
M
u
x
Sign
extend
Add


281
Compare to the datapath of single-cycle design, the multicycle datapath is
modified as follows:
1. A single memory unit is used for both instruction and data.
2. There is a single ALU, rather than an ALU and two adders.
3. One or more registers are added after every major functional unit to hold
the output of that unit until the value is used in a subsequent clock cycle.
(c) In a pipelining implementation, multiple instructions are overlapped in
execution, much like to an assembly line.









We separate the single-cycle datapath into five pieces, with each
corresponding to a stage of instruction execution. Pipeline registers are added
between two adjacency stages to hold data so that portions of the datapath
could be shared during instruction execution.

2. Please explain the designs and advantages for RAID 0, 1, 2, 3, 4, respectively.
Answer:
RAID
level
Design description Advantages
0
This technique has striping but
no redundancy of data. It offers
the best performance but no
fault-tolerance.
1. Best performance is achieved
when data is striped across
multiple disks
2. No parity calculation overhead
is involved and easy to
implement
1 Each disk is fully duplicated.
Very high availability can be
achieved
2
This type uses striping across
disks with some disks storing
error checking and correcting
(ECC) information.
Relatively simple controller
design compared to RAID levels
3,4 & 5


282
3
This type uses striping and
dedicates one drive to storing
parity information.
Very high read and write data
transfer rate since every read and
writes go to all disks
4
RAID 4 differs from RAID 3
only in the size of the stripes sent
to the various disks.
Better for small read (just one
disk) and small writes (less read)

3. An eight-block cache can be configured as direct mapped, two-way set
associative, and fully associative. Draw these cache configurations. Explain the
relationship between cache miss rate and associativity.
Answer: (1)
Direct-mapped Two-way associative
Block Tag Data Set Tag Data Tag Data
0 0
1 1
2 2
3 3
4
5
6
7
Fully associative
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

(2) Increase associativity will decrease cache miss rate due to conflict misses.


283
94

1. Translate the following C segment into MIPS assembly code.
while (data[i] = = k)
i = i + j;
Assume: base address of array data is in $a0, size of each element in data is 4
bytes, and i, j, k correspond to $s0, $s1, $s2 respectively.
(1) Use both a conditional branch and an unconditional jump in the loop.
(2) Use only one branch or jump in the loop.
Answer:
(1) Loop: sll $t1, $s0, 2 (2) Loop: sll $t1, $s0, 2
add $t1, $t1, $a0 add $t1, $t1, $a0
lw $t0, 0($t1) lw $t0, 0($t1)
bne $t0, $s2, Exit add $s0, $s0, $s1
add $s0, $s0, $s1 beq $t0, $s2, Loop
j Loop sub $s0, $s0, $s1
Exit:

2. Explain structural hazard, data hazard and branch hazard in the pipeline design.
Give an illustrative example for each of them.
Answer:
(1) 1. Structural hazards: hardware cannot support the instructions executing in
the same clock cycle (limited resources)
2. Data hazards: attempt to use item before it is ready (Data dependency)
3. Control hazards: attempt to make a decision before condition is evaluated
(branch instructions)
(2) Example program:
1 lw $5, 50($2)
2 add $2, $5, $4
3 add $4, $2, $5
4 beq $8, $9, L1
5 sub $16, $17, $18
6 sw $5, 100($2)
7 L1:
Type Example
Structure
hazard
datapath 4
1 4

structural hazard
Data hazard 1 5 $5

284
2 3 3 4 $5
2 3 $5
data hazard
Control
hazard
4 pipeline MEM
75
4 MEM56 EXID
4 pipeline control hazard


3. Rewrite 100
ten
using a 16-bit binary representation. Use
(1) sign and magnitude representation
(2) ones complement representation
(3) twos complement representation
Answer:
(1) sign and magnitude representation: 1000000001100100
(2) ones complement representation: 1111111110011011
(3) twos complement representation: 1111111110011100

4. We have the following statistics for two processors M1 and M2 (they have the
same classes of instructions):
Instruction class CPI Frequency Instruction class CPI Frequency
A 5 25% A 3 40%
B 2 40% B 3 35%
C 3 35% C 4 25%
M1: 200 MHz M2: 250 MHz
* CPI = clock cycles per instruction, Frequency: occurrence frequency of
the instruction class
(1) Calculate the average CPI for the two processors.
(2) Calculate the MIPS (Million Instructions Per Second) for them.
(3) Which machine is faster? How much faster?
Answer:
(1) Average CPI for M1 = 5 0.25 + 2 0.4 + 3 0.35 = 3.1
Average CPI for M2 = 3 0.4 + 3 0.35 + 4 0.25 = 3.25
(2) MIPS for M1 =
6
6
10 1 . 3
10 200

= 64.52, MIPS for M2 =


6
6
10 25 . 3
10 250

= 76.92
(3) M2 is faster than M1 by
ns 4 25 . 3
ns 5 1 . 3

=1.2 times


285
5. Draw the circuit of a 1-bit ALU that performs AND, OR, addition on inputs a and
b, or a and b . Use the basic AND, OR, Inverter, and Multiplexer gates.
Answer: 1-bit ALU













Control
Binvert CarryIn Operation0 Operation1 Function
0
0 0 AND
0
0 1 OR
0 0 1 0 ADD
1 1 1 0 SUB


6. Assume that both the logical and physical address are 16-bits wide, the page size
is 1K (1024) bytes, and one-level paging is used
(a) How many entries are there in the page table?
(b) How many bits are occupied by each entry of the page table? If the logical
address is still 16-bits-wide, but the physical address in extended to 20 bits
wide
(c) In order to do 16 20 mapping, some modification to the original page table
is needed. Whats it?
Answer:
(a) 2
16
/ 1024 = 64
(b) Length of a entry = valid bit + the size of physical page number = 1 + (20
10) = 11
(c) increase the length of each entry to contain the enlarged physical page number


0
2
a
1
0
1
b
Binvert Operation
CarryIn
CarryOut
Result

286
93

1. We wish to compare the performance of two different machines: M1 and M2. The
following measurements have been made on these machines:
Program Time on M1 Time on M2
1 10 seconds 5 seconds
2 4 seconds 6 seconds

Program
Instructions
executed on M1
Instructions
executed on M2
1 200 10
6
160 10
6

2 100 10
6
120 10
6

(a) Which machine is faster for each program and by how much?
(b) Find the instruction execution rate (instructions per second) for each machine
when running program 1 & 2.
(c) If the clock rate of machines M1 and M2 are 200 MHz and 500 MHz,
respectively, find the clock cycles per instruction for program 1 & 2 on both
machines using the data in Problem (a) and (b).

Answer:
(a) For Program 1, M2 is faster than M1 by 10/5 = 2 times
For Program 2, M1 is faster than M2 by 6/4 = 1.5 times
(b)
Exe. rate M1 M2
Program
1
6
6
10 20
10
10 200
=


6
6
10 32
5
10 160
=


Program
2
6
6
10 25
4
10 100
=


6
6
10 20
6
10 120
=


(c)
CPI M1 M2
Program
1
10
10 200
10 200 10
6
6
=



625 . 15
10 160
10 500 5
6
6
=



Program
2
8
10 100
10 200 4
6
6
=



25
10 120
10 500 6
6
6
=





2. Add 6.42
ten
10
1
to 9.51
ten
l0
2
, assuming that you have only three significant
decimal digits. Round to the nearest decimal number, first with guard and round
digits and then without them. Explain your work step by step.


287
Answer:
with guard and round digits:
6.42
ten
10
1
+ 9.51
ten
10
2
= 0.642
ten
10
2
+ 9.51
ten
10
2
= 10.152
ten
10
2
=
1.0152
ten
10
3
= 1.02
ten
10
3

without guard and round digits:
6.42
ten
10
1
+ 9.51
ten
10
2
= 0.64
ten
10
2
+ 9.51
ten
10
2
= 10.1
ten
10
2
= 1.01
ten

10
3


3. Assuming a 32-bit address, design
(1) a direct-mapped cache with 1024 blocks and a block size of 16 bytes (4 words).
(2) a two-way set-associative cache with 1024 blocks and a block size of 16 bytes.
Answer:
(1) (2)



















Address
22 8
V Tag Index
0
1
2
510
511
Data V Tag Data
Hit
1 2 3 8 9 10 11 12 30 31 0
2-to-1
Data
MUX 2-to-1
Data
MUX
19 9
128
128
20 10
Byte
offset
Valid Tag Data Index
0
1
2
1021
1022
1023
Tag
Index
Hit
Data
20
32
18
128
18
32 bits
10
20 10
Byte
offset
Valid Tag Data Index
0
1
2
1021
1022
1023
Tag
Index
Hit
Data
20
32
18
128
18
32 bits
10

288
4. Use add rd, rs, rt (addition) and addi rd, rs, imm (addition immediate) instructions
only to show the minimal sequence of MIPS instructions for the statement a = b
7 8; Assume that a corresponds to register $s0 and b corresponds to register $s1.
Answer:
add $s0, $s1, $s1 # $s0 = 2b
add $t0, $s0, $s1 # $t0 = 3b
add $s0, $s0, $s0 # $s0 = 4b
add $s0, $s0, $t0 # $s0 = 7b
addi $s0, $s0, 8 # $s0 = 7b 8



289
92

1. Explain the following terminologies.
(1) stack frame
(2) nonuniform memory access
(3) write-through
(4) multiple-instruction issue
(5) out-of-order commit
Answer:
(1) When performs a function call, information about the call is generated. That
information includes the location of the call, the arguments of the call, and the
local variables of the function being called. The information is saved in a
block of data called a stack frame.
(2) A type of single-address space multiprocessor in which some memory
accesses are faster than others depending which processor asks for which
word.
(3) A scheme in which writes always update both the cache and the memory,
ensuring that data is always consistent between the two.
(4) A scheme whereby multiple instructions are launched in 1 clock cycle.
(5) A commit in which the results of pipelined execution are written to the
programmer-visible state in the different order that instruction are fetched.

2. Suppose that in 1000 memory references there are 50 misses in the first-level
cache, 20 misses in the second-level cache, and 5 misses in the third-level cache,
what are the various miss rates? Assume the miss penalty from the L3 cache to
memory is 100 clock cycles, the hit time of the L3 cache is 10 clocks, the hit time
of the L2 cache is 4 clocks, the hit time of L1 is 1 clock cycle, and there are 1.2
memory references per instruction. What is the average memory access time
(average cycles per memory access) and average stall cycles per instruction?
Ignore the impact of writes.
Answer:
Global miss rate for L1 = 50 / 1000 = 0.05, for L2 = 20 / 1000 = 0.02, and for L3
= 5 / 1000 = 0.005
Average memory access time = (1 + 0.05 4 + 0.02 10 + 0.005 100) = 1.9
The average stall cycles per instruction = 1.2 (1.9 1) = 1.08
:
Local miss rate for L1 = 50 / 1000 = 0.05, L2 = 20 / 50 = 0.4, L3 = 5 / 20 = 0.25
Average memory access time = 1 + 0.05 (4 + 0.4 (10 + 0.25 100)) = 1.9




290
3. Describe how to reduce the miss rate of a cache and list the classes of cache
misses that exits.
Answer:
(1) Compulsory (cold start or process migration, first reference): first access to a
block
Solution: increase block size
(2) Conflict (collision): Multiple memory locations mapped to the same cache
location. Occur in set associative or direct mapped cache
Solution: increase associativity
(3) Capacity: Cache cannot contain all blocks accessed by the program
Solution: increase cache size

4. Draw a configuration showing a processor, four 16k 8-bit ROMs, and a bus
containing 16 address lines and 8 data lines. Add a chip-select logic block that
will select one of the four ROM modules for each of the 64k addresses
Answer:

2 4 decoder
1 6k 8 ROM
cs
A
13
~ A
0
A
15
A
14
address data
D
7
~ D
0
1 6k 8 ROM
cs
1 6k 8 ROM 1 6k 8 ROM
CPU
cs cs





291
96

1. Consider a cache with 2K blocks and a block size of 16 bytes. Suppose the
address is 32 bits.
(a) Suppose the cache is direct-mapped. Find the number of sets in the cache.
Compute the number of tag bits per cache block.
(b) Repeat part (a) when the cache becomes a 2-way set associative cache.
(c) Repeat part (a) when the cache becomes a fully associative cache.
Answer:
(a) The number of sets = the number of blocks in cache = 2K
The length of index field =11, and the length of offset field = 4
Tag size = 32 11 4 = 17
(b) The number of sets = 2K/2 = 1K
The length of index field =10, and the length of offset field = 4
Tag size = 32 10 4 = 18
(c) The number of sets = 1
The tag size = 32 4 = 28

2. Suppose we have two implementations of the same instruction set architecture.
Computer A has a clock cycle time of 500 ps, and computer B has a clock rate of
2.5 GHz. Consider a program having 1000 instructions.
(a) Suppose computer A has a clock cycles per instruction (CPI) 2.3 for the
program. Find the CPU time (in ns) for the computer A.
(b) Suppose the CPU time of the computer B is 800 ns for the same program.
Compute the CPI of computer B for the program.
Answer:
(a) CPU time for computer A = 1000 2.3 500 ps = 1150 ns
(b) 800 ns = 1000 CPI
B
0.4 ns CPI
B
= 2

3. Assume a MIPS processor executes a program having 800 instructions. The
frequency of loads and stores in the program is 25%. Moreover, an instruction
cache miss rate for the program is 1%, and a data cache miss rate is 4%. The miss
penalty is 100 cycles for all misses.
(a) Find the total number of instruction miss cycles.
(b) Find the total number of data miss cycles.
Answer:
(a) The total instruction miss cycles = 800 1% 100 = 800
(b) The total data miss cycles = 800 0.25 4% 100 = 800


292
4. Consider a 5-stage (IF, ID, EX, MEM, WB) MIPS pipeline processor with hazard
detection unit. Suppose the processor has instruction memory for IF stage, and
data memory for MEM stage so that the structural hazard for memory references
can be avoided.
(a) Assume no forwarding unit is employed for the pipeline. We are given a code
sequence shown below.
LD R1, 10(R2); R1 MEM[R2 + 10]
SUB R4, R1, R6; R4 R1 R6
ADD R5, R1, R6; R5 R1 + R6
Show the timing of each instruction of the code sequence. Your answer may
be in the following form.
Instruction
Clock Cycle
1 2 3 4 5 6 7 8 9 10
LD R1, 10(R2) IF ID EX MEM WB
SUB R4, R1, R6
ADD R5, R1, R6
(b) Repeat part (a) when a forwarding unit is used.
(c) Consider another code sequence shown below.
SUB R1, R3, R8; R1 R3 R8
SUB R4, R1, R6; R4 R1 R6
ADD R5, R1, R6; R5 R1 + R6
Suppose both hazard detector and forwarding unit are employed. Show the
timing of each instruction of the code sequence.
Answer:
(a) Suppose that the register read and write can happen in the same clock cycle.
Instruction
Clock Cycle
1 2 3 4 5 6 7 8 9 10
LD R1, 10(R2) IF ID EX MEM WB
SUB R4, R1, R6 IF ID ID ID EX MEM WB
ADD R5, R1, R6 IF IF IF ID EX MEM WB
(b)
Instruction
Clock Cycle
1 2 3 4 5 6 7 8 9 10
LD R1, 10(R2) IF ID EX MEM WB
SUB R4, R1, R6 IF ID ID EX MEM WB
ADD R5, R1, R6 IF IF ID EX MEM WB
(c)
Instruction
Clock Cycle
1 2 3 4 5 6 7 8 9 10
SUB R1, R3, R8 IF ID EX MEM WB
SUB R4, R1, R6 IF ID EX MEM WB
ADD R5, R1, R6 IF ID EX MEM WB

293
95

1. (1) What is the decimal value of the following 32-bit two's complement number?


(2) What is the decimal value of the following IEEE 754 single-precision binary
representation?


Answer:
(1) 11111111111111111111101101111100
2
=
00000000000000000000010010000100
2
= (2
10
+ 2
7
+ 2
2
) = 1156
10

(2) 1.01
2
2
132 127
= 1.01
2
2
5
= 1.01
2
= 101000
2
= 40
10


2. Consider a five-stage (IF, ID, EX, MEM and WB) pipeline processor with hazard
detection and data forwarding units. Assume the processor has instruction
memory for IF stage and data memory for MEM stage so that the structural
hazard for memory references can be avoided.
(1) Suppose the following code sequence is executed on the processor. Determine
the average CPI (clock cycles per instruction) of the code sequence.
ADD R1, R2, R3; R1 R2+R3
SUB R4, R5, R6; R4 R5 R6
(2) Repeat Part (1) for the following code sequence.
ADD R1, R2, R3; R1 R2+R3
SUB R4, R1, R6; R4 R1 R6
(3) Repeat Part (1) for the following code sequence.
LD R3, 10(R7) R3 MEM[R7 + 10]
ADD R1, R2, R3; R1 R2 + R3
SUB R4, R1, R6; R4 R1 R6
Answer:
(1) Clock cycles = (5 1) + 2 = 6
CPI = clock cycles / instruction count = 6 / 2 = 3
(2) Although there is a data hazard between instruction ADD and SUB, it can be
resolved by forwarding unit and no pipeline stall is needed.
Clock cycles = (5 1) + 2 = 6
CPI = clock cycles / instruction count = 6 / 2 = 3
(3) The data between ADD and SUB can be resolved by forwarding unit but the
data hazard between LD and ADD require one clock stall.
Hence, Clock cycles = (5 1) + 3 + 1 = 8
CPI = clock cycles / instruction count = 8 / 3 = 2.67

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

294
3. Consider a five-stage (IF, ID, EX, MEM and WB) pipeline processor with
instruction memory for IF stage and data memory for MEM stage. Suppose the
following code sequence is executed on the processor.
LD R2, 100(R1); R2 MEM[R1 + 100]
LD R4, 200(R3); R4 MEM[R3 + 200]
ADD R6, R2, R4; R6 R2 + R4
SUB R8, R2, R4; R8 R2 R4
SD R6, 120(R1); MEM[R1 + 120] R6
SD R8, 120(R3); MEM[R3 + 120] R8
(1) Determine the total number of memory references.
(2) Determine the percentage of the memory references which are data
references.
Answer:
(1) Every instruction should be read from memory and there are 4 memory
reference instructions in the code sequence. Hence the number of memory
references = 6 + 4 = 10
(2) The percentage of the memory references = 4 / 10 = 40%

4. (1) Consider a direct mapped cache with 64 blocks and a block size of 32 bytes.
What block number does byte address 1600 map to?
(2) Repeat Part (1) for the byte address 3209.
(3) With a 32-bit virtual address, 8-KB pages, and 4 bytes per page table entry,
determine the total page table size (in MB).
Answer:
(1) Memory block address =

50 32 / 1600 =
, 50 mod 64 = 50
The cache block number for byte address 1600 to map is 50
(2) Memory block address =

100 32 / 3209 = , 100 mod 64 = 36
The cache block number for byte address 3209 to map is 36
(3) No. of virtual pages = 2
32
/ 8K = 2
19
there are 2
19
entries in page table
The size of page table = 2
19
4 bytes = 2MB






295
94

1. Consider the 5-stage pipeline shown below.
















(a) Use the following load instruction as an example.
LD R5, 128(R1); #R5 M[128+R1]
Briefly explain the major operations of the pipeline at each stage.
(b) Consider the following code sequence.
LD R5, 128(R1); #R5 M[128 + R1]
ADD R3, R2, R5; # R3 R2 + R5
Will the execution of the ADD instruction cause a data hazard? Justify your
answer. If your answer is YES, determine whether the data hazard can be
removed by a forwarding technique.
(c) Consider the following code sequence.
ADD R5, R6, R7; # R5 R6 + R7
BNZ R5, exit; # goto exit if R5= 0
LD R2, 64(R1); # R2 M[64 + R1]
exit: ADD R8, R7, R8; # R8 R7 + R8
Will the execution of the BNZ instruction cause a data hazard? Justify your
answer. If your answer is YES, determine whether the data hazard can be
removed by a forwarding technique.
Answer:
(a) IF stage: instruction LD R5, 128(R1) is fetched from instruction memory.
ID stage: instruction is decode and register R5 and R1 are read from the
register file.
EX stage: memory address is calculated sign-ext(128) + [R1]
MEM stage: data is read from memory using the address calculated at EX
stage.
Instruction
memory
Address
4
32
Add
Add
result
Add
Add
result
I
n
s
t
r
u
c
t
i
o
n
IF/ID
M
u
x
0
1
Add
PC
Registers
Read
data1
Read
data2
Read
register 1
Read
register 2
16
Sign
extend
Write
register
Write
data
0
EX/MEM MEM/WB
0
Address
Write
data
M
u
x
1
Read
data
Data
memory
1
ALU
result
M
u
x
ALU
Zero
ID/EX
0
EX/MEM MEM/WB
0
Address
Write
data
M
u
x
1
Read
data
Data
memory
1
ALU
result
M
u
x
ALU
Zero
ID/EX
zero

296
WB stage: data read from memory at MEM stage is now write into R5.
(b) YES. The load-use data hazard can not removed by the forwarding technique
completely. We must stall one clock cycle between these two instructions and
then using forwarding to solve this hazard.
(c) YES. According to the graph the BNZ will complete in ID stage, hence the
forwarding technique can not solve it completely and we still have to stall for
one clock.

2. Consider a cache having 8K blocks. There are two words in each block. Each
word contains 4 bytes. Suppose the main memory is byte-addressed with a 32-bit
address bus.
(a) Suppose the cache is a four-way set associative cache. Find the total number
of sets and total number of tag bits.
(b) Suppose the cache is a fully associative cache. Find the total number of sets
and total number of tag bits.
Answer:
(a) Total number of sets = 8K / 4 = 2K = 2
11

The tag field has 32 3 11 = 18 bits
The total number of tag bits = 2K 4 18 = 144 Kbits
(b) Total number of sets = 1
The tag field has 32 3 = 29 bits
The total number of tag bits = 1 8K 29 = 232 Kbits

3. Briefly describe the LRU scheme for block replacement in a cache. Why the LRU
scheme may not be well suited for a fully associative cache? Justify your answer.
Answer:
(a) The block replaced is the one that has been unused for the longest time.
(b) Because tracking the usage information is costly.

4. Consider all the RAID systems (except the RAID 2).
(a) Which RAID system has no redundancy to tolerate disk failure?
(b) Which RAID system allows the recovery from the second failure?
Answer:
(a) RAID 0
(b) RAID 6 (P + Q)


297
5. What is the dynamic branch prediction? Briefly describe how a branch prediction
buffer can be used for the dynamic branch prediction.
Answer:
(a) Prediction of branches at runtime using runtime information
(b) A branch prediction buffer is a small memory indexed by the lower portion of
the address of the branch instruction. The memory contains a bit that says
whether the branch was recently taken or not.




298
93

1. Consider a direct-mapped cache with 32K bytes of data and one-word (4-byte)
block. Suppose the main memory is byte-addressed with a 32-bit address bus.
(a) Determine the number of blocks in the cache.
(b) How many bits are required in the tag field associated with each cache block?
(c) Determine the total cache size (in bits).
Answer:
(a) 32K / 4 = 8K blocks = 2
13
blocks
(b) tag field has 32 13 2 = 17 bits
(c) The total cache size = 8K (1 + 17 + 32) = 400K bits

2. (a) What is a translation look-aside buffer (TLB)?
(b) Does a TLB miss imply a page fault? Explain your answer.
Answer:
(a) A cache that keeps track of recently used address mappings to avoid an access
to the page table (page tablecache)
(b) No. TLBmappingspage tablesubsetTLB misspage
table hitpage faultpage table misspage fault

3. Consider a processor with a five-stage pipeline as shown below:
Stage 1 IF Instruction fetch
Stage 2 ID Instruction decode and register file read
Stage 3 EX Execution or address calculation
Stage 4 MEM Data memory access
Stage 5 WB Write back
(a) Identify all the hazards in the following code.
Loop: ADD R2, R3, R4; R2 R3 + R4
ADD R5, R2, R6; R5 R2 + R6
SD R5, 100(R0); M[R0 + 100] R5
ADD R0, R0, -1; R0 R0 - 1
BNZ R0, Loop; If R0= 0, goto Loop
(b) Which hazards found in part (a) can be resolved via forwarding?
Answer:
(a) lines (1, 2) for R2, lines (2, 3) for R5, lines (4, 5) for R0
(b) Data hazards for (1, 2) and (2, 3) can be resolved via forwarding.
If branch decision is made at MEM stage than (4, 5) can be resolved via
forwarding. If branch decision is made at ID stage than (4, 5) can not be
resolved via forwarding.

299
4. The snooping protocols are the most popular protocols for maintaining cache
coherence in a multiprocessor system. The snooping protocols are of two types:
write-invalidate and write-update.
(a) Briefly describe each type of the snooping protocol.
(b) Which type has less demand on bus bandwidth? Explain your answer.
Answer:
(1) Write-invalidate: The writing processor causes all copies in other caches to be
invalidated before changing its local copy; the writing processor issues an
invalidation signal over the bus, and all caches check to see if they have a
copy; if so, they must invalidate the block containing the word.
Write-update: Rather than invalidate every block that is shared, the writing
processor broadcasts the new data over the bus; all copies are then updated
with the new value. This scheme, also called write-broadcast, continuously
broadcasts writes to shared data, while write-invalidate deletes all other
copies so that there is only one local copy for subsequent writes.
(2) Write-invalidate has less demand on bus bandwidth. Because write-update
forces each access to shared data always to go around the cache to memory, it
would require too much bus bandwidth


300
92

1. (a) Describe the IEEE 754 floating-point standard
(b) Show the IEEE 754 binary number representation of the decimal numbers
-0.1875 in single precision.
Answer:
(a) IEEE Standard 754 floating point is the most common representation today
for real numbers on computers. The characteristics of the IEEE 754 are
described as follows:
1. The sign bit is 0 for positive, 1 for negative.
2. The exponents base is two.
3. The exponent field contains 127 plus the true exponent for
single-precision, or 1023 plus the true exponent for double precision.
4. The first bit of the mantissa is typically assumed to be 1.f, where f is the
field of fraction bits.
(b) 0.1875
10
= 0.0011
2
= 1.1
2
2
3

The single precision format is: 1 01111100 10000000000000000000000

2. (a) Briefly describe three major types of pipeline hazards.
(b) What is the branch prediction technique? Which type of the hazard may be
solved by the branch prediction technique?
(c) What is the data forwarding technique? Which type of the hazard may be
solved by the data forwarding technique?
Answer:
(a) Structural hazards: hardware cannot support the instructions executing in the
same clock cycle (limited resources)
Data hazards: attempt to use item before it is ready. (Data dependency:
instruction depends on result of prior instruction still in the pipeline)
Control hazards: attempt to make a decision before condition is evaluated
(branch instructions)
(b) The processor tries to predict whether the branch instruction will jump or not.
Branch prediction may resolve control hazard
(c) A method of resolving a data hazard by retrieving the missing data element
from internal buffers rather than waiting for it to arrive from
programmer-visible registers or memory.
Data hazard can be solved by the data forwarding technique.


301
3. (a) Briefly describe the direct-mapped cache structure.
(b) Briefly describe the fully associative cache structure.
(c) Suppose, we consider only the direct-mapped and fully associative cache
structures. Which structure has higher hardware cost for block searching?
Which structure usually has higher cache miss rate? Explain your answer
Answer:
(a) A cache structure in which each memory location is mapped to exactly one
location in the cache.
(b) A cache structure in which a block can be placed in any location in the cache.
(c) Fully associative cache has higher hardware cost for block searching because
it need more comparators for comparison. Besize, we need more cache bits to
store the tags.
Direct-mapped cache has higher cache miss rate because conflicts among
memory locations are high.

4. (1) Briefly describe the basic concept of the direct memory access (DMA). What
advantages may the DMA have as compared with the polling and
interrupt-driven data transfer techniques?
(2) Briefly describe the three steps in a DMA transfer.
Answer:
(1) DMA: a mechanism that provides a device controller the ability to transfer
data directly to or from the memory without involving the processor.
Other than polling and interrupt transfer techniques which both consume CPU
cycles, during data transfer DMA is independent of the processor and without
consuming all the processor cycles
(2)
Step 1: The processor sets up the DMA by supplying the identity of the
device, the operation to perform on the device, the memory address
that is the source or destination of the data to be transferred, and the
number of bytes to transfer.
Step 2: The DMA starts the operation on the device and arbitrates for the
bus.
Step 3: Once the DMA transfer is complete, the controller interrupts the
processor.



302
93

1. Suppose a computers address size is k bits (using byte addressing), the cache size,
is S bytes/the block size is B bytes, and the cache is A-way set-associative.
Assume that B is a power of two, so B = 2
b
. Figure out what the following
quantities are in terms of S, B, A, b, and k: the number of sets in the cache, the
number of index bits in the address, and the number of bits needed to implement
the cache. Derive the quantities step by step clearly and explain the reason for
each step.
Answer:
Address size: k bits
Cache size: S bytes/cache
Block size: B = 2
b
bytes/block
Associativity: A blocks/set
Number of sets in the cache = S/AB; Number of bits for index =
b
A
S
AB
S

|
.
|

\
|
=
|
.
|

\
|
2 2
log log

Number of bits for the tag =
|
.
|

\
|
=
|
|
.
|

\
|

|
.
|

\
|

A
S
k b b
A
S
k
2 2
log log

Number of bits needed to implement the cache = sets/cache associativity
(data + tag + valid)
=
|
|
.
|

\
|
+
|
.
|

\
|
+ =
|
|
.
|

\
|
+
|
.
|

\
|
+ 1 log 8 1 log 8
2 2
A
S
k B
B
S
A
S
k B A
AB
S
bits

2. Here is a series of address references given as word addresses: 1, 4, 8, 5, 20, 17,
19, 56, 9, 11, 4, 43, 5, 6, 9, 17. Assume that a 2-way set-associative cache is with
four-word blocks and a total size is 32 words. The cache is initially empty and
adopts LRU replacement policy. Label each reference in the list as a hit or a miss
and show the final contents of the cache.
Answer:
Referenced
Address
(decimal)
Referenced
Address
(Binary)
Tag Index Hit/Miss
Contents
set Block0 Block1
1 000001 0 0 Miss 0 0,1,2,3
4 000100 0 1 Miss 1 4,5,6,7
8 001000 0 2 Miss 2 8,9,10,11
5 000101 0 1 Hit 1 4,5,6,7
20 010100 1 1 Miss 1 4,5,6,7 20,21,22,23
17 010001 1 0 Miss 0 0,1,2,3 16,17,18,19

303
19 010011 1 0 Hit 0 0,1,2,3 16,17,18,19
56 111000 3 2 Miss 2 8,9,10,11 56,57,58,59
9 001001 0 2 Hit 2 8,9,10,11 56,57,58,59
11 001011 0 2 Hit 2 8,9,10,11 56,57,58,59
4 000100 0 1 Hit 1 4,5,6,7 20,21,22,23
43 101011 2 2 Miss 2 8,9,10,11 40,41,42,43
5 000101 0 1 Hit 1 4,5,6,7 20,21,22,23
6 000110 0 1 Hit 1 4,5,6,7 20,21,22,23
9 001001 0 2 Hit 2 8,9,10,11 40,41,42,43
17 010001 1 0 Hit 0 0,1,2,3 16,17,18,19

Set Block0 Block1
0 0,1,2,3 16,17,18,19
1 4,5,6,7 20,21,22,23
2 8,9,10,11 40,41,42,43
3


3. A superscalar MIPS machine is implemented as follows. Two instructions are
issued per clock cycle. One of the instructions could be an integer ALU operation
or branch, and the other could be a load or store. Given the following loop, please
unroll the loop twice first and then schedule the codes to maximize performance.
Indicate which instruction(s) will be executed in each clock cycle. Assume that
the loop index is a multiple of three.
Loop: lw $t0, 0($s1) // $t0 = array element
addu $t0, $t0, $s2 // add scalar in $s2
sw $t0, 0($s1) // store result
addi $s1, $s1, -4 // decrement pointer
bne $s1, $zero, Loop //branch $s1 != 0
Answer:
(1) lw $t0, 12($s1)
addu $t0, $t0, $s2
sw $t0, 12($s1)
lw $t1, 8($s1)
addu $t1, $t1, $s2
sw $t1, 8($s1)
lw $t2, 4($s1)
addu $t2, $t2, $s2
sw $t2, 4($s1)
addi $s1, $s1, -12
bne $s1, $zero, Loop
(2)

304
ALU or branch
instruction
Data transfer
instruction
Clock cycle
Loop: addi $s1, $s1, -12 lw $t0, 0($s1) 1
lw $t1, 8($s1) 2
addu $t0, $t0, $s2 lw $t2, 4($s1) 3
addu $t1, $t1, $s2 sw $t0, 12($s1) 4
addu $t2, $t2, $s2 sw $t1, 8($s1) 5
bne $s1, $zero, Loop sw $t2, 4($s1) 6


4. Consider a pipelined MIPS machine with the following five stages:
IF: fetch instruction form memory
ID: read registers while decoding the instruction
EXE: execute the operation or calculate an address
MEM: access an operand in data memory
WB: write the result into a register
Given the following codes, identify all of the data dependencies and explain
which hazards can be resolved via forwarding.
lw $s0, 12($s1) // load data into $s0
add $s4, $s0, $s2 // $s4 = $s0 + $s2
addi $s2, $s0, 4 // $s2 = $s0 + 4
sw $s4, 12($s1) // store $s4 to memory
add $s2, $s3, $s1 // $s2 = $s3 + $s1
Answer:
lines (1, 2) for $s0: can not be resolved by forwarding completely. Need stalling 1
clock.
lines (1, 3) for $s0: can be resolved by forwarding.
lines (2, 4) for $s4: can be resolved by forwarding.



305
95

1. Assume that a processor is a load-store RISC CPU, running with 600 MHz. The
instruction mix and clock cycles for a program as follows:
Instruction type Frequency Clock cycles
A 25% 2
B 10% 2
C 15% 3
D 30% 4
E 20% 1
(a) Find the CPI.
(b) Find the MIPS.
Answer:
(a) CPI = 0.25 2 + 0.1 2 + 0.15 3 + 0.3 4 + 0.2 1 = 2.55
(b) MIPS = (600 10
6
) / (1.47 10
6
) = 235.29

2. We make an enhancement to a computer that improves some mode of execution
by a factor of 10. Enhanced mode is used 80% of the time, measured as a
percentage of the execution time when the enhanced mode is in use.
(a) What is the speedup we have obtained from fast mode?
(b) What percentage of the original execution time has been converted to fast
mode.
Hint: The Amdahls Law depends on the fraction of the original, unenhanced
execution time that could make use of enhanced mode. Thus, we cannot directly
use this 80% measurement to compute speedup with Amdahls Law.
Answer:
(a) Speedup = Time
unenhanced
/Time
enhanced

The unenhanced time is the sum of the time that does not benefit from the 10
times faster speedup, plus the time that does benefit, but before its reduction
by the factor of 10. Thus,
Time
unenhanced
= 0.2 Time
enhanced
+ 10 0.8 Time
enhanced
= 8.2 Time
enhanced
Substituting into the equation for speedup gives us:
Speedup = Time
unenhanced
/Time
enhanced
= 8.2
(b) Using Amdahls Law, the given value of 10 for the enhancement factor, and
the value for Speedup from Part(a), we have:
8.2 = 1 / [(1- f) + (f /10)] f = 0.9756
Solving shows that the enhancement can be applied 97.56% of the original
time.


306
3. The following code fragment processes two arrays and produces an important
value in register $v0. Assume that each array consists of 1000 words indexed 0
through 999, that the base addresses of the arrays are stored in $a0 and $a1
respectively, and their sizes (1000) are stored in $a2 and $a3, respectively.
Assume that the code is run on a machine with a 1 GHz clock. The required
number of cycles for instruction add, addi and sll are all 1 and for instructions lw
and bne are 2. In the worst case, how many seconds will it take to execute this
code?
sll $a2, $a2, 2
sll $a3, $a3, 2
add $v0, $zero, $zero
add $t0, $zero, $zero
outer: add $t4, $a0, $t0
lw $t4, ,0($t4)
add $t1, $zero, $zero
inner: add $t3, $al, $t1
lw $t3, 0($t3)
bne $t3, $t4, skip
addi $v0, $v0, 1
skip: addi $t1, $t1, 4
bne $t1, $a3, inner
addi $t0, $t0, 4
bne $t0, $a2, outer
Answer:
1. Before outer loop there are 4 instructions require 4 cycles.
2. Outer loop has 3 instructions before the inner loop and 2 after. The cycles
needed to execute 1 + 2 + 1 + 1 + 2 = 7 cycles per iteration, or 1000 7
cycles.
3. The inner loop requires 1 + 2 + 2 + 1 + 1 + 2 = 9 cycles per iteration and it
repeats 1000 1000 times, for a total of 9 1000 1000 cycles.
The total number of cycles executed is therefore 4 + (1000 7) + (9 1000
1000) = 9007004. The overall execution time is therefore (9007004) / (1 10
9
)
= 9 ms.


307
4. Draw the gates for the Sum bit of an adder for the following equation ( a means
NOT a).
) ( ) ( ) ( ) ( CarryIn b a CarryIn b a CarryIn b a CarryIn b a Sum + + + =
Answer:














5. (a) Please explain the difference between write-through policy and write-back
policy?
(b) Assume that the instruction cache miss rate is 4% and the data cache miss rate
is 5%. If a processor has a CPI of 2.0 without any memory stalls and the miss
penalty is 200 cycles for all misses, determine how much faster a processor
would run with a perfect cache that never missed? Here, the frequency of loads
and stores is 35%.
Answer:
(a) Write-through: The information is written to both the block in the cache and to
the block in the lower level of the memory hierarchy.
Write-back: The information is written only to the block in the cache. The
modified block is written to the lower level of the hierarchy only when it is
replaced.
(b) The CPI considering stalls is 2 + 0.04 200 + 0.05 0.35 200 = 13.5
The processor run with a perfect cache is 13.5 / 2 = 6.75 times faster than
without a perfect cache.





Sum
CarryIn
a
b

308
6. Please explain the following terms: (a) compulsory misses, (b) capacity misses,
and (c) conflict misses.
Answer:
(a) A cache miss caused by the first access to a block that has never been in the
cache.
(b) A cache miss that occurs because the cache even with fully associativity, can
not contain all the block needed to satisfy the request.
(c) A cache miss that occurs in a set-associative or direct-mapped cache when
multiple blocks compete for the same set.




309
96

1. Given the number 0x811F00FE, what is it interpreted as:
(a) Four two's complement bytes?
(b) Four unsigned bytes?
Answer:
0x811 F00FE = 1000 0001 0001 1111 0000 0000 1111 1110
2

(a) 127 31 0 2
(b) 129 31 0 254

2. Given the following instruction mix, what is the CPI for this processor?
Operation Frequency CPI
A 50% 1
B 15% 4
C 15% 3
D 10% 4
E 5% 1
F 5% 2
Answer:
CPI = 1 0.5 + 4 0.15 + 3 0.15 + 4 0.1 + 1 0.05 + 2 0.05 = 2.1

3. The following piece of code has pipeline hazard(s) in it. Please try to reorder the
instructions and insert the minimum number of NOP to make it hazard-free.
(Note: Assume all the necessary forwarding logics exist)
haz: move $5, $0
lw $10, 1000($20)
addiu $20, $20, -4
addu $5, $5, $10
bne $20, $0, haz
Answer:
haz: lw $10, 1000($20)
addiu $20, $20, -4
move $5, $0
bne $20, $0, haz
addu $5, $5, $10

branch
ID

310
4. Given a MIPS machine with 2-way set-associative cache that has 2-word blocks
and a total size of 32 words. Assume that the cache is initially empty, and that it
uses an LRU replacement policy. Given the following memory accesses in
sequence:
0ff00f70
0ff00f60
0fe0012c
0ff00f5c
0fe0012c
0fe001e8
0f000f64
0f000144
0fe00204
0ff00f74
0f000f64
0f000128
(a) Please label whether they will be hits or misses.
(b) Please calculate the hit rate.
Answer:
(a)
Byte address
16

Byte address
16/2

Hit/Miss Tag
index offset
Hex. part Bin. part
0ff00f70 0ff00f 01 110 000 Miss
0ff00f60 0ff00f 01 100 000 Miss
0fe0012c 0fe001 00 101 100 Miss
0ff00f5c 0ff00f 01 011 100 Miss
0fe0012c 0fe001 00 101 100 Hit
0fe001e8 0fe001 11 101 000 Miss
0f000f64 0f000f 01 100 100 Miss
0f000144 0f0001 01 000 100 Miss
0fe00204 0fe002 00 000 100 Miss
0ff00f74 0ff00f 01 110 100 Hit
0f000f64 0f000f 01 100 100 Hit
0f000128 0f0001 00 101 000 Miss

(b) Hit rate = 3/12 = 0.25 = 25%


311
5. The speed of the memory system affects the designer's decision on the size of the
cache block. Which of the following cache designer guidelines are generally valid?
why?
(a) The shorter the memory latency, the smaller the cache block.
(b) The shorter the memory latency, the larger the block.
(c) The higher the memory bandwidth, the smaller the cache block.
(d) The higher the memory bandwidth, the larger the cache block.
Answer: (a) and (d)
A lower miss penalty can lead to smaller blocks, yet higher memory bandwidth
usually leads to larger blocks, since the miss penalty is only slightly larger.

6. Please state whether the following techniques are associated primarily with a
software- or hardware-based approach to exploiting ILP. In some cases, the
answer may be both.
(a) Brach prediction
(b) Dynamic scheduling
(c) Our-of-order execution
(d) EPIC
(e) Speculation
(f) Multiple issue
(g) Superscalar
(h) Reorder buffer
(i) Register renaming
(j) Predication
Answer:
(a) (b) (c) (d) (e) (f) (g) (h) (i) (j)
B H H B B B H H B S
H: hardware, S: software, B: both

7. What is Saturating Arithmetic? What kind instructions use this feature?
Answer:
(1) Saturation arithmetic is used in graphics routines. As an example, assume you
add together two medium-red pixels. Saturating arithmetic ensures the result
is a dark red or black. It's certainly different than regular integer math, where
you could perform the above operation and end up with a light-colored result.
(2) Intel MMX supports both signed and unsigned saturating arithmetic.


312
8. Please describe the Shift-and-Add multiplier architecture and its control steps.
Answer:
Shift-and-Add multiplier architecture control steps



















9. What fields are contained in TLB (translation lookaside buffer)? What are the
purposes of these fields?
Answer:
(a) valid: to indicate the page to be access is in the physical memory.
(b) dirty: to indicate whether this page should be write back.
(c) reference: to help deciding which page should be replaced.
(d) tag: identify whether the associated mapping is in the TLB.
(e) physical page number: to indicate which physical page the virtual page
mapped.

10. How many tag-comparators are needed in a 2-way set associative cache controller?
Why?
Answer:
2 comparators
It is because each set has two blocks and each block should be searched.
Done
Yes: 32 repetitions
2. Shift the Product
register right 1 bit.
No: < 32 repetitions
Product0 = 1
1a. Add multiplicand to the
left half of product & place
the result in the left half of
Product register
32nd
repetition?
Start
Test
Product0
Product0 = 0

313
11. What is Cache Line Width? Why is it larger than the word-size of CPU?
Answer:
(a) Cache line width: cache block sizes, i.e., byte in a cache block
(b) To include more spatial locality.

12. Use Verilog or VHDL languages to design a one-bit 8-to-1 multiplexer circuit.
Answer:
ENTITY mux8_1 IS

PORT
(sel :IN STD_LOGIC_VECTOR(2 downto 0);
d0, d1, d2, d3, d4, d5, d6, d7 :IN STD_LOGIC;
z :OUT STD_LOGIC);

END mux8_1;

ARCHITECTURE behavior OF mux8_1 IS
BEGIN

WITH sel SELECT
z <= d0 when "000",
d1 when "001",
d2 when "010",
d3 when "011",
d4 when "100",
d5 when "101",
d6 when "110",
d7 when "111",
'0' when others;

END behavior;



314
95

1. Convert these RTL descriptions for a multi-cycle MIPS CPU datapath into a
control specification and FSM state diagram.
Step Name
Action for R-type
Instructions
Action for Memory-
Reference Instructions
Action for
branches
Action for
jumps
Instruction fetch
IR Memory[PC]
PC PC+4
Instruction decode/register fetch
A Reg[IR[25-21]]
B Reg[IR[20-16]]
ALUOut PC + sign-extend(IR[15-0]) << 2
Execution, address computation,
branch/jump completion
ALUOut A op B
ALUOut A + sign-extend
(IR[15-0])
If (A==B) then
PC ALUOut;
PC {PC[31-28],
(IR[25-0]<<2)}
Memory Access or R-type
completion
Reg[IR[15-11]]
ALUOut
Load: MDR Memory[ALUOut]
or
Store: Memory[ALUOut] B

Memory read completion Load: Reg[IR[20-16]] MDR

Answer:






315
2. A muti-cycle CPU has 3 implementations. The first one is a 5-cycle
IF-ID-EX-MEM-WB design running at 4.8 GHz, where load takes 5 cycles,
store/R-type 4 cycles and branch/jump 3 cycles. The second one is a 6-cycle
design running 5.6 GHz, with MEM replaced by MEM1 & MEM2. The third is a
7-cycle design running at 6.4 GHz, with IF further replaced by IF1 & IF2.
Assume we have an instruction mix: load 26%, store 10%, R-type 49%,
branch/jump 15%. Do you think it is worthwhile to go for the 6-cycle design over
the 5-cycle design? How about the 7-cycle design, is it worthwhile? Please give
your rationales.
Answer:
The average CPI for implementation 1 is:
5 0.26 + 4 0.1 + 4 0.49 + 3 0.15 = 4.11
The execution time for an instruction in implementation 1 = 4.11/4.8G = 0.86 ns
The average CPI for implementation 2 is:
6 0.26 + 5 0.1 + 4 0.49 + 3 0.15 = 4.47
The execution time for an instruction in implementation 2 = 4.47/5.6G = 0.80 ns
The average CPI for implementation 3 is:
7 0.26 + 6 0.1 + 5 0.49 + 4 0.15 = 5.47
The execution time for an instruction in implementation 3 = 5.47/6.4G = 0.85 ns
It is worthwhile to go for the 6-cycle design over the 5-cycle design, but it is not
worthwhile to go for 7-cycle design over the 6-cycle design.

3. We have a program core consisting of five conditional branches. The program
core will be executed thousands of times. Below are the outcomes of each branch
for one execution of the program core (T for taken, N for not taken).
Branch 1: T-T-T
Branch 2: N-N-N-N
Branch 3: T-N-T-N-T-N
Branch 4: T-T-T-N-T
Branch 5: T-T-N-T-T-N-T
Assume the behavior of each branch remains the same for each program core
execution. For dynamic schemes, assume each branch has its own prediction
buffer and each buffer initialized to the same state before each execution. List the
predictions for the following branch prediction schemes:
a. Always taken
b. Always not taken
c. 1-bit predictor, initialized to predict taken
d. 2-bit predictor, initialized to weakly predict taken
What are the prediction accuracies?
Answer:


316
(a)
Branch 1: prediction: T-T-T,
Branch 2: prediction: T-T-T-T,
Branch 3: prediction: T-T-T-T-T-T,
Branch 4: prediction: T-T-T-T-T,
Branch 5: prediction:
T-T-T-T-T-T-T,
right: 3, wrong: 0
right: 0, wrong: 4
right: 3, wrong: 3
right: 4, wrong: 1
right: 5, wrong: 2
Total: right: 15, wrong: 10, Accuracy = 100% 15/25 = 60%
(b)
Branch 1: prediction: N-N-N,
Branch 2: prediction: N-N-N-N,
Branch 3: prediction: N-N-N-N-N-N,
Branch 4: prediction: N-N-N-N-N,
Branch 5: prediction:
N-N-N-N-N-N-N,
right: 0, wrong: 3
right: 4, wrong: 0
right: 3, wrong: 3
right: 1, wrong: 4
right: 2, wrong: 5
Total: right: 10, wrong: 15, Accuracy = 100% 10/25 = 40%
(c)
Branch 1: prediction: T-T-T,
Branch 2: prediction: T-N-N-N,
Branch 3: prediction: T-T-N-T-N-T,
Branch 4: prediction: T-T-T-T-N,
Branch 5: prediction:
T-T-T-N-T-T-N,
right: 3, wrong: 0
right: 3, wrong: 1
right: 1, wrong: 5
right: 3, wrong: 2
right: 3, wrong: 4
Total: right: 13, wrong: 12, Accuracy = 100% 13/25 = 52%
(d)
Branch 1: prediction: T-T-T,
Branch 2: prediction: T-N-N-N,
Branch 3: prediction: T-T-T-T-T-T,
Branch 4: prediction: T-T-T-T-T,
Branch 5: prediction:
T-T-T-T-T-T-T,
right: 3, wrong: 0
right: 3, wrong: 1
right: 3, wrong: 3
right: 4, wrong: 1
right: 5, wrong: 2
Total: right: 18, wrong: 7, Accuracy = 100% 18/25 = 72%


317
95

1. (a) Show the IEEE 754 binary representation for the floating-point number 0.1
ten
in
single precision.
(b) Add 2.56
ten
10
2
to 2.34
ten
10
4
, assuming that we have only 3 significant
decimal digits (No guard and round digits are used).
(c) Whats the number of ulp (units in the last place) in (b).
(d) Assume that you have only 4 significant decimal digits. Round the number
12.4650 to nearest even.
Answer:
(a) 0.1
ten
= 0.00011
two
= 1.10011
two
2
4

Sign = 0, Significand = .10011
Exponent = 4 + 127 = 123
0 01111011 10011001100110011001100
(b) 2.56
ten
10
2
+ 2.34
ten
10
4
= 0.02
ten
10
4
+ 2.34
ten
10
4
= 2.36
ten
10
4

(c) 2
(d) 12.46

2. Suppose that in 1000 memory references there are 60 misses in the first-level
cache, 30 misses in the second-level cache, and 5 misses in the third-level cache.
Assume the miss penalty from the L3 cache to memory is 100 clock cycles, the
hit time of the L3 cache is 10 clocks, the hit time of the L2 cache is 5 clocks, the
hit time of L1 is 1 clock cycle, and there are 1.5 memory references per
instruction.
(a) Whats the global miss rate for each level of caches?
(b) Whats the local miss rate for each level of caches?
(c) What is the average memory access time?
(d) What is the average stall cycle per instruction?
Answer:
(a) L1 = 60/1000 = 0.06, L2 = 30/1000 = 0.03, L3 = 5/1000 = 0.005
(b) L1 = 60/1000 = 0.06, L2 = 30/60 = 0.5, L3 = 5/30 = 0.167
(c) AMAT = 1 + 0.06 5 + 0.03 10 + 0.005 100 = 2.1 clock cycles
(d) (2.1 1) 1.5 = 1.65 clock cycles

3. (a) Consider a virtual memory system with the following properties: 38-bit virtual
byte address, 8 KB pages, 36-bit physical byte address. What is the total size of
the page table for each process on this processor, assuming that the memory
management bits take a total of 8 bits and that all the virtual pages are in use?
(Assume each entry in the page table should be round up to full bytes.)
(b) Briefly describe at least 3 techniques to minimize the memory dedicated to
page tables.

318
Answer:
(a) Number of page table entries = 2
38
/2
13
= 2
25

The bits in an entry = 8 + 23 = 31, round to full bytes 4 bytes for a entry
The size of page table = 2
25
4 bytes = 128 Mbytes.
(b)
1. To keep a limit register that restricts the size of the page table for a given
process. If the virtual page number becomes larger than the contents of the
limit register, entries must be added to the page table
2. Maintain two separate page tables and two separate limits. The high-order
bit of an address usually determines which segment and thus which page
table to use for that address
3. Apply a hashing function to the virtual address so that the page table data
structure need be only the size of the number of physical pages in main
memory. Such a structure is called an inverted page table
4. Multiple levels of page tables: First level maps large fixed-size blocks of
virtual address space by segment table; Each entry in the page table points to
a page table for that segment
5. Page tables to be paged: allow the page tables to reside in the virtual address
space

4. (a) Suppose a pipelined processor has S stages. If the processor takes 110 ns to
execute N instruction and 310 ns to execute 3N instruction. What are S and N,
respectively? (Assume that the clock rate is 500 MHz and no pipeline stalls
occur)
(b) For a pipelined implementation, assume that one-quarter of the load
instructions are immediately followed by an instruction that uses the result, that
the branch delay on misprediction is 1 clock cycle, and that half of the
branches are mispredicted. Assume that jumps always pay 1 full clock cycle of
delay, so their average time is 2 clock cycles. If the instruction mix is 25%
loads, 10% stores, 52% ALU instructions, 11% branches, and 2% jumps,
please calculate the average CPI.
Answer:
(a) (S 1) + N = 110/2 = 55 ...C
(S 1) + 3N = 310/2 = 155C
N = 50 and S = 6
(b) CPI = 1 + (0.25 0.25 1 + 0.11 0.5 1 + 0.02 1) = 1.1375


319
5. (a) Suppose we have a benchmark that executes in 100 seconds of elapsed time,
where 90 seconds is CPU time and the rest is I/O time. If CPU time improves
by 50% per year for the next five years but I/O time doesnt improve, how
must faster will our program run at the end of five years?
(b) Consider program P, which runs on a 1 GHz machine M in 10 seconds. An
optimization is made to P, replacing all instances of multiplying a value by 4
(mult X, X, 4) with two instructions that set x to x + x twice (add X, X; add X,
X). Call this new optimized program P. The CPI of a multiply instruction is 4,
and the CPI of an add is 1. After recompiling, the program now runs in 9
seconds on machine M. How many multiplies were replaced by the new
compiler?
Answer:
(a)
After n year CPU time I/O time Elapsed time
0 90 seconds 10 seconds 100 seconds
1
5 . 1
90
= 60 seconds
10 seconds 70 seconds
2
5 . 1
60
= 40 seconds
10 seconds 50 seconds
3
5 . 1
40
= 27 seconds 10 seconds 37 seconds
4
5 . 1
27
= 18 seconds 10 seconds 28 seconds
5
5 . 1
18
= 12 seconds 10 seconds 22 seconds
The improvement in elapsed time is
5 . 4
22
100
=

(b) 10
9
10 10
9
9 = 10
9
(less cycles after the optimization). Replace a mult
with two adds, it takes 4 2 1 = 2 cycles less per replacement. Thus, we
have 10
9
/ 2 = 5 10
8
replacements.



320
95

1. Please define the following term:
a. Finite State Machine
b. Microprogramming
c. Pipeline Hazards
d. Branch Prediction
e. Superscalar
f. Dynamic Multiple Issues Execution (Out-of-order Execution)
Answer:
a. A sequential logic function consisting of a set of inputs and outputs, a next
state function that maps the current state and the inputs to a new state, and an
output function that maps the current state and possibly the inputs to a set of
asserted outputs.
b. A method of specifying control that uses microcode rather than a finite state
representation.
c. The situations in pipeline when the next instruction cannot execute in the
following clock cycle.
d. A method of resolving a branch hazard that assumes a given outcome for the
branch and proceeds from that assumption rather than waiting to ascertain the
actual outcome.
e. An advanced pipelining technique that enables the processor to execute more
than one instruction per clock cycle.
f. A situation in pipelined execution when an instruction blocked from executing
does not cause the following instructions to wait.

2. Identify all of the data dependencies in the following code. Which dependencies
are data hazards that will be resolves via forwarding? Which dependencies are
data hazards that will cause a stall?
add $3 $4 $2
sub $5 $3 $1
lw $6 200($3)
add $7 $3 $6
Answer:
Data dependency (line 1, line 2), (1, 3), (1, 4), (3, 4)
Data hazard (1, 2), (1, 3), (3, 4)
Can be resolved via forwarding (1, 2), (1, 3)
Cause a stall (3, 4)


321
3. Please design a complete datapath of Pipelined Processor with (a) Forwarding
Unit (b) Hazard Detection Unit (c) Stall (d) Exception/Interrupt (e) Branch
Perdition, for the following eight instructions: add, sub, and, or, lw,
sw, beq, j. Then explain how it works.
Answer:



(a) Forwarding Unit: resolve a data hazard by retrieving the missing data element
from internal buffers rather than waiting for it to arrive from programmer-visible
registers or memory.
(b) Hazard Detection Unit: stall and deassert the control fields if the load-use hazard
test is true.
(c) Stall: preserve the PC register and the IF/ID pipeline register from changing.
(d) Exception/Interrupt: a cause register to record the cause of the exception; and an
EPC to save the address of the instruction that caused exception. And flush the
instructions that follow the offending instruction.
(e) Branch Perdition: assume that the branch will not be taken and if the branch is
taken, the instructions that are being fetched must be discards.

You might also like