Professional Documents
Culture Documents
PRACTICE PROBLEMS
COMPUTER ORGANIZATION AND
ARCHITECTURE
DESIGNING FOR PERFORMANCE
NINTH EDITION
WILLIAM STALLINGS
TABLE OF CONTENTS
-2-
Instruction
Comments
0
1
2
3L
3R
4L
4R
5L
5R
6L
6R
7L
7R
8L
8R
9L
9R
10L
10R
999
1
1000
LOAD M(2000)
ADD M(3000)
STOR M(4000)
LOAD M(0)
SUB M(1)
JUMP+ M(6, 20:39)
Constant (count N)
Constant
Constant
Transfer A(I) to AC
Compute A(I) + B(I)
Transfer sum to C(I)
Load count N
Decrement N by 1
Test N and branch to 6R if nonnegative
Halt
Update N
Increment AC by 1
Modify address in 3L
Modify address in 3R
Modify address in 4L
Branch to 3L
suite respectively.
Ni
m
j=1
or,
Nj
N1 + N 2 + N 3 ++ N m
or,
t1 + t 2 + t 3 ++ t m
t r + t r + t r ++ t m rm
WHM = 1 1 2 2 3 3
t1 + t 2 + t 3 ++ t m
WHM =
We know that T = t j
j=1
WHM =
t1
t
t
r1 + 2 r2 ++ m rm
T
T
T
-4-
and
Csys,A = 19$2,750 = $52,250.
So yes, chip B also lets us achieve a design with lower total cost that
still meets the $50,000 minimum cost constraint.
Incidentally, although we didnt ask you to compute it, the
optimal design with chip A would have a total throughput of 1980
w.u./s = 1,520 work units per second, while the design with chip B
will have a throughput of 3250 w.u/s = 1,600 work units per
second. So, the total throughput is also greater if we base our
design on chip B. This is because the total cost of the chip-B-based
design is about the same (near the $50k target), but the costperformance of chip B was greater.
2.6 Amdahls law is based on a fixed workload, where the problem size is
fixed regardless of the machine size. Gustafsons law is based on a
scaled workload, where the problem size is increased with the machine
size so that the solution time is the same for sequential and parallel
executions.
2.7 a. Say Program P1 consists of n x86 instructions, and hence 1.5 n
MIPS instructions. Computer A operates at 2.5 GHz, i.e. it takes
0.4ns per clock. So the time it takes to execute P1 is 0.4ns/clock 2
clocks/instructions 1.5 n instructions = 1.2 n ns. Computer B
operates at 3 GHz, i.e. 0.333ns per clock, so it executes P1 in 0.333
3 n = n ns. So Computer B is 1.2 times faster.
b. Going by similar calculations, A executes P2 in 0.4 1 1.5 n = 0.6
n ns while B executes it in 0.333 2 n = 0.666 n ns hence A is
1.11 times faster
2.8 a. CPI = .1 * 50 + .9 * 1 = 5.9
b. .1 * 50 * 100 / 5.9 = 84.75%
c. Divide would now take 25 cycles, so
CPI = .1 * 25 + .9 * 1 = 3.4. Speed up = 5.9/3.4 = 1.735x
d. Divide would now take 10 cycles, so
CPI = .1 * 10 + .9 * 1 = 1.9. Speed up = 5.9/1.9 = 3.105x
e. Divide would now take 5 cycles, so
CPI = .1 * 5 + .9 * 1 = 1.4. Speed up = 5.9/1.4 = 4.214x
f. Divide would now take 1 cycle, so
CPI = .1 * 1 + .9 * 1 = 1. Speed up = 5.9/1 = 5.9x
g. Divide would now take 0 cycles, so
CPI = .1 * 0 + .9 * 1 = .9. Speed up = 5.9/.9 = 6.55x
-6-
Bus System
Multistage Network
Crossbar Switch
Constant
O(logk n)
Constant
O(w/n) to O(w)
O(w) to O(nw)
O(w) to O(nw)
O(w)
O(n)
O(nwlogk n)
O(nlogk n)
O(n2w)
O(n2)
Some permutations
and broadcast, if
network
unblocked
nxn MIN using kk
switches with line
width of w bits
All permutations,
one at a time
Assume n processors
on the bus; bus width
is w bits
Assume nn
crossbar with line
width of w bits
3.2 a. Since the targeted memory module (MM) becomes available for
another transaction 600 ns after the initiation of each store
operation, and there are 8 MMs, it is possible to initiate a store
operation every 100 ns. Thus, the maximum number of stores that
can be initiated in one second would be:
109/102 = 107 words per second
Strictly speaking, the last five stores that are initiated are not
completed until sometime after the end of that second, so the
maximum transfer rate is really 107 5 words per second.
b. From the argument in part (a), it is clear that the maximum write
rate will be essentially 107 words per second as long as a MM is free
in time to avoid delaying the initiation of the next write. For clarity,
let the module cycle time include the bus busy time as well as the
internal processing time the MM needs. Thus in part (a), the module
cycle time was 600 ns. Now, so long as the module cycle time is 800
ns or less, we can still achieve 107 words per second; after that, the
maximum write rate will slowly drop off toward zero.
-7-
-8-
E=
th
th
1
=
=
t a p t h + (1 p)t m p + 1 p t m
( )
th
4.2 a. Using the rule x mod 8 for memory address x, we get {2, 10, 18,
26} contending for location 2 in the cache.
b. With 4-way set associativity, there are just two sets in 8 cache
blocks, which we will call 0 (containing blocks 0, 1, 2, and 3) and 1
(containing 4, 5, 6, and 7). The mapping of memory to these sets is
x mod 2; i.e., even memory blocks go to set 0 and odd memory
blocks go to set 1. Memory element 31 may therefore go to {4, 5, 6,
7}.
c. For the direct mapped cache, the cache blocks used for each memory
block are shown in the third row of the table below
order of reference 1 2 3 4 5 6 7 8
block referenced 0 15 18 5 1 13 15 26
Cache block
0 7 2 5 1 5 7 2
For the 4-way set associative cache, there is some choice as to
where the blocks end up, and we do not explicitly have any policy in
place to prefer one location over another or any history with which to
apply the policy. Therefore, we can imitate the direct-mapped
location until it becomes impossible, as shown below on cycle 5:
order of reference 1 2 3 4 5 6 7 8
block referenced 0 15 18 5 1 13 15 26
Cache block
0 7 2 5 x
It is required to map an odd-numbered memory block to one of {4,
5, 6, 7}, as shown in part (b).
d. In this example, nothing interesting happens beyond compulsory
misses until reference 6, before which we have the following
occupancy of the cache with memory blocks :
Cache block
0 1 2 3 4 5 6 7
-9-
Memory block 0 1 18
15
-10-
-13-
106 bytes/s. With two banks, the peak data rate is 1.6 109 bytes/s.
The average data rate is 1.5 800 106 bytes/s = 1.2 109 bytes/s.
5.4 a. SRAMs generally have lower latencies than DRAMs, so SRAMs would
be a better choice.
b. DRAMs have lower cost, so DRAMs would be the better choice.
-15-
b.
c.
d.
e.
-17-
CHAPTER 7 INPUT/OUTPUT
7.1
Memory Mapped
Advantages
Disadvantages
Simpler hardware
I/O Mapped
Advantages
Disadvantages
7.2 The first factor could be the limiting speed of the I/O device; Second
factor could be the speed of bus, the third factor could be no internal
buffering on the disk controller or too small internal buffering space.
Fourth factor could be erroneous disk or transfer of block.
7.3 Prefetching is a user-based activity, while spooling is a system-based
activity. Comparatively, spooling is a much more effective way of
overlapping I/O and processor operations. Another way of looking at it
is to say that prefetching is based upon what the processor might do in
the future, while spooling is based upon what the processor has done in
the past.
7.4 a. The device makes 150 requests, each of which require one interrupt.
Each interrupt takes 12,000 cycles (1000 to start the handler, 10,000
for the handler, 1000 to switch back to the original program), for a
total of 1,800,000 cycles spent handling this device each second.
b. The processor polls every 0.5 ms, or 2000 times/s. Each polling
attempt takes 500 cycles, so it spends 1,000,000 cycles/s polling. In
150 of the polling attempts, a request is waiting from the I/O device,
each of which takes 10,000 cycles to complete for another 1,500,000
cycles. Therefore, the total time spent on I/O each second is
2,500,000 cycles with polling.
-18-
c. In the polling case, the 150 polling attempts that find a request
waiting consume 1,500,000 cycles. Therefore, for polling to match
the interrupt case, an additional 300,000 cycles must be consumed,
or 600 polls, for a rate of 600 polls/s.
7.5 Without DMA, the processor must copy the data into the memory as the
I/O device sends it over the bus. Because the device sends 10 MB/s
over the I/O bus, which has a capacity of 100 MB/s, 10% of each
second is spent transferring data over the bus. Assuming the processor
is busy handling data during the time that each page is being
transferred over the bus (which is a reasonable assumption because the
time between transfers is too short to be worth doing a context switch),
then 10% of the processors time is spent copying data into memory.
With DMA, the processor is free to work on other tasks, except when
initiating each DMA and responding to the DMA interrupt at the end of
each transfer. This takes 2500 cycles/transfer, or a total of 6,250,000
cycles spent handling DMA each second. Because the processor operates
at 200 MHz, this means that 3.125% of each second, or 3.125% of the
processor's time, is spent handling DMA, less than 1/3 of the overhead
without DMA.
7.6 memory-mapped.
-19-
-21-
1096
446
10509
500
e.
f.
g.
h.
3666
63
63
63
9.2 :
a. 3123
b. 200
c. 21520
d. 1022200
e. 1000011
9.3
9.4
As with any other base, add one to the ones place. If that position
overflows, replace it with the lowest valued digit and carry one to the
next place.
0, 1, 1, 10, 11, 1, 10, 11, 10, 100, 101, 11, 110, 111,
1,10, 11, 10, 100, 101, 11
-22-
= (C1F3 0000)16
10.3
b.
c.
a.
999
-123
-----876
999
-000
-----999
999
-777
-----222
10.4 Notice that for a normalized floating point positive value, with a binary
exponent to the left (more significant) and the significand to the right
can be compared using integer compare instructions. This is especially
-23-
a.
10.6
a.
11111011
00101011
b. +15
c. 113
d.
b. 11011000 c. 00000011
-24-
+119
d.
Op0
0
1
0
1
0
1
0
1 1
-25-
r = a and b
r = a or b
co, r = ci+a+b
r=a xor b
r = a and b`
r = a or b`
co, r= ci+a-b
r = a xor b
12.2 From Table F.1 in Appendix F, 'z' is hex X'7A' or decimal 122; 'a' is
X'61' or decimal 97; 122 97 = 25
12.3 Let us consider the following: Integer divide 5/2 gives 1, 5/2 gives 2
on most machines.
Bit shift right, 5 is binary 101, shift right gives 10, or decimal 2.
In eight bit twos complement, -5 is 11111011. Arithmetic shift right
(shifting copies of the sign bit in) gives 11111101 or decimal -3.
Logical shift right (shifting in zeros) gives 01111101 or decimal 125.
-26-
-27-
11.3
rd
shift
amount
opcode
extension
R 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0
I 1 0 0 0 1 1 0 1 0 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0
opcode
rs
rt
operand/offset
0x02084820, 0x8d2afffc
-28-
a.
4 bytes wide
0x7fffe784
0x400100C
0x7fffe780
0x7fffe77c
0x400F028
0x7fffe778
0x7fffe774
0x400F028
0x7fffe770
0x7fffe76c
0x400F028
0x7fffe768
0x7fffe764
0x7fffe760
b. $a0 = 10 decimal
14.6 a. This branch is executed a total of ten times, with values of i from 1
to 10. The first nine times it isnt taken, the tenth time it is. The
first and second level predictors both predict correctly for the first
nine iterations, and mispredict on the tenth (which costs two
cycles). So, the mean branch cost is 2/10 or 0.2
b. This branch is executed ten times for every iteration of the outer
loop. Here, we have to consider the first outer loop iteration
separately from the remaining nine. On the first iteration of the
outer loop, both predictors are correct for the first nine inner loop
iterations, and both mispredict on the tenth. So on the first outer
loop iteration, we have a cost of 2 in the ten iterations. On the
remaining iterations of the outer loop, the first-level predictor will
mispredict on the first inner loop iteration, both predictors will be
correct for the next eight iterations, and both will mispredict on the
-30-
final iteration, for a cost of 3 in ten branches. The total cost will be
2 + 93 in 100 iterations, or 0.29
c. There are a total of 110 branch instructions, and a total cost of 31.
So the mean cost is 31/110, or .28
14.7 a.
1 2 3 4 5 6
S1 X
X
S2
X
X
S3
X
S4
X
b. {4}
c. Simple cycles: (1,5), (1,1,5), (1,1,1,5), (1,2,5), (1,2,3,5),
(1,2,3,2,5), (1,2,3,2,1,5),
(2,5), (2,1,5), (2,1,2,5), (2,1,2,3,5), (2,3,5), (3,5), (3,2,5),
(3,2,1,5), (3,2,1,2,5), (5),
(3,2,1,2), and (3).
The greedy cycles are (1,1,1,5) and (1,2,3,2)
d. MAL=(1+1+1+5)/4=2
14.8 a.
-31-
14.9 CPI = (base CPI) + (CPI due to branch hazards) + (CPI due to FP
hazards) + (CPI due to instruction accesses) + (CPI due to data
accesses)
Component
Base
Branch
FP
Instructions
Data
Explanation
Given in problem
branch % * misprediction % * misprediction penalty
FP % * avg penalty
AMAT for instructions minus 1 to take out the 1 cycle
of L1 cache hit that is included in the base CPI
(load % + store %) * AMAT for data adjusted as above
-32-
Value
1
0.1*0.2*1=0.02
0.2*0.9=0.18
1.476-1=0.476
(0.25+0.15)*(3.9281)=1.171
1
X
5
X
X
X
X
X
MemtoReg = X
MemWrite = 1
RegWrite = 0
15.3 a. The second instr. is dependent upon the first (because of $2)
The third instr. is dependent upon the first (because of $2)
The fourth instr. is dependent upon the first (because of $2)
The fourth instr. is dependent upon the second (because of $4)
There, All these dependencies can be solved by forwarding
-33-
b. Registers $11 and $12 are read during the fifth clock cycle.
Register $1 is written at the end of fifth clock cycle.
15.4 a. Expected instructions to next mispredict is e= 10/(1 p). Number
of instruction slots (two slots/cycle) to execute all those and then
get the next good path instruction to the issue unit is t = e +
2.20=10/(1 p) + 40 = (50 40p)/(1 p). So the utilization (rate
at which we execute useful instructions) is
e/t = 10/(50 40p)
b. If we increase the cache size the branch mispredict penalty
increases from 20 cycles to 25 cycles than, by part (a), utilization
would be 10/(60 50p). If we keep the i-cache small, it will have a
non-zero miss-rate, m. We still fetch e instructions between
mispredicts, but now of those e , em are 10 cycle (=20 slot) L1
misses. So the number of instruction slots to execute our
instructions is
t = e(1 + 20m)+ 2.20=[(10)/(1 p)][1 + 20m]+4
=(50 + 200m 40p)/(1 p)
The utilization will be 10/(50 + 200m 40p). The larger cache will
provide better utilization when
(10)/60 50p) > (10)/(50 + 200m 40p), or
50 + 200m 40p > 60 50p, or
p > 1 20m
15.5 Performance can be increased by:
Fewer loop conditional evaluations.
Fewer branches/jumps.
Opportunities to reorder instructions across iterations.
Opportunities to merge loads and stores across iterations.
Performance can be decreased by:
Increased pressure on the I-cache.
Large branch/jump distances may require slower instructions
15.6 The advantages stem from a simplification of the pipeline logic, which
can improve power consumption, area consumption, and maximum
speed. The disadvantages are an increase in compiler complexity and
a reduced ability to evolve the processor implementation
independently of the instruction set (newer generations of chip might
have a different pipeline depth and hence a different optimal number
of delay cycles).
-34-
15.7
a.
b.
-35-
-36-
-37-
Note that it is OK for CPU B to read x in between the time CPU A reads x
and then writes x because this case is the same as if CPU B read x
before CPU A did the fetch and increment. Therefore the fetch and
increment still appears atomic
-39-
-40-
0x0000B12C
0x0300EC04
0x0300EC04
0x0000EC04
0x0000009E
0x00000B4
PC MAR
M MBR
MBR IR
IR MAR
AC MBR
MBR M
0x0000B130
0x0100EC08
0x0100EC08
0x0000EC08
0x000000B4
0x000000B4
-41-
$v0,
$t0,
$t0,
$v0,
$zero, 0 #
$v0, $v0 #
$a0, end #
$v0, 1
# goto
$v0, $v0, -1
$ra
#
$a0
$v0
$t0
r := 0
$t0 := r*r
if (r*r > n) goto end
# r := r + 1
loop
# r := r 1
return r
Note that since we dont use any of the $s registers, and we dont use
jal, and since the problem said not to worry about the frame pointer, we
dont have to save/restore any registers on the stack. We use $a0 for
the argument n, and $v0 for the return value because these choices are
mandated by the standard MIPS calling conventions. To save registers,
we also use $v0 for the program variable r within the body of the
procedure.
B.2
# Local variables:
#
b $a0 Base value to raise to some power.
#
e $a1 Exponent remaining to raise the base to.
#
r $v0 Accumulated result so far.
intpower: addi $v0,
while:
blez
loop when e<=0.
mul
$v0,
addi
$a1,
b while
endwh:
jr
$ra
$zero, 1 # r := 0.
$a1, endwh
# Break out of while
$v0, $a0
$a1, -1
# r := r * b.
# b := b - 1.
# Continue looping.
# Return result r already in $v0.
-42-
B.3
AND
SLL
XOR
XOR
$r4,
$r4,
$r4,
$r4,
$r2,
$r4,
$r4,
$r4,
$r3
1
$r2
$r3
-43-