You are on page 1of 3

2012 Advanced Computer Architecture Mid-term Exam Questions and solutions

1. Power and Frequency


Your companys internal studies show that a single-core system is sufficient for the demand
on your processing power; however, you are exploring whether you could save power by
using two cores.
a) Assume your application is 80% parallelizable. By how much could you decrease the
frequency and get the same performance? ( It means that the task of your application
can be completed in the dual-core system with the same time as in the single-core
system)

b) Assume that the voltage may be decreased linearly with the frequency. Using the
equation :

How much dynamic power would the dual-core system require as compared to the
single-core system ?

Solutions:

2. Amdahls law
When making changes to optimize part of a processor, it is often the case that speeding up
one type of instruction comes at the cost of slowing down something else. For example, if we
put in a complicated fast floating point unit, that takes space, and something else might have
to be moved farther away from the middle to accommodate it, addding an extra cycle in
delay to reach that unit. The basic Amdahls law equation does not take into account this
trade-off.
a) If the new fast floating point unit speeds up floating point operations by , on average, 2
, and floating point operations take 20% of the original programs execution time, what
is the overal speedup ( ignoring the penalty to any other instructions) ?
b) Now assume that speeding up the floating point unit slowed down data cache accesses
which consume 10% of the execution time. What is the overal speedup now ?

a. 1/(0.8 + 0.20/2) = 1.11


b. 1/(0.7 + 0.20/2 + 0.10 3/2) = 1.05

3. Cache performance optimization


The transpose of a matrix interchanges its rows and columes; this is illustrated below:
Here is a simple C loop to show the transpose:
for ( i = 0; i< 3; i++) {
for ( j = 0; j<3; j++) {
output [j][i] = input [i][j];
}
}

Assume that both the input and output matrices are stored in the row major order (row
major order means that the row index changes fastest). Assume that you are executing a 256
256 double-precision transpose on a processor with a 16KB fully associative ( dont worry
about cache conflicts) least recently used(LRU) replacement L1 data cache with 64 byte
blocks. Assume that the L1 cache misses require 16 cycles and always hit in the L2 cache.

For the simple implementation given above, this execution order would be non- ideal for the
input matrix; however, applying a loop interchange optimization would create a non-ideal
order for the output matrix. Because loop interchange is not sufficient to improve its
performance, it must be blocked instead.

a) What should be the minimum size of the cache to take advantage of blocked execution?
Each element is 8B. Since a 64B cache line has 8 elements, and each column access will result in
fetching a new line for the non-ideal matrix, we need a minimum of 8x8 (64 elements) for each matrix.
Hence, the minimum cache size is 128 8B = 1KB. (128= 64 + 64 for two matrix )

b) How do the relative number of misses in the blocked and unblocked versions compare in
the minimum sized cache above?
The blocked version only has to fetch each input and output element once. The unblocked version
will have one cache miss for every 64B/8B = 8 row elements. Each column requires 64Bx256 of
storage, or 16KB. Thus, column elements will be replaced in the cache before they can be used
again. Hence the unblocked version will have 9 misses (1 row and 8 columns) for every 2 in the
blocked version.
c) Write code to perform a transpose with a block size parameter B which uses BB
blocks.
for ( i=0; i<256; i=i+B )
for(j=0; j<256; j=j+B)
for ( m=0; m<B; m++)
for ( n=0; n<B; n++)
output[j+n][i+m] = input [i+m][j+n] ;

4. Memory
Consider a desktop system with a processor connected to a 2GB DRAM DIMM with error
correcting code (ECC). Assume that there is only one memory channel of width 72 bits ( 64
bits for data and 8 bits for ECC).
a) How many chips are on the DIMM if 1Gbits DRAM chips are used, and how many data
I/Os must each DRAM chip have?
2GB DRAM with parity or ECC effectively has 9 bit bytes, and would require 18 1Gb DRAMs. To
create 72 output bits, each one would have to output 72/18 = 4 bits
b) What burst length is required to support 32 Byte L2 Cache blocks ?
A burst length of 4 reads out 32B.

c) Calculate the peak bandwidth for DDR2-667 and DDR2-533 DIMMs for reads from an
active page excluding the ECC overhead.
The DDR-667 DIMM bandwidth is 667 8 = 5336 MB/s.
The DDR-533 DIMM bandwidth is 533 8 = 4264 MB/s.

5. Flash memory
Which of the following are true about flash memory?
1. Like DRAM, flash is a semiconductor memory.
2. Like disks, flash does not lose information if it loses power.
3. The read access time of NOR flash is similar to DRAM.
4. The read bandwidth of NAND flash is similar to disk.
All are true.

6. Dependability
Which of the following are true about dependability?
1. If a system is up, then all its components are accomplishing their expected service.
2. Availability is a quantitative measure of the percentage of time a system is accomplishing
its expected service.
3. Reliability is a quantitative measure of continuous service accomplishment by a system.
4. The major source of outages today is software.

(2) and (3) are true.

You might also like