Professional Documents
Culture Documents
BR
BR
BR
BR
BR
+ 1. better utilize long blocks (dont exit in middle of block, dont enter
at label in middle of block)
- 1. complicated address mapping since addresses no longer aligned
to power-of-2 multiples of word size
- 1. instructions may appear multiple times in multiple dynamic traces
due to different branch outcomes
0->1
1.2
1->2
1
2->64
0.8
Base
0.6
0.4
0->1
1->2
2->64
Base
0.2
Integer
ora
spice2g6
nasa7
alvinn
hydro2d
mdljdp2
wave5
su2cor
doduc
swm256
tomcatv
fpppp
ear
mdljsp2
compress
xlisp
espresso
eqntott
Floating Point
FP programs on average: Miss Penalty = 0.68 -> 0.52 -> 0.34 -> 0.26
Int programs on average: Miss Penalty = 0.24 -> 0.20 -> 0.19 -> 0.19
8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92
block
TAGS
DATA
Data Prefetching
Pentium 4 can prefetch data into L2 cache from up to 8 streams from 8
different 4 KB pages
Prefetching invoked if 2 successive L2 cache misses to a page,
if distance between those cache blocks is < 256 bytes
Issues in Prefetching
Usefulness should produce hits
Timeliness not late and not too early
Cache and bandwidth pollution
CPU
RF
L1
Instruction
Unified L2
Cache
L1 Data
Prefetched data
Stream
Buffer
Prefetched
instruction block
CPU
RF
L1
Instruction
Req
block
Unified L2
Cache
Strided prefetch
If observe sequence of accesses to block b, b+N, b+2N,
then prefetch b+3N etc.
Example: IBM Power 5 [2003] supports eight independent
streams of strided prefetch per processor, prefetching 12
lines ahead of current access
Administrivia
Exam:
This Wednesday
Location: 310 Soda
TIME: 6:00-9:00pm
Data
Merging Arrays: improve spatial locality by single array of compound elements vs.
2 arrays
Loop Interchange: change nesting of loops to access data in order stored in
memory
Loop Fusion: Combine 2 independent loops that have same looping and some
variables overlap
Blocking: Improve temporal locality by accessing blocks of data repeatedly vs.
going down whole columns or rows
<
j
i
*
100; k = k+1)
< 100; j = j+1)
< 5000; i = i+1)
x[i][j];
<
i
j
*
100; k = k+1)
< 5000; i = i+1)
< 100; j = j+1)
x[i][j];
Blocking Example
/* Before */
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};
x[i][j] = r;
};
Blocking Example
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B-1,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};
x[i][j] = x[i][j] + r;
};
Miss Rate
0.1
0.05
50
100
150
Blocking Factor
1.5
2.5
Performance Improvement
merged
arrays
loop
interchange
loop fusion
blocking
Time
Insts
1 experiment
size > L1
cache
hit time
s = stride
Mem: 396 ns
(132 cycles)
L2: 2 MB,
12 cycles (36 ns)
L1: 16 B line
L1:
16 KB
2 cycles (6ns)
L2: 8 MB
128 B line
9 cycles
L1: 32 KB
128B line
.5-2 cycles
Array size
Best: 4x2
Reference
Mflop/s
rowblocksize(r)
Sun Ultra 2,
Sun Ultra 3,
AMD Opteron
Intel
Pentium M
IBM Power 4,
Intel/HP Itanium
Intel/HP
Itanium 2
IBM
Power 3
2
1
1
2
4
columnblocksize(c)
All possible column block sizes selected for 8 computers; How could
compiler know?
Technique
Hit Time
Band
width
Miss
penalty
Miss
rate
HW cost/
complexity
Comment
Way-predicting caches
Used in Pentium 4
Trace caches
Used in Pentium 4
Widely used
Widely used
Nonblocking caches
Banked caches
Widely used
Victim Caches
Compiler techniques to
reduce cache misses
Hardware prefetching of
instructions and data
2 instr.,
3 data
Compiler-controlled
prefetching
See: http://www.columbia.edu/acis/history/core.html
DRAM Architecture
Col.
1
word lines
Row 1
Row Address
Decoder
N+M
bit lines
Col.
2M
Row 2N
Memory cell
(one bit)
Write:
Read:
1. Precharge bit line to Vdd/2
2.. Select row
bit
3. Cell and bit line share charges
Very small voltage changes on the bit line
4. Sense (fancy sense amp)
Can detect changes of ~1 million electrons
5. Write: restore the value
Refresh
1. Just do a dummy read to every cell.
row select
Trench capacitors:
Stacked capacitors
Logic BELOW capacitor
Gain in surface area of capacitor
2-dim cross-section quite small
RAS_L
CAS_L
WE_L
256K x 8
DRAM
OE_L
RAS_L
CAS_L
A
Row Address
Col Address
Junk
Row Address
Col Address
Junk
WE_L
OE_L
D
High Z
Junk
Read Access
Time
Data Out
High Z
Output Enable
Delay
Data Out
Time
D1 available
Start Access for D1
CPU
Memory
Memory
Bank 1
Access Bank 0
Memory
Bank 2
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
Memory
Bank 3
Wide:
Simple:
Interleaved:
CPU/Mux 1 word;
Mux/Cache, Bus,
Memory N words
(Alpha: 64 bits & 256
bits)
Simple M.P.
= 4 x (1+10+1) = 48
Wide M.P.
= 1 + 10 + 1
= 12
Interleaved M.P. = 1+10+1 + 3 =15
address
0
4
8
12
Bank 0
address
1
5
9
13
Bank 1
address
2
6
10
14
Bank 2
address
3
7
11
15
Bank 3
Column
Address
N cols
DRAM
N rows
Row
Address
N x M SRAM
Only CAS is needed to access other
M-bit blocks on that row
M bits
M-bit Output
RAS_L remains asserted while
CAS_L is toggled
1st M-bit Access
2nd M-bit
3rd M-bit
4th M-bit
Col Address
Col Address
Col Address
RAS_L
CAS_L
A
Row Address
Col Address
CAS
x
RAS
(New Bank)
CAS Latency
Precharge
Burst
READ
Row
Column
Precharge
Row
Data
[ Micron, 256Mb DDR2 SDRAM datasheet ]
400Mb/s
Data Rate
Clock Rate
(MHz)
DDR
133
DDR
M
transfers
/ second
DRAM
Name
Mbytes/s/
DIMM
DIMM
Name
266
DDR266
2128
PC2100
150
300
DDR300
2400
PC2400
DDR
200
400
DDR400
3200
PC3200
DDR2
266
533
DDR2-533
4264
PC4300
DDR2
333
667
DDR2-667
5336
PC5300
DDR2
400
800
DDR2-800
6400
PC6400
DDR3
533
1066
DDR3-1066
8528
PC8500
DDR3
666
1333
DDR3-1333
10664
PC10700
DDR3
800
1600
DDR3-1600
12800
PC12800
x2
x8
DRAM Packaging
Clock and control signals
~7
DRAM
chip
Data bus
(4b,8b,16b,32b)
DIMM (Dual Inline Memory Module) contains multiple
chips arranged in ranks
Each rank has clock/control/address signals
connected in parallel (sometimes need buffers to
drive signals to all chips), and data pins work
together to return wide word
e.g., a rank could implement a 64-bit data bus using 16x4-bit
chips, or a 64-bit data bus using 8x8-bit chips.
DRAM Channel
Rank
Rank
Bank
Bank
Chip
Chip
16
16
Bank
Bank
Chip
Chip
16
Memory
Controller
64-bit
Data
Bus
Bank
Bank
Chip
Chip
16
16
Command/Address Bus
16
16
Bank
Bank
Chip
Chip
16
FB-DIMM Memories
Regular
DIMM
FB-DIMM
Uses Commodity DRAMs with special controller on
actual DIMM board
Connection is in a serial form:
FB-DIMM
FB-DIMM
FB-DIMM
FB-DIMM
FB-DIMM
Controller
FLASH Memory
Samsung 2007:
Has a floating gate that can hold charge 16GB, NAND Flash
Two varieties:
NAND: denser, must be read and written in blocks
NOR: much less dense, fast to read and write
Conclusion
Memory wall inspires optimizations since much performance lost
Reducing hit time: Small and simple caches, Way prediction, Trace caches
Increasing cache bandwidth: Pipelined caches, Multibanked caches, Nonblocking
caches
Reducing Miss Penalty: Critical word first, Merging write buffers
Reducing Miss Rate: Compiler optimizations
Reducing miss penalty or miss rate via parallelism: Hardware prefetching,
Compiler prefetching
Wider Memory
Interleaved Memory: for sequential or independent accesses
Avoiding bank conflicts: SW & HW
DRAM specific optimizations: page mode & Specialty DRAM