Professional Documents
Culture Documents
Lecture 14:
Hit Time
Time to deliver a line in the cache to the processor (includes time to determine
whether the line is in the cache)
Typical numbers:
1-2 clock cycles for L1
5-20 clock cycles for L2
Miss Penalty
Say I gave you a machine and didn't tell you anything about
its cache size or speeds.
How would you figure these values out?
Say I gave you a machine and didn't tell you anything about
its cache size or speeds.
How would you figure these values out?
Idea: Write a program to measure the cache's behavior and
performance.
Simple approach:
S = 4 words
W = 32 words
2009 Matt Welsh Harvard University
W = 32 words
As W gets larger than one level of the cache, performance of the program will drop.
If S is less than the size of a cache line, sequential accesses will be fast.
If S is greater than the size of a cache line, sequential accesses will be slower.
Shows size and read throughputs of different cache levels and memory
Read throughput
1200
(MB/sec)
main memory
region
L2 cache
region
L1 cache
region
1000
800
600
400
200
1k
2k
4k
8k
16k
32k
64k
128k
256k
512k
1024k
2m
4m
8m
Varying stride
Keep working set constant at W = 256 KB, vary stride from 1-16
700
600
500
s2
s3
s4
s5
s6
s7
s8
stride (words)
2009 Matt Welsh Harvard University
1000
L1 dropoff occurs at 16 KB
L1
800
600
400
xe
Ridges of
temporal
locality
L2
8k
32k
128k
512k
2m
8m
s15
Stride (words)
s9
s7
mem
s5
s3
s1
2k
200
s13
Slopes of
spatial
locality
16 KB L1 cache
512 KB L2 cache
1200
s11
Throughput (MB/sec)
L2 dropoff occurs
at 512 KB
5000
L1
4000
3000
L2
2000
4k
1000
128k
s29
s25
Stride (w ords)
M em
s21
s17
s13
s9
s5
0
s1
Slopes of
Spatial
Locality
R e a d T h r o u g h p u t (M B /s )
6000
4m
Ridges o f
Tempo ral
Locality
W orking Set Size
(bytes)
128m
11
3000
L1
2500
2000
Read
throughput
(MB/s)
1500
L2
1000
4k
500
128k
0
4m
s29
s25
Stride (words)
s21
s17
s13
s9
s5
s1
Mem
Working set
(bytes)
128m
12
It's also a nice example of a program that is highly sensitive to cache effects.
13
3 0 1
2 4 5
5 9 1
51
=
14
Line size = 32B (big enough for four 64-bit double values)
Matrix dimension (N) is very large
Cache is not even big enough to hold multiple rows
Analysis Method:
k
i
j
i
15
sum += a[0][i];
sum += a[i][0];
16
Inner loop:
(*,j)
(i,j)
(i,*)
A
Row-wise
Columnwise
Fixed
B
1.0
C
0.0
17
Inner loop:
(*,j)
(i,*)
A
Row-wise
Columnwise
B
1.0
(i,j)
C
Fixed
C
0.0
18
Inner loop:
(i,k)
A
Fixed
(k,*)
B
Row-wise
(i,*)
C
Row-wise
B
0.25
C
0.25
Now we suffer 0.25 compulsory misses per iteration for B and C accesses.
Also need to store back temporary result c[i][j] on each innermost loop iteration!
2009 Matt Welsh Harvard University
19
Inner loop:
(i,k)
(k,*)
Fixed
Row-wise
(i,*)
C
Row-wise
B
0.25
C
0.25
Same as kij.
2009 Matt Welsh Harvard University
20
Inner loop:
(*,k)
(k,j)
A
Column wise
B
0.0
(*,j)
Fixed
Columnwise
C
1.0
21
Inner loop:
(*,k)
(*,j)
(k,j)
Columnwise
Fixed
Columnwise
B
0.0
C
1.0
22
ijk or jik:
2 loads, 0 stores
misses/iter = 1.25
kij or ikj:
2 loads, 1 store
misses/iter = 0.5
jki or kji:
2 loads, 1 store
misses/iter = 2.0
23
50
jki, kji
(3 mem accesses,
2 misses/iter)
Cycles/iteration
40
kij, ikj
(3 mem accesses,
0.5 misses/iter)
30
20
kji
jki
kij
ikj
jik
ijk
ijk, jik
10
(2 mem accesses,
1.25 misses/iter)
0
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
Versions with same number of mem accesses and miss rate perform about the same.
Lower misses/iter tends to do better
ijk and jik version fastes, although higher miss rate than kij and ikj versions
24
B11 B12
X
B21 B22
C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
C11 = A11B11 + A12B21
25
26
kk
jj
kk+bsizekk+bsizekk+bsize
27
kk
kk
kk
kk
kk+bsize
kk+bsize
kk+bsize
kk+bsize
kk
jj
jj+bsize
kk+bsize
28
B
kk
kk+bsize
j j
kk
jj
jj+bsize
kk+bsize
29
C
j
kk
kk+bsize
kk
jj
jj+bsize
kk+bsize
30
Locality properties
A
C
j
kk
kk+bsize
kk
jj
jj+bsize
kk+bsize
Iterate over all elements of the block N times (once for each row sliver in A)
Row sliver in matrix A accessed bsize times (once for each column sliver in B)
If block and row slivers fit in the cache, performance should rock!
31
Cycles/iteration
50
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
40
30
20
10
Blocked versions
15
0
17
5
20
0
22
5
25
0
27
5
30
0
32
5
35
0
37
5
40
0
5
12
10
0
75
50
25
32
test(elems, stride);
/*
start_cycles = get_cpu_cycle_counter(); /*
test(elems, stride);
/*
end_cycles = get_cpu_cycle_counter();
/*
diff = end_cycles start_cycles;
/*
return (size / stride) / (diff / CPU_MHZ);
33
int data[MAXELEMS];
int main()
{
int size;
int stride;
34