Lectures Cacheperf

CS61: Systems Programming and Machine Organization
Harvard University, Fall 2009
Lecture 14:
Cache Performance Measurement

and Optimization
Prof. Matt Welsh

October 20, 2009
Topics for today
Cache performance metrics
Discovering your cache's size and performance
The Memory Mountain
Matrix multiply, six ways
Blocked matrix multiplication
2009 Matt Welsh Harvard University
Cache Performance Metrics

Miss Rate
Fraction of memory references not found in cache (# misses / # references)

Typical numbers:
3-10% for L1
Can be quite small (e.g., < 1%) for L2, depending on size and locality.
Hit Time
Time to deliver a line in the cache to the processor (includes time to determine
whether the line is in the cache)
Typical numbers:
1-2 clock cycles for L1
5-20 clock cycles for L2
Miss Penalty
Additional time required because of a miss

Typically 50-200 cycles for main memory
Writing Cache Friendly Code

Repeated references to variables are good
(temporal locality)
Stride-1 reference patterns are good (spatial locality)
Examples:
cold
cache, 4-byte words, 4-word cache blocks
int sum_array_rows(int a[M][N])

{
int i, j, sum = 0;
for (i = 0; i < M; i++)

for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
Miss rate = 1/4 = 25%

int sum_array_cols(int a[M][N])

{
int i, j, sum = 0;
for (j = 0; j < N; j++)

for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
Miss rate = 100%

4
Determining cache characteristics
Say I gave you a machine and didn't tell you anything about
its cache size or speeds.
How would you figure these values out?
Say I gave you a machine and didn't tell you anything about
its cache size or speeds.
How would you figure these values out?
Idea: Write a program to measure the cache's behavior and
performance.
Program needs to perform memory accesses with different locality patterns.
Simple approach:
Allocate array of size W words

Loop over the array with stride index S and measure speed of memory accesses
Vary W and S to estimate cache characteristics
S = 4 words
W = 32 words

S = 4 words
W = 32 words
What happens as you vary W and S?

Changing W varies the total amount of memory accessed by the
program.
Changing S varies the spatial locality of each access.
As W gets larger than one level of the cache, performance of the program will drop.
If S is less than the size of a cache line, sequential accesses will be fast.
If S is greater than the size of a cache line, sequential accesses will be slower.
See end of lecture notes for example C program to do this.
Varying Working Set

Keep stride constant at S = 1, and vary W from 1KB to 8MB
Shows size and read throughputs of different cache levels and memory
Read throughput
1200
(MB/sec)
main memory
region
L2 cache
region
L1 cache
region
Why the dropoff in

performance here?
read througput (MB/s)
1000
800
600
Why is this dropoff

not so drastic?
400
200
1k
2k
4k
8k
16k
32k
64k
128k
256k
512k
1024k
2m
4m
8m
working set size (bytes)

Varying stride
Keep working set constant at W = 256 KB, vary stride from 1-16
Shows the cache block size.

800
Why the gradual

drop in performance?
read throughput (MB/s)
700
600
500
one access per cache line

400
300
200
100
0
s1
s2
s3
s4
s5
s6
s7
s8
s9 s10 s11 s12 s13 s14 s15 s16
stride (words)
Pentium III 550 Mhz
1000
L1 dropoff occurs at 16 KB
L1
800
600
400
xe
Ridges of
temporal
locality
L2
8k
32k
128k
512k
2m
8m
s15
Stride (words)
s9
s7
mem
s5
s3
s1
2k
200
s13
Slopes of
spatial
locality
16 KB L1 cache
512 KB L2 cache
1200
s11
Throughput (MB/sec)
The Memory Mountain
L2 dropoff occurs
at 512 KB
Working set size

(bytes)
10
X86-64 Memory Mountain
5000
L1
4000
3000
L2
2000
4k
1000
128k
s29
s25
Stride (w ords)
M em
s21
s17
s13
s9
s5
0
s1
Slopes of
Spatial
Locality
R e a d T h r o u g h p u t (M B /s )
6000
Pentium Noco na Xeon x86-64

3.2 GHz
12 Kuo p on-chip L1 trace cache
16 KB on-chip L1 d-cache
1 M B off-chip unified L2 cache
4m
Ridges o f
Tempo ral
Locality
W orking Set Size
(bytes)
128m
11
Opteron Memory Mountain

AMD Opteron
2 GHZ
3000
L1
2500
2000
Read
throughput
(MB/s)
1500
L2
1000
4k
500
128k
0
4m
s29
s25
Stride (words)
s21
s17
s13
s9
s5
s1
Mem
Working set
(bytes)
128m
12
Matrix Multiplication Example
Matrix multiplication is heavily used in numeric and scientific

applications.
It's also a nice example of a program that is highly sensitive to cache effects.
Multiply two N x N matrices
O(N3) total operations

Read N values for each source element
Sum up N values for each destination
Variable sum
/* ijk */
held in register
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
13
Matrix Multiplication Example

/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
4*3 + 2*2 + 7*5 = 51

4 2 7
1 8 2
6 0 1
3 0 1
2 4 5
5 9 1
51
=
14
Miss Rate Analysis for Matrix Multiply

Assume:
Line size = 32B (big enough for four 64-bit double values)
Matrix dimension (N) is very large
Cache is not even big enough to hold multiple rows
Analysis Method:
Look at access pattern of inner loop
k
i
j
i
15
Layout of C Arrays in Memory (review)

C arrays allocated in row-major order
Each row in contiguous memory locations
Stepping through columns in one row:
for (i = 0; i < N; i++)
sum += a[0][i];
Accesses successive elements

Compulsory miss rate: (8 bytes per double) / (block size of cache)
Stepping through rows in one column:
for (i = 0; i < n; i++)
sum += a[i][0];
Accesses distant elements -- no spatial locality!

Compulsory miss rate = 100%
16
Matrix Multiplication (ijk)

/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
Inner loop:
(*,j)
(i,j)
(i,*)
A
Row-wise
Columnwise
Fixed
Misses per Inner Loop Iteration:

A
0.25
B
1.0
C
0.0
Assume cache line size of 32 bytes.

Compulsory miss rate = 8 bytes per double / 32 bytes = = 0.25
17
Matrix Multiplication (jik)

/* jik */
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
}
Inner loop:
(*,j)
(i,*)
A
Row-wise
Columnwise

A
0.25
B
1.0
(i,j)
C
Fixed
C
0.0
Same as ijk. Just swapped i and j.

18
Matrix Multiplication (kij)

/* kij */
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
Inner loop:
(i,k)
A
Fixed
(k,*)
B
Row-wise
(i,*)
C
Row-wise

A
0.0
B
0.25
C
0.25
Now we suffer 0.25 compulsory misses per iteration for B and C accesses.
Also need to store back temporary result c[i][j] on each innermost loop iteration!
19
Matrix Multiplication (ikj)

/* ikj */
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
Inner loop:
(i,k)
(k,*)
Fixed
Row-wise
(i,*)
C
Row-wise

A
0.0
B
0.25
C
0.25
Same as kij.
20
Matrix Multiplication (jki)

/* jki */
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
Inner loop:
(*,k)
(k,j)
A
Column wise

A
1.0
B
0.0
(*,j)
Fixed
Columnwise
C
1.0
21
Matrix Multiplication (kji)

/* kji */
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
Inner loop:
(*,k)
(*,j)
(k,j)
Columnwise
Fixed
Columnwise

A
1.0
B
0.0
C
1.0
22
Summary of Matrix Multiplication

for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
for (k=0; k<n; k++) {

for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {

for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
ijk or jik:
2 loads, 0 stores
misses/iter = 1.25
kij or ikj:
2 loads, 1 store
misses/iter = 0.5
jki or kji:
2 loads, 1 store
misses/iter = 2.0
23
Pentium Matrix Multiply Performance

60
50
jki, kji
(3 mem accesses,
2 misses/iter)
Cycles/iteration
40
kij, ikj
(3 mem accesses,
0.5 misses/iter)
30
20
kji
jki
kij
ikj
jik
ijk
ijk, jik
10
(2 mem accesses,
1.25 misses/iter)
0
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)
Versions with same number of mem accesses and miss rate perform about the same.
Lower misses/iter tends to do better
ijk and jik version fastes, although higher miss rate than kij and ikj versions
24
Using blocking to improve locality

Blocked matrix multiplication
Break matrix into smaller blocks and perform independent multiplications

on each block.
Improves locality by operating on one block at a time.
Best if each block can fit in the cache!
Example: Break each matrix into four sub-blocks

A11 A12
A21 A22
B11 B12
X
B21 B22
C11 C12
C21 C22
Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
C11 = A11B11 + A12B21
C12 = A11B12 + A12B22
C21 = A21B11 + A22B21
C22 = A21B12 + A22B22
25
Blocked Matrix Multiply (bijk)

for (jj=0; jj<n; jj+=bsize) {
for (i=0; i<n; i++)
for (j=jj; j < min(jj+bsize,n); j++)
c[i][j] = 0.0;
for (kk=0; kk<n; kk+=bsize) {
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
}
}
26
Blocked matrix multiply operation

A
kk
jj
jjjj+bsize jjjj+bsize jj+bsize

kk
kk
kk+bsizekk+bsizekk+bsize
Step 1: Pick location of block in matrix B
Block slides across matrix B left-to-right, top-to-bottom, by bsize units at a time
27

A
i
i
i
i
kk
kk
kk
kk
kk+bsize
kk+bsize
kk+bsize
kk+bsize
kk
jj
jj+bsize
kk+bsize
Step 2: Slide row sliver across matrix A
Hold block in matrix B fixed.

Row sliver slides from top to bottom across matrix A
Row sliver spans columns [ kk ... kk+bsize ]
28

A
B
kk
kk+bsize
j j
kk
jj
jj+bsize
kk+bsize
Step 3: Slide column sliver across block in matrix B
Row sliver in matrix A stays fixed.

Column sliver slides from left to right across the block
Column sliver spans rows [ kk ... kk+bsize ] in matrix B
29

A
C
j
kk
kk+bsize
kk
jj
jj+bsize
kk+bsize
Step 4: Iterate over row and column slivers together
Compute dot product of both vectors of length bsize

Dot product is added to contents of cell (i,j) in matrix C
30
Locality properties
A
C
j
kk
kk+bsize
kk
jj
jj+bsize
kk+bsize
What is the locality of this algorithm?
Iterate over all elements of the block N times (once for each row sliver in A)
Row sliver in matrix A accessed bsize times (once for each column sliver in B)
If block and row slivers fit in the cache, performance should rock!
31
Pentium Blocked Matrix

Multiply Performance
Blocking (bijk and bikj) improves performance by a factor of two
over unblocked versions (ijk and jik)
Relatively insensitive to array size.

60
Cycles/iteration
50
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
40
30
20
10
Blocked versions
15
0
17
5
20
0
22
5
25
0
27
5
30
0
32
5
35
0
37
5
40
0
5
12
10
0
75
50
25
Array size (n)

32
Cache performance test program

/* The test function */
void test(int elems, int stride) {
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i += stride)

result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
/* Run test(elems, stride) and return read throughput (MB/s) */
double run(int size, int stride)
{
uint64_t start_cycles, end_cycles, diff;
int elems = size / sizeof(int);
test(elems, stride);
/*
start_cycles = get_cpu_cycle_counter(); /*
test(elems, stride);
/*
end_cycles = get_cpu_cycle_counter();
/*
diff = end_cycles start_cycles;
/*
return (size / stride) / (diff / CPU_MHZ);
warm up the cache */

Read CPU cycle counter */
Run test */
Read CPU cycle counter again */
Compute time */
/* convert cycles to MB/s */
33
Cache performance main routine

#define
#define
#define
#define
#define
CPU_MHZ 2.8 * 1024.0 * 1024.0; /* e.g., 2.8 GHz */

MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
MAXBYTES (1 << 23) /* ... up to 8 MB */
MAXSTRIDE 16
/* Strides range from 1 to 16 */
MAXELEMS MAXBYTES/sizeof(int)
int data[MAXELEMS];
int main()
{
int size;
int stride;
/* The array we'll be traversing */
/* Working set size (in bytes) */

/* Stride (in array elements) */
init_data(data, MAXELEMS); /* Initialize each element in data to 1 */

for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1f\t", run(size, stride));
printf("\n");
}
exit(0);
34

Lectures Cacheperf

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lectures Cacheperf

Uploaded by

Copyright:

Available Formats

CS61: Systems Programming and Machine Organization

Harvard University, Fall 2009

Cache Performance Measurement

Prof. Matt Welsh

Topics for today

Cache performance metrics

Discovering your cache's size and performance

The Memory Mountain

Matrix multiply, six ways

Blocked matrix multiplication

2009 Matt Welsh Harvard University

Cache Performance Metrics

Fraction of memory references not found in cache (# misses / # references)

Additional time required because of a miss

2009 Matt Welsh Harvard University

Writing Cache Friendly Code

cache, 4-byte words, 4-word cache blocks

int sum_array_rows(int a[M][N])

for (i = 0; i < M; i++)

Miss rate = 1/4 = 25%

int sum_array_cols(int a[M][N])

for (j = 0; j < N; j++)

Miss rate = 100%

Determining cache characteristics

2009 Matt Welsh Harvard University

Determining cache characteristics

Program needs to perform memory accesses with different locality patterns.

Allocate array of size W words

Determining cache characteristics

What happens as you vary W and S?

Changing S varies the spatial locality of each access.

See end of lecture notes for example C program to do this.

2009 Matt Welsh Harvard University

Varying Working Set

Why the dropoff in

read througput (MB/s)

Why is this dropoff

working set size (bytes)

Shows the cache block size.

Why the gradual

read throughput (MB/s)

one access per cache line

s9 s10 s11 s12 s13 s14 s15 s16

Pentium III 550 Mhz

2009 Matt Welsh Harvard University

The Memory Mountain

Working set size

X86-64 Memory Mountain

2009 Matt Welsh Harvard University

Pentium Noco na Xeon x86-64

Opteron Memory Mountain

2009 Matt Welsh Harvard University

Matrix Multiplication Example

Matrix multiplication is heavily used in numeric and scientific

Multiply two N x N matrices

O(N3) total operations

2009 Matt Welsh Harvard University

Matrix Multiplication Example

4*3 + 2*2 + 7*5 = 51

2009 Matt Welsh Harvard University

Miss Rate Analysis for Matrix Multiply

Look at access pattern of inner loop

2009 Matt Welsh Harvard University

Layout of C Arrays in Memory (review)

43 + 22 + 7*5 = 51