You are on page 1of 34

CS61: Systems Programming and Machine Organization

Harvard University, Fall 2009

Lecture 14:

Cache Performance Measurement


and Optimization

Prof. Matt Welsh


October 20, 2009

Topics for today

Cache performance metrics

Discovering your cache's size and performance

The Memory Mountain

Matrix multiply, six ways

Blocked matrix multiplication

2009 Matt Welsh Harvard University

Cache Performance Metrics


Miss Rate

Fraction of memory references not found in cache (# misses / # references)


Typical numbers:
3-10% for L1
Can be quite small (e.g., < 1%) for L2, depending on size and locality.

Hit Time

Time to deliver a line in the cache to the processor (includes time to determine
whether the line is in the cache)
Typical numbers:
1-2 clock cycles for L1
5-20 clock cycles for L2

Miss Penalty

Additional time required because of a miss


Typically 50-200 cycles for main memory

2009 Matt Welsh Harvard University

Writing Cache Friendly Code


Repeated references to variables are good
(temporal locality)
Stride-1 reference patterns are good (spatial locality)
Examples:
cold

cache, 4-byte words, 4-word cache blocks

int sum_array_rows(int a[M][N])


{
int i, j, sum = 0;

for (i = 0; i < M; i++)


for (j = 0; j < N; j++)
sum += a[i][j];
return sum;

Miss rate = 1/4 = 25%


2009 Matt Welsh Harvard University

int sum_array_cols(int a[M][N])


{
int i, j, sum = 0;

for (j = 0; j < N; j++)


for (i = 0; i < M; i++)
sum += a[i][j];
return sum;

Miss rate = 100%


4

Determining cache characteristics

Say I gave you a machine and didn't tell you anything about
its cache size or speeds.
How would you figure these values out?

2009 Matt Welsh Harvard University

Determining cache characteristics

Say I gave you a machine and didn't tell you anything about
its cache size or speeds.
How would you figure these values out?
Idea: Write a program to measure the cache's behavior and
performance.

Program needs to perform memory accesses with different locality patterns.

Simple approach:

Allocate array of size W words


Loop over the array with stride index S and measure speed of memory accesses
Vary W and S to estimate cache characteristics

S = 4 words

W = 32 words
2009 Matt Welsh Harvard University

Determining cache characteristics


S = 4 words

W = 32 words

What happens as you vary W and S?


Changing W varies the total amount of memory accessed by the
program.

Changing S varies the spatial locality of each access.

As W gets larger than one level of the cache, performance of the program will drop.

If S is less than the size of a cache line, sequential accesses will be fast.
If S is greater than the size of a cache line, sequential accesses will be slower.

See end of lecture notes for example C program to do this.

2009 Matt Welsh Harvard University

Varying Working Set


Keep stride constant at S = 1, and vary W from 1KB to 8MB

Shows size and read throughputs of different cache levels and memory

Read throughput
1200
(MB/sec)

main memory
region

L2 cache
region

L1 cache
region

Why the dropoff in


performance here?

read througput (MB/s)

1000

800

600

Why is this dropoff


not so drastic?

400

200

1k

2k

4k

8k

16k

32k

64k

128k

256k

512k

1024k

2m

4m

8m

working set size (bytes)


2009 Matt Welsh Harvard University

Varying stride
Keep working set constant at W = 256 KB, vary stride from 1-16

Shows the cache block size.


800

Why the gradual


drop in performance?

read throughput (MB/s)

700
600
500

one access per cache line


400
300
200
100
0
s1

s2

s3

s4

s5

s6

s7

s8

s9 s10 s11 s12 s13 s14 s15 s16

stride (words)
2009 Matt Welsh Harvard University

Pentium III 550 Mhz

1000

L1 dropoff occurs at 16 KB

L1

800

600

400

xe

Ridges of
temporal
locality

L2

2009 Matt Welsh Harvard University

8k

32k

128k

512k

2m

8m

s15

Stride (words)

s9

s7

mem
s5

s3

s1

2k

200

s13

Slopes of
spatial
locality

16 KB L1 cache
512 KB L2 cache

1200

s11

Throughput (MB/sec)

The Memory Mountain

L2 dropoff occurs
at 512 KB

Working set size


(bytes)
10

X86-64 Memory Mountain

5000

L1

4000
3000

L2

2000
4k

1000

128k

2009 Matt Welsh Harvard University

s29

s25

Stride (w ords)

M em
s21

s17

s13

s9

s5

0
s1

Slopes of
Spatial
Locality

R e a d T h r o u g h p u t (M B /s )

6000

Pentium Noco na Xeon x86-64


3.2 GHz
12 Kuo p on-chip L1 trace cache
16 KB on-chip L1 d-cache
1 M B off-chip unified L2 cache

4m

Ridges o f
Tempo ral
Locality
W orking Set Size
(bytes)

128m

11

Opteron Memory Mountain


AMD Opteron
2 GHZ

3000

L1
2500

2000

Read
throughput
(MB/s)

1500

L2
1000

4k

500

128k
0

4m

2009 Matt Welsh Harvard University

s29

s25

Stride (words)

s21

s17

s13

s9

s5

s1

Mem

Working set
(bytes)

128m

12

Matrix Multiplication Example

Matrix multiplication is heavily used in numeric and scientific


applications.

It's also a nice example of a program that is highly sensitive to cache effects.

Multiply two N x N matrices

O(N3) total operations


Read N values for each source element
Sum up N values for each destination
Variable sum
/* ijk */
held in register
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}

2009 Matt Welsh Harvard University

13

Matrix Multiplication Example


/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}

4*3 + 2*2 + 7*5 = 51


4 2 7
1 8 2
6 0 1

2009 Matt Welsh Harvard University

3 0 1
2 4 5
5 9 1

51
=

14

Miss Rate Analysis for Matrix Multiply


Assume:

Line size = 32B (big enough for four 64-bit double values)
Matrix dimension (N) is very large
Cache is not even big enough to hold multiple rows

Analysis Method:

Look at access pattern of inner loop

k
i

2009 Matt Welsh Harvard University

j
i

15

Layout of C Arrays in Memory (review)


C arrays allocated in row-major order

Each row in contiguous memory locations

Stepping through columns in one row:

for (i = 0; i < N; i++)

sum += a[0][i];

Accesses successive elements


Compulsory miss rate: (8 bytes per double) / (block size of cache)

Stepping through rows in one column:

for (i = 0; i < n; i++)

sum += a[i][0];

Accesses distant elements -- no spatial locality!


Compulsory miss rate = 100%

2009 Matt Welsh Harvard University

16

Matrix Multiplication (ijk)


/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}

Inner loop:
(*,j)
(i,j)

(i,*)
A

Row-wise

Columnwise

Fixed

Misses per Inner Loop Iteration:


A
0.25

B
1.0

C
0.0

Assume cache line size of 32 bytes.


Compulsory miss rate = 8 bytes per double / 32 bytes = = 0.25
2009 Matt Welsh Harvard University

17

Matrix Multiplication (jik)


/* jik */
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
}

Inner loop:
(*,j)
(i,*)
A

Row-wise

Columnwise

Misses per Inner Loop Iteration:


A
0.25

B
1.0

(i,j)
C

Fixed

C
0.0

Same as ijk. Just swapped i and j.


2009 Matt Welsh Harvard University

18

Matrix Multiplication (kij)


/* kij */
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}

Inner loop:
(i,k)
A

Fixed

(k,*)
B

Row-wise

(i,*)
C

Row-wise

Misses per Inner Loop Iteration:


A
0.0

B
0.25

C
0.25

Now we suffer 0.25 compulsory misses per iteration for B and C accesses.
Also need to store back temporary result c[i][j] on each innermost loop iteration!
2009 Matt Welsh Harvard University

19

Matrix Multiplication (ikj)


/* ikj */
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}

Inner loop:
(i,k)

(k,*)

Fixed

Row-wise

(i,*)
C

Row-wise

Misses per Inner Loop Iteration:


A
0.0

B
0.25

C
0.25

Same as kij.
2009 Matt Welsh Harvard University

20

Matrix Multiplication (jki)


/* jki */
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}

Inner loop:
(*,k)
(k,j)
A

Column wise

Misses per Inner Loop Iteration:


A
1.0

2009 Matt Welsh Harvard University

B
0.0

(*,j)

Fixed

Columnwise

C
1.0

21

Matrix Multiplication (kji)


/* kji */
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}

Inner loop:
(*,k)

(*,j)
(k,j)

Columnwise

Fixed

Columnwise

Misses per Inner Loop Iteration:


A
1.0

2009 Matt Welsh Harvard University

B
0.0

C
1.0

22

Summary of Matrix Multiplication


for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}

for (k=0; k<n; k++) {


for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}

for (j=0; j<n; j++) {


for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
2009 Matt Welsh Harvard University

ijk or jik:

2 loads, 0 stores
misses/iter = 1.25

kij or ikj:

2 loads, 1 store
misses/iter = 0.5

jki or kji:

2 loads, 1 store
misses/iter = 2.0

23

Pentium Matrix Multiply Performance


60

50

jki, kji

(3 mem accesses,
2 misses/iter)

Cycles/iteration

40

kij, ikj

(3 mem accesses,
0.5 misses/iter)

30

20

kji
jki
kij
ikj
jik
ijk

ijk, jik

10

(2 mem accesses,
1.25 misses/iter)
0
25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400
Array size (n)

Versions with same number of mem accesses and miss rate perform about the same.
Lower misses/iter tends to do better
ijk and jik version fastes, although higher miss rate than kij and ikj versions

2009 Matt Welsh Harvard University

24

Using blocking to improve locality


Blocked matrix multiplication

Break matrix into smaller blocks and perform independent multiplications


on each block.
Improves locality by operating on one block at a time.
Best if each block can fit in the cache!

Example: Break each matrix into four sub-blocks


A11 A12
A21 A22

B11 B12
X

B21 B22

C11 C12
C21 C22

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.
C11 = A11B11 + A12B21

C12 = A11B12 + A12B22

C21 = A21B11 + A22B21

C22 = A21B12 + A22B22

2009 Matt Welsh Harvard University

25

Blocked Matrix Multiply (bijk)


for (jj=0; jj<n; jj+=bsize) {
for (i=0; i<n; i++)
for (j=jj; j < min(jj+bsize,n); j++)
c[i][j] = 0.0;
for (kk=0; kk<n; kk+=bsize) {
for (i=0; i<n; i++) {
for (j=jj; j < min(jj+bsize,n); j++) {
sum = 0.0
for (k=kk; k < min(kk+bsize,n); k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
}
}

2009 Matt Welsh Harvard University

26

Blocked matrix multiply operation


A

kk

jj

jjjj+bsize jjjj+bsize jj+bsize


kk
kk

kk+bsizekk+bsizekk+bsize

Step 1: Pick location of block in matrix B

Block slides across matrix B left-to-right, top-to-bottom, by bsize units at a time

2009 Matt Welsh Harvard University

27

Blocked matrix multiply operation


A
i
i
i
i

kk
kk
kk
kk

kk+bsize
kk+bsize
kk+bsize
kk+bsize

kk

jj

jj+bsize

kk+bsize

Step 2: Slide row sliver across matrix A

Hold block in matrix B fixed.


Row sliver slides from top to bottom across matrix A
Row sliver spans columns [ kk ... kk+bsize ]

2009 Matt Welsh Harvard University

28

Blocked matrix multiply operation


A

B
kk

kk+bsize

j j

kk

jj

jj+bsize

kk+bsize

Step 3: Slide column sliver across block in matrix B

Row sliver in matrix A stays fixed.


Column sliver slides from left to right across the block
Column sliver spans rows [ kk ... kk+bsize ] in matrix B

2009 Matt Welsh Harvard University

29

Blocked matrix multiply operation


A

C
j

kk

kk+bsize

kk

jj

jj+bsize

kk+bsize

Step 4: Iterate over row and column slivers together

Compute dot product of both vectors of length bsize


Dot product is added to contents of cell (i,j) in matrix C

2009 Matt Welsh Harvard University

30

Locality properties
A

C
j

kk

kk+bsize

kk

jj

jj+bsize

kk+bsize

What is the locality of this algorithm?

Iterate over all elements of the block N times (once for each row sliver in A)
Row sliver in matrix A accessed bsize times (once for each column sliver in B)

If block and row slivers fit in the cache, performance should rock!

2009 Matt Welsh Harvard University

31

Pentium Blocked Matrix


Multiply Performance
Blocking (bijk and bikj) improves performance by a factor of two
over unblocked versions (ijk and jik)

Relatively insensitive to array size.


60

Cycles/iteration

50

kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)

40

30

20

10

Blocked versions
15
0
17
5
20
0
22
5
25
0
27
5
30
0
32
5
35
0
37
5
40
0

5
12

10
0

75

50

25

Array size (n)


2009 Matt Welsh Harvard University

32

Cache performance test program


/* The test function */
void test(int elems, int stride) {
int i, result = 0;
volatile int sink;

for (i = 0; i < elems; i += stride)


result += data[i];
sink = result; /* So compiler doesn't optimize away the loop */
/* Run test(elems, stride) and return read throughput (MB/s) */
double run(int size, int stride)
{
uint64_t start_cycles, end_cycles, diff;
int elems = size / sizeof(int);

test(elems, stride);
/*
start_cycles = get_cpu_cycle_counter(); /*
test(elems, stride);
/*
end_cycles = get_cpu_cycle_counter();
/*
diff = end_cycles start_cycles;
/*
return (size / stride) / (diff / CPU_MHZ);

2009 Matt Welsh Harvard University

warm up the cache */


Read CPU cycle counter */
Run test */
Read CPU cycle counter again */
Compute time */
/* convert cycles to MB/s */

33

Cache performance main routine


#define
#define
#define
#define
#define

CPU_MHZ 2.8 * 1024.0 * 1024.0; /* e.g., 2.8 GHz */


MINBYTES (1 << 10) /* Working set size ranges from 1 KB */
MAXBYTES (1 << 23) /* ... up to 8 MB */
MAXSTRIDE 16
/* Strides range from 1 to 16 */
MAXELEMS MAXBYTES/sizeof(int)

int data[MAXELEMS];
int main()
{
int size;
int stride;

/* The array we'll be traversing */

/* Working set size (in bytes) */


/* Stride (in array elements) */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */


for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {
for (stride = 1; stride <= MAXSTRIDE; stride++)
printf("%.1f\t", run(size, stride));
printf("\n");
}
exit(0);

2009 Matt Welsh Harvard University

34

You might also like