Professional Documents
Culture Documents
Paolo.Ienne@epfl.ch
EPFL I&C LAP
(Largely based on slides by P. R. Panda, IITD
and P. Marwedel, University of Dortmund)
Motivation
Memories are the limiting performance factor
System-on-Chip memories and SRAMs embedded in
FPGAs are fast (1-2 cycles access) but:
On-chip memory might not be enough
eDRAM or eFLASH mayy be coming
g into the picture
p
Ienne 2003-08
Performance
10-90% of system
y
performance mayy be memoryy
p
related
Power
o
25-40% of system power may be memory related
Ienne 2003-08
Sub-banking
Energy
gy
Access times
Applications are
getting larger and
larger
The energy cost of
keeping access
times low is very
high
g
Ienne 2003-08
Sourrce: Marwede
el, 2007
Processor Energy
Cacheless
monoprocessor
Main Mem.
E
71%
Proc. Energy
IC h E
I-Cache
Energy
51,9%
28,1%
Multiprocessor with
I and
nd D caches
he
D-Cache Energy
Main Mem.
gy
Energy
14,8%
5,2%
Ienne 2003-08
So
ource: Verma
a and Marwed
del, Springer 2007
29%
Outline
Memory data layout
Scratchpad memory
Custom memory architectures
Ienne 2003-08
Similarity
This reminds
Thi
i d off the
th study
t d off placement
l
t for
f DSP
variables (but problem and strategy was different)
Ienne 2003-08
a[1024];
b[1024];
c[1024];
[
];
b[i]
(i = 0; i < N; i++)
c[i] = a[i] + b[i];
c
c[i]
Data Cache
Memory
(Direct-mapped,
512 words)
d )
Ienne 2003-08
Aliasing Example
Cache size C,
C line size M,
M array size N
Addresses and cache position:
a[i]: i
b[i]: i + N
c[i]: i + 2N
(i mod C) / M
((i + N) mod C) / M
((i + 2N) mod C) / M
Solutions?
Set-associative cache
Make C larger than N
?!
9
Costly!
Ienne 2003-08
Source: B
Banakar et al., IEEE 200
02
10
Ienne 2003-08
M words
a[i]
int
int
int
...
for
DUMMY
a[1024];
b[1024];
c[1024];
[
];
(i = 0; i < N; i++)
c[i] = a[i] + b[i];
b[i]
DUMMY
c
c[i]
Data Cache
(Direct-mapped,
512
2 words)
d )
Memory
Data alignment avoids cache conflicts
11
Ienne 2003-08
Classic Transformation
Loop Blocking
Modify loop exploration space in blocks (or tiles)
tiles ) so
that all elements accessed at once fit the cache
Original Code
for i = 1 to N
for k = 1 to N
r = X [i,k]
for j = 1 to N
Z[i,j] = r * Y[k,j]
Blocked Code
for kk = 1 to N step B
for jj = 1 to N step B
for i = 1 to N
for k = kk to min (kk+B-1, N)
r = X [i,k]
for j = jj to min (jj+B-1,
(jj+B-1 N)
Z[i,j] = r * Y[k,j]
B
N
12
Ienne 2003-08
Idea:
Split the array in blocks or
tiles
tiles and group tiles of
each array which are
accessed at once
If the tiles are small
enough, the set of tiles
accessed at once will fit
into the cache
Since theyy are adjacent
j
in
data memory, they will not
conflict in the cache
13
Ienne 2003-08
Sou
urce: Pand
da et al., IEEE 20
001
le = le / 2
f
for
(i = j
j; i < 2048
2048; i +
+= 2*l
2*le)
)
{
= sigreal[i]
= sigreal[i + le]
sigreal[i]
g
[ ] =
sigreal[i + le] =
0
}
1024
Array
sigreal
g
14
511
Cache
Ienne 2003-08
Padded FFT
double sigreal[2048
g
[
+ 16]
]
le = le / 2; le = le + le / 128
f
for
(i = j
j; i < 2048
2048; i +
+= 2*l
2*le)
) {
i = i + i / 128
1st Outer Loop Iteration
= sigreal[i]
= sigreal[i + le]
Pads (~1 cache line, every cache size)
sigreal[i]
sigreal[i
] =
0
sigreal[i + le] =
1032
0
}
Array
sigreal
511
Cache
Ienne 2003-08
Ienne 2003-08
ESS
LRW
DAT
Ienne 2003-08
Outline
Memory data layout
Scratchpad memory
Custom memory architectures
18
Ienne 2003-08
On-chip
Memory
0
P-1
P
CPU
Data
Cache
((on-chip)
p)
Off-chip
Memory
Addressable
Memory
1 cycle
N-1
N
1
10-20 cycles
19
Ienne 2003-08
Increase determinism
Save power
20
Ienne 2003-08
8
7
6
Scratch pad
5
3
2
1
0
256
512
1024
2048
4096
8192
16384
me mory size
Ienne 2003-08
Source: B
Banakar et al., IEEE 200
02
Timing Predictability
Ienne 2003-08
Sourrce: Marwede
el, 2007
Scratchpad Memory
Embedded processor
processor-based
based system
Processor core
Embedded memory
Design problems
1. How much on-chip memory?
2 Partitioning of on-chip
2.
on chip memory in cache and scratchpad?
3. Which variables/arrays in the scratchpad?
Goals
Improve performance
Save power
23
Ienne 2003-08
Architecture Exploration
Explore exhaustively the design space
Requires an algorithm to perform partitioning between
on- and off-chip
on
off chip memory
Ienne 2003-08
[Example: Histtogram]
Ienne 2003-08
[Example: Histtogram]
Effect of on-chip
on chip memory size
26
Ienne 2003-08
BrightnessLevel[512][512];
g
Hist [256];
Regular Access
Off-chip + Cache
Irregular Access
Scratchpad
27
Ienne 2003-08
Iteration (0,0)
mask
Small
Scratchpad
Iteration (0,1)
source + dest
Large and Regular
Off-chip + Cache
28
Ienne 2003-08
Data Partitioning
Pre Partitioning Scratchpad/Off
Pre-Partitioning
Scratchpad/Off-chip
chip
Scalar variables and constants to scratchpad
Large arrays to off
off-chip
chip memory
Detailed Partitioning
Identify critical data for scratchpad
Criteria:
Life-times of arrays
Access frequency of arrays
Loop conflicts
Ienne 2003-08
Scratchpad
memory,
capacity SSP
Processor
30
Which object
j
(array,
(
y loop,
p etc.)) to be
stored in a scratchpad?
Non-overlaying
l i
allocation
ll
i
repeat...
function
function...
Array...
y
Array
Array...
Int...
Solution: knapsack
p
algorithm
g
Overlaying allocation
Moving objects back and forth
between hierarchy levels
Solution: more complex
complex...
Ienne 2003-08
Source: S
Steinke et al., IEEE 200
02
for i ...{ }
Symbols:
S (vark ) = size of variable k
n (vark ) = number of accesses to variable k
(vark ) = energy saved
e
d per variable
i bl access, if vark is
i migrated
i t d
Ienne 2003-08
Source: S
Steinke et al., IEEE 200
02
Cycless [x100]
Energy [J]
Source: S
Steinke et al., IEEE 200
02
multi_sort
benchmark (mix
of sorting
algorithms)
Numbers will change with technology but algorithms will remain unchanged
32
Ienne 2003-08
Outline
Memory data layout
Scratchpad memory
Custom memory architectures
33
Ienne 2003-08
Bank
#2
Small
bitwidth
Addresss Space
Bank
#3
Accessible
A
ibl
at once
Ienne 2003-08
A
Row Address
Addr[15:8]
Page
B
C
Column Address
Addr[7:0]
Address
35
Page Buffer
Data
Ienne 2003-08
Row
A[I]
Row
B[I]
Row
C[2I]
Col
Col
Page Buffer
Add
Addr
36
T Datapath
To
D t
th
Col
Page Buffer
Add
Addr
T Datapath
To
D t
th
Page Buffer
Add
Addr
T Datapath
To
D t
th
Ienne 2003-08
Ienne 2003-08
DFG
Conflict Graph
Schedule
Minimal Allocation
38
Ienne 2003-08
Modifiied from P
Panda et a
al., ACM
M 2001
Sou
urce: Pand
da et al., ACM 20
001
Useful Exploration
p
Space
p
39
Ienne 2003-08
Summary
In SoCs and FPGAs situation is different from general
purpose computers
Different design space (fast memories almost as fast as logic)
Less constraints to use standard components: any size possible,
more types of memory available (e.g., dual port), etc.
More bandwidth exploitable
p
(no
( pins)
p )
Ienne 2003-08
References
P.
P R
R. Panda et al
al., Data and Memory Optimization
Techniques for Embedded Systems, ACM Transactions
on Design Automation of Electronic Systems, 6(2):149
206 April
206,
A il 2001
M. Verma and P. Marwedel, Advanced Memory
41
Ienne 2003-08