LPC 3

Low Power Computing
-Low-Power memory optimization

Outline
Low power caches and other optimizations

Memory customization
Low power caches and
optimizations
Memory Optimization
Most memory-oriented performance
optimizations reduce the NUMBER of
memory accesses
this also indirectly reduces power
However, other optimizations can apply

particularly when memory architecture is flexible
Cache architecture
READ Operation addr
Tag Index Off
Tag Data
Hit ?
Valid, Match
Data
Augmenting Cache Architecture
Basic idea: add a small buffer/cache

Minimize hit into the off-chip memory/main core
stores recent data
fetch directly from this buffer, inhibiting L1 cache access
if high hit rate at buffer, power reduction
Techniques of cache augmentation :

Filter
cache
Block buffers
Scratch Pad Memory
Filter Cache
Filter Cache
insert another small
cache before L1
to lower hit ratio CPU
Filter
cache L1 L2
reduces power
Block Buffer
Aim is to reduce access to
memory core
Block buffering
Save last accessed cache line addr
in buffer.
if next access is to the same line, Tag Index Off
read directly from buffer
Tag Data
Saves access to memory core
when spatial locality exists buf buf
Extend to small fully Hit ?
associative buffer. Valid, Match
Techniques -
Direct mapping Data
Fully associative mapping
Set associative mapping
Scratch Pad Memory
Compiler Managed Memory
part of memory space directly addressed
can be on-chip SRAM
Fast, predictable, low power vs. cache
Embedded Processors, IBM Cell
What data/code should reside in Scratch Pad?
Compiler decision
On-chip
Memory
1 cycle
CPU Address
space
Data
Off-chip
Cache
Memory
1 cycle (on-chip) 10-20
cycles
Tag Comparison
Basic idea:
Need to reduce comparison to save power
Techniques :
Conventional tag and data access
all tag data, arrays accessed simultaneously
More power consumption, performance is high
Sequential tag and data access
Power saving, performance penalty
Way Prediction (Memory Hit/Miss ratio etc)
Hit on predicted way
Produce result on single cycle
Miss on predicted way
Produce result on more than one cycle
Transformations and Disks
Basic idea:
Datatransformation techniques to represent and access
Assume data is large, so disk access needed
Techniques :
Loop fusion
Loop fission
Loop Fusion vs. Loop Fission
Which one is preferable ?
Transformations and Disks
Assume a, b data are located on different disks
Loop fusion may be bad
both disks are continuously accessed
Loop fission (reverse) could be preferable
idle disk in low power mode
for (i=0;i<N;i++) { for (i=0;i<N;i++) {

Disk 2 idle a[i] [j]=0; a[i] [j]=0; Both Disks busy
} b[i] [j]=0;
for (i=0;i<N;i++) { }
Disk 1 idle b[i] [j]=0;
}
DRAM
Two ways Dynamic RAM use :
Power management by controller
Compiler / OS
insert code to control DRAM power state
bring successive accesses together take common
Encoding
Techniques for Encoding words
reduce Hamming Distance between successive
words in transmission
add extra bits
data layout, etc.
Compression
Opportunity: If we can compress instructions
or data, we can use smaller memories
Issues?
Compression Possibilities mostly instn.
Instruction Compression
small subset of instructions typically used
in achieving this, de-compressor may lie in
between Cache and Memory
activated only on miss
between CPU and Cache
better compression in cache

Memory Customization
Memory Architecture Customization
Tailoring memory architecture parameters for
applications
cache size, line size
banking structure
scratch pad memory
Customization
Memory Optimization in terms of memory access rate :
Algorithm (program logic) design in terms of memory access rates
as less as possible.
Consider the power consumption during inception of logic and minimize
simultaneous access to data
Example :
1. Matrix Multiplication
How many memory accesses ? <==== In terms of data/opcode.
What optimization is possible ?
for (i = 0; i < N; i++) {

for (j = 0; j < N; j++){
c[i][j] = 0; <==== N*N times memory access for c
for (k = 0; k <N; k++)
c[ i] [j] += a[i] [k] * b[k][j]; <==== N*N*N times for a and b and c
}
}
Memory accesses : for c N*N and N*N*N i.e. N*N + N*N*N
for a and b N*N*N
Optimization : Can we reduce access to these data variables? If so it
will consume less power.
Customization
Optimization : for Matrix Multiplication
Try to use smaller memory.
We can add local new variable that will use STACK Memory from the
Scratch Pad Memory
result = 0;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){
result = 0;
for (k = 0; k <N; k++)
result += a[i] [k] * b[k][j]; <==== N*N*N times for a and b
c[i][j] = result; <==== So, N*N*N for c also
}
}
Optimization in no. of memory accesses : for c N*N*N only
for a and b N*N*N
So, power consumption becomes lesser for c
Memory Banking
Memory bank is a part of (cache) memory
addressed consecutively in the total set of memory banks,
i.e., when data item say, a(n) is stored in bank b,
data item a(n + 1) is stored in bank b + 1.
(Cache) memory is divided into banks to evade the effects
of the bank cycle time.
When data is stored or retrieved consecutively each bank
has enough time to recover before the next request for that
bank arrives.
Memory Banking is important in power optimization
Again the memory access rate
Basic idea: idle banks can be turned off
Arrange data and data accesses into banks such that this
opportunity is created.
Reconfigurable Caches
Configure cache blocks dynamically.
How?
Adjusting cache parameters
Dynamically adjust cache parameters
size
associativity
line size
How does this help?
Reconsider parameter adjustability in terms of
what cache reconfiguration possibly exists?
loopnesting behavior
Conflict Prediction
How to adjust for loop nesting requirement?

Should cache be flushed?
If data reuse expected, dont flush
Should reconsider to decide loops with
variable/data to be accessed?
Use of smaller or medium cache/memory based on
loop nature
How to adjust for conflict prediction?
Cache size more the associativity more space to utilize
Associativity need to reduce associativity
Example : Matrix multiplication
for (i = 0; i < N; i++) { for (i = 0; i < N; i++) {
for (j = 0; j < N; j++){ for (j = 0; j < N; j++){
c[i][j] = 0; x = a[i][j]+b[i][j];
for (k = 0; k <N; k++) for (k = 0; k <N; k++)
c[ i] [j] += a[i] [k] * b[k][j]; c[i][k]=c[i][k]+x;
} }
} }
Fig1 Fig2
Fig1 : Similar amount of data accessed same cache size required
Fig2 : Lesser conflicts in second loop can use lower associativity
Commercial Implementations
Software caches on CELL processor
cache emulated in software
References
P. R. Panda, F. Catthoor, N. Dutt, K. Dankaert, E. Brockmeyer, C.
Kulkarni, A. Vandercapelle, P. G. Kjeldsberg, Data and Memory
Optimization Techniques for Embedded Systems, ACM Transactions
D i A t ti f El t i S t 6(2) A 2001on Design Automation of Electronic
Systems, 2), Apr 2001
W. Wolf and M. Kandemir, Memory System Optimization of

Embedded Software, Proceedings of the IEEE, 91(1), Jan 2003

LPC 3

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LPC 3

Uploaded by

Copyright:

Available Formats

Low Power Computing

-Low-Power memory optimization

Low power caches and other optimizations

However, other optimizations can apply

READ Operation addr

Tag Index Off

Basic idea: add a small buffer/cache

Techniques of cache augmentation :

for (i=0;i<N;i++) { for (i=0;i<N;i++) {

better compression in cache

for (i = 0; i < N; i++) {

How to adjust for loop nesting requirement?

W. Wolf and M. Kandemir, Memory System Optimization of

You might also like