Professional Documents
Culture Documents
Systems Laboratory
School of ECE, Purdue University
2NEC Laboratories America
International Symposium on Microarchitecture 2013
> 20.4
>1
float x = 433/21
float y = 20.4
(x > y) ? YES :NO
YES
float x = 433/21
float y = 1
(x > y) ? YES :NO
YES
But, I worked
harder than
needed
Leads to inefficiency
And, an overkill (for many applications)
Search
Mining
Recognition
Vision
Video Processing
Search
Vision
Mining
Recognition
Video Processing
Intrinsic
Application
Resilience
INTRINSIC
APPLICATION RESILIENCE:
SOURCES
Perceptual
Limitations
Statistical
Probabilistic
Computations
Redundant
Input Data
Noisy Real
World Inputs
Principle Component
Analysis
Intrinsic
Application
Self-Healing
Resilience
Repeat until convergence
Compute distances
& assign points
to clusters
Update cluster
means
MIN
MAX
EFFORT
tradeoff?
Min
Effort
Energy
Max
Quality
Disproportionate benefit
Max
Effort
Min
Max
Effort
Min
Quality
Specifications
Application
Approximate
Algorithm
Software
Approximate
Architecture
Architecture
Approximate
Circuit
Circuits
Layout
Approximate
Implementation
APPROXIMATE ARCHITECTURE
Algorithm- Domain-specific
specific
accelerators
accelerators image, video
Programmable
accelerators (GPGPUs,
MIC) / Vector processors
General purpose
processors/
Multicores
Truffle
Esmaeilzadeh et.
al. ASPLOS
2012
EnerJ Sampson
et. al. PLDI 2011
Accurate and
approximate
instructions
APPROXIMATE ARCHITECTURE
Algorithm- Domain-specific
specific
accelerators
accelerators image, video
Programmable
accelerators (GPGPUs,
MIC) / Vector processors
General purpose
processors/
Multicores
Pros:
Large energy benefits
Broader applicability
Challenges:
Limited applicability
APPROXIMATE ARCHITECTURE
Algorithm- Domain-specific
specific
accelerators
accelerators image, video
Programmable
accelerators (GPGPUs,
MIC) / Vector processors
General purpose
processors/
Multicores
Opportunity:
Wide range of
applications fine grained
parallelism
CONTRIBUTIONS
HW/SW INTERFACE
25
% of Approximate instructions
% of Approximate instructions
30
Arbitrary
< 50%
< 12.5%
< 2.5%
20
15
25
Arbitrary
< 50%
< 25%
< 7.5%
20
15
10
10
5
0
0
10
Image Segmentation
(K-means)
12
0
0
HW/SW INTERFACE
Decode &
Control
Register
File
Quality
Configurable
Execution
Unit
Translate instruction
quality specification
into accuracy knobs
built in hardware
Capable of executing
instructions with
different quality levels
Any approximate
HW design technique
e.g. precision scaling
QUALITY MONITORS
AND
ERROR FEEDBACK
quality to software
HW/SW INTERFACE
Decode &
Control
Register
File
Software visible
Error Registers
Quality
Configurable
Instruction
Execution
accuracy
Unit
monitor
Multimedia
Synthesis
Vision
Recognition
Search
Video Processing
Image Analysis
Mining
QUORA
Quality programmable 1D/2D vector processor
QUORA: OVERVIEW
3-tier processing element
hierarchy
2D array PEs
2 sets of 1D array PEs
One scalar PE
2 streaming memory
m rows
n columns
QUORA: OVERVIEW
3-tier processing element
hierarchy
2 streaming memory
Application characteristic:
2 levels of reduction
operations
n columns
m rows
2D array PEs
2 sets of 1D array PEs
One scalar PE
QUORA: OVERVIEW
3-tier processing element
hierarchy
2 streaming memory
Application characteristic:
2 levels of reduction
operations
m rows
2D array PEs
2 sets of 1D array PEs
One scalar PE
n columns
Functionality / Size
Scalar
1D-Array
PE
2D-array PE
Similar to scalar
uProcessor
Small register
file, complex
execution units
Simple
accumulator
based data path
Complexity
Scalar
1D-Array
PE
PE count
(1)
(m+n)
2D-array PE
(m*n)
Scope for
approximation
PE count
Energy
Complexity
Scalar
1D-Array
PE
(> 70%)
2D-array PE
(> 90%)
CAPE
Scope for
approximation
PE count
Energy
Complexity
Completely Accurate
Processing Element
MAPE
APE
Approximate Processing
Elements
(> 70%)
(> 90%)
3-tiered PE hierarchy enables larger energy benefits from approximate computing
(while matching application characteristics)
Instruction
LDRI Rd, value
Scalar
Instructions
ADDR Rd,Rs1,Rs2
BEZ Rs, Rel. address
HALT
Streaming
LDSM R_length, stride, burst,
Memory
R_st_add
instructions
qpMAC R_length, R_row_enb,
R_col_enb, R_q_type, R_q_amt
Inst. Type
1D Array
Reduction
Instructions qpMIN <r/c>, R_row_enb,
1D Array
SEQ R_length, SReg, R_row_enb,
Streaming
R_col_enb
Instructions
1D Array
Self2D Array qpMOD2 R_length, R_row_enb,
Operand
Instructions R_col_enb, R_q_type, R_q_amt
Instructions
STR <r/c>, R_stride, R_burst,
R_st_add, R_row_enb,
R_col_enb
Instruction
quality fields
e.g. qpMAC R_length, R_row_enb, R_col_enb, R_q_type, R_q_amt
= | |
=
.
.
=
.
Amount of error
Data. IN
SM_col_sel
1-to-many-DEMUX
ALU
CLK
SM
SM
MAPE
Scalar
Reg. File
MAPE
CAPE
SM
SM
SM
MAPE
ACC
Approximate
Processing Element Array
Quality Control Unit & Quality Monitors
MAPE_row_sel
SM
MAPE
APE
APE
APE
APE
APE
SM
MAPE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
SM
ALU
ACC
SM
MAPE
SM
MAPE
ALU
MUX
MAPE
MUX
Scratch Registers
Reg
1-to-many-DEMUX
SM_row_sel
MAPE
ACC
MUX
APE
APE
APE
APE
APE
APE
APE ARRAY
APE
MUX
Data. OUT
Data. OUT
Data. IN
DATA
MEMORY
Data. Read
Data. Write
Data. Add
Data. OUT
Data. IN
Halt
MAPE
Prog. Counter
Instruction
ALU
Inst. Add
Reg
Inst. Read
INST.
MEMORY
Scratch Registers
RESET
MAPE_col_sel
SM_row_sel
SM_row_sel
SM_row_sel
SM_row_sel
SM_row_sel
MAPE
MAPE
MAPE
MAPE
MAPE
ACC
ACC
ACC
ACC
ACC
ALU
ALU
ALU
ALU
ALU
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
Halt
Halt
Halt
Halt
Halt
Scratch
Registers
Scratch
Registers
Scratch
Registers
Scratch
Registers
Scratch
Registers
Prog. Counter
Counter
Prog.
Prog.
Counter
Prog.
Counter
Prog.
Counter
SM
SM
SM
SM
SM
Quality Control
Control Unit
Unit &
& Quality
Quality Monitors
Monitors
Quality
Quality
Control
Unit
&
Quality
Monitors
Quality
Control
Unit
Quality
Monitors
Quality
Control
Unit
&&
Quality
Monitors
Mixed Processing
Element Array
MAPE
MAPE
MAPE
MAPE
MAPE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
SM
SM
SM
SM
SM
MAPE
MAPE
MAPE
MAPE
MAPE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
MAPE
MAPE
MAPE
MAPE
MAPE
SM
SM
SM
SM
SM
ALU
ALU
ALU
ALU
ALU
ACC
ACC
ACC
ACC
ACC
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
ALU
ALU
ALU
ALU
ALU
ACC
ACC
ACC
ACC
ACC
MUX
MUX
MUX
MUX
MUX
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
Data. OUT
OUT
Data.
Data.
OUT
Data.
OUT
Data.
OUT
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
A
RRAY
APE
RRAY
RRAY
APEAA
ARRAY
RRAY
APE
APE
APE
APE
APE
APE
MUX
MUX
MUX
MUX
MUX
MUX
MUX
MUX
MUX
ScratchRegisters
Registers
Scratch
Scratch
Registers
Scratch
Registers
Scratch
Registers
MAPE_row_sel
MAPE_row_sel
MAPE_row_sel
MAPE_row_sel
MAPE_row_sel
Reg
Reg
Reg
Reg
Reg
SM
SM
SM
SM
SM
Data.
Data.
OUT
Data.
OUT
Data.
OUT
Data.OUT
OUT
Data. IN
Data.
IN
Data.
Data.
IN
Data.IN
IN
Streaming
Memory Banks
INST. DECODE
DECODE &
&
INST.
INST.
DECODE
&
INST.
DECODE
INST.
DECODE
&&
CONTROL UNIT
UNIT
CONTROL
CONTROL
UNIT
CONTROLUNIT
UNIT
CONTROL
SM
SM
SM
SM
SM
SM
SM
SM
SM
SM
MUX
MUX
MUX
MUX
MUX
Data.Read
Read
Data.
Data.
Read
Data.
Read
Data.
Read
Data.Write
Write
Data.
Write
Data.
Data.
Write
Data.
Write
Data.Add
Add
Data.
Add
Data.
Data.
Add
Data.
Add
Data.
OUT
Data.
OUT
OUT
Data.
Data.
OUT
Data.
OUT
Data.
IN
Data.
IN
IN
Data.
Data.
Data.
ININ
SM SM
SM
SM
SM
SM
SM
SM
SM
SM
ALU
ALU
ALU
ALU
ALU
Reg
Reg
Reg
Reg
Reg
DATA
DATA
DATA
DATA
DATA
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
Inst.Read
Read
Inst.
Read
Inst.
Inst.
Read
Inst.
Read
Inst. Add
Add
Inst.
Inst.
Add
Inst.
Add
Inst.
Add
Instruction
Instruction
Instruction
Instruction
Instruction
1-to-many-DEMUX
1-to-many-DEMUX
1-to-many-DEMUX
1-to-many-DEMUX
1-to-many-DEMUX
INST.
INST.
INST.
INST.
INST.
MEMORY
MEMORY
MEMORY
MEMORY
MEMORY
Scalar
Scalar
Scalar
Scalar
Scalar
Reg.File
File
Reg.
Reg.
File
Reg.
File
Reg. File
MAPE
MAPE
MAPE
MAPE
MAPE
CLK
CLK
CLK
CLK
CLK
RESET
RESET
RESET
RESET
RESET
CAPE
CAPE
CAPE
CAPE
CAPE
MAPE
MAPE
MAPE
MAPE
MAPE
Completely Accurate
Processing Element
SM_col_sel
SM_col_sel
SM_col_sel
SM_col_sel
SM_col_sel
1-to-many-DEMUX
1-to-many-DEMUX
1-to-many-DEMUX
1-to-many-DEMUX
1-to-many-DEMUX
Quality
Control
Unit
&&Quality
Monitors
Quality
Control
Unit
Quality
Monitors
Quality
QualityControl
ControlUnit
Unit&
QualityMonitors
Monitors
Quality
Control
Unit
&&Quality
Quality
Monitors
INTERFACE
INTERFACE
INTERFACE
INTERFACE
INTERFACE
Data. IN
IN
Data.
Data.
IN
Data.
IN
Data.
IN
MAPE_col_sel
MAPE_col_sel
MAPE_col_sel
MAPE_col_sel
MAPE_col_sel
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
ACC
ACC
ACC
MAPE
MAPE
MAPE
Halt
Halt
Halt
SM
SM
SM
Quality
QualityControl
ControlUnit
Unit&
&Quality
QualityMonitors
Monitors
Quality
Control
Unit
&
Quality
Monitors
SM
SM
SM
MAPE
MAPE
MAPE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
SM
SM
SM
MAPE
MAPE
MAPE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
SM
SM
SM
ALU
ALU
ALU
ACC
ACC
ACC
SM
SM
SM
SM
SM
SM
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
Enable
quality
ACC
ACC
ACC
configurable
execution
ALU
ALU
ALU
MUX
MUX
MUX
APE
APE
APE
APE
APE
APE
APE
APE
APE
MUX
MUX
MUX
MAPE
MAPE
MAPE
Scratch Registers
ScratchRegisters
Registers
Scratch
MAPE_row_sel
MAPE_row_sel
MAPE_row_sel
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
APEAAARRAY
RRAY
APE
RRAY
APE
APE
APE
APE
Data.
Data.OUT
OUT
Data.
OUT
APE
APE
APE
APE
APE
APE
APE
APE
Data.
OUT
Data.OUT
OUT
Data.
SM_row_sel
SM_row_sel
SM_row_sel
Prog.
Prog.Counter
Counter
Prog.
Counter
SM
SM
SM
SM
SM
SM
SM
SM
SM
Reg
Reg
Reg
1-to-many-DEMUX
1-to-many-DEMUX
1-to-many-DEMUX
Data. Read
Data.Read
Read
Data.
Data. Write
Data.Write
Write
Data.
Data. Add
Data.Add
Add
Data.
Data.
OUT
Data.OUT
OUT
Data.
Data. IN
Data.IN
IN
Data.
INST.
INST.DECODE
DECODE&&
&
INST.
DECODE
CONTROL
UNIT
CONTROL
UNIT
CONTROL UNIT
QualityControl
ControlUnit
Unit&&&Quality
QualityMonitors
Monitors
Quality
Quality
Control
Unit
Quality
Monitors
Instruction
Instruction
Data.IN
Data.
Data.
ININ
DATA
DATA
DATA
MEMORY
MEMORY
MEMORY
Inst. Read
Inst.Read
Read
Inst.
Inst.
Inst.Add
Add
Inst.
Add
Instruction
SM
SM
SM
ALU
ALU
ALU
Scratch Registers
Scratch
ScratchRegisters
Registers
Scalar
Scalar
Scalar
Reg.
Reg.File
File
Reg.
File
MAPE
MAPE
MAPE
CAPE
CAPE
CAPE
CLK
CLK
CLK
RESET
RESET
RESET
INST.
INST.
INST.
MEMORY
MEMORY
MEMORY
SM_col_sel
SM_col_sel
SM_col_sel
1-to-many-DEMUX
1-to-many-DEMUX
1-to-many-DEMUX
MUX
MUX
MUX
INTERFACE
INTERFACE
INTERFACE
Data.
Data.IN
IN
Data.
IN
MAPE_col_sel
MAPE_col_sel
MAPE_col_sel
4 different flavors
Xpsc[i]
Precision scaled
operand
e.g., PSc == 3
MUX
PSc
MUX
No. of bits to
scale precision
for
Up/Down Precision
Scaling
X[i]
1<<
0
X[2:0] >= 4 ?
Round up : Round down
1<<
>
MUX
Xt
P.Err
Modulate the
threshold for round-off
N.Err
MUX
Err.[i]
Enables error feedback
to software
MAX
MAX/+
Err.[i+1]
Xpsc[i]
MAPE
MAPE
PSc.
Unit
PSc.
Unit
PSc.
Unit
PSc.
Unit
APE
APE
APE
Error
+/>
MAPE
PSc.
Unit
ACC
C.PSc
C.PSc.
Op-code
Actual Error
Op-code
Error
PSc
R.PSc
Error target
MAPE
LEVEL VIEW
Gated CLK
CL
K
MAPE
PSc.
Unit
APE
APE
APE
APE
MAPE
PSc.
Unit
APE
APE
APE
APE
MAPE
PSc.
Unit
APE
APE
APE
APE
E. Reg = 2
MAPE
MAPE
MAPE
MAPE
MAPE
MAPE
+/>
+/>
PSc.
PSc.
Unit
Unit
PSc.
PSc.
Unit
Unit
PSc.
PSc.
Unit
Unit
APE
APE
ACC
ACC
Error target
MAPE
PSc.
Unit
PSc.
Unit
Gated CLK
Gated CLK
max (|. |, |. |)
MAPE
MAPE
PSc.
Unit
PSc.
Unit
APE
APE
C.PSc.
APE
APE
APE
APE
APE
APE
APE
APE
Err.Reg
APE
APE
APE
APE+/>APE
APE
APE
APE
APE
APE
APE
APE
APE
APE
MAPEOp-code APE
MAPE
PSc.
Unit
PSc.
Unit
Error
CLK
PSc
APE
APE
CL
K
Actual Error
MAPE
C.PSc
Op-code
MAPE
R.PSc = 0
PSc.
PSc.
Unit
Unit
Err.Reg
Error
MAPE
PSc.
PSc.
Unit
Unit
C.PSc
C.PSc. .
C.PSc
R.PSc
Actual
Error
Actual
Error
Op-code
Op-code
Error
Error
PSc
PSc
Op-code
ErrorError
target
target
C.PSc = 1 +
Quality
Control unit
Quality control unit
R.PSc
MAPE
MAPE
APE
EXPERIMENTAL METHODOLOGY
RTL implementation
of QUORA using
Verilog HDL
Synthesized to IBM
45nm technology
node
Micro-architectural Parameters
Value
Array Dimensions
16 X 16
289 (256 + 32 + 1)
32 / 8
No. of SM elements
32
Depth of SM elements
64
Operating Frequency
250 MHz
Metric
Value
Feature Size
45nm
Area
2.6 mm2
Power
367.8 mW
Gate Count
502042
APEs (%)
1%
0%
MAPEs (%)
1%
19%
CAPE(%)
51%
28%
SMs (%)
PScE (%)
Misc. (%)
BENCHMARKS
Search image
Results
Principle Component
Analysis
SVM Classifier
Applications
Algorithm
Dataset
SVM
MNIST
SVM
NORB
CNN
MNIST
GLVQ
K-NN
OCR digits
ANN
Adult
SSI
Subset of Wikipedia
K-means
Berkeley dataset
K-means
OCR digits
0: Burger
1: Bread
2: Food
.
.
25: McDonals
Quality Metric
Percentage
classification
accuracy
RESULTS SUMMARY
Energy savings
1.2
No Approx.
< 0.5%
~ 2.5 %
~ 7.5%
1
0.8
0.6
0.4
0.2
0
RESULTS SUMMARY
QP-instructions in QUORA
10%
1% 3%
2%
96%
QP-APE
QP-MAPE
Accurate
88%
QP-APE
QP-MAPE
Energy
programmable instructions
Accurate
RESULTS SUMMARY
Precision scaling mechanisms
1.2
0.8
1.05
Trunc
Up/Down
Err. Comp
0.6
0.4
0.2
Trunc
Up/Down
Err. Comp
0.95
0.9
0.85
0.8
0
0
0.5
MAC - APE
0.5
1
1.5
Average Error (%) -->
ACC - MAPE
SUMMARY
Intrinsic application resilience: A new dimension to
optimize HW and SW
Objective: Energy-efficient & programmable processor