Professional Documents
Culture Documents
KIshore DusurI
MunugIng DIrecLorJGM
www.dspconcepLs.com
lntroductlon
8evlew ol slgnal processlng deslgn
Soltware bulldlng blocks lor slgnal processlng
Cptlmlzatlon technlques
Concluslon
Culck uemos
Agenda
Agenda
Who Is 0SP Concepts:
un engIneerIng servIces compuny specIuIIzIng In
embedded uudIo producL und LechnoIogy
deveIopmenL
Varket SIze vs. SophIstIcatIon of AudIo ProcessIng
1k 10k 100k 1M 10M 100M 1B
Annual unit sales
C
o
m
p
l
e
x
i
t
y
o
f
A
u
d
i
o
P
r
o
c
e
s
s
i
n
g
H
i
g
h
M
e
d
i
u
m
L
o
w
Broadcast
PA &
Evacuation
Professional
Powered
Speakers
Guitar Pedal
OEM Auto
Amplifier
TV
Home
Theater
iPod
Cell phone
Set top
box
Head
unit
Only
decoding
With
Speakers
Markets DSP Concepts
has experience in
Markets to focus on
PCs
Game
Consoles
Aftermarket
Auto
Amplifier
Cameras
Audio is
Primary
Audio is
Secondary
Speaker
phone
Aircraft
Radio
ulgltal slgnal control
ulgltal slgnal control
-
-
blend
blend
Digital
Signal
Controller
MCU
Low costs
Ease of use
C Programming
Interrupt handling
Ultra low power
DSP
Harvard architecture
Single cycle MAC
Floating Point
Barrel shifter
ll8 lllter
ll8 or recurslve lllter
ll1 8utterlly (radlx-2)
Mathematlcal detalls
Mathematlcal detalls
k n x k h n y
N
k
1
0
2 1
2 1
2 1
2 1 0
n y a n y a
n x b n x b n x b n y
j
e k X k X k Y
k X k X k Y
2 1 2
2 1 1
Most operatIons are domInated by MACs
These can be on 8, 16 or 32 bIt operatIons
owerlul MAC lnstructlons
owerlul MAC lnstructlons
05kA110N 1NS1k0C110N
16 x 16 = 22 SMul88, SMul81, SMul18, SMul11
16 x 16 + 22 = 22 SMl^88, SMl^81, SMl^18, SMl^11
16 x 16 + 64 = 64 SMl^l88, SMl^l81, SMl^l18, SMl^l11
16 x 22 = 22 SMulW8, SMulW1
{16 x 22) + 22 = 22 SMl^W8, SMl^W1
{16 x 16) {16 x 16) = 22 SMu^0, SMu^0x, SMuS0, SMuS0x
{16 x 16) {16 x 16) + 22 = 22 SMl^0, SMl^0x, SMlS0, SMlS0x
{16 x 16) {16 x 16) + 64 = 64 SMl^l0, SMl^l0x, SMlSl0, SMlSl0x
22 x 22 = 22 Mul
22 {22 x 22) = 22 Ml^, MlS
22 x 22 = 64 SMull, uMull
{22 x 22) + 64 = 64 SMl^l, uMl^l
{22 x 22) + 22 + 22 = 64 uM^^l
22 {22 x 22) = 22 {upper) SMMl^, SMMl^8, SMMlS, SMMlS8
{22 x 22) = 22 {upper) SMMul, SMMul8
All the above operations are single cycle on the Cortex-M4 processor
lLLL 734 standard compllance
Slngle-preclslon lloatlng polnt math key to some
algorlthms
Add, subtract, multlply, dlvlde, MAC and square root
lused MAC - provldes hlgher preclslon
lloatlng polnt hardware
lloatlng polnt hardware
S i 05kA110N CCL5 C00N1
0S1N6 i0
^dd,Subtract 1
0vde 14
Mu1tp1y 1
Mu1tp1y ^ccumu1ate {M^C) 2
Iused M^C 2
Square 8oot 14
0esIgn Example
;-bund GruphIc EquuIIzer
CorLex-M LPC1;68 runnIng uL 1zoMHz
CorLex-Mq runnIng uL 1zoMHz
DesIgned usIng DSP ConcepL`s AudIo Weuver
deveIopmenL envIronmenL
u gruphIcuI drug-und-drop desIgn envIronmenL
und u seL oI opLImIzed uudIo processIng IIbrurIes.
0SP example - graphIc equalIzer
AudIo Weaver sIgnal flow
ReuI-LIme Demo
; bund purumeLrIc EQ
z-bIL precIsIon
SLereo processIng
q8 kHz sumpIe ruLe
Pesults
PerIormunce
CorLex-M needed 1zq1 cycIes (q;.q% processor IoudIng)
CorLex-Mq needed onIy zqq cycIes (11% processor IoudIng).
How to program - assembly or C:
AssembIy ?
+ Cun resuIL In hIghesL perIormunce
DIIIIcuIL IeurnIng curve, Ionger deveIopmenL cycIes
Code reuse dIIIIcuIL - noL porLubIe
C ?
+ Eusy Lo wrILe und muInLuIn code, IusLer deveIopmenL
cycIes
+ Code reuse possIbIe, usIng LhIrd purLy soILwure Is
eusIer
+ nLrInsIcs provIde dIrecL uccess Lo cerLuIn processor
IeuLures
HIghesL perIormunce mIghL noL be possIbIe
GeL Lo know your compIIer !
C Is deIInILeIy Lhe
preIerred upprouch!
C Is deIInILeIy Lhe
preIerred upprouch!
1 N h
2 N h 0 h 1 h
2 n x ) 1 ( N n x
n x
1 n x
coeffPtr
statePtr
Clrcular Addresslng
Clrcular Addresslng
DuLu In Lhe deIuy chuIn Is rIghL shIILed every sumpIe.
ThIs Is very wusLeIuI. How cun we uvoId LhIs?
CIrcuIur uddressIng uvoIds LhIs duLu movemenL
Linear addressing of coefficients.
Circular addressing of states
8lock based processlng
lnner loop conslsts ol:
uual memory letches
MAC
olnter updates wlth
clrcular addresslng
8lock based processlng
lnner loop conslsts ol:
uual memory letches
MAC
olnter updates wlth
clrcular addresslng
ll8 lllter Standard C Code
ll8 lllter Standard C Code
void fir(q31_t *in, q31_t *out, q31_t *coeffs, int *stateIndexPtr,
int filtLen, int blockSize)
{
int sample;
int k;
q31_t sum;
int stateIndex = *stateIndexPtr;
for(sample=0; sample < blockSize; sample++)
{
state[stateIndex++] = in[sample];
sum=0;
for(k=0;k<filtLen;k++)
{
sum += coeffs[k] * state[stateIndex];
stateIndex--;
if (stateIndex < 0)
{
stateIndex = filtLen-1;
}
}
out[sample]=sum;
}
*stateIndexPtr = stateIndex;
}
32-blt uS processor assembly code
Cnly the lnner loop ls shown, executes ln a slngle cycle
Cptlmlzed assembly code, cannot be achleved ln C
ll8 lllter uS Code
ll8 lllter uS Code
lcntr=r2, do FIRLoop until lce;
FIRLoop: f12=f0*f4, f8=f8+f12, f4=dm(i1,m4), f0=pm(i12,m12);
Zero overhead loop State fetch with
circular addressing
Coeff fetch with
linear addressing
Multiply and
accumulate previous
letch coells[k] 2 cycles
letch state[statelndex] 1 cycle
MAC 1 cycle
statelndex-- 1 cycle
Clrcular wrap 4 cycles
Loop overhead 3 cycles
------------
1otal 12 cycles
Cortex
Cortex
-
-
M lnner loop
M lnner loop
for(k=0;k<filtLen;k++)
{
sum += coeffs[k] * state[stateIndex];
stateIndex--;
if (stateIndex < 0)
{
stateIndex = filtLen-1;
}
}
Even though the MAC executes In 1 cycIe,
there Is overhead compared to a DSP.
How can thIs be Improved on the Cortex-M4 Z
Clrcular addresslng alternatlves
Loop unrolllng
Cachlng ol lntermedlate varlables
Lxtenslve use ol SlMu and lntrlnslcs
Cptlmlzatlon strategles
Cptlmlzatlon strategles
18
Clrcular 8ullerlng alternatlve
Clrcular 8ullerlng alternatlve
0 h 1 h
0 x 1 x 2 x 3 x 4 x 5 x
2 h 3 h 4 h 5 h
5 x 6 x 7 x 8 x