DSP Designing With The Cortex-M4

0esIgnIng wIth the CortexV4
KIshore DusurI
MunugIng DIrecLorJGM
www.dspconcepLs.com
lntroductlon
8evlew ol slgnal processlng deslgn
Soltware bulldlng blocks lor slgnal processlng
Cptlmlzatlon technlques
Concluslon
Culck uemos
Agenda
Agenda
Who Is 0SP Concepts:
un engIneerIng servIces compuny specIuIIzIng In
embedded uudIo producL und LechnoIogy
deveIopmenL
Varket SIze vs. SophIstIcatIon of AudIo ProcessIng
1k 10k 100k 1M 10M 100M 1B
Annual unit sales
C
o
m
p
l
e
x
i
t
y

o
f

A
u
d
i
o

P
r
o
c
e
s
s
i
n
g
H
i
g
h
M
e
d
i
u
m
L
o
w
Broadcast
PA &
Evacuation
Professional
Powered
Speakers
Guitar Pedal
OEM Auto
Amplifier
TV
Home
Theater
iPod
Cell phone
Set top
box
Head
unit
Only
decoding
With
Speakers
Markets DSP Concepts
has experience in
Markets to focus on
PCs
Game
Consoles
Aftermarket
Auto
Amplifier
Cameras
Audio is
Primary
Audio is
Secondary
Speaker
phone
Aircraft
Radio
ulgltal slgnal control
ulgltal slgnal control
-
-
blend
blend
Digital
Signal
Controller
MCU
Low costs
Ease of use
C Programming
Interrupt handling
Ultra low power
DSP
Harvard architecture
Single cycle MAC
Floating Point
Barrel shifter
ll8 lllter
ll8 or recurslve lllter
ll1 8utterlly (radlx-2)
Mathematlcal detalls
Mathematlcal detalls
k n x k h n y
N
k

1
0

2 1
2 1
2 1
2 1 0

n y a n y a
n x b n x b n x b n y

j
e k X k X k Y
k X k X k Y

2 1 2
2 1 1
Most operatIons are domInated by MACs
These can be on 8, 16 or 32 bIt operatIons
owerlul MAC lnstructlons
owerlul MAC lnstructlons
05kA110N 1NS1k0C110N
16 x 16 = 22 SMul88, SMul81, SMul18, SMul11
16 x 16 + 22 = 22 SMl^88, SMl^81, SMl^18, SMl^11
16 x 16 + 64 = 64 SMl^l88, SMl^l81, SMl^l18, SMl^l11
16 x 22 = 22 SMulW8, SMulW1
{16 x 22) + 22 = 22 SMl^W8, SMl^W1
{16 x 16) {16 x 16) = 22 SMu^0, SMu^0x, SMuS0, SMuS0x
{16 x 16) {16 x 16) + 22 = 22 SMl^0, SMl^0x, SMlS0, SMlS0x
{16 x 16) {16 x 16) + 64 = 64 SMl^l0, SMl^l0x, SMlSl0, SMlSl0x
22 x 22 = 22 Mul
22 {22 x 22) = 22 Ml^, MlS
22 x 22 = 64 SMull, uMull
{22 x 22) + 64 = 64 SMl^l, uMl^l
{22 x 22) + 22 + 22 = 64 uM^^l
22 {22 x 22) = 22 {upper) SMMl^, SMMl^8, SMMlS, SMMlS8
{22 x 22) = 22 {upper) SMMul, SMMul8
All the above operations are single cycle on the Cortex-M4 processor
lLLL 734 standard compllance
Slngle-preclslon lloatlng polnt math key to some
algorlthms
Add, subtract, multlply, dlvlde, MAC and square root
lused MAC - provldes hlgher preclslon
lloatlng polnt hardware
lloatlng polnt hardware
S i 05kA110N CCL5 C00N1
0S1N6 i0
^dd,Subtract 1
0vde 14
Mu1tp1y 1
Mu1tp1y ^ccumu1ate {M^C) 2
Iused M^C 2
Square 8oot 14
0esIgn Example
;-bund GruphIc EquuIIzer
CorLex-M LPC1;68 runnIng uL 1zoMHz
CorLex-Mq runnIng uL 1zoMHz
DesIgned usIng DSP ConcepL`s AudIo Weuver
deveIopmenL envIronmenL
u gruphIcuI drug-und-drop desIgn envIronmenL
und u seL oI opLImIzed uudIo processIng IIbrurIes.
0SP example - graphIc equalIzer
AudIo Weaver sIgnal flow
ReuI-LIme Demo
; bund purumeLrIc EQ
z-bIL precIsIon
SLereo processIng
q8 kHz sumpIe ruLe
Pesults
PerIormunce
CorLex-M needed 1zq1 cycIes (q;.q% processor IoudIng)
CorLex-Mq needed onIy zqq cycIes (11% processor IoudIng).
How to program - assembly or C:
AssembIy ?
+ Cun resuIL In hIghesL perIormunce
DIIIIcuIL IeurnIng curve, Ionger deveIopmenL cycIes
Code reuse dIIIIcuIL - noL porLubIe
C ?
+ Eusy Lo wrILe und muInLuIn code, IusLer deveIopmenL
cycIes
+ Code reuse possIbIe, usIng LhIrd purLy soILwure Is
eusIer
+ nLrInsIcs provIde dIrecL uccess Lo cerLuIn processor
IeuLures
HIghesL perIormunce mIghL noL be possIbIe
GeL Lo know your compIIer !
C Is deIInILeIy Lhe
preIerred upprouch!
C Is deIInILeIy Lhe
preIerred upprouch!
1 N h
2 N h 0 h 1 h
2 n x ) 1 ( N n x
n x
1 n x
coeffPtr
statePtr
Clrcular Addresslng
Clrcular Addresslng
DuLu In Lhe deIuy chuIn Is rIghL shIILed every sumpIe.
ThIs Is very wusLeIuI. How cun we uvoId LhIs?
CIrcuIur uddressIng uvoIds LhIs duLu movemenL
Linear addressing of coefficients.
Circular addressing of states
8lock based processlng
lnner loop conslsts ol:
uual memory letches
MAC
olnter updates wlth
clrcular addresslng
8lock based processlng
lnner loop conslsts ol:
uual memory letches
MAC
olnter updates wlth
clrcular addresslng
ll8 lllter Standard C Code
ll8 lllter Standard C Code
void fir(q31_t *in, q31_t *out, q31_t *coeffs, int *stateIndexPtr,
int filtLen, int blockSize)
{
int sample;
int k;
q31_t sum;
int stateIndex = *stateIndexPtr;
for(sample=0; sample < blockSize; sample++)
{
state[stateIndex++] = in[sample];
sum=0;
for(k=0;k<filtLen;k++)
{
sum += coeffs[k] * state[stateIndex];
stateIndex--;
if (stateIndex < 0)
{
stateIndex = filtLen-1;
}
}
out[sample]=sum;
}
*stateIndexPtr = stateIndex;
}
32-blt uS processor assembly code
Cnly the lnner loop ls shown, executes ln a slngle cycle
Cptlmlzed assembly code, cannot be achleved ln C
ll8 lllter uS Code
ll8 lllter uS Code
lcntr=r2, do FIRLoop until lce;
FIRLoop: f12=f0*f4, f8=f8+f12, f4=dm(i1,m4), f0=pm(i12,m12);
Zero overhead loop State fetch with
circular addressing
Coeff fetch with
linear addressing
Multiply and
accumulate previous
letch coells[k] 2 cycles
letch state[statelndex] 1 cycle
MAC 1 cycle
statelndex-- 1 cycle
Clrcular wrap 4 cycles
Loop overhead 3 cycles
------------
1otal 12 cycles
Cortex
Cortex
-
-
M lnner loop
M lnner loop
{
stateIndex--;
if (stateIndex < 0)
{
stateIndex = filtLen-1;
}
}
Even though the MAC executes In 1 cycIe,
there Is overhead compared to a DSP.
How can thIs be Improved on the Cortex-M4 Z
Clrcular addresslng alternatlves
Loop unrolllng
Cachlng ol lntermedlate varlables
Lxtenslve use ol SlMu and lntrlnslcs
Cptlmlzatlon strategles
Cptlmlzatlon strategles
18
Clrcular 8ullerlng alternatlve
0 h 1 h
0 x 1 x 2 x 3 x 4 x 5 x
2 h 3 h 4 h 5 h
5 x 6 x 7 x 8 x
Create a buIIer oI Iength N + bIockSIze-1 and

shIIt thIs once per bIock
ExampIe. N = 6, bIockSIze = 4. SIze oI state

buIIer = 9.
Block 1 Block 2
0 h 1 h
0 x 1 x 2 x 3 x 4 x 5 x
2 h 3 h 4 h 5 h
5 x 6 x 7 x 8 x
Block 2 Block 3
Create a cIrcuIar buIIer oI Iength N + bIockSIze-1

and shIIt thIs once per bIock

buIIer = 9.
0 h 1 h
0 x 1 x 2 x 3 x 4 x 5 x
2 h 3 h 4 h 5 h
5 x 6 x 7 x 8 x
Block 3Block 4
Create a cIrcuIar buIIer oI Iength N + bIockSIze-1

and shIIt thIs once per bIock

buIIer = 9.
letch coells[k] 2 cycles
letch state[statelndex] 1 cycle
MAC 1 cycle
statelndex++ 1 cycle
Loop overhead 3 cycles
------------
1otal 8 cycles
Cortex
Cortex
-
-
M4 code wlth change
M4 code wlth change
{
stateIndex++;
}
uS assembly code = 1 cycle
Cortex-M4 standard C code takes 12 cycles
uslng clrcular addresslng alternatlve = 8 cycles
lmprovement ln perlormance
23
33Z better but stIII not
comparabIe to the DSP
Lets try Ioop unroIIIng
1hls ls an elllclent language-lndependent optlmlzatlon
technlque and makes up lor the lack ol a zero overhead
loop on the Cortex-M4
1here ls overhead lnherent ln every loop lor checklng the
loop counter and lncrementlng lt lor every lteratlon (3
cycles on the Cortex-M.)
Loop unrolllng processes 'n' loop lndexes ln one loop
lteratlon, reduclng the overhead by 'n' tlmes.
Loop unrolllng
Loop unrolllng
24
letch coells[k] 2 x 4 = 8 cycles
letch state[statelndex] 1 x 4 = 4 cycles
MAC 1 x 4 = 4 cycles
statelndex++ 1 x 4 = 4 cycles
Loop overhead 3 x 1 = 3 cycles
------------
1otal 23 cycles lor 4 taps
= 3.73 cycles per tap
unroll lnner Loop by 4
unroll lnner Loop by 4
{
stateIndex++;
stateIndex++;
stateIndex++;
stateIndex++;
}
Alter loop unrolllng < 6 cycles
26
25Z Iurther Improvement
8ut a Iarge gap stIII exIsts
Lets try SMD
Many lmage and vldeo processlng, and communlcatlons
appllcatlons use 8- or 16-blt data types.
SlMu speeds these up
16-blt data ylelds a 2x speed
lmprovement over 32-blt
8-blt data ylelds a 4x speed
lmprovement
Access to SlMu ls vla
compller lntrlnslcs
Lxample dual 16-blt MAC
SuM=__SMLALu(C, S, SuM)
Apply SlMu
Apply SlMu
H
32-bit register
L H
32-bit register
L
Sum
16-bit 16-bit
32-bit
64-bit
64-bit
16-bit
16-bit
64-bit
32-bit
16-blt example
Access two nelghbourlng values uslng a slngle 32-blt
memory read
uata organlzatlon wlth SlMu
uata organlzatlon wlth SlMu
0 h 1 h
0 x 4 x 5 x
2 h 3 h 4 h 5 h
5 x 6 x 7 x 8 x 1 x 2 x 3 x
lnner Loop wlth 16
lnner Loop wlth 16
-
-
blt SlMu
blt SlMu
filtLen = filtLen << 2;
for(k = 0; k < filtLen; k++)
{
c = *coeffs++; // 2 cycles
s = *state++; // 1 cycle
sum = __SMLALD(c, s, sum); // 1 cycle
} // 3 cycles
19 cycIes totaI. Computes 8 MACs
2.375 cycIes per IIIter tap
Alter uslng SlMu lnstructlons < 2.3 cycles
That's much better!
8ut Is there anythIng moreZ
One more Idea IeIt
ll8 lllter ls extremely memory lntenslve. 12 out ol 19
cycles ln the last code portlon deal wlth memory accesses
2 consecutlve loads take
4 cycles on Cortex-M3, 3 cycles on Cortex-M4
MAC takes
3-7 cycles on Cortex-M3, 1 cycle on Cortex-M4
When operatlng on a block ol data, memory bandwldth
can be reduced by slmultaneously computlng multlple
outputs and cachlng several coelllclents and state varlables
Cachlng lntermedlate values
Cachlng lntermedlate values
31
0 h 1 h
0 x 4 x 5 x
2 h 3 h 4 h 5 h
5 x 6 x 7 x 8 x 1 x 2 x 3 x
c0
Increment by 16-bits
statePtr++
Increment by 32-bits
coeffsPtr++
x0
x1
x2
x3
x0
x1
x2
x3
c0
uata Crganlzatlon wlth Cachlng
uata Crganlzatlon wlth Cachlng
Compute 4 Outputs SImuItaneousIy:
sum0 = SMLALD(x0, c0, sum0)
llnal ll8 Code
llnal ll8 Code
sample = blockSize/4;
do
{
sum0 = sum1 = sum2 = sum3 = 0;
statePtr = stateBasePtr;
coeffPtr = (q31_t *)(S->coeffs);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++);
i = numTaps>>2;
do
{
c0 = *(coeffPtr++);
x2 = *(q31_t *)(statePtr++);
x3 = *(q31_t *)(statePtr++);
sum0 = __SMLALD(x0, c0, sum0);
c0 = *(coeffPtr++);
x0 = *(q31_t *)(statePtr++);
x1 = *(q31_t *)(statePtr++);
sum2 = __SMLALD (x2, c0,
sum2);
sum3 = __SMLALD (x3, c0,
sum3);
} while(--i);
*pDst++ = (q15_t) (sum0>>15);
*pDst++ = (q15_t) (sum1>>15);
*pDst++ = (q15_t) (sum2>>15);
*pDst++ = (q15_t) (sum3>>15);
stateBasePtr= stateBasePtr + 4;
} while(--sample);
Uses Ioop unroIIIng, SMD IntrInsIcs,
cachIng oI states and coeIIIcIents,
and work around cIrcuIar addressIng
by usIng a Iarge state buIIer.
nner Ioop Is 26 cycIes Ior a totaI oI
16, 16-bIt MACs.
OnIy 1.625 cycIes per IIIter tap!
Uses Ioop unroIIIng, SMD IntrInsIcs,
cachIng oI states and coeIIIcIents,
and work around cIrcuIar addressIng
by usIng a Iarge state buIIer.
nner Ioop Is 26 cycIes Ior a totaI oI
16, 16-bIt MACs.
OnIy 1.625 cycIes per IIIter tap!
FP ApplIcatIon use case
Alter uslng SlMu lnstructlons < 2.3 cycles
Alter cachlng lntermedlate values ~ 1.6 cycles
Cortex
Cortex
-
-
M4 ll8 perlormance
M4 ll8 perlormance
Cortex-M4 C code now comparable in performance
8aslc Cortex-M4 C code qulte reasonable perlormance lor
slmple algorlthms
1hrough slmple optlmlzatlons, you can get to hlgh
perlormance on the Cortex-M4
?ou uC nC1 have to wrlte Cortex-M4 assembly, all
optlmlzatlons can be done completely ln C
Summary ol optlmlzatlons
Summary ol optlmlzatlons
36
8aslc math - vector mathematlcs
last math - sln, cos, sqrt etc
lnterpolatlon - llnear, blllnear
Complex math
Statlstlcs - max, mln,8MS etc
lllterlng - ll8, ll8, LMS etc
1ranslorms - ll1(real and complex) , Coslne translorm etc
Matrlx lunctlons
lu Controller, Clarke and ark translorms
Support lunctlons - copy/llll arrays, data type converslons etc
varlants lor lunctlons across q7,q13,q31 and l32 data types
CMSlS uS llbrary snapshot
CMSlS uS llbrary snapshot
0SP example - VPJ audIo playback
MHz bundwIdLh requIremenL Ior MP decode
Cortex-M4 approaches
specialized audio DSP
performance !
Cortex-M4 approaches
specialized audio DSP
performance !
QuIck 0emos

DSP Designing With The Cortex-M4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSP Designing With The Cortex-M4

Uploaded by

Copyright:

Available Formats

0esIgnIng wIth the CortexV4

Create a buIIer oI Iength N + bIockSIze-1 and

ExampIe. N = 6, bIockSIze = 4. SIze oI state

Create a cIrcuIar buIIer oI Iength N + bIockSIze-1

ExampIe. N = 6, bIockSIze = 4. SIze oI state

Create a cIrcuIar buIIer oI Iength N + bIockSIze-1

ExampIe. N = 6, bIockSIze = 4. SIze oI state

You might also like