Isscc Dsptut

DSP Architectures for Next-Generation
Wireless Communications
Ingrid Verbauwhede
Department of Electrical Engineering
University of California Los Angeles
ingrid@ee.ucla.edu
Chris Nicol
Bell Laboratories Australia
Lucent Technologies
chrisn@lucent.com
1
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol
Mobile Wireless Trends

S u b s crib ers in (000 )
1 ,6 0 0,0 00
1 ,4 0 0,0 00
W ire line C A G R - 5 % G lob al W irelin e

1 ,2 0 0,0 00 G lo b a l P en etratio n (2 01 0) - 20 % G ob al W ire le ss
1 ,0 0 0,0 00
Subscribers (000)
8 0 0,0 00
6 0 0,0 00
4 0 0,0 00
W irele ss C A G R 21 %
G lo b a l P en etra tio n (20 10 ) - 21 %
2 0 0,0 00
(C e llu lar+P C S + W L A S + O the r)
G lob a l P op - 7 bill
C AG R 1 9 95 -20 10 - 1 .4 %
0
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
World-wide deployment of mobile communications is exceeding expectations

2
1
DSP Evolution and Markets
Disk
DSP Market $270 M Cellular
Infrastructure
Other
$2B market, 30% growth rate
Wireless Mobile Handsets
$1.01B Cordless
Modem
GPS
V.34 $727 M
Source: Forward Concepts 1996
V.90
xDSL Consumer &
Automotive
M68000 ($200)
10K
Power Power
80286 ($200)
(mw/MIP) 1K 80386 ($300)
DSP-1 ($150) (mw/MIP)
Pentium ($300)
DSP-32C ($250)
100
Pentium (MMX)
($700)
10 DSP16A ($15) DSP1600 (<$10)
DSP16210
1
1980 1985 1990 1995 2000
3
The DSP Market Splits - and so does this tutorial
Ingrid Todays
general purpose Chris Nicol
Verbauwhede
assembly coded
Mobile Terminals DSP
Infrastructure
100 MOPS
Low cost, High
250 mW
low power $40 Performance
DSPs DSPs
200-1000 MOPS 1-10 GOPS

< 100 mW 1-5 watts
$10 < $50
4
2
Overview
Introduction
Low Power DSP Architectures for Handsets
Domain Specific Processors
DSP Processor Fundamentals
Datapath Design, Instruction Set Design
Pipeline Control, Memory Architecture, Low Power Design
for FIR - Viterbi - speech codec
High performance DSP Processors for BTS
2G and 3G Wireless Standards
Mobile Wireless Basestation Systems
Receiver Algorithms, Smart Antennas
Wideband TRX Architectures
Convolutional and Turbo coding
High Performance DSP Architectures for 3G Wireless
LU DSP16210, TI C6x, Starcore SC140
Future Trends - MIMD DSP
5

Application Domain General General
ASIC
Specific Specific DSP Purpose
Performance / Power:
high
high low
Programmability:
none
none parameters very high
Low power programmable

DSPs for wireless communications
6
3
Domain specific processors: to combine
High performance
Low Power
High degree of programmability
Application domains that need it:

Wireless communications (baseband processing)
Video processors
Embedded micro controllers
Etc.
Application domain is narrower, hence need high

volume to compensate development cost. 7
Application domain: wireless communications

RF Board
PA
Baseband board
Memories
External
Receiver Micro
Transmit
Processor
Synthesize
TCXO Digital DSP

ASIC
No network
Analog
ASIC
Power clr
Battery Supply
Audio
Pack Codec 1 2 3
4 5 6
7 8 9
* 0 #
8
4
Performance requirements: digital cellular phone
Communication Application
RF Channel Speech
Demodulation
Receive decoder decoder
RF Channel Speech
Send Modulation
encoder encoder
Goal: Minimum MIPS to get the job done.
9
Note: Definition of MIPS, MOPS
What is inside a MIPS = Million Instructions per Second ?
DSPs use Complex Instructions
One instruction = 5 operations

E.g. Lode instruction: 2 Memory operations, 2 address generations
and 1 arithmetic operation
So: benchmarks are expressed in minimum number of operations
to finish a job, usually expressed in MIPS
Small Example: Viterbi butterfly operation in 4 cycles/butterfly

Large Example: GSM Half rate speech codec in only 12 MIPS
10
5
Application Domain: compute intensive functions
Source encoder/decoder = speech coders

Advanced vocoders for improved speech quality & higher capacity:
Example: ACELP derivatives for GSM and IS136A
Digital filtering (FIR, IIR)
Vector quantization, code book search
(square distance computation)
Channel encoder/decoder = error correcting

Complex wireless modems:
Galois field arithmetic
Convolution coders based on Viterbi trellis search
Turbo coders
Modulation/demodulation =
Receivers based on Maximum Likelihood Sequence Estimation
(requires again fast Viterbi butterfly operations)
11
Compute intensive functions: evolution of DSPs
Simple FIR example
Speed-up of FIR example
Viterbi acceleration
Square distance
Evolution of DSPs follows these examples
12
6
Evolution of DSP processors
Generation Features Examples
0 (1980) Von Neumann architecture DSP-1 (AT&T)
1 (1982) Basic Harvard architecture TMS320C10 (TI)

NEC7720
2 (1986) 1data/program bus, TMS320C25 (TI)

1 data bus DSP16A (AT&T)
3 (1990) Extra Addressing modes, TMS320C5x (TI)

extra functions DSP16xx (AT&T)
4 (1994) 2 data busses TMS320C54x (TI)

1 program bus
5 (1996 now) 2 data busses, Lucent 16xxx
1 program bus, Atmel Lode
multiple units Siemens Carmel
13
DSP Processor Fundamentals
Processor Components [Skillikorn-88]
Data Path Interconnect

Processing Processing
Unit Unit
Instruction Memory
Processing Management
Unit Unit
14
7
Basic Harvard Architecture
Separate data memory from program memory!
Program Data
Memory Memory
Instruction
Multiply 16 x 16 mpy
Processing
Unit Accumulate
ALU
Different from Von Neumann machine:

one address bus - one data bus - one memory space
15
Example 1: TMS320C10 (1982)
Data RAM Program ROM

144 x 16 1.5K x 16 160/200ns Instruction
A (11-0)
cycle time
4K word external
PA (7-0)
D (15-0) (A 2-0, D 15-0) address reach
CPU 60 general purpose and
16-bit T-register DSP specific instructions
16-bit Barrel I/O Ports
16 x 16 Multiply 8 x 16
Shifter (L) Single cycle multiply
32-bit P-register
32-bit ALU 16-bit Barrel Shifter
32-bit Accumulator
External interrupt and
ShiftL (0,1,4)
polled input pins
2 Auxiliary Regs
Four Level H/W Stack Eight 16-bit I/O ports
Status Register
40-pin DIP/44-pin PLCC
Courtesy: Texas Instruments

16
8
Compute Intensive function 1: FIR
x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
N-1 (50 TAPS)
y(n) =
c(i) x(n-i) c(0) X X X c(N-1) X
i=0
y(n)
+ + +
Single Cycle Multiply - Accumulate!
TMS320C10 TMS320C25
LT LTD RPTK 49 LT
DMOV MPY MACD DMOV
APAC LTD APAC
MPY 53 Cycles MPY
LTD 3 Words Prog Memory
..
. 100 Cycles
MPY
100 Words Prog Memory
17
Example 2: Single Cycle MAC

TMS320C2x Multiplier/ALU
Program Bus
Single Cycle 16x16 bit
Data Bus 16
16 16 16 Multiply yielding a
Left T Register (16) MUX 32-bit product
Shifter 16
(0-16) 16
Multiplier (16x16)
32 Supports simultaneous
P Register (32) Program and two Data
32
Left Shifter (0-16) Operand aquisition
32 32
MUX Supports simultaneous
32
32
ALU and Multiplier
Arithmetic Logic Unit (ALU)
32 operations
C Accumulator Register (32)
32 0-16 bit Left Post-Shifter
16
Left Shifter (0-7)
Courtesy: Texas Instruments 18
9
Compute Intensive function 1: FIR (cont.)
x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
(50 TAPS)
N-1 c(0) c(N-1) X
c(i)
X X X
y(n) = x(n-i)
i=0
y(n)
+ + +
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));
One output = 2N reads, N MACs, 1 write
Classic Harvard: one output = N cycles

19
FIR speed-up
FIR filtering: two outputs in parallel
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
Two outputs = 4N reads, 2N MACs, 2 writes

Dual Mac Architecture with ONLY 2 data busses??
Read two 32-bit numbers instead of four 16-bit numbers
Solution by Lucent 16000 core with dual MAC
Run MAC at double frequency, read two 32-bit numbers
Solution by Matsushita
Insert delay register
Solution by Atmels LODE 20
10
Example 3: Lucent DSP16210
XDB(32)
Inner loop of 32-tap FIR Filter IDB(32)
do 14 { //one instruction ! Y(32) X(32)

a0=a0+p0+p1
p0=xh*yh p1=xl*yl
y=*r0++ x=*pt0++
16 x 16 mpy 16 x 16 mpy
}
p0 (32) p1 (32)
Outer Loop: 19 cycles, 38 bytes
Shift/Sat. Shift/Sat.
1 cycle in inner loop
5 exec units used in inner loop
2 MACs per cycle
ALU ADD BMU
Horizontal parallelism, one sample at
a time
2G mobile wireless base-stations ACC File
8 x 40
Courtesy: Gareth Hughes, Bell Labs Australia

21
FIR on Lode
FIR filter: two outputs in parallel with delay register
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
Total energy for one output sample:
Energy Single Dual Dual MAC

MAC MAC with REG
No. of MAC operations N N N
No of Memory reads 2N 2N N
No of Instruction Cycles N N/2 N/2
22
11
FIR on Lode
Two MAC units with dedicated bus network
DB1(16)
DB0(16)
x(n-i+1) x(n-i)
LREG c(i)
DB0 fetches coefficient c(i)
DB1 fetches data
X X
LREG delays input data
MAC1 MAC0
A0 stores y(n) output + +
A1 stores y(n+1) output
y(n+1) A0 y(n) A1
Same structure can be used for IIR
23
Compute Intensive function 2: Viterbi

Viterbi butterfly +a
i 2i
i = state index -a
s = # of states = 2 k-1 -a
2i+1
w = decoding window
...
+a
Basic equations: i+ s/2

...
d(2n) = min { d(i) + a, d(i + s/2) - a }

d(2i + 1) = min { d(i) - a, d(i + s/2) + a }
Key operation: Add-Compare-Select (ACS)

7
IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)
Basic algorithm in Viterbi channel decoders and MLSE based receivers,

modified version in turbo decoders.
24
12
Viterbi on Lode
Two MAC units & ALU: Add-Compare-Select
DB1(16)
DB0(16)
= min [(1 + 1), (2 + 2)]
1 2
MAC1 MAC0
DMAC operates as dual + +
add/subtract unit
1 2
ALU finds minimum A0 A1
Shortest distance saved
ALU decision bit
Path indicator saved Min()
4 cycles / butterfly
to memory
A3 A2
25
Viterbi on TIC54x
ALU and CSSU: Add-Compare-Select
DB1(16)
DB0(16)
= min [(1 + 1), (2 + 2)]
1 2
TREG
ALU splits in 16 bit halves + + ALU

ACC splits in half
1 2
Shortest distance saved Accumulator
CSSU compares halves

ALU Comp decision bit
MSW/LSW
Path indicator saved Select
4 cycles / butterfly
TRN reg
Data bus EB, to memory
26
13
Viterbi on LU DSP16210
GSM (K=5, 16 states) Comparison functions store ACS

decision bits:
do 8 {
a0=a4+y a1=a5-y *r3++=a0h
a2=a4-y a3=a5+y *r5++=a2h
a0=cmp1(a1,a0) yh=*r0 r0=r1+j j=k k=*pt1++ AR0
a2=cmp1(a3,a2) a4_5h=*pt0++
}
AR0 a0=cmp1(a1,a0)
Hardware support for Viterbi AR0 a2=cmp1(a3,a2)

algorithm:
...
...
ACS calculations are efficient
Minimal overhead
AR0 a2=cmp1(a3,a2)
4 cycles per butterfly
32 cycles per GSM timeslot.
Results written
to memory
Courtesy: Gareth Hughes, Bell Labs Australia
27
Square distance on Lode

ALU in parallel with MAC: Sum of square distance
DB1(16)
N-1
|| x(i) - y(i) ||2
DB0(16)
D=
x(i) y(i)
i=0
ALU -
ALU performs subtraction
and absolute value X
MAC performs squaring MAC

and accumulation +
Vector quantization in vocoders: D A0

vector size N = 50, codebook > 1000
28
14
Lode Core Architecture
29
Domain specific instruction set
Basic instruction set for general purpose DSP

e.g. MAC, min, max, etc.
Extra instructions for performance with every new generation

e.g. square distance and accumulate
N-1
D= || x(i) - y(i) ||2
i=0
One 32 bit instruction:

a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;
Bus network and instruction set design go together
CISC, thus compiler unfriendly

30
15
Control & Pipeline for DSPs
RISC: load/store machine
memory access with load/store instructions (DLX, MIPS, D10V)
Memory Write
Fetch Decode Execute Access Back
Memory access / branch
Execution/ address generation
Excellent for complex decision making!
DSP: register-memory architecture (TI, Lucent, HX, Lode)
Fetch Decode Memory Execute Write

Access Back
Execution
Memory access
Excellent for number crunching!

31
Pipeline RISC compared to DSP

RISC:example r0 = *p0; // load data
a0 = a0 + r0; // execute
Memory
Fetch Decode Execute Too expensive for DSP
Access
Memory
Fetch Decode Execute Access
Memory
Fetch Decode Execute Access
DSP: memory intensive applications:

Memory Execute
Fetch Decode
Access
Memory Execute
Fetch Decode
Access
Memory Execute
Fetch Decode
Access
Memory Execute
Fetch Decode
Access
Penalty: data dependent branch is expensive

32
16
Other control features
Hardware looping:
Because software branch is expensive
Zero overhead hardware loops (for tight FIR loops)
hardware supported
Interrupts: hardware with shadow registers for extremely fast

context switching.
Special instruction cache:

Single instruction repeat buffer
Multiple instruction cache: under programmers control!
E.g. Lucent DSP16210:31x 32 instruction cache
Predictable worst case execution time!
33
Low Power DSPs

DSP 1600 Core C54x 1V DSP
(Lucent - 1609 low cost consumer 16-bit) (Texas Instruments - ISSCC 1997)
0.35 3LM CMOS 0.25 3LM CMOS

80 M 16b MAC/s at 3.3V 65 M 16b MAC/s at 1.0V
1.4 mW/MHz at 3.3V 0.21 mW/MHz at 1.0V
30 W stand-by power 4.0 mW stand-by power
Dual Vt process
34
17
BUT: DSP Software Development
Complex DSP architecture not amenable to compiler technology
Algorithms are modeled in high level language (e.g. C++)
Solutions are implemented and debugged in hand-optimized
assembler - large development effort with minimal tool support
HLL hand coded optimize & debug

assembler prototype production
algorithmic
code code
model
Long, frustrating time to market

Fragile legacy code
Still used in handhelds, but change in basestations, Part II

35
Mobile Wireless Evolution

First Generation
Mobile Telephone Second Generation

SERVICE Service: Carphone
Digital Voice + Third Generation
Analog Cellular and Messaging/Data
Services Integrated High Quality Fourth Generation
Technology
Audio and Data.
TECHNOLOGY Narrowband and
Macrocellular Fixed Wireless Loop
Broadband Multimedia TelePresencing
Systems
Digital Cellular Services + IN integration
Technology + IN Education, training and
Past emergence Broader Bandwidth dynamic information access
Efficient Radio Transmission
Microcellular &
Picocellular: Information Compression Wireless- Wireline and
Broadband
capacity, quality
Higher Frequency Transparency
Enhanced Cordless Spectrum Utilization
Technology Knowledge-Based
IN + Network Management Network Operations
Now integration
Unified Service Network
Year 2000-2005
NMT GSM
WCDMA Year 2010?
TACS IS-54/ 136 TDMA
IS-95/ cdmaOne UWC-136 TDMA
Analog AMPS cdma2000
PDC
DECT
Global roaming
We are entering the decade of wireless data communications - and World-War 3G

36
18
Mobile Data Services
Carriers invest >$500 per subscriber but subscriber voice calls (and
therefore revenues) are reducing.
Data currently 3% of wireless traffic - projected to >50% by 2005
Wireless Internet : Average internet connection 30 mins
Text Messaging: Saturating 2G voice networks
2.5 Generation Mobile Standards [1]

GPRS: Packet Data over GSM - timeslot multiplexing, multi-slots per user.
EDGE: 8-PSK modulation + GPRS, 384 Kbps max to 1 user.
3G - IMT2000 Proposals
144 Kbps Automobile, 384 Kbps Pedestrian, 2 Mbps stationary.
Several Proposals - UWC 136 (200Khz, TDMA, 8-PSK = EDGE).
UMTS, CDMA-2000 are both CDMA proposals.
37
Evolution of Mobile Wireless Network Architecture
Internet / Advanced Services

PSTN PSTN
Mobile Packet Wireless Circuit Network

MSC Mode Control Mode
Switches Servers
Servers Servers Servers
High Speed Data, (Feature Control, (Voice, Low
Multimedia, Network Management, Speed Data,
Voice over IP, Billing, etc.) etc.)
etc.
BSC
Base
Stations Packet Connectivity (ATM / IP)
Radio
Clients
2G Network IP-based 3G Network
Mobile networks are being upgraded in preparation for the delivery of

high speed data services. 38
19
Mobile Wireless Infrastructure
Macro-cell GSM Basestation Micro-cell GSM Basestation

(6-12 TRX) (2 TRX)
39
2G Basestation Baseband Processing
Multiple DSPs used for baseband processing.

RISC Microcontroller for timing, framing, I/O control
Software upgradable over the network
DSPs dominate cost and power consumption
Future trend - integrate
Channel Channel
baseband processing -
Equalization De/coding Encryption low cost Pico BTS
Tx I/O I/O
Rx AFE DSP DSP RAM DSP DSP DSP RISC
Tx Micro T1/E1
Rx AFE DSP DSP RAM DSP DSP DSP Controller I/O I/O ASIC
Tx/Rx baseband processing board for 2-carrier GSM basestation

40
20
3G Basestation Baseband Processing
Increased DSP performance needed in next-generation basestation
Increased Receiver Algorithm Sensitivity
Antenna Arrays - Smart Antennas
Multi-Standard Basestations using Software Radio Architecture
3G - constraint length 9, rate 1/2 convolutional coding for voice.
3G - constraint length 4, Turbo codes for data
Code generator Synchronisation SIR measurement Power control

Code generator
channelisation code cell search
channelisation fast power control
scambling codecode slot syn, frame syn.
scambling (DSP)
(ASIC)) code (DSP)
High Performance DSPs (ASIC))
+ Custom Logic needed Sliding correlator RAKE combiner Decoder

Deinterleaver Viterbi algorithm
for 3G (Viterbi decoding despreading
(ASIC)
reassemble multipath
(DSP, ASIC)
(DSP) Turbo decoding
(DSP, ASIC)
and Turbo decoding)

Code tracking
Channel estimation Path search
delay-lock-loop
(DSP) (ASIC)
(ASIC, DSP)
Courtesy: Bing Xu: Bell Labs Australia

41
Receiver Algorithms for GSM Basestation

Enhanced Receiver Sensitivity
Larger Cells in Suburban Areas = Reduced network cost
Mobile transmits with less power = Increased battery life
Existing Receiver
Estimating Equalizing Channel Speech
Wireless Multi-path Decoding Decoding
Channel Effects
New Iterative Receiver

Speech
Statistics
Estimating Equalizing Channel Speech

1.3dB improvement Wireless Multi-path Decoding Decoding
Channel Effects
Challenge - requires 6x DSP MIPS of existing receiver in basestation

Courtesy: Magnus Sandell: Bell Labs UK
42
21
Smart Antennas
A multiple antenna element system
Combined with a base station architecture and signal processing
techniques designed to dynamically select or form the optimum
beam pattern per user
Omnidirectional Three Sector Intelligent Antenna

Cell Site Cell Site Cell Site
Increased cost in RF electronics and enhanced DSP requirements. 43
Fixed Multi-Beam Versus Adaptive Beam

Fixed Multi-Beam Adaptive Beam
Mobile 1
Interferer
Mobile
Direct Ray
Mobile 1
Direct Ray
Reflected Rays
Mobile 2
Interferer
Reflected Ray Mobile 2
Select from--or use--multiple fixed Adaptively weight and combine multiple

antenna beams to optimize antenna elements to optimize
performance. performance.
44
22
Digital Radio Trends - Software Radio
Antennas
Linear amplification
Combining
multi-standard
A/D basestation
AMP
RF/ Digital Network

Network
Analog
Processing Processing Interface
RF/IF
DSPs - higher speed, more powerful
Higher dynamic range
Filtering Modulation
Smaller Demodulation Equalization
Amplifiers
Rake receiver Correlator
Mixers
Channel coding Encryption
Filters . . .
Diversity . . . 45
Wideband Receiver Architecture
C C C C C C C C C
H H H ... H H H H ... H H
1 2 3 M 1 2 3 M 1
fRF freq fBB freq freq
CH1 CH1
High
RF-IF & Digital . Baseband
Speed . .
.
Filter Channeliser . Processing .
A/D
CHM CHM
C C C
H H H
1 2 3
...
C
H Increased DSP performance C
H
M
f IF freq
needed for Software Radio M
freq
46
23
Turbo Codes
For 3G Wireless (UMTS and CDMA2000)
Voice service: BER requirement 10-3
Data service: BER requirement 10-5
Parallel concatenation of convolutional codes is used to give the codes

structure so they can be decoded
Pseudorandom interleaving is used to give the codes performance
which approaches that for random coding
Resulting encoder structure: Two Recursive Systematic Convolutional(RSC)
Codes
Systematic Output
Input Encoder
#1
MUX
Interleaver
Parity
Output
Encoder
#2
47
Turbo Decoding
Key idea: iterative decoding (up to 10 iterations for 3G)

There is one decoder for each elementary encoder.
Each decoder estimates the a-posteriori probability (APP) of each data
bit.
The APPs are used as a priori information by the other decoder.
Deinterleaver
APP
APP
Interleaver
systematic Decoder Decoder
data
#1 #2 hard bit
parity decisions
data DeMUX
Interleaver
48
24
Soft-Output Decoding Algorithms
Requirements for Turbo: Trellis-Based
Estimation Algorithms
Accept Soft-Inputs in the form of a
priori probabilities (APP)
Produce APP estimates of the data.
Viterbi MAP
Soft-Input Soft-Output
Algorithm Algorithm
Todays High-performance DSPs are

highly MAC-focussed (for filtering in SOVA max-log-MAP
modem applications). Some DSPs
provide hardware support for efficient
implementation of Viterbi - none support
Improved SOVA log-MAP
SOVA or log-MAP
Iterative channel estimation also uses Sequence Symbol-by-symbol

Soft-Input Soft-Output decoders. Estimation Estimation
SOVA and log-MAP use modified Add-Compare-Select operations - not only

select the maximum path metric - but also need to keep the difference. 49
The Maximum A Posteriori (MAP) Algorithm

Pr[d = 1] Pr (d = 1 y ) p ( y d = 1)
L(d ) = ln L (d y ) = ln = ln + L (d )
Pr[d = 0] Pr (d = 0 y ) p ( y d = 0 )
Log-Likelihood Ratio:
A Priori value of Pr[d=1],Pr[d=0]

Output of decoder contains additional extrinsic information
The sum of the a priori information and the extrinsic information will be the a
priori information for the next-stage of decoding, for both 2nd decoder or 1st
decoder in the next iteration
p (s , s , y )
Pr [u = + 1 y ] {s , s : u k = 1}
L (u k ) = ln k
= ln
Pr [u k = 0 y ] p (s , s , y )
{s , s : u k = 0 }
1) uk is the kth bit of the desired data sequence, 2) y be the observed sequence, 3)
the state transitions from state s at time k-1 to state s at time k, 4) We want to
evaluate this LLR for every k
Break the probability computation into: Gamma: k (s, s ) = p(s, y k s)

( ) (
p (s, s, y ) = p s, y j < k p (s, y k s) p y j > k s ) Alpha: k 1(s ) = p (s, y j < k )
Beta: k = p (y j > k s )
50
25
Gamma, Alpha and Beta Calculations
Gamma: Calculated from known bits up to k, needs to be stored
k (s, s ) = p(s, y k s) = P(s s ') p(y k s, s) = P(uk ) p(y k uk )
where P (uk ) is calculated from the a priori information and p(y k uk ) is calculated
from the received bits
Alpha: Calculated by a forward recursion through the trellis based on Gamma

k (s ) = k (s, s ) k 1(s)
s
Beta: Calculated by a backward recursion from the end of the trellis
k 1(s) = k (s, s ) k (s )
s
Alpha Gamma Beta
Dummy
Betas
Window algorithm 51
Log MAP and MAX-log MAP

Compute logarithms of alpha, beta and gamma, which means we compute:
(
ln e 1 + e 2 )
Log-MAP: ( )
ln e 1 + e 2 max ( 1 , 2 ) + f c ( 1 2 )
MAX-Log-MAP: ln (e 1 + e 2 ) max ( 1 , 2 ) Correction function (impl. table)
-1
10
MaxlogAPP
LogAPP MAX-log MAP suffers approx 0.5dB
-2
10 from log MAP.
-3
10
For log-MAP, small correction table
BER
-4
needed (approx 6 non-zero values).
10
Absolute difference used as table
-5
10
look-up. We need the difference!
-6
10 Courtesy: Bing Xu: Bell Labs Australia
2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 52
26
High Performance DSP Requirements
Very high levels of DSP integer performance
Support for complex real-time synchronous applications (latency, predictable

throughput, synchronization)
Large memory and I/O Some DSP Applications

bandwidth. 100K
3-D graphics?
Soft radio
Scalability to meet wide 10K MPEGII 1G eth. xcvr
range of cost, power, encode
3G Wireless
performance. MOPS 24 ch. set-top
modem box
1000
ADSL DAB
Cost & power efficient 16 HR 6M rcvr SP
solution. GSM
GSM
itio nal D
100 term PCS tr a d
ADSL K56 term
Friendly, compiler driven, V.34 500k
programming environment.
10
1997 1999 2001
53
Compiler Driven VLIW

Instruction format: cond/branch ex1 ex2 ex3 .. exn
Data memory
Register
Array
Interconnect
ex1 ex2 ex3 ex4 exn

(alu) (alu) (mpy) (ld/st) (ld/st)
Large orthogonal register set, regular interconnect
Atomic RISC-like operations => heavily pipelined, high freq. clock

54
27
Explicitly Parallel Instruction Computing
Execution Clusters
Data memory
Register Register
Array Array
Interconnect Interconnect
ex1 ex2 ex3 ex4 ex5 ex6

(alu) (alu) (ld/st) (alu) (mpy) (ld/st)
Execution Sets
fetch set
1 1 1 0 1 0 1 0
exec. set
55
Explicitly Parallel Instruction Computing

Predication (guarded) exec. cond any instruction
- eliminates branches - improves compiler efficiency
- eliminates branches - removes pipeline bubbles
- fill delayed branch slots with predicated instructions
Instruction modifiers modifier instr1 instr2 instr3 instr4
- allows shorter instruction length

- extend register addressing
- predication
- execution set identifier
- looping
- extended operations
56
28
Texas Instruments C6201
Program Memory
(16K x 32)
256
Instruction Dispatch & Decode
Register Bank A Register Bank B
(16 x 32) (16 x 32)
ALU shift mpy add ALU shift mpy add
Data Memory
(32K x 16)
8-way VLIW with two execution clusters

256 bit (8x32) instruction fetch with variable length execute set
Each 32 bit instruction individually predicated
11 stage pipeline
1600 MIPS, 400 MMACs @ 200 MHz 57
FIR Filter on TI C6x
Hand-coded assembly: 32-tap FIR filter
loop:
||
ldw
ldw
.d1t1
.d2t2
*a4++,a5
*b4++,b5
Outer Loop: 23 cycles, 180 bytes
||[b0] sub .s2 b0,1,b0 1 cycle in inner loop
||[b0]
||
b
mpy
.s1
.m1x
loop
a5,b5,a6 All 8 exec units used in inner
|| mpyh .m2x a5,b5,b6
|| add .l1 a7,a6,a7
loop - maximum efficiency
|| add .l2 b7,b6,b7 2 MACs per cycle
Assembly syntax more difficult to learn.

Hard to get full use of all 8 execution units at once.
Software pipelining difficult to implement, and requires longer prolog/epilog (larger
code size).
Courtesy: Gareth Hughes: Bell Labs Australia
58
29
Viterbi on TI C6x
3-cycle 2-ACS Inner-Loop 16-state Viterbi decoder for GSM
from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm
LOOP:
[b1] b .s1 LOOP
3 cycles per butterfly
||[b1] sub .s2 b1,1,b1
||[!a2] sth .d1 b12,*+a6[8] 32 cycles per GSM timeslot (8 butterflies)
||[!a2] add .d2 b0,b14,b14 x8 MPY instructions used to move data
|| cmpgt .l1 a11,a10,a1
|| cmpgt .l2 b11,b10,b0
|| mpy .m1x 1,b5,a4
Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
[a2] sub .s1 a2,1,a2
.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]
||[!a2] sth .d1 a12,*a6++
.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]
||[a1] add .s2 2,b0,b0
.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0
||[b0] mpy .m2 1,b11,b12
|| mpy .m1 1,a10,a12 .M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8
|| sub .l2x a7,b5,b10 .L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0
|| ldh .d2 *++b9,b5 .L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I
.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k
shl .s2 b14,2,b14 .S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j
||[a1] mpy .m1 1,a11,a12
|| add .s1 a7,a4,a10 Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
|| sub .l1x b13,a4,a11
.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1
|| add .l2 b13,b5,b11
.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0
|| mpy .m2 1,b10,b12
.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0
|| ldh .d2 *b4++[2],a7
|| ldh .d1 *a5++[2],b13 .M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj
; end of LOOP .L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP
.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8
.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k
.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP
Utilization of execution units in Viterbi decoder

59
Lucent / Motorola Star*Core SC140

Program / Data Memory
Program Address Data Registers

Sequencer Registers (16)
Instruction (27)
Dispatcher MAC MAC MAC MAC
AAU AAU ALU ALU ALU ALU

BFU BFU BFU BFU
6-way VLIW with 128 bit (8x16) instruction fetch

Prefix instructions for high performance without sacrificing code density
Each execution set (parallel instructions + prefix) predicated
5 stage pipeline
1800 MIPS, 1200 MMACs @ 300 MHz 60
30
Viterbi on Star*Core
GSM (K=5, 16 states)
[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2
[ add2 d0,d4 sub2 d6,d2
] Decision bits are manually stored
sub2 d4,d0
[ max2vit d4,d2
add2 d2,d6
max2vit d0,d6
]
] x4
using the Viterbi Shift Left (VSL)
[ vsl.4w d2:d6:d1:d3,(r2)+n0
vsl.4f d2:d6:d1:d3,(r3)+n0 ]
instruction:
max2vit d4,d2 max2vit d0,d6
SR
Hardware support for Viterbi
algorithm: vsl.4w d2:d6:d1:d3,(r2)+n0
max2vit instruction.
D1 decisions
vsl instruction
D3 decisions
1 cycle per butterfly through
software-pipelining D2 path metrics Results written
D6 path metrics to memory
Courtesy: Gareth Hughes: Bell Labs Australia

61
Log-MAP on Star*Core
Star*Core code for log-MAP Butterfly
d0: a d1: b d6: x
Cycle 1 move.w (r0)+,d0 move.w (r1)+,d1
Cycle 2 add d0,d6,d0 sub d6,d0,d5

d0: a+x d1: b+x
Cycle 3 sub d6,d1,d4 add d1,d6,d1
d4: b-x d5: a-x
Cycle 4 sub d0,d4,d2 sub d1,d5,d3

max max
Cycle 5 max d0,d4 max d1,d5
d2: d0-d4 d3: d1-d5

d4: max(d0,d4) d5: max(d1,d5) Cycle 6 abs d2 abs d3
n0: |d2| Cycle 7 move.l d2,n0
d2:
Cycle 8 move.l d3,n0 move.w (r6+n0),d2

n0: |d3|
r6 Cycle 9 add d4,d2,d4 move.w (r6+n0),d3

Cycle 10 add d5,d3,d5
Cycle 11 move.2w d4:d5,(r2)+
d4: d4+d2 d5: d5+d3 r6
d3:
This code uses 2 of the 4 ALUs and can be software
Courtesy: Gareth Hughes: Bell Labs Australia pipelined to achieve 6 cycles per LOG-MAP Butterfly
62
31
Parallel DSP Architectures
Arch. Parallelism Compile? Power ?
S/scalar Dynamic instruction level

VLIW Static instruction level
SIMD Highly regular, data dependent
MIMD Task level
MIMD with VLIW / SIMD provides high order parallel execution
The future of high performance DSPs is MIMD

63
Daytona: A Multiprocessor DSP Architecture

I/O I/O External
Interfaces Interfaces Memory
Chip
Buffered Arbitration I/O Subsystem
I/O Synchronization
split transaction bus (128 bits)
Programmable Programmable
Processing Hardware
Processing
Element Accelerator
Element
(PE) (PE)
Scalable Architecture - multiple programmable DSPs on a single chip

1 Bus supports different programmable DSPs and Microcontrollers
64
32
Split Transaction Bus
Separate Address and Data busses - each with pipelined protocol
Multiple outstanding transactions - varying size/priority
Separate Bus Arbitration

ID
Arbiter
Address addr (round-robin)
Bus (100MHz)
ID ID ID
Arbiter
(round-robin)
Data data data data
Bus (128 bits 100MHz)
Memory ID addr
Controller ID addr
PE
65
Memory Hierarchy in MIMD DSPs

Multiple copies of 1 application (e.g. odd/even slot channel equalisation)
Multiple copies of same software - Shared memory multiprocessing
Flat Memory Architecture vs. Hierarchical Memory Architecture
Inefficient
DRAM
SRAM SRAM
Cache Cache
DSP DSP DSP DSP
2 copies of software 1 copy of software
Mix of different applications (e.g. equalisation, convolutional decoding)

Heterogenous mix of applications
66
33
Shared Memory Multiprocessing
64 Semaphores provided for process synchronization
L-1 cache coherency using a snoopy protocol (modified MESI used)
Coherent Transaction
Memory
Controller
Access to shared data hit

uses coherent transaction.
Caches snoop the address
and query their tag RAMs.
A cache hit prevents the DSP DSP DSP DSP
memory controller from Access Snoop Snoop
Snoop
servicing the request. to shared (miss) (hit) (miss)
data
67
Daytona Multiprocessor DSP Chip

Bell Laboratories Research Chip for 3G Wireless Base-stations / Head-end xDSL
Host 64-b 4-MAC 64-b 4-MAC

SIMD DSP SIMD DSP
Chip Characteristics
Interface
2
32-b RISC 32-b RISC Core Area 120mm
I/O &
Memory Cache Memory Cache Memory
Speed 100 MHz
Controller
128-b Split Transaction Bus
Test & Power 4W
JTAG Port Cache Memory Cache Memory
Arbiter Tech 0.25um

32-b RISC 32-b RISC
64-b 4-MAC 64-b 4-MAC
Semaphore SIMD DSP SIMD DSP
Paper 4.2, ISSCC2000 68

34
Photomicrograph of Daytona Test Chip
Arbiter
Vector Unit ( RVU)
)
Semph
PE
t(
en
DLL
HDS
LRU
BUS
INT
em
SPARC
El
ng
si
Split Transaction Bus
es
8KB Re-configurable Memory
oc
Pr
)
)
PE
PE
t(
t(
en
en
em
em
El
El
ng
ng
si
si
es
es
oc
oc
Pr
Pr
I/O
Subsystem
Paper 4.2, ISSCC2000 69

Acknowledgements
The following people contributed to the work in this tutorial:
Low Power DSPs for Wireless

Wanda Gass: Texas Instruments
Mihran Touriguian: Atmel
High Performance DSPs for Wireless Infrastructure

Bryan Ackland: Bell Labs US - High Perf. DSP Architecture
Gareth Hughes: Bell Labs Australia - LU DSP16210, C6x and Starcore benchmarks
Bing Xu: Bell Labs Australia - SOVA, MAP, LOG-MAP
Ran-Hong Yan: Bell Labs UK - 3G Wireless
Daytona Team: (J Williams, K.J. Singh, J. Othmer, B. Ackland), Bell Labs US.
70
35
References
[1] P. Lapsley, J. Bier, A. Shoham, E. Lee, DSP Processor Fundamentals, IEEE Press, New York, 1997.
[2] D. Skillikorn, A Taxonomy for Computer Architectures, Computer Magazine, Nov. 1988.
[3] H. Kabuo, M. Okamoto, I. Tanaka, H. Yasoshima, S. Marui, M. Yamasaki, T. Sugimura, K. Ueda, T. Ishikawa, H.
Suzuki, R. Asahi, An 80 MOPS-Peak High-Speed and Low-Power-Consumption 16-b Digital Signal Processor, IEEE
Journal of Solid-State Circuits, Vol. 31, No. 4, April 1996, pg. 494-503.
[4] E. A. Lee, D. G. Messerschmitt, Digital communication, Boston: Kluwer Academic Publishers, 1988.
[5] W. Lee et al., A 1V DSP for Wireless Communications, Proceedings IEEE International Solid-State Circuits
Conference, pp. 92-93, February 1997.
[6] S. Lin, and J. Costello Jr., Error Control Coding: Fundamentals and applications, Prentice Hall, New Jersey, 1983
[7] Lucent 16000, http://www.lucent.com/micro/ or http://www.lucent.dk/micro/dsp16000/
[8] Thomas Parsons, Voice and Speech Processing, McGraw-Hill Book Company, New York, 1987.
[9] TMS320C54x Users Guide, available from the Texas Instruments Literature Response Center.
[10] I. Verbauwhede, M. Touriguian, A Low Power DSP Engine for Wireless Communications, Journal of VLSI Signal
Processing 18, pg. 177-186, 1998, Kluwer Academic Publishers.
[11] I. Verbauwhede, M. Touriguian, Wireless digital signal processors, Chapter in Digital Signal Processing for
Multimedia Systems, Edited by K.K. Parhi, T. Nishitani, Publisher: Marcel Dekker, New York, 1999.
[12] M. Okamoto, K. Stone, T. Sawai, H. Kabuo, S. Marui, M. Yamasaki, Y. Uto, Y. Sugisawa, Y. Sasagawa, T.
Ishikawa, H. Suzuki, N. Minamida, R. Yamanaka, K. Ueda, A High Performance DSP Architecture for Next Generation
Mobile Phone Systems, 1998 IEEE DSP Workshop.
[13] Lode specifications, available from www.atmel.com
[14] M.W. Oliphant, The Mobile Phone meets the Internet, IEEE Spectrum pp. 20-28, Aug. 1999.
[15] L. C. Godara, Application of Antenna Arrays to Mobile Communications: Part 1, Proc. IEEE, Vol 85, No. 7. pp
1031-1060, July 97
71
References (cont)
[16] G. D. Forney, Jr., Maximum Likelihood Sequence Estimation of Digital Sequences in the Presence of Intersymbol
Interference, IEEE Trans. Inform. Theory, V IT-18, pp. 363-378, May 1972.
[17] C. Berrou, A. Glavieux, P. Thitimajshima, Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes
(1), Proc. ICC93, May 1993.
[18] J. Hagenauer, P. Hoeher, A Viterbi Algorithm with Soft-Decision Outputs and its Applications, Proc. Globecom 89,
Nov. 1989, pp.47.1.1-47.1.7
[19] L. Bahl, J. Cocke, F. Jelinek, J. Raviv, Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate, IEEE
Trans. Inform. Theory, V IT-20, pp. 284-287, Mar. 1974.
[20] J. Turley, H. Hakkaraainen, TIs new C6x DSP Screams at 1600 MIPS, Microprocessor Report, Vol 11, No. 2, pp
14, Feb 1997
[21] Starcore Launched First Architecture, Microprocessor Report, V12, No. 14. pp 22, Oct 1998
[22] B. Ackland & P. DArcy, A New Generation of DSP Architectures, Proc. IEEE CICC99, Paper 25.1.1
[23] J. Williams, K.J. Singh, C.J. Nicol, B. Ackland, A 3.2 GOPs Multiprocessor DSP for Communication Applications,
Proc. IEEE ISSCC2000, Paper 4.2
72
36

Isscc Dsptut

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Isscc Dsptut

Uploaded by

Copyright:

Available Formats

DSP Architectures for Next-Generation

Mobile Wireless Trends

W ire line C A G R - 5 % G lob al W irelin e

World-wide deployment of mobile communications is exceeding expectations

10 DSP16A ($15) DSP1600 (<$10)

The DSP Market Splits - and so does this tutorial

200-1000 MOPS 1-10 GOPS

Domain Specific Processors

Low power programmable

High degree of programmability

Application domains that need it:

Application domain is narrower, hence need high

Application domain: wireless communications

TCXO Digital DSP

Goal: Minimum MIPS to get the job done.

Note: Definition of MIPS, MOPS

What is inside a MIPS = Million Instructions per Second ?

DSPs use Complex Instructions

One instruction = 5 operations

Small Example: Viterbi butterfly operation in 4 cycles/butterfly

Source encoder/decoder = speech coders

Channel encoder/decoder = error correcting

Compute intensive functions: evolution of DSPs

Simple FIR example

Speed-up of FIR example

Evolution of DSPs follows these examples

0 (1980) Von Neumann architecture DSP-1 (AT&T)

1 (1982) Basic Harvard architecture TMS320C10 (TI)

2 (1986) 1data/program bus, TMS320C25 (TI)

3 (1990) Extra Addressing modes, TMS320C5x (TI)

4 (1994) 2 data busses TMS320C54x (TI)

DSP Processor Fundamentals

Processor Components [Skillikorn-88]

Data Path Interconnect

Different from Von Neumann machine:

Example 1: TMS320C10 (1982)

Data RAM Program ROM

Courtesy: Texas Instruments

Single Cycle Multiply - Accumulate!

Example 2: Single Cycle MAC

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

One output = 2N reads, N MACs, 1 write

Classic Harvard: one output = N cycles

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);

Two outputs = 4N reads, 2N MACs, 2 writes

do 14 { //one instruction ! Y(32) X(32)

Courtesy: Gareth Hughes, Bell Labs Australia

Total energy for one output sample:

Energy Single Dual Dual MAC

No of Instruction Cycles N N/2 N/2

Same structure can be used for IIR

Compute Intensive function 2: Viterbi

Basic equations: i+ s/2

d(2n) = min { d(i) + a, d(i + s/2) - a }

Key operation: Add-Compare-Select (ACS)

Basic algorithm in Viterbi channel decoders and MLSE based receivers,

ALU splits in 16 bit halves + + ALU

CSSU compares halves

GSM (K=5, 16 states) Comparison functions store ACS

Hardware support for Viterbi AR0 a2=cmp1(a3,a2)

Square distance on Lode

MAC performs squaring MAC

Vector quantization in vocoders: D A0