You are on page 1of 36

DSP Architectures for Next-Generation

Wireless Communications

Ingrid Verbauwhede
Department of Electrical Engineering
University of California Los Angeles
ingrid@ee.ucla.edu

Chris Nicol
Bell Laboratories Australia
Lucent Technologies

chrisn@lucent.com

1
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Mobile Wireless Trends


S u b s crib ers in (000 )
1 ,6 0 0,0 00

1 ,4 0 0,0 00

W ire line C A G R - 5 % G lob al W irelin e


1 ,2 0 0,0 00 G lo b a l P en etratio n (2 01 0) - 20 % G ob al W ire le ss

1 ,0 0 0,0 00
Subscribers (000)

8 0 0,0 00

6 0 0,0 00

4 0 0,0 00
W irele ss C A G R 21 %
G lo b a l P en etra tio n (20 10 ) - 21 %
2 0 0,0 00
(C e llu lar+P C S + W L A S + O the r)
G lob a l P op - 7 bill
C AG R 1 9 95 -20 10 - 1 .4 %
0
1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

World-wide deployment of mobile communications is exceeding expectations


2
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

1
DSP Evolution and Markets
Disk
DSP Market $270 M Cellular
Infrastructure
Other
$2B market, 30% growth rate
Wireless Mobile Handsets
$1.01B Cordless
Modem
GPS
V.34 $727 M
Source: Forward Concepts 1996
V.90
xDSL Consumer &
Automotive

M68000 ($200)
10K
Power Power
80286 ($200)
(mw/MIP) 1K 80386 ($300)
DSP-1 ($150) (mw/MIP)
Pentium ($300)
DSP-32C ($250)
100
Pentium (MMX)
($700)

10 DSP16A ($15) DSP1600 (<$10)

DSP16210
1
1980 1985 1990 1995 2000
3
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

The DSP Market Splits - and so does this tutorial

Ingrid Todays
general purpose Chris Nicol
Verbauwhede
assembly coded
Mobile Terminals DSP
Infrastructure
100 MOPS
Low cost, High
250 mW
low power $40 Performance
DSPs DSPs

200-1000 MOPS 1-10 GOPS


< 100 mW 1-5 watts
$10 < $50

4
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

2
Overview
Introduction
Low Power DSP Architectures for Handsets
Domain Specific Processors
DSP Processor Fundamentals
Datapath Design, Instruction Set Design
Pipeline Control, Memory Architecture, Low Power Design
for FIR - Viterbi - speech codec
High performance DSP Processors for BTS
2G and 3G Wireless Standards
Mobile Wireless Basestation Systems
Receiver Algorithms, Smart Antennas
Wideband TRX Architectures
Convolutional and Turbo coding
High Performance DSP Architectures for 3G Wireless
LU DSP16210, TI C6x, Starcore SC140
Future Trends - MIMD DSP
5
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Domain Specific Processors


Application Domain General General
ASIC
Specific Specific DSP Purpose

Performance / Power:

high
high low

Programmability:

none
none parameters very high

Low power programmable


DSPs for wireless communications

6
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

3
Domain Specific Processors
Domain specific processors: to combine
High performance

Low Power

High degree of programmability

Application domains that need it:


Wireless communications (baseband processing)
Video processors
Embedded micro controllers
Etc.

Application domain is narrower, hence need high


volume to compensate development cost. 7
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Application domain: wireless communications


RF Board

PA
Baseband board
Memories
External

Receiver Micro
Transmit

Processor
Synthesize

TCXO Digital DSP


ASIC
No network
Analog
ASIC
Power clr
Battery Supply
Audio
Pack Codec 1 2 3
4 5 6
7 8 9
* 0 #

8
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

4
Performance requirements: digital cellular phone
Communication Application

RF Channel Speech
Demodulation
Receive decoder decoder

RF Channel Speech
Send Modulation
encoder encoder

Goal: Minimum MIPS to get the job done.

9
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Note: Definition of MIPS, MOPS

What is inside a MIPS = Million Instructions per Second ?

DSPs use Complex Instructions

One instruction = 5 operations


E.g. Lode instruction: 2 Memory operations, 2 address generations
and 1 arithmetic operation
So: benchmarks are expressed in minimum number of operations
to finish a job, usually expressed in MIPS

Small Example: Viterbi butterfly operation in 4 cycles/butterfly


Large Example: GSM Half rate speech codec in only 12 MIPS

10
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

5
Application Domain: compute intensive functions

Source encoder/decoder = speech coders


Advanced vocoders for improved speech quality & higher capacity:
Example: ACELP derivatives for GSM and IS136A
Digital filtering (FIR, IIR)
Vector quantization, code book search
(square distance computation)

Channel encoder/decoder = error correcting


Complex wireless modems:
Galois field arithmetic
Convolution coders based on Viterbi trellis search
Turbo coders
Modulation/demodulation =
Receivers based on Maximum Likelihood Sequence Estimation
(requires again fast Viterbi butterfly operations)
11
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Compute intensive functions: evolution of DSPs

Simple FIR example

Speed-up of FIR example

Viterbi acceleration

Square distance

Evolution of DSPs follows these examples

12
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

6
Evolution of DSP processors
Generation Features Examples

0 (1980) Von Neumann architecture DSP-1 (AT&T)

1 (1982) Basic Harvard architecture TMS320C10 (TI)


NEC7720

2 (1986) 1data/program bus, TMS320C25 (TI)


1 data bus DSP16A (AT&T)

3 (1990) Extra Addressing modes, TMS320C5x (TI)


extra functions DSP16xx (AT&T)

4 (1994) 2 data busses TMS320C54x (TI)


1 program bus
5 (1996 now) 2 data busses, Lucent 16xxx
1 program bus, Atmel Lode
multiple units Siemens Carmel
13
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

DSP Processor Fundamentals

Processor Components [Skillikorn-88]

Data Path Interconnect


Processing Processing
Unit Unit

Instruction Memory
Processing Management
Unit Unit

14
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

7
Basic Harvard Architecture
Separate data memory from program memory!

Program Data
Memory Memory

Instruction
Multiply 16 x 16 mpy
Processing
Unit Accumulate

ALU

Different from Von Neumann machine:


one address bus - one data bus - one memory space

15
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Example 1: TMS320C10 (1982)

Data RAM Program ROM


144 x 16 1.5K x 16 160/200ns Instruction
A (11-0)
cycle time
4K word external
PA (7-0)
D (15-0) (A 2-0, D 15-0) address reach
CPU 60 general purpose and
16-bit T-register DSP specific instructions
16-bit Barrel I/O Ports
16 x 16 Multiply 8 x 16
Shifter (L) Single cycle multiply
32-bit P-register
32-bit ALU 16-bit Barrel Shifter
32-bit Accumulator
External interrupt and
ShiftL (0,1,4)
polled input pins
2 Auxiliary Regs
Four Level H/W Stack Eight 16-bit I/O ports
Status Register
40-pin DIP/44-pin PLCC

Courtesy: Texas Instruments


16
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

8
Compute Intensive function 1: FIR

x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
N-1 (50 TAPS)
y(n) =
c(i) x(n-i) c(0) X X X c(N-1) X
i=0

y(n)
+ + +

Single Cycle Multiply - Accumulate!

TMS320C10 TMS320C25
LT LTD RPTK 49 LT
DMOV MPY MACD DMOV
APAC LTD APAC
MPY 53 Cycles MPY
LTD 3 Words Prog Memory
..
. 100 Cycles
MPY
100 Words Prog Memory
17
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Example 2: Single Cycle MAC


TMS320C2x Multiplier/ALU
Program Bus
Single Cycle 16x16 bit
Data Bus 16
16 16 16 Multiply yielding a
Left T Register (16) MUX 32-bit product
Shifter 16
(0-16) 16
Multiplier (16x16)
32 Supports simultaneous
P Register (32) Program and two Data
32
Left Shifter (0-16) Operand aquisition
32 32
MUX Supports simultaneous
32
32
ALU and Multiplier
Arithmetic Logic Unit (ALU)
32 operations
C Accumulator Register (32)
32 0-16 bit Left Post-Shifter
16
Left Shifter (0-7)
Courtesy: Texas Instruments 18
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

9
Compute Intensive function 1: FIR (cont.)
x(n-1)
x(n) -1 -1 -1
Z Z Z
x(n-(N-1))
(50 TAPS)
N-1 c(0) c(N-1) X
c(i)
X X X
y(n) = x(n-i)
i=0
y(n)
+ + +

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);


y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

One output = 2N reads, N MACs, 1 write

Classic Harvard: one output = N cycles


19
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

FIR speed-up
FIR filtering: two outputs in parallel

y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);


y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

Two outputs = 4N reads, 2N MACs, 2 writes


Dual Mac Architecture with ONLY 2 data busses??
Read two 32-bit numbers instead of four 16-bit numbers
Solution by Lucent 16000 core with dual MAC
Run MAC at double frequency, read two 32-bit numbers
Solution by Matsushita
Insert delay register
Solution by Atmels LODE 20
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

10
Example 3: Lucent DSP16210
XDB(32)
Inner loop of 32-tap FIR Filter IDB(32)

do 14 { //one instruction ! Y(32) X(32)


a0=a0+p0+p1
p0=xh*yh p1=xl*yl
y=*r0++ x=*pt0++
16 x 16 mpy 16 x 16 mpy
}
p0 (32) p1 (32)
Outer Loop: 19 cycles, 38 bytes
Shift/Sat. Shift/Sat.
1 cycle in inner loop
5 exec units used in inner loop
2 MACs per cycle
ALU ADD BMU
Horizontal parallelism, one sample at
a time
2G mobile wireless base-stations ACC File
8 x 40

Courtesy: Gareth Hughes, Bell Labs Australia


21
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

FIR on Lode
FIR filter: two outputs in parallel with delay register
y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + . . . + c(N-1)x(1-N);
y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + . . . + c(N-1)x(2-N);
y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + . . . + c(N-1)x(3-N);
. . .
y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2)+ . . + c(N-1)x(n-(N-1));

Total energy for one output sample:

Energy Single Dual Dual MAC


MAC MAC with REG
No. of MAC operations N N N

No of Memory reads 2N 2N N

No of Instruction Cycles N N/2 N/2

22
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

11
FIR on Lode
Two MAC units with dedicated bus network

DB1(16)
DB0(16)
x(n-i+1) x(n-i)
LREG c(i)
DB0 fetches coefficient c(i)
DB1 fetches data
X X
LREG delays input data
MAC1 MAC0
A0 stores y(n) output + +
A1 stores y(n+1) output
y(n+1) A0 y(n) A1

Same structure can be used for IIR

23
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Compute Intensive function 2: Viterbi


Viterbi butterfly +a
i 2i
i = state index -a
s = # of states = 2 k-1 -a
2i+1
w = decoding window
...

+a

Basic equations: i+ s/2


...

d(2n) = min { d(i) + a, d(i + s/2) - a }


d(2i + 1) = min { d(i) - a, d(i + s/2) + a }

Key operation: Add-Compare-Select (ACS)


7
IS-95: k = 8, w = 192, corresponds to 2 x 192 x (cycles for one ACS)

Basic algorithm in Viterbi channel decoders and MLSE based receivers,


modified version in turbo decoders.
24
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

12
Viterbi on Lode
Two MAC units & ALU: Add-Compare-Select

DB1(16)
DB0(16)
= min [(1 + 1), (2 + 2)]
1 2

MAC1 MAC0
DMAC operates as dual + +
add/subtract unit
1 2
ALU finds minimum A0 A1
Shortest distance saved
ALU decision bit
Path indicator saved Min()

4 cycles / butterfly
to memory
A3 A2

25
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Viterbi on TIC54x
ALU and CSSU: Add-Compare-Select

DB1(16)
DB0(16)
= min [(1 + 1), (2 + 2)]
1 2
TREG

ALU splits in 16 bit halves + + ALU


ACC splits in half
1 2
Shortest distance saved Accumulator

CSSU compares halves


ALU Comp decision bit
MSW/LSW
Path indicator saved Select
4 cycles / butterfly
TRN reg
Data bus EB, to memory

26
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

13
Viterbi on LU DSP16210

GSM (K=5, 16 states) Comparison functions store ACS


decision bits:
do 8 {
a0=a4+y a1=a5-y *r3++=a0h
a2=a4-y a3=a5+y *r5++=a2h
a0=cmp1(a1,a0) yh=*r0 r0=r1+j j=k k=*pt1++ AR0
a2=cmp1(a3,a2) a4_5h=*pt0++
}
AR0 a0=cmp1(a1,a0)

Hardware support for Viterbi AR0 a2=cmp1(a3,a2)


algorithm:

...

...
ACS calculations are efficient
Minimal overhead
AR0 a2=cmp1(a3,a2)
4 cycles per butterfly
32 cycles per GSM timeslot.
Results written
to memory
Courtesy: Gareth Hughes, Bell Labs Australia
27
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Square distance on Lode


ALU in parallel with MAC: Sum of square distance
DB1(16)
N-1
|| x(i) - y(i) ||2
DB0(16)
D=
x(i) y(i)
i=0
ALU -
ALU performs subtraction
and absolute value X

MAC performs squaring MAC


and accumulation +

Vector quantization in vocoders: D A0


vector size N = 50, codebook > 1000

28
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

14
Lode Core Architecture

29
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Domain specific instruction set

Basic instruction set for general purpose DSP


e.g. MAC, min, max, etc.

Extra instructions for performance with every new generation


e.g. square distance and accumulate

N-1
D= || x(i) - y(i) ||2
i=0

One 32 bit instruction:


a3 = abs (*r0 - *r1 < asr), a0 = a0 + sqr(a3), r0++, r1++;

Bus network and instruction set design go together

CISC, thus compiler unfriendly


30
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

15
Control & Pipeline for DSPs
RISC: load/store machine
memory access with load/store instructions (DLX, MIPS, D10V)

Memory Write
Fetch Decode Execute Access Back
Memory access / branch
Execution/ address generation
Excellent for complex decision making!

DSP: register-memory architecture (TI, Lucent, HX, Lode)

Fetch Decode Memory Execute Write


Access Back

Execution
Memory access

Excellent for number crunching!


31
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Pipeline RISC compared to DSP


RISC:example r0 = *p0; // load data
a0 = a0 + r0; // execute
Memory
Fetch Decode Execute Too expensive for DSP
Access
Memory
Fetch Decode Execute Access
Memory
Fetch Decode Execute Access

DSP: memory intensive applications:


Memory Execute
Fetch Decode
Access
Memory Execute
Fetch Decode
Access
Memory Execute
Fetch Decode
Access
Memory Execute
Fetch Decode
Access

Penalty: data dependent branch is expensive


32
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

16
Other control features

Hardware looping:
Because software branch is expensive
Zero overhead hardware loops (for tight FIR loops)
hardware supported

Interrupts: hardware with shadow registers for extremely fast


context switching.

Special instruction cache:


Single instruction repeat buffer
Multiple instruction cache: under programmers control!
E.g. Lucent DSP16210:31x 32 instruction cache

Predictable worst case execution time!

33
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Low Power DSPs


DSP 1600 Core C54x 1V DSP
(Lucent - 1609 low cost consumer 16-bit) (Texas Instruments - ISSCC 1997)

0.35 3LM CMOS 0.25 3LM CMOS


80 M 16b MAC/s at 3.3V 65 M 16b MAC/s at 1.0V
1.4 mW/MHz at 3.3V 0.21 mW/MHz at 1.0V
30 W stand-by power 4.0 mW stand-by power
Dual Vt process
34
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

17
BUT: DSP Software Development
Complex DSP architecture not amenable to compiler technology
Algorithms are modeled in high level language (e.g. C++)
Solutions are implemented and debugged in hand-optimized
assembler - large development effort with minimal tool support

HLL hand coded optimize & debug


assembler prototype production
algorithmic
code code
model

Long, frustrating time to market


Fragile legacy code

Still used in handhelds, but change in basestations, Part II


35
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Mobile Wireless Evolution


First Generation

Mobile Telephone Second Generation


SERVICE Service: Carphone
Digital Voice + Third Generation
Analog Cellular and Messaging/Data
Services Integrated High Quality Fourth Generation
Technology
Audio and Data.
TECHNOLOGY Narrowband and
Macrocellular Fixed Wireless Loop
Broadband Multimedia TelePresencing
Systems
Digital Cellular Services + IN integration
Technology + IN Education, training and
Past emergence Broader Bandwidth dynamic information access
Efficient Radio Transmission
Microcellular &
Picocellular: Information Compression Wireless- Wireline and
Broadband
capacity, quality
Higher Frequency Transparency
Enhanced Cordless Spectrum Utilization
Technology Knowledge-Based
IN + Network Management Network Operations
Now integration
Unified Service Network
Year 2000-2005

NMT GSM
WCDMA Year 2010?
TACS IS-54/ 136 TDMA
IS-95/ cdmaOne UWC-136 TDMA
Analog AMPS cdma2000
PDC
DECT
Global roaming

We are entering the decade of wireless data communications - and World-War 3G


36
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

18
Mobile Data Services
Carriers invest >$500 per subscriber but subscriber voice calls (and
therefore revenues) are reducing.

Data currently 3% of wireless traffic - projected to >50% by 2005

Wireless Internet : Average internet connection 30 mins

Text Messaging: Saturating 2G voice networks

2.5 Generation Mobile Standards [1]


GPRS: Packet Data over GSM - timeslot multiplexing, multi-slots per user.
EDGE: 8-PSK modulation + GPRS, 384 Kbps max to 1 user.

3G - IMT2000 Proposals
144 Kbps Automobile, 384 Kbps Pedestrian, 2 Mbps stationary.
Several Proposals - UWC 136 (200Khz, TDMA, 8-PSK = EDGE).
UMTS, CDMA-2000 are both CDMA proposals.

37
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Evolution of Mobile Wireless Network Architecture

Internet / Advanced Services


PSTN PSTN

Mobile Packet Wireless Circuit Network


MSC Mode Control Mode
Switches Servers
Servers Servers Servers
High Speed Data, (Feature Control, (Voice, Low
Multimedia, Network Management, Speed Data,
Voice over IP, Billing, etc.) etc.)
etc.
BSC

Base
Stations Packet Connectivity (ATM / IP)

Radio
Clients

2G Network IP-based 3G Network

Mobile networks are being upgraded in preparation for the delivery of


high speed data services. 38
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

19
Mobile Wireless Infrastructure

Macro-cell GSM Basestation Micro-cell GSM Basestation


(6-12 TRX) (2 TRX)
39
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

2G Basestation Baseband Processing

Multiple DSPs used for baseband processing.


RISC Microcontroller for timing, framing, I/O control
Software upgradable over the network
DSPs dominate cost and power consumption
Future trend - integrate
Channel Channel
baseband processing -
Equalization De/coding Encryption low cost Pico BTS

Tx I/O I/O
Rx AFE DSP DSP RAM DSP DSP DSP RISC
Tx Micro T1/E1

Rx AFE DSP DSP RAM DSP DSP DSP Controller I/O I/O ASIC

Tx/Rx baseband processing board for 2-carrier GSM basestation


40
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

20
3G Basestation Baseband Processing
Increased DSP performance needed in next-generation basestation
Increased Receiver Algorithm Sensitivity
Antenna Arrays - Smart Antennas
Multi-Standard Basestations using Software Radio Architecture
3G - constraint length 9, rate 1/2 convolutional coding for voice.
3G - constraint length 4, Turbo codes for data

Code generator Synchronisation SIR measurement Power control


Code generator
channelisation code cell search
channelisation fast power control
scambling codecode slot syn, frame syn.
scambling (DSP)
(ASIC)) code (DSP)
High Performance DSPs (ASIC))

+ Custom Logic needed Sliding correlator RAKE combiner Decoder


Deinterleaver Viterbi algorithm
for 3G (Viterbi decoding despreading
(ASIC)
reassemble multipath
(DSP, ASIC)
(DSP) Turbo decoding
(DSP, ASIC)

and Turbo decoding)


Code tracking
Channel estimation Path search
delay-lock-loop
(DSP) (ASIC)
(ASIC, DSP)

Courtesy: Bing Xu: Bell Labs Australia


41
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Receiver Algorithms for GSM Basestation


Enhanced Receiver Sensitivity
Larger Cells in Suburban Areas = Reduced network cost
Mobile transmits with less power = Increased battery life

Existing Receiver
Estimating Equalizing Channel Speech
Wireless Multi-path Decoding Decoding
Channel Effects

New Iterative Receiver


Speech
Statistics

Estimating Equalizing Channel Speech


1.3dB improvement Wireless Multi-path Decoding Decoding
Channel Effects

Challenge - requires 6x DSP MIPS of existing receiver in basestation


Courtesy: Magnus Sandell: Bell Labs UK
42
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

21
Smart Antennas
A multiple antenna element system
Combined with a base station architecture and signal processing
techniques designed to dynamically select or form the optimum
beam pattern per user

Omnidirectional Three Sector Intelligent Antenna


Cell Site Cell Site Cell Site
Increased cost in RF electronics and enhanced DSP requirements. 43
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Fixed Multi-Beam Versus Adaptive Beam


Fixed Multi-Beam Adaptive Beam
Mobile 1
Interferer
Mobile
Direct Ray
Mobile 1

Direct Ray
Reflected Rays
Mobile 2

Interferer
Reflected Ray Mobile 2

Select from--or use--multiple fixed Adaptively weight and combine multiple


antenna beams to optimize antenna elements to optimize
performance. performance.
44
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

22
Digital Radio Trends - Software Radio
Antennas
Linear amplification
Combining
multi-standard
A/D basestation
AMP

RF/ Digital Network


Network
Analog
Processing Processing Interface

RF/IF
DSPs - higher speed, more powerful
Higher dynamic range
Filtering Modulation
Smaller Demodulation Equalization
Amplifiers
Rake receiver Correlator
Mixers
Channel coding Encryption
Filters . . .
Diversity . . . 45
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Wideband Receiver Architecture

C C C C C C C C C
H H H ... H H H H ... H H
1 2 3 M 1 2 3 M 1

fRF freq fBB freq freq

CH1 CH1
High
RF-IF & Digital . Baseband
Speed . .
.
Filter Channeliser . Processing .
A/D
CHM CHM

C C C
H H H
1 2 3
...
C
H Increased DSP performance C
H
M

f IF freq
needed for Software Radio M

freq

46
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

23
Turbo Codes
For 3G Wireless (UMTS and CDMA2000)
Voice service: BER requirement 10-3
Data service: BER requirement 10-5

Parallel concatenation of convolutional codes is used to give the codes


structure so they can be decoded
Pseudorandom interleaving is used to give the codes performance
which approaches that for random coding
Resulting encoder structure: Two Recursive Systematic Convolutional(RSC)
Codes
Systematic Output
Input Encoder
#1
MUX
Interleaver

Parity
Output
Encoder
#2
47
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Turbo Decoding

Key idea: iterative decoding (up to 10 iterations for 3G)


There is one decoder for each elementary encoder.
Each decoder estimates the a-posteriori probability (APP) of each data
bit.
The APPs are used as a priori information by the other decoder.

Deinterleaver
APP
APP
Interleaver
systematic Decoder Decoder
data
#1 #2 hard bit
parity decisions
data DeMUX

Interleaver

48
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

24
Soft-Output Decoding Algorithms
Requirements for Turbo: Trellis-Based
Estimation Algorithms
Accept Soft-Inputs in the form of a
priori probabilities (APP)
Produce APP estimates of the data.
Viterbi MAP
Soft-Input Soft-Output
Algorithm Algorithm

Todays High-performance DSPs are


highly MAC-focussed (for filtering in SOVA max-log-MAP
modem applications). Some DSPs
provide hardware support for efficient
implementation of Viterbi - none support
Improved SOVA log-MAP
SOVA or log-MAP

Iterative channel estimation also uses Sequence Symbol-by-symbol


Soft-Input Soft-Output decoders. Estimation Estimation

SOVA and log-MAP use modified Add-Compare-Select operations - not only


select the maximum path metric - but also need to keep the difference. 49
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

The Maximum A Posteriori (MAP) Algorithm


Pr[d = 1] Pr (d = 1 y ) p ( y d = 1)
L(d ) = ln L (d y ) = ln = ln + L (d )
Pr[d = 0] Pr (d = 0 y ) p ( y d = 0 )
Log-Likelihood Ratio:

A Priori value of Pr[d=1],Pr[d=0]


Output of decoder contains additional extrinsic information
The sum of the a priori information and the extrinsic information will be the a
priori information for the next-stage of decoding, for both 2nd decoder or 1st
decoder in the next iteration
p (s , s , y )
Pr [u = + 1 y ] {s , s : u k = 1}
L (u k ) = ln k
= ln
Pr [u k = 0 y ] p (s , s , y )
{s , s : u k = 0 }

1) uk is the kth bit of the desired data sequence, 2) y be the observed sequence, 3)
the state transitions from state s at time k-1 to state s at time k, 4) We want to
evaluate this LLR for every k

Break the probability computation into: Gamma: k (s, s ) = p(s, y k s)


( ) (
p (s, s, y ) = p s, y j < k p (s, y k s) p y j > k s ) Alpha: k 1(s ) = p (s, y j < k )
Beta: k = p (y j > k s )
50
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

25
Gamma, Alpha and Beta Calculations
Gamma: Calculated from known bits up to k, needs to be stored
k (s, s ) = p(s, y k s) = P(s s ') p(y k s, s) = P(uk ) p(y k uk )

where P (uk ) is calculated from the a priori information and p(y k uk ) is calculated
from the received bits

Alpha: Calculated by a forward recursion through the trellis based on Gamma


k (s ) = k (s, s ) k 1(s)
s
Beta: Calculated by a backward recursion from the end of the trellis
k 1(s) = k (s, s ) k (s )
s

Alpha Gamma Beta

Dummy
Betas

Window algorithm 51
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Log MAP and MAX-log MAP


Compute logarithms of alpha, beta and gamma, which means we compute:

(
ln e 1 + e 2 )
Log-MAP: ( )
ln e 1 + e 2 max ( 1 , 2 ) + f c ( 1 2 )

MAX-Log-MAP: ln (e 1 + e 2 ) max ( 1 , 2 ) Correction function (impl. table)

-1
10
MaxlogAPP
LogAPP MAX-log MAP suffers approx 0.5dB
-2
10 from log MAP.
-3
10
For log-MAP, small correction table
BER

-4
needed (approx 6 non-zero values).
10
Absolute difference used as table
-5
10
look-up. We need the difference!

-6
10 Courtesy: Bing Xu: Bell Labs Australia
2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 52
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

26
High Performance DSP Requirements
Very high levels of DSP integer performance

Support for complex real-time synchronous applications (latency, predictable


throughput, synchronization)

Large memory and I/O Some DSP Applications


bandwidth. 100K
3-D graphics?
Soft radio
Scalability to meet wide 10K MPEGII 1G eth. xcvr
range of cost, power, encode
3G Wireless
performance. MOPS 24 ch. set-top
modem box
1000
ADSL DAB
Cost & power efficient 16 HR 6M rcvr SP
solution. GSM
GSM
itio nal D
100 term PCS tr a d
ADSL K56 term
Friendly, compiler driven, V.34 500k
programming environment.
10
1997 1999 2001
53
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Compiler Driven VLIW


Instruction format: cond/branch ex1 ex2 ex3 .. exn

Data memory

Register
Array

Interconnect

ex1 ex2 ex3 ex4 exn


(alu) (alu) (mpy) (ld/st) (ld/st)

Large orthogonal register set, regular interconnect

Atomic RISC-like operations => heavily pipelined, high freq. clock


54
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

27
Explicitly Parallel Instruction Computing
Execution Clusters
Data memory

Register Register
Array Array

Interconnect Interconnect

ex1 ex2 ex3 ex4 ex5 ex6


(alu) (alu) (ld/st) (alu) (mpy) (ld/st)

Execution Sets
fetch set
1 1 1 0 1 0 1 0

exec. set

55
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Explicitly Parallel Instruction Computing


Predication (guarded) exec. cond any instruction
- eliminates branches - improves compiler efficiency
- eliminates branches - removes pipeline bubbles
- fill delayed branch slots with predicated instructions

Instruction modifiers modifier instr1 instr2 instr3 instr4

- allows shorter instruction length


- extend register addressing
- predication
- execution set identifier
- looping
- extended operations
56
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

28
Texas Instruments C6201
Program Memory
(16K x 32)
256
Instruction Dispatch & Decode
Register Bank A Register Bank B
(16 x 32) (16 x 32)

ALU shift mpy add ALU shift mpy add

Data Memory
(32K x 16)

8-way VLIW with two execution clusters


256 bit (8x32) instruction fetch with variable length execute set
Each 32 bit instruction individually predicated
11 stage pipeline
1600 MIPS, 400 MMACs @ 200 MHz 57
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

FIR Filter on TI C6x

Hand-coded assembly: 32-tap FIR filter

loop:

||
ldw
ldw
.d1t1
.d2t2
*a4++,a5
*b4++,b5
Outer Loop: 23 cycles, 180 bytes
||[b0] sub .s2 b0,1,b0 1 cycle in inner loop
||[b0]
||
b
mpy
.s1
.m1x
loop
a5,b5,a6 All 8 exec units used in inner
|| mpyh .m2x a5,b5,b6
|| add .l1 a7,a6,a7
loop - maximum efficiency
|| add .l2 b7,b6,b7 2 MACs per cycle

Assembly syntax more difficult to learn.


Hard to get full use of all 8 execution units at once.
Software pipelining difficult to implement, and requires longer prolog/epilog (larger
code size).
Courtesy: Gareth Hughes: Bell Labs Australia
58
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

29
Viterbi on TI C6x
3-cycle 2-ACS Inner-Loop 16-state Viterbi decoder for GSM
from TI WWW site: ftp://ftp.ti.com/pub/tms320bbs/c62xfiles/vitgsm.asm
LOOP:
[b1] b .s1 LOOP
3 cycles per butterfly
||[b1] sub .s2 b1,1,b1
||[!a2] sth .d1 b12,*+a6[8] 32 cycles per GSM timeslot (8 butterflies)
||[!a2] add .d2 b0,b14,b14 x8 MPY instructions used to move data
|| cmpgt .l1 a11,a10,a1
|| cmpgt .l2 b11,b10,b0
|| mpy .m1x 1,b5,a4
Cycle 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
[a2] sub .s1 a2,1,a2
.D1 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH sd1 STH m[2] STH m[3]
||[!a2] sth .d1 a12,*a6++
.D2 ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj SUB m LDH sd0 STH m[5] STH m[4]
||[a1] add .s2 2,b0,b0
.M1 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0
||[b0] mpy .m2 1,b11,b12
|| mpy .m1 1,a10,a12 .M2 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8

|| sub .l2x a7,b5,b10 .L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 ADD m0 SUB -m0

|| ldh .d2 *++b9,b5 .L2 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 SUB old SUB -m1 SUB m1 SUB I

.S1 B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k
shl .s2 b14,2,b14 .S2 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 ADD tr B JLOOP MVK j
||[a1] mpy .m1 1,a11,a12
|| add .s1 a7,a4,a10 Cycle 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
|| sub .l1x b13,a4,a11
.D1 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH new 8 LDH old0 STH new 0 STH m[0] STH m[1] LDH old1
|| add .l2 b13,b5,b11
.D2 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 LDH mj ADD tr LDH old1 STH trans STH m[1] STH m[6] LDH old0
|| mpy .m2 1,b10,b12
.M1 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0 MPY a0 MPY mj *MPY b0
|| ldh .d2 *b4++[2],a7
|| ldh .d1 *a5++[2],b13 .M2 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 *MPY b8 MPY a8 MPY mj

; end of LOOP .L1 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 CMPGT t0 SUB b0 SUB new ADD old ADD SP

.L2 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8 SUB a8 CMPGT t8 ADD b8

.S1 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 SUB k B JLOOP ADD a0 MVK k

.S2 *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr *ADD t0,t8 SUB j SHL tr B JLOOP

Utilization of execution units in Viterbi decoder


59
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Lucent / Motorola Star*Core SC140


Program / Data Memory

Program Address Data Registers


Sequencer Registers (16)
Instruction (27)
Dispatcher MAC MAC MAC MAC

AAU AAU ALU ALU ALU ALU


BFU BFU BFU BFU

6-way VLIW with 128 bit (8x16) instruction fetch


Prefix instructions for high performance without sacrificing code density
Each execution set (parallel instructions + prefix) predicated
5 stage pipeline
1800 MIPS, 1200 MMACs @ 300 MHz 60
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

30
Viterbi on Star*Core
GSM (K=5, 16 states)
[ move.2l (r0)+,d0:d1 move.2l (r1)+,d1:d2
[ add2 d0,d4 sub2 d6,d2
] Decision bits are manually stored
sub2 d4,d0
[ max2vit d4,d2
add2 d2,d6
max2vit d0,d6
]
] x4
using the Viterbi Shift Left (VSL)
[ vsl.4w d2:d6:d1:d3,(r2)+n0
vsl.4f d2:d6:d1:d3,(r3)+n0 ]
instruction:

max2vit d4,d2 max2vit d0,d6

SR
Hardware support for Viterbi
algorithm: vsl.4w d2:d6:d1:d3,(r2)+n0

max2vit instruction.
D1 decisions
vsl instruction
D3 decisions
1 cycle per butterfly through
software-pipelining D2 path metrics Results written
D6 path metrics to memory

Courtesy: Gareth Hughes: Bell Labs Australia


61
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Log-MAP on Star*Core
Star*Core code for log-MAP Butterfly
d0: a d1: b d6: x
Cycle 1 move.w (r0)+,d0 move.w (r1)+,d1

Cycle 2 add d0,d6,d0 sub d6,d0,d5


d0: a+x d1: b+x
Cycle 3 sub d6,d1,d4 add d1,d6,d1
d4: b-x d5: a-x

Cycle 4 sub d0,d4,d2 sub d1,d5,d3


max max
Cycle 5 max d0,d4 max d1,d5

d2: d0-d4 d3: d1-d5


d4: max(d0,d4) d5: max(d1,d5) Cycle 6 abs d2 abs d3
n0: |d2| Cycle 7 move.l d2,n0
d2:

Cycle 8 move.l d3,n0 move.w (r6+n0),d2


n0: |d3|

r6 Cycle 9 add d4,d2,d4 move.w (r6+n0),d3


Cycle 10 add d5,d3,d5
Cycle 11 move.2w d4:d5,(r2)+
d4: d4+d2 d5: d5+d3 r6
d3:
This code uses 2 of the 4 ALUs and can be software
Courtesy: Gareth Hughes: Bell Labs Australia pipelined to achieve 6 cycles per LOG-MAP Butterfly
62
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

31
Parallel DSP Architectures

Arch. Parallelism Compile? Power ?

S/scalar Dynamic instruction level


VLIW Static instruction level
SIMD Highly regular, data dependent
MIMD Task level

MIMD with VLIW / SIMD provides high order parallel execution

The future of high performance DSPs is MIMD


63
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Daytona: A Multiprocessor DSP Architecture


I/O I/O External
Interfaces Interfaces Memory

Chip
Buffered Arbitration I/O Subsystem
I/O Synchronization

split transaction bus (128 bits)

Programmable Programmable
Processing Hardware
Processing
Element Accelerator
Element
(PE) (PE)

Scalable Architecture - multiple programmable DSPs on a single chip


1 Bus supports different programmable DSPs and Microcontrollers
64
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

32
Split Transaction Bus
Separate Address and Data busses - each with pipelined protocol
Multiple outstanding transactions - varying size/priority

Separate Bus Arbitration


ID
Arbiter
Address addr (round-robin)

Bus (100MHz)
ID ID ID
Arbiter
(round-robin)
Data data data data
Bus (128 bits 100MHz)

Memory ID addr
Controller ID addr
PE
65
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Memory Hierarchy in MIMD DSPs


Multiple copies of 1 application (e.g. odd/even slot channel equalisation)
Multiple copies of same software - Shared memory multiprocessing
Flat Memory Architecture vs. Hierarchical Memory Architecture

Inefficient
DRAM

SRAM SRAM
Cache Cache
DSP DSP DSP DSP
2 copies of software 1 copy of software

Mix of different applications (e.g. equalisation, convolutional decoding)


Heterogenous mix of applications
66
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

33
Shared Memory Multiprocessing

64 Semaphores provided for process synchronization

L-1 cache coherency using a snoopy protocol (modified MESI used)

Coherent Transaction
Memory
Controller

Access to shared data hit


uses coherent transaction.
Caches snoop the address
and query their tag RAMs.
A cache hit prevents the DSP DSP DSP DSP
memory controller from Access Snoop Snoop
Snoop
servicing the request. to shared (miss) (hit) (miss)
data
67
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Daytona Multiprocessor DSP Chip


Bell Laboratories Research Chip for 3G Wireless Base-stations / Head-end xDSL

Host 64-b 4-MAC 64-b 4-MAC


SIMD DSP SIMD DSP
Chip Characteristics
Interface
2
32-b RISC 32-b RISC Core Area 120mm
I/O &
Memory Cache Memory Cache Memory
Speed 100 MHz
Controller
128-b Split Transaction Bus
Test & Power 4W
JTAG Port Cache Memory Cache Memory

Arbiter Tech 0.25um


32-b RISC 32-b RISC
64-b 4-MAC 64-b 4-MAC
Semaphore SIMD DSP SIMD DSP

Paper 4.2, ISSCC2000 68


ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

34
Photomicrograph of Daytona Test Chip

Arbiter
Vector Unit ( RVU)

)
Semph

PE
t(
en
DLL

HDS
LRU

BUS
INT

em
SPARC

El
ng
si
Split Transaction Bus

es
8KB Re-configurable Memory

oc
Pr
)

)
PE

PE
t(

t(
en

en
em

em
El

El
ng

ng
si

si
es

es
oc

oc
Pr

Pr
I/O
Subsystem

Paper 4.2, ISSCC2000 69


ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

Acknowledgements
The following people contributed to the work in this tutorial:

Low Power DSPs for Wireless


Wanda Gass: Texas Instruments
Mihran Touriguian: Atmel

High Performance DSPs for Wireless Infrastructure


Bryan Ackland: Bell Labs US - High Perf. DSP Architecture
Gareth Hughes: Bell Labs Australia - LU DSP16210, C6x and Starcore benchmarks
Bing Xu: Bell Labs Australia - SOVA, MAP, LOG-MAP
Ran-Hong Yan: Bell Labs UK - 3G Wireless
Daytona Team: (J Williams, K.J. Singh, J. Othmer, B. Ackland), Bell Labs US.

70
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

35
References

[1] P. Lapsley, J. Bier, A. Shoham, E. Lee, DSP Processor Fundamentals, IEEE Press, New York, 1997.
[2] D. Skillikorn, A Taxonomy for Computer Architectures, Computer Magazine, Nov. 1988.
[3] H. Kabuo, M. Okamoto, I. Tanaka, H. Yasoshima, S. Marui, M. Yamasaki, T. Sugimura, K. Ueda, T. Ishikawa, H.
Suzuki, R. Asahi, An 80 MOPS-Peak High-Speed and Low-Power-Consumption 16-b Digital Signal Processor, IEEE
Journal of Solid-State Circuits, Vol. 31, No. 4, April 1996, pg. 494-503.
[4] E. A. Lee, D. G. Messerschmitt, Digital communication, Boston: Kluwer Academic Publishers, 1988.
[5] W. Lee et al., A 1V DSP for Wireless Communications, Proceedings IEEE International Solid-State Circuits
Conference, pp. 92-93, February 1997.
[6] S. Lin, and J. Costello Jr., Error Control Coding: Fundamentals and applications, Prentice Hall, New Jersey, 1983
[7] Lucent 16000, http://www.lucent.com/micro/ or http://www.lucent.dk/micro/dsp16000/
[8] Thomas Parsons, Voice and Speech Processing, McGraw-Hill Book Company, New York, 1987.
[9] TMS320C54x Users Guide, available from the Texas Instruments Literature Response Center.
[10] I. Verbauwhede, M. Touriguian, A Low Power DSP Engine for Wireless Communications, Journal of VLSI Signal
Processing 18, pg. 177-186, 1998, Kluwer Academic Publishers.
[11] I. Verbauwhede, M. Touriguian, Wireless digital signal processors, Chapter in Digital Signal Processing for
Multimedia Systems, Edited by K.K. Parhi, T. Nishitani, Publisher: Marcel Dekker, New York, 1999.
[12] M. Okamoto, K. Stone, T. Sawai, H. Kabuo, S. Marui, M. Yamasaki, Y. Uto, Y. Sugisawa, Y. Sasagawa, T.
Ishikawa, H. Suzuki, N. Minamida, R. Yamanaka, K. Ueda, A High Performance DSP Architecture for Next Generation
Mobile Phone Systems, 1998 IEEE DSP Workshop.
[13] Lode specifications, available from www.atmel.com
[14] M.W. Oliphant, The Mobile Phone meets the Internet, IEEE Spectrum pp. 20-28, Aug. 1999.
[15] L. C. Godara, Application of Antenna Arrays to Mobile Communications: Part 1, Proc. IEEE, Vol 85, No. 7. pp
1031-1060, July 97

71
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

References (cont)
[16] G. D. Forney, Jr., Maximum Likelihood Sequence Estimation of Digital Sequences in the Presence of Intersymbol
Interference, IEEE Trans. Inform. Theory, V IT-18, pp. 363-378, May 1972.
[17] C. Berrou, A. Glavieux, P. Thitimajshima, Near Shannon Limit Error-Correcting Coding and Decoding: Turbo-Codes
(1), Proc. ICC93, May 1993.
[18] J. Hagenauer, P. Hoeher, A Viterbi Algorithm with Soft-Decision Outputs and its Applications, Proc. Globecom 89,
Nov. 1989, pp.47.1.1-47.1.7
[19] L. Bahl, J. Cocke, F. Jelinek, J. Raviv, Optimal Decoding of Linear Codes for Minimizing Symbol Error Rate, IEEE
Trans. Inform. Theory, V IT-20, pp. 284-287, Mar. 1974.
[20] J. Turley, H. Hakkaraainen, TIs new C6x DSP Screams at 1600 MIPS, Microprocessor Report, Vol 11, No. 2, pp
14, Feb 1997
[21] Starcore Launched First Architecture, Microprocessor Report, V12, No. 14. pp 22, Oct 1998
[22] B. Ackland & P. DArcy, A New Generation of DSP Architectures, Proc. IEEE CICC99, Paper 25.1.1
[23] J. Williams, K.J. Singh, C.J. Nicol, B. Ackland, A 3.2 GOPs Multiprocessor DSP for Communication Applications,
Proc. IEEE ISSCC2000, Paper 4.2

72
ISSCC 2000, DSP Tutorial. Ingrid Verbauwhede, Chris Nicol

36

You might also like