DSP TMS Processors PART1

Lecture 10a:
Digital Signal Processors:

A TI Architectural History
Collated by: Professor Kurt Keutzer
Computer Science 252, Spring 2000
With contributions from:
Dr. Brock Barton, Clark Hise TI;
Dr. Surendar S. Magar, Berkeley
Concept Research Corporation
1
DSP ARCHITECTURE EVOLUTION
Multipliers (MUL) Multiprocessors (MP)
Video/Imaging
Multi-Processing
Application Examples
W-CDMA
Radars
DSP Building Blocks Function/Application Specific
Digital Radios & Bit Slice Processors (MUL, etc.) ( MP)
High-End
Control
Modems
DSP P and RISC
Voice Coding ( MP )
Instruments
C and Analog P
Low-End
Modems
Industrial
Control
1980 1985 1990 1995

2
DSP ARCHITECTURE
Enabling Technologies
Time Frame Approach Primary Application Enabling Technologies
Early 1970’s  Discrete logic  Non-real time  Bipolar SSI, MSI

procesing  FFT algorithm
 Simulation
Late 1970’s  Building block  Military radars  Single chip bipolar multiplier
 Digital Comm.  Flash A/D
Early 1980’s  Single Chip DSP P  Telecom  P architectures

 Control  NMOS/CMOS
Late 1980’s  Function/Application  Computers  Vector processing

specific chips  Communication  Parallel processing
Early 1990’s  Multiprocessing  Video/Image Processing  Advanced multiprocessing

 VLIW, MIMD, etc.
Late 1990’s  Single-chip  Wireless telephony  Low power single-chip DSP

multiprocessing  Internet related  Multiprocessing
3
Texas Instruments TMS320 Family
Multiple DSP P Generations
First Bit Size Clock Instruction MAC MOPS Device density (#
Sample speed Throughput execution of transistors)
(MHz) (ns)
Uniprocessor
Based
(Harvard
Architecture)
TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3)
TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2)

TMS320C30 1988 32 flt.pt. 33 17 MIPS 60 33 695,000 (1)
TMS320C50 1991 16 integer 57 29 MIPS 35 60 1,000,000 (0.5)
TMS320C2XXX 1995 16 integer 40 MIPS 25 80
Multiprocessor
Based
TMS320C80 1996 32 integer/flt. 2 GOPS MIMD
120 MFLOP
TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW
TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW
4
First Generation DSP P Case Study
TMS32010 (Texas Instruments) - 1982
Features
 200 ns instruction cycle (5 MIPS)
 144 words (16 bit) on-chip data RAM
 1.5K words (16 bit) on-chip program ROM - TMS32010
 External program memory expansion to a total of 4K words at full
speed
 16-bit instruction/data word
 single cycle 32-bit ALU/accumulator
 Single cycle 16 x 16-bit multiply in 200 ns
 Two cycle MAC (5 MOPS)
 Zero to 15-bit barrel shifter
 Eight input and eight output channels
5
TMS32010 BLOCK DIAGRAM
6
TMS32010 Program Memory Maps
Microcomputer Mode Microprocessor Mode
Address 16-bit word 16-bit word
0 Reset 1st Word 0 Reset 1st Word
1 1 Reset 2nd Word

Reset 2nd Word
Internal
Memory 2 Interrupt
2 Interrupt Space
External
Memory
1525
Space
Internal
Memory
Space Reserved
For Testing
1536
External
Memory
Space
4095 4095
7
Digital FIR Filter Implementation
(Uniprocessor-Circular Buffer)
Start each
Time here
1st. Cycle 2nd. Cycle
End
X0 Start
a n-1 a n-2 a1 a0 X1 Start
X2
a0 a n-1 X3
X4
X X5
Xn-1
End
+ Replace
starting
value
Acc with new
value
8
TMS32010 FIR FILTER PROGRAM
Indirect Addressing (Smaller Program Space)
Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0)
For N=50, Indirect Addressing t=42 s (23.8 KHz) 9

For N=50, Direct Addressing t=21.6 s (40.2 KHz)
TMS320C203/LC203 BLOCK DIAGRAM
DSP Core Approach - 1995
10
Third Generation DSP P Case Study
TMS320C30 - 1988
TMS320C30 Key Features
 60 ns single-cycle instruction execution time
 33.3 MFLOPS (million floating-point operations per second)
 16.7 MIPS (million instructions per second)
 One 4K x 32-bit single-cycle dual-access on-chip ROM block
 Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks
 64 x 32-bit instruction cache
 32-bit instruction and data words, 24-bit addresses
 40/32-bit floating-point/integer multiplier and ALU
 32-bit barrel shifter
11
Third Generation DSP P Case Study
TMS320C30 - 1988
TMS320C30 Key Features (cont.)

 Eight extended precision registers (accumulators)
 Two address generators with eight auxiliary registers and two auxiliary
register arithmetic units
 On-chip direct memory Access (DMA) controller for concurrent I/O and
CPU operation
 Parallel ALU and multiplier instructions
 Block repeat capability
 Interlocked instructions for multiprocessing support
 Two serial ports to support 8/16/32-bit transfers
 Two 32-bit timers
 1  CDMOS Process
12
TMS320C30 BLOCK DIAGRAM
13
TMS320C3x CPU BLOCK DIAGRAM
14
TMS320C3x MEMORY BLOCK DIAGRAM
15
TMS320C30 Memory Organization
Oh Interrupt locations Oh Interrupt locations
& reserved (192) & reserved (192)
BFh external STRB active BFh
COh External COh
ROM
STRB Active 0FFFh
7FFFFFh (Internal)
1000h
800000h Expansion BUS MSTRB
7FFFFFh
Expansion BUS MSTRB Active (8K)
801FFFh 800000h
Active (8K)
802000h Reserved 801FFFh Reserved
(8K) 802000h (8K)
803FFFh
804000h Expansion Bus 803FFFh Expansion Bus
IOSTRB Active (8K) 804000h IOSTRB Active (8K)
805FFFh
806000h Reserved 805FFFh Reserved
(8K) 806000h (8K)
807FFFH
80800h Peripheral Bus Memory Mapped 807FFFH Peripheral Bus Memory Mapped
Registers (Internal) (6K) 80800h Registers (Internal) (6K)
8097FFh
809800h RAM Block 0 (1K) 8097FFh RAM Block 0 (1K)
(Internal) 809800h (Internal)
809BFFh
809C00h 809BFFh RAM Block 1 (1K)
RAM Block 1 (1K) 809C00h
809FFFh (Internal)
(Internal)
80A00h 809FFFh
External 80A00h External
0FFFFFFh STRB Active STRB Active
0FFFFFFh
Microprocessor Mode Microcomputer Mode 16
TMS320C30 FIR FILTER PROGRAM
Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0)
For N=50, t=3.6 s (277 KHz) 17

‘C54x Architecture
18
TMS320C54x Internal Block Diagram
19
Architecture optimized for DSP
#1: CPU designed for efficient DSP processing
 MAC unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data
and program flow
 Four busses and large on-chip memory that
result in sustained performance near peak
#3: Highly tuned instruction set for
powerful DSP computing
 Sophisticated instructions that execute in fewer
cycles, with less code and low power demands
20
Key #1: DSP engine
40
Y =  an * xn
n = 1
x a
MPY
ADD
y
21
Key #1: MAC Unit
MAC *AR2+, *AR3+, A
Data Acc A Temp Coeff Prgm Data Acc A

S/U S/U
MPY A
Fractional B
Mode Bit
ADD O
acc A acc B
22
Key #1: Accumulators + Adder
General-Purpose Math example: t = s+e-r
A Bus B Bus A B C T D Shifter
LD @s, A
acc A acc B ALU
ADD @e, A
MUX U Bus SUB @r, A
STL A, @t
A B MAC
23
Key #1: Barrel shifter
LD @X, 16, A
STH @B, Y
A B C D
Barrel Shifter
(-16-+31)
S Bus
ALU E Bus
24
Key #1: Temporary register
LD @x, T
MPY @a, A
D X EXP A
Encoder B
Temporary For example:

Register
A = xa
T Bus
MAC ALU
25
Key #2: Efficient data/program flow
 MAC unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data
and program flow
26
Key #2: Multiple busses
MAC *AR2+, *AR3+, A
P
INTERNAL
EXTERNAL
M
MEMORY
MEMORY
U D M
X C U
E X
S E
Central
C D
Arithmetic T MAC A B ALU SHIFTER
Logic Unit
M
27
Key #2: Pipeline
Prefetch Fetch Decode Access Read Execute
P F D A R E
 Prefetch: Calculate address of instruction

 Fetch: Collect instruction
 Decode: Interpret instruction
 Access: Collect address of operand
 Read: Collect operand
 Execute: Perform operation
28
Key #2: Bus usage
CNTL PC ARs
P
INTERNAL
EXTERNAL
MEMORY
MEMORY
U D M
X C U
E X
S E
Central
Arithmetic
Logic Unit
T MAC A B ALU SHIFTER
29
Key #2: Pipeline performance
CYCLES
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 F4 D4 A4 R4 X4
P5 F5 D5 A5 R5 X5
P6 F6 D6 A6 R6 X6
Fully loaded pipeline
30
Key #3: Powerful instructions
 MAC Unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data and
program flow
31
Key #3: Advanced applications
Symmetric FIR filter FIRS

Adaptive filtering LMS
Polynomial evaluation POLY
Code book search STRCD
SACCD
SRCCD
Viterbi DADST
DSADT
CMPS
32
C62x Architecture
33
TMS320C6201 Revision 2
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM
Pwr C6201 CPU Megamodule

Dwn Program Fetch
Control
Host Instruction Dispatch Registers
Port Instruction Decode
Interface Data Path 1 Data Path 2 Control
4-DMA A Register File B Register File
Logic
Test
Emulation
L1 S1 M1 D1 D2 M2 S2 L2
Ext. Interrupts
Memory
Interface
2 Timers
2 Multi-
Data Memory channel
buffered
32-Bit address, 8-, 16-, 32-Bit data serial ports
(T1/E1)
512K Bits RAM
34
C6201 Internal Memory
Architecture
 Separate Internal Program and Data Spaces
 Program
 16K 32-bit instructions (2K Fetch Packets)
 256-bit Fetch Width
 Configurable as either
 Direct Mapped Cache, Memory Mapped Program Memory
 Data
 32K x 16
 Single Ported Accessible by Both CPU Data Buses
 4 x 8K 16-bit Banks
 2 Possible Simultaneous Memory Accesses (4 Banks)
 4-Way Interleave, Banks and Interleave Minimize Access
Conflicts
35
C62x Interrupts
 12 Maskable Interrupts , Non-Maskable Interrupt (NMI)
 Interrupt Return Pointers (IRP, NRP)
 Fast Interrupt Handing
 Branches Directly to 8-Instruction Service Fetch Packet
 Can Branch out with no overhead for longer service
 7 Cycle Overhead : Time When No Code is Running
 12 Cycle Latency : Interrupt Response Time
 Interrupt Acknowledge (IACK) and Number (INUM)
Signals
 Branch Delay Slots Protected From Interrupts
 Edge Triggered
36
C62x Datapaths
Registers A0 - A15 Registers B0 - B15
1X 2X
S1 S2 D DL SL SL DL D S S D S S D S S S S D S S D S S D DL SL SL DL D S2 S1
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
DDATA_I1 DDATA_I2
(load data) (load data)
DDATA_O1 DADR1 DADR2 DDATA_O2

(store data) (address) (address) (store data)
Cross Paths
40-bit Write Paths (8 MSBs)
40-bit Read Paths/Store Paths
37
Functional Units
 L-Unit (L1, L2)
 40-bit Integer ALU, Comparisons
 Bit Counting, Normalization
 S-Unit (S1, S2)
 32-bit ALU, 40-bit Shifter
 Bitfield Operations, Branching
 M-Unit (M1, M2)
 16 x 16 -> 32
 D-Unit (D1, D2)
 32-bit Add/Subtract
 Address Calculations
38
C62x Datapaths

1X 2X
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
Cross Paths
DDATA_O1 DDATA_I1 DDATA_I2 DDATA_O2 40-bit Write Paths (8 MSBs)
(store data) (load data) (load data) (store data) 40-bit Read Paths/Store Paths
DADR1 DADR2
(address) (address)
39
C62x Instruction Packing
Instruction Packing Advanced VLIW
 Fetch Packet
 CPU fetches 8 instructions/cycle
Example 1
 Execute Packet
A B C D E F G H  CPU executes 1 to 8 instructions/cycle
 Fetch packets can contain multiple execute packets
A  Parallelism determined at compile / assembly
time
B  Examples
C  1) 8 parallel instructions
D Example 2  2) 8 serial instructions
E  3) Mixed Serial/Parallel Groups
 A // B
F  C
G  D
H A B  E // F // G // H
 Reduces Codesize, Number of Program Fetches,
C Power Consumption
D Example 3
E
F G H
40
C62x Pipeline Operation
Pipeline Phases
Fetch Decode Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
 Decode
Single-Cycle Throughput
 DP Instruction Dispatch
Operate in Lock Step
 DC Instruction Decode
Fetch  Execute
 PG Program
 E1Address
- E5 Generate
Execute 1 through Execute 5
 PS Program Address Send
 PW Program Access Ready Wait
 PR Program Fetch Packet Receive
Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5
41
Delay Slots
 Delay Slots: number of extra cycles until result is:
 written to register file
 available for use by a subsequent instructions
 Multi-cycle NOP instruction can fill delay slots while
minimizing codesize impact
Most Instructions E1 No Delay
Integer Multiply E1 E2 1 Delay Slots
Loads E1 E2 E3 E4 E5 4 Delay Slots

Branches E1
Branch Target PG PSPWPR DPDC E1 5 Delay Slots

42
C6000 Pipeline Operation
Benefits
 Cycle Time
 Allows 6 ns cycle time on 67x
 Allows 5 ns cycle time & single cycle execution on C62x
 Parallelism
 8 new instructions can always be dispatched every cycle
 High Performance Internal Memory Access
 Pipelined Program and Data Accesses
 Two 32-bit Data Accesses/Cycle (C62x)
 Two 64-bit Data Accesses/Cycle (C67x)
 256-bit Program Access/Cycle
 Good Compiler Target
 Visible: No Variable-Length Pipeline Flow
 Deterministic: Order and Time of Execution
 Orthogonal: Independent Instructions
43
C6000 Instruction Set Features
Conditional Instructions
 All Instructions can be Conditional

 A1, A2, B0, B1, B2 can be used as
Conditions
 Based on Zero or Non-Zero Value
 Compare Instructions can allow other
Conditions (<, >, etc)
 Reduces Branching
 Increases Parallelism
44
C6000 Instruction Set Addressing
Features
 Load-Store Architecture
 Two Addressing Units (D1, D2)
 Orthogonal
 Any Register can be used for Addressing or
Indexing
 Signed/Unsigned Byte, Half-Word, Word,
Double-Word Addressable
 Indexes are Scaled by Type
 Register or 5-Bit Unsigned Constant
Index
45
Features
 Indirect Addressing Modes
 Pre-Increment *++R[index]
 Post-Increment *R++[index]
 Pre-Decrement *--R[index]
 Post-Decrement *R--[index]
 Positive Offset *+R[index]
 Negative Offset *-R[index]
 15-bit Positive/Negative Constant Offset
from Either B14 or B15
46
Features
 Circular Addressing
 Fast and Low Cost: Power of 2 Sizes and
Alignment
 Up to 8 Different Pointers/Buffers, Up to 2
Different Buffer Sizes
 Dual Endian Support
47
C67x Architecture
48
TMS320C6701 DSP
Block Diagram
Program Cache/Program Memory
32-bit address, 256-Bit data
512K Bits RAM
Power ’C67x Floating-Point CPU Core

Down Program Fetch
Control
Host Instruction Dispatch Registers
Port Instruction Decode
Interface 4 Data Path 1 Data Path 2 Control
Channel Logic
A Register File B Register File
DMA Test
Emulation
L1 S1 M1 D1 D2 M2 S2 L2
Interrupts
External
Memory
Interface
2 Timers
2 Multi-
Data Memory channel
32-Bit address buffered
8-, 16-, 32-Bit data serial ports
(T1/E1)
512K Bits RAM
49
TMS320C6701
Advanced VLIW CPU (VelociTI ) TM
 1 GFLOPS @ 167 MHz

 6-ns cycle time
 6 x 32-bit floating-point instructions/cycle
 Load store architecture
 3.3-V I/Os, 1.8-V internal
 Single- and double-precision IEEE floating-point
 Dual data paths
 6 floating-point units / 8 x 32-bit instructions
50
TMS320C6701
Memory /Peripherals
 Same as ’C6201
 External interface supports
 SDRAM, SRAM, SBSRAM
 4-channel bootloading DMA
 16-bit host port interface
 1Mbit on-chip SRAM
 2 multichannel buffered serial ports (T1/E1)
 Pin compatible with ’C6201
51
TMS320C67x CPU Core
’C67x Floating-Point CPU Core
Program Fetch
Control
Instruction Dispatch Registers
Instruction Decode
Data Path 1 Data Path 2 Control
Logic
A Register File B Register File
Test
Emulation
L1 S1 M1 D1 D2 M2 S2 L2
Interrupts
Floating-Point
Arithmetic Auxiliary
Logic Logic
Multiplier Capabilities
Unit
Unit Unit
52
C67x Interrupts
 12 Maskable Interrupts
 Non-Maskable Interrupt (NMI)
 Interrupt Return Pointers (IRP, NRP)
 Fast Interrupt Handling
 Branches Directly to 8-Instruction Service Fetch Packet
 7 Cycle Overhead: Time When No Code is Running
 12 Cycle Latency : Interrupt Response Time
 Interrupt Acknowledge (IACK) and Number
(INUM) Signals
 Branch Delay Slots Protected From Interrupts
 Edge Triggered
53
C67x New Instructions
.L Unit .M Unit .S Unit
Floating Point Arithmetic Unit
ABSSP
Floating Point Auxilary Unit

ADDSP MPYSP
Floating Point Multiply Unit

ADDDP MPYDP ABSDP
SUBSP MPYI CMPGTSP
SUBDP MPYID CMPEQSP
CMPLTSP
INTSP MPY24 CMPGTDP
INTDP MPY24H CMPEQDP
SPINT CMPLTDP
DPINT RCPSP
SPTRUNC RCPDP
DPTRUNC RSQRSP
DPSP RSQRDP
SPDP
54
C67x Datapaths
 L-Unit (L1, L2)

 2 Data Paths  Floating-Point, 40-bit Integer ALU
 8 Functional Units  Bit Counting, Normalization
 Orthogonal/Independent
 2 Floating Point Multipliers  S-Unit (S1, S2)
 2 Floating Point Arithmetic  Floating Point Auxiliary Unit
 2 Floating Point Auxiliary  32-bit ALU/40-bit shifter
 Control  Bitfield Operations, Branching
 Independent
 Up to 8 32-bit Instructions  M-Unit (M1, M2)
 Registers  Multiplier: Integer & Floating-Point
 2 Files 
 32, 32-bit registers total
D-Unit (D1, D2)
 32-bit add/subtract Addr Calculations
 Cross paths (1X, 2X)

1X 2X
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
55
C67x Instruction Packing
Instruction Packing Enhanced VLIW
Example 1
 Fetch Packet
A B C D E F G H  CPU fetches 8 instructions/cycle
 Execute Packet
 CPU executes 1 to 8
instructions/cycle
A  Fetch packets can contain multiple
B execute packets
 Parallelism determined at
C compile/assembly time
 Examples
D Example 2  1) 8 parallel instructions
E  2) 8 serial instructions
 3) Mixed Serial/Parallel Groups
F  A // B
G 

C
D
H A B  E // F // G // H
C  Reduces
 Codesize
D Example 3  Number of Program Fetches
E  Power Consumption
F G H
56
Pipeline Phases
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Operate in Lock Step  Decode
Fetch  DP Instruction Dispatch
 PG Program Address Generate  DC Instruction Decode
 PS Program Address Send  Execute
 PW Program Access Ready Wait  E1 - E5 Execute 1 through Execute 5
 PR Program Fetch Packet Receive  E6 - E10 Double Precision Only
Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10

57
Delay Slots
Delay Slots: number of extra cycles until result is:
 written to register file
 available for use by a subsequent instructions
 Multi-cycle NOP instruction can fill delay slots while
minimizing codesize impact
Most Integer E1 No Delay
Single-Precision E1 E2 E3 E4 3 Delay Slots
Loads E1 E2 E3 E4 E5 4 Delay Slots
Branches E1
Branch Target PG PS PW PR DP DC E1 5 Delay Slots

58
’C67x and ’C62x Commonality
 Driving commonality ( ) between ’C67x & ’C62x shortens ’C67x design time.
 Maintaining symmetry between datapaths shortens the ’C67x design time.
’C62x ’C67x
M-Unit 1 CPU
M-Unit 2 M-Unit 1 CPU
M-Unit 2
Multiplier Multiplier Multiplier Unit Multiplier Unit
Unit Unit with Floating Point with Floating Point
D-Unit 1 Control D-Unit 2 D-Unit 1 Control D-Unit 2
Data Load/ Registers Data Load/ Data Load/ Registers Data Load/
Store Emulation Store Store Emulation Store
S-Unit 1 S-Unit 2 S-Unit 1 S-Unit 2
Auxiliary Auxiliary Auxiliary Logic Unit Auxiliary Logic Unit
Logic Unit Logic Unit with Floating Point with Floating Point
L-Unit 1 L-Unit 2 L-Unit 1 L-Unit 2

Arithmetic Arithmetic Arithmetic Logic Unit Arithmetic Logic Unit
Logic Unit Logic Unit with Floating Point with Floating Point
Decode Decode
Register Register Register Register
file file file file
Program Fetch & Dispatch Program Fetch & Dispatch
59
TMS320C80 MIMD MULTIPROCESSOR
Texas Instruments - 1996
60
Copyright 1999
61
SIGNAL AND IMAGE Accumulator architecture
PROCESSING ON THE
TMS320C54x DSP
Memory-register architecture
Prof. Brian L. Evans

in collaboration with
Niranjan Damera-Venkata and
Wade Schwartzkopf Load-store architecture
Embedded Signal Processing Laboratory

The University of Texas at Austin
Austin, TX 78712-1084
http://signal.ece.utexas.edu/
Outline
 Introduction
 Instruction set architecture
 Vector dot product example
 Pipelining
 Algorithm acceleration
 C compiler
 Development tools and boards
 Conclusion
63
Introduction to TMS320C54x
 Lowest DSP in power consumption: 0.54 mW/MIP

 Acceleration for FIR and LMS filtering, code book
search, polynomial evaluation, Viterbi decoding
Roadmap
64
Instruction Set Architecture
65
Instruction Set Architecture
 Conventional 16-bit fixed-point DSP

 8 16-bit auxiliary/address registers (ar0-7)
 Two 40-bit accumulators (a and b)
 One 16 bit x 16 bit multiplier
 Accumulator architecture
 Four busses (may be active each cycle)
 Three read busses: program, data, coefficient
 One write bus: writeback
 Memory blocks
 ROM in 4k blocks
 Dual-access RAM in 2k blocks
 Single-access RAM in 8k blocks
 Two clock cycles per instruction cycle
66
C54x Addressing Modes
 Immediate
 Operand is part of the ADD #0FFh
instruction
 Absolute
 Address of operand is part of
the instruction LD *(LABEL), A
 Register
 Operand is specified in a
register READA DATA
;(data read
from address in
accumulator A)
67
C54x Addressing Modes
 Direct
 Address of operand is part of the
ADD 010h,A
instruction (added to implied
memory page)
 Indirect
 Address of operand is stored in a
register
 Offset addressing ADD *AR1
 Register offset (ar1+ar0)
ADD *AR1(10)
 Autoincrement/decrement
 Bit reversed addressing
ADD *AR1+0
 Circular addressing ADD *AR1+
ADD *AR1+B
ADD *AR1+0B
68
Program Control
 Conditional execution
 XC n, cond [, cond [, cond ]] ; 23 possible conditions
 Executes next n (1 or 2) words if conditions (cond) are met
 Takes one cycle to execute
xc 1,ALEQ ; test for accumulator a0
mac *ar1+,*ar2+,a ; perform MAC only if a0
add #12,a,a ; always perform add
 Repeat single instruction or block

 Overhead: 1 cycle for RPT/RPTZ and 4 cycles for RPTB
 Hardware loop counters count down
rptz a,#39 ; zero accumulator a

; repeat next instruction 40 times
mac *ar2+,*ar3+,a ; a += a(n) * x(n)
69
Special Arithmetic Functions
 Scalar arithmetic
 ABS Absolute value
 SQUR Square
 POLY Polynomial evaluation
 Vector arithmetic acceleration
 Each instruction operates on one element at at time
 ABDIST Absolute difference of vectors
 SQDIST Squared distance between vectors
 SQURA Sum of squares of vector elements
 SQURS Difference of squares of vector elements
rptz a,#39 ; zero accumulator a, repeat next
; instruction over 40 elements
squra *ar2+,a ; a += x(n)^2
70
C54X Instructions Set by Category
Arithmetic Logical Program Application
ADD AND Control Specific
MAC BIT B ABS
MAS BITF BC ABDST
MPY CMPL CALL DELAY
NEG CMPM CC EXP
SUB OR IDLE FIRS
ZERO ROL INTR LMS
ROR NOP MAX
Data SFTA RC MIN
Management SFTC RET NORM
LD SFTL RPT POLY
MAR XOR RPTB RND
MV(D,K,M,P) RPTZ SAT
ST TRAP SQDST
XC SQUR
Notes
SQURA
CMPL complement MAR modify address reg.
SQURS
CMPM compare memory MAS multiply and subtract
71
Example: Vector Dot Product
 A vector dot product is common in filtering

N 1
Y   a ( n) x ( n)
n 0
 Store a(n) and x(n) into an array of N elements

 C54x performance: N cycles
Coefficients a(n)
Data x(n)
72
Example: Vector Dot Product
 Prologue
 Initialize pointers: ar2 for a(n) and ar3 for x(n)
 Set accumulator (A) to zero
 Inner loop
Reg Mea n in g
 Multiply and accumulate a(n) and x(n) AR2 &a(n )
AR3 &x(n )
 Epilogue A Y
 Store the result into Y
; Initialize pointers ar2 and ar3 (not shown)
rptz a,#39 ; zero accumulator a
; repeat next instruction 40 times
mac *ar2+,*ar3+,a ; a += a(n) * x(n)
sth a,#Y ; store result in Y
73
Pipelining
Sequential (Motorola 56000)
Fetch Decode Read Execute
Pipelined (Most conventional DSP processors)
Superscalar (Pentium, MIPS)

Managing Pipelines
•compiler or programmer
(TMS320C6x and C54x)
•pipeline interlocking
in processor (TMS320C3x)
Superpipelined (TMS320C6x) •hardware instruction
scheduling

74
TMS320C54x Pipeline
 Six-stage pipeline
 Prefetch: load address of next instruction onto bus
 Fetch: get next instruction
 Decode: decode next instruction to determine type of memory
access for operands
 Access: read operands address
 Read: read data operand(s)
 Execute: write data to bus
 Instructions
 1-3 words long (most are one word long)
 1-6 cycles to execute (most take one cycle) not counting external
(off-chip) memory access penalty
75
TMS320C54x Pipeline
 Instructions affecting pipeline behavior

 Delayed branches (BD), calls (CALLD), and
returns (RETD)
 Conditional branches (BC), execution (XC), and
returns (RC)
 No hardware protection against pipeline hazards
 Compiler and assembler must prevent pipeline hazards
 Assembler/linker issues warnings about potential pipeline
hazards
76
Block FIR Filtering
 y[n] = h0 x[n] + h1 x[n-1] + ... + hN-1 x[n-(N-1)]

 h stored as linear array of N elements (in prog. mem.)
 x stored as circular array of N elements (in data mem.)
; Addresses: a4 h, a5 N samples of x, a6 input buffer, a7 output buffer
; Modulo addressing prevents need to reinitialize regs each sample
; Moving filter coefficients from program to data memory is not shown
firtask: ld #firDP,dp ; initialize data page pointer
stm #frameSize-1,brc ; compute 256 outputs
rptbd firloop-1
stm #N,bk ; FIR circular buffer size
ld *ar6+,a ; load input value to accumulator b
stl a,*ar4+% ; replace oldest sample with newest
rptz a,#(N-1) ; zero accumulator a, do N taps
mac *ar4+0%,*ar5+0%,a ; one tap, accumulate in a
sth a,*ar7+ ; store y[n]
firloop: ret
77
Accelerating Symmetric FIR Filtering
 Coefficients in linear phase filters are either symmetric
or anti-symmetric
 Symmetric coefficients
y[n] = h0 x[n] + h1 x[n-1] + h1 x[n-2] + h0 x[n-3]
y[n] = h0 (x[n] + x[n-3]) + h1 (x[n-1] + x[n-2])
 Accelerated by FIRS (FIR Symmetric) instruction
x in two h in
circular program
buffers memory
78
Accelerating Symmetric FIR Filtering
; Addresses: a6 input buffer, a7 output buffer
; a4 array with x[n-4], x[n-3], x[n-2], x[n-1] for N = 8
; a5 array with x[n-5], x[n-6], x[n-7], x[n-8] for N = 8
; Modulo addressing prevents need to reinitialize regs each sample
firtask: ld #firDP,dp ; initialize data page pointer
stm #frameSize-1,brc ; compute 256 outputs
rptbd firloop-1
stm #N/2,bk ; FIR circular buffer size
ld *ar6+,b ; load input value to accumulator b
mvdd *ar4,*a5+0% ; move old x[n-N/2] to new x[n-N/2-1]
stl b,*ar4% ; replace oldest sample with newest
add *a4+0%,*a5+0%,a ; a = x[n] + x[n-N/2-1]
rptz b,#(N/2-1) ; zero accumulator b, do N/2-1 taps
firs *ar4+0%,*ar5+0%,coeffs ; b += a * h[i], do next a
mar *+a4(2)% ; to load the next newest sample
mar *ar5+% ; position for x[n-N/2] sample
sth b,*ar7+
firloop: ret
79
Accelerating LMS Filtering
 Adapt weights: bk(i+1) = bk(i) + 2  e(i) x(i-k)

 Accelerated by the LMS instruction (2 cycles/tap)
80
Accelerating LMS Filtering
81
Accelerating Polynomial Evaluation
 Function approximation and spline interpolation

 Fast polynomial evaluation (N coefficients)
 y(x) = c0 + c1 x + c2 x2 + c3 x3 Expanded form
 y(x) = c0 + x (c1 + x (c2 + x (c3))) Horner’s form
 POLY reduces 2 N cycles using MAC+ADD to N cycles
; ar2 contains address of array [c3 c2 c1 c0]
; poly uses temporary register t for multiplicand x
; first two times poly instruction executes gives
; 1. a = c(3) + x * 0 = c(3); b = c2
; 2. a = c(2) + x * c(3); b = c1
ld *ar2+,16,b ; b = c3 << 16
ld *ar3,t ; t = x (ar3 contains addr of x)
rptz a,#3 ; a = 0, repeat next inst. 4 times
poly *ar2+ ; a = b + x*a || b = c(i-1) << 16
sth a,*ar4 ; store result (ar4 is addr of y)
82
C54x optimizing C compiler
 ANSI C compiler
 Instrinsics, in-line assembly and functions, pragmas
Selected CODE_SECTION code section

Pragmas DATA_SECTION data section
FUNC_IS_PURE no side effects
INTERRUPT specifies interrupt routine
NO_INTERRUPT cannot be interrupted
 Cl500 shell program contains

 C Compiler: parser, optimizer, and code generator
 Assembler: generates a relocatable (COFF) object file
 Linker: creates executable object file
83
Optimizing C Code
 Level 0 optimization: -o0 flag

 Performs control-flowgraph simplifications
 Allocates variables to registers
 Eliminates unused code
 Simplifies expressions and statements
 Expands inline function calls

 Performs local copy/constant propagation
 Removes unused assignments
 Eliminates local common expressions
84
Optimizing C Code

 Performs loop optimizations
 Eliminates global common sub-expressions
 Eliminates global unused assignments
 Performs loop unrolling

 Removes all functions that are never called
 Performs file-level optimization
 Simplifies functions with unused return values
 Program-level optimization: -pm flag

85
Compiler Optimizations
 Cost-based register allocation

 Alias disambiguation
 Aliasing memory prevents compiler from keeping values in
registers
 Determines when 2 pointers cannot point to the same location,
allowing compiler to optimize expressions
 Branch optimizations
 Analyzes branching behavior and rearranges code to remove
branches or remove redundant conditions
86
 Copy propagation
 Following an assignment compiler replaces references to a
variable with its value
 Common sub-expression elimination

 When 2 or more expressions produce the same value, the
compiler computes the value once and reuses it
 Redundant assignment elimination

 Redundant assignment occur mainly due to the above two
optimizations and are completely eliminated
87
 Expression simplification
 Compiler simplifies expressions to equivalent forms requiring fewer
instructions/registers
/* Expression Simplification*/
g = (a + b) - (c + d); /* unoptimized */
g = ((a + b) - c) - d; /* optimized */
 Inline expansion
 Replaces calls to small run-time support functions with inline
code, saving function call overhead
88
 Induction variables
 Loop variables whose value directly depends on the number of
times a loop executes
 Strength reduction
 Loops controlled by counter increments are replaced by repeat
blocks
 Efficient expressions are substituted for inefficient use of
induction variables (e.g., code that indexes into an array is
replaced with code that increments pointers)
89
 Loop-invariant code motion

 Identifies expressions within lops that always compute the same
value, and the computation is moved to the front of the loop as a
precomputed expression
 Loop rotation
 Evaluates loop conditionals at the bottom of loop
 Auto-increment addressing
 Converts C-increments into efficient address-register indirect
access
90
Hypersignal Block Diagram Environments
 Hierarchical block diagrams (dataflow modeling)

 Block is defined by dynamically linked library function
 Create new blocks by using a design assistant GUI
 RIDE for graphical real-time debugging/display

 1-D, multirate, and m-D signal processing
 ANSI C source code generator
 C54x boards: support planned for 4Q99
 C6x boards: DNA McEVM, Innovative Integration, MicroLAB
TORNADO, and TI EVM
 OORVL DSP Graphical Compiler
 Generates DSP assembly code (C3x and C54x)
91
Hypersignal RIDE Environment
Download demonstration software from http://www.hyperception.com

92
Hypersignal RIDE Image Processing Library
Ca t eg or y Bl ock s
Im a g e a r i t h m et i c Add, su bt r a ct , m u lt iply, expon en t ia t e
Im a g e g en er a t i on Gr a ysca le, n oise, spr it e
Im a g e I/O AVI, bit m a ps, r a w im a ges, video ca pt u r e
Im a g e d i sp l a y Bit m a ps, RGB
Ed g e d et ect i on Isot r opic, La pla ce, P r ewit t , Rober t s, Sobel
Li n e d et ect i on H or izon t a l, 45 o, ver t ica l, 135 o
1-D fi l t er i n g Con volu t ion , DF T, F F T, F IR, IIR,
2-D fi l t er i n g DF T, F F T, F IR
N on l i n ea r fi l t er i n g Ma x, m edia n , m in , r a n k or der , t h r esh old
Hi st og r a m s H ist ogr a m s, h ist ogr a m equ a liza t ion
Ma n i p u l a t i on Con t r a st , flip, n ega t e, r esize, r ot a t e, zoom
O b ject -b a sed Object cou n t , object t r a ckin g
N et w or k i n g In t er n et t r a n sm it , In t er n et r eceive
Same as ImageDSP and Advanced Image Processing Library 93

TI C54x Evaluation Module (EVM) Board
 Offered through TI and Spectrum Digital

 100 MHz C549 & 100 MHz C5410 for under $1,000
 Memory: 192 kwords program, 64 kwords data
 Single/multi-channel audio data acquisition interfaces
 Standard JTAG interface (used by debugger)
 Spectrum sells 100 MHz C5402 & 66 MHz C548 EVMs
 Software features
 Compatible with TI Code Composer Studio
 Supports TI C debugger, compiler, assembler, linker
http://www.ti.com/sc/docs/tools/dsp/c5000developmentboards.html
94
Sampling of Other C54 Boards
Ven d or Boa r d R AM ROM P r ocessor I/O
Ka ne KC542/ 256 kb 256 kb 40-MIP 16-bit
Com p u t i n g PC C5402 st er eo
In n ov a t i v e SBC54 100-MIP Modu la r
In t eg r a t i on C549
DSP Tiger 256 kb 256 kb 100-MIP
R esea r ch 549/P C C549
DSP Tiger 256 kb 256 kb 100-MIP
Od in VIDAR 2 Mb 0 kb fou r 80-MIP
T el esyst em s 5x4P CI C548
DSP Viper -12 12 Mb 0 kb 12 100-MIP
http://www.ti.com/sc/docs/tools/dsp/c5000developmentboards.html
95
Binary-to-Binary Translation
 Many of today’s DSP systems are implemented using the

TI C5x DSP (e.g. voiceband modems)
 TI is no longer developing members of C5x family in favor of
the C54x family
 3Com has shipped over 35 million modems with C5x
 C5x binaries are incompatible with C54x
 Significant architectural differences between them
 Need for automatic translator of binary C5x code to binary
C54x code
 Solutions for binary-to-binary translation
 Translation Assistance Program 5000 from TI
 C50-to-C54 translator from UT Austin
 Both provide assistance for cases they cannot handle
96
TI Translation Assistant Program 5000
 Assists in translating C5x code to C54x code

 Makes many assumptions about code being translated
 Requires a significant amount of user interaction
 Free evaluation for 60 days from TI Web site
 Static assembler to assembler translation
 Generates automatic translation when possible
 Twenty situations are not automatically translated: user must
intervene
 Many other situation result in inefficient code
 Warns user when translation difficulty is encountered
 Analyzes prior translations
http://www.ti.com/sc/docs/tools/dsp/tap5000freetool.html
97
Conclusion
 C54x is a conventional digital signal processor

 Separate data/program busses (3 reads & 1 write/cycle)
 Extended precision accumulators
 Single-cycle multiply-accumulate
 Saturation and wraparound arithmetic
 Bit-reversed and circular addressing modes
 Highest performance vs. power consumption/cost/vol.
 C54x has instructions to accelerate algorithms
 Communications: FIR & LMS filtering, Viterbi decoding
 Speech coding: vector distances for code book search
 Interpolation: polynomial evaluation
98
Conclusion
 C54x reference set

 Mnemonic Instruction Set, vol. II, Doc. SPRU172B
 Applications Guide, vol. IV, Doc. SPRU173. Algorithm
acceleration examples (filtering, Viterbi decoding, etc.)
 C54x application notes
http://www.ti.com/sc/docs/apps/dsp/tms320c5000app.html
 C54x source code for applications and kernels
http://www.ti.com/sc/docs/dsps/hotline/wizsup5xx.htm
 Other resources
 comp.dsp newsgroup: FAQ www.bdti.com/faq/dsp_faq.html
 embedded processors and systems: www.eg3.com
 on-line courses and DSP boards: www.techonline.com
 DSP course: http://www.ece.utexas.edu/~bevans/courses/realtime/
99
TMS320C54x DSP
Design Workshop
Module 1
Introduction and Overview
Learning Objectives
 Describe the requirements of a DSP system

 Identify the CPU components of the ‘C54x
 List the ‘C54x internal buses and their usage
 List the ‘C54x pipeline stages and their actions
 Describe the memory map of the ‘C54x
 List memory and peripherals of the ‘C54x devices
 Become familiar with ‘C54x simulator
1 - 101
DSP: Sum-of-Products
x a
MPY
ADD
1 - 102
MAC Unit Details
D AT C P DA
s/u s/u D = Data Bus
C = Coefficient Bus
MPY P = Program Bus
A A = A accumulator
FRCT B = B accumulator
B
T = Temporary register
ADD 0
s/u = signed/unsigned
acc A acc B FRCT = Fractional mode bit
MAC *AR2+, *AR3+, A
1 - 103
Accumulators + ALU
General-Purpose Math, ex: t = s + e - r
A B C T D Shifter
LD @s, A
acc A acc B ALU
ADD @e, A
SUB @r, A
STL @A, t
MUX
A B MAC
1 - 104
Barrel Shifter
A B C D
SHIFTER (-16 to +31)
ALU E BUS
LD @X, 16, A
STH @B, y
1 - 105
Temporary Register
A
D X EXP
B
T
ex: A = xa
LD @x, T
MPY @a, A
MAC ALU
1 - 106
'C54x Buses
P
D
M
INTERNAL U M EXTERNAL
MEMORY X C U MEMORY
E X
S E
C D
T MAC A B ALU SHIFT
MAC *AR2+, *AR3+, A

1 - 107
Pipeline - Concept
 F Fetch Get instruction from memory

 D Decode Schedule activity
 R Read Get operand from memory
 X Execute Perform operation
1 - 108
Fetch and Read - Memory Interaction
 Broken into two phases:

1. Calculate address
2. Collect data or instruction
 Allows more time for memory interface.
1 - 109
‘C54x Pipeline - Enhanced
 P Prefetch Calculate address of instruction

 F Fetch Collect instruction
 D Decode Interpret instruction
 A Access Calculate address of operand
 R Read Collect operand
 X Execute Perform operation
1 - 110
Memory Write
 When storing results back to memory

 Two phases
 Address set up
 Data written
 Overlaid onto R + X phases
 Best balance of:
 Processor loading
 Speed
 Cost 1 - 111
'C54x Pipeline Events
P Drive address of instruction PA

F Collect instruction PD
D Interpret instruction, plan job
ctlr
A Set up pointers, Calc data address
R Collect operand DA/CA
Calculate Write address DD/CD
X Execute operation
EA
Send result
*,+
ED
1 - 112
‘C54x Pipeline Hardware
P PC, PA
F Program Mem, PD
D Controller
A ARs, DA + CA , ARAUs
R
Data Mem, DD + CD
X ; AR, ARAU, EA
CALU (MAC, ALU)
; ED, Data Mem
CALU = Combined Arithmetic Logic Unit (MAC +ALU)
1 - 113
'C54x Components and Bus Usage
CNTL PC ARs
D
M
INTERNAL M EXTERNAL
U
MEMORY C U MEMORY
X
X
E
S E
T MAC A B ALU SHIFT
1 - 114
Pipeline Performance
TIME
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 F4 D4 A4 R4 X4
P5 F5 D5 A5 R5 X5
P6 F6 D6 A6 R6 X6
FULLY LOADED 'PIPE'

1 - 115
Pipeline Conflicts - External Memory
P
54x
D
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 -- -- -- F4 D4 A4 R4 X4
-- -- -- P5 F5 D5 A5 R5 X5
-- -- -- P6 F6 D6 A6 R6
1 - 116
Pipeline Flow: Internal and External Memories
54x 54x
P or D
D P
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 F4 D4 A4 R4 X4
P5 F5 D5 A5 R5 X5
P6 F6 D6 A6 R6 X6
NO CONFLICT
1 - 117
Pipeline: Internal Memory Only
'C54x
ROM DARAM RAM
4K 2K 8K
4K 2K 8K
. . .
. . .
. . .
P
MAC ALU
ROM & RAM: One access per block per cycle

DARAM: Two accesses per block per cycle
1 - 118
'C541 Memory Maps
PROGRAM DATA
0000 0000
RAM ? OVLY MMR / RAM
1400 1400
EXT
EXT
9000
Internal
ROM ?
E000
FF80 DROM ROM ?
VECTORS FFFF
FFFF
1 - 119
'C541 Program Memory Options
All External 28K ROM** 'RAM' Option
MP/MC = 1 MP/MC = 0 OVLY = 1
0000 0000 0000
0080
RAM
1400
EXT EXT
EXT
9000
2K ROM
9800 2K ROM
A000
4K ROM
B000
4K ROM EXT
C000
4K ROM or
D000
4K ROM ROM
E000
4K ROM
FF80 F000
VECTORS* 4K ROM w VECs * VECTORS*
FFFF FFFF FFFF
* FF80 - FFFF are the default locations for vectors.

** Internal ROM FF00 - FF7F reserved for TI test.
1 - 120
'C541 Data Memory
0000 0000 0000

MMR / RAM MMR MMR
1400 + 0060
0080 SPRAM
0400 RAM a
RAM b
0800
EXT
RAM c
RAM a
0C00
RAM d
1000
E000
EXT or ROM RAM e
FFFF 1400 0400
1 - 121
'C54x Memory Mix
C54x DARAM SARAM ROM DROM

1 5 28 8
 2 10 2
 3 10 2
 4 4 24 8
 5 6 48
16
 6 6 48
16
 9 8 24 16
1 - 122
'C54x Peripheral Mix
C54x SER TDM BSP HPI

1 2
2 1 1 1
3 1 1
4 2
5 1 1 1
6 1 1
9 1 2 1
1 - 123
'C54x Review - CALU
CALU supports:
 General-purpose operations:
 MAC
 ALU
 Special functions:
 CSSU (Viterbi)
 EXP (Norm)
 FIRS: MAC + ALU
 16- or 32-bit operations:
 C16 mode (Double)
 Long operations
1 - 124
'C54x Review - System
 Four buses allow 1 fetch, 2 reads, and 1 write each cycle.

 Built from and for cDSP:
 Fast growing family
 Easy to modify for custom use
 Attributes
 Static design
 Low power
 Any clock below maximum
 Low $/MIP
 Fast/dense instructions
 Small size for functionality
 LC version for 3V operation, VC for 2.5/3.3V operation 1 - 125
Lab 1: Debugger Walkthrough
Follow the steps in your workbook to exercise the Debugger.
1 - 126
Debugger Screen
1 - 127
Simulator Quick Reference
Window Management Running Code Other Actions
Selecting Window Reset ? <label> display value of <label>
F6 rotates to next window Type RESET forces PC to zero ? <label> = <n>load <label> with <n>
WIN <name> selects <name> window Type RESTART return to "entry point" file <name> load file <name> to file window
Click window frame select window TAB scroll to prior commands
F4 close selected window Stepping SHIFT TAB scroll to subsequent
F8 or type STEP for one step commands
Moving Inside Window F10 or type NEXT condense subroutines F9 alternate form of mouse click
Up Arrow / Down Arrow Type STEP <n> for <n> steps TAKE <name> simulator 'batch' file
Page Up / Page Down LOAD <name>download file <name>
Type NEXT <n> for <n> nexts
Click on window frame arrows
For DISASSEM window; type ADDR <value>
Running
For MEMORY window; type MEM <value>
RUN run until <Esc> or breakpoint
RUNB run with benchmark
Moving Window
Click on top of frame; drag to new location GO <label> run to <label>
Type MOVE and use arrows or type
coordinates
Sizing Window
Click on bottom right corner; drag to new shape
Type SIZE and use arrows or type coordinates Watches and Breakpoints Entry/Exit
ZOOM click on top left corner
Operation Watch Breakpoint SIM54xw <file>start simulator with <file>.out
UNZOOM click again on top left corner
ADD WA BA
Screen Configuration RESET WR BR
QUIT exit simulator
SCONFIG <name> load configuration <name> LIST WL BL
SSAVE <name> save configuration DELETE WD # BD # SYSTEM go to DOS shell
<name> or hot keys or mouse clicks
Modes
ASM display ASM info or <Alt> D,A
C display C info or <Alt> D,C
MIX display both ASM and C or <Alt> D,M
1 - 128
Texas Instruments
TMS320C64x
CS433
Processor Presentation Series
Prof. Luddy Harrison
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 130

Note on this presentation series
 These slide presentations were prepared by
students of CS433 at the University of Illinois at
Urbana-Champaign
 All the drawings and figures in these slides were
drawn by the students. Some drawings are based
on figures in the manufacturer’s documentation for
the processor, but none are electronic copies of
such drawings
 You are free to use these slides provided that you
leave the credits and copyright notices intact

Purpose
 TMS320C64x is a family of 16-bit Very Long
Instruction Word (VLIW) Digital Signal
Processors (DSP)s from Texas Instruments
 Target is applications that require high fixed
point performance of streaming data

History
 1982 Texas Instruments releases TMS32010,
the first fixed-point DSP in the TMS320 family
 C1x, C2x, C2xx, C5x, C54x, and C6x are
successive generations of fixed point DSP
offerings in the family
 C3x, C4x, C67x are floating point DSP
offerings
 C8x multiprocessor DSPs

History cont.
 The C64x is the successor of the earlier C62x
 C64x adds significant processing capabiliteis
for Single Instruction, Multiple Data (SIMD) to
the C62x.
 C64x can process all C62x object code
unmodified (but not vice-versa)

Performance
 TMS320C6418
 Up to 4800 Million Instructions Per Second
(MIPS) at a clock rate of 600 MHz
 Can produce 4 16-bit Multiply Accumulates
(MACs) per cycle, making 2400 Million MACs per
Second (MMACS) OR
 8 8-bit MACs per cycle for a total of 4800 MMACS

Pricing
 The C6418, as of April 2005 cost $55.94 per
unit when purchased in quantities of 1000
units or more.

Typical Applications
 Pooled Modems
 Wireless local loop base stations
 Remote Access Servers
 Digital Subscriber Loop (DSL) systems
 Cable modems
 Multichannel telephony systems

Applications for the C64x
TMS320C64x can be used as a CPU in the following

devices:
 Wireless local base stations;

 Remote access server (RAS);
 Digital subscriber loop (DSL) systems;
 Cable modems;
 Multichannel telephony systems;
 Pooled modems;

Applications for the C64x
TMS320C64x is a solution for the following new applications:

 Security access;
 Video conferencing;
 Digital filtering;
 3D graphics;
 Speech recognition;
 Robot Vision;
 Image processing;
 Pattern recognition.
Pipeline Overview
 11-stage pipeline
 Non-interlocked
 The processor does not resolve resource or data
conflicts that are pipe-line related
 Assembly code must resolve all pipeline related
conflicts
 Simplifies the pipeline design

VelociTI™
In the heart of TMSC6000 is VelociTI™, an advanced
VLIW architecture, that allows to achieve high performance
due to the following features:
 Packed instructions reduce code size, fetches and

power consumption;
 Predicated instructions reduce costly branching;
 Variable-width instructions allow flexible data types
(8/16/32-bit data support);
 Branches are fully pipelined (zero overhead of not-
taken branches);
Architecture Overview
 2 (almost) identical fixed-point data paths that
each contain
 1 ALU (The .L Unit)
 1 Shifter (The .S Unit)
 1 Multiplier (The .M Unit)
 1 Adder/Subtractor used for address generation
(The .D Unit)
 1 register file containing thirty-two 32-bit registers

Architecture Overview cont.
 The 8 execution units in the 2 data paths are
capable of executing up to 8 instructions in
parallel.
 Can operate on 8-, 16-, 32-, and 40-bit data
 Can perform double-word (64-bit) loads and
stores by using 2 registers for the one
operation.

Architecture
Program cache/program memory
32-bit addresses
256-bit data
TMS320C64x CPU
Program fetch
Instruction dispatch
Functional units: Instruction decode
6 ALUs
(L1, L2, S1, S2, D1, D2) Register file A Register file B
2 multiplers (M1, M2)
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2
Data cache/data memory

32-bit address
8-, 16-, 32-, 64- bit data

C64x Core Functional Diagram
L1 Instruction Cache
Direct Mapped
16KBytes
C64x DSP Core
Instruction Fetch
Control
Instruction Dispatch Registers
Advanced Instruction Packet
Instruction Decode Control
Logic
Data Path A Data Path B
A Register File B Register File Test
A31-A16 B31-B16
A15-A0 B15-B0 Advanced
In-Circuit
Emulation
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2 Interrupt
Control
L1 Data Cache
2-Way Set-Associative
16KBytes

TMS3206418 Functional Diagram
TMS3206418
SDRAM
EMIF A
32 L1 Instruction Cache
SBSRAM
VCP Direct Mapped
16KBytes
ZBT SRAM
FIFO Timer 2
C64x DSP Core
SRAM Timer 1
Instruction Fetch Advanced
ROM/FLAHS Timer 0 In-Circuit
Emulation
I/O Devices
McBSP0 .D1 .D2
McBSP1 Enhanced
L2 Cache
DMA
Memory
Controller
McASP0 512KBytes
(EDMA) L1 Data Cache
2-Way Set-Associative
McASP1
16KBytes
HPI16
or Oscillator and Power Down
HPI32 PLL (x1, x5-x12, Logic
x16, x18, x19-
x22, x24)
I2C0
I2C1
Boot Configuration
GP0
16

Central Processing Unit
 Performance of up to 4000 million instruction per

second;
 Clock rate 500 MHZ;
 2 register banks of 32 32-bit registers each;
 Program fetch, instruction dispatch (advanced
instruction packing) and instruction decode units, which
can supply 8 32-bits instructions to the functional units per
cycle;
 Instructions are executed in 2 data path (A and B), each
with four functional units (a multiplier and 3 ALUs) and a
register bank
General-Purpose Register Files
 The C64x register file contains 32 32-bit registers (A0-
A31 for file A and B0-B31 for file B);
 GPRs can be used for data, pointers or conditions;
 Values larger than 32 bits (40-bit long and 64-bit float
quantities) are stored in register pairs. Least significant bits
are placed in an even-numbered register and the remaining
bits (8 for 40-bit value and 32 for 32-bit value) are the next
upper register;
 Packed data types are: four 8-bit values or two 16-bit
values in a single 32-bit register, four 16-bit values in a 64-
bit register pair.
Odd register 39 32 31 Even register 0
Zero filled

TMS320C64x Data Paths
Data path A
.L1
ST1b
ST1a
The data path of C64x has the
.S1 Register
file A following components:
(A0-A31)
LD1b
.M1
 Two load-from-memory
LD1a data paths;
DA1 .D1
 Two store-to-memory data
DA2
.D2 paths;
LD1a
LD1b  Two data address paths;
.M2
Register
 Two register file data
.S2 file B cross paths;
ST2a (B0-B31)
ST2b
.L2
Data path B
Data Path A
src1
src2
.L1
dst
long dst 8
long src
32 MSB
ST1b
ST1a
32 LSB
long src
long dst
8
dst
.S1 Register file A
(A0-A31)
src1
src2
long dst
dst
src1
.M1
src2
32 MSB
LD1b
LD1a
32 LSB
dst
src1
DA1 .D1
src2
2X Data Path A

Data Path B (cont’d from above)
1X
src2
DA2 .D2 src1
dst
32 LSB
LD2a
LD2b
32 MSB
src2
.M2
src1
dst
long dst
Control Register
src2
Register file B
src1
.S2 (B0-B31)
dst
long dst 8
long src
32 MSB
ST2a
ST2b
32 LSB long src 8
long dst 8
dst
.L2
src2
src1
Data Path B

src1
src2
.L1
dst
long dst 8
long src
32 MSB
ST1b
ST1a
32 LSB
long src
long dst
8
dst
.S1 Register file A
(A0-A31)
src1
src2
long dst
dst
src1
.M1
src2
32 MSB
LD1b
LD1a
32 LSB
dst
src1
DA1 .D1
src2
2X Data Path A
1X
src2
DA2 .D2 src1
dst
32 LSB
LD2a
LD2b
32 MSB
src2
.M2
src1
dst
long dst
Control Register
src2
Register file B
src1
.S2 (B0-B31)
dst
long dst 8
long src
32 MSB
ST2a
ST2b
32 LSB 8
long src
long dst 8
dst
.L2
src2
src1
Data Path B

Functional Units (Structure)
 Each functional unit has its own 32-bit
src1 long src
write port into a GPR. Each functional
long dst unit reads directly from its own data
.L1
src2
dst path;
.S1
src1
dst
 All units ending in 1 write to register
long dst
src2
file A, and all units ending in 2 write to
long src register file B;
 Each functional unit has two 32-bit

read ports for source operands src1 and
src2;
dst long dst
src1 dst  L and S units have an extra 8-bit-wide
src2 src1 port for 40-bit long writes, as well as an
8-bit input for 40-bit long reads;
.D1 .M1
 Each C64x multiplier can return up to
src2 a 64-bit result;

.L (.L1 and .L2) Unit Operations
Performed
 32/40-bit arithmetic and compare operations
 32-bit logical operations
 Leftmost 1 or 0 counting for 32 bits
 Normalization count for 32 and 40 bits
 Byte shifts
 Data packing/unpacking
 5-bit constant generation
 Vector Operations:
 Dual 16-bit arithmetic operations
 Quad 8-bit arithmetic operations
 Dual 16-bit min/max operations
 Quad 8-bit min/max operations

.S (.S1 and .S2) Unit Operations
Performed
 32-bit arithmetic operations
 32/40-bit shifts and 32-bit bit-field operations
 Branches
 Constant generation
 Register transfers to/from control register file (.S2 only)
 Byte shifts
 Data packing/unpacking
 Vector Operations
 Dual 16-bit compare operations
 Quad 8-bit compare operations
 Dual 16-bit shift operations
 Dual 16-bit saturated arithmetic operations
 Quad 8-bit saturated arithmetic operations

.M (.M1 and .M2) Unit Operations
Performed
 16 x 16 multiply operations
 16 x 32 multiply operations
 Vector Operations
 Quad 8 x 8 multiply operations
 Dual 16 x 16 multiply operations
 Dual 16 x 16 multiply with add/subtract operations
 Quad 8 x 8 multiply with add operation
 Bit expansion
 Bit interleaving/de-interleaving
 Variable shift operations
 Rotation
 Galois Field Multiply

.D (.D1 and .D2) Unit Operations
Performed
 32-bit add, subtract, linear and circular address
calculation (for circular arrays)
 Loads and stores with 5-bit constant offset
 Loads and stores with 15-bit constant offset (.D2
only)
 Load and store double words with 5-bit constant
 Load and store non-aligned words and double
words
 5-bit constant generation

Functional Units (I)
Functional Unit Fixed-Point Operations Floating point operations
.L unit (.L1, .L2) 32/40-bit arithmetic compare Arithmetic operations

operations DS → SP, INT → DP,
32-bit logical operations (L) INT → SP operations
Byte shifts
Data packing/unpacking
5-bit constant generation
Dual and quad 8-, 16- bit
arithmetic/min-max operations
.S unit (.S1, .S2) 32-bit arithmetic and field operations Compare

Branches Reciprocal and reciprocal
Constant generation square-root operations
Dual and quad 8-, 16- saturated Absolute value operations
arithmetic and compare operations
Byte shifts

Functional Units (II)
Functional Unit Fixed-Point Operations Floating point operations
.M unit (.M1, .M2) 16x16, 16x32, quad 8x8, dual 16x16 32x32-bit fixed-point multiply
quad 8x8 multiply operations operations
Bit expansion Floating-point multiply
Variable shift operation operations
Rotation
Galois Field Multiply
.D unit (.D1, .D2) 32-bit add, subtract, linear and Compare

circular address operations Reciprocal and reciprocal
Branches square-root operations
Constant generation Absolute value operations
Dual and quad 8-, 16- saturated
arithmetic and compare operations
Byte shifts

Instruction to Functional Unit
Mapping
.L Unit .M Unit .S Unit .D Unit
ABS MPY ADD SET ADD STB (15-bit
ADD MPYU ADDK SHL offset)‡
ADDU MPYUS ADD2 SHR ADDAB STH (15-bit
AND MPYSU AND SHRU offset)‡
ADDAH STW (15-bit
CMPEQ MPYH B disp SSHL
offset)‡
CMPGT MPYHU B IRP† SUB
ADDAW SUB
CMPGTU MPYHUS B NRP† SUBU
LDB SUBAB
CMPLT MPYHSU B reg SUB2
LDBU SUBAH
CMPLTU MPYHL CLR XOR
LDH SUBAW
LMBD MPYHLU EXT ZERO
LDHU ZERO
MV MPYHULS EXTU
LDW
NEG MPYHSLU MV
LDB (15-bit offset)‡
NORM MPYLH MVC†
LDBU (15-bit offset)‡
NOT MPYLHU MVK
LDH (15-bit offset)‡
OR MPYLUHS MVKH
LDHU (15-bit offset)‡
SADD MPYLSHU MVKLH
LDW (15-bit offset)‡
SAT SMPY NEG
MV
SSUB SMPYHL NOT
STB
SUB SMPYLH OR
STH
SUBU SMPYH
STW
SUBC
XOR
ZERO

Register file cross-paths
 The register files are connected to the opposite side register file’s
functional units via the 1X and 2X cross paths;
 The 1X cross path allows the functional units of data path A to read
their source from register file B, and the 2X cross path allows the
functional units of data path A to read their source from register file B.
 All eight of the functional units have access to the register file on the
opposite side, via a cross path;
 The src2 inputs of .M, .S and .D units are selectable between the cross-
path and the same side register file. Both src1 and src2 inputs of .L units
are selectable between the cross path and the same-side register file;
 Since there are only two cross-path, the limit is one source read from
each data path’s opposite register file per cycle, or a total of two cross
path reads per cycle.

Load and Store Paths
Load path:
 Side A:
• LD1a is the load path for the 32 least significant bits;
• LD1b is the load path for the 32 most significant bits;
 Side B:
• LD2a is the load path for the 32 least significant bits;
• LD2b is the load path for the 32 most significant bits;
Store path:
 Side A:
• ST1a is the write path for the 32 least significant bits;
• ST1b is the write path for the 32 most significant bits;
 Side B:
• ST2a is the load path for the 32 least significant bits;
• ST2b is the load path for the 32 most significant bits;

Data Address Paths
Data address paths are specified are referred to as

T1 and T2 respectively.
T1: T2:
LD1 (LD1a, LD1b) LD2 (LD2a, LD2b)
ST1 (ST1a, ST1b) ST2 (ST2a, ST2b)

Instruction Packets
 Instructions are always fetched 8 (256-bits) at a time. This is called a fetch
packet
 If the p-bit of instruction i is set, then instruction i and i+1 are executed in the
same cycle in parallel.
 1 to 8 instructions can be executed in parallel. This is called an execute packet
 In the C62x, packets could not cross the 8-word boundary, and thus the 8th p-bit
was always 0 and padding with NOPs was needed. The C64x did away with
that restriction, and execute packets may now span multiple fetch packets.
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
p p p p p p p p
Instruction A Instruction B Instruction C Instruction D Instruction E Instruction F Instruction G Instruction H

LSBs of the
000002 001002 010002 011002 100002 101002 110002 111002
byte address

Fetch Packet Example
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
1 1 0 0 1 0 1 0
Cycle/Execute Packet Instructions
1 ABC
2 D
3 EF
4 GH
Instruction Encoding (1/4)
Operations on the .L Unit
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1/cst x op 1 1 0 s p
3 5 5 5 7
Operations on the .M Unit

31 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
creg z dst src2 src1/cst x op 0 0 0 0 0 s p
3 5 5 5 5
Operations on the .D Unit

31 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
3 5 5 5 5

Load/store with 15-bit offset on the .D Unit
31 29 28 27 23 22 8 7 6 4 3 2 1 0
creg z dst/src ucst15 y Id/st 1 1 s p
3 5 15 3
Load/store baseR+offsetR/cst on the .D Unit

31 29 28 27 23 22 18 17 13 12 9 7 6 5 4 3 2 1 0
creg z dst/src baseR offsetR/ucst5 mode r y Id/st 0 1 s p
3 5 5 5 4 3
Operations on the .S Unit

31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1/cst x op 1 0 0 0 s p
3 5 5 5 6

ADDK on the .S Unit
31 29 28 27 23 22 7 6 5 4 3 2 1 0
creg z dst cst 1 0 1 0 0 s p
3 5 16
Field Operations (immediate forms) on the .S Unit

31 29 28 27 23 22 18 17 13 12 8 7 6 5 4 3 2 1 0
creg z dst src2 csta cstb op 0 0 1 0 s p
3 5 5 5 5 2
MVK and MVKH on the .S Unit

31 29 28 27 23 22 7 6 5 4 3 2 1 0
creg z dst cst h 1 0 1 0 s p
3 5 16

Bcond disp on the .S Unit
31 29 28 27 7 6 5 4 3 2 1 0
creg z cst 0 0 1 0 0 s p
3 21
IDLE
31 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Reserved 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 s p
14
NOP
31 18 17 16 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Reserved 0 src 0 0 0 0 0 0 0 0 0 0 0 0 p
14 4

C64x Opcode Map
Operations on the .L unit:
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1/cst x op 1 1 0 s p
Operations on the .M unit:

31 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
Operations on the .M unit:

31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
creg z dst src2 src1/cst op 1 0 0 0 0 s p

C64x Opcode Map
Load/store with 15-bit offset on the .D unit :
31 29 28 27 23 22 8 7 6 4 3 2 1 0
creg z dst/src ucst15 y ld/st 1 1 s p
Load/store with baseR + offset/cst on the .D unit :

31 29 28 27 23 22 18 17 13 12 9 8 7 6 4 3 2 1 0
creg z dst/src baseR offset/usct5 mode r y ld/st 1 1 s p
Operations on the .S unit:

31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1/cst x op 1 0 0 0 s p
ADDK on the .S unit:

31 29 28 27 23 22 7 6 5 4 3 2 1 0
creg z dst cst 1 0 1 0 0 s p

Pipeline
The stages of the fixed-point pipeline are:
The C64x pipeline has the following features:

 11 phases divided into Fetch, Decode, Execute;
 Fetch has 4 phases for all instructions, the decode phase has two
phases for all instructions;
 The execute stage of the pipeline requires a varying number of
phases, depending on the type of the instruction.
Pipeline
 In the C64x instructions are fetched from the

instruction memory in grouping of eight instructions,
called fetch packets (FPs);
 Each FP can be split into one to eight executable

packets (EP). Each EP contains only instructions that can
execute in parallel. Each instruction in EP executes in an
independent functional unit;
 The C64x pipe is most effective when it is kept as full

as possible by organizing instructions;

Pipeline Stages

Fetch Pipeline Stages
PG PS PW PR
Fetch
 PG: Program address Generate
 Program Address is generated in the CPU
 PS: Program address Send
 Program Address is sent to memory for a read operation
 PW: Program access ready Wait
 Memory read occurs
 PR: Program fetch packet Receive
 Fetch packet is received at the CPU

Decode Pipeline Stages
DP DC
Decode
 DP: Instruction Dispatch
 Fetch packets are split into execute packets
 Instructions in the execute packets are assigned to the
appropriate functional units
 DC: Instruction Decode
 Source and destination registers and associated paths are
decoded for use by the functional units

Execute Pipeline Stages: E1
E1 E2 E3 E4 E5
Execute
 E1: Execute stage 1

 Single cycle instructions are completed
 For all instructions, conditions are evaluated and operands are
read
 For load/store, address generation is performed, and address
modifications are written to register file
 For branch instructions, branch fetch packet in PG phase is
affected
 For single cycle instructions, results are written to register

E1 E2 E3 E4 E5
Execute

 Multiply instructions are completed
 Load inst. sends address to memory
 Store inst. sends address and data to memory
 The SAT bit in the control status register (CSR) is set if a single
cycle instruction saturated the result set
 Single 16x16 multiply inst. results are written to the register
 .M Unit non-multiply instructions are written to the register

E1 E2 E3 E4 E5
Execute

 Store instructions are completed
 Data memory accesses are performed
 The SAT bit in the control status register (CSR) is set for
multiply instructions

E1 E2 E3 E4 E5
Execute

 Multiply extension instructions are completed
 Load instructions bring the data to the CPU
 Multiply extension instruction (MPY2, MYP4, DOTPx2,
DOTPU4, MPYHIx, MPYLIx and MVD) results are written
to the register

E1 E2 E3 E4 E5
Execute

 Load instructions are completed
 Load instruction data is written to the register

Delay Slots
 Delay slots mean “how many CPU cycles
come between the current instruction and
when the results of the instruction can be
used by another instruction”
 Single Cycle Instructions: 0 delay slots
 16x16 Single Multiply and .M Unit non-
multiply Instructions: 1 delay slot

Delay Slots cont.
 Store: 0 delay slots
 If a load occurs before a store (either in parallel or not), then the
old data is loaded from memory before the new data is stored.
 If a load occurs after a store, (either in parallel or not), then the
new data is stored before the data is loaded.
 C64x Multiply Extensions: 3 delay slots
 Load: 4 delay slots
 Branch: 5 delay slots
 The branch target is in the PG slot when the branch condition is
determined in E1. There are 5 slots between PG and E1 when
the branch target begins executing useful code again.

Pipeline summary
Stage Phase Symbol During This Phase
Program Program address PG The address of the fetch packet is determined
fetch generate
Program address PS The address of the fetch packet is sent to memory

send
Program wait PW A program memory access is performed
Program data receive PR The fetch packet is at the CPU boundary
Program Dispatch DP The next execute packet in the fetch packet is

decode determined and sent to the appropriate functional units
to be decoded
Decode DC Instructions are decoded in functional units

Pipeline summary
Execute Execute 1 E1 For all instruction types, the conditions for the instructions are
evaluated and operands are read.
For load and store instructions, address generation is performed and
address modifications are written to a register file.
For branch instructions, branch fetch packet in PG phase is affected
For single-cycle instructions, results are written to a register file.
Execute 2 E2 For load instructions, the address is sent to memory. For store
instructions, the address and data are sent to memory.
Single-cycle instructions that saturate results set the SAT bit in the
control status register (CSR) if saturation occurs.
Execute 3 E3 Data memory accesses are performed. Any multiply instructions that
saturates results sets the SAT bit in the control status register (CSR)
if saturation occurs.
Execute 4 E4 For load instructions, data is brought to the CPU boundary. The
results of multiply extensions are written to a register file.
Execute 5 E5 For load instructions, data is written into a register.

Single-Cycle Instructions
PG PS PW PR DP DC E1
Functional
unit
.L, .S, .M,
or .D
Operands Write results
E1
Register file

Two-cycle instructions
The operations occurring in the pipeline for a multiply:
PG PS PW PR DP DC E1 E2 1 delay slot
Functional
Unit
.M
Operands (data)
Write results
E1
E2
Register file

Store instructions
PG PS PW PR DP DC E1 E2 E3
Address
modification
Functional
unit
.D
E2
E1
Register file
Data
E2 Memory
controller Address
E3
Memory

Load Instructions
Address 4 delay slots
modification
Functional
unit
.D
E2
E1
E5
Register file
Data
E4 Memory
controller Address
E3
Memory

Branch Instructions
Branch
target
5 delay slots
Since branch target has to wait until it reaches the E1 phase to begin
execution, the branch takes five delay slots before the branch target
code executes.

Instruction Packets
 Instructions are always fetched 8 (256-bits) at a time. This is called a fetch
packet
 If the p-bit of instruction i is set, then instruction i and i+1 are executed in the
same cycle in parallel.
 1 to 8 instructions can be executed in parallel. This is called an execute packet
 In the C62x, packets could not cross the 8-word boundary, and thus the 8th p-bit
was always 0 and padding with NOPs was needed. The C64x did away with
that restriction, and execute packets may now span multiple fetch packets.
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
p p p p p p p p

LSBs of the
000002 001002 010002 011002 100002 101002 110002 111002
byte address

Parallel operations
Fetched packets are aligned on 8-word boundaries. Basic format of a fetch packet is shown
below:
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
p p p p p p p p
Instruction Instruction Instruction Instruction Instruction Instruction Instruction Instruction

A B C D E F G H
The boundaries of fetch packets are determined by a bit in each instruction, the p-bit. If p-bit
determines whether the instruction executes in parallel with another instruction. If the p-bit
of instruction i is 1, then instruction i+1 is to be executed in parallel with (in the same cycle
as) instruction i, otherwise instruction i+1 is executed in the cycle after instruction i. Thus,
the last p-bit in a fetch packet is always 0. Packets can be:
• Fully serial (all p-bits are 0);
• Fully parallel (all p-bits except the last one are set to 1);
• Partially serial;

Fetch Packet Example
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
1 1 0 0 1 0 1 0
Cycle/Execute Packet Instructions
1 ABC
2 D
3 EF
4 GH
Specifying Execute Packets in
Assembly
 Code lines with preceding double vertical bars, ||, will be executed in
parallel with the previous instruction.
 Example:
InstructionA
|| InstructionB
|| InstructionC
InstructionD
InstructionE
|| InstructionF
InstructionG
|| InstructionH
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
1 1 0 0 1 0 1 0

Specifying Execution Unit in
Instructions
 Instructions must specify the instructional unit
that they execute on
 A functional unit can only be used once in
any given execution packet
 Example
ADD .L1 A6, A7, A8

|| MPY .M2 B9, B10, B11

Cross Paths
 Units in one data path may read a single
operand from the other data path using the
cross-paths (1x and 2x) shown in the Data
Path figures
 Add “X” to the execution unit

Cross Paths Examples
 Good Example: B1 is used by the A side data
path in 2 instructions with the 1X cross path
ADD .L1X A0, B1, A3
|| SUB .S1X A5, B1, A7
 Bad Example: Instructions are using the 2X
cross path for 2 different registers
Add .L2X A0, B1, A3
|| SUB .S2X A5, B2, A7

Conditional Execution
 All instructions can be executed conditionally
 6 general purpose registers used for conditional
registers (A0,A1,A2,B0,B1,B2)
 3-bit opcode field (creg) specifies the condition
register tested
 1-bit field (z) tests for nonzero (z=0) or zero (z=1)
 The 4 MSBs of every instruction are creg and z
 creg=0 and z=0 means unconditional execution

Conditional Execution cont.
 To execute an instruction conditionally, add
[R] before the instruction to test for nonzero,
and [!R] to test for zero, where R is a
conditional register
 Example: The instruction below will only
execute if conditional register A1 = 1.
 [A1] ADD .L1 A5, A6, A7

Opcode Symbol Definitions
Symbol Meaning  Symbol Meaning
 baseR base address register
 creg 3-bit field specifying a  p parallel execution
conditional register  r LDDW bit
 cst constant  rsv reserved
 csta constant a  s select side A or B for
 cstb constant b destination
 dst destination  src2 source 2
 h MVK or MVKH bit  src1 source 1
 ld/st load/store opfield  ucstn n-bit unsigned
 mode addressing mode constant field
 offsetR register offset  x use cross path for src2
 op opfield, field within  y select .D1 or .D2
opcode that specifies  z test for equality with
a unique instruction zero or
nonzero

Sum of products example
C code: TI TMS C64x code:
int DotP(short* m, short* n, int count) { LOOP:

int i, product, sum = 0; [A0] SUB .L1 A0, 1, A0
for(i = 0; i < count; i++)
| | [!A0] ADD .S1 A6, A5, A5
{
|| MPY .M1X B4, A4, A6
product = m[i] * n[i];
| | [B0] BDEC .S2 LOOP, B0
sum+=product;
} LDH .D1T1 *A3++, A4
return(sum); LDH .D2T2 *B5++, B4
}

Another code example
MIPS:
loop: LW R1, 0(R11)

MUL R2, R1, R10
SW R2, 0(R12)
ADDI R12, R12, #-4
ADDI R11, R11, #-4
BGTZ R12, loop
TI TMS C64x:
ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MVK .S2 #-4,B1

ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || ADDK .S2 #-12,B12
loop: ADDK .S1 #-4,A11 || LDW .D1 A1,0(A11) || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12) ||
ADD .L2 B12,B1,B12 || BGTZ .S2 B12, loop
ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)
ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)

Special purpose instructions
Instruction Description Example Application
BITC4 Bit counter Machine vision
GMPY4 Galois Field MPY Reed Solomon support
SHFL Bit interleaving Convolution encoder
DEAL Bit de-interleaving Cable modem
SWAP4 Byte swap Endian swap
XPNDx Bit expansion Graphics
MPYHIx, MPYLIx Extended precision 16x32 MPYs Audio
AVGx Quad 8-bit, Dual 16-bit average Motion compensation
SUBABS4 Quad 8-bit Absolute of Motion estimation
differences
SSHVL, SSHVR Signed variable shift GSM

Memory
 The C64x has different spaces for program and data memory;
 Uses two-level cache memory scheme;
 Memory is organized in interleaved, single-ported memory banks,
(only one access to each bank is allowed per cycle);
 Two accesses to a single bank in the same cycle result in a memory
stall, all the pipeline is halted for one cycle, while the second value is
read from memory;
 Two memory operations per cycle are allowed if they do not access
the same bank;
 CPU accesses to memory occur during the PW phase for a program
memory and during the E3 phase for a data memory accesses;

DSP and Memory
L1 Program cache
Direct-mapped
SDRAM 16 K Bytes total
EMIF A
SBSRAM
Instruction fetch
ZBT RAM EMIF B Data path A Data path B
Enhanced L2
DMA Memory
FIFO Controller 1024K Register file Register
(64-channel) A file B
bytes
SRAM
.L1 .S1 .M1 .D1 .D1 .M1 .S1 .L1

I/O devices
L2 Data cache
2-way set-associative
16 K Bytes total

Internal Memory
The C64x has a 32-bit byte-addressable memory with the
following features:
 Separate data and program address spaces;
 Large on chip RAM, up to 7MB;
 2-level cache;
 Single internal program memory port with an
instruction-fetch bandwidth of 256 bits;
 Two 64-bit internal data memory ports;

External memory - CPU signals
Data path A
64 AECLKIN
AED[63:0] Data
AECLKOUT1
AECLKOUT2
ASDCKE
ACE3 External
AARE/ASDCAS/
ACE2 Memory Map Memory ASADS/ASRE
ACE1 Space Select AAOE/ASDRAS/ASOE
Interface
ACE0
20 Control AAWE/ASDWE/ASWE
AEA[22:3] Address AARDY
ABE7 ASOE3
ABE6
APDT
ABE5
ABE4 Bytes
ABE3 Enabled
AHOLD
ABE2 Bus AHOLDA
ABE1 arbitration ABUSREQ
ABE0

External memory - CPU signals
Data path B
16 BECLKIN
BED[15:0] Data
BECLKOUT1
BECLKOUT2
BSDCKE
BCE3 External
BARE/BSDCAS/
BCE2 Memory Map Memory BSADS/BSRE
BCE1 Space Select BAOE/BSDRAS/BSOE
Interface
BCE0
20 Control BAWE/BSDWE/BSWE
BEA[22:3] Address BARDY
BBE1 BSOE3
Bytes
Enabled BPDT
BBE0
BHOLD
Bus BHOLDA
arbitration BBUSREQ

Memory Map (Internal and
External Memory)
 Level 1 Program Cache is 128 Kbit direct
mapped
 Level 1 Data cache is 128Kbit 2-way set-
associative
 Shared Level 2 Program/Data
Memory/Cache of 4Mbit
 Can be configured as mapped memory
 Cache (up to 256 Kbytes)
 Combination of the two

Memory Buses
 Instruction fetch using 32-bit address bus and
256-bit data bus
 2 64-bit load buses (LD1 and LD2)
 2 64-bit store buses (ST1 and ST2)

L2 Memory/Cache Layout
Base
L2 Mode L2 Memory
Address
000 001 010 011 111
0x0000 0000
256K 256KBytes
SRAM SRAM
384K
SRAM
448K
480K SRAM
512K SRAM 0x0003 FFFF
SRAM 0x0004 0000
128KBytes
SRAM
256K
Cache 0x0005 FFFF
(4 way 0x0006 0000
assoc.) 64 Kbytes SRAM
128K
Cache 0x0006 FFFF
64K (4 way 0x0007 0000
32K 32 Kbytes SRAM
Cache assoc.) 0x0007 7FFF
Cache (4 way 0x0007 8000
32 Kbytes SRAM
(4 way assoc.) 0x0007 FFFF
assoc.)

Interrupts
 16 prioritized interrupts: INT_00 to INT_15
 INT_00 has the highest priority and is dedicated to
RESET. This halts the CPU and returns it to a
known state
 The first four interrupts (INT_00 – INT_03) are fixed
and nonmaskable
 INT_01 – INT_03 are generally used to alert the
CPU of an impending hardware problem, such as an
imminent power failure
 The remaining interrupts are maskable and can be
programmed

Interrupt Performance
Consideration
 Overhead for all CPU interrupts is 7 cycles
 Interrupt latency is 11 cycles
 Interrupts can be recognized every 2 cycles
 2 occurrences of a specific interrupt can be
recognized in 2 cycles

Peripheral Set
 2 multichannel buffered audio serial ports
 2 inter-integrated circuit bus modules (I2Cs)
 2 multichannel buffered serial ports (McBSPs)
 3 32-bit general-purpose timers
 1 user-configurable 16-bit or 32-bit host-port interface
(HPI16/HPI32)
 1 16-pin general-purpose input/output port (GP0) with
programmable interrupt/event generation modes
 1 32-bit glueless external memory interface (EMIFA), capable of
interfacing to synchronous and asynchronous memories and
peripherals.

Signals-Pins map
Signal name Pin Description
no.
CLKIN H4 Clock input. This clock is the input to the on-chip PLL
PPLV J6 PLL voltage supply
AECLKIN H25 EMIFA external input clock
AARE/ASDCAS/AS J25 EMIFA asynchronous memory/SDRAM column-address
ADS/ASRE strobe/programmable synchronous interface-address strobe or read-
enable.
AAOE/ASDRAS/A J24 EMIFA asynchronous memory read-enable/SDRAM column-address
SOE strobe/programmable synchronous interface-address strobe
AAWE/ASDWE K26 EMIFA asynchronous memory output-enable/SDRAM write-
enable/programmable synchronous interface write-enable
AARDY L22 Asynchronous memory ready input
AHOLDA N22 EMIFA hold-request-acknowledge to the host
AHOLD V23 EMIFA hold request from the host
ABUSREQ P22 EMIFA bus request output

Signals-pins map
Signal name Pin no. Description
ACE[1:3] L26,K23,K24 EMIFA memory space enables: enabled by bits 28

through 31 of the word address
ABE[0:7] T,L[23,24], EMIFA byte-enabled control
R,M[25,26]
APDT M22 EMIFA peripheral data transfer
AED[63:0] EMIFA external data
AEA[22:3] T22,V24,V25,V26,U23,U2 EMIFA external address
4,U25,U26,T25,T26,R23,R
24,P23,P24,P26,N23,N26,
M23,M24
AECLKOUT1 J23 EMIFA output clock 1 [at EMIFA input clock
(AECLKIN, CPU/4 clock, CPU/6 clock) frequency]
AECLKOUT2 J26 EMIFA output clock. Programmable By the EMIFA
input clock
ASOE3 R22 EMIFA synchronous memory output-enable for
ACE3
RESET AC7 Device reset
Packaging – Top View
A1 Corner
Top View
Packaging - Bottom View
Bottom View

Packaging
GLZ 532-PIN BALL GRID ARRAY (BGA) PACKAGE
AF
AE
AD
AC
AB
AA
Y
W
V
U
T
R
P
N
M
L
K
J
H
G
F
E
D
C
B
A
1 3 5 7 9 11 13 15 17 19 21 23 25
2 4 6 8 10 12 14 16 18 20 22 24 26

References
 TMS320C6000 CPU and Instruction Set Reference Guide.
Texas Instruments.
 http://www-s.ti.com/sc/psheets/spru189f/spru189f.pdf
 A BDTI Analysis of the Texas Instruments TMS320C64x. Staff of
Berkeley Design Technology, Inc.
 http://www.bdti.com/articles/c64_summary_report.pdf
 TMS320C6418 Fixed-Point Digital Signal Processor Data
Manual. Texas Instruments.
 http://focus.ti.com/lit/ds/sprs241b/sprs241b.pdf

Overflow Slides

Fetch
CPU
PG PS PW PR Functional
units
Registers
PR Memory
PS
PG
PW
The phases of the fetch pipeline stage are:
 PG: Program address generate;
 PS: Program address send;
 PW: Program access ready wait;
 PR: Program fetch packet receive;

Decode
DP DC
The phases of the pipeline in the decode stage are:

 DP: Instruction dispatch (fetch packets are split into execute
packets, two to eight instructions that can be executed in parallel);
 DC: Instruction decode (the source registers, destination

registers, and associated paths are decoded for the execution of the
instructions in the functional units);

DSP TMS Processors PART1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSP TMS Processors PART1

Uploaded by

Copyright:

Available Formats

Lecture 10a:

Digital Signal Processors:

Multipliers (MUL) Multiprocessors (MP)

1980 1985 1990 1995

Time Frame Approach Primary Application Enabling Technologies

Early 1970’s  Discrete logic  Non-real time  Bipolar SSI, MSI

Early 1980’s  Single Chip DSP P  Telecom  P architectures

Late 1980’s  Function/Application  Computers  Vector processing

Early 1990’s  Multiprocessing  Video/Image Processing  Advanced multiprocessing

Late 1990’s  Single-chip  Wireless telephony  Low power single-chip DSP

TMS320C25 1985 16 integer 40 10 MIPS 100 20 160,000 (2)

Address 16-bit word 16-bit word

0 Reset 1st Word 0 Reset 1st Word

1 1 Reset 2nd Word

Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0)

For N=50, Indirect Addressing t=42 s (23.8 KHz) 9

TMS320C30 Key Features (cont.)

Y(n) = x[n-(N-1)] . h(N-1) + x[n-(N-2)] . h(N-2) +…+ x(n) . h(0)

For N=50, t=3.6 s (277 KHz) 17

Data Acc A Temp Coeff Prgm Data Acc A

A Bus B Bus A B C T D Shifter

Temporary For example:

 Prefetch: Calculate address of instruction

Fully loaded pipeline

Symmetric FIR filter FIRS

Pwr C6201 CPU Megamodule

DDATA_O1 DADR1 DADR2 DDATA_O2

Registers A0 - A15 Registers B0 - B15

Most Instructions E1 No Delay

Integer Multiply E1 E2 1 Delay Slots

Loads E1 E2 E3 E4 E5 4 Delay Slots

Branch Target PG PSPWPR DPDC E1 5 Delay Slots

 All Instructions can be Conditional

Power ’C67x Floating-Point CPU Core

 1 GFLOPS @ 167 MHz

Floating Point Auxilary Unit

Floating Point Multiply Unit

 L-Unit (L1, L2)

Registers A0 - A15 Registers B0 - B15

Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10

Most Integer E1 No Delay

Single-Precision E1 E2 E3 E4 3 Delay Slots

Loads E1 E2 E3 E4 E5 4 Delay Slots

Branch Target PG PS PW PR DP DC E1 5 Delay Slots

L-Unit 1 L-Unit 2 L-Unit 1 L-Unit 2

Program Fetch & Dispatch Program Fetch & Dispatch

Prof. Brian L. Evans

Embedded Signal Processing Laboratory

 Lowest DSP in power consumption: 0.54 mW/MIP

 Conventional 16-bit fixed-point DSP

 Repeat single instruction or block

rptz a,#39 ; zero accumulator a

 A vector dot product is common in filtering

 Store a(n) and x(n) into an array of N elements

Pipelined (Most conventional DSP processors)

Fetch Decode Read Execute

Superscalar (Pentium, MIPS)

Fetch Decode Execute

 Instructions affecting pipeline behavior

 y[n] = h0 x[n] + h1 x[n-1] + ... + hN-1 x[n-(N-1)]

 Adapt weights: bk(i+1) = bk(i) + 2  e(i) x(i-k)

 Function approximation and spline interpolation

Selected CODE_SECTION code section

 Cl500 shell program contains

MAC AR2+, AR3+, A

MAC AR2+, AR3+, A