Professional Documents
Culture Documents
Video/Imaging
Multi-Processing
Application Examples
W-CDMA
Radars
DSP Building Blocks Function/Application Specific
Digital Radios & Bit Slice Processors (MUL, etc.) ( MP)
High-End
Control
Modems
DSP P and RISC
Voice Coding ( MP )
Instruments
C and Analog P
Low-End
Modems
Industrial
Control
3
Texas Instruments TMS320 Family
Multiple DSP P Generations
First Bit Size Clock Instruction MAC MOPS Device density (#
Sample speed Throughput execution of transistors)
(MHz) (ns)
Uniprocessor
Based
(Harvard
Architecture)
TMS32010 1982 16 integer 20 5 MIPS 400 5 58,000 (3)
Multiprocessor
Based
TMS320C80 1996 32 integer/flt. 2 GOPS MIMD
120 MFLOP
TMS320C62XX 1997 16 integer 1600 MIPS 5 20 GOPS VLIW
TMS310C67XX 1997 32 flt. pt. 5 1 GFLOP VLIW
4
First Generation DSP P Case Study
TMS32010 (Texas Instruments) - 1982
Features
200 ns instruction cycle (5 MIPS)
144 words (16 bit) on-chip data RAM
1.5K words (16 bit) on-chip program ROM - TMS32010
External program memory expansion to a total of 4K words at full
speed
16-bit instruction/data word
single cycle 32-bit ALU/accumulator
Single cycle 16 x 16-bit multiply in 200 ns
Two cycle MAC (5 MOPS)
Zero to 15-bit barrel shifter
Eight input and eight output channels
5
TMS32010 BLOCK DIAGRAM
6
TMS32010 Program Memory Maps
Microcomputer Mode Microprocessor Mode
External
Memory
1525
Space
Internal
Memory
Space Reserved
For Testing
1536
External
Memory
Space
4095 4095
7
Digital FIR Filter Implementation
(Uniprocessor-Circular Buffer)
Start each
Time here
1st. Cycle 2nd. Cycle
End
X0 Start
a n-1 a n-2 a1 a0 X1 Start
X2
a0 a n-1 X3
X4
X X5
Xn-1
End
+ Replace
starting
value
Acc with new
value
8
TMS32010 FIR FILTER PROGRAM
Indirect Addressing (Smaller Program Space)
10
Third Generation DSP P Case Study
TMS320C30 - 1988
TMS320C30 Key Features
60 ns single-cycle instruction execution time
33.3 MFLOPS (million floating-point operations per second)
16.7 MIPS (million instructions per second)
One 4K x 32-bit single-cycle dual-access on-chip ROM block
Two 1K x 32-bit single-cycle dual-access on-chip RAM blocks
64 x 32-bit instruction cache
32-bit instruction and data words, 24-bit addresses
40/32-bit floating-point/integer multiplier and ALU
32-bit barrel shifter
11
Third Generation DSP P Case Study
TMS320C30 - 1988
12
TMS320C30 BLOCK DIAGRAM
13
TMS320C3x CPU BLOCK DIAGRAM
14
TMS320C3x MEMORY BLOCK DIAGRAM
15
TMS320C30 Memory Organization
Oh Interrupt locations Oh Interrupt locations
& reserved (192) & reserved (192)
BFh external STRB active BFh
COh External COh
ROM
STRB Active 0FFFh
7FFFFFh (Internal)
1000h
800000h Expansion BUS MSTRB
7FFFFFh
Expansion BUS MSTRB Active (8K)
801FFFh 800000h
Active (8K)
802000h Reserved 801FFFh Reserved
(8K) 802000h (8K)
803FFFh
804000h Expansion Bus 803FFFh Expansion Bus
IOSTRB Active (8K) 804000h IOSTRB Active (8K)
805FFFh
806000h Reserved 805FFFh Reserved
(8K) 806000h (8K)
807FFFH
80800h Peripheral Bus Memory Mapped 807FFFH Peripheral Bus Memory Mapped
Registers (Internal) (6K) 80800h Registers (Internal) (6K)
8097FFh
809800h RAM Block 0 (1K) 8097FFh RAM Block 0 (1K)
(Internal) 809800h (Internal)
809BFFh
809C00h 809BFFh RAM Block 1 (1K)
RAM Block 1 (1K) 809C00h
809FFFh (Internal)
(Internal)
80A00h 809FFFh
External 80A00h External
0FFFFFFh STRB Active STRB Active
0FFFFFFh
Microprocessor Mode Microcomputer Mode 16
TMS320C30 FIR FILTER PROGRAM
18
TMS320C54x Internal Block Diagram
19
Architecture optimized for DSP
#1: CPU designed for efficient DSP processing
MAC unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data
and program flow
Four busses and large on-chip memory that
result in sustained performance near peak
#3: Highly tuned instruction set for
powerful DSP computing
Sophisticated instructions that execute in fewer
cycles, with less code and low power demands
20
Key #1: DSP engine
40
Y = an * xn
n = 1
x a
MPY
ADD
y
21
Key #1: MAC Unit
MAC *AR2+, *AR3+, A
MPY A
Fractional B
Mode Bit
ADD O
acc A acc B
22
Key #1: Accumulators + Adder
General-Purpose Math example: t = s+e-r
LD @s, A
acc A acc B ALU
ADD @e, A
MUX U Bus SUB @r, A
STL A, @t
A B MAC
23
Key #1: Barrel shifter
LD @X, 16, A
STH @B, Y
A B C D
Barrel Shifter
(-16-+31)
S Bus
ALU E Bus
24
Key #1: Temporary register
LD @x, T
MPY @a, A
D X EXP A
Encoder B
MAC ALU
25
Key #2: Efficient data/program flow
#1: CPU designed for efficient DSP processing
MAC unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data
and program flow
Four busses and large on-chip memory that
result in sustained performance near peak
#3: Highly tuned instruction set for
powerful DSP computing
Sophisticated instructions that execute in fewer
cycles, with less code and low power demands
26
Key #2: Multiple busses
MAC *AR2+, *AR3+, A
P
INTERNAL
EXTERNAL
M
MEMORY
MEMORY
U D M
X C U
E X
S E
Central
C D
Arithmetic T MAC A B ALU SHIFTER
Logic Unit
M
27
Key #2: Pipeline
Prefetch Fetch Decode Access Read Execute
P F D A R E
28
Key #2: Bus usage
CNTL PC ARs
P
INTERNAL
EXTERNAL
MEMORY
MEMORY
U D M
X C U
E X
S E
Central
Arithmetic
Logic Unit
T MAC A B ALU SHIFTER
29
Key #2: Pipeline performance
CYCLES
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 F4 D4 A4 R4 X4
P5 F5 D5 A5 R5 X5
P6 F6 D6 A6 R6 X6
30
Key #3: Powerful instructions
#1: CPU designed for efficient DSP processing
MAC Unit, 2 Accumulators, Additional Adder,
Barrel Shifter
#2: Multiple busses for efficient data and
program flow
Four busses and large on-chip memory that
result in sustained performance near peak
#3: Highly tuned instruction set for
powerful DSP computing
Sophisticated instructions that execute in fewer
cycles, with less code and low power demands
31
Key #3: Advanced applications
32
C62x Architecture
33
TMS320C6201 Revision 2
Program Cache / Program Memory
32-bit address, 256-Bit data512K Bits RAM
Memory
Interface
2 Timers
2 Multi-
Data Memory channel
buffered
32-Bit address, 8-, 16-, 32-Bit data serial ports
(T1/E1)
512K Bits RAM
34
C6201 Internal Memory
Architecture
Separate Internal Program and Data Spaces
Program
16K 32-bit instructions (2K Fetch Packets)
256-bit Fetch Width
Configurable as either
Direct Mapped Cache, Memory Mapped Program Memory
Data
32K x 16
Single Ported Accessible by Both CPU Data Buses
4 x 8K 16-bit Banks
2 Possible Simultaneous Memory Accesses (4 Banks)
4-Way Interleave, Banks and Interleave Minimize Access
Conflicts
35
C62x Interrupts
12 Maskable Interrupts , Non-Maskable Interrupt (NMI)
Interrupt Return Pointers (IRP, NRP)
Fast Interrupt Handing
Branches Directly to 8-Instruction Service Fetch Packet
Can Branch out with no overhead for longer service
7 Cycle Overhead : Time When No Code is Running
12 Cycle Latency : Interrupt Response Time
Interrupt Acknowledge (IACK) and Number (INUM)
Signals
Branch Delay Slots Protected From Interrupts
Edge Triggered
36
C62x Datapaths
Registers A0 - A15 Registers B0 - B15
1X 2X
S1 S2 D DL SL SL DL D S S D S S D S S S S D S S D S S D DL SL SL DL D S2 S1
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
DDATA_I1 DDATA_I2
(load data) (load data)
Cross Paths
40-bit Write Paths (8 MSBs)
40-bit Read Paths/Store Paths
37
Functional Units
L-Unit (L1, L2)
40-bit Integer ALU, Comparisons
Bit Counting, Normalization
S-Unit (S1, S2)
32-bit ALU, 40-bit Shifter
Bitfield Operations, Branching
M-Unit (M1, M2)
16 x 16 -> 32
D-Unit (D1, D2)
32-bit Add/Subtract
Address Calculations
38
C62x Datapaths
S1 S2 D DL SL SL DL D S S D S S D S S S S D S S D S S D DL SL SL DL D S2 S1
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
Cross Paths
DDATA_O1 DDATA_I1 DDATA_I2 DDATA_O2 40-bit Write Paths (8 MSBs)
(store data) (load data) (load data) (store data) 40-bit Read Paths/Store Paths
DADR1 DADR2
(address) (address)
39
C62x Instruction Packing
Instruction Packing Advanced VLIW
Fetch Packet
CPU fetches 8 instructions/cycle
Example 1
Execute Packet
A B C D E F G H CPU executes 1 to 8 instructions/cycle
Fetch packets can contain multiple execute packets
A Parallelism determined at compile / assembly
time
B Examples
C 1) 8 parallel instructions
D Example 2 2) 8 serial instructions
E 3) Mixed Serial/Parallel Groups
A // B
F C
G D
H A B E // F // G // H
Reduces Codesize, Number of Program Fetches,
C Power Consumption
D Example 3
E
F G H
40
C62x Pipeline Operation
Pipeline Phases
Fetch Decode Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5
Decode
Single-Cycle Throughput
DP Instruction Dispatch
Operate in Lock Step
DC Instruction Decode
Fetch Execute
PG Program
E1Address
- E5 Generate
Execute 1 through Execute 5
PS Program Address Send
PW Program Access Ready Wait
PR Program Fetch Packet Receive
Execute Packet 1 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5
41
C62x Pipeline Operation
Delay Slots
Delay Slots: number of extra cycles until result is:
written to register file
available for use by a subsequent instructions
Multi-cycle NOP instruction can fill delay slots while
minimizing codesize impact
43
C6000 Instruction Set Features
Conditional Instructions
44
C6000 Instruction Set Addressing
Features
Load-Store Architecture
Two Addressing Units (D1, D2)
Orthogonal
Any Register can be used for Addressing or
Indexing
Signed/Unsigned Byte, Half-Word, Word,
Double-Word Addressable
Indexes are Scaled by Type
Register or 5-Bit Unsigned Constant
Index
45
C6000 Instruction Set Addressing
Features
Indirect Addressing Modes
Pre-Increment *++R[index]
Post-Increment *R++[index]
Pre-Decrement *--R[index]
Post-Decrement *R--[index]
Positive Offset *+R[index]
Negative Offset *-R[index]
15-bit Positive/Negative Constant Offset
from Either B14 or B15
46
C6000 Instruction Set Addressing
Features
Circular Addressing
Fast and Low Cost: Power of 2 Sizes and
Alignment
Up to 8 Different Pointers/Buffers, Up to 2
Different Buffer Sizes
Dual Endian Support
47
C67x Architecture
48
TMS320C6701 DSP
Block Diagram
Program Cache/Program Memory
32-bit address, 256-Bit data
512K Bits RAM
49
TMS320C6701
Advanced VLIW CPU (VelociTI ) TM
50
TMS320C6701
Memory /Peripherals
Same as ’C6201
External interface supports
SDRAM, SRAM, SBSRAM
4-channel bootloading DMA
16-bit host port interface
1Mbit on-chip SRAM
2 multichannel buffered serial ports (T1/E1)
Pin compatible with ’C6201
51
TMS320C67x CPU Core
’C67x Floating-Point CPU Core
Program Fetch
Control
Instruction Dispatch Registers
Instruction Decode
Data Path 1 Data Path 2 Control
Logic
A Register File B Register File
Test
Emulation
L1 S1 M1 D1 D2 M2 S2 L2
Interrupts
Floating-Point
Arithmetic Auxiliary
Logic Logic
Multiplier Capabilities
Unit
Unit Unit
52
C67x Interrupts
12 Maskable Interrupts
Non-Maskable Interrupt (NMI)
Interrupt Return Pointers (IRP, NRP)
Fast Interrupt Handling
Branches Directly to 8-Instruction Service Fetch Packet
7 Cycle Overhead: Time When No Code is Running
12 Cycle Latency : Interrupt Response Time
Interrupt Acknowledge (IACK) and Number
(INUM) Signals
Branch Delay Slots Protected From Interrupts
Edge Triggered
53
C67x New Instructions
.L Unit .M Unit .S Unit
Floating Point Arithmetic Unit
ABSSP
S1 S2 D DL SL SL DL D S S D S S D S S S S D S S D S S D DL SL SL DL D S2 S1
1 2 1 2 1 2 2 1 2 1 2 1
L1 S1 M1 D1 D2 M2 S2 L2
55
C67x Instruction Packing
Instruction Packing Enhanced VLIW
Example 1
Fetch Packet
A B C D E F G H CPU fetches 8 instructions/cycle
Execute Packet
CPU executes 1 to 8
instructions/cycle
A Fetch packets can contain multiple
B execute packets
Parallelism determined at
C compile/assembly time
Examples
D Example 2 1) 8 parallel instructions
E 2) 8 serial instructions
3) Mixed Serial/Parallel Groups
F A // B
G
C
D
H A B E // F // G // H
C Reduces
Codesize
D Example 3 Number of Program Fetches
E Power Consumption
F G H
56
C67x Pipeline Operation
Pipeline Phases
Fetch Decode Execute
PG PS PW PR DP DC E1 E2 E3 E4 E5 E6 E7 E8 E9 E10
Operate in Lock Step Decode
Fetch DP Instruction Dispatch
PG Program Address Generate DC Instruction Decode
PS Program Address Send Execute
PW Program Access Ready Wait E1 - E5 Execute 1 through Execute 5
PR Program Fetch Packet Receive E6 - E10 Double Precision Only
57
C67x Pipeline Operation
Delay Slots
Delay Slots: number of extra cycles until result is:
written to register file
available for use by a subsequent instructions
Multi-cycle NOP instruction can fill delay slots while
minimizing codesize impact
Branches E1
Decode Decode
Register Register Register Register
file file file file
59
TMS320C80 MIMD MULTIPROCESSOR
Texas Instruments - 1996
60
Copyright 1999
61
SIGNAL AND IMAGE Accumulator architecture
PROCESSING ON THE
TMS320C54x DSP
Memory-register architecture
Introduction
Instruction set architecture
Vector dot product example
Pipelining
Algorithm acceleration
C compiler
Development tools and boards
Conclusion
63
Introduction to TMS320C54x
Roadmap
64
Instruction Set Architecture
65
Instruction Set Architecture
Immediate
Operand is part of the ADD #0FFh
instruction
Absolute
Address of operand is part of
the instruction LD *(LABEL), A
Register
Operand is specified in a
register READA DATA
;(data read
from address in
accumulator A)
67
C54x Addressing Modes
Direct
Address of operand is part of the
ADD 010h,A
instruction (added to implied
memory page)
Indirect
Address of operand is stored in a
register
Offset addressing ADD *AR1
Register offset (ar1+ar0)
ADD *AR1(10)
Autoincrement/decrement
Bit reversed addressing
ADD *AR1+0
Circular addressing ADD *AR1+
ADD *AR1+B
ADD *AR1+0B
68
Program Control
Conditional execution
XC n, cond [, cond [, cond ]] ; 23 possible conditions
Executes next n (1 or 2) words if conditions (cond) are met
Takes one cycle to execute
xc 1,ALEQ ; test for accumulator a0
mac *ar1+,*ar2+,a ; perform MAC only if a0
add #12,a,a ; always perform add
Scalar arithmetic
ABS Absolute value
SQUR Square
POLY Polynomial evaluation
Vector arithmetic acceleration
Each instruction operates on one element at at time
ABDIST Absolute difference of vectors
SQDIST Squared distance between vectors
SQURA Sum of squares of vector elements
SQURS Difference of squares of vector elements
rptz a,#39 ; zero accumulator a, repeat next
; instruction over 40 elements
squra *ar2+,a ; a += x(n)^2
70
C54X Instructions Set by Category
Arithmetic Logical Program Application
ADD AND Control Specific
MAC BIT B ABS
MAS BITF BC ABDST
MPY CMPL CALL DELAY
NEG CMPM CC EXP
SUB OR IDLE FIRS
ZERO ROL INTR LMS
ROR NOP MAX
Data SFTA RC MIN
Management SFTC RET NORM
LD SFTL RPT POLY
MAR XOR RPTB RND
MV(D,K,M,P) RPTZ SAT
ST TRAP SQDST
XC SQUR
Notes
SQURA
CMPL complement MAR modify address reg.
SQURS
CMPM compare memory MAS multiply and subtract
71
Example: Vector Dot Product
Coefficients a(n)
Data x(n)
72
Example: Vector Dot Product
Prologue
Initialize pointers: ar2 for a(n) and ar3 for x(n)
Set accumulator (A) to zero
Inner loop
Reg Mea n in g
Multiply and accumulate a(n) and x(n) AR2 &a(n )
AR3 &x(n )
Epilogue A Y
Store the result into Y
; Initialize pointers ar2 and ar3 (not shown)
rptz a,#39 ; zero accumulator a
; repeat next instruction 40 times
mac *ar2+,*ar3+,a ; a += a(n) * x(n)
sth a,#Y ; store result in Y
73
Pipelining
Sequential (Motorola 56000)
Fetch Decode Read Execute
Six-stage pipeline
Prefetch: load address of next instruction onto bus
Fetch: get next instruction
Decode: decode next instruction to determine type of memory
access for operands
Access: read operands address
Read: read data operand(s)
Execute: write data to bus
Instructions
1-3 words long (most are one word long)
1-6 cycles to execute (most take one cycle) not counting external
(off-chip) memory access penalty
75
TMS320C54x Pipeline
76
Block FIR Filtering
x in two h in
circular program
buffers memory
78
Accelerating Symmetric FIR Filtering
; Addresses: a6 input buffer, a7 output buffer
; a4 array with x[n-4], x[n-3], x[n-2], x[n-1] for N = 8
; a5 array with x[n-5], x[n-6], x[n-7], x[n-8] for N = 8
; Modulo addressing prevents need to reinitialize regs each sample
firtask: ld #firDP,dp ; initialize data page pointer
stm #frameSize-1,brc ; compute 256 outputs
rptbd firloop-1
stm #N/2,bk ; FIR circular buffer size
ld *ar6+,b ; load input value to accumulator b
mvdd *ar4,*a5+0% ; move old x[n-N/2] to new x[n-N/2-1]
stl b,*ar4% ; replace oldest sample with newest
add *a4+0%,*a5+0%,a ; a = x[n] + x[n-N/2-1]
rptz b,#(N/2-1) ; zero accumulator b, do N/2-1 taps
firs *ar4+0%,*ar5+0%,coeffs ; b += a * h[i], do next a
mar *+a4(2)% ; to load the next newest sample
mar *ar5+% ; position for x[n-N/2] sample
sth b,*ar7+
firloop: ret
79
Accelerating LMS Filtering
81
Accelerating Polynomial Evaluation
ANSI C compiler
Instrinsics, in-line assembly and functions, pragmas
83
Optimizing C Code
Branch optimizations
Analyzes branching behavior and rearranges code to remove
branches or remove redundant conditions
86
Compiler Optimizations
Copy propagation
Following an assignment compiler replaces references to a
variable with its value
87
Compiler Optimizations
Expression simplification
Compiler simplifies expressions to equivalent forms requiring fewer
instructions/registers
/* Expression Simplification*/
g = (a + b) - (c + d); /* unoptimized */
g = ((a + b) - c) - d; /* optimized */
Inline expansion
Replaces calls to small run-time support functions with inline
code, saving function call overhead
88
Compiler Optimizations
Induction variables
Loop variables whose value directly depends on the number of
times a loop executes
Strength reduction
Loops controlled by counter increments are replaced by repeat
blocks
Efficient expressions are substituted for inefficient use of
induction variables (e.g., code that indexes into an array is
replaced with code that increments pointers)
89
Compiler Optimizations
Loop rotation
Evaluates loop conditionals at the bottom of loop
Auto-increment addressing
Converts C-increments into efficient address-register indirect
access
90
Hypersignal Block Diagram Environments
Software features
Compatible with TI Code Composer Studio
Supports TI C debugger, compiler, assembler, linker
http://www.ti.com/sc/docs/tools/dsp/c5000developmentboards.html
94
Sampling of Other C54 Boards
Ven d or Boa r d R AM ROM P r ocessor I/O
Ka ne KC542/ 256 kb 256 kb 40-MIP 16-bit
Com p u t i n g PC C5402 st er eo
In n ov a t i v e SBC54 100-MIP Modu la r
In t eg r a t i on C549
DSP Tiger 256 kb 256 kb 100-MIP
R esea r ch 549/P C C549
DSP Tiger 256 kb 256 kb 100-MIP
R esea r ch 5410/P C C5410
Od in VIDAR 2 Mb 0 kb fou r 80-MIP
T el esyst em s 5x4P CI C548
DSP Viper -12 12 Mb 0 kb 12 100-MIP
R esea r ch 549/P C C549
http://www.ti.com/sc/docs/tools/dsp/c5000developmentboards.html
95
Binary-to-Binary Translation
http://www.ti.com/sc/docs/tools/dsp/tap5000freetool.html
97
Conclusion
99
TMS320C54x DSP
Design Workshop
Module 1
Introduction and Overview
Learning Objectives
1 - 101
DSP: Sum-of-Products
x a
MPY
ADD
1 - 102
MAC Unit Details
D AT C P DA
s/u s/u D = Data Bus
C = Coefficient Bus
MPY P = Program Bus
A A = A accumulator
FRCT B = B accumulator
B
T = Temporary register
ADD 0
s/u = signed/unsigned
acc A acc B FRCT = Fractional mode bit
1 - 103
Accumulators + ALU
A B C T D Shifter
LD @s, A
acc A acc B ALU
ADD @e, A
SUB @r, A
STL @A, t
MUX
A B MAC
1 - 104
Barrel Shifter
A B C D
ALU E BUS
LD @X, 16, A
STH @B, y
1 - 105
Temporary Register
A
D X EXP
B
T
ex: A = xa
LD @x, T
MPY @a, A
MAC ALU
1 - 106
'C54x Buses
P
D
M
INTERNAL U M EXTERNAL
MEMORY X C U MEMORY
E X
S E
C D
T MAC A B ALU SHIFT
1 - 108
Fetch and Read - Memory Interaction
1 - 109
‘C54x Pipeline - Enhanced
1 - 110
Memory Write
1 - 112
‘C54x Pipeline Hardware
P PC, PA
F Program Mem, PD
D Controller
A ARs, DA + CA , ARAUs
R
Data Mem, DD + CD
X ; AR, ARAU, EA
CALU (MAC, ALU)
; ED, Data Mem
1 - 113
'C54x Components and Bus Usage
CNTL PC ARs
D
M
INTERNAL M EXTERNAL
U
MEMORY C U MEMORY
X
X
E
S E
1 - 114
Pipeline Performance
TIME
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 F4 D4 A4 R4 X4
P5 F5 D5 A5 R5 X5
P6 F6 D6 A6 R6 X6
P
54x
D
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 -- -- -- F4 D4 A4 R4 X4
-- -- -- P5 F5 D5 A5 R5 X5
-- -- -- P6 F6 D6 A6 R6
1 - 116
Pipeline Flow: Internal and External Memories
54x 54x
P or D
D P
P1 F1 D1 A1 R1 X1
P2 F2 D2 A2 R2 X2
P3 F3 D3 A3 R3 X3
P4 F4 D4 A4 R4 X4
P5 F5 D5 A5 R5 X5
P6 F6 D6 A6 R6 X6
NO CONFLICT
1 - 117
Pipeline: Internal Memory Only
'C54x
ROM DARAM RAM
4K 2K 8K
4K 2K 8K
. . .
. . .
. . .
P
MAC ALU
EXT
EXT
9000
Internal
ROM ?
E000
FF80 DROM ROM ?
VECTORS FFFF
FFFF
1 - 119
'C541 Program Memory Options
All External 28K ROM** 'RAM' Option
MP/MC = 1 MP/MC = 0 OVLY = 1
0000 0000 0000
0080
RAM
1400
EXT EXT
EXT
9000
2K ROM
9800 2K ROM
A000
4K ROM
B000
4K ROM EXT
C000
4K ROM or
D000
4K ROM ROM
E000
4K ROM
FF80 F000
VECTORS* 4K ROM w VECs * VECTORS*
FFFF FFFF FFFF
RAM b
0800
EXT
RAM c
RAM a
0C00
RAM d
1000
E000
EXT or ROM RAM e
1 - 121
'C54x Memory Mix
1 - 122
'C54x Peripheral Mix
1 - 123
'C54x Review - CALU
CALU supports:
General-purpose operations:
MAC
ALU
Special functions:
CSSU (Viterbi)
EXP (Norm)
FIRS: MAC + ALU
16- or 32-bit operations:
C16 mode (Double)
Long operations
1 - 124
'C54x Review - System
1 - 126
Debugger Screen
1 - 127
Simulator Quick Reference
Window Management Running Code Other Actions
Selecting Window Reset ? <label> display value of <label>
F6 rotates to next window Type RESET forces PC to zero ? <label> = <n>load <label> with <n>
WIN <name> selects <name> window Type RESTART return to "entry point" file <name> load file <name> to file window
Click window frame select window TAB scroll to prior commands
F4 close selected window Stepping SHIFT TAB scroll to subsequent
F8 or type STEP for one step commands
Moving Inside Window F10 or type NEXT condense subroutines F9 alternate form of mouse click
Up Arrow / Down Arrow Type STEP <n> for <n> steps TAKE <name> simulator 'batch' file
Page Up / Page Down LOAD <name>download file <name>
Type NEXT <n> for <n> nexts
Click on window frame arrows
For DISASSEM window; type ADDR <value>
Running
For MEMORY window; type MEM <value>
RUN run until <Esc> or breakpoint
RUNB run with benchmark
Moving Window
Click on top of frame; drag to new location GO <label> run to <label>
Type MOVE and use arrows or type
coordinates
Sizing Window
Click on bottom right corner; drag to new shape
Type SIZE and use arrows or type coordinates Watches and Breakpoints Entry/Exit
ZOOM click on top left corner
Operation Watch Breakpoint SIM54xw <file>start simulator with <file>.out
UNZOOM click again on top left corner
ADD WA BA
Screen Configuration RESET WR BR
QUIT exit simulator
SCONFIG <name> load configuration <name> LIST WL BL
SSAVE <name> save configuration DELETE WD # BD # SYSTEM go to DOS shell
<name> or hot keys or mouse clicks
Modes
ASM display ASM info or <Alt> D,A
C display C info or <Alt> D,C
MIX display both ASM and C or <Alt> D,M
1 - 128
Texas Instruments
TMS320C64x
CS433
Processor Presentation Series
Prof. Luddy Harrison
256-bit data
TMS320C64x CPU
Program fetch
Instruction dispatch
Functional units: Instruction decode
6 ALUs
(L1, L2, S1, S2, D1, D2) Register file A Register file B
2 multiplers (M1, M2)
Instruction Fetch
Control
Instruction Dispatch Registers
Advanced Instruction Packet
Instruction Decode Control
Logic
Data Path A Data Path B
A Register File B Register File Test
A31-A16 B31-B16
A15-A0 B15-B0 Advanced
In-Circuit
Emulation
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2 Interrupt
Control
L1 Data Cache
2-Way Set-Associative
16KBytes
FIFO Timer 2
C64x DSP Core
SRAM Timer 1
Instruction Fetch Advanced
ROM/FLAHS Timer 0 In-Circuit
Emulation
I/O Devices
McBSP0 .D1 .D2
McBSP1 Enhanced
L2 Cache
DMA
Memory
Controller
McASP0 512KBytes
(EDMA) L1 Data Cache
2-Way Set-Associative
McASP1
16KBytes
HPI16
or Oscillator and Power Down
HPI32 PLL (x1, x5-x12, Logic
x16, x18, x19-
x22, x24)
I2C0
I2C1
Boot Configuration
GP0
16
ST1b
ST1a
The data path of C64x has the
.S1 Register
file A following components:
(A0-A31)
LD1b
.M1
Two load-from-memory
LD1a data paths;
DA1 .D1
Two store-to-memory data
DA2
.D2 paths;
LD1a
LD1b Two data address paths;
.M2
Register
Two register file data
.S2 file B cross paths;
ST2a (B0-B31)
ST2b
.L2
Data path B
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 149
Data Path A
src1
src2
.L1
dst
long dst 8
long src
32 MSB
ST1b
ST1a
32 LSB
long src
long dst
8
dst
.S1 Register file A
(A0-A31)
src1
src2
long dst
dst
src1
.M1
src2
32 MSB
LD1b
LD1a
32 LSB
dst
src1
DA1 .D1
src2
2X Data Path A
dst
32 LSB
LD2a
LD2b
32 MSB
src2
.M2
src1
dst
long dst
Control Register
src2
Register file B
src1
.S2 (B0-B31)
dst
long dst 8
long src
32 MSB
ST2a
ST2b
32 LSB long src 8
long dst 8
dst
.L2
src2
src1
Data Path B
src2
.L1
dst
long dst 8
long src
32 MSB
ST1b
ST1a
32 LSB
long src
long dst
8
dst
.S1 Register file A
(A0-A31)
src1
src2
long dst
dst
src1
.M1
src2
32 MSB
LD1b
LD1a
32 LSB
dst
src1
DA1 .D1
src2
2X Data Path A
1X
src2
dst
32 LSB
LD2a
LD2b
32 MSB
src2
.M2
src1
dst
long dst
Control Register
src2
Register file B
src1
.S2 (B0-B31)
dst
long dst 8
long src
32 MSB
ST2a
ST2b
32 LSB 8
long src
long dst 8
dst
.L2
src2
src1
Data Path B
.M unit (.M1, .M2) 16x16, 16x32, quad 8x8, dual 16x16 32x32-bit fixed-point multiply
quad 8x8 multiply operations operations
Bit expansion Floating-point multiply
Variable shift operation operations
Rotation
Galois Field Multiply
The 1X cross path allows the functional units of data path A to read
their source from register file B, and the 2X cross path allows the
functional units of data path A to read their source from register file B.
All eight of the functional units have access to the register file on the
opposite side, via a cross path;
The src2 inputs of .M, .S and .D units are selectable between the cross-
path and the same side register file. Both src1 and src2 inputs of .L units
are selectable between the cross path and the same-side register file;
Since there are only two cross-path, the limit is one source read from
each data path’s opposite register file per cycle, or a total of two cross
path reads per cycle.
Store path:
Side A:
• ST1a is the write path for the 32 least significant bits;
• ST1b is the write path for the 32 most significant bits;
Side B:
• ST2a is the load path for the 32 least significant bits;
• ST2b is the load path for the 32 most significant bits;
T1: T2:
LD1 (LD1a, LD1b) LD2 (LD2a, LD2b)
ST1 (ST1a, ST1b) ST2 (ST2a, ST2b)
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
p p p p p p p p
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
1 1 0 0 1 0 1 0
1 ABC
2 D
3 EF
4 GH
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 165
Instruction Encoding (1/4)
Operations on the .L Unit
31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
3 5 5 5 7
3 5 5 5 5
3 5 5 5 5
3 5 15 3
3 5 5 5 4 3
3 5 5 5 6
3 5 16
3 5 5 5 5 2
3 5 16
creg z cst 0 0 1 0 0 s p
3 21
IDLE
31 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Reserved 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 s p
14
NOP
31 18 17 16 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Reserved 0 src 0 0 0 0 0 0 0 0 0 0 0 0 p
14 4
PG PS PW PR DP DC E1 E2 E3 E4 E5
PG PS PW PR
Fetch
PG: Program address Generate
Program Address is generated in the CPU
PS: Program address Send
Program Address is sent to memory for a read operation
PW: Program access ready Wait
Memory read occurs
PR: Program fetch packet Receive
Fetch packet is received at the CPU
DP DC
Decode
DP: Instruction Dispatch
Fetch packets are split into execute packets
Instructions in the execute packets are assigned to the
appropriate functional units
DC: Instruction Decode
Source and destination registers and associated paths are
decoded for use by the functional units
Execute
Execute
Execute
Execute
Execute
Execute 2 E2 For load instructions, the address is sent to memory. For store
instructions, the address and data are sent to memory.
Single-cycle instructions that saturate results set the SAT bit in the
control status register (CSR) if saturation occurs.
Execute 3 E3 Data memory accesses are performed. Any multiply instructions that
saturates results sets the SAT bit in the control status register (CSR)
if saturation occurs.
Execute 4 E4 For load instructions, data is brought to the CPU boundary. The
results of multiply extensions are written to a register file.
PG PS PW PR DP DC E1
Functional
unit
.L, .S, .M,
or .D
E1
Register file
PG PS PW PR DP DC E1 E2 1 delay slot
Functional
Unit
.M
Operands (data)
Write results
E1
E2
Register file
Functional
unit
.D
E2
E1
Register file
Data
E2 Memory
controller Address
E3
Memory
Functional
unit
.D
E2
E1
E5
Register file
Data
E4 Memory
controller Address
E3
Memory
PG PS PW PR DP DC E1
Branch
PG PS PW PR DP DC E1
target
5 delay slots
Since branch target has to wait until it reaches the E1 phase to begin
execution, the branch takes five delay slots before the branch target
code executes.
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
p p p p p p p p
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
p p p p p p p p
The boundaries of fetch packets are determined by a bit in each instruction, the p-bit. If p-bit
determines whether the instruction executes in parallel with another instruction. If the p-bit
of instruction i is 1, then instruction i+1 is to be executed in parallel with (in the same cycle
as) instruction i, otherwise instruction i+1 is executed in the cycle after instruction i. Thus,
the last p-bit in a fetch packet is always 0. Packets can be:
• Fully serial (all p-bits are 0);
• Fully parallel (all p-bits except the last one are set to 1);
• Partially serial;
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
1 1 0 0 1 0 1 0
1 ABC
2 D
3 EF
4 GH
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 193
Specifying Execute Packets in
Assembly
Code lines with preceding double vertical bars, ||, will be executed in
parallel with the previous instruction.
Example:
InstructionA
|| InstructionB
|| InstructionC
InstructionD
InstructionE
|| InstructionF
InstructionG
|| InstructionH
31 0 31 0 31 0 31 0 31 0 31 0 31 0 31 0
1 1 0 0 1 0 1 0
TI TMS C64x:
ADD .L2 B12, B1, B12 || MUL .M1 A1,A10,A2 || STW .D2x A2,0(B12)
ADD .L2 B12, B1, B12 || STW .D2x A2,0(B12)
EMIF A
SBSRAM
Instruction fetch
ZBT RAM EMIF B Data path A Data path B
Enhanced L2
DMA Memory
FIFO Controller 1024K Register file Register
(64-channel) A file B
bytes
SRAM
L2 Data cache
2-way set-associative
16 K Bytes total
2-level cache;
Single internal program memory port with an
instruction-fetch bandwidth of 256 bits;
64 AECLKIN
AED[63:0] Data
AECLKOUT1
AECLKOUT2
ASDCKE
ACE3 External
AARE/ASDCAS/
ACE2 Memory Map Memory ASADS/ASRE
ACE1 Space Select AAOE/ASDRAS/ASOE
Interface
ACE0
20 Control AAWE/ASDWE/ASWE
AEA[22:3] Address AARDY
ABE7 ASOE3
ABE6
APDT
ABE5
ABE4 Bytes
ABE3 Enabled
AHOLD
ABE2 Bus AHOLDA
ABE1 arbitration ABUSREQ
ABE0
16 BECLKIN
BED[15:0] Data
BECLKOUT1
BECLKOUT2
BSDCKE
BCE3 External
BARE/BSDCAS/
BCE2 Memory Map Memory BSADS/BSRE
BCE1 Space Select BAOE/BSDRAS/BSOE
Interface
BCE0
20 Control BAWE/BSDWE/BSWE
BEA[22:3] Address BARDY
BBE1 BSOE3
Bytes
Enabled BPDT
BBE0
BHOLD
Bus BHOLDA
arbitration BBUSREQ
256K 256KBytes
SRAM SRAM
384K
SRAM
448K
480K SRAM
512K SRAM 0x0003 FFFF
SRAM 0x0004 0000
128KBytes
SRAM
256K
Cache 0x0005 FFFF
(4 way 0x0006 0000
assoc.) 64 Kbytes SRAM
128K
Cache 0x0006 FFFF
64K (4 way 0x0007 0000
32K 32 Kbytes SRAM
Cache assoc.) 0x0007 7FFF
Cache (4 way 0x0007 8000
32 Kbytes SRAM
(4 way assoc.) 0x0007 FFFF
assoc.)
A1 Corner
Top View
CS433 Prof. Luddy Harrison Copyright 2005 University of Illinois 217
Packaging - Bottom View
Bottom View
AF
AE
AD
AC
AB
AA
Y
W
V
U
T
R
P
N
M
L
K
J
H
G
F
E
D
C
B
A
1 3 5 7 9 11 13 15 17 19 21 23 25
2 4 6 8 10 12 14 16 18 20 22 24 26
PG PS PW PR Functional
units
Registers
PR Memory
PS
PG
PW
The phases of the fetch pipeline stage are:
PG: Program address generate;
PS: Program address send;
PW: Program access ready wait;
PR: Program fetch packet receive;
DP DC