DSP SHARK Processors PART2

PDF processed with CutePDF evaluation edition www.CutePDF.
com
A property of MVG_OMALLOOR
Analog Devices
SHARC
CS 433
Processor Presentation Series
Prof. Luddy Harrison
CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 1

Note on this presentation series
z These slide presentations were prepared by

students of CS433 at the University of Illinois at
Urbana-Champaign
z All the drawings and figures in these slides were
drawn by the students. Some drawings are based
on figures in the manufacturer’s documentation for
the processor, but none are electronic copies of
such drawings
z You are free to use these slides provided that you
leave the credits and copyright notices intact

Overview
z Processor History
z Physical packaging
z Data paths, register files, computational units
z Pipelining, timing information
z Memory
z Instruction Set Architecture (ISA)
z Applications targeted
z Systems employing the SHARC

SHARC Features
z Super Harvard ARChitecture

z Unique CISC architecture allows simultaneous fetch of
two operands and an instruction in one cycle
z Combines high performance DSP core with
integrated, on-chip system features
z Dual-ported (processor and I/O) SRAM
z DMA Controller
z Selective Instruction Cache
z Cache only those instructions whose fetches conflict with
program memory data accesses
SHARC Processor History
z ADSP-2106x (2000)
z Single computational units based on predecessor
ADSP-2100 Family
z 40 MHz core
z ADSP-2116x (2001)
z SIMD (Single-Issue Multiple-Data) dual computational
unit architecture added
z 150-200 MHz core, 1-2 MB RAM
z ADSP-2126x, ADSP-2136x (2003 – Future)
z Integrated audio-centric peripherals (128-140db
Sample Rate Conversion) added
z 333-400 MHz core, 2-3 MB RAM
ADSP-2106x Overview
CORE PROCESSOR DUAL-PORTED SRAM
TIMER INSTRUCTION TWO INDEPENDENT
BLCOK 0
DUAL-PORTED BLOCKS
BLCOK 1
CACHE
PROCESSOR PORT I/O PORT
DAG1 DAG2 PROGRAM EXTERNAL

SEQUENCER PORT
PM ADDRESS BUS
ADDR BUS
DM ADDRESS BUS MUX
MULTIPROC
INTERFACE
PM DATA BUS
BUS DATA BUS
CONNECT DM DATA BUS MUX
(PX)
HOST PORT
DATA DMA
REGISTER IOP CONTROLLER
FILE REGISTERS
BARREL SERIAL PORTS (2)

MULTIPLIER 16x40-BIT ALU CONTROL,
SHIFTER STATUS &
DATA BUFFERS LINK PORTS (6)

I/O
PROCESSOR
ADSP-2106x Core
z Computational Units
z ALU, Multiplier, and Shifter can all perform independent
operations in a single cycle
z Register File
z Two sets (primary and alternate) of 16 registers, each
40-bits wide
z Program Sequencer and Data Address
Generators
z Allows computational units to operate independent of
instruction fetch and program counter increment

ADSP-2106x Packaging
ADSP-2106x
1x CLOCK CLKIN
EBOOT BMS
CS
LBOOT
CONTROL
ADDRESS
ADDR BOOT EPROM
DATA
DATA
IRQ
FLAG
TIMEXP ADDR31-0
ADDR
DATA47-0 DATA
LxCLK OE MEMORY &
LINK DEVICES LxACK RD WE PERIPHERALS
LxDAT WR ACK
ACK CS
TCLK0 MS3-0
RCLK0 PAGE
SERIAL TFS0 SBTS DMA DEVICE
DEVICE RFS0 SW
DT0 ADRCLK DATA
DR0 DMAR1-2
DMAG1-2
TCLK1
RCLK1 CS
SERIAL TFS1 HBR HOST PROCESSOR
DEVICE RFS1 HBG INTERFACE
DT1 REDY
DR1
RPBA BR1-6 ADDR
ID2-0 CPA
DATA

ADSP-2106x Key Pins
PIN FUNCTION NOTE

ADDR31-0 External Bus Address
DATA47-0 External Bus Data
Asserted (low) as chip selects memory bank

MS3-0 Memory Select Lines
Asserted if a page boundary is crossed

PAGE DRAM Page Boundary
DMAR(1-2) DMA Request 1 and 2
Edge-triggered or level-sensitive
IRQ2-0 Interrupt Request Lines

ADSP-2106x Registers
z Data Registers
z R15 – R0 (fixed-point), F15 – F0 (floating-point)
z Program Sequencer
z PC (program counter), PCSTKP (PC stack pointer),
FADDR (fetch address), etc.
z Data Address Generator
z I7 – I0 (DAG1 index), M7 – M0 (DAG1 modify)
z L7 – L0 (DAG1 length), B7 – B0 (DAG1 base)
z Bus Exchange, Timer, and System Registers

ADSP-2106x Buses
z Address
z Program Memory Address – 24 bits wide
z Data Memory Address – 32 bits wide
z Data
z Program Memory Data – 48 bits wide
z Stores instructions and data for dual-fetches
z Data Memory Data – 40 bits wide
z Stores data operands
z One PM Data bus and/or one DM Data bus register

file access per cycle

ADSP-2106x I/O
z Serial Ports
z Operate at clock rate of processor
z DMA
z Port data can be automatically transferred to and
from on-chip memory

ADSP-2106x DMA
z I/O port block transfers (link/serial)
z External memory block transfers
z DMA Channel setup by writing memory buffer
parameters to DMA parameter registers
z Starting Address for Buffer
z Address Modifier
z Word Count
z Interrupt generated when transfer completes (i.e.
Word Count = 0)

ADSP-2106x DMA Registers

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
FS
EXT. PORT FIFO DEN
DMA ENABLE
FLSH
FLUSH EXT. PORT FIFO CHEN
DMA CHAINING ENABLE
EXTERN
EXT. DEVICES TO EXT. MEM. TRAN
DMA CHANNEL DIRECTION
INTIO
SINGLE-WORD INTERRUPTS PS
PACKING STATUS
HSHAKE
DMA HANDSHAKE
DTYPE
MASTER DATA TYPE
DMA MASTER MODE
PMODE
MSWF PACKING MODE
MOST SIGNIFICANT WORD FIRST

ADSP-2106x Pipelining
z Three phases
z Fetch
z Read from cache or program memory
z Decode
z Generate conditions for instruction
z Execute
z Operations specified by instruction completed

ADSP-2106x Branching and
Pipelining
z Branches
z Delayed
z Two instructions after branch are executed
z Non-delayed
z Program sequencer suppresses instruction execution for
next two instructions
CLOCK CYCLES Æ
Fetch n+2 j j+1 j+2
Decode n+1 n+2 j j+1
Execute n no-op n+1 no-op n+2 j
Non-delayed Delayed

ADSP-2106x Memory
On-Chip SRAM ADSP-21060 ADSP-21062 ADSP-21061
Total Size 500KB 250KB 125KB
z On-chip support for:

z 48-bit instructions (and 40-bit extended precision floating-
point data)
z 32-bit floating-point data
z 16-bit short word data
z Off-chip memory up to 4 GB

ADSP-2106x Memory (2)

0x0000 0000
IOP REGISTERS
0x0000 0100
RESERVED ADDRESS
SPACE
These represent the same
0x0001 FFFF physical memory
0x0002 0000
0x0004 0000
BLOCK 0
0x0003 0000
BLOCK 0
BLOCK 1
0x0003 FFFF
0x0006 0000
NORMAL
WORD
ADDRESSING
128K x 32-bit words BLOCK 1
80K x 40-bit words
CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 0x0007 FFFF 18
ADSP-2106x Memory (3)

z Memory divided into blocks
z Dual-ported (PM and DM bus share one port, I/O
bus uses the other)
z Allows independent access by processor core and I/O
processor
z Each block can be accessed by both in every cycle
z Typical DSP applications (digital filters, FFTs, etc.)
access two operands at once, such as a filter
coefficient and a data sample, so allowing single-
cycle execution is a must

ADSP-2106x Shadow Write

z Due to need for high-speed operations,
memory writes to a two-deep FIFO
z On write, data in FIFO from previous write is
loaded to memory and new data enters FIFO
z Reads of last two written locations are
intercepted and re-routed to the FIFO

ADSP-2106x Instruction Cache

z Sequencer checks instruction cache on every
program memory data access
z Allows PM bus to be used for data fetches
instead of being tied up with an instruction
fetch
z When fetch conflict first occurs, instruction is
cached to prevent the same delay from
happening again

ADSP-2106x Instruction Cache (2)

LRU BIT VALID INSTRUCTIONS ADDRESSES (BITS 23-4) ADDRESSES (BITS 3-0)
SET 0 ENTRY 0 0000
ENTRY 1
SET 1 ENTRY 0 0001
ENTRY 1
SET 14 ENTRY 0 1110
ENTRY 1
SET 15 ENTRY 0 1111
ENTRY 1
ADSP-2106x ISA Overview

z 24 operations, although some have more than one
syntactical form
z Instruction Types
z Compute and Move
z Compute operation in parallel with data moves or index
register modify
z Program Flow Control
z Branch, Call, Return, Loop
z Immediate Data Move
z Operand or addressing immediate fields
z Miscellaneous
z Bit Modify and Test, No-op, etc.

ADSP-2106x ISA
Compute and Move
z Instructions follow the format

IF condition op1, op2;
z IF and condition are optional
z op1 is an optional compute instruction
z op2 is an optional data move instruction

ADSP-2106x ISA
Compute Examples
z Single function
z F6 = (F2 + F3);
z Multi-function
z Distinct parallel operations supported
z Parallel computations and data transfers
z R1 = R2 * R6, M4 = R0;
z Simultaneous multiplier and ALU operations
z R1 = R2 * R6, F6 = F2 + F3;

ADSP-2106x ISA
Single function Compute

22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 CU OPCODE RN RX RY
z CU specifies
z 00 – ALU
z 01 – Multiplier
z 02 – Shifter
z OPCODE indicates operation type (add, subtract, etc.)
z RN specifies result register
z RX and RY specify operand registers

ADSP-2106x ISA
Multi-function Compute
z Parallel ALU and Multiplier operations
22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 OPCODE RM RA RXM RYM RXA RYA
z Registers restricted to particular sets

z Multiplier X: R3 – R0, Y: R7 – R4
z ALU X: R11 – R8, Y: R15 – R12
z OPCODE specifies ALU op, for example:
z 000100: Rm = R3-0 * R7-4, Ra = R11-8 + R15-12;
z 011111: Rm = R3-0 * R7-4, Ra = MIN(R11-8, R15-
12);

ADSP-2106x ISA
Program Flow Control
IF condition JUMP/CALL, ELSE op2;
z IF, condition, ELSE are optional

z JUMP/CALL is a JUMP or CALL instruction
z op2 is an optional compute instruction

ADSP-2106x ISA
Program Flow Control (2)

DO <addr24> UNTIL termination;
z No optional fields
z <addr24> is the loop start address
z termination is the loop ending condition to
check after each iteration

ADSP-2106x ISA
Program Flow Examples
z Conditional Execution
z IF GT R1 = R2 * R6;
z IF NE JUMP label2;
z Also used for Call/Return
main: CALL routine;
routine: ...
RTS; /*return to main*/

ADSP-2106x ISA
Immediate Data Move
ureg = <data32>;
DM(<data32>, Ia) = ureg;
PM(<data24>, Ia) = ureg;
z Ia is an optional indirect addressor
z DM is a 32-bit data memory address
z PM is a 24-bit program memory address

ADSP-2106x ISA
Addressing Examples
z Direct
z JUMP <data24>;
z Relative to Program Counter
z JUMP (PC, <data24>);
z Register Indirect (using DAG registers)
z Pre-Modify (modification pre-address calculation)
z JUMP (M0, I0);
z Post-Modify (modification post-address calculation)

z JUMP (I0, M0);

ADSP-2116x Overview
z Extension of 2106x, adding 150Mhz core and SIMD (Single-
Issue Multiple-Data) support via dual hardware
DIFFERENT DATA GOES TO EACH ELEMENT
PM DATA BUS
BUS
CONNECT DM DATA BUS
DATA DATA
REGISTER REGISTER
MULT FILE BARREL BARREL FILE
SHIFTER MULT
SHIFTER
ALU
SAME INSTRUCTION GOES TO BOTH ELEMENTS
PROGRAM
SEQUENCER

ADSP-2116x SIMD Engine

z Dual hardware allows same instruction to be
executed across different data
z 2 ALUs, multipliers, shifters, register files
z Two data values transferred with each memory or register
file access
z Very effective for stereo channel processing
z Can effectively double performance over similar
algorithms running on ADSP-2106x processors

ADSP-2116x SIMD Engine (2)
z Enabled/disabled via MODE1 bit

z When disabled, processor simply acts in SISD mode
z Program sequencer must be aware of status flags
set by each set of hardware elements
z Conditional compute operations can be specified
on both, either, or neither hardware set
z Conditional branches and loops executed by
AND’ing the condition tests on both hardware sets

ADSP-2116x SIMD Engine (3)

Instruction Mode Transfer 1 Transfer 2
Rx = Ry; SISD Rx loaded from Ry n/a
SIMD Rx loaded from Ry Sx loaded from Sy
Sx = Sy; SISD Sx loaded from Sy n/a
SIMD Sx loaded from Sy Rx loaded from Ry
Rx = Sy; SISD Rx loaded from Sy n/a
SIMD Rx loaded from Sy Sx loaded from Ry
Sx = Ry; SISD Sx loaded from Ry n/a
SIMD Sx loaded from Ry Rx loaded from Sy

ADSP-2126x Overview
z Direct extension of 2116x, instructions are fully

backward compatible
z Core increased to 150-200 MHz w/ 1MB SRAM
z Data buses increased from 32 to 64 bits
z Synchronous, independent serial ports increased
from 2 to 6
z ROM-based security
z Prevents piracy of code and algorithms
z Prevents peripheral devices from reading on-chip
memory

ADSP-2136x Overview
4 BLOCKS ON-CHIP MEMORY
CORE PROCESSOR
BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3
INSTRUCTION SRAM SRAM

TIMER SRAM
CACHE 1M BIT ROM 1M BIT ROM SRAM
2M BIT 2M BIT 0.5M BIT 0.5M BIT
PROGRAM
DAG1 DAG2 ADDR DATA ADDR DATA ADDR DATA ADDR DATA
SEQUENCER
PM ADDRESS BUS
DM ADDRESS BUS
PM DATA BUS
DM DATA BUS
PROCESSING PROCESSING PX REGISTER SPI

ELEMENT SPORTS
ELEMENT
(PEX) IDP SIGNAL
(PEY) IOP
POG ROUTING
REGISTERS
TIMERS UNIT
SRC
SPDIF
I/O PROCESSOR AND PERIPHEALS

ADSP-2136x Overview (2)
z Direct extension of 2126x, instructions are fully

backward compatible
z On-chip memory expanded from 2 to 4 blocks
z Digital Audio Interface (DAI) set of audio
peripherals
z Interrupt controller, interface data port, signal routing
unit, clock generators, and timers
z Different units contain S/PDIF receiver/transmitter,
sample rate converters, or DTCP encrypting engine

SHARC Benchmarks
z Algorithm benchmarks supplied by manufacturer:
2106x 2116x 2126x 2136x

Clock Cycle 66 MHz 100 MHz 200 MHz 333 MHz
Instruction Cycle 15 ns 10 ns 6.67 ns 3 ns

Time
MFLOPS 132 MFLOPS 400 MFLOPS 600 MFLOPS 1332 MFLOPS
Sustained
MFLOPS Peak 198 MFLOPS 600 MFLOPS 900 MFLOPS 1998 MFLOPS
FIR Filter (per tap) 15 ns 5 ns 2.5 ns 1.5 ns
IIR Filter (per 61 ns 20 ns 10 ns 6 ns

biquad)
Divide (y/x) 91 ns 30 ns 20 ns 9 ns

Applications Targeted
z SHARC designed to
z Simplify Development
z Speed time to Market
z Reduce Product Costs
z Product targeted
z A/V Receivers
z 7.1 Surround Sound Decoding
z Mixing Consoles
z Digital Synthesizers
z Automobiles

Systems Employing the SHARC

z SRS Circle Surround II
z Melody (w/ Auto Room Tuner)
z Metric Halo's Portable Pro Audio Hub
z Alacron FT-P5

SHARC in SRS Circle Surround II

z Full multi-channel surround sound from
simple right/left stereo sound
z Encoding can be transmitted over standard
stereo medium (broadcast television, radio,
etc.) and maintains full backward
compatibility

SHARC in SRS Circle Surround II (2)

z Output from each source is combined in constant
phase filter banks and encoded in quadrature to
prevent signal cancellation
z “Positional bias generator” analyzes ratios between
left and right surround signals which multipliers then
apply to the opposing left or right output
z Decoder uses this level imbalance to direct the
surround information to the correct output

SHARC Melody
z “Optimized Surround Sound for the Mass

Market”
z Core of high-fidelity audio decoders in
Denon, Bose, and Kenwood products
z Auto Room Tuner (ART) integrated
software simplifies setup of complex audio
systems

SHARC Melody ART

z Automatically measures and corrects multi-
channel sound system for room’s acoustics
z Corrects system deficiencies
z For each speaker, ART calculates:
z Sound pressure level (SPL)
z Distance of each speaker from listener
z Frequency response

SHARC in Metric Halo's Portable
Pro Audio Hub

z Portable FireWire-based recording device, used in
live recordings applications by motion pictures and
major recording artists like “No Doubt” and “Dave
Mathews Band”
z Serial ports used to interface to digital and mixed-
signal peripheral devices
z Initially implemented on SHARC ADSP-2106x, later
upgraded to ADSP-2126x
z Future hybrid implementation will use a ADSP-
2106x for FireWire processing coupled with a
ADSP-2126x for audio processing

SHARC in Alacron FT-P5

z COTS (Commercial Off-The-Shelf) system for use in
“distributed, compute intensive, high data rate
applications” in commercial and military industries
z Supports 1 to 96 ADSP-2106x processors
z Makes extensive use of SHARC’s DMA through
external PMC interface, supporting full-duplex
communication in excess of 1 GB/sec
z In-cabinet SAN clusters
z Compute nodes in distributed systems

SHARC vs. RISC Processors

z RISC is...
z Less costly to design, test, and manufacture,
since processors are less specialized
z But...
z Parallel (stereo) computation requires two or more
interconnected processors accessing shared
memory
z Less performance

Conclusion
z SHARC offers great deal of computational
power, with on-chip SRAM and SIMD
architecture
z Variety of applications (especially audio
processing) exploit it

Citations
z Processor details taken from product
manuals and descriptions at
http://www.analog.com

Other ISAs Addressing modes

• Next, we discuss some alternative instruction set designs. • The first instruction set design issue we’ll see are addressing modes,
– Different ways of specifying memory addresses which let you specify memory addresses in various ways.
– Different numbers and types of operands in ALU instructions – Each mode has its own assembly language notation.
– A couple of advanced instruction sets – Different modes may be useful in different situations.
• VLIW (Very Long Instruction Word) – The location that is actually used is called the effective address.
– Texas Instruments C64 • The addressing modes that are available will depend on the datapath.
– Analog Devices TigerSHARC – Our simple datapath only supports two forms of addressing.
• ARM and Thumb – Older processors like the 8086 have zillions of addressing modes.
• We’ll introduce some of the more common ones.
December 8, 2003 Other ISA's 1 December 8, 2003 Other ISA's 2
Immediate addressing Direct addressing

• One of the simplest modes is immediate addressing, where the operand • Another possible mode is direct addressing, where the operand is a
itself is accessed. constant that represents a memory address.
LD R1, #1999 R1 ← 1999 LD R1, 500 R1 ← M[500]
• This mode is a good way to specify initial values for registers. • Here the effective address is 500, the same as the operand.
• We’ve already used immediate addressing several times. • This is useful for working with pointers.
– It appears in the string conversion program you just saw. – You can think of the constant as a pointer.
– The register gets loaded with the data at that address.
Register indirect addressing Stepping through arrays

• We already saw register indirect mode, where the operand is a register • Register indirect mode makes it easy to access contiguous locations in
that contains a memory address. memory, such as elements of an array.
• If R0 is the address of the first element in an array, we can easily
LD R1, (R0) R1 ← M[R0]
access the second element too:
• The effective address would be the value in R0.
• This is also useful for working with pointers. In the example above,
LD R1, (R0)
ADD R0, R0, #1
// R1 contains the first element
– R0 is a pointer, and R1 is loaded with the data at that address. LD R2, (R0) // R2 contains the second element
– This is similar to R1 = *R0 in C or C++.
• So what’s the difference between direct mode and this one? • This is so common that some instruction sets can automatically
– In direct mode, the address is a constant that is hard-coded into increment the register for you:
the program and cannot be changed.
– Here the contents of R0, and hence the address being accessed, LD R1, (R0)+ // R1 contains the first element
LD R2, (R0)+ // R2 contains the second element
can easily be changed.
• Such instructions can be used within loops to access an entire array.

Indexed addressing PC-relative addressing

• Operands with indexed addressing include a constant and a register. • We’ve seen PC-relative addressing already. The operand is a constant
that is added to the program counter to produce the effective memory
LD R1, 500(R0) R1 ← M[R0 + 500] address.
• The effective address is the register data plus the constant. For 200: LD R1, $30 R1 ← M[201 + 30]
instance, if R0 = 25, the effective address here would be 525.
• We can use this addressing mode to access arrays also. • The PC usually points to the address of the next instruction, so the
– The constant is the array address, while the register contains an effective address here is 231 (assuming the LD instruction itself uses
index into the array. one word of memory).
– The example instruction above might be used to load the 25th • This is similar to indexed addressing, except the PC is used instead of a
element of an array that starts at memory location 500. regular register.
• It’s possible to use negative constants too, which would let you index • Relative addressing is often used in jump and branch instructions.
arrays backwards. – For instance, JMP $30 lets you skip the next 30 instructions.
– A negative constant lets you jump backwards, which is common in
writing loops.
Indirect addressing Addressing mode summary

• The most complicated mode that we’ll look at is indirect addressing.
Mode Notation Register transfer equivalent
LD R1, [360] R1 ← M[M[360]]
Immediate LD R1, #CONST R1 ← CONST
• The operand is a constant that specifies a memory location which Direct LD R1, CONST R1 ← M[CONST]
refers to another location, whose contents are then accessed. Register indirect LD R1, (R0) R1 ← M[R0]
• The effective address here is M[360]. Indexed LD R1, CONST(R0) R1 ← M[R0 + CONST]
Relative LD R1, $CONST R1 ← M[PC + CONST]
• Indirect addressing is useful for working with multi-level pointers, or
Indirect LD R1, [CONST] R1 ← M[M[CONST]]
“handles.”
– The constant represents a pointer to a pointer.
– In C, we might write something like R1 = **ptr.
Number of operands Two-address instructions

• Another way to classify instruction sets is according to the number of • In a two-address instruction, the first operand serves as both the
operands that each data manipulation instruction can have. destination and one of the source registers.
• Our example instruction set had three-address instructions, because
each one had up to three operands—two sources and one destination.
operation operands Register transfer instruction:
operation operands Register transfer instruction: ADD R0, R1 R0 ← R0 + R1
ADD R0, R1, R2 R0 ← R1 + R2 destination source 2

and source 1
destination sources
• Some other examples and the corresponding C code:

• This provides the most flexibility, but it’s also possible to have fewer
than three operands. ADD R3, #1 R3 ← R3 + 1 R3++;
MUL R1, #5 R1 ← R1 * 5 R1 *= 5;
NOT R1 R1 ← R1’ R1 = ~R1;

One-address instructions The ultimate: zero addresses

• Some computers, like this old Apple II, have one-address instructions. • If the destination and sources are all implicit, then you don’t have to
• The CPU has a special register called an accumulator, which implicitly specify any operands at all!
serves as the destination and one of the sources. • This is possible with processors that use a stack architecture.
– HP calculators and their “reverse Polish notation” use a stack.
operation source Register transfer instruction: – The Java Virtual Machine is also stack-based.
• How can you do calculations with a stack?
ADD R0 ACC ← ACC + R0 – Operands are pushed onto a stack. The most recently pushed
element is at the “top” of the stack (TOS).
• Here is an example sequence which increments M[R0]: – Operations use the topmost stack elements as their operands.
Those values are then replaced with the operation’s result.
LD (R0) ACC ← M[R0]

ADD #1 ACC ← ACC + 1
ST (R0) M[R0] ← ACC
Stack architecture example Data movement instructions

• From left to right, here are three stack instructions, and what the • Finally, the types of operands allowed in data manipulation instructions
stack looks like after each example instruction is executed. is another way of characterizing instruction sets.
– So far, we’ve assumed that ALU operations can have only register
PUSH R1 PUSH R2 ADD and constant operands.
– Many real instruction sets allow memory-based operands as well.
R1 R2 R1 + R2 (Top) • We’ll use the book’s example and illustrate how the following operation
… stuff 1 … R1 … stuff 1 … can be translated into some different assembly languages.
… stuff 2 … … stuff 1 … … stuff 2 …
X = (A + B)(C + D)
… stuff 2 … (Bottom)
• Assume that A, B, C, D and X are really memory addresses.
• This sequence of stack operations corresponds to one register transfer
instruction:
TOS ← R1 + R2
Register-to-register architectures Memory-to-memory architectures

• Our programs so far assume a register-to-register, or load/store, • In memory-to-memory architectures, all data manipulation instructions
architecture, which matches our datapath from last week nicely. use memory addresses as operands.
– Operands in data manipulation instructions must be registers. • With a memory-to-memory, three-address instruction set, we might
– Other instructions are needed to move data between memory and translate X = (A + B)(C + D) into simply:
the register file.
• With a register-to-register, three-address instruction set, we might ADD X, A, B M[X] ← M[A] + M[B]
translate X = (A + B)(C + D) into: ADD T, C, D M[T] ← M[C] + M[D] // T is temporary storage
MUL X, X, T M[X] ← M[X] * M[T]
LD R1, A R1 ← M[A] // Use direct addressing
LD R2, B R2 ← M[B]
ADD R3, R1, R2 R3 ← R1 + R2 // R3 = M[A] + M[B] • How about with a two-address instruction set?
LD R1, C R1 ← M[C]
LD R2, D R2 ← M[D] MOVE X, A M[X] ← M[A] // Copy M[A] to M[X] first
ADD R1, R1, R2 R1 ← R1 + R2 // R1 = M[C] + M[D] ADD X, B M[X] ← M[X] + M[B] // Add M[B]
MOVE T, C M[T] ← M[C] // Copy M[C] to M[T]
MUL R1, R1, R3 R1 ← R1 * R3 // R1 has the result ADD T, D M[T] ← M[T] + M[D] // Add M[D]
ST X, R1 M[X] ← R1 // Store that into M[X] MUL X, T M[X] ← M[X] * M[T] // Multiply

Register-to-memory architectures Size and speed

• Finally, register-to-memory architectures let the data manipulation • There are lots of tradeoffs in deciding how many and what kind of
instructions access both registers and memory. operands and addressing modes to support in a processor.
• With two-address instructions, we might do the following: • These decisions can affect the size of machine language programs.
– Memory addresses are long compared to register file addresses, so
LD R1, A R1 ← M[A] // Load M[A] into R1 first instructions with memory-based operands are typically longer than
ADD R1, B R1 ← R1 + M[B] // Add M[B] those with register operands.
LD
ADD
R2, C
R2, D
R2 ← M[C]
R2 ← R2 + M[D]
//
//
Load M[C] into R2
Add M[D]
– Permitting more operands also leads to longer instructions.
MUL R1, R2 R1 ← R1 * R2 // Multiply • There is also an impact on the speed of the program.
ST X, R1 M[X] ← R1 // Store – Memory accesses are much slower than register accesses.
– Longer programs require more memory accesses, just for loading
the instructions!
• Most newer processors use register-to-register designs.

– Reading from registers is faster than reading from RAM.
– Using register operands also leads to shorter instructions.
TI C64: Architecture
Texas Instruments C64
VLIW signal processor
Program cache/program memory
32-
32-bit addresses
256-bit data
TMS320C64x CPU
Program fetch
Instruction dispatch
Functional units: Instruction decode
6 ALUs
(L1, L2, S1, S2, D1, D2) Register file A Register file B
2 multiplers (M1, M2)
.L1 .S1 .M1 .D1 .D2 .M2 .S2 .L2
Data cache/data memory

32-
32-bit address
8-, 16-
16-, 32-
32-, 64-
64- bit data
TMS320C64x Data Paths TI C64: Functional Units (Structure)
Data path A
.L1
long src
Each functional unit has its own
src1
ST1b 32-bit write port into a GPR. Each
ST1a
The data path of C64x has the src2
long dst functional unit reads directly from
.L1 .S1 dst its own data path;
.S1 Register
file A following components: src1
dst
(A0-A31)
All units ending in 1 write to
LD1b
.M1
Two load-from-memory long dst
long src src2 register file A, and all units ending
LD1a data paths; in 2 write to register file B;
DA1 .D1
Two store-to-memory Each functional unit has two 32-
DA2
.D2 data paths; bit read ports for source operands
LD1a dst long dst src1 and src2;
LD1b Two data address paths; src1 dst
.M2 src1 L and S units have an extra 8-bit-
Register
Two register file data .D1
src2
wide port for 40-bit long writes, as
.M1
.S2 file B cross paths; well as an 8-bit input for 40-bit long
(B0-B31)
ST2a
ST2b
reads;
src2
.L2 Each C64x multiplier can return

up to a 64-bit result;
Data path B

TI C64: .L (.L1 and .L2) Unit Operations Performed .S (.S1 and .S2) Unit Operations Performed
• 32/40-bit arithmetic and compare operations • 32-bit arithmetic operations

• 32-bit logical operations • 32/40-bit shifts and 32-bit bit-field operations
• Leftmost 1 or 0 counting for 32 bits • 32-bit logical operations
• Normalization count for 32 and 40 bits • Branches
• Byte shifts • Constant generation
• Data packing/unpacking • Register transfers to/from control register file (.S2 only)
• 5-bit constant generation • Byte shifts
• Vector Operations: • Data packing/unpacking
– Dual 16-bit arithmetic operations • Vector Operations
– Quad 8-bit arithmetic operations – Dual 16-bit compare operations
– Dual 16-bit min/max operations – Quad 8-bit compare operations
– Quad 8-bit min/max operations – Dual 16-bit shift operations
– Dual 16-bit saturated arithmetic operations
– Quad 8-bit saturated arithmetic operations
.M (.M1 and .M2) Unit Operations Performed .D (.D1 and .D2) Unit Operations Performed
• 16 x 16 multiply operations • 32-bit add, subtract, linear and circular address calculation (for circular arrays)
• 16 x 32 multiply operations • Loads and stores with 5-bit constant offset
• Vector Operations • Loads and stores with 15-bit constant offset (.D2 only)
– Quad 8 x 8 multiply operations • Load and store double words with 5-bit constant
– Dual 16 x 16 multiply operations • Load and store non-aligned words and double words
– Dual 16 x 16 multiply with add/subtract operations • 5-bit constant generation
– Quad 8 x 8 multiply with add operation • 32-bit logical operations
• Bit expansion
• Bit interleaving/de-interleaving
• Variable shift operations
• Rotation
• Galois Field Multiply
Instruction to Functional Unit Mapping Instruction Packets
.L Unit .M Unit .S Unit .D Unit

• Instructions are always fetched 8 (256-bits) at a time. This is called a
fetch packet
•
ABS MPY ADD SET ADD STB (15-bit
ADD MPYU ADDK SHL offset)‡ If the p-bit of instruction i is set, then instruction i and i+1 are
ADDU MPYUS ADD2 SHR ADDAB STH (15-bit executed in the same cycle in parallel.
AND MPYSU AND SHRU offset)‡ • 1 to 8 instructions can be executed in parallel. This is called an execute
CMPEQ MPYH B disp SSHL ADDAH STW (15- packet
bit offset)‡
CMPGT MPYHU B IRP† SUB
ADDAW SUB • In the C62x, packets could not cross the 8-word boundary, and thus
CMPGTU MPYHUS B NRP† SUBU
LDB SUBAB the 8th p-bit was always 0 and padding with NOPs was needed. The
CMPLT MPYHSU B reg SUB2
LDBU SUBAH C64x did away with that restriction, and execute packets may now span
CMPLTU MPYHL CLR XOR
LDH SUBAW multiple fetch packets.
LMBD MPYHLU EXT ZERO
LDHU ZERO
MV MPYHULS EXTU
LDW
NEG MPYHSLU MV
LDB (15-bit offset)‡
NORM MPYLH MVC†
LDBU (15-bit offset)‡
NOT MPYLHU MVK
LDH (15-bit offset)‡
OR MPYLUHS MVKH
LDHU (15-bit offset)‡
SADD MPYLSHU MVKLH
LDW (15-bit offset)‡
SAT SMPY NEG
MV
SSUB SMPYHL NOT
STB
SUB SMPYLH OR
STH
SUBU SMPYH
STW
SUBC
XOR
ZERO

Fetch Packet Example C64x Opcode Map
Operations on the .L unit:

31 29 28 27 23 22 18 17 13 12 11 5 4 3 2 1 0
creg z dst src2 src1/cst x op 1 1 0 s p
Cycle/Execute Packet Instructions Operations on the .M unit:

31 29 28 27 23 22 18 17 13 12 11 7 6 5 4 3 2 1 0
creg z dst src2 src1/cst x op 0 0 0 0 0 s p
1 ABC
2 D
3 EF Operations on the .M unit:

31 29 28 27 23 22 18 17 13 12 7 6 5 4 3 2 1 0
4 GH creg z dst src2 src1/cst op 1 0 0 0 0 s p
C64x Opcode Map

Analog Device TigerSHARC
VLIW Vector Signal Processor
Load/store with 15-bit offset on the .D unit :
31 29 28 27 23 22 8 7 6 4 3 2 1 0
creg z dst/src ucst15 y ld/st 1 1 s p
Load/store with baseR + offset/cst on the .D unit :

31 29 28 27 23 22 18 17 13 12 9 8 7 6 4 3 2 1 0
creg z dst/src baseR offset/usct5 mode r y ld/st 1 1 s p
Operations on the .S unit:

31 29 28 27 23 22 18 17 13 12 11 6 5 4 3 2 1 0
creg z dst src2 src1/cst x op 1 0 0 0 s p
ADDK on the .S unit:

31 29 28 27 23 22 7 6 5 4 3 2 1 0
creg z dst cst 1 0 1 0 0 s p
ADI TigerSHARC: Core Block Diagram ADI TigerSHARC: Computation Block Block Diagram

Register Data Formats Instruction Line Organization
Instruction Encoding Compute Block
IALU Load and Store

Sequencer
ARM and Thumb
Low Power General Purpose Microprocssors
ARM Family Overview ARM Evolution

• Architecture Versions
– ARM V3, V4, V5, V6
– Called “architecture” in their literature, this is the programmer’s
view of the machine
• The externally visible architecture
• It is primarily a matter of Instruction Set Architecture
• Implementations
– ARM7, ARM9, ARM10, ARM11
• With letter extensions – to be explained shortly
– Called “cores” in their literature
ARM11 MicroArchitecture
28 Jan 2005 Copyright ARM Ltd. 2002 December 8, 2003 Other ISA's 48
ARMv5T
(ARM)
Summary
• Instruction sets can be classified along several lines.
– Addressing modes let instructions access memory in various ways.
– Data manipulation instructions can have from 0 to 3 operands.
– Those operands may be registers, memory addresses, or both.
• Instruction set design is intimately tied to processor datapath design.
• VLIW and compact, low-power instruction sets represents endpoints on

ARMv5T a continuum
(Thumb) – The VLIW uses enormous instruction fetch bandwidth to keep lots
of functional units busy
– Thumb mode attempts to pack irregular control code into as few
bits as possible to save instruction fetch bandwidth (power)

DSP SHARK Processors PART2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DSP SHARK Processors PART2

Uploaded by

Copyright:

Available Formats

PDF processed with CutePDF evaluation edition www.CutePDF.

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 1

Note on this presentation series

z These slide presentations were prepared by

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 2

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 3

z Super Harvard ARChitecture

SHARC Processor History

DAG1 DAG2 PROGRAM EXTERNAL

BARREL SERIAL PORTS (2)

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 6

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 7

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 8

ADSP-2106x Key Pins

PIN FUNCTION NOTE

DATA47-0 External Bus Data

Asserted (low) as chip selects memory bank

Asserted if a page boundary is crossed

DMAR(1-2) DMA Request 1 and 2

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 9

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 10

z One PM Data bus and/or one DM Data bus register

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 11

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 12

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 13

ADSP-2106x DMA Registers

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 14

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 15

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 16

Total Size 500KB 250KB 125KB

z On-chip support for:

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 17

ADSP-2106x Memory (2)

80K x 40-bit words

ADSP-2106x Memory (3)

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 19

ADSP-2106x Shadow Write

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 20

ADSP-2106x Instruction Cache

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 21

ADSP-2106x Instruction Cache (2)

SET 0 ENTRY 0 0000

SET 1 ENTRY 0 0001

SET 14 ENTRY 0 1110

SET 15 ENTRY 0 1111

ADSP-2106x ISA Overview

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 23

Compute and Move

z Instructions follow the format

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 24

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 25

Single function Compute

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 26

1 OPCODE RM RA RXM RYM RXA RYA

z Registers restricted to particular sets

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 27

z IF, condition, ELSE are optional

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 28

Program Flow Control (2)

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 29

Program Flow Examples

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 30

CS 433 Prof. Luddy Harrison Copyright 2005 University of Illinois 31

z Post-Modify (modification post-address calculation)