You are on page 1of 79

1

Philips Research

The Trimedia CPU64 VLIW Media Processor


Kees Vissers Philips Research Visiting Industrial Fellow vissers@eecs.berkeley.edu
ICS 252 class, February 3, 2000

Philips Research

Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results

ICS 252 class, February 3, 2000

Philips Research

Design Problem
SDRAM video-in video-out timers I$ D$

Serial I/O
PCI bridge
ICS 252 class, February 3, 2000

I2C I/O audio-out

audio-in

Philips Research

Application Domain
High volume consumer electronics products future TV, home theatre, set-top box, etc. Embedded core Media processing: audio, video, graphics, communication

ICS 252 class, February 3, 2000

Philips Research

Media Processing CPU


Design considerations: Cost (embedded in high volume products) Performance for the application domain Ease of use (programming model)
Benchmark suite needed
ICS 252 class, February 3, 2000

Philips Research

Application: Natural Motion

D/2

n-1
ICS 252 class, February 3, 2000

-D/2

n-1/2 picture number

Philips Research

Project Approach: Y-chart


Machine Description Compiler Simulator Performance Numbers Benchmark Applications

ICS 252 class, February 3, 2000

Philips Research

Benchmark Suite Characteristics


Each application: typical for a class of applications within application domain The set covers a significant area of the application domain Each benchmark is sufficiently well tuned to the architecture to measure its performance

ICS 252 class, February 3, 2000

Philips Research

The Benchmark Suite


Data comm. Viterbi decoding Audio coding AC3 decode Video coding MPEG2 encode DVC decode Video proc. Layered natural motion Dynamic noise reduction Peaking Graphics 3D renderer library

ICS 252 class, February 3, 2000

10

Philips Research

Processing Characteristics
Signal rates audio, video (at block and pixel rate) Basic data types byte (8), half-words (16), words (32), float (32) Data access patterns sample stream, bitstream, random access Data independent and dependent load Control processing and signal processing

ICS 252 class, February 3, 2000

11

Philips Research

Application Development
Algorithm Reference C

TM1000 optimized

CPU64 optimized knowledge


ICS 252 class, February 3, 2000

video

IC video

simulation exploration

12

Philips Research

Initial Design Considerations


Goal: 6-8 times TM1000 performance Standard ANSI-C, reuse of code Utilize instruction and data parallelism Limited complexity (embedded core) Compatibility through recompilation VLIW architecture

ICS 252 class, February 3, 2000

13

Philips Research

Instruction Set Considerations


A machine operation must be sufficiently generic within the application domain Sufficiently powerful operations Limited number of operations Consistency and orthogonality

ICS 252 class, February 3, 2000

14

Philips Research

Results: Natural Motion


TM1000 cycles x 107 CPU64 cycles x 10 7

Average gain: 2.7x (cycles)

15 UPC

15

10 SEG

10

SEG

instructions nops cache stalls

MED SAD SUB

5 UPC SUB MED SAD 20 40 60 80 100%

0 0

20

40

60

80 100%

0 0

ICS 252 class, February 3, 2000

15

Philips Research

Natural Motion Dynamic Load


150 cpu load (%)

tm1000 @ 125MHz

100 90%

50

cpu64 @ 300MHz
16% 0 10 20 30 40 50 60 70 80 Field number 90 100 110

ICS 252 class, February 3, 2000

16

Philips Research

Summary
Application domain: Media processing for CE industry Benchmark suite: Targeted to application domain Optimized for class of processors Future products: Gradual shift to software implementations of signal processing functions

ICS 252 class, February 3, 2000

17

Philips Research

Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results

ICS 252 class, February 3, 2000

18

Philips Research

Application Target
Processor core, to be embedded in different ICs and products Real time processing of media streams Cost-sensitive consumer electronics market
ICS 252 class, February 3, 2000

19

Philips Research

Embedded Application
SDRAM video-in video-out timers I$ D$

Serial I/O
PCI bridge
ICS 252 class, February 3, 2000

I2C I/O audio-out

audio-in

20

Philips Research

Performance Target
Relative to TriMedia TM1000 (5-slot VLIW, 100MHz, 32-bit datapath, media operations):

ICS 252 class, February 3, 2000

6x to 8x performance increase, to process more or higher-resolution video streams Not more than double transistor count

21

Philips Research

Efficiency
Good performance/silicon cost ratio by: Optimize CPU architecture and media benchmark source code towards each other Solve resource conflicts at compile time
ICS 252 class, February 3, 2000

22

Philips Research

Outline
Introduction VLIW architecture Instruction set IDCT example Results

ICS 252 class, February 3, 2000

23

Philips Research

Architectural Speedup
Not simply increasing the VLIW issue width: Diminishing gain of compiler-generated ILP Increasing implementation complexity (area, timing)
ICS 252 class, February 3, 2000

24

Philips Research

Architectural Speedup
Extended SIMD: uniform 64-bit design, vectors of 1-, 2-, and 4-byte elements (data throughput per cycle) New, extensive, media instruction set (functionality per cycle) Improved cache control (prefetch, alloc)

ICS 252 class, February 3, 2000

25

Philips Research

CPU64 Architecture
64-bit memory bus 32-bit peripheral bus multi-port 128 words x 64 bits register file bypass network

data mmu cache 16KB mmu


exceptions

FU
PC

FU

FU

FU

FU
ICS 252 class, February 3, 2000

instruction cache mmu 32 KB

VLIW instruction decode and launch

26

Philips Research

Code Example
int a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; RISC instructions: issueslots): 1 =1 2 3 x = *a a += 1 y = *b nop i += 1 nop g = i<n nop z = x+y nop *c = z nop jmpf g 12 jmpt g 1 1 b += 1 2 nop 3 nop jmpf g 7 jmpt g 1 nop g = i<n nop nop
ICS 252 class, February 3, 2000

VLIW x = *a

instructions y = *b

(5 i+

4
5 6

4
nop 5 *c = z 6 nop

z = x+y
c+=1 nop

nop
nop nop

nop
nop nop

7 8

27

Philips Research

VLIW challenges
Achieve high Instruction Level Parallelism Compiler needs to know the pipeline for instructions Code size Design of a multi-ported register file. Design of the forwarding network

ICS 252 class, February 3, 2000

28

Philips Research

Instruction Level Parallelism


Loop unrolling, software pipelining Guarded execution Avoid branches if conversion, hand rewrite source!, valid in Embedded applications, done for TI, Philips and Apple software (Adobe fotoshop). Speculate branches

ICS 252 class, February 3, 2000

29

Philips Research

Compiler
Detect data dependency at compile time: examples:
c[i]=a[i]+b[i]; d[i]=a[i]+c[j]; c[1]=a[i]+b[i]; d[i]=a[i]+c[2]; potential dependency c[i] might be c[j] no dependency c[1] is never c[2]
ICS 252 class, February 3, 2000

30

Philips Research

Compiler
Schedule the instructions in the proper slot at the proper time, e.g.: load and stores only in slot 1,2 adds in all slots floating point multiply only in slot 3 add take 1 cycle, mult takes 2 cycles 3 branch delay slots

ICS 252 class, February 3, 2000

31

Philips Research

Code size
Large number of registers cost code size add r1 r3 r127 Large number of media instructions: code size Use hardware compression techniques to remove nops Variable length instruction format! Vector instructions: SIMD, great for data processing performance

ICS 252 class, February 3, 2000

32

Philips Research

Code Example
char a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; : 1 2 3 4 5 6 VLIW instructions (5 issueslots): x = *a g = i<n jmpf g 7 z = x+y *c = z nop y = *b nop jmpt g 1 nop c+=1 nop i+=1 nop nop nop nop nop a += 1 nop nop nop nop nop b += 1 nop nop nop nop nop

ICS 252 class, February 3, 2000

33

Philips Research

Code Example Vector


vec64sb a[n/8], b[n/8], c[n/8]; int i; for (i=0; i</8n; i++) c[i]=sb_add(a[i],b[i]); : 1 2 3 4 5 6 VLIW instructions and SIMD (5 issueslots): x = ld64b a y =ld64b b g = i<fraction n jmpf g 7 z = add64sb x y st64b c z nop i+=1 nop jmpt g 1 nop c+=1 nop a += 8 nop nop nop nop nop b += 8 nop nop nop nop nop nop nop nop nop nop

ICS 252 class, February 3, 2000

34

Philips Research

Vector Instruction Example


x7
y7

x6
y6

x5
y5

x4
y4

x3
y3

x2
y2

x1
y1

x0
y0

|.|

|.|

|.|

|.|

|.|

|.|

|.|

|.|
ICS 252 class, February 3, 2000

Sum of absolute differences (ub_me)

35

Philips Research

Application Code Example


int calc_sad(vec64ub *prv, vec64ub *cur, int s) { vec64ub left, right; int i; int sad = 0; for (i=0; i<8; i++) { left = prv[i*s]; right = cur[i*s]; sad += ub_me(left, right); } return(sad); }

ICS 252 class, February 3, 2000

36

Philips Research

Architecture VLIW+SIMD
Issues a 5-slot instruction every cycle Each slot supports a selection of FUs All FUs support vectorized (SIMD) data Double-slot FU allows powerful multiargument, multi-result operations All FUs are pipelined, latency 1 to 4 (except floating point divide and sqrt)

ICS 252 class, February 3, 2000

37

Philips Research

Instruction Format
Per operation, per slot: Up to 2 arguments, up to 1 result Optionally an extra register for guarding (conditional execution) Immediate argument size can be 32 bit Instructions have compressed variable length format, decompressed during decode

ICS 252 class, February 3, 2000

38

Philips Research

Operations
Intends to cover combinations of: 1-, 2-, 4-byte elements in an 8-byte vector Signed or unsigned type Clipped versus wrap-around arithmetic Set of basic functions (ld, st, add, mul, compare, shift, ...)

ICS 252 class, February 3, 2000

39

Philips Research

Instruction Set
Category Load/store ops Byte shuffles Bit shifts Multiplies Integer arith Floating point Lookup table Special ops Branch Cnt 39 67 48 54 104 59 6 23 10 Comment vectorized, endian-correct vector type convert round, fields round, clip, wrap around clip, wrap around scalar and vectorized vectorized MMU, cache, special regs (un)interruptible, trap

ICS 252 class, February 3, 2000

40

Philips Research

Example: msp-multiply
Each argument: 4-way 16-bit int
Multiply to internal double precision integer
+
1

Round lower half into upper half


Return upper half

ICS 252 class, February 3, 2000

41

Philips Research

Example: Transpose
a e i m b f j n c g k o d h l p

2-slot super-op: takes 4 arguments, shuffles bytes, produces 2 results


ICS 252 class, February 3, 2000

a b

e f

i j

m n

(2-dimensional filtering)

42

Philips Research

Example: 4-way average

add with extended precision round

(MPEG 2-dimensional half-pixel average)

ICS 252 class, February 3, 2000

43

Philips Research

Branches
Branch operations have 3 branch delay slots Compiler & scheduler try to fill these (profiling, loop unrolling, inlining, guarding) Branch units are properly pipelined Up to 3 branch ops in 1 instruction No branch prediction hardware Branches are preferred moments for interrupt servicing

ICS 252 class, February 3, 2000

44

Philips Research

2D-IDCT Example
IDCT code was tested early in the project, also for studying the programming model Mapped through our experimental C compiler and instruction scheduler Operates entirely on (vectors of) 16-bit data elements Simulation showed IEEE 1180 accuracy compliancy

ICS 252 class, February 3, 2000

45

Philips Research

2D-IDCT Instructions
Execution time in cycles, excluding data cache misses
VLIW: CPU64, TM1000, TI 320C60 others superscalar
400 350 300 250 200 150 100 50 0 CPU64 TM-1000 TI 320C62 Altivec PA-8000 PentiumI PentiumII NEC V830

= accuracy compliant

ICS 252 class, February 3, 2000

46

Philips Research

Status
Now in construction at Philips Semiconductors, Sunnyvale 7M transistors (target) 300Mhz clock (target) 0.18 technology 1.8 volt

ICS 252 class, February 3, 2000

47

Philips Research

Conclusion Processor Architecture


The 5-slot 64-bit VLIW provides a powerful architecture for media processing Performance gain of 6x to 8x on TM1000 Rich and regular media instruction set Powerful multi-argument, multi-result superop naturally fits VLIW architecture

ICS 252 class, February 3, 2000

48

Philips Research

Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results

ICS 252 class, February 3, 2000

49

Philips Research

Traditional Compilation
Compiler A App 1 App 2 App 3 Compiler B

Exe 1A Exe 2A
Exe 3A Exe 1B Exe 2B Exe 3B

Machine A

Machine B

ICS 252 class, February 3, 2000

50

Philips Research

Retargetable Compilation
Machine A Description App 1 App 2 App 3 Machine B Description

Exe 1A Exe 2A
Exe 3A

Machine A

Compiler Machine B
ICS 252 class, February 3, 2000

Exe 1B Exe 2B Exe 3B

51

Philips Research

MD file contents
MD describes instance from class of machines Contains information such as: number & width of registers number of issue slots number & placement of function units instruction latencies

ICS 252 class, February 3, 2000

52

Philips Research

Y-Chart
Machine Description Compiler Simulator Performance Numbers Benchmark Applications

ICS 252 class, February 3, 2000

53

Philips Research

Software and Hardware Operations


Hardware view of instruction set Implement minimal, changeable set Software view of instruction set provide regular, orthogonal, stable set Two instruction sets were defined for these purposes

ICS 252 class, February 3, 2000

54

Philips Research

Instruction set libraries


Hardware operations library untyped, minimal, changing Software operations library vector-typed, orthogonal, stable Mapping in MD file and library

ICS 252 class, February 3, 2000

55

Philips Research

Library structure
source code simulator hw_ops sw_ops application

workstation executable

simulator

application

application CPU64 executable

ICS 252 class, February 3, 2000

56

Philips Research

Functional Development
source code hw_ops gcc workstation executable
ICS 252 class, February 3, 2000

sw_ops

application

application

57

Philips Research

Code Tuning
source code simulator hw_ops application

gcc
workstation executable

tmcc application CPU64 executable


ICS 252 class, February 3, 2000

simulator

interpretation

58

Philips Research

Conclusions Compiler
Applications are written in C code only Machine Description supports changes in ISA Vector types keep code legible Fast functional development track Accurate application tuning track

ICS 252 class, February 3, 2000

59

Philips Research

Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results

ICS 252 class, February 3, 2000

60

Philips Research

Methodology: the Y-chart


Machine Description Compiler Simulator Performance Numbers Benchmark Applications

ICS 252 class, February 3, 2000

61

Philips Research

64-bit memory bus

64-bit VLIW CPU


multi-port 128 words x 64 bits register file bypass network

data mmu cache 16KB mmu instruction cache mmu 32 KB

FU

FU

FU

FU

FU
ICS 252 class, February 3, 2000

VLIW instruction decode and launch

Many parameters, large design space(s)

62

Philips Research

Multimedia benchmark
Nine multimedia applications from: Data communication Audio coding Video coding Video processing Graphics Representative Applications Code Datasets
ICS 252 class, February 3, 2000

63

Philips Research

Challenge for design space exploration


Simulation time for the benchmark for a single machine is 18 hours The number of design points for functional 1015 unit configuration alone: Results in 2000,000,000,000 years of computation time

ICS 252 class, February 3, 2000

64

Philips Research

Essential tooling
Retargetable toolchain Core compiler, scheduler, simulator Design and experiment management DSE support tools Machine generation, pseudosim, analysis Glue software

ICS 252 class, February 3, 2000

65

Philips Research

Pseudo-simulation
Pseudosim: a retargetable pseudo-simulator Performs a cycle-accurate calculation of the amount of instruction cycles, without doing the actual machine simulation Gathers other machine statistics, such as utilisation of slots, functional units, operations Time for the whole benchmark is reduced from 18 hours to 4 minutes

ICS 252 class, February 3, 2000

66

Philips Research

Pseudo-simulation
Machine description Benchmark applications Machine description

Compiler
ICS 252 class, February 3, 2000

once, 18 hours
Simulator

code Pseudosim

many times, 4 minutes

statistics
results

67

Philips Research

Functional unit configuration


Problem: How many of each type of FU do I need? Where do I place them? How do I prevent simulating too many machine variations?

FU type

slot

ICS 252 class, February 3, 2000

68

Philips Research

Nave calculation of the design space size


24 types of FUs 31 configurations per type 6 types of super-op FUs 7 configurations per type 3124 76 1040 Size of design space:

ICS 252 class, February 3, 2000

69

Philips Research

Not so nave calculation of the design space size


Assume relationship between FU types Best-guess lower and upper bounds on number of FUs per type Reduce permutations Size of design space: 1015

ICS 252 class, February 3, 2000

70

Philips Research

Systematic exploration
It is not feasible to exhaustively compute all design points in the design space Therefore we probe, partition, and then explore the design space Careful set-up of experiments

ICS 252 class, February 3, 2000

71

Philips Research

Reduction of the design space


Observation: over 93% of operations are done in 30% of FU types Action: partition space Exhaustive exploration of important FUs Greedy exploration of the remainder functional unit types
70% 30%

7%

operations

ICS 252 class, February 3, 2000

93%

72

Philips Research

Exhaustive experiment
7

x 10 10

Exhaustive experiment

8.8

8.6 194

195

196

197

198

199

200

201

202

area

ICS 252 class, February 3, 2000

Close to 3000 machine variations Only 4 machines survive Large performance variation due to placement Time taken = 200 hours

9.8

9.6

cycles

9.4

9.2

73

Philips Research

Greedy experiments
First greedy experiment
7 x 10 9.3

8.8

8.7

8.6 191

192

193

194

195

196

197

area

ICS 252 class, February 3, 2000

Less machine variations per experiment More machines survive Less performance variation due to placement Close to another 3000 machine variations over all greedy experiments

9.2

9.1

cycles

8.9

74

Philips Research

Outline
Introduction Tools Exploration Real numbers Conclusions

ICS 252 class, February 3, 2000

75

Philips Research

Simulated machines
Experiment number Probing Exhaustive Greedy Total 185 3023 2519 5727
180 Gbyte data transferred 800 Mbyte compressed data stored 2T cycles simulated 10T operations issued

ICS 252 class, February 3, 2000

76

Philips Research

Outline
Introduction Tools Exploration Real numbers Conclusions

ICS 252 class, February 3, 2000

77

Philips Research

Summary and conclusions


A full-fledged design space exploration was done for the 64-bit TriMedia VLIW CPU core The use of both pseudo-simulation and a systematic exploration has made this feasible The outcome is: A range of machines in A-T space Rules for functional unit placement A flexible, integrated environment for continuation of DSE

ICS 252 class, February 3, 2000

78

Philips Research

Summary and conclusions Design Space Exploration


A large amount of time is spent in developing tools to enable the DSE The design of experiments and the analysis of results takes up more time than the execution Performing a DSE stretches the capabilities of the tools to the maximum The pseudo-simulator is a powerful tool that supports the full design space offered by the machine description

ICS 252 class, February 3, 2000

79

Philips Research

Conclusions VLIW
Difficult to exploit ILP in general purpose, great for embedded applications, streaming applications and DSP. Compiler is essential Good in combination with vector or SIMD Quantify, do experiments, analyse results and keep doing this: Y-chart. Promising for binary to binary (Transmeta: code morphing), Intel-HP: IA-64 Revival: Philips, Sun MAJC, TI 320C60, Crusoe 3120.

ICS 252 class, February 3, 2000

You might also like