Lec06 cs252

1
Philips Research
The Trimedia CPU64 VLIW Media Processor

Kees Vissers Philips Research Visiting Industrial Fellow vissers@eecs.berkeley.edu
ICS 252 class, February 3, 2000
Philips Research
Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results
Philips Research
Design Problem
SDRAM video-in video-out timers I$ D$
Serial I/O
PCI bridge
I2C I/O audio-out
audio-in
Philips Research
Application Domain
High volume consumer electronics products future TV, home theatre, set-top box, etc. Embedded core Media processing: audio, video, graphics, communication
Philips Research
Media Processing CPU

Design considerations: Cost (embedded in high volume products) Performance for the application domain Ease of use (programming model)
Benchmark suite needed
Philips Research
Application: Natural Motion
D/2
n-1
-D/2
n-1/2 picture number
Philips Research
Project Approach: Y-chart

Machine Description Compiler Simulator Performance Numbers Benchmark Applications
Philips Research
Benchmark Suite Characteristics

Each application: typical for a class of applications within application domain The set covers a significant area of the application domain Each benchmark is sufficiently well tuned to the architecture to measure its performance
Philips Research
The Benchmark Suite

Data comm. Viterbi decoding Audio coding AC3 decode Video coding MPEG2 encode DVC decode Video proc. Layered natural motion Dynamic noise reduction Peaking Graphics 3D renderer library
10
Philips Research
Processing Characteristics
Signal rates audio, video (at block and pixel rate) Basic data types byte (8), half-words (16), words (32), float (32) Data access patterns sample stream, bitstream, random access Data independent and dependent load Control processing and signal processing
11
Philips Research
Application Development
Algorithm Reference C
TM1000 optimized
CPU64 optimized knowledge

video
IC video
simulation exploration
12
Philips Research
Initial Design Considerations

Goal: 6-8 times TM1000 performance Standard ANSI-C, reuse of code Utilize instruction and data parallelism Limited complexity (embedded core) Compatibility through recompilation VLIW architecture
13
Philips Research
Instruction Set Considerations

A machine operation must be sufficiently generic within the application domain Sufficiently powerful operations Limited number of operations Consistency and orthogonality
14
Philips Research
Results: Natural Motion

TM1000 cycles x 107 CPU64 cycles x 10 7
Average gain: 2.7x (cycles)
15 UPC
15
10 SEG
10
SEG
instructions nops cache stalls
MED SAD SUB
5 UPC SUB MED SAD 20 40 60 80 100%
0 0
20
40
60
80 100%
0 0
15
Philips Research
Natural Motion Dynamic Load

150 cpu load (%)
tm1000 @ 125MHz
100 90%
50
cpu64 @ 300MHz
16% 0 10 20 30 40 50 60 70 80 Field number 90 100 110
16
Philips Research
Summary
Application domain: Media processing for CE industry Benchmark suite: Targeted to application domain Optimized for class of processors Future products: Gradual shift to software implementations of signal processing functions
17
Philips Research
Outline
18
Philips Research
Application Target
Processor core, to be embedded in different ICs and products Real time processing of media streams Cost-sensitive consumer electronics market
19
Philips Research
Embedded Application
SDRAM video-in video-out timers I$ D$
Serial I/O
PCI bridge
I2C I/O audio-out
audio-in
20
Philips Research
Performance Target
Relative to TriMedia TM1000 (5-slot VLIW, 100MHz, 32-bit datapath, media operations):
6x to 8x performance increase, to process more or higher-resolution video streams Not more than double transistor count
21
Philips Research
Efficiency
Good performance/silicon cost ratio by: Optimize CPU architecture and media benchmark source code towards each other Solve resource conflicts at compile time
22
Philips Research
Outline
Introduction VLIW architecture Instruction set IDCT example Results
23
Philips Research
Architectural Speedup
Not simply increasing the VLIW issue width: Diminishing gain of compiler-generated ILP Increasing implementation complexity (area, timing)
24
Philips Research
Architectural Speedup
Extended SIMD: uniform 64-bit design, vectors of 1-, 2-, and 4-byte elements (data throughput per cycle) New, extensive, media instruction set (functionality per cycle) Improved cache control (prefetch, alloc)
25
Philips Research
CPU64 Architecture
64-bit memory bus 32-bit peripheral bus multi-port 128 words x 64 bits register file bypass network
data mmu cache 16KB mmu

exceptions
FU
PC
FU
FU
FU
FU
instruction cache mmu 32 KB
VLIW instruction decode and launch
26
Philips Research
Code Example
int a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; RISC instructions: issueslots): 1 =1 2 3 x = *a a += 1 y = *b nop i += 1 nop g = i<n nop z = x+y nop *c = z nop jmpf g 12 jmpt g 1 1 b += 1 2 nop 3 nop jmpf g 7 jmpt g 1 nop g = i<n nop nop
VLIW x = *a
instructions y = *b
(5 i+
4
5 6
4
nop 5 *c = z 6 nop
z = x+y
c+=1 nop
nop
nop nop
nop
nop nop
7 8
27
Philips Research
VLIW challenges
Achieve high Instruction Level Parallelism Compiler needs to know the pipeline for instructions Code size Design of a multi-ported register file. Design of the forwarding network
28
Philips Research
Instruction Level Parallelism

Loop unrolling, software pipelining Guarded execution Avoid branches if conversion, hand rewrite source!, valid in Embedded applications, done for TI, Philips and Apple software (Adobe fotoshop). Speculate branches
29
Philips Research
Compiler
Detect data dependency at compile time: examples:
c[i]=a[i]+b[i]; d[i]=a[i]+c[j]; c[1]=a[i]+b[i]; d[i]=a[i]+c[2]; potential dependency c[i] might be c[j] no dependency c[1] is never c[2]
30
Philips Research
Compiler
Schedule the instructions in the proper slot at the proper time, e.g.: load and stores only in slot 1,2 adds in all slots floating point multiply only in slot 3 add take 1 cycle, mult takes 2 cycles 3 branch delay slots
31
Philips Research
Code size
Large number of registers cost code size add r1 r3 r127 Large number of media instructions: code size Use hardware compression techniques to remove nops Variable length instruction format! Vector instructions: SIMD, great for data processing performance
32
Philips Research
Code Example
char a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; : 1 2 3 4 5 6 VLIW instructions (5 issueslots): x = *a g = i<n jmpf g 7 z = x+y *c = z nop y = *b nop jmpt g 1 nop c+=1 nop i+=1 nop nop nop nop nop a += 1 nop nop nop nop nop b += 1 nop nop nop nop nop
33
Philips Research
Code Example Vector

vec64sb a[n/8], b[n/8], c[n/8]; int i; for (i=0; i</8n; i++) c[i]=sb_add(a[i],b[i]); : 1 2 3 4 5 6 VLIW instructions and SIMD (5 issueslots): x = ld64b a y =ld64b b g = i<fraction n jmpf g 7 z = add64sb x y st64b c z nop i+=1 nop jmpt g 1 nop c+=1 nop a += 8 nop nop nop nop nop b += 8 nop nop nop nop nop nop nop nop nop nop
34
Philips Research
Vector Instruction Example

x7
y7
x6
y6
x5
y5
x4
y4
x3
y3
x2
y2
x1
y1
x0
y0
|.|
|.|
|.|
|.|
|.|
|.|
|.|
|.|
Sum of absolute differences (ub_me)
35
Philips Research
Application Code Example

int calc_sad(vec64ub *prv, vec64ub *cur, int s) { vec64ub left, right; int i; int sad = 0; for (i=0; i<8; i++) { left = prv[i*s]; right = cur[i*s]; sad += ub_me(left, right); } return(sad); }
36
Philips Research
Architecture VLIW+SIMD
Issues a 5-slot instruction every cycle Each slot supports a selection of FUs All FUs support vectorized (SIMD) data Double-slot FU allows powerful multiargument, multi-result operations All FUs are pipelined, latency 1 to 4 (except floating point divide and sqrt)
37
Philips Research
Instruction Format
Per operation, per slot: Up to 2 arguments, up to 1 result Optionally an extra register for guarding (conditional execution) Immediate argument size can be 32 bit Instructions have compressed variable length format, decompressed during decode
38
Philips Research
Operations
Intends to cover combinations of: 1-, 2-, 4-byte elements in an 8-byte vector Signed or unsigned type Clipped versus wrap-around arithmetic Set of basic functions (ld, st, add, mul, compare, shift, ...)
39
Philips Research
Instruction Set
Category Load/store ops Byte shuffles Bit shifts Multiplies Integer arith Floating point Lookup table Special ops Branch Cnt 39 67 48 54 104 59 6 23 10 Comment vectorized, endian-correct vector type convert round, fields round, clip, wrap around clip, wrap around scalar and vectorized vectorized MMU, cache, special regs (un)interruptible, trap
40
Philips Research
Example: msp-multiply
Each argument: 4-way 16-bit int
Multiply to internal double precision integer
+
1
Round lower half into upper half

Return upper half
41
Philips Research
Example: Transpose
a e i m b f j n c g k o d h l p
2-slot super-op: takes 4 arguments, shuffles bytes, produces 2 results

a b
e f
i j
m n
(2-dimensional filtering)
42
Philips Research
Example: 4-way average
add with extended precision round
(MPEG 2-dimensional half-pixel average)
43
Philips Research
Branches
Branch operations have 3 branch delay slots Compiler & scheduler try to fill these (profiling, loop unrolling, inlining, guarding) Branch units are properly pipelined Up to 3 branch ops in 1 instruction No branch prediction hardware Branches are preferred moments for interrupt servicing
44
Philips Research
2D-IDCT Example
IDCT code was tested early in the project, also for studying the programming model Mapped through our experimental C compiler and instruction scheduler Operates entirely on (vectors of) 16-bit data elements Simulation showed IEEE 1180 accuracy compliancy
45
Philips Research
2D-IDCT Instructions
Execution time in cycles, excluding data cache misses
VLIW: CPU64, TM1000, TI 320C60 others superscalar
400 350 300 250 200 150 100 50 0 CPU64 TM-1000 TI 320C62 Altivec PA-8000 PentiumI PentiumII NEC V830
= accuracy compliant
46
Philips Research
Status
Now in construction at Philips Semiconductors, Sunnyvale 7M transistors (target) 300Mhz clock (target) 0.18 technology 1.8 volt
47
Philips Research
Conclusion Processor Architecture

The 5-slot 64-bit VLIW provides a powerful architecture for media processing Performance gain of 6x to 8x on TM1000 Rich and regular media instruction set Powerful multi-argument, multi-result superop naturally fits VLIW architecture
48
Philips Research
Outline
49
Philips Research
Traditional Compilation
Compiler A App 1 App 2 App 3 Compiler B
Exe 1A Exe 2A
Exe 3A Exe 1B Exe 2B Exe 3B
Machine A
Machine B
50
Philips Research
Retargetable Compilation
Machine A Description App 1 App 2 App 3 Machine B Description
Exe 1A Exe 2A
Exe 3A
Machine A
Compiler Machine B
Exe 1B Exe 2B Exe 3B
51
Philips Research
MD file contents
MD describes instance from class of machines Contains information such as: number & width of registers number of issue slots number & placement of function units instruction latencies
52
Philips Research
Y-Chart
53
Philips Research
Software and Hardware Operations

Hardware view of instruction set Implement minimal, changeable set Software view of instruction set provide regular, orthogonal, stable set Two instruction sets were defined for these purposes
54
Philips Research
Instruction set libraries

Hardware operations library untyped, minimal, changing Software operations library vector-typed, orthogonal, stable Mapping in MD file and library
55
Philips Research
Library structure
source code simulator hw_ops sw_ops application
workstation executable
simulator
application
application CPU64 executable
56
Philips Research
Functional Development
source code hw_ops gcc workstation executable
sw_ops
application
application
57
Philips Research
Code Tuning
source code simulator hw_ops application
gcc
workstation executable
tmcc application CPU64 executable

simulator
interpretation
58
Philips Research
Conclusions Compiler
Applications are written in C code only Machine Description supports changes in ISA Vector types keep code legible Fast functional development track Accurate application tuning track
59
Philips Research
Outline
60
Philips Research
Methodology: the Y-chart

61
Philips Research
64-bit memory bus
64-bit VLIW CPU

multi-port 128 words x 64 bits register file bypass network
data mmu cache 16KB mmu instruction cache mmu 32 KB
FU
FU
FU
FU
FU
VLIW instruction decode and launch
Many parameters, large design space(s)
62
Philips Research
Multimedia benchmark
Nine multimedia applications from: Data communication Audio coding Video coding Video processing Graphics Representative Applications Code Datasets
63
Philips Research
Challenge for design space exploration

Simulation time for the benchmark for a single machine is 18 hours The number of design points for functional 1015 unit configuration alone: Results in 2000,000,000,000 years of computation time
64
Philips Research
Essential tooling
Retargetable toolchain Core compiler, scheduler, simulator Design and experiment management DSE support tools Machine generation, pseudosim, analysis Glue software
65
Philips Research
Pseudo-simulation
Pseudosim: a retargetable pseudo-simulator Performs a cycle-accurate calculation of the amount of instruction cycles, without doing the actual machine simulation Gathers other machine statistics, such as utilisation of slots, functional units, operations Time for the whole benchmark is reduced from 18 hours to 4 minutes
66
Philips Research
Pseudo-simulation
Machine description Benchmark applications Machine description
Compiler
once, 18 hours
Simulator
code Pseudosim
many times, 4 minutes
statistics
results
67
Philips Research
Functional unit configuration

Problem: How many of each type of FU do I need? Where do I place them? How do I prevent simulating too many machine variations?
FU type
slot
68
Philips Research
Nave calculation of the design space size

24 types of FUs 31 configurations per type 6 types of super-op FUs 7 configurations per type 3124 76 1040 Size of design space:
69
Philips Research
Not so nave calculation of the design space size

Assume relationship between FU types Best-guess lower and upper bounds on number of FUs per type Reduce permutations Size of design space: 1015
70
Philips Research
Systematic exploration
It is not feasible to exhaustively compute all design points in the design space Therefore we probe, partition, and then explore the design space Careful set-up of experiments
71
Philips Research
Reduction of the design space

Observation: over 93% of operations are done in 30% of FU types Action: partition space Exhaustive exploration of important FUs Greedy exploration of the remainder functional unit types
70% 30%
7%
operations
93%
72
Philips Research
Exhaustive experiment
7
x 10 10
Exhaustive experiment
8.8
8.6 194
195
196
197
198
199
200
201
202
area
Close to 3000 machine variations Only 4 machines survive Large performance variation due to placement Time taken = 200 hours
9.8
9.6
cycles
9.4
9.2
73
Philips Research
Greedy experiments
First greedy experiment
7 x 10 9.3
8.8
8.7
8.6 191
192
193
194
195
196
197
area
Less machine variations per experiment More machines survive Less performance variation due to placement Close to another 3000 machine variations over all greedy experiments
9.2
9.1
cycles
8.9
74
Philips Research
Outline
Introduction Tools Exploration Real numbers Conclusions
75
Philips Research
Simulated machines
Experiment number Probing Exhaustive Greedy Total 185 3023 2519 5727
180 Gbyte data transferred 800 Mbyte compressed data stored 2T cycles simulated 10T operations issued
76
Philips Research
Outline
Introduction Tools Exploration Real numbers Conclusions
77
Philips Research
Summary and conclusions

A full-fledged design space exploration was done for the 64-bit TriMedia VLIW CPU core The use of both pseudo-simulation and a systematic exploration has made this feasible The outcome is: A range of machines in A-T space Rules for functional unit placement A flexible, integrated environment for continuation of DSE
78
Philips Research
Summary and conclusions Design Space Exploration

A large amount of time is spent in developing tools to enable the DSE The design of experiments and the analysis of results takes up more time than the execution Performing a DSE stretches the capabilities of the tools to the maximum The pseudo-simulator is a powerful tool that supports the full design space offered by the machine description
79
Philips Research
Conclusions VLIW
Difficult to exploit ILP in general purpose, great for embedded applications, streaming applications and DSP. Compiler is essential Good in combination with vector or SIMD Quantify, do experiments, analyse results and keep doing this: Y-chart. Promising for binary to binary (Transmeta: code morphing), Intel-HP: IA-64 Revival: Philips, Sun MAJC, TI 320C60, Crusoe 3120.

Lec06 cs252

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lec06 cs252

Uploaded by

Copyright:

Available Formats

1

The Trimedia CPU64 VLIW Media Processor

ICS 252 class, February 3, 2000

I2C I/O audio-out

ICS 252 class, February 3, 2000

Media Processing CPU

Application: Natural Motion

n-1/2 picture number

Project Approach: Y-chart

ICS 252 class, February 3, 2000

Benchmark Suite Characteristics

ICS 252 class, February 3, 2000

The Benchmark Suite

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

CPU64 optimized knowledge

Initial Design Considerations

ICS 252 class, February 3, 2000

Instruction Set Considerations

ICS 252 class, February 3, 2000

Results: Natural Motion

Average gain: 2.7x (cycles)

instructions nops cache stalls

MED SAD SUB

5 UPC SUB MED SAD 20 40 60 80 100%

ICS 252 class, February 3, 2000

Natural Motion Dynamic Load

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

I2C I/O audio-out

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

data mmu cache 16KB mmu

instruction cache mmu 32 KB

VLIW instruction decode and launch

ICS 252 class, February 3, 2000

Instruction Level Parallelism

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

Code Example Vector

ICS 252 class, February 3, 2000

Vector Instruction Example

Sum of absolute differences (ub_me)

Application Code Example

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

Round lower half into upper half

ICS 252 class, February 3, 2000

2-slot super-op: takes 4 arguments, shuffles bytes, produces 2 results

Example: 4-way average

add with extended precision round

(MPEG 2-dimensional half-pixel average)

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000

ICS 252 class, February 3, 2000