Professional Documents
Culture Documents
Philips Research
Philips Research
Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results
Philips Research
Design Problem
SDRAM video-in video-out timers I$ D$
Serial I/O
PCI bridge
ICS 252 class, February 3, 2000
audio-in
Philips Research
Application Domain
High volume consumer electronics products future TV, home theatre, set-top box, etc. Embedded core Media processing: audio, video, graphics, communication
Philips Research
Philips Research
D/2
n-1
ICS 252 class, February 3, 2000
-D/2
Philips Research
Philips Research
Philips Research
10
Philips Research
Processing Characteristics
Signal rates audio, video (at block and pixel rate) Basic data types byte (8), half-words (16), words (32), float (32) Data access patterns sample stream, bitstream, random access Data independent and dependent load Control processing and signal processing
11
Philips Research
Application Development
Algorithm Reference C
TM1000 optimized
video
IC video
simulation exploration
12
Philips Research
13
Philips Research
14
Philips Research
15 UPC
15
10 SEG
10
SEG
0 0
20
40
60
80 100%
0 0
15
Philips Research
tm1000 @ 125MHz
100 90%
50
cpu64 @ 300MHz
16% 0 10 20 30 40 50 60 70 80 Field number 90 100 110
16
Philips Research
Summary
Application domain: Media processing for CE industry Benchmark suite: Targeted to application domain Optimized for class of processors Future products: Gradual shift to software implementations of signal processing functions
17
Philips Research
Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results
18
Philips Research
Application Target
Processor core, to be embedded in different ICs and products Real time processing of media streams Cost-sensitive consumer electronics market
ICS 252 class, February 3, 2000
19
Philips Research
Embedded Application
SDRAM video-in video-out timers I$ D$
Serial I/O
PCI bridge
ICS 252 class, February 3, 2000
audio-in
20
Philips Research
Performance Target
Relative to TriMedia TM1000 (5-slot VLIW, 100MHz, 32-bit datapath, media operations):
6x to 8x performance increase, to process more or higher-resolution video streams Not more than double transistor count
21
Philips Research
Efficiency
Good performance/silicon cost ratio by: Optimize CPU architecture and media benchmark source code towards each other Solve resource conflicts at compile time
ICS 252 class, February 3, 2000
22
Philips Research
Outline
Introduction VLIW architecture Instruction set IDCT example Results
23
Philips Research
Architectural Speedup
Not simply increasing the VLIW issue width: Diminishing gain of compiler-generated ILP Increasing implementation complexity (area, timing)
ICS 252 class, February 3, 2000
24
Philips Research
Architectural Speedup
Extended SIMD: uniform 64-bit design, vectors of 1-, 2-, and 4-byte elements (data throughput per cycle) New, extensive, media instruction set (functionality per cycle) Improved cache control (prefetch, alloc)
25
Philips Research
CPU64 Architecture
64-bit memory bus 32-bit peripheral bus multi-port 128 words x 64 bits register file bypass network
FU
PC
FU
FU
FU
FU
ICS 252 class, February 3, 2000
26
Philips Research
Code Example
int a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; RISC instructions: issueslots): 1 =1 2 3 x = *a a += 1 y = *b nop i += 1 nop g = i<n nop z = x+y nop *c = z nop jmpf g 12 jmpt g 1 1 b += 1 2 nop 3 nop jmpf g 7 jmpt g 1 nop g = i<n nop nop
ICS 252 class, February 3, 2000
VLIW x = *a
instructions y = *b
(5 i+
4
5 6
4
nop 5 *c = z 6 nop
z = x+y
c+=1 nop
nop
nop nop
nop
nop nop
7 8
27
Philips Research
VLIW challenges
Achieve high Instruction Level Parallelism Compiler needs to know the pipeline for instructions Code size Design of a multi-ported register file. Design of the forwarding network
28
Philips Research
29
Philips Research
Compiler
Detect data dependency at compile time: examples:
c[i]=a[i]+b[i]; d[i]=a[i]+c[j]; c[1]=a[i]+b[i]; d[i]=a[i]+c[2]; potential dependency c[i] might be c[j] no dependency c[1] is never c[2]
ICS 252 class, February 3, 2000
30
Philips Research
Compiler
Schedule the instructions in the proper slot at the proper time, e.g.: load and stores only in slot 1,2 adds in all slots floating point multiply only in slot 3 add take 1 cycle, mult takes 2 cycles 3 branch delay slots
31
Philips Research
Code size
Large number of registers cost code size add r1 r3 r127 Large number of media instructions: code size Use hardware compression techniques to remove nops Variable length instruction format! Vector instructions: SIMD, great for data processing performance
32
Philips Research
Code Example
char a[n], b[n], c[n]; int i; for (i=0; i<n; i++) c[i]=a[i]+b[i]; : 1 2 3 4 5 6 VLIW instructions (5 issueslots): x = *a g = i<n jmpf g 7 z = x+y *c = z nop y = *b nop jmpt g 1 nop c+=1 nop i+=1 nop nop nop nop nop a += 1 nop nop nop nop nop b += 1 nop nop nop nop nop
33
Philips Research
34
Philips Research
x6
y6
x5
y5
x4
y4
x3
y3
x2
y2
x1
y1
x0
y0
|.|
|.|
|.|
|.|
|.|
|.|
|.|
|.|
ICS 252 class, February 3, 2000
35
Philips Research
36
Philips Research
Architecture VLIW+SIMD
Issues a 5-slot instruction every cycle Each slot supports a selection of FUs All FUs support vectorized (SIMD) data Double-slot FU allows powerful multiargument, multi-result operations All FUs are pipelined, latency 1 to 4 (except floating point divide and sqrt)
37
Philips Research
Instruction Format
Per operation, per slot: Up to 2 arguments, up to 1 result Optionally an extra register for guarding (conditional execution) Immediate argument size can be 32 bit Instructions have compressed variable length format, decompressed during decode
38
Philips Research
Operations
Intends to cover combinations of: 1-, 2-, 4-byte elements in an 8-byte vector Signed or unsigned type Clipped versus wrap-around arithmetic Set of basic functions (ld, st, add, mul, compare, shift, ...)
39
Philips Research
Instruction Set
Category Load/store ops Byte shuffles Bit shifts Multiplies Integer arith Floating point Lookup table Special ops Branch Cnt 39 67 48 54 104 59 6 23 10 Comment vectorized, endian-correct vector type convert round, fields round, clip, wrap around clip, wrap around scalar and vectorized vectorized MMU, cache, special regs (un)interruptible, trap
40
Philips Research
Example: msp-multiply
Each argument: 4-way 16-bit int
Multiply to internal double precision integer
+
1
41
Philips Research
Example: Transpose
a e i m b f j n c g k o d h l p
a b
e f
i j
m n
(2-dimensional filtering)
42
Philips Research
43
Philips Research
Branches
Branch operations have 3 branch delay slots Compiler & scheduler try to fill these (profiling, loop unrolling, inlining, guarding) Branch units are properly pipelined Up to 3 branch ops in 1 instruction No branch prediction hardware Branches are preferred moments for interrupt servicing
44
Philips Research
2D-IDCT Example
IDCT code was tested early in the project, also for studying the programming model Mapped through our experimental C compiler and instruction scheduler Operates entirely on (vectors of) 16-bit data elements Simulation showed IEEE 1180 accuracy compliancy
45
Philips Research
2D-IDCT Instructions
Execution time in cycles, excluding data cache misses
VLIW: CPU64, TM1000, TI 320C60 others superscalar
400 350 300 250 200 150 100 50 0 CPU64 TM-1000 TI 320C62 Altivec PA-8000 PentiumI PentiumII NEC V830
= accuracy compliant
46
Philips Research
Status
Now in construction at Philips Semiconductors, Sunnyvale 7M transistors (target) 300Mhz clock (target) 0.18 technology 1.8 volt
47
Philips Research
48
Philips Research
Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results
49
Philips Research
Traditional Compilation
Compiler A App 1 App 2 App 3 Compiler B
Exe 1A Exe 2A
Exe 3A Exe 1B Exe 2B Exe 3B
Machine A
Machine B
50
Philips Research
Retargetable Compilation
Machine A Description App 1 App 2 App 3 Machine B Description
Exe 1A Exe 2A
Exe 3A
Machine A
Compiler Machine B
ICS 252 class, February 3, 2000
51
Philips Research
MD file contents
MD describes instance from class of machines Contains information such as: number & width of registers number of issue slots number & placement of function units instruction latencies
52
Philips Research
Y-Chart
Machine Description Compiler Simulator Performance Numbers Benchmark Applications
53
Philips Research
54
Philips Research
55
Philips Research
Library structure
source code simulator hw_ops sw_ops application
workstation executable
simulator
application
56
Philips Research
Functional Development
source code hw_ops gcc workstation executable
ICS 252 class, February 3, 2000
sw_ops
application
application
57
Philips Research
Code Tuning
source code simulator hw_ops application
gcc
workstation executable
simulator
interpretation
58
Philips Research
Conclusions Compiler
Applications are written in C code only Machine Description supports changes in ISA Vector types keep code legible Fast functional development track Accurate application tuning track
59
Philips Research
Outline
Introduction Application Domain Processor Architecture Retargetable compiler Design Space Exploration Results
60
Philips Research
61
Philips Research
FU
FU
FU
FU
FU
ICS 252 class, February 3, 2000
62
Philips Research
Multimedia benchmark
Nine multimedia applications from: Data communication Audio coding Video coding Video processing Graphics Representative Applications Code Datasets
ICS 252 class, February 3, 2000
63
Philips Research
64
Philips Research
Essential tooling
Retargetable toolchain Core compiler, scheduler, simulator Design and experiment management DSE support tools Machine generation, pseudosim, analysis Glue software
65
Philips Research
Pseudo-simulation
Pseudosim: a retargetable pseudo-simulator Performs a cycle-accurate calculation of the amount of instruction cycles, without doing the actual machine simulation Gathers other machine statistics, such as utilisation of slots, functional units, operations Time for the whole benchmark is reduced from 18 hours to 4 minutes
66
Philips Research
Pseudo-simulation
Machine description Benchmark applications Machine description
Compiler
ICS 252 class, February 3, 2000
once, 18 hours
Simulator
code Pseudosim
statistics
results
67
Philips Research
FU type
slot
68
Philips Research
69
Philips Research
70
Philips Research
Systematic exploration
It is not feasible to exhaustively compute all design points in the design space Therefore we probe, partition, and then explore the design space Careful set-up of experiments
71
Philips Research
7%
operations
93%
72
Philips Research
Exhaustive experiment
7
x 10 10
Exhaustive experiment
8.8
8.6 194
195
196
197
198
199
200
201
202
area
Close to 3000 machine variations Only 4 machines survive Large performance variation due to placement Time taken = 200 hours
9.8
9.6
cycles
9.4
9.2
73
Philips Research
Greedy experiments
First greedy experiment
7 x 10 9.3
8.8
8.7
8.6 191
192
193
194
195
196
197
area
Less machine variations per experiment More machines survive Less performance variation due to placement Close to another 3000 machine variations over all greedy experiments
9.2
9.1
cycles
8.9
74
Philips Research
Outline
Introduction Tools Exploration Real numbers Conclusions
75
Philips Research
Simulated machines
Experiment number Probing Exhaustive Greedy Total 185 3023 2519 5727
180 Gbyte data transferred 800 Mbyte compressed data stored 2T cycles simulated 10T operations issued
76
Philips Research
Outline
Introduction Tools Exploration Real numbers Conclusions
77
Philips Research
78
Philips Research
79
Philips Research
Conclusions VLIW
Difficult to exploit ILP in general purpose, great for embedded applications, streaming applications and DSP. Compiler is essential Good in combination with vector or SIMD Quantify, do experiments, analyse results and keep doing this: Y-chart. Promising for binary to binary (Transmeta: code morphing), Intel-HP: IA-64 Revival: Philips, Sun MAJC, TI 320C60, Crusoe 3120.