Exploiting ILP, TLP, and DLP With The PDF

Exploiting ILP, TLP, and DLP with the
Polymorphous TRIPS Architecture

Ramadass Nagarajan
Karthikeyan Sankaralingam
Haiming Liu
Changkyu Kim
Jaehyuk Huh
Doug Burger
Stephen W. Keckler
Charles R. Moore
Computer Architecture and Technology Laboratory

Department of Computer Sciences
The University of Texas at Austin
8/31/05
CART UT-CS
Trends in Programmable Processors
Increased specialization
among processors
Benefits of specialization
Performance, power, area
Problems of specialization
Poor performance outside
intended domain
Little design re-use
8/31/05
CART UT-CS
Performance
Workloads are becoming

diverse
Network
Server
Graphics
Desktop
Power4
GeForce
Intel IXP
Pentium4
Courtesy : Bob Gray bill, DARPA
Homogeneity versus Heterogeneity
Heterogeneous - multiple different types of processors

(Eg: Tarantula [Espasa et al, ISCA 2002])
VEC
DSP
+ Performance advantages
Load balancing inefficiencies
Higher design complexity
THR
UNI
UNI
UNI
UNI
UNI
UNI
VEC
UNI
DSP
UNI
UNI
DSP
THR
UNI
THR
UNI
UNI
Homogeneous - single or multiple of same processor

+ Flexible/general purpose
+ Ample design reuse
Processor mismatch inefficiencies
Approach: Hardware Polymorphism

Start with high performance homogeneous substrate
Add coarse-grained reconfigurability to micro-architectural
elements
Manage different elements appropriately for different
applications
8/31/05
CART UT-CS
Challenges for Homogenous Systems
Fine-grain CMP:
64 in-order cores
High degree of partitioning

Necessary for fine-grained concurrency
High computational density (ALUs/mm2 )

Necessary for data parallel applications
Keep communication localized

Permits technology scalability
Minimize specialized hardware

Reduces design complexity
8/31/05
CART UT-CS
What is the Right Granularity of Processing?

Coarse-grained/general-purpose
Fine-grained concurrency
FPGA: 106 gates
PIM: 256 elements
Fine-grain CMP:
64 in-order cores
Coarse-grain CMP:
16 O-O-O cores
4 ultra-large cores
Configuring granularity through polymorphism

Synthesis: Emulate coarse-grained cores using fine-grained PEs
Partitioning: Partition coarse-grained cores into fine-grained PEs
Synthesizing fine-grained PEs difficult at best

Partitioning ultra-large cores
8/31/05
Maximize resources for single-threaded performance

Sub-divide for finer granularity
Configure micro-architectural elements for different levels of parallelism
CART UT-CS
Outline
Introduction to polymorphous systems

TRIPS architectural overview
Polymorphous components
Supporting different granularities of parallelism
Instruction-level parallelism (ILP)
Thread-level parallelism (TLP)
Data-level parallelism (DLP)
Conclusions
8/31/05
CART UT-CS
TRIPS Overview
CMP with large Grid Processor cores and L2 cache banks
Grid Processor Core
L2 cache bank
8/31/05
CART UT-CS
TRIPS Overview
Moves
Bank M
L2 Cache Banks
Bank 0
Bank 0
Load store queues
Bank 1
Bank 2
Bank 3
IF CT
Bank 1
Bank 2
Bank 3
Block termination Logic
8/31/05
SPDI: Static Placement, Dynamic Issue

ALU Chaining
Short wires / Wire-delay constraints
exposed at the architectural level
Block Atomic Execution
CART UT-CS
Challenges for Different Levels of Parallelism

Instruction-level parallelism[Nagarajan et al, Micro01]
Populate large instruction window with useful instructions
Schedule instructions to optimize communication and
concurrency
Thread-level parallelism
Partition instruction window among different threads
Reduce contentions for instruction and data supply
Data-level parallelism
Provide high density of computational elements
Provide high bandwidth to/from data memory
8/31/05
CART UT-CS
TRIPS Configurable Resources

Moves
Bank M
L2 Cache Banks
Bank 0
Bank 0
Load store queues
Bank 1
Bank 2
Bank 3
IF CT
Bank 1
Bank 2
Bank 3
Reservation stations
Instruction window management
Instruction fetch control

Speculation/non-speculation, multiple
threads, mapping re-use.
Register files
Speculative vs. non-speculative data storage
L2 cache banks
tag lookup, replacement, b/w to near banks
8/31/05
CART UT-CS
10
Aggregating Reservation Stations: Frames

sub
add
sub
add
add
Control
opcode src1 src2

opcode src1 src2
opcode src1 src2
opcode src1 src2
ALU
Router
Execution Node
Instruction Buffers form

a logical z-dimension
in each node
4 logical frames
each with 16 instruction slots
Instruction buffers add depth to the execution array

2D array of ALUs; 3D volume of instructions
8/31/05
CART UT-CS
11
Extracting ILP: Frames for Speculation

16-wide OOO issue
start
Execute A
16 total frames (4 sets of 4)

E (spec)
D (spec)
C (spec)
A
Predict C
Execute C
C
B
D
Predict D
Execute D
Predict E
Execute E
E
end
8/31/05
Ultra-wide issue from a large distributed instruction window

CART UT-CS
12
ILP Results with Speculation

12
#blocks
10
IPC
SPEC Int
Programs
1
4
16
Perfect
8
6
4
2
0
bzip2
compr
m88k
mcf
vortex
MEAN
12
10
1
4
16
Perfect
SPEC FP
programs
IPC
8
6
4
2
0
ammp
8/31/05
equake
mgrid
swim
CART UT-CS
tomcatv
MEAN
13
Configuring Frames for TLP

Th
Divide frame space

among threads
ad
2
A2
re
B2(spec)
Th
ad
1
A1
re
B1(spec)
- Each can be further subdivided to enable some

degree of speculation
- Shown: 2 threads, each
with 1 speculative block
- Alternate configuration
might provide 4 threads
Multiple partitioned instruction window for different threads

8/31/05
CART UT-CS
14
TLP Results
Rate of Work (IPC)
30
25
20
Sequential Execution
15
TLP-mode execution
Multiple processors
10
5
0
2
# of threads
Speedup: 1.8x to 2.9x

Reasons for performance losses
Contention for resources (principally in instruction and data supply)
Reduced instruction window size
8/31/05
CART UT-CS
15
Using Frames for DLP
start
(2)
loop N times
end
(1)
(3)
unroll 8X
loop N/8 times
Streaming Kernel:
- read input stream element
- process element
- write output stream element
start
(8)
end
Map very large unrolled kernels to window

Turn-off speculation
Keep communication localized
Mapping re-use: Fetch/map loop body once, re-use many times
Re-vitalization initiates successive iterations
8/31/05
CART UT-CS
16
Configuring Data Memory for DLP

Moves
Bank M
L2 Cache Banks
L1 Banks
Bank 0
Bank 0
Load store queues
Bank 1
Bank 2
Bank 3
IF CT
Bank 1
Bank 2
Bank 3
Stream register file (SRF) (accessed w/ LMW)

Streaming channels
Regular data accesses

Subset of L2 cache banks configured as SRF
High bandwidth data channels to SRF
Reduced address communication
Constants saved in reservation stations
8/31/05
CART UT-CS
17
DLP Results (4x4 GPA)

16
14
Compute Inst/cycle
12
10
ILP mode
DLP-mode
1/4 LD B/W
NoRevitalize
6
4
2
0
convert
dct
fft8
fir16
idea
transform
MEAN
Performance metric omits overhead, LD/ST instructions

8/31/05
CART UT-CS
18
Results: Summary
ILP: instruction window occupancy
Peak: 4x4x128 array 2048 instructions
Sustained: 493 for Spec Int, 1412 for Spec FP
Bottleneck: branch prediction
TLP: instruction and data supply

Peak: 100% efficiency
Sustained: 87% for two threads, 61% for four threads
DLP: data supply bandwidth

Peak: 16 ops/cycle
Sustained: 6.9 ops/cycle
8/31/05
CART UT-CS
19
Related Work
Polymorphous homogeneous
SmartMemories: Modular reconfigurable architecture[Mai, ISCA
01]
Fine-grained homogeneous
RAW: Baring it all to software [Waingold, IEEE Computer 00]
Ultra-fine grained homogeneous

Piperench reconfigurable architecture and compiler [Goldstein,
IEEE Computer 00]
Heterogeneous
Tarantula Vector Extensions to the EV8 [Espasa, ISCA 02]
8/31/05
CART UT-CS
20
Conclusions
TRIPS: Coarse-grained homogeneous approach with polymorphism.
Sub-divide a powerful uniprocessor

ILP: Well-partitioned powerful uniprocessor (GPA)
TLP: Divide instruction window among different threads
DLP: Mapping reuse of instructions and constants in grid
Future work
Demonstrate viability with HW/SW prototype
Design software interfaces to exploit configurable hardware
How well homogeneous approaches compare with specialized cores?
How large should these cores scale?
8/31/05
CART UT-CS
21

Exploiting ILP, TLP, and DLP With The PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Exploiting ILP, TLP, and DLP With The PDF

Uploaded by

Copyright:

Available Formats

Exploiting ILP, TLP, and DLP with the

Polymorphous TRIPS Architecture

Computer Architecture and Technology Laboratory

Trends in Programmable Processors

Workloads are becoming

Homogeneity versus Heterogeneity

Heterogeneous - multiple different types of processors

Homogeneous - single or multiple of same processor

Approach: Hardware Polymorphism

Challenges for Homogenous Systems

High degree of partitioning

High computational density (ALUs/mm2 )

Keep communication localized

Minimize specialized hardware

What is the Right Granularity of Processing?

FPGA: 106 gates

PIM: 256 elements

Configuring granularity through polymorphism

Synthesizing fine-grained PEs difficult at best

Maximize resources for single-threaded performance

Introduction to polymorphous systems

Grid Processor Core

Block termination Logic

SPDI: Static Placement, Dynamic Issue

Challenges for Different Levels of Parallelism

TRIPS Configurable Resources

Block termination Logic

Instruction fetch control

Aggregating Reservation Stations: Frames

opcode src1 src2

Instruction Buffers form

Instruction buffers add depth to the execution array

Extracting ILP: Frames for Speculation

16 total frames (4 sets of 4)

Ultra-wide issue from a large distributed instruction window

ILP Results with Speculation

Configuring Frames for TLP

Divide frame space

- Each can be further subdivided to enable some

Multiple partitioned instruction window for different threads

Rate of Work (IPC)

Speedup: 1.8x to 2.9x

Using Frames for DLP

loop N/8 times

Map very large unrolled kernels to window

Configuring Data Memory for DLP

Block termination Logic

Stream register file (SRF) (accessed w/ LMW)

Regular data accesses

DLP Results (4x4 GPA)

Performance metric omits overhead, LD/ST instructions

TLP: instruction and data supply

DLP: data supply bandwidth

Ultra-fine grained homogeneous

TRIPS: Coarse-grained homogeneous approach with polymorphism.

Sub-divide a powerful uniprocessor

How well homogeneous approaches compare with specialized cores?

How large should these cores scale?

You might also like