How A GPU Works: Kayvon Fatahalian 15-462 (Fall 2011)

How a GPU Works
Kayvon Fatahalian
15-462 (Fall 2011)
Today
1. Review: the graphics pipeline
2. History: a few old GPUs
3. How a modern GPU works (and why it is so fast!)
4. Closer look at a real GPU design

NVIDIA GTX 285
2
Part 1:
The graphics pipeline
(an abstraction)
3
Vertex processing
Vertices are transformed into screen space
v2
v0
v5
v4
v3 v1
Vertices
Vertex processing
Vertices are transformed into screen space
v2
v0
v5
EACH VERTEX IS
v4
TRANSFORMED
INDEPENDENTLY
v3 v1
Vertices
Primitive processing
Then organized into primitives that are clipped and
culled
v2
v2
v0
v0 v5
v5
v4
v4
v3 v1 v1
v3
Vertices Primitives
(triangles)
Rasterization
Primitives are rasterized into pixel fragments
Fragments
Rasterization
Primitives are rasterized into pixel fragments
EACH PRIMITIVE IS RASTERIZED

INDEPENDENTLY
Fragment processing
Fragments are shaded to compute a color at each pixel
Shaded fragments
Fragment processing
Fragments are shaded to compute a color at each pixel
EACH FRAGMENT IS PROCESSED

INDEPENDENTLY
Pixel operations
Fragments are blended into the frame buffer at their
pixel locations (z-buffer determines visibility)
Pixels
Pipeline entities
v2 v2
v0 v0
v5 v5
v4 v4
v3 v1 v3 v1
Vertices Primitives Fragments
Fragments (shaded) Pixels

Graphics pipeline
Memory Buffers
Vertex Generation Vertex Data Buffers Fixed-function
Vertex stream Programmable
Vertices
Vertex Processing Textures
Vertex stream
Primitive Generation
Primitive stream
Primitives
Primitive Processing Textures
Primitive stream
Fragment Generation
Fragment stream
Fragments
Fragment Processing Textures
Fragment stream
Pixels Pixel Operations Output image (pixels)

Part 2:
Graphics architectures
(implementations of the graphics pipeline)
14
Independent
Whats so important about independent
computations?
15
Silicon Graphics RealityEngine (1993)
graphics supercomputer Vertex Generation
Vertex Processing
Primitive Processing
Fragment Generation
Fragment Processing
Pixel Operations
Pre-1999 PC 3D graphics accelerator
Vertex Generation
CPU
Vertex Processing
3dfx Voodoo Clip/cull/rasterize
Fragment Generation
NVIDIA RIVA TNT Tex Tex
Pixel operations Fragment Processing
Pixel Operations
GPU* circa 1999
CPU
Vertex Generation
Vertex Processing
GPU Primitive Generation
Fragment Generation
Fragment Processing
NVIDIA GeForce 256
Pixel Operations
Direct3D 9 programmability: 2002
Vertex Generation
Vtx Vtx Vtx Vtx

Vertex Processing
Clip/cull/rasterize
Tex Tex Tex Tex
Frag Frag Frag Frag
Tex Tex Tex Tex
Frag Frag Frag Frag
Fragment Generation
Pixel operations
Fragment Processing
ATI Radeon 9700
Pixel Operations
Direct3D 10 programmability: 2006
Vertex Generation
Vertex Processing
Core Core Core Core Pixel op Pixel op
Core Core Core Core Pixel op Pixel op

Tex
Pixel op Pixel op
Core Core Core Core
Clip/Cull/Rast Primitive Processing
Core Core Core Core
Scheduler
Fragment Generation
NVIDIA GeForce 8800

Fragment Processing
(unified shading GPU)
Pixel Operations
Part 3:
How a shader core works
(three key ideas)
21
GPUs are fast
Intel Core i7 Quad Core AMD Radeon HD 5870

~100 GFLOPS peak ~2.7 TFLOPS peak
730 million transistors 2.2 billion transistors
(obtainable if you code your program to (obtainable if you write OpenGL programs
use 4 threads and SSE vector instr) like youve done in this class)
22
A diffuse reflectance shader
sampler(mySamp;
Texture2D<float3>(myTex;
float3(lightDir; Shader programming model:
float4(diffuseShader(float3(norm,(float2(uv)
{
Fragments are processed independently,
((float3(kd; but there is no explicit parallel
((kd(=(myTex.Sample(mySamp,(uv); programming.
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0);
((return(float4(kd,(1.0);(((
}(
Independent logical sequence of control
per fragment. ***
sampler(mySamp;
{
}(
per fragment. ***
sampler(mySamp;
{
}(
per fragment. ***
Big Guy, lookin diffuse
Compile shader
1 unshaded fragment input record
sampler(mySamp;
float3(lightDir; <diffuseShader>:
sample(r0,(v4,(t0,(s0
mul((r3,(v0,(cb0[0]
float4(diffuseShader(float3(norm,(float2(uv) madd(r3,(v1,(cb0[1],(r3
{ madd(r3,(v2,(cb0[2],(r3
((float3(kd; clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o0,(r0,(r3
((kd(=(myTex.Sample(mySamp,(uv);
mul((o1,(r1,(r3
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); mul((o2,(r2,(r3
((return(float4(kd,(1.0);((( mov((o3,(l(1.0)
1 shaded fragment output record

Execute shader
Fetch/
Decode
<diffuseShader>:
ALU mul((r3,(v0,(cb0[0]
(Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
Execution mul((o0,(r0,(r3
Context mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Execute shader
Fetch/
Decode
<diffuseShader>:
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o2,(r2,(r3
mov((o3,(l(1.0)
CPU-style cores
Fetch/
Decode
Data cache
(a big one)
ALU
(Execute)
Execution Out-of-order control logic

Context
Fancy branch predictor
Memory pre-fetcher
Slimming down
Fetch/
Decode
Idea #1:
ALU
(Execute) Remove components that
help a single instruction
Execution stream run fast
Context
Two cores (two fragments in parallel)
fragment 1 fragment 2
Fetch/ Fetch/
Decode Decode
<diffuseShader>: <diffuseShader>:
sample(r0,(v4,(t0,(s0 ALU ALU sample(r0,(v4,(t0,(s0
mul((r3,(v0,(cb0[0] mul((r3,(v0,(cb0[0]
madd(r3,(v1,(cb0[1],(r3 (Execute) (Execute) madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3 madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0) clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o0,(r0,(r3 mul((o0,(r0,(r3
mul((o1,(r1,(r3
mul((o2,(r2,(r3
Execution Execution mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Context Context mov((o3,(l(1.0)
Four cores (four fragments in parallel)
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
Fetch/ Fetch/
Decode Decode
ALU ALU
(Execute) (Execute)
Execution Execution
Context Context
Sixteen cores (sixteen fragments in parallel)
16 cores = 16 simultaneous instruction streams

Instruction stream sharing
But ... many fragments

should be able to share an
instruction stream!
<diffuseShader>:
mul((r3,(v0,(cb0[0]
madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o0,(r0,(r3
mul((o1,(r1,(r3
mul((o2,(r2,(r3
mov((o3,(l(1.0)
Recall: simple processing core
Fetch/
Decode
ALU
(Execute)
Execution
Context
Add ALUs
Fetch/
Idea #2:
Decode Amortize cost/complexity of
ALU 1 ALU 2 ALU 3 ALU 4
managing an instruction
stream across many ALUs
Ctx Ctx Ctx Ctx

SIMD processing
Ctx Ctx Ctx Ctx
Shared Ctx Data

Modifying the shader
Fetch/
Decode <diffuseShader>:
mul((r3,(v0,(cb0[0]
madd(r3,(v1,(cb0[1],(r3
madd(r3,(v2,(cb0[2],(r3
ALU 5 ALU 6 ALU 7 ALU 8 clmp(r3,(r3,(l(0.0),(l(1.0)
mul((o0,(r0,(r3
mul((o1,(r1,(r3
Ctx Ctx Ctx Ctx mul((o2,(r2,(r3
mov((o3,(l(1.0)
Ctx Ctx Ctx Ctx

Original compiled shader:
Shared Ctx Data
Processes one fragment using
scalar ops on scalar registers
Fetch/
Decode <VEC8_diffuseShader>:
VEC8_sample(vec_r0,(vec_v4,(t0,(vec_s0
VEC8_mul((vec_r3,(vec_v0,(cb0[0]
VEC8_madd(vec_r3,(vec_v1,(cb0[1],(vec_r3
VEC8_clmp(vec_r3,(vec_r3,(l(0.0),(l(1.0)
VEC8_mul((vec_o0,(vec_r0,(vec_r3
Ctx Ctx Ctx Ctx VEC8_mul((vec_o2,(vec_r2,(vec_r3
VEC8_mov((o3,(l(1.0)
Ctx Ctx Ctx Ctx

New compiled shader:
Shared Ctx Data
Processes eight fragments using
vector ops on vector registers
1 2 3 4
5 6 7 8
Fetch/
Decode <VEC8_diffuseShader>:
VEC8_sample(vec_r0,(vec_v4,(t0,(vec_s0
VEC8_mul((vec_r3,(vec_v0,(cb0[0]
VEC8_clmp(vec_r3,(vec_r3,(l(0.0),(l(1.0)
Ctx Ctx Ctx Ctx VEC8_mul((vec_o2,(vec_r2,(vec_r3
VEC8_mov((o3,(l(1.0)
Ctx Ctx Ctx Ctx
Shared Ctx Data

128 fragments in parallel
16 cores = 128 ALUs , 16 simultaneous instruction streams

vertices/fragments
128 [ primitives
OpenCL work items ] in parallel
CUDA threads
vertices
primitives
fragments
But what about branches?
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
(shader(code>
if((x(>(0)({
y(=(pow(x,(exp);
y(*=(Ks;
refl(=(y(+(Ka;((
}(else({
x(=(0;(
refl(=(Ka;((
}
<resume(unconditional
(shader(code>
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
(shader(code>
T T F T F F F F if((x(>(0)({
y(=(pow(x,(exp);
y(*=(Ks;
refl(=(y(+(Ka;((
}(else({
x(=(0;(
refl(=(Ka;((
}
(shader(code>
1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
(shader(code>
T T F T F F F F if((x(>(0)({
y(=(pow(x,(exp);
y(*=(Ks;
refl(=(y(+(Ka;((
}(else({
x(=(0;(
refl(=(Ka;((
}
Not all ALUs do useful work! (shader(code>
Worst case: 1/8 peak performance

1 2 ... ... 8
Time (clocks)
ALU 1 ALU 2 . . . . . . ALU 8
<unconditional
(shader(code>
T T F T F F F F if((x(>(0)({
y(=(pow(x,(exp);
y(*=(Ks;
refl(=(y(+(Ka;((
}(else({
x(=(0;(
refl(=(Ka;((
}
(shader(code>
Terminology
Coherent execution*** (admittedly fuzzy definition): when processing of different
entities is similar, and thus can share resources for efficient execution
- Instruction stream coherence: different fragments follow same sequence of logic
- Memory access coherence:
Different fragments access similar data (avoid memory transactions by reusing data in cache)
Different fragments simultaneously access contiguous data (enables efficient, bulk granularity memory
transactions)
Divergence: lack of coherence

- Usually used in reference to instruction streams (divergent execution does not make full use of SIMD
processing)
*** Do not confuse this use of term coherence with cache coherence protocols
GPUs share instruction streams across many fragments
In modern GPUs: 16 to 64 fragments share an instruction stream.

Stalls!
Stalls occur when a core cannot run the next instruction because of a
dependency on a previous operation.
Recall: diffuse reflectance shader
sampler(mySamp;
float3(lightDir;
{
((float3(kd;
((kd(=(myTex.Sample(mySamp,(uv); Texture access:
((kd(*=(clamp((dot(lightDir,(norm),(0.0,(1.0); Latency of 100s of cycles
}(
Recall: CPU-style core
OOO exec logic
Branch predictor
Fetch/Decode
Data cache
ALU (a big one: several MB)
Execution
Context
CPU-style memory hierarchy
Processing Core (several cores per chip)
OOO exec logic

L1 cache
Branch predictor (32 KB)
25 GB/sec
Fetch/Decode to memory
L3 cache
ALU (8 MB)
shared across cores

Execution
contexts
L2 cache
(256 KB)
CPU cores run efficiently when data is resident in cache

(caches reduce latency, provide high bandwidth)
Stalls!
Stalls occur when a core cannot run the next instruction because of a
dependency on a previous operation.
Texture access latency = 100s to 1000s of cycles
Weve removed the fancy caches and logic that helps avoid stalls.
But we have LOTS of independent fragments.
(Way more fragments to process than ALUs)
Idea #3:
Interleave processing of many fragments on a single core to avoid
stalls caused by high latency operations.
Hiding shader stalls
Time (clocks) Frag 1 8
Fetch/
Decode
Ctx Ctx Ctx Ctx
Ctx Ctx Ctx Ctx
Shared Ctx Data

Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32
1 2 3 4
Fetch/
Decode
1 2
3 4
1 2 3 4
Stall
Runnable
1 2 3 4
Stall
Stall
Stall
Runnable
Stall
Throughput!
1 2 3 4
Start
Stall
Start
Stall
Start
Stall
Runnable
Stall
Runnable
Done! Runnable
Done! Runnable
Increase run time of one group Done!

to increase throughput of many groups
Done!
Storing contexts
Fetch/
Decode
Pool of context storage

128 KB
Eighteen small contexts (maximal latency hiding)
Fetch/
Decode

Twelve medium contexts
Fetch/
Decode

Four large contexts (low latency hiding ability)
Fetch/
Decode
1 2
3 4
My chip!
16 cores
8 mul-add ALUs per core

(128 total)
16 simultaneous
instruction streams
64 concurrent (but interleaved)
instruction streams
512 concurrent fragments
= 256 GFLOPs (@ 1GHz)

My enthusiast chip!
32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)

Summary: three key ideas for high-throughput execution
1. Use many slimmed down cores, run them in parallel
2. Pack cores full of ALUs (by sharing instruction stream overhead across
groups of fragments)
Option 1: Explicit SIMD vector instructions
Option 2: Implicit sharing managed by hardware
3. Avoid latency stalls by interleaving execution of many groups of fragments

When one group stalls, work on another group
Putting the three ideas into practice:
A closer look at a real GPU
NVIDIA GeForce GTX 480

NVIDIA GeForce GTX 480 (Fermi)
NVIDIA-speak:
480 stream processors (CUDA cores)
SIMT execution
Generic speak:
15 cores
2 groups of 16 SIMD functional units per core
NVIDIA GeForce GTX 480 core
= SIMD function unit,

Fetch/
control shared across 16 units
Decode
(1 MUL-ADD per clock)
Groups of 32 fragments share an instruction

stream
Execution contexts
(128 KB) Up to 48 groups are simultaneously interleaved
Up to 1536 individual contexts can be stored

Shared scratchpad memory
(16+48 KB)
Source: Fermi Compute Architecture Whitepaper

CUDA Programming Guide 3.1, Appendix G
NVIDIA GeForce GTX 480 core
= SIMD function unit,

Fetch/
control shared across 16 units
Decode
Fetch/
Decode The core contains 32 functional units
Execution contexts
Two groups are selected each clock
(128 KB) (decode, fetch, and execute two instruction
streams in parallel)
(16+48 KB)

NVIDIA GeForce GTX 480 SM
= CUDA core
Fetch/
Decode
Fetch/
Decode The SM contains 32 CUDA cores
Execution contexts
Two warps are selected each clock
(128 KB) (decode, fetch, and execute two warps in
parallel)
(16+48 KB) Up to 48 warps are interleaved, totaling 1536
CUDA threads
NVIDIA GeForce GTX 480
There are 15 of these things on the GTX 480:

Thats 23,000 fragments!
(or 23,000 CUDA threads!)
Looking Forward
Current and future: GPU architectures
Bigger and faster (more cores, more FLOPS)
2 TFLOPs today, and counting
Addition of (select) CPU-like features

More traditional caches
Tight integration with CPUs (CPU+GPU hybrids)

See AMD Fusion
What fixed-function hardware should remain?

Recent trends
Support for alternative programming interfaces
Accelerate non-graphics applications using GPU (CUDA, OpenCL)
How does graphics pipeline abstraction change to enable more advanced

real-time graphics?
Direct3D 11 adds three new pipeline stages
Global illumination algorithms Credit: Bratincevic
Credit: NVIDIA
Ray tracing:
for accurate reflections, shadows
Credit: Ingo Wald

Alternative shading structures (e.g., deferred shading)
For more efficient scaling to many lights (1000 lights, [Andersson 09])
Simulation
Cinematic scene complexity
Image credit: Pixar (Toy Story 3, 2010)

Motion blur
Motion blur
Thanks!
Relevant CMU Courses for students interested in high performance graphics:
15-869: Graphics and Imaging Architectures (my special topics course)

15-668: Advanced Parallel Graphics (Treuille)
15-418: Parallel Architecture and Programming (spring semester)

How A GPU Works: Kayvon Fatahalian 15-462 (Fall 2011)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

How A GPU Works: Kayvon Fatahalian 15-462 (Fall 2011)

Uploaded by

Copyright:

Available Formats

How a GPU Works

1. Review: the graphics pipeline

2. History: a few old GPUs

3. How a modern GPU works (and why it is so fast!)

4. Closer look at a real GPU design

EACH PRIMITIVE IS RASTERIZED

EACH FRAGMENT IS PROCESSED

Vertices Primitives Fragments

Fragments (shaded) Pixels

Pixels Pixel Operations Output image (pixels)

(implementations of the graphics pipeline)

Pixel operations Fragment Processing

GPU Primitive Generation

Vtx Vtx Vtx Vtx

Core Core Core Core Pixel op Pixel op

NVIDIA GeForce 8800

(three key ideas)

Intel Core i7 Quad Core AMD Radeon HD 5870

1 shaded fragment output record

Execution Out-of-order control logic

16 cores = 16 simultaneous instruction streams

But ... many fragments

Ctx Ctx Ctx Ctx

Shared Ctx Data

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

16 cores = 128 ALUs , 16 simultaneous instruction streams

Worst case: 1/8 peak performance

Divergence: lack of coherence

In modern GPUs: 16 to 64 fragments share an instruction stream.

OOO exec logic

OOO exec logic

shared across cores

CPU cores run efficiently when data is resident in cache

Texture access latency = 100s to 1000s of cycles

ALU 5 ALU 6 ALU 7 ALU 8

Ctx Ctx Ctx Ctx

Ctx Ctx Ctx Ctx

Shared Ctx Data

ALU 5 ALU 6 ALU 7 ALU 8

Increase run time of one group Done!

ALU 5 ALU 6 ALU 7 ALU 8

Pool of context storage

ALU 5 ALU 6 ALU 7 ALU 8

ALU 5 ALU 6 ALU 7 ALU 8

ALU 5 ALU 6 ALU 7 ALU 8

8 mul-add ALUs per core

512 concurrent fragments

= 256 GFLOPs (@ 1GHz)

32 cores, 16 ALUs per core (512 total) = 1 TFLOP (@ 1 GHz)

3. Avoid latency stalls by interleaving execution of many groups of fragments

NVIDIA GeForce GTX 480

= SIMD function unit,

Groups of 32 fragments share an instruction

Up to 1536 individual contexts can be stored

Source: Fermi Compute Architecture Whitepaper

= SIMD function unit,

Source: Fermi Compute Architecture Whitepaper

There are 15 of these things on the GTX 480:

Addition of (select) CPU-like features

Tight integration with CPUs (CPU+GPU hybrids)