Optimizing PCIe Performance in PCs and Embedded Systems FROZEN

Copyright 2009, PCI-SIG, All Rights Reserved 1
Optimizing PCIe Performance in

PCs & Embedded Systems
Optimizing PCIe Performance in
PCs & Embedded Systems
Mike Alford
Gennum
Mike Alford
Gennum
Copyright 2009, PCI-SIG, All Rights Reserved 2 PCI-SIG Developers Conference
Disclaimer
Copyright 2009, PCI-SIG, All Rights Reserved 2
Presentation Disclaimer: All opinions, judgments, recommendations, etc.
that are presented herein are the opinions of the presenter of the material
and do not necessarily reflect the opinions of the PCI-SIG
.
Some information in this presentation refers to a specification still in the
development process.
Agenda
Latency
Link layer
Packet layer
Driver/SW
System level
DMA engine architecture
Conventional
PCIe optimized
Root Complex Characteristics
Measured vs. theoretical
Avoiding hazards and race conditions
Interrupt controller design
Overview
Objective of the presentation is to explore some
of the important elements of system
performance WRT endpoint HW and SW design
How should they work and perform?
How do they perform in actual implementations
Best practices in endpoint design
FPGA and ASIC
Latency
There are many forms or layers of latency in
systems
System Level, End-to-End Latency
Application SW
Switch
RC Endpoint
Memory
Driver SW Interrupt Service Latency
DMA PCIe Core
100s of ns
10s to 100s of s
> ms
(Packet Level Latency)
(Driver Level Latency)
(Application Latency)
Link Level Latency (PM etc.)
10s of ns
Latency Impact of
Power Management
Lane 1
Lane 0
Lane 3
Lane 2
Lane 0
1 Packet
over 4 lanes
1 Packet over
a single lane
PM Scenario 1: Negotiate down # of lanes
PM Scenario 2: Aggregate Packets
Lane 1
Lane 3
Lane 2
Additional Latency
Link A
L
i
n
k

A
L
i
n
k

B
Link B
Packet Latency: Pre-PCI
Prior to PCI, the cost of polling IO was about
the same as polling memory
In the order of several hundred ns
This assumption was accepted in the architecture of
IO devices and driver software
CPU
(486)
Local Bus (32 bit/33MHz)
RAM IO
Packet Latency: PCI
With PCI, system aggregate
bandwidth improved while IO
latency degraded
Memory access has
lower latency cost
compared to IO
Especially if IO is
below multiple layers
of bridges
Packet Latency: PCI Express
PCI Express provides even more
options for system expansion
that can increase IO latency
PCIe
Switch
CPU
Root
Complex
DRAM
GPU
IO
PCIe
Repeater
PCIe
Repeater
PCIe
Switch
IO
PCIe Cable
PCIe cable delay = ~43ns/10m
Switch latency ~200ns or more
Polling Latency Across the
System
1.50
920.01
2.50
1540.77
66.71 71.34
1
10
100
1000
10000
Cache Read Memory Read PCIe Endpoint Read
A
v
e
r
a
g
e

R
e
a
d

L
a
t
e
n
c
y

(
n
s
)
System 1
System 2
Data Measured from 2
Different PC Systems
Interrupt Latency Factors
More cache layers (L1/L2/L3)
Cache miss probability decreases but penalty for a
miss increases
Results in less predictable interrupt latency (larger max/min
ratio)
Will tend to get worse in the future
Deeper processor pipelines result in longer
interrupt latencies due to larger amount of
context information that must be flushed or
stored
Interrupt Latency Experiment
Use a test endpoint card to generate an interrupt
under SW control
Repeatedly measure the time interval between
assertion of the interrupt and the de-assertion
from within the interrupt handler to generate a
histogram
Vary system loading to observe the effect on interrupt
latency
T
Interrupt
Assert by
HW
Interrupt
De-assert
by ISR
Interrupt Latency Example
1
10
100
1000
10000
100000
1
0
4
0
7
0
1
0
0
1
3
0
1
6
0
1
9
0
2
2
0
2
5
0
2
8
0
3
1
0
3
4
0
3
7
0
4
0
0
4
3
0
4
6
0
4
9
0
5
2
0
5
5
0
5
8
0
6
1
0
6
4
0
6
7
0
7
0
0
7
3
0
7
6
0
7
9
0
8
2
0
8
5
0
8
8
0
9
1
0
9
4
0
9
7
0
1
0
0
0
Latency (Microseconds)
Latency (System Idle) Max=33us
Latency (>90% CPU Load) Max=18.2ms
Latency (20% CPU Load) Max=23.4ms
Data Measured from PC System Running Windows XP
Note: Max values are the largest latencies measured under those conditions.
Simple DMA Service Scenario
Host Peripheral
Set up DMA Transfer
Execute DMA Transfer
Set up DMA Transfer
Execute DMA Transfer
If this latency is too
long, then the
peripheral will starve
(example: dropped
video frames, audio
breakup)
Service Latency vs. Buffer Size
0
50
100
150
200
250
300
350
400
450
500
0 100 200 300 400 500
Throughput (MB/s)
B
u
f
f
e
r

S
i
z
e

(
K
B
)
1000 us
750 us
500 us
375 us
250 us
100 us
50 us
25 us
Example: 1 stream
of 1080p60 video
DMA Engine Architecture
With PCIe, DMA can take advantage of the full
duplex nature of the link
DMA can consist of multiple upstream DMA
threads and multiple downstream DMA threads
However, only one thread in each direction can be
active on the link at any one instant
Conventional Multi-channel
DMA Approach
PHY LL TL
Up
DMAm
Up
DMA1
Up
DMA2
AHB
Ctl.
Down
DMAn
Down
DMA1
Down
DMA2
A
H
B
AHB
Arb.
PCIe Core
Interconnects like AHB are a
poor choice because they dont
provide full duplex data transfer
PCIe Optimized DMA
Approach
M
U
X
D
E
C
O
D
E
Scatter/Gather Controller
Example
SG Engine
Sequencer Registers
DPTR
RA
RB
SYS_ADDR_H
SYS_ADDR_L
XFER_CTL
EVENT_SET
EVENT_CLR
EVENT_EN
EVENT
CSR
DMA
Control
Program
Control
Event
Control/
Status
Sequencer
Control/
Ststus
Descriptor
RAM
Instruction
Decoder
External
Conditional
Inputs
JMP Condition
Select
Host Access to
Descriptor RAM
Data
Address
Downstream
DMA Master
Upstream
DMA Master
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO
FIFO Status
Upstream Data
Downstream Data
Host Access to
SG Registers
Application
Upstream Channel Select
M
U
X
D
E
C
O
D
E
Downstream Channel Select
Application
Interaction
with EVENT
Interrupt Output
FIFO
FIFO
Example SG List Entry
XFER_CTL
SYS_ADDR_H
SYS_ADDR_L
Specifies xfer count,
direction, stream ID, etc.
64 bit host memory address
SG Engine Instruction Set
Load, Store, Add System Address
Used to manipulate the system address register
System address register used to specify the host source/destination
address for DMA xfer
Load XFER_CTL
Pushes a command into either the upstream or downstream
DMA controller
Load, Store, Add RA/RB Registers
Used to manipulate the indexing/counting registers RA and RB
Conditional Jump
Used for polling FIFO status and for looping
Event assertion
Used as a semaphore mechanism and to signal interrupts
Simplified DMA Main
Control Sequence
Channel 0 Ready
for Servicing?
No
Service Channel 0
Initialize
Yes
Channel 1 Ready
for Servicing?
No
Service Channel 1
Yes
Channel n-1 Ready
for Servicing?
No
Service Channel n-1
Yes
Root Complex Characteristics
Max Payload is 128B on most systems
256 on some server class chipsets and newer desktop
512 seen in some of the latest at Plugfest
Spec allows up to 4KB
Max read request size supported
Typical is 512B
Typical read completion packet size
Most systems use a cache line based fetching mechanism resulting in
64B cache aligned packets
Some RC chipsets provide read combining feature that will
opportunistically combine multiple sequential 64B packets together
Virtual channels
Typical RC/switches/FPGA cores support only the default VC0
Spec allows for up to 8 hardware VC
Outstanding Reads vs.
Performance
0
100
200
300
400
500
600
700
800
1 Outstanding 2 Outstanding 3 Outstanding 4 Outstanding
Number of Outstanding Reads
T
h
r
o
u
g
h
p
u
t

(
M
B
/
s
)
System 1
System 2
System 3
System 4
Measured Results for Endpoint DMA Read to 4 Different RC
(PCIe 1.x x4 link)
Completion Ordering
Completion Ordering
Summary
FIFO Order
Lowest latency for single stream traffic
Fewer outstanding requests needed to sustain throughput
Out of Order
Lowest latency for multi stream traffic
Better when you have multiple streams with small FIFOs
Can use one outstanding request per FIFO and thus avoid re-ordering
logic
Actual systems
Some use FIFO order some OoO
Endpoint generally needs to support both unless RC is always
known
Typical PCIe IP cores do not re-order for you
Requires additional logic
Link Efficiency
The specified link rate of 2.5GT/s (PCIe 1.x), 5GT/s
(PCIe 2.x), 8GT/s (PCIe 3.0) is not all usable
Example:
X1 link at PCIe 2.x is 312.5 MB/s of raw bandwidth
Subtract 8b/10b encoding = 250MB/s
Subtract link layer traffic (ack/nack, replay, FC updates etc.)
Subtract packet overhead
Packet overhead
STP
Sequence Number
PHY/PCS
DLL
Header
LCRC
END
1 Byte
2 Bytes
12 Bytes (32 bit requests and completions)
16 Bytes (64 bit requests)
Data Payload
Between 4 Bytes and
MAX_PAYLOAD size
TLP
4 Bytes
1 Byte
Link Efficiency vs.
Payload Size
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
4 8 16 32 64 128 256 512 1K 2K 4K
Payload Size (Bytes)
L
i
n
k

E
f
f
i
c
i
e
n
c
y
16 Byte Header
12 Byte Header
RC/Endpoint Performance
Under System Load
Experiment:
X4 endpoint doing 600-700MB/s DMA in each direction to host
memory via RC when system is ~2% CPU load (including DMA
driver)
What happens to endpoint throughput when host is stressed?
Scenarios:
100% CPU load, memory stress test, high IO traffic, GPU to host
memory traffic
Test results show a worst case degradation of only ~6% on a variety
of PC MB
Conclusion
Typical PC RC memory contention is minimal
Rule of thumb for sustainable throughput
150MB/s per PCIe lane per direction (double for PCIe 2.x)
Avoiding Hazards and
Race Conditions
The definition of endpoint control/status registers needs to be multi-
core/multi-thread friendly
Think like a driver/OS programmer (or at least have them review your
spec)
Avoid registers that cause state change on a read
Can be a problem for bridges/processors that do caching/prefetching
Where reads cause state changes, avoid packing multiple bit fields into
the same naturally aligned DW (32 bits)
Example: 8 bit Read FIFO from 2 different UARTs packed into the same DW
Why?:
Byte lane selection not available for block operations and prefetching
Some processors dont have byte lane information for reads
Use IOV like constructs such as providing multiple views of the register
space to different processors/processes
Impact on performance
Poor control mechanisms can result in ugly SW workarounds that can
seriously impact performance
Interrupt Controller Design
Problem:
Use of IP blocks can result in poorly thought out interrupt controller that
SW engineers will hate you for
Example: lack of a single register to determine source of a shared interrupt
Common practice is to OR together all on-chip interrupt sources and use
that to generate INTx or MSI/MSI-X
Best practices
Single read only status register where the status bits of all IP blocks are
always readable (even if interrupt forwarding/generation is disabled)
No need to poll multiple registers to determine the source of an interrupt
You always have the option of polling using a single register rather than
employing interrupts
When multiple interrupts (hard or messaged) are to be generated have
an enable (or mask) register per interrupt output
Make sure it is possible to clear each interrupt source separately
Clearing one interrupt source will not cause another to be cleared
unintentionally
Make sure you support INTx (legacy interrupt mode) in addition to MSI
or MSI-X
Win XP and below do not support MSI/MSI-X
Example Interrupt Controller
7
Programmable
I/O Pins
GPIO Cont roll er
GPIO Cont rol
Regist ers
6
5
4
3
2
1
0
INT7
MSI
Generat ion
Int errupt Cont roll er
INT6
INT5
INT4
INT3
INT2
INT1
INT0
INT_CFG0 Int errupt
Conf igurat ion Regist er
(1 per int errupt out put )
INT_STAT
Int errupt
St at us
Regist er
(1 only)
On-chip
Int errupt
Sources
Int errupt
Sources
GPIO Out put
Regist ers
Tips for Scalable Endpoint
Design
Assume that packet latency and context
switching latency will increase in future systems
Dont rely on fast interrupt handling to keep your data
pipes filled
Avoid interrupts altogether if possible
Use low frequency polling
Rely on DMA with large scatter/gather lists that dont
have to be updated very often
Assume that throughput AND latency will
increase in future systems
Have host driver poll on host memory based
semaphores rather than polling the IO subsystem
Have IO subsystem write semaphores into host memory
Summary
SW/HW interaction should be designed to be relatively
insensitive to packet level latency
High sensitivity to packet level latency may be a sign of poor
HW/SW interaction
Assume that interrupt latency will be wildly variable
For isochronous data (example: video) rely on large SG lists so
that the endpoint can operate for a long period without interrupt
servicing
Take advantage of the bidirectional nature of the PCIe
link
Avoid internal busses that are not bidirectional
Or have separate upstream/downstream buses to feed the
transaction layer
Copyright 2009, PCI-SIG, All Rights Reserved 35 PCI-SIG Developers Conference Copyright 2009, PCI-SIG, All Rights Reserved 35
Thank you for attending the
PCI-SIG Developers Conference 2009
For more information please go to
www.pcisig.com

Optimizing PCIe Performance in PCs and Embedded Systems FROZEN

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimizing PCIe Performance in PCs and Embedded Systems FROZEN

Uploaded by

Copyright:

Available Formats

Copyright 2009, PCI-SIG, All Rights Reserved 1

Optimizing PCIe Performance in

You might also like