Packet Switching Eng 3pp

1
1
Scheduling algorithms for
input-queued IP routers
Andrea Bianco Paolo Giaccone
Gruppo Reti di Telecomunicazioni
Dipartimento di Elettronica
Politecnico di Torino
http://www.tlc-networks.polito.it
Switching Architectures 2011/12
2
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Heuristic algorithms
Packet-mode algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
3
Note
The slides marked RWP are reproduced with
permission of Prof.Nick McKeown from the
Electrical Engineering and Computer Science
Dept. of Stanford University (CA,USA)
2
4
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
5
The Internet is a mesh of routers
core
router
access
router
enterprise
router
6
Access router:
high number of ports at low speed (kbps/Mbps)
several access protocols (modem, ADSL, cable)
Enterprise router:
medium number of ports at high speed (Mbps)
several services (IP classification, filtering)
Core router:
low number of ports at very high speed (Mbps/Gbps)
very high throughput
few services
The Internet is a mesh of routers
3
7
Basic architecture
Control Plane
Datapath
per-packet
processing
Switching
Forwarding
Table
Routing
table
Routing
protocols
8
Basic functions
Routing
computation of the output port of
an incoming packet (forwarding)
uses the routing tables computed by
the routing protocols
can be a complex procedure:
very large routing tables
dynamic variation of routes in the Internet
9
Basic functions
Switching
transfer of packets from input ports
to output ports
solution of the contentions for output ports
queueing methods
where to store
scheduling methods
what to transfer
4
Faster and faster
Need for high performance routers
to accommodate the bandwidth demands
for new users and new services
to support QoS (over-provisioning)
to reduce costs with respect to a cloud of smaller size
routers (maybe)
a smaller number of fibers is needed
a smaller number of devices (but it is less costly?)
May be more energy hunghry
to ease of the management task
10 Switching Architectures 2011/12
11
Packet processing and link speed
0,1
1
10
100
1000
10000
1985 1990 1995 2000
S
p
e
c
9
5
I
n
t

C
P
U

r
e
s
u
l
t
s
0,1
1
10
100
1000
10000
1985 1990 1995 2000
F
i
b
e
r

C
a
p
a
c
i
t
y

(
G
b
i
t
/
s
)
TDM DWDM
Packet processing Power Link Speed
Moores law
2x / 18 months
2x / 7 months
Source: SPEC95Int & David Miller, Stanford.
RWP
Increase of electronic packet processing power cannot
accommodate the increase in link speed
?
12
0,001
0,01
0,1
1
10
100
1000
1980 1983 1986 1989 1992 1995 1998 2001
A
c
c
e
s
s

T
i
m
e

(
n
s
)
Moores Law
2x / 18 months
1.1x / 18 months
RWP
Memory access time
5
13
Its hard to keep up with Moores law:
the bottleneck is memory speed
Moores law is too slow:
routers need to improve faster
than Moores law
RWP
Moores law
14
Router capacity exceeds Moores law
Growth in capacity of commercial routers:
1992 ~ 2 Gb/s
1995 ~ 10 Gb/s
1998 ~ 40 Gb/s
2001 ~ 160 Gb/s
2003 ~ 640 Gb/s
Average growth rate: 2.2x / 18 months
RWP
15
Single packet processing
The time to process one packet
is becoming shorter and shorter
worst case: 40-Byte packets (ACKs)
travelling over the Internet
3.2 s at 100 Mbps
320 ns at 1 Gps
32 ns at 10 Gps
3.2 ns at 100 Gbps
320 ps at 1Tbps
6
16

S F
LC
LC
LC
LC
CP
S F
IP
IP
IP
IP
CP
OP
OP
OP
OP
Hardware architecture
physical structure logical structure
17
Main elements
line cards
support input/output transmissions
adapt packets to the internal format of the switching fabric
support data link protocols
In most architectures
store packets
classify packets
schedule packets
support security
switching fabric
transfers packets from input ports to output ports
18
control processor/network processor
runs routing protocols
computes and stores routing tables
manages the overall system
sometimes
store packets
classify packets
schedule packets
support security
forwarding engines
inspect packet headers
compute the packet destination (lookup)
Searching routing or forwarding (chaching) tables
rewrite packet headers
7
19
switching fabric
line card line card
control processor &
forwarding engine
1 N
Interconnections among main
elements - I
20
switching fabric
line card line card
control
processor
forwarding
engine
forwarding
engine
1 N
elements - II
21
elements - II
switching fabric
line card &
forwarding engine
control
processor
1
line card &
forwarding engine
N
8
Cell-based routers
ISM: Input-
Segmentation Module
ORM: Output-
Reassembly Module
packet: variable-size
data unit
cell: fixed-size data
unit
22
Cell switch (fabric) ORM
1
ORM
N
1
ISM
N
ISM
packets cells cells packets
23
Switching fabric
Our assumptions:
bufferless
to reduce internal hardware complexity
non-blocking
given a non-conflicting set of inputs/outputs,
it is always possible to connect inputs with
outputs
24
Switching fabric
Examples:
bus
shared memory
crossbar
Multi-stage
rearrangeable Clos network
Benes network
Batcher-Banyan network (self-routing)
Switching constraints
at most one cell for each input and for each output
can be transferred
1
2
3
4
1 2 3 4
outputs
i
n
p
u
t
s
9
25
Switching fabric
We do not discuss switching fabrics with
internal buffers
e.g.: crossbars with buffer at each crosspoint
26
Generic switching architecture
Output 1
switching fabric
Input 1
Input N
Output N
S
in
S
in
S
out
S
out
input queues
output queues
27
Speedup
The speedup limits the switch performance
S
in
= reading speed from input queues
S
out
= writing speed to output queues
The main performance limit can be due to
the maximum speedup factor:
S = max(S
in
,S
out
)
10
28
Performance comparison
The performance of different switching
systems can be studied
with analytical models
introducing simplifying assumptions, but
obtaining general results
with simulation models
obtaining more detailed results
29
Traffic description
A
ij
(n) = 1 if a packet arrives at time n at input i,
with destination reachable through output j

ij
= E[A
ij
(n)]
An arrival process is admissible if:

i
ij
< 1

j
ij
< 1
that is, no input and no output are overloaded
on average
note that OQ switches exhibit finite delays only
for admissible traffic
traffic matrix: = [
ij
]
30
Traffic scenarios
Uniform traffic
Bernoulli i.i.d. arrivals
usual testbed in the literature
easy to schedule
Diagonal traffic
Bernoulli i.i.d arrivals
critical to schedule, since
only two matchings are good
(
(
(
(
=
2 0 0 1
1 2 0 0
0 1 2 0
0 0 1 2
3
(
(
(
(
=
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
N

11
31
Traffic scenarios
LogDiagonal traffic
Bernoulli i.i.d. arrivals
more critical than uniform,
less than diagonal traffic
(
(
(
(
=
8 1 2 4
4 8 1 2
2 4 8 1
1 2 4 8
1 2
N

32
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
33
Output Queued (OQ) switches
S
in
= 1 S
out
= N
used for low bandwidth routers
no coordination among ports
work-conserving
best average delays
complete control of delays
support of QoS scheduling
12
34
Output Queued (OQ) switch
speedup N
Output N
Output 1
switching fabric
Input 1
Input N
35
0% 20% 40% 60% 80% 100%
Normalized load
D
e
l
a
y
OQ performance
OQ
Note: OQ is optimal from the point
of view of average delay and
throughput
Uniform traffic
36
Stability, throughput and delays
Hp: stationary system, infinite queue
for a particular
in
stable finite occupancy finite delays
in
=
out
100% throughput stable under any
in
admissible
in

out
13
37
Stability, throughput and delays

out

max
stable(
in
)
in
=
out

in

max
hence,
max
is
maximum throughput achievable
maximum offered load for stability
unstable (
in
)
in
>
max
queue grows with rate
in
-
max
in
out
max
in
delay
max
max
38
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
39
Simple Input Queued (IQ) switches
S
in
= 1 S
out
= 1
1 FIFO queue for each input port
throughput limitations
due to head of the line (HOL) blocking
scheduling
to solve contentions
for the same output
Output N
Input 1 Output 1
switching fabric
Input 1
14
40
Head of the Line (HOL) Blocking
RWP
41
0% 20% 40% 60% 80% 100%
Normalized load
D
e
l
a
y
Simple IQ switch performance
OQ
Simple
IQ
Uniform traffic
% 58 2 2
Single IQ switch
Using a simple Markov chain model
2x2 throughput 0.75
states: (2,0), (1,1)
3x3 throughput ?
states: (3,0,0), (2,1,0), (1,1,1)
Switching Architectures 2011/12 42
15
Bufferless switch
Throughput=
uniform i.i.d. Bernoulli arrivals
input load p
63 . 0 1 1 1
1
|
\
|

e
N
p
N
44
Improving IQ switches performance
Window/bypass schedulers
the first w cells of each queue contend
for outputs
HOL blocking is reduced, not eliminated
w = 1 means FIFO at each input
higher complexity
the scheduler deals with wN cells
non-FIFO queues
45
Maximum throughput in an NxN switch with
variable window size w
N W=1 W=2 w=3 w=4 W=5 W=6 W=7 W=8
2 0.75 0.84 0.89 0.92 0.93 0.94 0.95 0.96
4 0.66 0.76 0.81 0.85 0.87 0.89 0.91 0.92
8 0.62 0.72 0.78 0.82 0.85 0.87 0.88 0.89
16 0.60 0.71 0.77 0.81 0.84 0.86 0.87 0.88
32 0.59 0.7 0.76 0.8 0.83 0.85 0.87 0.88
64 0.59 0.7 0.76 0.8 0.83 0.85 0.86 0.88
128 0.59 0.7 0.76 0.8 0.83 0.85 0.86 0.88
16
46
Virtual output queueing (VOQ)
one queue for each input/output pair
N queues at each input
N
2
queues in the whole switch
eliminates HOL blocking
used in high-bandwidth routers
scheduling implemented in hardware
at very high speed
47
IQ switches with VOQ
Output N
Input 1
1
N
Output 1
Input N
1
N
scheduler
switching fabric
Note: from now on, we always assume VOQ
at the switch inputs
input
constraints
output
constraints
48
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
17
49
Scheduling in IQ switches
Scheduling can be modeled as a matching
problem in a bipartite graph
the edge from node i to node j refers to packets
at input i and directed to output j
the weight of the edge can be
binary (not empty/empty queue)
queue length
HOL cell waiting time, or cell age
some other metric indicating the priority
of the HOL cell to be served
50
5
1
2
3
4
3
4
2
5
4
8
2
5
4
4
8
1
2
3
4
1
2
3
4
1
2
3
4
Graph Matching
inputs outputs
scheduler
51
Implementing schedulers
Scheduling is a complex task
a scheduling algorithm can be implemented
in hardware if:
it shows good performance for a wide range
of traffic patterns
it can be efficiently parallelized
it can be efficiently pipelined
it requires few iterations (or clock cycles)
it requires limited control information
18
52
Scheduling uniform traffic
A number of algorithms give 100%
throughput when traffic is uniform
For example:
TDM and a few variants
iSLIP (see later)
RWP
Example of TDM for a 4x4 switch
53
Scheduling non-uniform traffic
If the traffic is known and admissible, 100%
throughput can be achieved by a TDM
using:
for a fraction of time a
1
matching M
1
2
matching M
2
k
matching M
k
subject to
i
a
i
= 1
thanks to the Birkhoff - von Neumann
theorem
54
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
19
55
Maximum Weight Matching
Maximum Weight Matching (MWM)
among all the possible N! matchings, selects the one
with the highest weight (sum of the edge metrics)
MWM is generally not unique
MWM is too complex to be implemented in hardware
at high speed
the best MWM algorithm requires O(N
3
) iterations,
and cannot be implemented efficiently, since
it is based on a flow augmentation path algorithm
cannot be parallelized and pipelined efficiently
MWM has never been implemented in a commercial
chipset
56
Maximum Weight Matching
MWM is the optimal solution of the scheduling
problem when the traffic is unknown, when the
weight is either the queue length or the cell age
achieves 100% throughput under any traffic
also under non-Bernoulli arrival processes,
satisfying the law of large numbers
achieves low average delays, very close to those
of OQ switches
possible starvation for lightly loaded packet flows
57
MWM with pipeline and latency
Let T and P be fixed
D
t
denotes the matching used at time t
The following variations of MWM also achieve
100% throughput:
D
t
= MWM(t-P) MWM with pipeline degree P
D
t
= MWM(ceil(t/T)T) MWM with latency T
combinations of both
thus, it seems easy to achieve 100% throughput!
20
58
MWM with pipeline and latency
But:
What about throughput?
100% throughput
but needs the computation of a MWM
What about delays?
delays can be really bad!

59
General consideration
When scheduling in IQ switches, it is
very difficult to achieve simultaneously
high throughput
low delay
limited implementation complexity
60
Maximum Size Matching
Maximum Size Matching (MSM)
among all the possible matchings, selects the one
with the highest number of edges (like MWM with
binary edge weights)
MSM is generally not unique
the best MSM algorithm requires O(N
2.5
) iterations,
and cannot be implemented efficiently, since it is
based on a flow augmentation path algorithm
21
61
Maximum Size Matching
MSM maximizes the instantaneous
throughput
MSM may not yield 100% throughput
short term decisions can be inefficient
in the long term
non-binary edge weights allow MWM
to maximize the long-term throughput
62
Instability of MSM
Assume:
P(arrival at Q
12
) =
P(arrival at Q
11
) = P(arrival at Q
22
) = 1--
Q
12
= B 0 Q
11
= Q
22
= 0
in case of parity serve Q
11
and/or Q
22
instead of Q
12
Observe:
Q
12
is served only when A
11
= 0 and A
22
= 0, i.e. with probability:
P(serve Q
12
) = P(no arrivals at both Q
11
and Q
22
) = [1-(1--)]
2
= (+)
2
P(serve Q
12
) < P(arrival at Q
12
) if is small enough
Example: = 0.5; = 0.1;
P(serve Q
12
) = 0.36
In1
In2
Out1
Out2
1--
1--
Note: this proof is due to

I.Keslassy and R.Zhang,
Stanford Univ.
63
Uniform traffic
MWM and MSM behave almost identically
1
10
100
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n

d
e
la
y
Normalized Load
Uniform Traffic
MWM
MSM
22
64
LogDiagonal traffic
MSM is somewhat inferior to MWM
1
10
100
1000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n

d
e
la
y
Normalized Load
LogDiagonal Traffic
MWM
MSM
65
Diagonal traffic
MSM yields much longer delays than MWM at medium/high loads
1
10
100
1000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n

d
e
la
y
Normalized Load
Diagonal Traffic
MWM
MSM
66
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
23
67
Approximations of MSM and MWM
Motivation
strong interest in scheduling algorithms with
very low complexity
high performance
Usually
implementable schedulers (low complexity)
low throughput, long delays
theoretical schedulers (high complexity)
high throughput, short delays
68
Some implementable algorithms
Approximate MSM
WFA, iSLIP, 2DRR, RC, FIRM and many
others
Approximate MWM with w
ij
= X
ij
(queue length)
iLQF, RPA, learning algorithms
ij
= cell age
iOCF
ij
=
i
X
ij
+
j
X
ij
iLPF, MUCS
69
APPROXIMATIONS OF
MAXIMUM SIZE
MATCHING
24
70
Wave Front Arbiter
Requests Match
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
RWP
71
Wave Front Arbiter
Requests Match
RWP
2N-1 steps
72
Wrapped Wave Front Arbiter
Requests Match
N steps instead of
2N-1
RWP
25
73
iSLIP
iSLIP means iterative SLIP
iterates among the following 3 phases
Request
Grant
Accept
74
iSLIP
iSLIP demo
from: http://tiny-tera.stanford.edu/tiny-tera/demos/index.html
75
iSLIP
3 phases:
Request (from inputs to outputs)
each unmatched input sends a request
to every output for which it has a cell
Grant (from outputs to inputs)
if an unmatched output receives requests,
it sends a grant to one of the inputs
contentions solved by a round-robin mechanism
Accept (from inputs to outputs)
if an unmatched input receives grants, it selects
a single output and it becomes matched to it
contentions solved by a round-robin mechanism
26
76
iSLIP
The round robin mechanism in iSLIP is
designed so that, under uniform traffic,
iSLIP emulates a dynamic TDM scheduler
synchronized on the arrival pattern
77
iSLIP
iSLIP is maximal
often, with log N iterations
always, with N iterations
iSLIP was implemented on a single chip
in the Cisco 12000 router
http://www.cisco.com/warp/public/cc/pd/rt/12000/tech/fasts_wp.pdf
78
APPROXIMATIONS OF
MAXIMUM WEIGHT
MATCHING
27
79
iLQF
iLQF means iterative Longest Queue First
iterates among the following 3 phases
Request
Grant
Accept
80
iLQF
iLQF demo
from: http://tiny-tera.stanford.edu/tiny-tera/demos/index.html
81
iLQF
3 phases:
Request (from inputs to outputs)
each unmatched input sends all its queue lengths
as requests to corresponding outputs
Grant (from outputs to inputs)
if an unmatched output receives requests, it sends
a grant to the input corresponding to the longest queue
contentions solved by random choice
Accept (from inputs to outputs)
if an unmatched input receives grants, it selects
the output with the longest queue
contentions solved by random choice
28
82
iLQF
iLQF is maximal
often, with log N iterations
always, with N iterations
iLQF is robust to non-uniform traffic
83
RPA
RPA means Reservation with Preemption
and Acknowledgment
Two phases
Reservation (possibly preemptive)
Acknowledgement
Sequential accesses to a reservation vector
Urg
j
(if set) is the urgency of the transfer from
input In
j
to output j
Urg
1
,In
1
Urg
2
,In
2
Urg
3
,In
3
Urg
N
,In
N
Out 1 Out 2 Out 3 Out N
Vector
Res
RPA
Vector Res is sequentially
accessed by all inputs
Res
Input 1 Input 2
Input 4 Input 3
29
85
RPA
Initially, at each round: Urg
j
= 0 for all j
Reservation phase
when input i accesses Res
it computes W
j
= X
ij
Urg
j
for all j
finds j
*
such that W
j*
= max{ W
j
}
if W
j*
> 0,
reserve output j
*
and set Urg
j*
=X
ij*
, possibly
overwriting the previous reservation
otherwise,
leave the current reservation
86
RPA
Acknowledgement phase
if input i still finds its reservation at output j,
books output j
otherwise,
chooses an unreserved output j and books
output j
87
Uniform traffic
comparison between MWM, iSLIP, iLQF, and RPA
1
10
100
1000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n

d
e
la
y
Normalized Load
Uniform Traffic
MWM
iSLIP
iLQF
RPA
30
88
LogDiagonal traffic
iSLIP saturates close to 84% throughput
1
10
100
1000
10000
100000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n

d
e
la
y
Normalized Load
LogDiagonal Traffic
MWM
iSLIP
iLQF
RPA
89
Diagonal traffic
RPA achieves 98% throughput, iLQF 87%, iSLIP 83%
1
10
100
1000
10000
100000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n

d
e
la
y
Normalized Load
Diagonal Traffic
MWM
iSLIP
iLQF
RPA
90
LEARNING ALGORITMS
31
91
Learning algorithms
Goal:
find a good compromise among
throughput, delay and complexity
92
Learning algorithms
Key observation
the matchings generated by MWM show limited
changes from one time slot to another
remembering the matching from the past
simplifies the computation of the new matching
the search implemented by MWM can be
enhanced
with a randomized approach
by observing arrivals
by searching in parallel
based on an extension of randomized
scheduling algorithms
93
Simple Randomized Schemes
Choose a matching at random and use it
as the schedule
doesnt yield 100% throughput
Choose 2 matchings at random and use
the heavier one as the schedule

Choose N matchings at random and use
the heaviest one as the schedule
None of these can give 100% throughput !
32
94
0.001
0.01
0.1
1
10
100
1000
10000
0.0 0.2 0.4 0.6 0.8 1.0
M
e
a
n

I
Q

L
e
n
Normalized Load
Diagonal Traffic
MWM
R32
R1
Simple randomized algorithms
32x32
95
Bounds on Maximum Throughput
by Devavrat Shah, Stanford
University
96
Tassiulas scheme
Consider the following policy
R
t
= matching picked at random (uniformly) among
all the possible N! matchings
D
t
= arg max { W(D
t-1
), W(R
t
) }
Complexity is very low
O(1) iterations
easy to pipeline
Yields 100% throughput !
note the boost in throughput is due to memory
of the past matching D
t-1
However, delays are very large
33
97
0.01
0.1
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n

I
Q

L
e
n
Normalized Load
Diagonal Traffic
MWM
Tassiulas
Tassiulas' scheme
32x32
98
Learning approach
Properties of COMP1
W(D
t
) W(D
t-1
)
W(D
t
) W(M
t
)
Examples:
COMP1 is the MAX among
D
t-1
and M
t
COMP1 is the MERGE
among D
t-1
and M
t
D
t-1
D
t
COMP1
M
t
99
The learning approach
D
t-1
D
t
COMP1
M
t
Properties of M
t
informally, M
t
should be a good sample
in the space of all possible matchings
Examples:
M
t
is a matching picked uniformly at
random
M
t
is derived from the arrival vector A
t
M
t
is a good neighbor of D
t-1
34
100
Theoretical properties
D
t-1
D
t
COMP1
M
t
Stability
100% throughput under any
admissible Bernoulli traffic
pattern
Delay
the better is the weight of M
t
,
the smaller are the queue
lengths, and hence the smaller
are the delays
101
D
t-1
M
t
D
t
MAX
MAX
N
1
N
K
A
t
K-th neighbor
of D
t-1
Example of practical implementation
Exploiting parallel search:
This scheme is called
APSARA
102
What is a neighbor of a matching?
Each neighbor
differs from D
t-1
in ONLY TWO edges
can be generated very easily in hardware
3 neighbors
Example: 3 x 3 switch
D
t-1
N
1 N
2
N
3
35
103
Max-APSARA
APSARA, as described before, is not
maximal
Max-APSARA is a modified version of
APSARA where a maximal size matching
algorithm runs on the remaining unmatched
inputs/outputs
e.g., if k inputs/outputs are unmatched,
run iSLIP with k iterations
select k random edges among the
unmatched inputs/outputs
104
APSARA performance
0.01
0.1
1
10
100
1000
10000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n

I
Q

L
e
n
g
t
h
Normalized Load
Diagonal Traffic
MWM
MaxAPSARA
APSARA
iSLIP
iLQF
105
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
36
106
Routers and switches
IP routers deal with variable-size packets
Hardware switching fabrics often deal
with fixed-size cells
Question:
how to integrate an hardware switching fabric
within an IP router?
107
Router based on an IQ cell switch:
cell-mode
switching fabric
IQ cell switch
1
ISM
N
ISM
ORM
1
ORM
N
108
Cell-mode scheduling
Scheduling algorithms work at cell level
pros:
100% throughput achievable
cons:
interleaving of packets at the outputs
of the switching fabric
37
109
packet-mode
switching fabric
IQ cell switch
1
ISM
N
ISM
ORM
1
ORM
N
NO packet
interleaving
if packet-mode
110
packet-mode
switching fabric
IQ cell switch
1
ISM
N
ISM
ORM
1
ORM
N
NO packet
interleaving
if packet-mode
ORMs can be
removed
111
Packet-mode scheduling
Rule: packets transferred as trains of cells
when an input starts transferring the first cell
of a packet comprising k cells, it continues
to transfer in the following k-1 time slots
Pros:
no interleaving of packets at the outputs
easy extension of traditional schedulers
Cons:
starvation due to long packets
inherent in packet systems without preemption
negligible for high speed rates
38
112
Packet-mode scheduling
Questions
can packet mode provide high
throughput?
what about delays?
YES!
It depends
113
Packet-mode properties
Main theoretical results
MWM in packet-mode yields 100% throughput
Packet mode can provide shorter delays
than cell mode, depending on the
packet length distribution
114
Simulation scenario
Router with ISMs and ORMs
Uniform packet traffic
uniform packet load
uniform (1,192) packet size
distribution
Spotted packet traffic
non uniform packet load
bimodal (3,100) packet size
distribution
1 1 1 0 1 0 1 0
0 1 0 1 1 1 0 1
1 0 1 0 1 1 1 0
1 1 0 1 0 1 0 1
1 0 1 0 1 0 1 1
0 1 0 1 0 1 1 1
1 0 1 1 1 0 1 0
0 1 1 1 0 1 0 1
P
=
39
115
Uniform packet traffic
Packet mode and cell mode reach the same throughput
Cell-mode
Packet-mode
100
1000
10000
100000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n
p
a
c
k
e
t d
e
la
y
Normalized Load
Uniform packet traffic for packet mode
100
1000
10000
100000
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
M
e
a
n
p
a
c
k
e
t d
e
la
y
Normalized Load
Uniform packet traffic for cell mode
MWM
MSM
iSLIP
iLQF
116
Spotted packet traffic
Packet mode reaches higher throughput than cell mode
100
1000
10000
100000
0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1.0 1.0
M
e
a
n
p
a
c
k
e
t d
e
la
y
Normalized Load
Spotted packet traffic for packet mode
100
1000
10000
100000
0.5 0.6 0.6 0.7 0.7 0.8 0.8 0.9 0.9 1.0 1.0
M
e
a
n
p
a
c
k
e
t d
e
la
y
Normalized Load
Spotted packet traffic for cell mode
MWM
MSM
iSLIP
iLQF
Cell-mode Packet-mode
117
At high load PM
becomes
better
Effect of packet size distribution
iSLIP delay
CM
/delay
PM
for different packet size distributions
better
PM
better
CM
0
0.5
1
1.5
2
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
P
a
c
k
e
t

m
o
d
e

g
a
in

f
o
r

iS
L
I
P
Normalized load
Uniform
Exponential
Trimodal
Bimodal
40
118
Packet mode features
Packet mode scheduling
is a feasible modification of schedulers
improves throughput
but it can generate some unfairness between
long and short packets
inherent to all variable-packet networks without
preemption
may give better packet delays than cell mode
depends on the packet size distribution
119
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
120
Network of IQ routers
Question:
given a network of IQ switches running MWM
and an admissible input traffic, is the network
always stable?
NO!
this is quite
counterintuitivebut true
41
121
Networks of IQ routers
Consider the acyclic network of IQ routers
in the following slide
derived from well established results
from adversarial queueing theory
a very specific scenario, but comprises
only few switches
this situation may not be common,
but cannot be excluded in real networks
122
Pathological network of IQ switches
Network with
8 switches
and 4 flows
123
Instability of MWM
If MWM is adopted at each IQ router, and
the traffic is admissible, the system can be
unstable under Bernoulli i.i.d. arrivals
42
124
Instability of MWM
MWM is too greedy, in the sense that it
can create traffic bursts that are amplified
by each scheduler
A server can be idling when large bursts
(directed to it) are blocked because
of the contentions upstream
the problem arises when a packet flow is
subject to priority changes along its path
through the network
125
Stability in networks of routers
Global policies
Oldest in the network and many others
problem: requires global information about
the network, and synchronized clocks at the
ingress of the network
126
Stability in networks of routers
Semi-local policies
MWM with local information about the router
neighbors can achieves 100% throughput
under i.i.d. Bernoulli arrivals
Virtual network queue
the weights used by MWM are:
w
ij
= max{0,X
ij
-X
down-queue(ij)
)}
where down-queue(ij) is the first downstream
queue which is receiving packets from X
ij
43
127
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
128
IQ and QoS
Problem:
support rate guarantees, with admissible rate
matrix
129
Birkhoff von Neumann
decomposition
goal: find a sequence of matchings M
k
and
their fraction of time
k
such that the service
given to all the queues satisfies R
44
130
IQ and frame scheduling
Example:
M
1
1
=1/4,
2
=1/2,
3
=1/4
M
2
M
2
M
3
M
1
M
2
M
2
M
3
frame i frame i+1
131
How to decompose R?
R double substochastic
R double stochastic such that RR
R decomposition
augmentation
algorithm
BvN algorithm
132
Augmentation algorithm
45
133
BvN algorithm
134
BvN algorithm
135
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
46
136
CIOQ routers
Output 1
switching fabric
Output N
S
S
o
1
o
N
Input 1
S
Input N
S
VOQ
137
CIOQ routers
Question:
if a low speedup S is allowed (and queues
are available at both inputs and outputs),
is it possible to design simple scheduling
algorithms, capable of achieving high
throughput and low delay?
YES!
138
OQ emulation
a CIOQ switch achieves perfect OQ
emulation if the departure order of all the
packets from each output is the same as
the emulated OQ
it is impossible to distinguish, by observing
arrivals and departures, if the switching
architecture is CIOQ or OQ
delays are perfectly controlled
easy to implement scheduling algorithms
born for OQ (eg: WFQ)
47
139
Work conservation
a CIOQ switch is work-conserving when each
output is busy at the same time as the
corresponding OQ switch
i.e., each output of the switch for which there are cells
(either at the inputs or at the outputs) at the beginning
of cell slot T is active at the end of the cell slot T
output never idling whenever a packet is present
destined to it
good delay performance: same average delays as OQ
note that OQ emulation implies work conservation
but not viceversa
140
Speedup and performance
speedup 4
exact OQ emulation
speedup 2
exact OQ emulation
work conservation
same average delay than OQ
141
CIOQ routers with S=2
If S = 2
easy to obtain 100% throughput
any maximal matching obtains 100%
throughput
less easy to obtain work conservation
LOOFA algorithm
it is difficult to obtain perfect OQ emulation
stable marriage algorithm with special
preference list
48
142
LOOFA
Occupancy o
j
: number of cells currently
residing at the j-th output queue
at each time slot, o
j
is decremented by one
because of departures
Basic idea of LOOFA
Higher priority is given to outputs with lower
occupancy, thereby attempting to maintain
work-conservation for all outputs
143
LOOFA
If S = 2, during each of the two phases
each unmatched input selects a non-empty
VOQ directed to the unmatched output with
the lowest occupancy, and sends a request
to that output
each unmatched output grants one request
the selection can be round robin, random, ...
repeat until the matching is maximal
144
LOOFA with S=2
TEO:
LOOFA achieves work conservation if S = 2
49
145
OQ emulation with S=4
urgency of a cell=departure time in OQ-
current time
MUCF (Most urgent cell first)
During each phase:
1. outputs request their most urgent cells from inputs
2. input grants output with the most urgent cell
3. loser output tries to obtain their next urgent cell
4. when no more matchings are possible, cells are
transferred and the next phase starts
146
OQ emulation with S=4
Note: picture reproduced from Balaji Prabhakar and Nick McKeown, "On the Speedup Required for
Combined Input and Output Queued Switching.", Computer Systems Technical Report, November 1997
147
OQ emulation and speedup 4
TEO:
MUCF with speedup 4 obtains OQ emulation
50
148
CIOQ routers
CIOQ are very promising architectures
many degrees of freedom in design
how to balance input/output buffers
how the buffers interact
e.g., by backpressure mechanisms
Several currently designed architectures are
supposed to be CIOQ
Speedup S is becoming closer and closer to 1 in
practical implementations of new switching
architectures (CIOQ IQ)
149
Outline
IP routers
OQ routers
IQ routers
Scheduling
Optimal algorithms
Networks of routers
QoS support
CIOQ routers
Multicast traffic
Conclusions
150
Multicast traffic
Misleading idea:
observe
1. OQ can achieve 100% throughput under
any admissible unicast and multicast traffic
2. OQ can be perfectly emulated by CIOQ
with S = 2
then, with S = 2 it is possible to achieve
100% throughput for multicast traffic
WRONG!
because observation 2 holds only for unicast traffic Switching Architectures 2011/12
51
151
Multicast traffic
Question:
what is the minimum speedup required
to achieve 100% throughput?
unknown!
152
Multicast traffic
Possible implementations
copy network before the switching fabric
a multicast cell with f destinations is treated as f cells
possible bandwidth inefficiency
dedicated queue
multicast packets are treated in some specific way
1
UC
MC
N
N N
UC+MC
N N
153
Multicast traffic: optimal queueing
MC-VOQ queueing
best throughput performance
avoids HOL blocking
2
N
-1 queues for each input, one for each fanout set
re-enqueuing process out-of-sequence problem
no re-enqueuing some throughput degradation
MC+UC
1
2
N
-1
N N
52
154
Multicast traffic: optimal scheduling
The optimal scheduling for multicast traffic
can be defined similarly to unicast traffic
it is a sort of max flow algorithm on all N(2
N
-1)
queues
Many heuristics can be envisaged
to approximate it
155
Summary
3 main ingredients for IQ scheduling
algorithms:
Weight computation
Matching computation
Contention resolution
156
Summary
Weight computation
obtains the priority of each input queue
the metric can be related to queue length,
waiting time of the cell at the HOL,
Contention resolution
whenever the selection is among situations
with equal weights
can be round robin, or random
53
157
Summary
Matching computation
computes the matching, trying to maximize
its total weight
can be based on
an iterative search, like in iSLIP, iOCF,
iLQF
a matrix greedy approach, like in MUCS,
WFA
a reservation vector, like in RPA
a learning approach, like in APSARA
158
Summary
Good IQ scheduling algorithms exist:
100% throughput
short delay
limited complexity
Performance differences are significant
only close to saturation
159
Summary
Open questions concerning IQ
schedulers:
QoS guarantees
stability of networks of switches
multicast traffic
54
160
References
Router functions and architectures
Keshav S., Sharma R., `Ìssues and trends in router design'', IEEE Communications Magazine, vol.36, n.5,
May 1998, p.144-151
Bux W., Denzel W.E., Engbersen T., Herkersdorf A., Luijten R.P.,``Technologies and building blocks for fast
packet forwarding'', IEEE Communications Magazine, Jan.2001, pp.70-77
Newman P., Minshall G., Lyon T., Huston L.,`ÌP switching and gigabit routers'', IEEE Communications
Magazine, Jan.1997, pp.64-69
Wolf T., Turner J.S., ``Design issues for high-performance active routers'', IEEE Journal on Selected Areas in
Communications, vol.19, n.3, Mar.2001, pp.404-409
Karol M., Hluchyj M., Morgan S., `Ìnput versus output queueing on a space division switch'', IEEE
Transactions on Communications, vol.35, n.12, Dec.1987
McKeown N., Anantharam V., Walrand J.,`Àchieving 100\% throughput in an input-queued switch'',IEEE
INFOCOM'96, vol.1, San Francisco, CA, Mar.1996, pp.296-302
McKeown N.,`ìSLIP: a scheduling algorithm for input-queued switches'', IEEE Transactions on Networking,
vol.7, n.2, Apr.1999, pp.188-201
McKeown N., Mekkittikul A.,`À practical scheduling algorithm to achieve 100\% throughput in input-queued
switches'', IEEE INFOCOM'98, vol.2, 1998, pp.792-9, New York, NY
Tamir Y., Chi H.-C., ``Symmetric crossbar arbiters for VLSI communication switches'', IEEE Transaction on
Parallel and Distributed Systems, vol.4, no.1, Jan.1993, pp.13 27
Chen H., Lambert J., Pitsilledes A.,``RC-BB switch. A high performance switching network for B-ISDN'',
GLOBECOM 95
161
References
Anderson T., Owicki S., Saxe J., Thacker C.,``High speed switch scheduling for local area networks'', ACM
Transactions on Computer Systems, vol.11, n.4, Nov.1993
LaMaire R.O., Serpanos D.N., ``Two dimensional round-robin schedulers for packet switches with multiple
input queues'', IEEE/ACM Transaction on Networking, vol.2, n.5, Oct.1994, p.471-482
Chen H., Lambert J., Pitsilledes A., ``RC-BB switch. A high performance switching network for B-ISDN'', IEEE
GLOBECOM 95, 1995
Duan H., Lockwood J.W., Kang S.M., Will J.D., `À high performance OC12/OC48 queue design prototype for
input buffered ATM switches'', IEEE INFOCOM'97, vol.1, 1997, pp.20-8, Los Alamitos, CA
Partridge C., et al., `À 50-Gb/s IP router'', IEEE Transactions on Networking, vol.6, n.3, June 1998, pp.237-
248
Ajmone Marsan M., Bianco A., Leonardi E., Milia L., ``RPA: a flexible scheduling algorithm for input buffered
switches'', IEEE Transactions on Communications, vol.47, n.12, Dec.1999, pp.1921-1933
Ajmone Marsan M., Bianco A., Filippi E., Giaccone P.,Leonardi E., Neri F.,`Òn the behavior of input queueing
switch architectures'', European Transactions on Telecommunications, vol.10, n.2, Mar.1999, pp.111-124
Christensen K.J.,``Design and evaluation of a parallel-polled virtual output queued switch'', IEEE ICC 2001,
vol.1, pp.112-116, 2001
Serpanos D.N., Antoniadis P.I., ``FIRM: a class of distributed scheduling algorithms for high-speed ATM
switches with multiple input queues'', IEEE INFOCOM 2000, vol.2, pp.548-555, 2000
Ying Jiang, Hamdi, M., A 2-stage matching scheduler for a VOQ packet switch architecture, IEEE ICC 2002,
vol.4, pp.2105-2110, 2002
Tassiulas L., ``Linear complexity algorithms for maximum throughput in radio networks and input queued
switches'', IEEE INFOCOM'98, vol.2, New York, NY, 1998, pp.533-539
Giaccone P., Prabhakar B., Shah D., ``Towards simple, high-performance schedulers for high-aggregate
bandwidth switches '', IEEE INFOCOM'02, New York, Jun.2002
162
References
Packet scheduling in IQ switches
Ajmone Marsan M., Bianco A., Giaccone P., Leonardi E., Neri F., ``Packet scheduling in input-queued cell-
based switches'', IEEE INFOCOM'01, Anchorage, Alaska, Apr.2001(extended version to appear in IEEE
Trans. on Networking, about Oct.2002)
Moon S.H., Sung D.K., ``High-performance variable-length packet scheduling algorithm for IP traffic'', IEEE
GLOBECOM'01, Dec.2001
Scheduling multicast traffic in IQ switches
Hayes J.F., Breault R., Mehmet-Ali M.K., ``Performance analysis of a multicast switch'', IEEE Transactions on
Communications, vol.39, n.4, Apr.1991, pp.581-587
Kim C.K., Lee T.T., ``Call scheduling algorithm in multicast switching systems'', IEEE Transactions on
Communications, vol.40, n.3, Mar.1992, pp.625-635
McKeown N., Prabhakar B., ``Scheduling multicast cells in an input-queued switch'', INFOCOM'96, vol.1, San
Francisco, CA, Mar.1996, pp.261-278
Prabhakar B., McKeown N., Ahuja R., ``Multicast scheduling for input-queued switches'', IEEE Journal on
Selected Areas in Communications, vol.15, n.5, Jun.1997, pp.855-866
Chen W., Chang Y., Hwang W., `À high performance cell scheduling algorithm in broadband multicast
switching systems'', IEEE GLOBECOM'97, vol.1, New York, NY, 1997, pp.170-174
Guo M., Chang R., ``Multicast ATM switches: survey and performance evaluation'', Computer Communication
Review, vol.28, n.2, Apr.1998, pp.98-131
Andrews M., Khanna S., Kumaran K., `Ìntegrated scheduling of unicast and multicast traffic in an input-
queued switch'', IEEE INFOCOM'99, vol.3, New York, NY, 1999, pp.1144-1151
Liu Z., Righter R., ``Scheduling multicast input-queued switches'', Journal of Scheduling, John Wiley & Sons,
May 1999
55
163
References
Scheduling multicast traffic in IQ switches
Nong G., Hamdi M., `Òn the provision of integrated QoS guarantees of unicast and multicast traffic in input-
queued switches'', IEEE GLOBECOM'99, vol.3, 1999
Ajmone Marsan M., Bianco A., Giaccone P., Leonardi E., Neri F., `Òn the throughput of input-queued cell-
based switches with multicast traffic'', IEEE INFOCOM'01, Anchorage Alaska, Apr.2001
Ge Nong, Hamdi M., Providing QoS guarantees for unicast/multicast traffic with fixed/variable-length packets
in multiple input-queued switches, IEEE Symposium on Computers and Communications, pp.166 171, 2001
Smiljanic A., Flexible bandwidth allocation in high-capacity packet switches, IEEE/ACM Transactions on
Networking, vol.10, n.2, pp.287-293, Apr.2002
QoS support in IQ switches
Tabatabaee V., Georgiadis L., Tassiulas L., ``QoS provisioning and tracking fluid policies in input queueing
switches'', IEEE INFOCOM'00, New York, Mar.2000
Chang C.S., Lee D.S., Jou Y.S., ``Load balanced Birkhoff-von Neumann switches'', 2001 IEEE Workshop on
High Performance Switching and Routing, 2001, pp.276-280.
Hung A., Kesidis G., McKeown N.,`ÀTM input-buffered switches with guaranteed-rate property'', IEEE
ISCC'98, July 1998, pp.331-335, Athens, Greece
Advanced architectures derived from pure IQ
Iyer S., McKeown N., ``Making parallel packet switches practical'', IEEE INFOCOM'01, Alaska, Mar.2001
Chang C.S., Lee D.S., Jou Y.S., ``Load balanced Birkhoff-von Neumann switches'', 2001 IEEE Workshop on
High Performance Switching and Routing, 2001, pp.276-280
Sivaram R., Stunkel C.B., Panda D.K., HIPIQS: a high-performance switch architecture using input queuing,
IEEE Transactions on Parallel and Distributed Systems, vol.13, n.3, pp.275-289, Mar.2002
164
References
Scheduling in networks of IQ switches
L.Tassiulas, A.Ephremides,``Stability properties of constrained queueing systems and scheduling policies for
maximum throughput in multihop radio networks'',IEEE Transactions on Automatic Control,vol.37, n.12,
Dec.1992, pp.1936-1948
M.Andrews, L.Zhang,`Àchieving Stability in Networks of Input-Queued Switches'',IEEE INFOCOM 2001,
Anchorage, Alaska, Apr.2001, pp.1673-1679
M.Ajmone Marsan, E.Leonardi, M.Mellia, F.Neri,`Òn the Throughput Achievable by Isolated and
Interconnected Input-Queued Switches under Multicass Traffic'',IEEE INFOCOM 2002, New York, NY (USA),
June 2002
M. Ajmone Marsan, P. Giaccone, E. Leonardi, F. Neri,`Òn the Stability of Local Scheduling Policies in
Networks of Packet Switches with Input Queues'', IEEE JSAC, to appear, 2003

Packet Switching Eng 3pp

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Packet Switching Eng 3pp

Uploaded by

Copyright:

Available Formats

1

Switching Architectures 2011/12

Switching Architectures 2011/12

Note: this proof is due to

You might also like