You are on page 1of 71

The Network Layer

application

Principles behind network


layer services
routing (path selection)
scalability
routers
Example implementation: the
Internet (and ATM briey)

transport
network
link
physical

Network layer

transports segments from


sending to receiving host
on sending side, encapsulates
segments into datagrams
on receiving side, delivers
segments to transport layer
network layer protocols in
every host, router
router examines header elds
in every IP datagram that
passes through it

Network Layer functions

forwarding: moving packets from routers input port to


appropriate output

car analogy: process of getting through a single interchange

routing: determine the route taken by packets from


source to destination

car analogy: process of planning trip from source to dest


use routing algorithms

connection setup: like in TCP, but with all routers in a


path

used in ATM, frame relay, X.25, but not IP

Routing and forwarding

Network service model


What service model can a datagram-transporting
channel oer?
for individual datagrams:

guaranteed delivery
guaranteed delivery with < 40ms delay

for a ow of datagrams:

in-order datagram delivery


guaranteed minimum bandwidth for a ow
low jitter (variation in packet interarrival time)

Network service models


Network
architecture

Service
model

Internet

best-eort

ATM

CBR

ATM

VBR

Guarantees?
Bandwidth

Loss

Order

Timing

Congestion
feedback

none

constant
rate
guaranteed
rate

ATM

ABR

guaranteed
minimum

ATM

UBR

none

(inferred via
loss)
N/A (no
congestion)
N/A (no
congestion)

VC & datagram networks


Contrary to VC (Venture Capitalist) opinion, there is
more to networking than the Internet

Internet is a datagram network


VC (Virtual Circuit) networks, e.g., ATM, frame relay, X.25

Datagram network: network-layer connectionless

service
VC network: network-layer connection service
Like transport layer, except

host-to-host service
no choice - network is one or the other
implementation is in the core, not the edge

Virtual circuits
src-to-dest path behaves like telephone circuit

recall circuit-switched vs. packet-switched discussion


signalling protocols used for call setup/teardown for each call before
data can ow
each packet carries VC identier (not dest address)
every router on path maintains state for each passing connection
link, router resources (bandwidth, buers) allocated to VC

VC comprises

path from source to destination


VC numbers (one for each link along path)
entries in forwarding tables in routers along path

VC number changed on each link

new number comes from forwarding table

VC forwarding table

R1s forwarding table


Incoming Incoming Outgoing Outgoing
interfac# VC # interfac# VC#
1
12
2
22
2

63

18

17

97

87

...

...

...

...

Routers maintain
connection state
information

Datagram networks
no call setup at network layer
routers have no state about end-to-end connections
packets forwarded using destination host address

packets between same src-dst pair may take dierent paths

Datagram forwarding table


Destination address rang#
11001000 00010111 00010000
to
11001000 00010111 00010111
11001000 00010111 00011000
to
11001000 00010111 00011000
11001000 00010111 00011001
to
11001000 00010111 00011111
otherwise

00000000
11111111
00000000
11111111
00000000
11111111

Link interfac#

0
1
2
3

IPv4 = 32-bit address 4.3 billion possible entries


(IPv6 = 128-bit address = lots and lots of entries)

Prex matching
Prex match

Link interfac#

11001000 00010111 00010

11001000 00010111 00011000

11001000 00010111 00011

2
3

otherwise

DA

Link interfac#

11001000

00010111

00010110

10100001

11001000

00010111

00011000

10101010

Datagram vs. VC
Internet
ATM
data exchange among
 evolved from telephony mindset
computers
 human conversation
elastic service, no strict
  strict timing, reliability reqs
timing requirements
  need for guaranteed service
smar end systems (computers)  dumb end systems (phones)
can adapt, recover from
  complexity inside network
errors
simple inside network,
complex edge
many link types
dierent characteristics
uniform service dicult

Routers
Two main functions:

run routing algorithms/protocols (e.g., RIP, OSPF, BGP)


forwarding datagrams from incoming to outgoing links

Input ports

Decentralised switching

given a datagram destination, look up output port using


forwarding table in input port memory
goal: to complete input port processing at line speed
line speed: look-up time < time to receive pkt at input port
slow path vs. fast path
queueing: if datagrams arrive faster than the forwarding rate
into switch fabric

Switching fabric
The heart of the router
switching via memory (earliest type of router)

packets copied to system memory


speed limited by memory bandwidth (2 bus crossings/datagram)

switching via bus

datagrams go from input memory to output memory via a shared


bus
speed limited by bus bandwidth

switching via interconnection networ%

crossbar: 2n buses connect n input ports to n output ports


fragment packet into xed-length cells, tag and switch

Output ports

Buering required when datagrams arrive from fabric

faster than transmission rate


Scheduling discipline chooses among queued datagrams
for transmission

e.g., FCFS, WFQ


QoS, fairness

Output port queueing

Buering when arrival rate (from switch) exceeds


output line speed

buer overow queueing delay, packet loss

Packets dropped according to policy

e.g., drop-tail
AQM - mark or drop packets pre-emptively

Input port queueing

If fabric slower than input ports combined, queueing may

occur at input queues


HOL (Head Of the Line) blocking: queued datagram at
front of queue prevents others from moving forward,
even if output port is free
input buer overow queueing delay and loss

The Internet Network layer


Transport layer
ICMP protocol
 error reporting
 router signaling
Routing protocols
 path selection
 RIP, OSPF, BGP

Forwarding
table

IP protocol
 addressing conventions
 datagram format
 packet handling
conventions

Link layer
Physical layer

IPv4 datagram format (RFC791)


32 bits
header
version
length

type of
service

ID
time to live

length (bytes)
ags

upper layer
protocol,
e.g., TCP

fragment oset

all for
fragmentation/
reassembly

Internet checksum

source IP address
destination IP address
options (variable length) (e.g. timestamp, record route)
data
(e.g. a TCP or UDP segment)

Overhead?
 20 bytes TCP
 20 bytes IP

IP fragmentation and reassembly

network links have dierent


MTU (maximum transfer
unit)
di link-layer type, di
MTU
large IP datagrams are
fragmented in the network
1 datagram many
reassembled at nal
destination
bits in IP header used to
identify and order
fragments

IP addressing

IP address: 32-bit identier for


host, router interfac#
interface: connection between
host/router and physical link
routers typically have
multiple interfaces
hosts may have multiple
interfaces (think wireless)
IP addresses are associated
with each interface
223.1.3.1 = 1101111 00000001 00000011 00000001
223
1
3
1

Subnets

IP address:
high-order bits = subnet par'
low-order bits = host par'
Subnet:
device interfaces share same
subnet part of IP address
can physically reach each
other without router
CIDR
Recipe:
a.b.c.d/x, left-most x bits are  detach each interface from host
subnet part
or router, create islands of
223.1.3.0/24 = left-most 24
isolated networks
bits are subnet part
subnet mask: 255.255.255.0

Getting an IP address
How does a host get an IP address?
Ask Wayne

address hard-coded by sysadmin in a le


e.g., Redhat: /etc/syscong/network-scripts/ifcfg-eth0
e.g., MacOSX: /etc/hostcong (or System Preferences)

DHCP (Dynamic Host Conguration Protocol)

server dynamically assigns host a temporary IP address


no manual host conguration required
DHCP uses info from both network and link layer, so well
cover it next week

Getting an IP address
How does network get subnet part of IP address?
Allocated by ISP

e.g., Dartmouth (ISP) has 129.170.0.0/16 (sometimes written as


129.170/16)
CS dept (customer) gets 129.170.212.0/22
/24 is common, a.k.a single-wide subnet

How does ISP get block of IP addresses?


Allocated by ICANN to RIR (Regional Internet
Registry)

e.g., ARIN, RIPE, APNIC, LACNIC


non-routable (local) IP addresses (RFC1918, also see RFC3330)
10/8, 172.16/12, 192.168/16

Classful addressing
Before CIDR, we had classes

Class A: high-order bits 000, /8 (1.0.0.0 to 126.255.255.255)


Class B: high-order bits 100, /16 (127.0.0.0 to 192.255.255.255)
Class C: high-order bits 110, /24 (193.0.0.0 to 223.255.255.255)
Class D = multicast: high-order bits 1110 (224.0.0.0/4)
Class E = experimental: high-order bits 11110 (240.0.0.0/4)

Led to wasteful address allocation

Class A makes up 50% of IPv4 address space


Do Ford, Halliburton, USPS need a /8 each?
http://www.iana.org/assignments/ipv4-address-space
Most allocations these days are from Class C space
CIDR allows ecient allocation (+ policy - RFC2050)
subnets can be any size, just maintain upstream

Summary of IPv4 blocks (RFC3330)


Address bloc%
Present Us#
0.0.0.0/8
This Network
10.0.0.0/8
Private-Use Networks
14.0.0.0/8
Public-Data Networks
24.0.0.0/8
Cable Television Networks
39.0.0.0/8
Reserved but subject to allocation
127.0.0.0/8
Loopback
128.0.0.0/16
Reserved but subject to allocation
169.254.0.0/16
Link Local
172.16.0.0/12
Private-Use Networks
191.255.0.0/16
Reserved but subject to allocation
192.0.0.0/24
Reserved but subject to allocation
192.0.2.0/24
Test-Net
192.88.99.0/24
6to4 Relay Anycast
192.168.0.0/16
Private-Use Networks
198.18.0.0/15 Network Interconnect Device Benchmark Testing
223.255.255.0/24
Reserved but subject to allocation
224.0.0.0/4
Multicast
240.0.0.0/4
Reserved for Future Use

Referenc#
[RFC1700, p4]
[RFC1918]
[RFC1700, p181]
-[RFC1797]
[RFC1700, p5]
--[RFC1918]
--[RFC3068]
[RFC1918]
[RFC2544]
-[RFC3171]
[RFC1700, p4]

A routing table

nickwaket:~> netstat -rn


Kernel IP routing table
Destination
Gateway
129.170.210.0
0.0.0.0
129.170.212.0
0.0.0.0
169.254.0.0
0.0.0.0
0.0.0.0
129.170.210.1

Genmask
255.255.255.0
255.255.252.0
255.255.0.0
0.0.0.0

Flags
U
U
U
UG

MSS
0
0
0
0

Window
0
0
0
0

irtt
0
0
0
0

Iface
eth1
eth0
eth0
eth1

NAT (Network Address Translation)

Not everyone can get a routable IP address


cost, resources (IPv4 = 4bn addresses, but lots are used up...)
Not everyone wants a routable IP address
changing addresses a lot
change upstream ISP without changing device addresses
might not want machines to be explicitly addressable (alleged
security benets)
NAT router
replaces (src IP addr, port#) in outgoing datagrams to (NAT
IP addr, new port#)
renumbers every (src IP addr, port#) to (NAT IP, port#) in
NAT translation table
replaces (NAT IP addr, port#) with (src IP addr, port#) in
each incoming datagram

NAT

But...
NAT breaks end-to-en principle

application designers may need to consider NAT (e.g. Skype)


port numbers are for addressing processes, not hosts

routers should only look up to layer 3 (not transport)


is there really an IPv4 address shortage?

controversial, also IPv6 on the horizon

dicult to run servers behind a NAT

lose the Internes global addressibility

scalability - 16-bit port number eld

NAT limited to 60,000 simultaneous connections

ICMP
Internet Control Message Protocol
Used by hosts & routers to communicate networklevel information

error reporting (unreachable host, network, port...)


echo request/reply (used by ping)

Sits in the network layer, but above IP

ICMP messages carried in IP datagrams

Message format: type, code, plus rst 8 bytes of IP


datagram causing the error

e.g., type 8, code 0 = echo request (ping)


type 0, code 0 = echo reply (ping)
type 3, code 0 = destination network unreachable

traceroute
You played with traceroute earlier
Source sends series of UDP segments to destination

TTL=1, TTL=2, TTL=3, ...


random (unlikely) port number

When nth datagram arrives at nth router

router discards datagram


sends source ICMP message (type 11, code 0)
includes name of router and IP address

When ICMP message arrives, source calculates RTT


Eventually reach destination

destination returns ICMP port unreachable (type 3, code 3)


traceroute stops

IPv6
Next-generation IP datagram format. Why?
Address space allocation

in the future, everything will / can have an IP address


Internet-enabled fridge, toilet, ...
mobile devices
IPv4 allocation skewed towards America/Europe
What about Asia? Africa? China?

IPv4 header-processing, forwarding can be speeded up

xed-length header
no fragmentation - let end systems take care of it

QoS (Quality of Service)

might be easier if we had more bits

IPv6 datagram format (RFC2460)


32 bits
version trac class
payload length

ow label
hop limit
(TTL)

next header

source IP address (128 bits)

destination IP address (128 bits)

data

Priority
 identify priority among
datagrams in ow
Flow label
 identify datagrams in same
 ow
Next header
 identify upper layer
protocol for data

e.g., 3ffe:2101:7:4:2e0:18ff:fe34:150b

IPv6 is great! Les switch.


How to change when everyone still using IPv4?
Flag day (Swedish cars)

good luck...

Dual-stack

IPv6/IPv4 hosts can send and receive both

Tunneling

put IPv6 datagrams inside IPv4 datagrams


aside: other uses for tunneling?

http://www.6bone.net
Some applications need explicit IPv6 support
Dicult to change network layer

compared to e.g., new application-layer protocols

Routing
Routing = determining goo paths between src & dst
(hosts attached to default router, so only consider src & dst routers)

Abstract network into a


graph
Graph: G = (N,E)
N = set of routers = {u,v,w,x,y,z}
E = set of links = {(u,v),(u,x),(v,x),(v,w),...}

Graph abstraction useful elsewhere, e.g., P2P


(application-layer, overlay routing)

Link costs
c(x,x) = cost of link (x,x)
 e.g., c(w,z) = 5
cost of path (x1,x2,x3,...,xb) =
c(x1,x2) + c(x2,x3) + ... c(xb-1,xb)

Cost could always be 1, related to bandwidth,


congestion,...
What is the least-cost path between u and z?
Routing algorithm: algorithm that nds least-cost path

Routing algorithms
Global or decentralised information?
Global
all routers have complete topology and link cost information
link state
Decentralised
router knows physically-connected neighbours and costs
iterative process of computation, exchanging info with neighbours
distance vector
Static or dynamic?
Static
routes change slowly over time
Dynamic
routes change more quickly - periodic changes, or in response to
link cost changes

Link-state routing
Use Dijkstras algorithm to compute least-cost paths

D(v) = min ( D(v), D(w) + c(w,v) )


determine least-cost path from src node u to all other nodes

All nodes know costs to other nodes

nodes broadcast link-state packets to all other nodes in the


network
a centralised routing algorithm: needs global state

Iterative: after k iterations, know least cost path to k


destinations

Link-state example
Step

D(v),p(v) D(w),p(w) D(x),p(x) D(y),p(y) D(z),p(z)

2,u

5,u

ux

2,u

4,x

uxy

2,u

3,y

4,y

uxyv

3,y

4,y

uxyvw

uxyvwz

1,u

2,x

4,y

But costs may change


osci,ations
Routers shouldnt run
algorithm at same time

Distance Vector routing

Bellman-Ford equation

dx(y) = minv {c(x,v) + dv(y)}


dv(z) = 5, dx(z) = 3, dw(z) = 3
du(z) = min {c(u,v) + dv(z), c(u,x) + dx(z), c(u,w) + dw(z)} = min {2+5,
1+3, 5+3} = 4

node that achieves minimum is next hop in the


forwarding path; add it to the forwarding table

Distance Vector algorithm


Dx(y) = estimate of least cost from x to y
Distance vector: Dx = [ Dx(y): y N ]

Node x knows cost to each neighbour v : c(x,v)


Node x maintains Dx = [ Dx(y): y N ]
Node x also maintains neighbours distance vectors

Each node periodically sends Dx to its neighbours

a distributed routing algorithm


estimate should converge to actual least cost dx(y)
when a node receives new DV estimate, updates its own DV
estimate using Bellman-Ford equation

DV example

cost to
x y z
0
2
7
x
y 2 0 1
7
1 0

z
cost to
x y

2
x 0
2
0
y
z 7 1

1
0

from
from

cost to
x y z
x 0 22 37
2 0

1
y
z

from

cost to
x y z
x
y
z 7 1 0

from

cost to
x y z
x
y 2 0 1
z

from

Node zs table

from

Node ys table

from

from

Dx(z) = min {c(x,y)+Dy(z), c(x,z)+Dz(z)}


= min {2+1,7+0} = 3
cost to
x y z
x 0 2 7
Node xs table
y
z

from

Dx(y) = min {c(x,y)+Dy(y), c(x,z)+Dz(y)}


= min {2+0,7+1} = 2
cost to
x y
x 0 2
y 2 0
z 3 1

z
3
1
0

cost to
x y
x 0 2
y 2 0
z 3 1

z
3
1
0

cost to
x y
x 0 2
y 2 0
z 3 1

z
3
1
0

Distance Vector - evaluation


Link cost changes

node detects cost change, updates routing info, recalculates DV


if DV changes, notify neighbours
if neighbours least costs change, change will propagate
good news travels fast
but if cost increases?
bad news travels slow
count-to-innity problem

Routing loops in DV

Cost between x and y changes from 4 to 60

at to, y knows cost has increased, z doesnt

y: Dy(x) = min {c(y,x)+Dx(x), c(y,z)+Dz(x)} = min {60+0,1+5} = 6


But this is wrong!

y will send datagrams destined for x via z will send datagrams destined for x via y

How to solve?

poisoned reverse - advertise cost (see book)

Hierarchical routing
How does routing scale to the Internet?

Can we store routes to e.g., 200m destinations in routing


tables?

How long would it take to update/exchange tables?


Table exchange alone would swamp links

Administrative autonomy

Internet is a network of networks

Aggregate routers into regions = AS (Autonomous System)

Each network admin wants to control routing in their own network


routers in same AS run same routing protocol (intra-AS RP)
routers in dierent ASes can run dierent intra-AS routing protocol

Gateway routers

Direct link to a router in another AS

Interconnecting ASes

Forwarding table is congured by both intra- and


inter-AS routing algorithms

Intra-AS sets entries for internal destinations


Inter-AS and Intra-AS set entries for external destinations

Choosing among multiple ASes

Suppose AS1 learns from inter-AS RP that subnet x is


reachable from AS3 and AS2

To congure forwarding table, router 1d must determine


gateway to which it should forward packets for x
use inter-AS RP
hot-potato routing: send packet toward closest of the two (get
rid of packet as quickly, i.e., as cheaply, as possible)

Routing in the Internet


3 common intra-AS RPs (IGP - Interior Gateway
Protocol)

RIP (Routing Information Protocol)


OSPF (Open Shortest Path First)
IGRP (Interior Gateway Routing Protocol) - Ciscoproprietary, not so interesting...

inter-AS routing

BGP (Border Gateway Protocol)

RIP
One of the earlier Internet routing protocols

became popular when included in 1982 BSD 4.1a (TCP/IP)

Distance vector algorithm

distance metric: #hops (max = 15 hops)


DVs exchanged every 30 sec via advertisements
Each advertisement contains list of up to 25 destination
subnets within AS
destinatio.
u
v
w
x
y
z

hops
1
2
2
3
3
2

RIP example

dst ne' next router hops to ds'

dst ne' next router hops to ds'

A
B

57

...

...

...

Ds routing table

As advertisement

RIP implementation
Advertisements every 30 seconds

If no advertisement heard after 180 sec, neighbour/link


declared dead
routes via neighbour invalidated
new advertisements sent to neighbours link failures
propagate quickly
poison reverse used to prevent loops (16 hops = )

routed runs in application-layer

advertisements sent via UDP, port 520

OSPF
O = Open = public (good!)
link state
OSPF advertisement carries one entry per neighbour

advertisements sent to entire AS via ooding


OSPF messages carried directly over IP (no TCP or UDP)

Benets over RIP

security: all messages authenticated


multiple same-cost paths (multipath) allowed (not in RIP)
for each link, can have multiple cost metrics for dierent TOS
unicast and multicast support
hierarchical OSPF in large ASes

Hierarchical OSPF

Hierarchical OSPF

Two-level hierarchy: local area and backbon#

Internal routers: non-backbone; only intra-AS routing


Area border routers: summarise distances to networks in their
own area, advertise to other area border routers
Backbone routers: run OSPF routing that is limited to backbone
Boundary routers: connect to other ASes
learn about paths to external networks

BGP allows AS to

BGP

obtain subnet reachability information from neighbouring ASes


propagate reachability information to all routers internal to AS
determine goo routes to subnets based on reachability information and
on AS policy

Pairs of routers (BGP peers) exchange routing info over


semi-permanent TCP connections (port 179)
BGP sessions do not correspond to physical links
when AS2 advertises a prex to AS1, AS2 is promising it will forward any
datagrams destined to that prex towards the prex

AS2 can aggregate prexes in advertisement

BGP example

AS3 uses eBGP (external BGP) to send prex reachability


info to AS1

aggregated, e.g., 138.16.64/24 + 138.16.65/24 = 138.16.64/23

1c uses iBGP (internal BGP) to send info to all routers in AS1


1b re-advertises info to AS2 over eBGP session
whenever router learns about a new prex, it creates an
entry for the prex in its forwarding table

BGP path attributes


Prex advertisements include BGP attributes

prex + attributes = route

Two most important attributes

AS-PATH: contains the ASes through which the advert for the
prex passed, e.g. AS 67 AS 17
AS numbers assigned by ICANN (like IP blocks)
NEXT-HOP: indicates the specic internal-AS router to nexthop AS (may be multiple links from current to next-hop AS)

When a gateway router receives a route


advertisement, it uses import policy to accept or decline

not all ASes want to send trac over every other AS


router may already know a better or preferred route

BGP route selection and policy


Router may learn about more than 1 route to a prex
How to choose route?

local preference value attribute; policy decisio.


shortest AS-PATH
closest NEXT-HOP router (hot potato)

Policy

ISPs have peering arrangements with each other


commercially-sensitive (i.e., secret)
prevent free-riding
bandwidth exchanges (e.g., band-x.com)

Broadcast routing
Single source sends a datagram to all nodes in network

in IP, last address in network is broadcast, e.g. 129.170.215.255


or generic (limited broadcas) address = 255.255.255.255
N-way-unicast: send individual packets to each node
wasteful
Flooding: forward to all neighbours, except neighbour that
sent the packet
can lead to broadcast storms (TTL is important)
use sequence numbers to prevent duplicates

Reverse path forwarding

When router receives


broadcast packet with a given
source address, only forward if
the packet arrived on the link
that is on shortest path back to
source
otherwise router will
eventually receive packet
from the shortest path

Spanning tree

RPF still causes duplication


We want a spanning tre#
graph where all edges
connected and no cycles
How to construct tree?
dene core or rendezvous poin'
nodes send join messages
towards the core
when join msgs reach the
existing tree, the path up to
tree is grafte onto tree

Multicast routing
Single node delivers to a subset of network nodes
Useful for lots of applications

software delivery (e.g., Windows Update, apt, RPM)


audio / video conferencing
games
le-sharing (most people want the same Britney Spears MP3)

Multicast saves duplicatio.


How to identify receivers?
How to address a datagram to these receivers?

Multicast addressing & trees

Multicast uses address


indirectio.
A single Class D IP address
represents the entire group
nodes keep their own IP
address and send to group
address
receiver-driven: use IGMP to
join group
Trees - not all paths between
routers are used
shared tree: same tree used by
all group members
source-based tree: dierent tree
from each sender to receivers

Addressing multicast groups

hosts join group


any datagrams addressed to group address are
delivered to a, members of the group

Source-based trees

Use RPF (Reverse Path


Forwarding) like in unicast
But what if no-one downstream
of a link is subscribed to group?
why send to G?
routers downstream of G will
send prune messages
upstream

Group-shared trees
Steiner Tree: minimum-cost tree connecting all routers
with attached group members

NP-complete
But heuristics available
But requires knowledge about entire network

Designate one router as centre or cor#

edge router join by sending unicast join-msg to core router


path taken by join-msg becomes new branch for this router

Multicast routing in the Internet


DVMRP (Distance Vector Multicast Routing Protocol)

oldest and most common; supported by most routers


reverse path forwarding, source-based tree
soft-state - routers periodically forget pruning

PIM (Protocol Independent Multicast)

dense-mode - group members close together, more bandwidth


group membership assumed until explicit prune
ood-and-prune RPF
sparse-mode - far apart, less bandwidth
no membership until explicit joi.
receiver-driven tree-construction (core-based trees)
join via RP (Rendezvous Point), then can switch to sourcespecic tree

Multicast - will it ever come?


Deering paper 1985, RFC1112 1989
So where is it?

Akamai, etc. use application-level multicast / overlays


P2P, etc. use unicast (>50% of Internet trac)

ISPs scared?

congestion control, UDP

Dont understand?

yet another thing to congure

No business model?

How to charge for shared bandwidth?

You might also like