You are on page 1of 56

Network Load Balancing (NLB)

A Core Technology Overview

Sean B House June 18, 2002


(Updated October 31, 2002)

Agenda
    

NLB architecture and fundamentals NLB cluster membership protocol Packet filtering and TCP connection affinity Limitations of NLB Advanced NLB topics
  

Multicast Bi-Directional Affinity BiVPN (PPTP & IPSec/L2TP)

Q&A

Introduction to NLB


Fully distributed, symmetric, software-based softwareTCP/IP load-balancing load  

Cloned services (e.g., IIS) run on each host in the cluster Client requests are partitioned across all hosts Load distribution is static, but configurable through load weights (percentages) Use commodity hardware Simple and robust Highly available

Design goals include:


  

Introduction to NLB


NLB Provides:
  

ScaleScale-out for IP services High availability (no single point of failure) An inexpensive alternative to HW LB devices Stateless services ShortShort-lived connections from many clients Downloads such as HTTP or FTP GETs In small clusters (less than 10 nodes)

NLB is appropriate for load balancing:


   

HighHigh-end hardware load-balancers cover a much loadbroader range of load-balancing scenarios load-

NLB Architecture


NLB is an NDIS intermediate filter driver inserted between the physical NIC and the protocols in the network stack
 

Protocols think NLB is a NIC NICs think NLB is a protocol

One instance of NLB per NIC to which it s bound




All NLB instances operate independently of each other

Fundamental Algorithm
   

NLB is fundamentally just a packet filter Via the NLB membership protocol, all hosts in the cluster agree on the load distribution NLB requires that all hosts see all inbound packets Each host discards those packets intended for other hosts in the cluster


Each host makes accept/drop decisions independently

Packets accepted on each host are passed up to the protocol(s) and one response is sent back to the client

Host 3

Host 2

Host 1

NLB Cluster

The network floods A client initiates a A response accepts One server is sent the incoming client request to an NLB backclient request. the to the client. request. cluster.

Internet

Client(s)

Cluster Operation Modes




NLB modes of operation:


  

Unicast, multicast and IGMP multicast Unicast makes up approximately 98% of deployments For the rest of this talk, assume unicast operation


Advanced Topics covers multicast and IGMP multicast

To project a single system image:


 

All hosts share the same set of virtual IP addresses All hosts share a common network (MAC) address
 

In unicast, NLB actually alters the MAC address of the NIC




This precludes inter-host communication over the NLB NIC inter-

Communication with specific cluster hosts is accomplished through the use of dedicated NICs or dedicated IP addresses

Unicast Mode


Each host in the cluster is configured with the same unicast MAC address


02-bf-WW-XX-YY02-bf-WW-XX-YY-ZZ
  

02 = locally administered address bf = arbitrary (Bain/Faenov) WW-XX-YYWW-XX-YY-ZZ = the primary cluster IP address

All ARP requests for virtual IP addresses resolve to this cluster MAC address automagically

NLB must ensure that all inbound packets are received by all hosts in the cluster


All N hosts in the cluster receive every packet and N-1 Nhosts discard each packet

Inbound Packet Flooding




NLB is well-suited for (arguably designed for) a wellhub environment




Hubs forward inbound packets on all ports by their very nature

Switches associate MAC addresses with ports in an effort to eliminate flooding




On each port, switches snoop the source MAC addresses of all packets received
 

Those source MAC addresses are learned on that port When packets arrive destined for a learned MAC address, they are forwarded only on the associated port

This breaks NLB compatibility with switches

NLB/Switch Incompatibility
 

All NLB hosts share the same cluster MAC address Switches only allow a particular MAC address to be associated with one switch port at a given time This results in the cluster MAC address / port association thrashing between ports


Host 3

Host 2

Host 1

Switch

Inbound packets are only forwarded to the port with which the switch currently believes the cluster MAC address is associated Connectivity to the cluster will be intermittent at best

MAC Address Masking




NLB uses MAC address masking to keep switches from learning the cluster MAC address and associating it with a particular port NLB spoofs the source MAC address of all outgoing packets


The second byte of the source MAC address is overwritten with the host s unique NLB host ID


E.g., 02-bf-0a-00-00-01 -> 02-09-0a-00-00-01 02-bf-0a-00-0002-09-0a-00-00-

This prevents switches from associating the cluster MAC address with a particular port
 

Switches only associate the masked MAC addresses Enables inbound packet flooding

Host 3

Host 2

Host 1

10.0.0.1 02-bf-0a-00-00-01 From: 02-03-0a-00-00-01/10.0.0.1:80 To: 00-a0-cc-a1-cd-9f/10.0.0.5:29290

NLB Cluster

A response is sent back to the client. The source MAC A client initiates a request address is NLB cluster. the The to an masked using to switch does not know host sserver accepts One unique host ID. which port 02-bf-0a-00-00-01 An ARP request for the client request. belongs, will floods the The switch so itcontinue to 10.0.0.1 resolves to the request to all ports. associate 02-03-0a-00-00-01, cluster MAC address not 02-bf-0a-00-00-01. with 02-bf-0a-00-00-01, this switch port. This enables switch flooding.

Switch

From: 00-a0-cc-a1-cd-9f/10.0.0.5:29290 To: 02-bf-0a-00-00-01/10.0.0.1:80

Client(s)
10.0.0.5 00-a0-cc-a1-cd-9f

LoadLoad-Balancing Overview
 

Each host periodically sends a heartbeat packet to announce their presence and distribute load Load distribution is quantized into 60 buckets that are distributed amongst hosts


Each host owns a subset of the buckets Typically using the IP 2-tuple or 4-tuple as input to 24the hashing function The owner of the bucket accepts the packet, the others drop the packet What happens to existing connections if bucket ownership changes?

Incoming packets hash to one of the 60 buckets


  

Cluster Membership


Each NLB host is assigned a unique host ID in the range from 1 to 32 (the maximum cluster size) Using ethernet broadcast, each host sends heartbeat packets to announce its presence and distribute load
 

Twice per second during convergence Once per second after convergence completes Registered Ethernet type = 0x886f Contains configuration information such as host ID, dedicated IP address, port rules, etc. Contains load balancing state such as the load distribution, load weights (percentages), activity indicators, etc.

Heartbeats are MTU-sized, un-routable Ethernet frames MTUun 

Convergence


Convergence is a distributed mechanism for determining cluster membership and load distribution


Conveyed via the NLB heartbeat messages

Hosts initiate convergence primarily to partition the load




When consensus is reached, the cluster is said to be converged

  

Misconfiguration can cause perpetual convergence Network problems can cause periodic convergence Cluster operations continue during convergence


Can result in disruption to or denial of client service

Triggering Convergence


Joining hosts


New hosts trigger convergence to repartition the load distribution and begin accepting client requests The other hosts pick up the slack when a fixed number of heartbeats are missed from the departing host Configuration changes Administrative operations that change the configured load of a server (disable, enable, drain, etc.)

Departing hosts


New configuration on hosts


 

The Convergence Algorithm


1. 2. 3.

4.

5.

All hosts enter the CONVERGING state The host with the smallest host ID is elected the default host Each host moves from the CONVERGING state to the STABLE state after a fixed number of epochs* in which consistent membership and load distribution are observed The default host enters the CONVERGED state after a fixed number of epochs in which all hosts are observed to be in the STABLE state Other hosts enter the CONVERGED state when they see that the default host has converged

* For all intents and purposes, an epoch is a heartbeat period

Bucket Distribution


Load distribution is quantized into buckets


 

Incoming packets hash to one of 60 buckets




Largely for reasons of equal divisibility

Buckets are distributed amongst hosts, but based on configuration (load weights/percentages), may or may not be shared equally Buckets are not dynamically re-distributed based on reload (no dynamic load balancing)

 

Goal: minimize disruption to existing connections during bucket transfer NonNon-goal: optimize remaps across a series of convergences

Bucket Distribution


During convergence, each host computes identical target load distributions




Based on the existing load distribution and the new membership information

When convergence completes, hosts transfer buckets pair-wise via heartbeats pair  

First, the donor host surrenders ownership of the buckets and notifies the recipient Soon thereafter, the recipient picks up the buckets, asserts ownership of them and notifies the donor During the transfer (~2 seconds), nobody is accepting new connections on those buckets

Bucket Distribution


Advantages
  

Easy method by which to divide client population among hosts Convenient for adjusting relative load weights (percentage) between hosts Avoid state lookup in optimized cases Quantized domain has limited granularity


Disadvantages


Unbalanced for some cluster sizes, but 60 divides nicely:




1, 2, 3, 4, 5, 6, 10, 12, 15, 20, 30

Worst case is a load distribution ratio of 2:1 in 31 and 32 host clusters

Host 3

Host 2

Host 1 Now: None Next: 10-19, 50-59


10-19, 50-59 None

Now: 0-29 Next: 0-9, 20-29


0-29 0-9, 20-29

Now: 30-59 Next: 30-49

30-49 30-59

NLB Cluster

When convergence completes, each pair of hosts transfers the designated buckets via the Hosts 2host uses the same Host 1 heartbeats. onand Each and 3the cluster all joins are converged Convergence begins and sending CONVERGED algorithm to computecluster. begins sending CONVERGING three hosts in the the new Bucketsload distribution. the are removed from heartbeats. heartbeats. donating host s bucket map before being handed off to the new owner.

Switch

Internet

Client 1

Packet Filtering
 

Filtered packets are those for which NLB will make an accept/drop decision (load-balance) (loadIP Protocols that are filtered by NLB
  

TCP, UDP GRE




Assumes a relationship with a corresponding PPTP tunnel Assumes a relationship with a corresponding IPSec/L2TP tunnel By default, all hosts accept ICMP; can be optionally filtered

ESP/AH (IPSec)


ICMP


Other protocols and Ethernet types are passed directly up to the protocol(s)

Client Affinity


None
 

Typically provides the best load balance Uses both client IP address and port when hashing Used primarily for session support for SSL and multimulticonnection protocols (IPSec/L2TP, PPTP, FTP) Uses only the client IP address when hashing Used primarily for session support for users behind scaling proxy arrays Uses only the class C subnet of the client IP address when hashing

Single
 

Class C
 

Hashing
 

Packets hash to one of 60 buckets, which are distributed amongst hosts NLB employs bi-level hashing bi 

Level 1: Bucket ownership Level 2: State lookup Optimized


 

NLB hashing operates in one of two modes




Level 1 hashing only The bucket owner accepts the packet unconditionally Level 1 and level 2 hashing State lookup is necessary to resolve ownership ambiguity

NonNon-optimized
 

Hashing


Protocols such as UDP always operate in optimized mode




No state is maintained for UDP, which eliminates the need for level 2 hashing

Protocols such as TCP can operate in either optimized or non-optimized mode non   

State is maintained for all TCP connections When ambiguity arises, state lookup determines ownership New connections always belong to the bucket owner Global aggregation determines when other hosts complete service of a lingering connection and optimize out level 2 hashing

Host 3

Host 2

Host 1

10.0.0.1
From: 10.0.0.1:80 To: 10.0.0.5:29290 20-39 0-19 40-59

02-bf-0a-00-00-01

NLB Cluster

Hash on IP 5-tuple (10.0.0.5, 29290, 10.0.0.1, 80, TCP) A client initiates a A response is sent maps to Bucket 14, owned by Host 3. request to client. back to thean NLB Host 3 acceptscluster. the request. All other hosts drop the request.

Switch

Internet

From: 10.0.0.5:29290 To: 10.0.0.1:80

Client(s)
10.0.0.5 00-a0-cc-a1-cd-9f

Connection Tracking


Ensures that connections are serviced by the same host for their duration even if a change in bucket ownership occurs Sessionful vs. sessionless hashing


E.g., UDP is sessionless




If an ownership change occurs, existing streams shift immediately to the new bucket owner If an ownership change occurs, existing connections continue to be serviced by the old bucket owner Requires state maintenance and lookup to resolve packet ownership ambiguity

E.g., TCP is sessionful


 

Host 3 Client 1 Owner

Host 2

No TCP Connection Affinity


Host 1

NLB Cluster

The client completes A client initiates a TCP The ACK is accepted the three-way A SYN+ACK sending The SYN by is sent accepted connectionis breaking by Host 1,by sending handshake the client back toHost 3NLB by a TCP connection. the SYN to theto the an ACK back cluster. NLB cluster.

Switch

Internet

Client 1

TCP Connection State




NLB maintains a connection descriptor for each active TCP connection


 

A connection descriptor is basically an IP 5-tuple 5

(Client IP, Client port, Server IP, Server Port, TCP)

In optimized mode, descriptors are maintained, but not needed when making accept/drop decisions

State and its associated lifetime are maintained by either:


 

TCP Packet snooping




Monitoring TCP packet flags (SYN, FIN, RST) Using kernel callbacks

Explicit notifications from TCP/IP




TCP Packet Snooping




NLB monitors the TCP packet flags




Upon seeing a SYN


 

If accepted, a descriptor is created to track the connection Only one host should have a descriptor for this IP 5-tuple 5Destroys the associated descriptor, if one is found

Upon seeing a FIN/RST




Problems include:


Routing and interface metrics can cause traffic flow to be asymmetric




NLB cannot rely on seeing outbound packets

 

State is created before connection is accepted Both of which can result in stale/orphaned descriptors


Wasted memory, performance degradation, packet loss

TCP Connection Notifications


 

NLB receives explicit notifications from TCP/IP when connection state is created or destroyed TCP/IP notifies NLB when a connection enters:
 

SYN_RCVD


A descriptor is created to track the inbound connection Destroys the associated descriptor, if one is found

CLOSED


Advantages include:


NLB state maintenance remains in very close synchronization with TCP/IP




NLB TCP connection tracking is more reliable

Affords NLB protection against SYN attacks, etc.

Host 3
TCP Connection Descriptor

Host 2

TCP Connection Affinity


Host 1

Client 1 Owner

NLB Cluster

The ACK is accepted AThe client because it client initiates a TCP by Host 3 completes the itis accepted Theconnectionactive A SYN+ACK is by SYN has sent knows three-way handshake SYN and back toathe 3 by by sending sending Host client TCP connectionsto thea an matching TCP ACK back to the NLB cluster. NLB cluster. connection descriptor.

Switch

Internet

The ACK is rejected by Host 1 because it knows that other hosts have active TCP connections and it does NOT have a matching TCP connection descriptor.

Client 1

Session Tracking


Session tracking complications in NLB are of one of two forms, or a combination thereof:
 

The inability of NLB to detect the start or end of a session in the protocol itself The need to associate multiple seemingly unrelated streams and provide affinity to a single server Requires specialized support from PPTP and IPSec Session start and end are unknown to NLB SSL sessions span TCP connections

  

PPTP and IPSec/L2TP sessions are supported




UDP sessions are not supported




SSL sessions are not supported




Session Tracking


The classic NLB answer to session support is to use client affinity




Assumes that client identity remains the same throughout the session Different connections in the same session may have different client IP addresses and/or ports Using class C affinity can help, but is likely to highly skew the achievable load balance Session lifetime is highly indeterminate Sessions can span many connections


AOL proxy problem (scaling proxy arrays)


 

Terminal Server
 

Subsequent connections may be from different locations

ScaleScale-Out Limitations


Network limitations


Switch flooding


The pipe to each host in the cluster must be as fat as the uplink pipe Not allowing the switch to learn the MAC address causes degraded switch performance as well All hosts share the same virtual IP address(es)

Incompatible with layer 3 switches




CPU limitations


Packet drop overhead




Every host drops (N-1)/N % of all packets on average (N-

LoadLoad-Balancing Limitations


The NLB load balancing algorithm is static




Only the IP 5-tuple is considered when making load5loadbalancing decisions




No dynamic metrics are taken into consideration




E.g. CPU, memory, total number of connections E.g. Terminal Server vs. IIS

No application semantics are taken into consideration




NLB requires a sufficiently large (and varied) client population to achieve the configured balance
 

A small number of clients will result in poor balance Mega proxies can significantly skew the load balance

Other Limitations


No inter-host communication possible without a intersecond NIC


 

Hosts are cloned and traffic destined for local MAC addresses doesn t reach the wire Both multicast modes address this issue, but require a static ARP entry in Cisco routers TCP connections are preserved during a rebalance NLB generally has no session awareness
 

NLB provides connection, not session, affinity connection, session,


 

E.g., SSL can/will break during a rebalance Specialized support from NLB and VPN allows VPN sessions (tunnels) to be preserved during a rebalance

Summary


NLB is fully distributed, symmetric, softwaresoftwarebased TCP/IP load-balancing load

Cloned services run on each host in the cluster and client requests are partitioned across all hosts

 

NLB provides high availability and scale-out for scaleIP services NLB is appropriate for load balancing:
   

Stateless services ShortShort-lived connections from many clients Downloads such as HTTP or FTP GETs In small clusters (less than 10 nodes)

Advanced Topics
  

Multicast Bi-directional affinity BiVPN session support

Multicast


All hosts share a common multicast MAC address




Each host retains its unique MAC address Packets addressed to multicast MAC addresses are flooded by switches NLB munges ARP requests to resolve all virtual IP addresses to the shared multicast MAC address All ARP requests for the dedicated IP address of a host resolve to the unique hardware MAC address

Does not limit switch flooding




Does allow inter-host communication inter

Multicast


NLB multicast modes break an internet RFC


 

Unicast IP addresses cannot resolve to multicast MAC addresses Requires a static ARP entry on Cisco routers
 

Cisco routers won t dynamically add the ARP entry Cisco plans to eliminate support for static ARP entries for multicast addresses

The ping-pong effect ping

In a redundant router configuration, multicast packets may be repeatedly replayed onto the network


Typically until the TTL reaches zero

 

Router utilization skyrockets Network bandwidth plummets

IGMP Multicast
 

All hosts share a common IGMP multicast MAC address IGMP does limit switch flooding
  

All cluster hosts join the same IGMP group




Hosts periodically send IGMP join messages on the network

ARP requests for all virtual IP addresses resolve to the shared IGMP multicast MAC address Switches forward packets destined for IGMP multicast MAC address only on the ports on which the switch has recently received a join message for that IGMP group

Still requires a static ARP entry in Cisco routers

Bi-Directional Affinity Bi 

Proxy/firewall scalability and availability By default, NLB instances on distinct network adapters operate independently
  

Independently configured Independently converge and distribute load Independently make packet accept/drop decisions That load-balancers associate multiple packet streams loadThat all related packet streams get load-balanced to loadthe same firewall server This is Bi-Directional Affinity (BDA) Bi-

Firewall stateful packet inspection requires:


  

No Bi-Directional Affinity BiThe internal server response may be accepted by a The internal server different NLB/Firewallathe The client initiates server One NLB/Firewall A firewall routes sends one that handled a response to than the request an the client accepts the server request to to the the client via the initial request. server. appropriateclient request. client internal NLB/Firewall cluster. NLB/Firewall cluster. This breaks stateful packet inspection.

NLB/Firewall Cluster
Firewall State (SPI)

Host 1

Host 2

Host 3

Client(s)

Internet
Published Server

Stateful Packet Inspection


  

Firewalls maintain state [generally] on a perperconnection basis This state is necessary to perform advanced inspection of traffic through the firewall Requires special load-balancing semantics load  

LoadLoad-balance incoming external requests for internal resources LoadLoad-balance outgoing internal requests for external resources Maintain firewall server affinity for the responses


Return traffic must pass through the same firewall server as the request

The Affinity Problem




Firewalls/Proxies often alter TCP connection parameters


  

Source and destination ports Source IP address




If translated, the host s dedicated IP address should be used In many scenarios, a published IP address is translated at the firewall into a private IP address

Destination IP address


The packets of the request and associated response are often very different


Difficult for load-balancers to associate the two loadseemingly unrelated streams and provide affinity

The Affinity Problem




Incoming packets utilize the conventional NLB hashing algorithm


  

Lookup the applicable port rule using the server port Hash on the IP 2-tuple or 4-tuple 24Map that result to an owner, who accepts the packet Port rule lookup


For firewalls/proxies, problems include:


  

Server port is different on client and server sides of firewall Ports and IP addresses have been altered by the firewall Each NLB instance has independent bucket ownership

Hashing function


Bucket ownership


BDA Teaming
 

Abandons some aspects of independence between designated NLB instances All members of a BDA team belong to a different cluster that continues to converge independently


Primarily useful for consistency and failure detection

However, all members of a BDA team share loadload-balancing state, including:


 

Connection descriptors Bucket distribution

Allows all team members to make consistent accept/drop decisions and preserve affinity

Preserving Affinity with BDA




Requirements include:


A Single port rule, ports = (0 - 65535)




Eliminates problems with port rule lookup due to port translation Eliminates hashing problems due to port translation Eliminates hashing problems due to IP address translation

 

Single or Class C affinity on the only port rule




Server IP address not used in hashing




The lone common element in hashing is then the client IP address


 

Use the source IP address on incoming client requests Use the destination IP address on server responses


This is often called reverse-hashing reverse-

Bi-Directional Affinity BiNLB hashes on the destination IP address (the client IP address) of the NLB hashes on the source response and Bi-Directional IP address (the that The internal client A firewall routes Affinity internalserver the ensures The client initiates atheIP The a of theserver address) responsethe is request sends request an to request toresponse internal server isto client responsevia sent and one NLB/Firewall the client NLB/Firewall cluster. handled internal same appropriatetoby thethe back the theserver. client. server acceptscluster. NLB/Firewallserverclient NLB/Firewall that request. client handled the initial request.

NLB/Firewall Cluster
Firewall State (SPI)

Host 1

Host 2

Host 3

Client(s)

Internet
Published Server

BDA Miscellaneous


External entities are expected to monitor the health of BDA teams through NLB WMI events


E.g., if one member of a BDA team fails, the entire team should be stopped


All load will then be re-distributed to surviving hosts re-

Reverse hashing is set on a per-adapter basis per

To override the configured hashing scheme on a perperpacket basis, NLB provides a kernel-mode hook kernel  

Entities register to see all packets in the send and/or receive paths and can influence NLB s decision to accept/drop them Hook returns ACCEPT, REJECT, FORWARD hash, REVERSE hash or PROCEED with the default hash Enables extensions to BDA support for more complex firewall/proxy scenarios without explicit changes to NLB

VPN Session Support




Support for clustering VPN servers

PPTP
  

TCP tunnel GRE Call IDs

IPSec/L2TP
   

Notifications from IKE No FINs MM and QM SAs INITIAL_CONTACT

You might also like