Professional Documents
Culture Documents
Agenda
NLB architecture and fundamentals NLB cluster membership protocol Packet filtering and TCP connection affinity Limitations of NLB Advanced NLB topics
Q&A
Introduction to NLB
Cloned services (e.g., IIS) run on each host in the cluster Client requests are partitioned across all hosts Load distribution is static, but configurable through load weights (percentages) Use commodity hardware Simple and robust Highly available
Introduction to NLB
NLB Provides:
ScaleScale-out for IP services High availability (no single point of failure) An inexpensive alternative to HW LB devices Stateless services ShortShort-lived connections from many clients Downloads such as HTTP or FTP GETs In small clusters (less than 10 nodes)
HighHigh-end hardware load-balancers cover a much loadbroader range of load-balancing scenarios load-
NLB Architecture
NLB is an NDIS intermediate filter driver inserted between the physical NIC and the protocols in the network stack
Fundamental Algorithm
NLB is fundamentally just a packet filter Via the NLB membership protocol, all hosts in the cluster agree on the load distribution NLB requires that all hosts see all inbound packets Each host discards those packets intended for other hosts in the cluster
Packets accepted on each host are passed up to the protocol(s) and one response is sent back to the client
Host 3
Host 2
Host 1
NLB Cluster
The network floods A client initiates a A response accepts One server is sent the incoming client request to an NLB backclient request. the to the client. request. cluster.
Internet
Client(s)
Unicast, multicast and IGMP multicast Unicast makes up approximately 98% of deployments For the rest of this talk, assume unicast operation
All hosts share the same set of virtual IP addresses All hosts share a common network (MAC) address
Communication with specific cluster hosts is accomplished through the use of dedicated NICs or dedicated IP addresses
Unicast Mode
Each host in the cluster is configured with the same unicast MAC address
02-bf-WW-XX-YY02-bf-WW-XX-YY-ZZ
02 = locally administered address bf = arbitrary (Bain/Faenov) WW-XX-YYWW-XX-YY-ZZ = the primary cluster IP address
All ARP requests for virtual IP addresses resolve to this cluster MAC address automagically
NLB must ensure that all inbound packets are received by all hosts in the cluster
All N hosts in the cluster receive every packet and N-1 Nhosts discard each packet
On each port, switches snoop the source MAC addresses of all packets received
Those source MAC addresses are learned on that port When packets arrive destined for a learned MAC address, they are forwarded only on the associated port
NLB/Switch Incompatibility
All NLB hosts share the same cluster MAC address Switches only allow a particular MAC address to be associated with one switch port at a given time This results in the cluster MAC address / port association thrashing between ports
Host 3
Host 2
Host 1
Switch
Inbound packets are only forwarded to the port with which the switch currently believes the cluster MAC address is associated Connectivity to the cluster will be intermittent at best
NLB uses MAC address masking to keep switches from learning the cluster MAC address and associating it with a particular port NLB spoofs the source MAC address of all outgoing packets
The second byte of the source MAC address is overwritten with the host s unique NLB host ID
This prevents switches from associating the cluster MAC address with a particular port
Switches only associate the masked MAC addresses Enables inbound packet flooding
Host 3
Host 2
Host 1
NLB Cluster
A response is sent back to the client. The source MAC A client initiates a request address is NLB cluster. the The to an masked using to switch does not know host sserver accepts One unique host ID. which port 02-bf-0a-00-00-01 An ARP request for the client request. belongs, will floods the The switch so itcontinue to 10.0.0.1 resolves to the request to all ports. associate 02-03-0a-00-00-01, cluster MAC address not 02-bf-0a-00-00-01. with 02-bf-0a-00-00-01, this switch port. This enables switch flooding.
Switch
Client(s)
10.0.0.5 00-a0-cc-a1-cd-9f
LoadLoad-Balancing Overview
Each host periodically sends a heartbeat packet to announce their presence and distribute load Load distribution is quantized into 60 buckets that are distributed amongst hosts
Each host owns a subset of the buckets Typically using the IP 2-tuple or 4-tuple as input to 24the hashing function The owner of the bucket accepts the packet, the others drop the packet What happens to existing connections if bucket ownership changes?
Cluster Membership
Each NLB host is assigned a unique host ID in the range from 1 to 32 (the maximum cluster size) Using ethernet broadcast, each host sends heartbeat packets to announce its presence and distribute load
Twice per second during convergence Once per second after convergence completes Registered Ethernet type = 0x886f Contains configuration information such as host ID, dedicated IP address, port rules, etc. Contains load balancing state such as the load distribution, load weights (percentages), activity indicators, etc.
Convergence
Convergence is a distributed mechanism for determining cluster membership and load distribution
Misconfiguration can cause perpetual convergence Network problems can cause periodic convergence Cluster operations continue during convergence
Triggering Convergence
Joining hosts
New hosts trigger convergence to repartition the load distribution and begin accepting client requests The other hosts pick up the slack when a fixed number of heartbeats are missed from the departing host Configuration changes Administrative operations that change the configured load of a server (disable, enable, drain, etc.)
Departing hosts
4.
5.
All hosts enter the CONVERGING state The host with the smallest host ID is elected the default host Each host moves from the CONVERGING state to the STABLE state after a fixed number of epochs* in which consistent membership and load distribution are observed The default host enters the CONVERGED state after a fixed number of epochs in which all hosts are observed to be in the STABLE state Other hosts enter the CONVERGED state when they see that the default host has converged
Bucket Distribution
Buckets are distributed amongst hosts, but based on configuration (load weights/percentages), may or may not be shared equally Buckets are not dynamically re-distributed based on reload (no dynamic load balancing)
Goal: minimize disruption to existing connections during bucket transfer NonNon-goal: optimize remaps across a series of convergences
Bucket Distribution
Based on the existing load distribution and the new membership information
When convergence completes, hosts transfer buckets pair-wise via heartbeats pair
First, the donor host surrenders ownership of the buckets and notifies the recipient Soon thereafter, the recipient picks up the buckets, asserts ownership of them and notifies the donor During the transfer (~2 seconds), nobody is accepting new connections on those buckets
Bucket Distribution
Advantages
Easy method by which to divide client population among hosts Convenient for adjusting relative load weights (percentage) between hosts Avoid state lookup in optimized cases Quantized domain has limited granularity
Disadvantages
Host 3
Host 2
30-49 30-59
NLB Cluster
When convergence completes, each pair of hosts transfers the designated buckets via the Hosts 2host uses the same Host 1 heartbeats. onand Each and 3the cluster all joins are converged Convergence begins and sending CONVERGED algorithm to computecluster. begins sending CONVERGING three hosts in the the new Bucketsload distribution. the are removed from heartbeats. heartbeats. donating host s bucket map before being handed off to the new owner.
Switch
Internet
Client 1
Packet Filtering
Filtered packets are those for which NLB will make an accept/drop decision (load-balance) (loadIP Protocols that are filtered by NLB
Assumes a relationship with a corresponding PPTP tunnel Assumes a relationship with a corresponding IPSec/L2TP tunnel By default, all hosts accept ICMP; can be optionally filtered
ESP/AH (IPSec)
ICMP
Other protocols and Ethernet types are passed directly up to the protocol(s)
Client Affinity
None
Typically provides the best load balance Uses both client IP address and port when hashing Used primarily for session support for SSL and multimulticonnection protocols (IPSec/L2TP, PPTP, FTP) Uses only the client IP address when hashing Used primarily for session support for users behind scaling proxy arrays Uses only the class C subnet of the client IP address when hashing
Single
Class C
Hashing
Packets hash to one of 60 buckets, which are distributed amongst hosts NLB employs bi-level hashing bi
Level 1 hashing only The bucket owner accepts the packet unconditionally Level 1 and level 2 hashing State lookup is necessary to resolve ownership ambiguity
NonNon-optimized
Hashing
No state is maintained for UDP, which eliminates the need for level 2 hashing
Protocols such as TCP can operate in either optimized or non-optimized mode non
State is maintained for all TCP connections When ambiguity arises, state lookup determines ownership New connections always belong to the bucket owner Global aggregation determines when other hosts complete service of a lingering connection and optimize out level 2 hashing
Host 3
Host 2
Host 1
10.0.0.1
From: 10.0.0.1:80 To: 10.0.0.5:29290 20-39 0-19 40-59
02-bf-0a-00-00-01
NLB Cluster
Hash on IP 5-tuple (10.0.0.5, 29290, 10.0.0.1, 80, TCP) A client initiates a A response is sent maps to Bucket 14, owned by Host 3. request to client. back to thean NLB Host 3 acceptscluster. the request. All other hosts drop the request.
Switch
Internet
Client(s)
10.0.0.5 00-a0-cc-a1-cd-9f
Connection Tracking
Ensures that connections are serviced by the same host for their duration even if a change in bucket ownership occurs Sessionful vs. sessionless hashing
If an ownership change occurs, existing streams shift immediately to the new bucket owner If an ownership change occurs, existing connections continue to be serviced by the old bucket owner Requires state maintenance and lookup to resolve packet ownership ambiguity
Host 2
NLB Cluster
The client completes A client initiates a TCP The ACK is accepted the three-way A SYN+ACK sending The SYN by is sent accepted connectionis breaking by Host 1,by sending handshake the client back toHost 3NLB by a TCP connection. the SYN to theto the an ACK back cluster. NLB cluster.
Switch
Internet
Client 1
In optimized mode, descriptors are maintained, but not needed when making accept/drop decisions
Monitoring TCP packet flags (SYN, FIN, RST) Using kernel callbacks
If accepted, a descriptor is created to track the connection Only one host should have a descriptor for this IP 5-tuple 5Destroys the associated descriptor, if one is found
Problems include:
State is created before connection is accepted Both of which can result in stale/orphaned descriptors
NLB receives explicit notifications from TCP/IP when connection state is created or destroyed TCP/IP notifies NLB when a connection enters:
SYN_RCVD
A descriptor is created to track the inbound connection Destroys the associated descriptor, if one is found
CLOSED
Advantages include:
Host 3
TCP Connection Descriptor
Host 2
Client 1 Owner
NLB Cluster
The ACK is accepted AThe client because it client initiates a TCP by Host 3 completes the itis accepted Theconnectionactive A SYN+ACK is by SYN has sent knows three-way handshake SYN and back toathe 3 by by sending sending Host client TCP connectionsto thea an matching TCP ACK back to the NLB cluster. NLB cluster. connection descriptor.
Switch
Internet
The ACK is rejected by Host 1 because it knows that other hosts have active TCP connections and it does NOT have a matching TCP connection descriptor.
Client 1
Session Tracking
Session tracking complications in NLB are of one of two forms, or a combination thereof:
The inability of NLB to detect the start or end of a session in the protocol itself The need to associate multiple seemingly unrelated streams and provide affinity to a single server Requires specialized support from PPTP and IPSec Session start and end are unknown to NLB SSL sessions span TCP connections
Session Tracking
Assumes that client identity remains the same throughout the session Different connections in the same session may have different client IP addresses and/or ports Using class C affinity can help, but is likely to highly skew the achievable load balance Session lifetime is highly indeterminate Sessions can span many connections
Terminal Server
ScaleScale-Out Limitations
Network limitations
Switch flooding
The pipe to each host in the cluster must be as fat as the uplink pipe Not allowing the switch to learn the MAC address causes degraded switch performance as well All hosts share the same virtual IP address(es)
CPU limitations
LoadLoad-Balancing Limitations
E.g. CPU, memory, total number of connections E.g. Terminal Server vs. IIS
NLB requires a sufficiently large (and varied) client population to achieve the configured balance
A small number of clients will result in poor balance Mega proxies can significantly skew the load balance
Other Limitations
Hosts are cloned and traffic destined for local MAC addresses doesn t reach the wire Both multicast modes address this issue, but require a static ARP entry in Cisco routers TCP connections are preserved during a rebalance NLB generally has no session awareness
E.g., SSL can/will break during a rebalance Specialized support from NLB and VPN allows VPN sessions (tunnels) to be preserved during a rebalance
Summary
Cloned services run on each host in the cluster and client requests are partitioned across all hosts
NLB provides high availability and scale-out for scaleIP services NLB is appropriate for load balancing:
Stateless services ShortShort-lived connections from many clients Downloads such as HTTP or FTP GETs In small clusters (less than 10 nodes)
Advanced Topics
Multicast
Each host retains its unique MAC address Packets addressed to multicast MAC addresses are flooded by switches NLB munges ARP requests to resolve all virtual IP addresses to the shared multicast MAC address All ARP requests for the dedicated IP address of a host resolve to the unique hardware MAC address
Multicast
Unicast IP addresses cannot resolve to multicast MAC addresses Requires a static ARP entry on Cisco routers
Cisco routers won t dynamically add the ARP entry Cisco plans to eliminate support for static ARP entries for multicast addresses
In a redundant router configuration, multicast packets may be repeatedly replayed onto the network
IGMP Multicast
All hosts share a common IGMP multicast MAC address IGMP does limit switch flooding
ARP requests for all virtual IP addresses resolve to the shared IGMP multicast MAC address Switches forward packets destined for IGMP multicast MAC address only on the ports on which the switch has recently received a join message for that IGMP group
Proxy/firewall scalability and availability By default, NLB instances on distinct network adapters operate independently
Independently configured Independently converge and distribute load Independently make packet accept/drop decisions That load-balancers associate multiple packet streams loadThat all related packet streams get load-balanced to loadthe same firewall server This is Bi-Directional Affinity (BDA) Bi-
No Bi-Directional Affinity BiThe internal server response may be accepted by a The internal server different NLB/Firewallathe The client initiates server One NLB/Firewall A firewall routes sends one that handled a response to than the request an the client accepts the server request to to the the client via the initial request. server. appropriateclient request. client internal NLB/Firewall cluster. NLB/Firewall cluster. This breaks stateful packet inspection.
NLB/Firewall Cluster
Firewall State (SPI)
Host 1
Host 2
Host 3
Client(s)
Internet
Published Server
Firewalls maintain state [generally] on a perperconnection basis This state is necessary to perform advanced inspection of traffic through the firewall Requires special load-balancing semantics load
LoadLoad-balance incoming external requests for internal resources LoadLoad-balance outgoing internal requests for external resources Maintain firewall server affinity for the responses
Return traffic must pass through the same firewall server as the request
If translated, the host s dedicated IP address should be used In many scenarios, a published IP address is translated at the firewall into a private IP address
Destination IP address
The packets of the request and associated response are often very different
Difficult for load-balancers to associate the two loadseemingly unrelated streams and provide affinity
Lookup the applicable port rule using the server port Hash on the IP 2-tuple or 4-tuple 24Map that result to an owner, who accepts the packet Port rule lookup
Server port is different on client and server sides of firewall Ports and IP addresses have been altered by the firewall Each NLB instance has independent bucket ownership
Hashing function
Bucket ownership
BDA Teaming
Abandons some aspects of independence between designated NLB instances All members of a BDA team belong to a different cluster that continues to converge independently
Allows all team members to make consistent accept/drop decisions and preserve affinity
Requirements include:
Eliminates problems with port rule lookup due to port translation Eliminates hashing problems due to port translation Eliminates hashing problems due to IP address translation
Use the source IP address on incoming client requests Use the destination IP address on server responses
Bi-Directional Affinity BiNLB hashes on the destination IP address (the client IP address) of the NLB hashes on the source response and Bi-Directional IP address (the that The internal client A firewall routes Affinity internalserver the ensures The client initiates atheIP The a of theserver address) responsethe is request sends request an to request toresponse internal server isto client responsevia sent and one NLB/Firewall the client NLB/Firewall cluster. handled internal same appropriatetoby thethe back the theserver. client. server acceptscluster. NLB/Firewallserverclient NLB/Firewall that request. client handled the initial request.
NLB/Firewall Cluster
Firewall State (SPI)
Host 1
Host 2
Host 3
Client(s)
Internet
Published Server
BDA Miscellaneous
External entities are expected to monitor the health of BDA teams through NLB WMI events
E.g., if one member of a BDA team fails, the entire team should be stopped
To override the configured hashing scheme on a perperpacket basis, NLB provides a kernel-mode hook kernel
Entities register to see all packets in the send and/or receive paths and can influence NLB s decision to accept/drop them Hook returns ACCEPT, REJECT, FORWARD hash, REVERSE hash or PROCEED with the default hash Enables extensions to BDA support for more complex firewall/proxy scenarios without explicit changes to NLB
PPTP
IPSec/L2TP