Heavy! GitHub releases GLB, an open source load balancing component

Heavy! GitHub releases GLB, an open source load balancing component

Introduction: GitHub GLB director is GitHub's recently open source load balancer, positioned as a better data center load balancer. This article introduces GLB features in detail.

At GitHub, we process tens of thousands of requests per second on the metal cloud at the edge of the network. We have introduced GLB in the previous article, which is our scalable load balancing solution for bare metal data centers. It supports most of GitHub's external services and also provides load balancing services for our most critical internal systems, such as highly available MySQL Cluster. Today, we are very happy to share more details about the load balancer design and open source GLB Director.

GLB Director is a 4-layer load balancer that can expand a single IP address on a large number of physical machines while trying to minimize connection interruptions during modification. GLB Director will not replace services like haproxy and nginx, but is deployed before these services (or any TCP services), allowing them to scale across multiple physical machines without requiring each machine to have a unique IP address.
Heavy!  GitHub releases GLB, an open source load balancing component

Use ECMP to extend IP

The basic attribute of the 4-layer load balancer is the ability to use a single IP address to achieve balanced connections between multiple servers. In order to expand a single IP to handle more traffic, we not only need to split the traffic between the backend servers, but also need to be able to scale the load balancer itself. This is actually another layer of load balancing.

Generally, we think of an IP address as a single physical machine, and a router as a machine that moves packets to the next nearest router. In the simplest case, there is always an optimal next hop. The router selects this hop and forwards all packets until the destination is reached.
Heavy!  GitHub releases GLB, an open source load balancing component

In fact, most networks are much more complicated. There are usually multiple paths available between two computers, for example, using multiple ISPs or two routers connected by multiple physical cables to increase capacity and provide redundancy. This is where Equal Cost Multipath (ECMP) routing comes into play-instead of having the router choose a single best next hop, many paths in ECMP have the same cost (usually defined as the number of ASs to the destination), and the router distributes the traffic In order to balance the connections between all available paths of the same cost.
Heavy!  GitHub releases GLB, an open source load balancing component

ECMP determines one of the available paths by hashing each data packet. The hash function used here varies from device to device, but is usually based on the consistent hash of the source and destination IP addresses and the source and destination ports of TCP traffic. This means that multiple packets of the same TCP connection will usually traverse the same path, which means that even if the path has different delays, the packets will arrive in the same order. It is worth noting that in this case, the path can be changed without interrupting the connection, because they always end up on the same target server, and the path it takes is mostly irrelevant.

Another use of ECMP is when we want to span multiple servers instead of the same server on multiple paths. Each server can use BGP or other similar network protocols to use the same IP address, so that the connection is fragmented between these servers, and the router does not know that the connection is processed in a different place, instead of all the traditional methods. Connections are handled on the same machine.
Heavy!  GitHub releases GLB, an open source load balancing component

Although ECMP is like fragmenting traffic, it has a huge disadvantage: when the server of the same IP changes (or any path or router along the way changes), the connection must be rebalanced to ensure the connection on each server More balanced. Routers are usually stateless devices that just make the best decision for each packet regardless of the connection to which it belongs, which means that some connections will be interrupted in this case.
Heavy!  GitHub releases GLB, an open source load balancing component

In the above example, we can imagine that each color represents an active connection. Add a new proxy server using the same IP. The router guarantees consistent hashing, moving 1/3 of the connection to the new server while keeping 2/3 of the connection to the old server. Unfortunately, for 1/3 of the connections in progress, the data packets now reach the server in a disconnected state, so the connection will fail.

Separate director/proxy

The problem with the previous ECMP-only solution is that it does not know the complete context of a given packet, nor can it store data for each packet/connection. It turns out that tools such as Linux Virtual Server (LVS) are usually used. We created a new "director" server layer, which obtains packets from the router through ECMP, but instead of relying on the router's ECMP hash to select the back-end proxy server, it controls the hash and storage state for all links (select the back-end) . When we change the proxy layer server, the director layer is expected to remain unchanged, and our connection will not be broken.
Heavy!  GitHub releases GLB, an open source load balancing component

Although this works well in many situations, it does have some disadvantages. In the above example, we added the LVS director and the back-end proxy server at the same time. The new director receives some data packets, but has no state (or has a delayed state), so it is treated as a new connection for hash processing and may make an error (and cause the connection to fail). The typical solution for LVS is to use multicast connection synchronization to maintain the shared connection status between all LVS director servers. This still needs to propagate the connection state, and it still needs to repeat the state-not only does each agent need the state of each connection in the Linux kernel network stack, but each LVS director also needs to store the mapping of the connection to the backend proxy server.

Remove all states from the director layer

When we designed GLB, we decided to improve this situation instead of repeating the state. By using the flow state that has been stored in the proxy server as part of maintaining the established Linux TCP connection from the client, GLB takes a different approach from the above method.

For each incoming connection, we choose the primary server and secondary server that can handle the connection. When the data packet reaches the primary server and is invalid, the data packet is forwarded to the secondary server. The hash for selecting the primary/secondary server is done once in advance and stored in the lookup table, so there is no need to recalculate on the basis of each flow or each data packet. When adding a new proxy server, for 1/N connections, it will become the new primary server, and the old primary server will become the secondary server. This allows existing processes to be completed because the proxy server can use its local state (single source of truth) to make decisions. Essentially, this gives the packet a "second chance" when it reaches the intended server that maintains its state.
Heavy!  GitHub releases GLB, an open source load balancing component

Even if the director still sends the connection to the wrong server, the server will know how to forward the packet to the correct server. As far as TCP streaming is concerned, the GLB director layer is completely stateless: the director server can enter and exit at any time, and always select the same primary/secondary server, as long as their forwarding tables match (but they rarely change). There are some details that need to be paid attention to when changing the proxy, which we will introduce below.

Maintain the Hash collection unchanged

The core of the GLB Director design boils down to consistently selecting the primary and secondary servers, and allowing the proxy layer servers to be drained and filled as needed. We believe that every proxy server has a state, and adjust the state when a server joins or exits.

We create a static binary forwarding table, which is generated on each controller server in the same way to map incoming connections to a given primary server and secondary server. We did not use the complex logic of selecting servers from all available servers during packet processing, but indirectly through the creation of tables (65k rows), each row containing the IP addresses of the primary and secondary servers. The table stores data in memory in a two-dimensional array, and each table is about 512kb. When a packet arrives, we always hash it (based on packet data only) to the same row in the table (using the hash as the index of the array), which provides a consistent pair of primary and secondary servers.
Heavy!  GitHub releases GLB, an open source load balancing component

We want each server to be roughly the same in the primary and secondary fields, and never appear in the same row. When we add a new server, we want some action to make its primary server a secondary server, and the new server will become the primary server. Similarly, we want the new server to become a secondary server in certain rows. When we delete a server, in any row where it is the primary server, we want the secondary server to become the primary server and another server to become the secondary server.

This sounds complicated, but it can be summarized succinctly with a few invariants:

  • When we change the server set, the relative order of the existing servers should be maintained.
  • The order of the servers should be computable, and there is no other state besides the server list (there may be some predefined seeds).
  • Each server should appear at most once in each row.
  • Each server should appear approximately the same number of times in each column.

For some of the above problems, the set hash is an ideal choice because it can satisfy these invariants well. Each server (in our case, IP) is hashed with the line number, the servers are sorted by the hash (just a number), and we get the unique order of the servers for the given line. We regard the first two as primary and secondary respectively.

The relative order will be maintained, because the hash of each server is the same regardless of which other servers are included. The only information required to generate the table is the IP of the server. Since we are only sorting a set of servers, the servers appear only once. Finally, if we use a good hash function of pseudo-random, then the sorting will be pseudo-random, so the distribution will be as uniform as we expect.

Proxy related operations

To add or delete a proxy server, we need some special handling methods. This is because the forwarding table entry only defines the primary/secondary agent, so draining/failover only applies to a single agent host. We define the following valid states and state transitions for the proxy server:
Heavy!  GitHub releases GLB, an open source load balancing component
When the proxy server is active, exhausted or filled, it will be included in the forwarding table entry. In a steady state, all proxy servers are active, and the rendezvous hash described above will have a roughly uniform and random distribution of each proxy server in the primary and secondary columns.

When the proxy server becomes exhausted, we adjust the entries in the forwarding table by exchanging the primary and secondary entries we originally contained:
Heavy!  GitHub releases GLB, an open source load balancing component

This has the effect of sending the packet to the previously secondary server. Since it receives the packet first, it will accept the SYN packet and therefore any new connections. For any packet that is not understood to be related to a local stream, it forwards it to other servers (previous master server), which allows existing connections to be completed.

This can gracefully exhaust the required connection server, after which it can be completely deleted, and the agent can randomly fill the second empty slot:
Heavy!  GitHub releases GLB, an open source load balancing component

The nodes in the population look like activities because the table itself allows a second chance:
Heavy!  GitHub releases GLB, an open source load balancing component

This implementation requires only one proxy server to be in any state other than active at a time, which actually works well on GitHub. Changes to the state of the proxy server can be as fast as the longest connection duration that needs to be maintained. We are working on design extensions that not only support primary and secondary, but some components (such as the titles listed below) already include initial support for arbitrary server lists.

Package in data center

Now there is an algorithm to consistently select the back-end proxy server, but how to encapsulate the information of the secondary server in the data packet? In this way, the main server can forward data packets without understanding them.

The traditional way of LVS is to use an IP over IP (IPIP) tunnel. The client IP data packet is encapsulated in the internal IP data packet and forwarded to the proxy server, and the proxy server decapsulates it. But it is difficult to encode the metadata of other servers in IPIP packets, because the only available space is the IP option. Data center routers pass packets of unknown IP to the processing software (called "Layer 2 slow path"). The speed ranges from millions to thousands of packets per second.

In order to avoid this situation, the data needs to be hidden in different packet formats of the router to prevent it from trying to understand. We initially adopted the original Foo-over-UDP (FOU) and custom GRE payload (payload), which basically encapsulated all the content in the UDP packet. We recently switched to Generic UDP Encapsulation (GUE), which provides standard UDP packets that encapsulate the internal IP protocol. We put the IP of the secondary server in the private data of the GUE header. From the perspective of the router, these data packets are all internal data center UDP data packets between two ordinary servers.
Heavy!  GitHub releases GLB, an open source load balancing component
Another benefit of using UDP is that the source port can be filled with the hash of each connection so that they can flow in the data center through different paths (using ECMP in the data center) and can be in different RX queues of the NIC of the proxy server Receive the message (similar to using the hash of the TCP/IP header field). This is impossible for IPIP, because most NICs in data centers can only understand ordinary IP, TCP/IP and UDP/IP. It is worth noting that the NIC cannot view IP/IP packets.

When the proxy server wants to send the data packet back to the client, it does not need to be encapsulated or returned through our director tier, it can send the data directly to the client (usually called "Direct Server Return"). This is a typical load balancer design and is especially useful for content providers, because in most cases, outbound traffic is much larger than inbound traffic.

The data packet flow is shown in the figure below:
Heavy!  GitHub releases GLB, an open source load balancing component
Introduce DPDK

Since the first public discussion of our initial design, we have completely rewritten glb-director using DPDK. DPDK is an open source project that bypasses the Linux kernel and allows very fast packet processing from user space. In this way, NIC line rate processing can be implemented on a normal NIC through the CPU, and the director layer can be easily expanded to handle as much traffic as the inbound traffic required for public connections. This is particularly important in anti-DDoS VPN, we don't want the load balancer to become a bottleneck.

One of the original goals of GLB was to run on hardware in general data centers without any special hardware configuration. The Director and proxy server of GLB can be supplied like ordinary servers in the data center. Each server has a pair of bound network interfaces, which are shared between the DPDK on the GLB Director server and the Linux system.

Modern NICs support SR-IOV, a technology that makes a single NIC look like multiple NICs from the perspective of the operating system. This is usually used by hypervisors to require a real NIC ("Physical Function") to create multiple virtual NICs ("Virtual Functions") for each VM. In order to enable DPDK and Linux kernel to share NIC, we use flow bifurcation, which sends specific traffic (target is GLB IP address) to us DPDK for processing on Virtual Function, while keeping the remaining data packets and the Linux kernel's network stack in Physical Function on.

We found that the packet processing rate of DPDK on Virtual Function can meet the requirements. GLB Director uses the DPDK Packet Distributor mode to distribute the task of encapsulating packets to the CPU on the machine, and supports any number of CPU cores because it is stateless and can be highly parallelized.
Heavy!  GitHub releases GLB, an open source load balancing component
GLB Director supports matching and forwarding of inbound IPv4 and IPv6 packets containing TCP payloads, and inbound ICMP Fragmentation Required messages as part of Path MTU Discovery.

Use Scapy to add test cases to DPDK

A typical problem is that when creating (or using) technologies that use low-level primitives (such as direct communication with the NIC) but run at high speeds, they become very difficult to test. As part of the creation of GLB Director, we also created a test environment that supports simple end-to-end packet flow testing of our DPDK applications, and supports an environmental abstraction layer (EAL) by influencing DPDK, allowing physical NIC and based The local interface of libpcap is the same in the application view.

This allows us to write tests in Scapy, using simple Python lib packages to view, manipulate and write data packets. By creating a Linux virtual network card driver, using Scapy on one side and DPDK on the other side, we can transmit customized packages and verify the functions supported by our software on the other side. This is a complete GUE-encapsulated back-end proxy service expected data package .
Heavy!  GitHub releases GLB, an open source load balancing component

This method allows us to test more complex behaviors, such as traversing the ICMPv4/ICMPv6 header of the transport layer to obtain the source IP and TCP port for correct routing, so that ICMP messages from external routers can be correctly forwarded.

health examination

The GLB design includes parts that gracefully handle server failures. The current design includes master/backup. For a given forwarding table/client, it means that we can solve a single server failure by observing each Director through a health check. We run a service called glb-healthcheck, which continuously verifies the GUE tunnel and arbitrary HTTP ports of each backend server.

When the server fails, we will switch the main/standby and replace the standby with the main. This is a "soft switchover" of the server and a good way to support failover. If the health check failure is a false positive, the connection will not be interrupted, and they will only traverse a different path.

Proxy uses iptables to provide a second chance

The last component that makes up GLB is the target of the Netfilter module and iptables, which runs on each proxy server and provides a "second chance" for design.

This module provides a simple task to determine whether the internal TCP/IP data packet of each GUE data packet is valid locally according to the Linux kernel TCP stack, if not, then forward it to the next proxy server (standby server), Instead of unpacking on the current server.

If the data packet is a SYN (new connection) or is valid locally for an established connection, the current server will receive it. Then, we receive the GUE package and process it locally using the Linux kernel 4.x GUE that contains the fou module.

Open source

When we were ready to start writing a better data center load balancer, we decided to open source it so that others could benefit from our work. We are very happy to open source all these components on github/glb-director. We allow others to use it and use it as a general solution for load balancing, running on commodity hardware in a physical data center environment.
Heavy!  GitHub releases GLB, an open source load balancing component

Open source project address:
https://github.com/github/glb-director

Original English:
https://githubengineering.com/glb-director-open-source-load-balancer/

Related Reading:

Which microservice gateway is better? An article to understand the performance difference of Zuul, Nginx, Spring Cloud, Linkerd microservice
gateway terminator? Spring Cloud launches new member Spring Cloud Gateway.
Alibaba Cloud failure is just an operation and maintenance error?
They migrated the production environment from nginx to envoy, the reason turned out to be...

The author of this article Theo Julienne, translated by Fang Yuan, Wang Yuanming, Lin Tongchuan, and Deng Qiming. Please indicate the source for reprinting this article. More friends are welcome to join the ranks of translation and contribution articles. For details, please click on the official account menu "Contact Us".

Highly available architecture

Changing the way the internet is built

Heavy!  GitHub releases GLB, an open source load balancing component

Guess you like

Origin blog.51cto.com/14977574/2546797