background

In general, the QPS pressure that a single machine can withstand in a production environment is about 2w. Excessive pressure will cause the server to explode. Even if a single machine can support 2w QPS, it is generally not done. The production environment generally reserves 50% of the redundancy capacity to prevent the QPS from exploding due to a popular event. When the QPS exceeds the pressure that a single machine can bear, it is natural to think of introducing a distributed cluster. So, which server will a request be processed by, is this random, or is it processed according to certain rules? This is what the load balancing algorithm does.

load balancer

A load balancer is a software or hardware device that implements one or more load balancing algorithms. Load balancers are usually divided into two types according to the different protocol layers. The first is implemented at the four-layer transport layer, and the second is implemented at the seven-layer application layer.
Many dedicated hardware load balancers support load balancing at the TCP layer with high efficiency. Of course, the implementation of load balancing at the TCP layer has its shortcomings, such as the inability to save long connections. Therefore, large companies like BAT generally cooperate with multi-layer load balancing.
Generally, pure software implementation is usually implemented at the application layer, which is also an implementation method with many applications. At present, the more popular implementations are Nginx, HAProxy, Keepalived, etc. Of course, the LVS (Linux Virtual Server) that comes with the Linux kernel is a four-layer implementation.

Round Robin

Polling is a simple implementation that in turn distributes requests to backend servers. The advantage is that it is simple to implement and requests are evenly distributed.
The disadvantage is precisely that the requests are evenly distributed, because the backend servers usually have different performances, so it is hoped that the servers with better performance can bear more. It is also not suitable for scenarios that require long connections and hit rates.

Weighted Round Robin

The essence of weighting is a method with priority, and weighted round-robin is an improved round-robin algorithm. The round-robin algorithm is weighted round-robin with the same weight. It is necessary to set different weights for each server in the backend to determine the proportion of allocated requests. The application of this algorithm is quite extensive, and it is very suitable for stateless load scenarios.
The advantage is to solve the situation of different server performance, and the disadvantage is that the weight needs to be statically configured and cannot be adjusted automatically. It is also not suitable for scenarios that require long connections and hit rates.

Random

Randomly distribute requests to backend servers. The uniformity of request distribution depends on the random algorithm. Because the implementation is simple, it is often used to deal with some extreme cases. For example, if a hotspot request occurs, it can be randomly distributed to any backend to disperse the hotspot. Of course the shortcomings are self-evident.

Hash Hash

The hash algorithm must be familiar to everyone, and it is the most widely used. According to Source IP, Destination IP, URL, or others, calculate the hash value or md5, and then use the modulo. For example, there are N servers: S1, S2, S3...Sn

hash值 % N

hash
Obviously, the same request will be mapped to the same backend. This is great for maintaining long connections and improving hit rates.
But it also has some inherent disadvantages. For example, now a request is mapped to S3 through hashing. If S3 is down, it has to be hashed twice, and the downed backend will be eliminated when recalculating the route.

hash值 % (N - 1)

This will cause almost all request routing changes. As a result, the hit rate dropped sharply. Of course, the general production environment solves this problem by providing an S3 backup machine, but it takes time to switch between the master and the backup, and the data synchronization between them is also delayed. So it needs to be weighed according to the business scenario.
Expansion will also have similar problems. The calculation routing formula becomes:

hash值 % (N + 1)

In order to solve this problem, the general production environment may adopt the method of multiplying the capacity. N -> 2N, so that the routing can be consistent with the original. Of course, it will inevitably lead to a waste of machine resources. Please judge by yourself.
For hotspot requests, this Hash algorithm may also have an avalanche effect, depending on which Hash is used, URL-based or IP-based, etc. In short, you cannot route hotspot requests to a single machine, otherwise the single machine will not be able to support it and will be destroyed one by one, which is the avalanche effect.

Minimum number of connections LC

Least Connection, which allocates requests to the backend server with the smallest number of active connections. It estimates server load by activity. Smarter, but needs to maintain a list of connections to the backend server.

Weighted Minimum Connections WLC

The Weighted Least Connection (Weighted Least Connection) can optimize the performance of the LC when the performance of the backend servers varies greatly, and services with high weights can bear more connection loads.

Minimum response time LRT

Least Response Time: Allocate requests to the backend server with the shortest average response time. The average response time can be obtained from the ping probe request or the normal request response time.
RT (Response Time) is a very important indicator to measure server load. For a server that responds very slowly, it means that its load is generally very high, and its QPS should be lowered.

Some people said that using CPU usage as an indicator of load balancing can only be said that they did not understand the essence of CPU usage. In theory, the higher the CPU usage, the better, indicating that the service makes full use of CPU resources. However, the high CPU usage caused by unreasonably designed programs is a problem of program design and does not violate this theory.

Consistent Hash

introduce

Consistent Hashing is a distributed hash (DHT) implementation algorithm proposed by the Massachusetts Institute of Technology in 1997. The design goal is to solve the hot spot problem in the Internet.

principle

Imagine an abstract hash ring, which can be represented by a 32-bit integer, 2^32 bucket bits, back-end nodes s0, s1...sn, etc., the hash is mapped to different bucket bits, and the imaginary end-to-end connection forms a ring. (The following picture is stolen from me shamelessly)

Map all backend nodes to the ring through Hash, as shown in the following figure:
Backend Node Mapping
The actual request Job is mapped to the hash ring in the same way, as shown in the following figure:
request mapping
Then follow The clockwise rule asks the Job to find the nearest node along the hash ring. As shown in the figure, the request Job_1 is allocated to Node_1 according to the rules, and the requests Job_k and Job_k+1 are allocated to Node_n.

Advantage

Node downtime

Assuming that the backend node Node_1 is down, according to the rules, the request Job_1 is allocated to Node_k, but Job_k, Job_k+1, and Job_i are not affected. To put it bluntly, it only affects requests on downed nodes.
Capacity expansion

According to the rules, only the request Job_k is reassigned to Node_i, and other requests Job_1, Job_k+1, and Job_i are not affected. To put it bluntly, it only affects a small number of requests that are offloaded.
Introducing the uneven Hash algorithm of virtual nodes
will inevitably lead to uneven mapping to the request ring, and some back-end nodes will bear more requests. If a sudden downtime occurs, all requests from the downed node will be transferred to the next node. It will lead to a surge in the request volume of the next node, it may also be down, and a similar avalanche effect may occur.
In order to ensure uniformity, a certain number of virtual nodes will be virtualized for each physical node and distributed on the hash ring, which needs to be distributed as randomly as possible. Requests carried by virtual nodes are actually directed to real physical nodes. When the physical machine node is down, the request on the virtual node is transferred to the next node according to the rules to avoid the avalanche effect.
When there are fewer physical nodes, the number of virtual nodes needs to be higher to ensure better consistent performance.

references

In-depth consistent hashing (Consistent Hashing) algorithm principle, with 100 lines of code implementation

Analysis of common load balancing strategies