1. Background
The search and recommendation algorithm architecture provides services for all search and recommendation businesses of JD.com, and returns the processing results to the upstream in real time. Each subsystem of the department has implemented CPU-based adaptive current limiting, but the call from the client to the server is still RR polling, which does not consider the performance differences of downstream machines, and cannot maximize the use of the overall CPU of the cluster. End CPU imbalance problem.
The load balancing method developed by JD’s advertising department for its business scenarios is of great reference. The RALB (Remote Aware Load Balance) algorithm they proposed can improve the CPU resource efficiency of downstream service cluster machines, avoid CPU short-board effects, and allow machines with good performance to Handle more traffic. We applied its core ideas to our system and got good results.
The structure of this article is as follows:
1. Introduction to RALB
◦The principle of the algorithm is briefly introduced.
2. Functional verification
◦Apply RALB load balancing technology to the search recommendation architecture system for functional verification.
3. Throughput test
◦Mainly compare RALB and RR load balancing technologies. It has been verified that there is no significant difference in the throughput between the clusters with no current limit and full current limit. In the case of RR partial current limiting, there is a difference in throughput between the two, and there is a point of maximum throughput difference. For RALB, it is a turning point from unlimited flow limit to full flow limit on the server side, and there is almost no case of partial flow limit.
4. Boundary testing
◦By simulating various boundary conditions, the system was tested to verify the stability and reliability of RALB.
5. Function launched
◦The RALB load balancing mode is fully enabled on all Server-side clusters. It can be seen that before and after the launch, the QPS on the server side gradually becomes stratified, and the CPU on the server side gradually tends to be unified.
2. Introduction to RALB
RALB is a high-performance load balancing algorithm targeting CPU balancing.
2.1 Algorithm goals
1. Adjust the CPU usage on the server side, so that the CPUs of each node are relatively balanced, and avoid cluster current limiting triggered by excessive CPU usage
2. QPS has a linear relationship with CPU usage, and adjusting QPS can achieve the goal of balanced CPU usage
2.2 Algorithm principle
2.2.1 Algorithm steps
1. When allocating traffic, distribute according to weight (random algorithm with weight, wr)
2. Collect CPU usage: The server feeds back the CPU usage (average 1s) to the client through RPC
3. Weight adjustment: adjust the weight regularly (every 3s) according to the CPU usage (average value in the window) of the cluster and each node to balance the CPU of each node
2.2.2 Index dependence
serial number | index | effect | source |
---|---|---|---|
1 | IP | Available IP list | Service registry discovery and fault masking modules for maintenance |
2 | real-time health | IP availability status changes in real time, providing the boundary conditions of the algorithm | RPC framework health check function maintenance |
3 | Historical health | Health history value, used to judge boundary conditions such as ip failure and recovery | Historical value of indicator 2 |
4 | Dynamic Target (CPU Usage) | Provide the most direct target basis for the equalization algorithm | Server-side timing statistics, RPC framework returns through RPC |
5 | weightweight | Real-time load distribution based on | algorithm update |
2.2.3 Weight adjustment algorithm
2.2.4 Boundary processing
Boundary 1: Within the feedback window (3s), if the downstream IP is not accessed, its average CPU value is 0, and the weight adjustment algorithm will consider the node to have excellent performance, thereby increasing the weight
Boundary 2: When the network fails, the RPC framework sets the failed node as unavailable, and the CPU and weight are 0; after the network is restored, the RPC framework sets the IP as available, but the node with the weight of 0 cannot share traffic, resulting in the node will remain unavailable
Processing: The update of the weight is triggered by a timer, and the available status of the node is recorded. When the node recovers from unavailable to available, a low weight is given to gradually recover
2.3 The key to landing
Be fast and steady, avoid stalemates and avalanches in any situation, and especially deal with boundary conditions
Algorithm points:
1. The update of each dependent factor in the formula maintains an independent meaning and update mechanism to maintain the reliability and simplicity of the algorithm
◦The update of the IP list is jointly guaranteed by the service registration discovery and the RPC framework
◦RPC update CPU
2. Pay attention to the meaning of boundary value, which needs to distinguish continuous values
◦CPU = 0, means unknown, does not mean good CPU performance
◦w = 0, which means no traffic will be allocated, and it will be 0 only when it is unavailable; if it is available, there should be at least a smaller value to ensure that RPC can still be triggered, and then the weight can be updated
3. Algorithm update weight, do not rely on RPC trigger, but should update regularly
3. Function verification
3.1 Pressure test preparation
Module | IP | CPU |
---|---|---|
Client | 10.173.102.36 | 8 |
server side | 11.17.80.238 | 8 |
11.18.159.191 | 8 | |
11.17.191.137 | 8 |
3.2 Pressure measurement data
index | RR load balancing | RALB load balancing |
---|---|---|
SWC | QPS balance on the server side | As can be seen from the above figure, the QPS on the server side is stratified |
CPU | CPU performance is also relatively uniform, maintained at about 10%, but compared to RALB, the CPU gap between nodes is larger | ****The server-side CPU performance is even, maintaining around 10% |
TP99 | Latency is stable, with some variance | The delay is stable, there are slight differences, and it is smaller than RR |
Due to the small difference in machine performance, the CPU effect of the pressure test is not obvious. In order to make the CPU effect more obvious, an initial load is applied to the node "11.17.80.238" (that is, when there is no traffic, the CPU usage rate is 12.5%)
index | LA load balancing | RR load balancing | RALB load balancing |
---|---|---|---|
SWC | The QPS is extremely uneven, and the traffic is highly skewed, and the traffic will be concentrated on one node | QPS Uniform | QPS appears to be clearly stratified, and the QPS changes because two adjustments have been made to the "maximum weight adjustment ratio" (1.5 → 2.0 → 2.5) 11.17.80.238: 125 → 96 → 79 11.18.159.191: 238 → 252 → 262 11.17.191.137: 239 → 254 → 263 |
CPU | CPU is not the target of LA balance, so it is consistent with the QPS trend, and it is on a single node in the ensemble | The CPU is obviously stratified, and the CPU of 11.17.80.238 is obviously higher than that of other nodes | 1. At the beginning of the pressure test, the CPU of 11.17.80.238 is higher than the other two nodes, because the "maximum weight adjustment ratio" is 1.5 (relative to base, the fixed value is 10000), reaching the adjustment limit 2. "Maximum weight adjustment The ratio is adjusted to 2.0, and the gap between nodes becomes smaller. 3. The "maximum weight adjustment ratio" is adjusted to 2.5, and the gap between nodes becomes smaller |
TP99 | The delay of nodes receiving traffic is stable. Since the traffic received by some nodes is very low (almost none), the delay of these nodes seems to fluctuate greatly, but the effect of LA on delay should be stable, because large Some requests are processed with a more balanced delay. | Stable latency with slight variance | The delay is stable, there are slight differences, and it is smaller than RR |
3.3 Pressure test conclusion
After the pressure test, both RR and LA have the problem of CPU imbalance, which will cause a short board effect due to the performance difference of machine resources, and fail to achieve the purpose of fully utilizing resources.
RALB uses the CPU as the balance target, so it will adjust the QPS undertaken by the node in real time according to the CPU of the node, and then achieve the goal of CPU balance. The functional verification is available, and the CPU performance meets expectations.
4. Throughput test
4.1 Pressure measurement target
RALB is a load balancing algorithm that uses CPU usage as a dynamic indicator. It can solve the problem of CPU imbalance, avoid CPU short-board effect, and allow machines with good performance to handle more traffic. Therefore, we expect that the RALB load balancing strategy can achieve a certain degree of throughput improvement compared to the RR polling strategy.
4.2 Pressure test preparation
There are 100 machines on the server side for testing, and the server side uses pure CPU adaptive current limiting, and the current limiting threshold is set to 55%.
4.3 Pressure measurement data
Through pressure testing in the two load balancing modes of RALB and RR, the throughput of the server side changes with the trend of traffic, and compares the impact of the two load balancing strategies on the cluster throughput.
4.3.1 RALB
4.3.1.1 Throughput data
The following table is the throughput data of the server side, which is sent to the client side by the test, and the load balancing mode is set to RALB. At 18:17, the situation on the server side was close to the current limit just now. During the entire stress testing phase, three situations of unrestricted flow, partial flow restriction, and complete flow restriction were tested.
time | 17:40 | 17:45 | 17:52 | 18:17 | 18:22 |
---|---|---|---|---|---|
total flow | 2270 | 1715 | 1152 | 1096 | 973 |
handle traffic | 982 | 1010 | 1049 | 1061 | 973 |
Limited traffic | 1288 | 705 | 103 | 35 | 0 |
Current limiting ratio | 56.74% | 41% | 8.9% | 3.2% | 0% |
Average CPU usage | 55% | 55% | 54% | 54% | 49% |
4.3.1.2 Indicator Monitoring
The traffic received by the server-side machine is distributed according to performance, and the CPU remains balanced.
SWC | CPU |
---|---|
4.3.2 RR
4.3.2.1 Throughput data
The following table is the throughput data of the server side, which is sent to the client side by the test, and the load balancing mode is set to RR. The overall traffic on the Server side at 18:46 is close to the overall traffic on the Server side at 18:17. The following will focus on comparing the data at these two critical moments.
time | 18:40 | 18:46 | 19:57 | 20:02 | 20:04 | 20:09 |
---|---|---|---|---|---|---|
total flow | 967 | 1082 | 1149 | 1172 | 1263 | 1314 |
handle traffic | 927 | 991 | 1024 | 1036 | 1048 | 1047 |
Limited traffic | 40 | 91 | 125 | 136 | 216 | 267 |
Current limiting ratio | 4.18% | 8.4% | 10.92% | 11.6% | 17.1% | 20.32% |
Average CPU usage | 45%(部分限流) | 51%(部分限流) | 53%(部分限流) | 54%(接近全部限流) | 55%(全部限流) | 55%(全部限流) |
4.3.2.2 指标监控
Server端收到的流量均衡,但是CPU有差异。
QPS | CPU |
---|---|
|
4.4 压测分析
4.4.1 吞吐曲线
根据4.3节的压测数据,进行Server端吞吐曲线的绘制,对比RALB和RR两种负载均衡模式下的吞吐变化趋势。
import matplotlib.pyplot as plt
import numpy as np
x = [0,1,2,3,4,5,6,7,8,9,9.73,10.958,11.52,17.15,22.7]
y = [0,1,2,3,4,5,6,7,8,9,9.73,10.61,10.49,10.10,9.82]
w = [0,1,2,3,4,5,6,7,8,9.674,10.823,11.496,11.723,12.639,13.141,17.15,22.7]
z = [0,1,2,3,4,5,6,7,8,9.27,9.91,10.24,10.36,10.48,10.47,10.10,9.82]
plt.plot(x, y, 'r-o')
plt.plot(w, z, 'g-o')
plt.show()
4.4.2 曲线分析
负载均衡策略 | RALB | RR |
---|---|---|
阶段一:所有机器未限流 | 接收QPS=处理QPS,表现为y =x 的直线 | 接收QPS=处理QPS,表现为y =x 的直线 |
阶段二:部分机器限流 | 不存在RALB根据下游CPU进行流量分配,下游根据CPU进行限流,理论上来讲,下游的CPU永远保持一致。所有的机器同时达到限流,不存在部分机器限流的情况。 所以在图中,不限流与全部机器限流是一个转折点,没有平滑过渡的阶段。 | RR策略,下游的机器分配得到的QPS一致,由于下游根据CPU进行限流,所以不同机器限流的时刻有差异。 相对于RALB,RR更早地出现了限流的情况,并且在达到限流之前,RR的吞吐是一直小于RALB的。 |
阶段三:全部机器限流 | 全部机器都达到限流阈值55%之后,理论上,之后无论流量怎样增加,处理的QPS会维持不变。图中显示处理的QPS出现了一定程度的下降,是因为处理限流也需要消耗部分CPU | RR达到全部限流的时间要比RALB更晚。在全部限流之后,两种模式的处理的QPS是一致的。 |
4.5 压测结论
临界点:吞吐差异最大的情况,即RALB模式下非限流与全限流的转折点。
通过上述分析,可以知道,在RALB不限流与全部限流的临界点处,RR与RALB的吞吐差异最大。
此时,计算得出RALB模式下,Server集群吞吐提升7.06%。
五、边界测试
通过模拟各种边界条件,来判断系统在边界条件的情况下,系统的稳定性。
边界条件 | 压测情形 | 压测结论 |
---|---|---|
下游节点限流 | CPU限流 | 惩罚因子的调整对于流量的分配有重要影响 |
QPS限流 | 符合预期 | |
下游节点超时 | Server端超时每个请求,固定sleep 1s | 请求持续超时期间分配的流量基本为0 |
下游节点异常退出 | Server端进程被杀死直接kill -9 pid | 杀死进程并自动拉起,流量分配快速恢复 |
下游节点增减 | Server端手动Jsf上下线 | jsf下线期间不承接流量 |
Server端重启stop + start | 正常反注册、注册方式操作Server端进程,流量分配符合预期 |
六、功能上线
宿迁机房Client端上线配置,在所有Server端集群全面开启RALB负载均衡模式。可以看出,上线前后,Server端的QPS逐渐出现分层,Server端的CPU逐渐趋于统一。
上线前后Server端QPS分布 | 上线前后Server端的CPU分布 |
---|---|
参考资料
1.负载均衡技术
2.深入浅出负载均衡
作者:京东零售 胡沛栋
来源:京东云开发者社区