Application of RALB load balancing algorithm | JD Cloud technical team

1. Background

The search and recommendation algorithm architecture provides services for all search and recommendation businesses of JD.com, and returns the processing results to the upstream in real time. Each subsystem of the department has implemented CPU-based adaptive current limiting, but the call from the client to the server is still RR polling, which does not consider the performance differences of downstream machines, and cannot maximize the use of the overall CPU of the cluster. End CPU imbalance problem.

The load balancing method developed by JD’s advertising department for its business scenarios is of great reference. The RALB (Remote Aware Load Balance) algorithm they proposed can improve the CPU resource efficiency of downstream service cluster machines, avoid CPU short-board effects, and allow machines with good performance to Handle more traffic. We applied its core ideas to our system and got good results.

The structure of this article is as follows:

1. Introduction to RALB

◦The principle of the algorithm is briefly introduced.

2. Functional verification

◦Apply RALB load balancing technology to the search recommendation architecture system for functional verification.

3. Throughput test

◦Mainly compare RALB and RR load balancing technologies. It has been verified that there is no significant difference in the throughput between the clusters with no current limit and full current limit. In the case of RR partial current limiting, there is a difference in throughput between the two, and there is a point of maximum throughput difference. For RALB, it is a turning point from unlimited flow limit to full flow limit on the server side, and there is almost no case of partial flow limit.

4. Boundary testing

◦By simulating various boundary conditions, the system was tested to verify the stability and reliability of RALB.

5. Function launched

◦The RALB load balancing mode is fully enabled on all Server-side clusters. It can be seen that before and after the launch, the QPS on the server side gradually becomes stratified, and the CPU on the server side gradually tends to be unified.

2. Introduction to RALB

RALB is a high-performance load balancing algorithm targeting CPU balancing.

2.1 Algorithm goals

1. Adjust the CPU usage on the server side, so that the CPUs of each node are relatively balanced, and avoid cluster current limiting triggered by excessive CPU usage

2. QPS has a linear relationship with CPU usage, and adjusting QPS can achieve the goal of balanced CPU usage

2.2 Algorithm principle

2.2.1 Algorithm steps

1. When allocating traffic, distribute according to weight (random algorithm with weight, wr)

2. Collect CPU usage: The server feeds back the CPU usage (average 1s) to the client through RPC

3. Weight adjustment: adjust the weight regularly (every 3s) according to the CPU usage (average value in the window) of the cluster and each node to balance the CPU of each node

2.2.2 Index dependence

serial number index effect source
1 IP Available IP list Service registry discovery and fault masking modules for maintenance
2 real-time health IP availability status changes in real time, providing the boundary conditions of the algorithm RPC framework health check function maintenance
3 Historical health Health history value, used to judge boundary conditions such as ip failure and recovery Historical value of indicator 2
4 Dynamic Target (CPU Usage) Provide the most direct target basis for the equalization algorithm Server-side timing statistics, RPC framework returns through RPC
5 weightweight Real-time load distribution based on algorithm update

2.2.3 Weight adjustment algorithm

2.2.4 Boundary processing

Boundary 1: Within the feedback window (3s), if the downstream IP is not accessed, its average CPU value is 0, and the weight adjustment algorithm will consider the node to have excellent performance, thereby increasing the weight

Boundary 2: When the network fails, the RPC framework sets the failed node as unavailable, and the CPU and weight are 0; after the network is restored, the RPC framework sets the IP as available, but the node with the weight of 0 cannot share traffic, resulting in the node will remain unavailable

Processing: The update of the weight is triggered by a timer, and the available status of the node is recorded. When the node recovers from unavailable to available, a low weight is given to gradually recover

2.3 The key to landing

Be fast and steady, avoid stalemates and avalanches in any situation, and especially deal with boundary conditions

Algorithm points:

1. The update of each dependent factor in the formula maintains an independent meaning and update mechanism to maintain the reliability and simplicity of the algorithm

◦The update of the IP list is jointly guaranteed by the service registration discovery and the RPC framework

◦RPC update CPU

2. Pay attention to the meaning of boundary value, which needs to distinguish continuous values

◦CPU = 0, means unknown, does not mean good CPU performance

◦w = 0, which means no traffic will be allocated, and it will be 0 only when it is unavailable; if it is available, there should be at least a smaller value to ensure that RPC can still be triggered, and then the weight can be updated

3. Algorithm update weight, do not rely on RPC trigger, but should update regularly

3. Function verification

3.1 Pressure test preparation

Module IP CPU
Client 10.173.102.36 8
server side 11.17.80.238 8
11.18.159.191 8
11.17.191.137 8

3.2 Pressure measurement data

index RR load balancing RALB load balancing
SWC QPS balance on the server side As can be seen from the above figure, the QPS on the server side is stratified
CPU CPU performance is also relatively uniform, maintained at about 10%, but compared to RALB, the CPU gap between nodes is larger ****The server-side CPU performance is even, maintaining around 10%
TP99 Latency is stable, with some variance The delay is stable, there are slight differences, and it is smaller than RR

Due to the small difference in machine performance, the CPU effect of the pressure test is not obvious. In order to make the CPU effect more obvious, an initial load is applied to the node "11.17.80.238" (that is, when there is no traffic, the CPU usage rate is 12.5%)

index LA load balancing RR load balancing RALB load balancing
SWC  The QPS is extremely uneven, and the traffic is highly skewed, and the traffic will be concentrated on one node  QPS Uniform QPS appears to be clearly stratified, and the QPS changes because two adjustments have been made to the "maximum weight adjustment ratio" (1.5 → 2.0 → 2.5) 11.17.80.238: 125 → 96 → 79 11.18.159.191: 238 → 252 → 262 11.17.191.137: 239 → 254 → 263
CPU  CPU is not the target of LA balance, so it is consistent with the QPS trend, and it is on a single node in the ensemble  The CPU is obviously stratified, and the CPU of 11.17.80.238 is obviously higher than that of other nodes  1. At the beginning of the pressure test, the CPU of 11.17.80.238 is higher than the other two nodes, because the "maximum weight adjustment ratio" is 1.5 (relative to base, the fixed value is 10000), reaching the adjustment limit 2. "Maximum weight adjustment The ratio is adjusted to 2.0, and the gap between nodes becomes smaller. 3. The "maximum weight adjustment ratio" is adjusted to 2.5, and the gap between nodes becomes smaller
TP99  The delay of nodes receiving traffic is stable. Since the traffic received by some nodes is very low (almost none), the delay of these nodes seems to fluctuate greatly, but the effect of LA on delay should be stable, because large Some requests are processed with a more balanced delay.  Stable latency with slight variance The delay is stable, there are slight differences, and it is smaller than RR

3.3 Pressure test conclusion

After the pressure test, both RR and LA have the problem of CPU imbalance, which will cause a short board effect due to the performance difference of machine resources, and fail to achieve the purpose of fully utilizing resources.

RALB uses the CPU as the balance target, so it will adjust the QPS undertaken by the node in real time according to the CPU of the node, and then achieve the goal of CPU balance. The functional verification is available, and the CPU performance meets expectations.

4. Throughput test

4.1 Pressure measurement target

RALB is a load balancing algorithm that uses CPU usage as a dynamic indicator. It can solve the problem of CPU imbalance, avoid CPU short-board effect, and allow machines with good performance to handle more traffic. Therefore, we expect that the RALB load balancing strategy can achieve a certain degree of throughput improvement compared to the RR polling strategy.

4.2 Pressure test preparation

There are 100 machines on the server side for testing, and the server side uses pure CPU adaptive current limiting, and the current limiting threshold is set to 55%.

4.3 Pressure measurement data

Through pressure testing in the two load balancing modes of RALB and RR, the throughput of the server side changes with the trend of traffic, and compares the impact of the two load balancing strategies on the cluster throughput.

4.3.1 RALB

4.3.1.1 Throughput data

The following table is the throughput data of the server side, which is sent to the client side by the test, and the load balancing mode is set to RALB. At 18:17, the situation on the server side was close to the current limit just now. During the entire stress testing phase, three situations of unrestricted flow, partial flow restriction, and complete flow restriction were tested.

time 17:40 17:45 17:52 18:17 18:22
total flow 2270 1715 1152 1096 973
handle traffic 982 1010 1049 1061 973
Limited traffic 1288 705 103 35 0
Current limiting ratio 56.74% 41% 8.9% 3.2% 0%
Average CPU usage 55% 55% 54% 54% 49%

4.3.1.2 Indicator Monitoring

The traffic received by the server-side machine is distributed according to performance, and the CPU remains balanced.

SWC CPU

4.3.2 RR

4.3.2.1 Throughput data

The following table is the throughput data of the server side, which is sent to the client side by the test, and the load balancing mode is set to RR. The overall traffic on the Server side at 18:46 is close to the overall traffic on the Server side at 18:17. The following will focus on comparing the data at these two critical moments.

time 18:40 18:46 19:57 20:02 20:04 20:09
total flow 967 1082 1149 1172 1263 1314
handle traffic 927 991 1024 1036 1048 1047
Limited traffic 40 91 125 136 216 267
Current limiting ratio 4.18% 8.4% 10.92% 11.6% 17.1% 20.32%
Average CPU usage 45%(部分限流) 51%(部分限流) 53%(部分限流) 54%(接近全部限流) 55%(全部限流) 55%(全部限流)

4.3.2.2 指标监控

Server端收到的流量均衡,但是CPU有差异。

QPS CPU


4.4 压测分析

4.4.1 吞吐曲线

根据4.3节的压测数据,进行Server端吞吐曲线的绘制,对比RALB和RR两种负载均衡模式下的吞吐变化趋势。

import matplotlib.pyplot as plt
import numpy as np
       
x = [0,1,2,3,4,5,6,7,8,9,9.73,10.958,11.52,17.15,22.7]
y = [0,1,2,3,4,5,6,7,8,9,9.73,10.61,10.49,10.10,9.82]
  
w = [0,1,2,3,4,5,6,7,8,9.674,10.823,11.496,11.723,12.639,13.141,17.15,22.7]
z = [0,1,2,3,4,5,6,7,8,9.27,9.91,10.24,10.36,10.48,10.47,10.10,9.82]
  
plt.plot(x, y, 'r-o')
plt.plot(w, z, 'g-o')
plt.show()







4.4.2 曲线分析

负载均衡策略 RALB RR
阶段一:所有机器未限流 接收QPS=处理QPS,表现为y =x 的直线 接收QPS=处理QPS,表现为y =x 的直线
阶段二:部分机器限流 不存在RALB根据下游CPU进行流量分配,下游根据CPU进行限流,理论上来讲,下游的CPU永远保持一致。所有的机器同时达到限流,不存在部分机器限流的情况。 所以在图中,不限流与全部机器限流是一个转折点,没有平滑过渡的阶段。 RR策略,下游的机器分配得到的QPS一致,由于下游根据CPU进行限流,所以不同机器限流的时刻有差异。 相对于RALB,RR更早地出现了限流的情况,并且在达到限流之前,RR的吞吐是一直小于RALB的。
阶段三:全部机器限流 全部机器都达到限流阈值55%之后,理论上,之后无论流量怎样增加,处理的QPS会维持不变。图中显示处理的QPS出现了一定程度的下降,是因为处理限流也需要消耗部分CPU RR达到全部限流的时间要比RALB更晚。在全部限流之后,两种模式的处理的QPS是一致的。

4.5 压测结论

临界点:吞吐差异最大的情况,即RALB模式下非限流与全限流的转折点。

通过上述分析,可以知道,在RALB不限流与全部限流的临界点处,RR与RALB的吞吐差异最大。

此时,计算得出RALB模式下,Server集群吞吐提升7.06%。

五、边界测试

通过模拟各种边界条件,来判断系统在边界条件的情况下,系统的稳定性。

边界条件 压测情形 压测结论
下游节点限流 CPU限流 惩罚因子的调整对于流量的分配有重要影响
QPS限流 符合预期
下游节点超时 Server端超时每个请求,固定sleep 1s 请求持续超时期间分配的流量基本为0
下游节点异常退出 Server端进程被杀死直接kill -9 pid 杀死进程并自动拉起,流量分配快速恢复
下游节点增减 Server端手动Jsf上下线 jsf下线期间不承接流量
Server端重启stop + start 正常反注册、注册方式操作Server端进程,流量分配符合预期

六、功能上线

宿迁机房Client端上线配置,在所有Server端集群全面开启RALB负载均衡模式。可以看出,上线前后,Server端的QPS逐渐出现分层,Server端的CPU逐渐趋于统一。

上线前后Server端QPS分布 上线前后Server端的CPU分布

参考资料

1.负载均衡技术

2.深入浅出负载均衡

作者:京东零售 胡沛栋

来源:京东云开发者社区

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/9908409