GaussDB (DWS) network flow control and control effect

Abstract: This article mainly introduces the network flow control capability of GaussDB (DWS), and verifies its control effect.

This article is shared from Huawei Cloud Community " GaussDB (DWS) Network Flow Control and Control Effects ", author: A grape tree in front of the door.

In the previous blog post GaussDB (DWS) network scheduling and isolation management and control capabilities , we introduced the GaussDB network scheduling logic in detail and briefly introduced how to apply the network isolation management and control capabilities. This blog post mainly introduces the network flow control capability of GaussDB (DWS), and verifies its control effect.

1. Analysis of network overload impact

The impact of network overload on performance is mainly reflected in two aspects:

  1. The impact of network scheduling on performance, the analysis of the reasons for performance impact and GaussDB network scheduling are detailed in the blog: GaussDB (DWS) network scheduling and isolation control capabilities ;
  2. The impact of TCP cache on performance, this blog mainly analyzes the impact of TCP cache on performance, and introduces how GaussDB realizes the control of TCP cache through flow control.

As we all know, TCP is a connection-oriented and reliable transmission protocol. In order to ensure the reliability of data transmission, the receiver needs to reply a response to the sender for each data packet sent by the sender, and retransmit if the sending fails. The above mechanism ensures the reliability of data transmission, but the disadvantages are also obvious: the sender needs to wait for the receiver to confirm each time it sends a data packet, and the receiver confirms the receipt before sending the next data packet. The time between two transmissions The interval depends on the packet sending and receiving delay and the processing capability of the receiving end. The larger the interval, the lower the communication efficiency. In order to solve this problem, TCP introduces the concept of window. The so-called window is actually the operating system opening up buffer space for sending and receiving data packet buffering, so as to improve communication efficiency and network throughput. For detailed principles, please refer to the TCP sliding window mechanism.

The TCP cache solves the problem of low communication efficiency of the TCP protocol. However, when the network is overloaded, the TCP cache is generally relatively high. For business data packets, this waiting time is called sending delay. Obviously, when the network bandwidth remains constant, the larger the TCP cache, the greater the sending delay.

Assuming that the network bandwidth is 1GB and there is 2MB of data in the TCP cache, the time for sending all the data in the TCP cache = 2/1024*1000 = 1.95ms. Considering the data processing and response delay of the receiver, the actual sending delay is 2~ between 4ms. If a high-quality job needs to wait for 2~4ms every time a data packet is sent, the accumulation of this time is still very scary.

In the laboratory environment, a network overload scenario is constructed to test the impact of TCP cache on business performance. The test environment configuration is as follows:

Use the large table broadcast as the background pressure, and the simple association between the two tables is tested as a normal business. The test data is as follows:

Note: In order to more intuitively reflect the impact of TCP cache on performance, we use the execution time with relatively no background pressure increase as the performance cracking indicator.

During the background stress test, the TCP cache continued to reach more than 2MB. From the above test data, pure network scheduling cannot completely solve the impact of network overload on business performance. Other environmental parameters remain unchanged, and the impact of TCP cache on performance is tested:

From the above test data and the test data of the default configuration of the TCP cache, no matter whether network control is performed or not, the larger the TCP cache, the worse the performance. So far, we can basically confirm that after network scheduling is applied in the network overload scenario, the TCP cache is the key point of performance impact, but directly adjusting the TCP cache configuration will affect the overall network throughput and communication delay, so other technologies need to be used to control TCP The cache size is within a certain range.

2. GaussDB network flow control

2.1 Network current limiting algorithm

Current limiting is one of the three sharp tools (current limiting, caching, and downgrading) to protect system stability. Current limiting can be to limit concurrency or resource usage; it can protect yourself or others. In the mixed database load scenario, current limiting can prevent low-quality services from occupying too many resources, prevent resource overload, and ensure that the performance of high-quality services will not be greatly affected. Common current limiting algorithms include counting current limiting, leaky bucket algorithm and token bucket algorithm:

  1. Counting current limiting: The purpose of current limiting is achieved by limiting the number of requests within a current limiting period. In one current limiting period, the request can be limited to not exceed the limit, but in the adjacent time of two current limiting periods, there is a critical problem, and the instantaneous traffic may exceed the limit.
  2. Leaky bucket current limiting: Consume requests at a fixed rate, limiting the amount of requests that can be sent per unit of time; requests are put into the bucket (queue) first, and the leaky bucket discharges water at a fixed rate, which can prevent burst traffic.
  3. Token bucket current limiting: The service provider adds tokens to the token bucket at a fixed rate, and no more will be added when the total amount of tokens reaches the threshold; a certain number of tokens are obtained from the token bucket when requesting consumption, if the tokens are insufficient , the rejection policy is triggered, and the token bucket allows short-term burst traffic.

2.2 Implementation of network flow control

GaussDB network flow control is mainly used to prevent the continuous overload of the network caused by poor network SQL, prevent the continuous surge of TCP cache, cause excessive network transmission delay, and cause high-quality business network requests to not be sent in time, affecting high-quality business performance. For the soaring TCP cache caused by excessive concurrency of normal business, it is recommended to use the method of query scheduling to limit concurrency to solve it. The network flow control of SQL with poor network is based on the design and implementation of low-quality queues in network scheduling, and is implemented by using the leaky bucket algorithm.

The new GUC parameter low_priority_bandwidth (default value: 256MB) is used to limit the network bandwidth that low-priority queues can occupy. This parameter has two meanings (assuming the default configuration):

  • The network transmission rate of the low priority queue does not exceed 256MB/s.
  • The amount of data allowed to be transmitted within 1ms does not exceed 256KB (256MB/s≈256KB/ms), ensuring that the low-priority queue data in the TCP cache does not exceed 256KB, preventing the low-priority queue from causing the TCP cache to be too high and greatly degrading the performance of high-priority services.

The network bandwidth setting of low-priority queues needs to fully consider the network environment and cluster deployment. If the setting is too large, the network flow control effect may not be achieved, and if the setting is too small, the performance of low-priority services may drop too much. For example, in a 10GE network, in a 3-node 12DN environment, the network bandwidth of low priority queues should not be higher than 256MB. On this basis, the lower the bandwidth configuration of low priority queues, the better the current limiting effect and the smaller the impact on high priority service performance; low When the network bandwidth configuration of the optimal queue is close to the upper limit of the network, the greater the concurrency of SQL in a poor network, the worse the current limiting effect will be. For example, in a 10GE network, 3-node 12DN environment, and in the case of a low-optimized queue with a current limit of 256MB, more than 15 concurrent large table broadcasts , the effect of network current limiting begins to decline.

2.3 Verification of flow control effect

Test environment configuration:

  • Network card: 10GE
  • CPU: 72 kernels
  • Memory: 350GB
  • Cluster: 3 nodes 12DN, 4 DNs per node
  • low_priority_bandwidth:256

Set an exception rule to perform a downgrade operation on a job whose query runs for more than 1 minute and whose network bandwidth exceeds 128MB (single DN, 5s average transfer rate):

CREATE EXCEPT RULE bandwidth_rule1 WITH(bandwidth=128, ELAPSEDTIME=60, action='penalty');

Create resource pool rp1 and associate the above exception rules:

CREATE RESOURCE POOL rp1 WITH(EXCEPT_RULE='bandwidth_rule1');

Create user user1 to associate with resource pool rp1:

CREATE USER user1 RESOURCE POOL 'rp1' PASSWORD 'xxxxxxxx';

When user user1 executes a query that satisfies the rule of "running time exceeds 1 minute and occupies bandwidth exceeding 128MB", the query is downgraded. After the downgrade, the query network request is scheduled by the low-priority queue.

Use user1 to perform the following tests to verify the effect of network current limiting:

  • Create a sample table and import data
// 背景压力SQL使用的表 
CREATE TABLE wt1(c1 int, c2 int, b1 char(1000), b2 char(7000)) distribute by hash(c1);
CREATE TABLE wt2(c1 int, c2 int, b1 char(1000), b2 char(7000)) distribute by hash(c1);
INSERT INTO wt1 select generate_series(1,10000), generate_series(1,10000),repeat('a',900), repeat('b',6888);
INSERT INTO wt2 select * from wt1;
INSERT INTO wt1 select * from wt1; // 连续执行多次,导入3GB以上数据
// 高优业务SQL使用的表
CREATE TABLE wt3(c1 int, c2 int, b1 char(1000), b2 char(7000)) distribute by hash(c1);
CREATE TABLE wt4(c1 int, c2 int, b1 char(1000), b2 char(7000)) distribute by hash(c1);
INSERT INTO wt3 select generate_series(1,10000), generate_series(1,10000),repeat('a',900), repeat('b',6888);
INSERT INTO wt4 select * from wt3;
  • Use the following SQL for background pressure
select count(1) from (select /*+ broadcast(wt1)*/ wt1.c1,wt1.c2 from wt1, wt2 where wt1.c2 = wt2.c2);
  • Use the following SQL as a high-quality business for performance test verification
select count(1) from (select /*+ broadcast(wt3)*/ wt3.c1,wt3.c2 from wt3, wt4 where wt3.c2 = wt4.c2);
  • Under different network background pressure conditions (parallel different numbers of background pressure SQL), test the performance data of no network control and background pressure degradation respectively, and record the SQL execution completion time.

It can be seen from the performance test data that:

  • Without network management and control, when the network is overloaded, the business performance cracks obviously, and the cracking is as high as 55 times under 10 background pressures.
  • Without network management and control, the greater the pressure on the network background, the worse the business performance will be.
  • After the background pressure is downgraded, the business performance does not change significantly under different background pressures.
  • After the background pressure degrades, the cracking of business performance is basically controllable, and there will be no significant cracking.

After the background pressure is downgraded, the business performance still deteriorates. The main reason is that flow control can only reduce the TCP cache, but cannot completely eliminate it. Terminate it after good SQL.

Judging from the test and verification results, the downgrading exception rules combined with low-quality queue network flow control can effectively control the impact of background pressure on business performance, ensuring that poor network SQL will not lead to significant degradation of high-quality business performance.

reference:

 

https://www.cnblogs.com/niumoo/p/16007224.html 

https://xie.infoq.cn/article/4a0acdd12a0f6dd4a53e0472c  

 

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

おすすめ

転載: my.oschina.net/u/4526289/blog/8705215