Apache Doris's load isolation capability based on Workload Group｜Deep Dive

[Live broadcast preview] Will large models replace programmers? "

Author: SelectDB technical team

Nowadays, the data query needs of enterprises are constantly increasing. When sharing the same cluster, they often need to face concurrent queries from multiple business lines or multiple analysis loads at the same time. Under limited resource conditions, resource preemption between query tasks will lead to performance degradation and even cluster instability. Therefore, the importance of load management is self-evident.

Starting from business scenarios, the requirements for user load management mainly come from the following aspects:

When multiple business departments or tenants may share the same cluster, in order to avoid load interaction between different tenants, it is necessary to ensure the resource usage independence and performance stability of each tenant.
Different businesses have different requirements for the responsiveness and priority of query tasks. For key businesses or high-priority tasks, such as real-time data analysis, online transactions, etc., it is necessary to ensure that these tasks can obtain sufficient resources and be executed with priority to avoid resource competition. Have an impact on query performance.
Users not only care about resource allocation and management, but also pay attention to cost control and resource utilization. The load management solution needs to meet the isolation requirements while also realizing the user's demands for low usage cost and high resource utilization.

In early versions, Apache Doris launched an isolation solution based on resource tags, including node-level resource group division within the cluster and resource limits for individual queries, achieving physical isolation of resources between different users. In order to provide users with a more complete load management solution, Apache Doris has launched a management solution based on Workload Group since version 2.0, which realizes the soft limit of CPU resources and provides users with higher resource utilization. The newly released version 2.1 is based on the CGroup technology provided by the Linux kernel, which further implements hard limits on CPU resources and provides users with better query stability.

Physical isolation solution based on Resource Tag

There are two types of nodes in Apache Doris, FE and BE. The FE node is responsible for metadata storage, cluster management, user request access, query plan analysis, etc., while the BE node is responsible for data storage and calculation. The main resource consumption involved in the query execution process is the BE node, so the Apache Doris load isolation solution is designed for the BE node.

In the Resource Tag resource physical isolation solution, you can set tags on BE nodes in the same cluster. BE nodes with the same tags will form a resource group (Resource Group), and the resource group can be regarded as a unit of data storage and computing. When data is entered into the database, copies of the data will be written to different resource groups according to the resource group configuration. When querying, the computing resources on the corresponding resource group will be used for calculation according to the division of resource groups.

Reference documentation: https://doris.apache.org/zh-CN/docs/2.0/admin-manual/resource-admin/multi-tenant

Let's take a common read-write analysis scenario as an example. Assume there are 3 BEs in the cluster. The specific usage steps are as follows:

BE node binding Resource Tag: Bind two BEs to Tag Read to serve the read load; bind one BE to Tag Write to serve the write load. Read workloads and write workloads are located on different machines to achieve read and write isolation.
Data copies are bound to Resource Tag: Table 1 has three copies, two copies are bound to Tag Read, and one copy is bound to Tag Write. The data written to replica 3 will be automatically synchronized to replica 1 and replica 2. The synchronization process will not occupy too much computing resources of BE 1 and BE 2.
The workload is bound to the Resource Tag: If the Tag carried by the query SQL is Read, the query will be automatically routed to the machine (BE 1, BE 2) with the Tag as Read for execution; if the Stream Load is imported into the load and the specified Tag is Write , then the Stream Load will be routed to the machine whose Tag is Write (BE 3). In this process, in addition to the overhead generated during replica synchronization, there is no longer competition for resources between query and import.

Physical isolation solution based on Resource Tag.png

Resource Tag can also implement multi-tenant functions. For example, there are two users, UserA and UserB, who want to create independent tenants to avoid mutual influence. Then you can bind UserA's computing and storage resources to a tag named UserA, and bind UserB's computing and storage resources to a tag named UserA. is UserB's Tag, then the two users achieve resource isolation between tenants on the BE side.

Physical isolation solution based on Resource Tag-2.png

The essence of Resource Tag is to achieve resource isolation by grouping BE nodes. The advantages of this solution are:

Good isolation, multiple tenants are isolated through physical machines, and complete isolation of CPU, memory, and IO is achieved;
Fault isolation, when a problem occurs in one tenant (such as a process crash), the other tenant is not affected at all;

Based on this technology, some users place different resource groups in different physical computer rooms to achieve active-active operation of two computer rooms in the same city.

But there are also certain limitations:

In the read-write isolation scenario, when the write load stops, the machine with the Tag Write will be in an idle state, thus reducing the resource utilization of the entire cluster, which obviously cannot meet the user's expectations for full resource utilization.
In a multi-tenant scenario, the loads of multiple business parties within the same tenant will also affect each other. Even if isolation can be achieved by configuring separate physical machines for each business party, this will bring about problems such as high cost and low resource utilization.
The flexibility is poor. The number of tenants is actually bound to the number of replicas. If you want to establish 5 tenants, you need at least 5 replicas, which causes a waste of storage space to a certain extent.

Load management solution based on Workload Group

In order to solve the above problems, Apache Doris launched a management solution based on Workload Group, which supports a more fine-grained resource isolation mechanism-intra-process resource isolation, which means that multiple Query rooms in the same BE can also achieve a certain extent. The isolation effectively avoids resource competition within the process and improves resource utilization.

Workload Group manages workloads in groups to achieve refined management and control of memory and CPU resources. Limit the percentage of CPU and memory resources of a single Query on a single BE node by associating the Query executed by the user with the Workload Group. At the same time, you can configure and enable memory resource limits. When cluster resources are tight, queries with high memory usage in the group will be automatically terminated to relieve pressure. When resources are idle, multiple Workload Groups share idle resources and automatically break through limits to ensure stable query execution.

The limits of CPU resources can be subdivided into soft limits and hard limits. CPU soft limits have the characteristics of higher resource utilization and allow flexible allocation of resources when resources are idle; while CPU hard limits focus more on ensuring performance stability and ensuring Groups will not interfere with each other due to load changes.

( The two isolation methods of CPU hard limit and soft limit can match different usage scenarios, but cannot be applied at the same time. Users can flexibly choose according to their own needs)

Load management solution based on Workload Group.png

The main differences between Workload Group and Resource Tag solutions are as follows:

From the perspective of computing resources, Workload Group further divides the CPU and memory resources within the BE process. Multiple Workload Groups need to compete for resources on the same BE. Resource Tag groups BE nodes, and the loads of different business parties are sent to BEs in different groups to achieve resource isolation. There will be no direct resource competition between business loads in different BE groups.
From the perspective of storage resources, Workload Group does not need to pay attention to storage resources, but only focuses on the allocation of computing resources within a single BE. Resource Tag requires grouping copies of data to ensure that business side data that needs to be isolated is distributed on different BEs.

01 CPU soft limit

CPU priority is mainly cpu_sharereflected through parameters, which can be compared to the concept of weight. In the same time period, a Group with a higher weight can get more CPU time.

Take Group A and Group B as an example. If Group A is configured cpu_shareas 1 and Group B is configured cpu_shareas 9, a 10s time period is given. When both loads are saturated, Group B with a higher weight can obtain CPU time for 9 seconds (90% of all resources), and Group A can obtain CPU time for 1 second (10% of all resources). In actual use, not all services are running at full load. If the load of Group B is low or no load, Group A can monopolize the CPU time for 10 seconds. This method can provide higher resource allocation flexibility, thereby improving the overall utilization of cluster CPU resources.

CPU

02 CPU hard limit

Using CPU soft time limit may cause fluctuations in query performance if the system load is high or CPU resources are tight. In order to meet users' high requirements for stable query performance, Apache Doris has implemented the CPU hard limit of the Workload Group in the latest version 2.1 - regardless of whether the overall CPU of the current physical machine is idle, the maximum CPU usage of the Group configured with the hard limit cannot Preconfigured limit value exceeded.

Take Group A and Group B as an example . If you configure Group A cpu_hard_limit=10%, Group B. cpu_hard_limit=90%When the CPU resources of both single machines reach saturation, Group A's CPU utilization is 10%, and Group B's CPU utilization is 90%, which is the same as the CPU soft limit. However, when the load of Group B decreases or there is no load, even if Group A increases the query load, its maximum CPU utilization is still strictly limited to 10%, and it cannot obtain more resources. Although this approach sacrifices the flexibility of resource allocation, it also ensures the stability of query performance.

hard limit

03 Memory resource limitations

Instructions for use: BE node memory is mainly divided into the following parts:

Operating system reserves memory

The non-query part of the memory in the BE process cannot be counted by the Workload Group for the time being.

The memory of the query part within the BE process (including import operations) can be counted and managed by the Workload Group.

Memory resource limits are mainly memory_limitlimited by parameters (setting the percentage of BE memory that can be used). Not only can you set the preconfigured memory usage, but it can also affect the priority of returning memory after overcommit.

In the initial state, high-priority resource groups will be allocated more memory, and low-priority resource groups will be allocated less memory. In order to improve memory utilization, you can enable_memory_overcommitenable the memory soft limit of the resource group. If the system has free memory resources, it can be used beyond the limit.

In order to ensure the stable operation of the system, when the overall memory resources of the system are insufficient, the system will give priority to canceling tasks that occupy large amounts of memory to reclaim overcommitted memory resources. During this process, the system will try to reserve the memory resources of high-priority resource groups, and the excess memory of low-priority resource groups will be reclaimed faster.

Memory resource limits

04 Query queue

When the business load exceeds the upper limit of the system, continuing to submit new queries will not only fail to be executed effectively, but will also affect running queries. To avoid this problem, Workload Group supports query queuing. When the query reaches the preset maximum concurrency, the new submission plan will enter the queuing logic. When the queue is full or the waiting times out, the query will be rejected to relieve the pressure on the system under high load.

Query queue.png

The query queuing function mainly has three attributes:

max_concurrency: The maximum number of SQL statements allowed to run simultaneously in the current Group. If the maximum number is exceeded, queuing logic will be entered.
max_queue_size: The maximum number of queries allowed in the queue. If the queue is full, the query will be rejected and execution will fail.
queue_timeout: The time limit for queuing in the queue. If it times out, it will fail directly. The unit is milliseconds.

Reference documentation: https://doris.apache.org/zh-CN/docs/admin-manual/workload-group

Workload Group usage test

Next, we conduct detailed tests on the CPU soft limit and hard limit of the Workload Group to clearly demonstrate to users the load management effect and performance of these two limits under the same hardware conditions.

Test environment: 16-core 64G memory single physical machine
Deployment method: 1 FE, 1 BE
Test data set: Clickbench, TPCH
Stress measurement tool: JMeter

01 CPU soft limit test

Start two clients (1, 2) to test the effect of CPU soft limit on load management without using/using CPU soft limit respectively. It should be noted that in this test, Page Cache will affect the test results, and Page Cache needs to be turned off to achieve the ideal test results.

CPU soft limit test.png

By comparing and analyzing the client throughput data in the two tests, we can draw the following conclusions:

Without Workload Group , the throughput ratio of the two clients is 1:1, indicating that they receive the same CPU resources during the same run time.
After using Workload Group andcpu_share setting them to 2048 and 1024 respectively , the results show that the throughput ratio becomes 2:1. This shows that cpu_shareclient 1 with larger parameters gets a higher proportion of CPU resources in the same running time .

02 CPU hard limit test

As can be seen from the above introduction, CPU hard limit can ensure good isolation when the load is high. Therefore, we use a hard limit to limit the CPU usage to 50% ( cpu_hard_limit=50%), and use the same client to execute the q23 query test when the number of concurrencies is 1, 2, and 4 (simulating different loads). Each test runs for 5 minutes. .

CPU hard limit test.jpeg

From the above test results, we can see that as the number of concurrent queries increases, the CPU utilization is always stable at around 800% (on a 16-core machine, 800% means using 8 cores, and the actual CPU utilization is 50% ). Because CPU resources are hard-limited, it is expected that tp99 latency increases as concurrency increases.

03 Simulate production environment testing

In actual production environments, users often pay more attention to query latency performance rather than pure throughput. In order to be closer to actual application scenarios and accurately evaluate performance, we selected a series of query SQLs with a latency of about 1 second (including CKBench's q15, q17, q23 and TPCH's q3, q7, q19) to form a SQL set. These queries cover various features such as single-table aggregation and Join calculation, and the TPCH data set size used is 100G.

We designed two sets of tests to simulate scenarios without Workload Group and with Workload Group respectively. Four tests were conducted on Client 1 and Client 2, focusing on the latency of tp90 and tp99.

Simulate production environment testing.png

By observing the query delays in the four tests in the above table, we can draw the following conclusions:

未使用 Workload Group（测试 1、2）：当客户端 2 的并发量从 1 增加到 4 时，客户端 1、2 的查询延迟均显著上升。对比客户端 1 的性能表现，median、tp90 和 tp95 查询响应时间均增加了 2-3 倍。
Using Workload Group (Test 3, 4): CPU hard limits were applied in these two tests: Set Client 1 cpu_hard_limit=90%, Client 2 cpu_hard_limit=90%. It can be seen from the test results that even if the concurrency of client 2 increases, the query delay of client 1 only increases slightly, which is significantly better than the performance in test 2. This result fully demonstrates the effectiveness of Workload Group in load isolation and performance stability guarantee.

Conclusion

At present, the Resource Tag and Workload Group functions have been launched in the production services of multiple community users and have been verified on a large scale. They are recommended for users with resource isolation needs.

Whether it is Resource Tag or Workload Group, the goal is to balance the independence of resource isolation and resource utilization . The former adopts a more thorough isolation solution, while the latter achieves isolation while ensuring Full utilization of resources, and further ensuring system stability in high workload scenarios through query queue and task queuing mechanisms.

In the actual use of resource isolation, we recommend that the two solutions can be combined and applied according to business scenarios:

If the same cluster is shared across systems/cross-business departments and you want to achieve physical isolation of resources and data, you can adopt the Resource Tag solution;
If you are facing multiple types of query loads at the same time in the same cluster, you can distinguish different loads through Workload Group, and ensure that various query loads can obtain appropriate resources through flexible resource allocation;

We still have many plans for subsequent functional improvements:

The current memory limit is used to release memory through Cancel Query. In the future, operator placement can further improve the stability of large queries and avoid query task failures when resources are tight.
Currently, in the memory model of the BE process, some non-query memory is not counted, which may lead to differences between the memory of the BE process and the memory used by the Workload Group seen by users. We will try to solve this problem in future versions.
The query queuing function only supports queuing based on the maximum number of concurrent queries. In the future, the maximum number of concurrent queries will be restricted by the resource usage of BE, thereby forming automatic back pressure on the client and improving the availability of Doris when the client continues to submit high loads.
The Resource Tag function is to divide BE machine resources, and the Workload Group is to divide resources within a single machine process. Both of these resource division methods expose the concept of BE nodes to users. When users use the resource management function, they essentially only need to pay attention to the amount of available resources and the priority of resource allocation for their own workloads in the entire set. In the future, new ways of dividing resources will be explored to reduce users’ understanding and usage costs.

Acknowledgments

The Workload Group function is a project jointly developed by the open source community. Thanks to the following students for their contributions: Luo Zenglin (luozenglin), Liu Lijia (liutang123), Zhao Liwei (levy5307)