Theory + example, explain GaussDB (DWS) resource management in detail

Abstract: Reasonable management and allocation of system resources is the key to ensure the stable and efficient operation of the database system.

This article is shared from Huawei Cloud Community " GaussDB (DWS) Resource Management Capability Introduction and Application Examples ", author: A grape tree in front of the door.

1. Resource management capabilities

1.1 Overview

Common resources used during database operation include: system resources (CPU, memory, network, etc.) and database shared resources (locks, counts, etc.). During the running of the job, it is always desirable to obtain more public resources to obtain the best execution performance. However, the abuse of public resources may lead to instability of the database system, cause resource overload, affect the QoS (quality of service) of high-quality services, and even block business operations. Therefore, rational management and allocation of system resources is the key to ensure the stable and efficient operation of the database system.

Goals of resource management:

Prevent resource overload and cause system-level failures;
Realize the priority scheduling of high-quality services and ensure the QoS of high-quality services;
Realize the isolation of resources between businesses, and prevent serious competition for resources between businesses;
Realize peak-staggered and time-sharing scheduling of business, and prevent instantaneous high concurrency from affecting system stability;
Quickly identify abnormal queries to ensure the stable operation of normal business.

1.2 Basic Principles

A resource pool is a technology used to divide system resources. By setting a resource upper limit for a resource pool, resource management and control of jobs running in it can be realized. Resource pools can help system administrators better manage and allocate system resources, and improve system availability and stability.

GaussDB (DWS) provides resource management functions, using resource pools to realize resource isolation and query scheduling between services (different services are routed to different resource pools). GaussDB (DWS) supports two routing strategies:

User-resource pool: Associate a user with a resource pool. When a user executes a job, the job is routed to the corresponding resource pool for execution according to the "user-resource pool" association.
Query_band load identification: The business sets the query_band (USERSET parameter), and routes the job to the corresponding resource pool for execution according to the relationship between the query_band and the resource pool.

In the actual application process, it is recommended to use the user-resource pool routing method first; when the user-resource pool routing method cannot meet the isolation requirements, use query_band load identification to achieve business isolation.

1.3 Capability introduction

GaussDB (DWS) supports resource management capabilities such as load management, resource control, exception rules, query filters, and load planning. Different resource management capabilities have different usage scenarios, and 1~N resource management capabilities may be used in practical applications. ability.

1.3.1 Load management

Supports query scheduling based on concurrency and estimated memory to prevent serious resource contention and query accumulation caused by excessive concurrency. In a multi-CN scenario, CNs are not aware of each other's load conditions, so the concurrency and memory usage of the entire cluster cannot be precisely controlled, which may trigger an out-of-memory error. Therefore, in order to ensure concurrency and memory controllability in multi-CN scenarios, the unified scheduling of CCN for queries is designed and implemented. When the cluster is started for the first time, the CM selects the CN with the smallest number as the CCN through the cluster deployment form. After the CCN fails, the CM selects a new CCN to replace it.

Although CCN control can achieve more precise control, the CCN control logic is more complex and involves communication between CN and CCN. Communication delay (between 10ms and 1s) and complex control logic may lead to unstable operation performance. In addition, CCN also It may become a bottleneck to improve system concurrency. Therefore, the "short query acceleration" function is designed to realize the separate management and control of simple queries and complex queries. Complex queries take a long time to execute and consume a lot of memory, and CCN control has limited impact on its performance; simple queries have a short execution time and low memory consumption, and CN control can reduce the impact on its performance. The main purpose of CCN control is to prevent insufficient memory errors, so we classify queries according to estimated memory:

Simple query: the estimated memory is less than 32MB;
Complex query: the estimated memory is greater than or equal to 32MB.

To improve the performance of simple queries, by default, simple queries only perform concurrency control instead of memory and CPU control. In actual application scenarios, low-priority services may not be sensitive to performance, and need to accurately control the CPU and memory resources they use. At this time, simple queries also need resource control. In order to adapt to more usage scenarios, the short query acceleration function supports enabling and disabling:

Enable short query acceleration: simple queries are controlled in CN, and complex queries are controlled in CCN;
Disable short query acceleration: no distinction is made between simple and complex queries, and all queries are managed and controlled by CCN.

For ease of distinction, we refer to CN control as the fast lane, and CCN control as the slow lane. The fast lane only performs concurrency control on the CN, and does not support memory and CPU control; the slow lane performs concurrency and memory control on the CCN, and supports CPU control at the same time. Among them, the fast lane concurrency corresponds to the resource pool max_dop parameter, the slow lane concurrency corresponds to the resource pool active_statements parameter, and the slow lane memory corresponds to the resource pool mem_percent parameter.

1.3.2 CPU resource control

GaussDB (DWS) implements two CPU control capabilities based on cgroups: shared quota control based on cpu.shares and exclusive quota control based on cpuset.

Exclusive quotas limit the CPU cores that can be used by jobs in the resource pool, and the isolation is more thorough. It is used to prevent low-quality jobs from occupying too much CPU and affecting the performance of high-quality jobs.

Shared quotas only take effect when the CPU is busy. Different resource pools occupy the CPU according to the quota ratio. There is CPU contention among different services, which may affect service performance.

1.3.3 Network control (supported by version 821 and above)

GaussDB (DWS) supports network scheduling based on the SP+DWRR algorithm and network flow control based on token buckets. It realizes proportional network scheduling between resource pools and at the same time realizes the degradation and flow control of poor network SQL.

1.3.4 Space control (not the focus of this article)

There are two main purposes of space control: one is to prevent the disk from being full, and the other is to limit the disk space used by the business. GaussDB (DWS) supports the following space management and control capabilities:

Database read-only: The CM checks the data disk usage every 10 minutes. When the usage exceeds the threshold, the database is read-only; when the usage is lower than the threshold, the database is read-only. When the database is read-only, only read-only jobs are allowed to execute. After the database is read-only, clear the disk space by executing DROP/TRUNCATE in the read-write transaction: " START TRANSACTION READ WRITE;DROP/TRUNCATE TABLE;COMMIT; ".

User space control: limit the upper limit of the space that users can use on a single CN/DN, and control according to the owner of the table, regardless of who the user who performs the insert is. Related syntax: CREATE ALTER USER xxx PERM SPACE 'xxx G/M/K'.

Schema space control: Provides Schema-level single-instance space control capabilities, related syntax: CREATE/ALTER SCHEMA xxx WITH PERM SPACE 'xx G/M/K'.

1.3.5 Exception rules

Exception rules are used to prevent a single job from occupying too many resources and taking too long to execute, and prevent a single job from occupying a large amount of resources for a long time, resulting in a decrease in the overall system throughput and affecting the performance of other jobs. GaussDB (DWS) supports the following exception rules:

1.3.6 Query Filters

GaussDB (DWS) query filter provides query filtering function to filter jobs added to the blacklist and prohibit execution. The main application scenarios include the following two:

Exception fusing mechanism: After configuring the exception rules, the job frequently triggers the exception rules, and the queries that trigger the exception rules the number of times reaching the threshold are automatically added to the blacklist for filtering.
Emergency interception: When a job causes problems such as CORE, hang, or a large performance drop, and needs to be avoided urgently, the job can be added to the blacklist for filtering.

1.3.7 Resource Management Plan

The business that users need to focus on at different times may be different, and the concurrency and resources required by each business may also be different at different time periods, so the resource management configuration required by users at different times may be different. Resource management plans support automatic switching of resource management configurations at specified times, and users can create multiple resource management plans as needed. After creating a resource management plan, configuring the effective time of the plan, and starting the plan, GaussDB (DWS) will automatically switch the resource configuration at the effective time of the plan.

1.4 Capability Boundary (Notes)

First of all, it needs to be clear that resource management is not a panacea. Not all resource problems can be solved by resource management, and most resource shortage problems cannot be solved by resource management. Secondly, it needs to be clear that resource management has two main purposes: one is to avoid the disorderly use of resources, thereby preventing the occurrence of system-level failures, and at the same time avoiding the occurrence of query accumulation; the other is to achieve resource isolation between businesses, Prevent the performance degradation of high-quality services due to resource competition among different services, thereby affecting the QoS of high-quality services. After clarifying the above points, let's look at what resource management can do and what it can't do, and it may be better understood:

Resource management can limit business concurrency and realize traffic peak shaving. However, restricting concurrency also means a decrease in throughput. If business concurrency continues to rise, restricting concurrency may cause the business to run endlessly, and not restricting concurrency will affect other businesses. At this time, other methods may need to be considered to improve business performance (expansion/ upgrade/SQL optimization).
Resource management cannot improve the overall system throughput. On the contrary, resource isolation is likely to reduce the overall system throughput. For example: before resource management is applied, the CPU continues to soar above 90%, the query piles up seriously, and the business runs endlessly. In this case, the CPU quota isolation for low-quality services may greatly reduce the CPU usage; however, the corresponding low-quality services may report large-scale errors or fail to complete.
Resource management can realize resource isolation and control, but compared with no background pressure, business performance may still be lost after isolation and control. For example: Some users only perform CPU quota control for low-quality services, but do not perform CPU control for high-quality services. At this time, the high-quality service and the low-quality service may use the same CPU, and the CPU usage may be less than 100%, or even less than 90%. But relatively without background pressure, there will be more threads requesting the same CPU at the same time, so the CPU scheduling delay will be greater. In order to isolate the impact of CPU on performance, it is necessary to control CPU quotas for both businesses. However, after quota control, the performance of using all CPUs without background pressure may also decrease, which will not be discussed here.
Resource management does not improve job performance. Some users have large concurrent services and high resource requirements, but few system resources. It may be difficult for the business series to run normally. At this time, it is basically impossible to ensure the normal operation of the business through resource management. The best way is to expand/upgrade/SQL optimization.

2. Application examples

2.1 Confirm the business scenario

The purpose of resource management is to isolate and control services. Therefore, before designing a resource management solution, it is first necessary to confirm user business scenarios: business classification, business priority, business type, business concurrency, business execution time period, and business peak time period.

Only by confirming the business classification can we determine how many resource pools should be divided;
Only by confirming the business priority can we determine which resource pool should reserve more resources and concurrency;
Only by confirming the business concurrency and business type can we determine whether concurrency control and exception rules are required;
Only after confirming the business execution time period can it be confirmed whether the resource management plan needs to be used;
Only by confirming the business peak time period can we confirm how many resources to reserve and the appropriate concurrency.

In order to avoid the negative impact of too many resource pools on the overall throughput, and to simplify the resource management solution, it is recommended to control the number of resource pools to less than 5, preferably 3 or less. If there are fewer business categories (3 or less), you can create a resource pool for each business category for better category management. If there are many business categories, businesses with the same priority can be grouped into the same category and managed by the same resource pool.

Business priorities generally fall into the following categories:

Performance-sensitive high-quality business: This type of business generally has a short execution time, low business concurrency, is very sensitive to performance, and generally does not accept performance jitter;
Low-quality business without timeliness requirements: This type of business has no requirements for execution performance, generally as long as it can produce results, including but not limited to: peripheral application query, self-service analysis business, etc.;
Other businesses with certain performance requirements: This type of business generally has large concurrency and long execution time, and has certain performance requirements, but performance fluctuations may occur, such as: standard reports, real-time storage, etc.

2.2 Clarify control demands

After confirming the business scenario, the next step is to clarify the control requirements with the user. After confirming the business scenario in the first step, a preliminary resource management plan can actually be formed, such as: several resource pools should be created, restrictions on low-quality business concurrency and resource usage (limits), whether to use resource management plans, etc.

Example:

Business scenario: User business includes: storage, query and external access. The warehousing and query use the same user (user1), and the external access uses a user (user2); the priority of warehousing is slightly lower than that of query and warehousing cannot affect query performance, and the priority of external access is very low, accepting error reports and long-term The time does not produce results; the storage concurrency is large and the CPU consumption is high.

For the above business scenarios, there are the following basic demands and preliminary resource management solutions:

There are three types of business with different priorities, and each business has isolation control requirements, so it is necessary to create 3 resource pools;
All businesses use the self-built resource pool, and the resource pool performs concurrency control, so the single CN concurrency limit (max_active_statements) can be set to a larger value (recommended 300/500);
The priority of external access is low, so the concurrency setting of external access is considered to be smaller, the CPU core allocation is less, and the short query acceleration is disabled in consideration of strict CPU control;
The same user is used for warehousing and query, so query_band load identification is required to distinguish warehousing and query services;
The priority of warehousing is slightly lower than that of query and warehousing cannot significantly affect query performance, so consider implementing CPU quota control for both warehousing and query, and turn off short query acceleration at the same time;
There may be deviations in the estimated memory of the query. When the estimated memory is too large, it may cause abnormal queuing. Therefore, it is recommended to set the upper limit of the estimated memory of the query.

The resource pool configuration of the preliminary resource management plan is as follows:

Single CN concurrency upper limit: max_active_statements = 300/500 ??

Among them, "??" after the parameter value indicates that the parameter size is a preliminary estimate, and it is necessary to communicate with the user to confirm the final size of the parameter.

After forming a preliminary resource management plan, discuss with users on the resource management plan and confirm the control requirements:

Resource pool: confirm whether the number of resource pools is appropriate, and whether the corresponding relationship between services and resource pools is appropriate (there may be multiple services corresponding to one resource pool);
Concurrency control: Whether each business needs concurrency control, and whether the upper limit of concurrency is appropriate; (the upper limit of concurrency is difficult to predict before the resource management goes online, and generally a relatively large value will be given in the early stage, and it will be adjusted according to the actual operation effect after the subsequent launch)
Memory control: whether the business needs memory control, and whether the memory size setting is reasonable; (memory control is not required in scenarios where there are sufficient memory resources and the competition for memory between businesses is not serious; if the memory setting is too small, memory control may cause unnecessary queuing and affect the system throughput)
Query the estimated memory limit: You can judge whether it is reasonable through the estimated memory of the job in the TopSQL history and the actual memory used; (the estimated memory upper limit setting is too small may affect job performance, and if the setting is too large, it may cause abnormal queuing due to estimation deviation, which requires comprehensive business performance and memory usage considerations)
CPU control: Before going online through resource management, the CPU usage rate confirms whether there is a CPU bottleneck, and at the same time confirms whether there is a new business online; (if there is a CPU bottleneck or the subsequent possible upper limit of a large number of new services, it is recommended to configure CPU control in advance to prevent business After going online, it will greatly affect the high-quality business performance)
Resource management plan: In the case of different business peak hours, resource management plans may be required to focus on guaranteeing different businesses in different time periods;
Other control requirements: Confirm whether users have other control requirements, such as setting exception rules to prevent query accumulation and serious resource competition (although resource isolation is implemented, bad SQL may affect other operations in the business).

The above is just a list of common management and control demands for discussion, which can be more divergent in practical applications.

3. Configuration resource management

Take the resource management solution in the previous chapter as an example to illustrate how to configure the resource management solution.

3.1 Configuring resource pools

As shown in the figure below, take the respool_select resource pool as an example, and add a resource pool according to the steps in Adding a Resource Pool in the User Guide. Fill in "respool_select" for the name; select the exclusive quota for CPU resources, the quota is 50%; the memory resource quota is 50%; After completing the filling, click OK to complete adding the resource pool. For resource pool operation instructions, see: Add Resource Pool , Modify Resource Pool , Delete Resource Pool .

After the resource pool is added, the interface shown in the figure below is displayed. The parameters just configured are displayed in the resource configuration tab; short query acceleration is enabled by default and does not limit the concurrency upper limit of simple statements; the middle position displays the exception rules associated with resource pools. By default, there are two default exception rules associated: single DN average consumption The CPU ratio does not exceed 50%, and the single DN operator download does not exceed 1/10 of the data disk. After exceeding the limit, the job reports an error and exits; the user associated with the resource pool is displayed at the bottom. Click the Associate User button to associate the "query user" with the resource pool. Similarly, create two other resource pools and associate "external access users" with the respool_other resource pool.

Note: For resource pools that need to set custom exception rules, click Edit in the exception rules tab bar below to configure exception rules.

3.2 Short query acceleration configuration

Select the newly added resource pool respool_select in the resource pool drop-down list, click Edit on the upper right of the short query configuration, modify the simple statement concurrency to 150, and click Save after the modification is complete.

For the other two resource pools, short query acceleration needs to be disabled. After selecting the corresponding resource pool in the resource pool drop-down list, click the "Short query acceleration" switch in the short query configuration to disable short query acceleration.

3.3 Query estimated memory settings

The upper limit of query estimated memory does not support configuration on the interface for the time being, and can be directly modified by executing SQL through DAS:

3.4 Configure query_band

1. To modify the user’s warehousing business, set the query_band in the warehousing business session: SET query_band='Jobname=upsert';--Execute the business;--Reset the query_band after the job is executed: reset query_band; Example:

postgres=> SET query_band='Jobname=copy_upsert';
SET
postgres=> INSERT INTO t1 SELECT generate_series(1,10000);
INSERT 0 10000
postgres=> RESET query_band;
RESET

2. Configure query_band to route the inbound business to the resource pool respool_upsert and run:

postgres=# SELECT * FROM gs_wlm_set_queryband_action('Jobname=copy_upsert', 'respool=respool_upsert');
 gs_wlm_set_queryband_action
-----------------------------
 t
(1 row)

3. Query the pg_queryband_action view to confirm that the query_band configuration is successful:

postgres=# select * from pg_queryband_action;
 qband | respool_id |    respool     | priority | qborder
---------------------+------------+----------------+----------+---------
 Jobname=copy_upsert | 2147483648 | respool_upsert | medium   | -1
(1 row)

4. During the process of running the storage job, query the TopSQL real-time view to confirm whether the query_band takes effect

postgres=# SELECT username, query_band, resource_pool, substr(query, 1, 30) FROM pgxc_wlm_session_statistics WHERE query_band IS NOT NULL;
 username | query_band | resource_pool | substr
----------+---------------------+----------------+--------------------------------
 user_elk | Jobname=copy_upsert | respool_upsert | INSERT INTO t1 SELECT generate
(1 row)

3.5 Other configurations

1. If you need to use the resource management plan, you can configure the resource management plan by referring to the resource management plan operation .

2. GaussDB (DWS) 8.2.1 and later versions support network resource management and control. Assuming that the network bandwidth weight ratio of the three resource pools is respool_select:respool_upsert:respool_other = 5:4:1, use DAS to execute the following SQL to configure network management and control:

ALTER RESOURCE POOL respool_select WITH(WEIGHT=5);
ALTER RESOURCE POOL respool_upsert WITH(WEIGHT=4);
ALTER RESOURCE POOL respool_other WITH(WEIGHT=1);

See the figure below for a configuration example:

Assume that the external access service runs for more than 20 minutes, and the network bandwidth usage exceeds 128MB and degrades:

CREATE EXCEPT RULE bandwidth_rule1 WITH(bandwidth=128, ELAPSEDTIME=1200, action='penalty'); -- 创建异常规则
ALTER RESOURCE POOL respool_other WITH(EXCEPT_RULE='bandwidth_rule1'); -- 将异常规则关联至资源池respool_other

See the figure below for a configuration example:

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~