Explain in detail the principle and application of GaussDB (DWS) user monitoring

Abstract: This article will focus on the principle and application of user monitoring.

This article is shared from Huawei Cloud Community " GaussDB (DWS) Monitoring Tool Guide (2) User-level Monitoring ", author: Little black claw behind the scenes.

foreword

Resource monitoring is an important part of the entire operation and maintenance and even the entire product life cycle. Faults are detected in a timely manner beforehand, and detailed data is provided afterwards to track down and locate problems. The entire resource monitoring system of GaussDB (DWS) is divided into job-level monitoring, user monitoring and resource pool monitoring. This article will focus on the principles and applications of user monitoring.

1. GuassDB (DWS) user system

For a product, the simplest user classification is the three-tier system of ordinary users, system administrators, and super administrators. The super administrator has the highest level of authority, and ordinary users are the most basic users. The system administrator also has part of the authority of the user's operating system, and he can also change the authority of ordinary users. Super administrators have all permissions, but are not easy to use.

1.1 Introduction to the two-tier user mechanism

For an enterprise, the operation of the database is also divided into departments. Each department has a separate table, and each department also has a separate priority. In view of this, the user system designed by GaussDB (DWS) is also divided into two layers. :

The first layer is group users. Users at this layer are associated with group resource pools and are not used as users for executing jobs.

The second layer is business users. Users at this layer are associated with business resource pools and can be used as users who execute jobs.

The resources that can be used among group users can also be set individually. Separate resources can also be set between each business user. Compared with the previous single-layer user mechanism, the two-layer user mechanism can realize finer-grained control over user resources.

Example:

# 创建cgroup控制组
gs_ssh -c "gs_cgroup -c -S ClassG1 -G wn1"
# 创建组资源池resource_pool_a绑定ClassG1控制组。
CREATE RESOURCE POOL resource_pool_a WITH (control_group = 'ClassG1');
# 创建业务资源池resource_pool_a1绑定wn1控制组。
CREATE RESOURCE POOL resource_pool_a1 WITH (control_group = 'ClassG1:wn1');
# 创建组用户关联到组资源池。例如,名称为“tenant_a”的组用户关联到“resource_pool_a”组资源池
CREATE USER tenant_a RESOURCE POOL 'resource_pool_a' PASSWORD '********';
# 创建业务用户关联到业务资源池和组用户。例如,名称为“tenant_a1”的业务用户关联到“resource_pool_a1”组资源池和“tenant_a”组用户。
CREATE USER tenant_a1 RESOURCE POOL 'resource_pool_a1' USER GROUP 'tenant_a' PASSWORD '********';

1.2 Empowerment

When we need ordinary users to access a certain table, we can use the grant syntax to grant permissions to users or withdraw permissions. This operation needs to be performed by users with sysadmin permissions. For example

# 将public表空间下的lineitem表的查询权限赋给user_1:
grant select on public.lineitem to user_1;
# 回收user_1的public表空间下的lineitem表的查询权限:
Revoke select on public.lineitem from user_1;

2. User resource monitoring

2.1 Objectives

In general, data warehouse products will have multiple users operating on the database at the same time, and the amount of resources used by each user is different. To give an extreme example, when a user issues slow SQL, the overall performance of the cluster will deteriorate. At this point, we need to determine which user issued the job, and then find the corresponding slow SQL and manage it.

For administrator users, user monitoring can help administrators understand the performance status of the system from the perspective of users, discover and solve resource bottlenecks and faults in time, and improve system reliability and stability. It can also distinguish the amount of resources used by each user in the entire cluster, determine which users use more resources than the standard, and then restrict the users who exceed the standard.

2.2 Monitoring dimensions

User monitoring supports the monitoring of CPU, memory, storage space, temporary space, operator storage space, disk IO, network, etc. By monitoring these resources, administrators can understand the system load, process running status, Disk space usage, network bandwidth utilization and other information. This information can help administrators discover system abnormalities in time and take timely measures to avoid system crashes or service interruptions.

Example usage:

postgres=# SELECT * FROM PG_TOTAL_USER_RESOURCE_INFO;
     username     | used_memory | total_memory | used_cpu | total_cpu | used_space | total_space | used_temp_space | total_temp_space | used_spill_space | total_spill_space | read_kbytes | write_kbytes | read_cou
nts | write_counts | read_speed | write_speed | send_speed | recv_speed
------------------+-------------+--------------+----------+-----------+------------+-------------+-----------------+------------------+------------------+-------------------+-------------+--------------+---------
----+--------------+------------+-------------+------------+------------
 user_grp_1       | 0 | 4928 | 0 | 16 | 1573880 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 perfadm | 0 | 0 | 0 | 0 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 user_normal | 0 | 24643 | 0 | 16 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 usr1             | 0 | 69763 | 0 | 40 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 logical_cluster1 | 0 | 24643 | 0 | 16 | 1834424 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 user_2           | 0 | 985 | 0 | 16 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 user_1           | 0 | 3942 | 0 | 16 | 1573880 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 logical_cluster2 | 0 | 45120 | 0 | 24 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 user_default | 0 | 24643 | 0 | 16 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
 wjx | 0 | 24643 | 0 | 16 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 0 | 0 | 
 0 | 0 | 0 | 0 | 0 | 0
(10 rows)
postgres=# select * from GS_WLM_USER_RESOURCE_HISTORY;
     username     |           timestamp           | used_memory | total_memory | used_cpu | total_cpu | used_space | total_space | used_temp_space | total_temp_space | used_spill_space | total_spill_space | read_
kbytes | write_kbytes | read_counts | write_counts | read_speed | write_speed | send_speed | recv_speed
------------------+-------------------------------+-------------+--------------+----------+-----------+------------+-------------+-----------------+------------------+------------------+-------------------+------
-------+--------------+-------------+--------------+------------+-------------+------------+------------
 user_grp_1       | 2023-05-22 16:51:03.380482+08 | 0 | 4928 | 0 | 16 | 1573880 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
 wjx | 2023-05-22 16:51:03.380482+08 | 0 | 24643 | 0 | 16 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
 user_default | 2023-05-22 16:51:03.380482+08 | 0 | 24643 | 0 | 16 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
 logical_cluster2 | 2023-05-22 16:51:03.380482+08 | 0 | 45120 | 0 | 24 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
 user_1           | 2023-05-22 16:51:03.380482+08 | 0 | 3942 | 0 | 16 | 1573880 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
 user_2           | 2023-05-22 16:51:03.380482+08 | 0 | 985 | 0 | 16 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
 logical_cluster1 | 2023-05-22 16:51:03.380482+08 | 0 | 24643 | 0 | 16 | 1834424 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
 usr1             | 2023-05-22 16:51:03.380482+08 | 0 | 69763 | 0 | 40 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0
 user_normal | 2023-05-22 16:51:03.380482+08 | 0 | 24643 | 0 | 16 | 0 |          -1 | 0 |               -1 | 0 |                -1 | 
 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0

2.3 Monitoring principle

When the job is running, the kernel accumulates the relevant resource fields according to the user information carried by the job, and summarizes the information into the user monitoring history table at regular intervals. Also, there are some specifications for the usage of this function:

2.3.1 Related GUC parameters

enable_logical_io_statistics : The switch of user resource monitoring and resource pool resource monitoring IO related values. The default is on. After enabling, io related records (read_kbytes, write_kbytes, read_counts, write_counts, read_speed and write_speed) in user monitoring will be counted.

enable_user_metric_persistent : Whether to enable the user/resource pool historical resource monitoring and dumping function. After enabling, the monitoring records will be dumped into the history table.

user_metric_retention_time: Set the storage days of user historical resource monitoring data, the default is 7 days

2.3.2 Relevant instructions

Current user monitoring can simultaneously monitor the CPU, IO, and memory usage of all jobs in the fast and slow lanes.

When a user performs a query on the CN, the cumulative sum of all DN resource pool usage and resource limits is displayed. When DN is queried, only resource pool usage and resource limit information on the DN is counted.

The data collection period on the DN is 5s, and the CN collects information from the DN every 5s. The auxiliary thread automatically performs persistence operations every 30s to persist user monitoring data.

For the initial management user, resource monitoring is not performed temporarily, because this user is a super administrator user, and there is no need to monitor it.

2.4 Case Study

2.4.1 When the memory is unavailable, you can use this view to check which user uses too much memory

2.4.2 It can monitor the user's network usage, such as the sending and receiving rate of the network.

 

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4526289/blog/9104847