GaussDB Technology Interpretation Series | Operation and Maintenance Autonomous Driving Exploration

This article is shared from the Huawei Cloud Community " DTCC 2023 Expert Interpretation | GaussDB Technology Interpretation Series Operation and Maintenance Autonomous Driving Exploration ", author: GaussDB database.

Recently, at the GaussDB "Five Highs and Two Easy" core technologies at the 14th China Database Technology Conference (DTCC2023), giving the world a better choice, Li Dong, Huawei Cloud Database Operations and Maintenance R&D Director, explained in detail the automatic driving of the GaussDB operation and maintenance system. Explore and practice.

cke_144.jpeg

The following is the transcript of the speech:

Hello everyone, I am Li Dong from Huawei Cloud GaussDB database. As the digital transformation of enterprises enters the deep water area, database systems become more and more complex, and the scale of databases maintained by operation and maintenance teams is getting larger and larger. Traditional tool-based operation and maintenance can no longer meet the current operation and maintenance requirements, and database operation and maintenance is gradually becoming intelligent. develop.

How to better perceive and predict database failures, and then carry out intelligent diagnosis and adaptive recovery, is what we have been exploring. Next, I will share GaussDB’s exploration and practice in automated driving of operations and maintenance, from four directions: cloud database operation and maintenance challenges, GaussDB operation and maintenance architecture, and how we perform rapid sensing and rapid diagnosis.

1. What challenges does cloud database operation and maintenance face?

As enterprises move their databases to the cloud, the challenges faced by cloud database operation and maintenance become more complex. Databases may be deployed on diverse infrastructures such as bare metal, virtual machines, and containers, and the failure scenarios faced by different infrastructures are also diverse.

If we encounter a sub-health problem of performance jitter, the usual solution is to comprehensively analyze it from multiple levels such as application database, operating system, computing, storage, network, etc. Failures may occur at each layer, and failures at different levels may cause the same problem. Sub-health phenomena. For example, a slow SQL may be caused by a disk failure or network jitter, but it appears to be a slow SQL to the customer.

If the operation and maintenance capabilities are insufficient, it will be difficult for us to locate and solve sub-health problems in a short time, because there are generally three challenges in the processing process:

First, sub-health problems cannot be accurately predicted and monitored. This mainly involves the issue of more and less. "Less" means the lack of monitoring of sub-health issues at each layer. For example, a disk failure can be easily detected; if the disk has bad blocks or a slow disk, , we have no way to monitor quickly. "Multiple" is reflected in the fact that we have done monitoring and alarming at each layer, but the alarms are not related. It is difficult to identify the real problem point from these alarms, so it is difficult to accurately find the current sub-system. health problems.

Second, there is no way to quickly diagnose sub-health problems after they are discovered. For problems that occur within the database, DBA experience is often relied upon for diagnosis and decision-making, which requires relatively high operation and maintenance capabilities, and efficiency cannot be guaranteed.

Third, the recovery ability is insufficient. We currently need to recover after diagnosing the root cause of the problem, but the recovery capabilities are insufficient, such as current limiting and overloading. Involving database parameter tuning, bad SQL optimization, etc. requires deep database capabilities and accumulation of experience.

2. GaussDB overall operation and maintenance architecture

How does GaussDB handle these problems? Let's take a look at the overall operation and maintenance architecture of GaussDB.

GaussDB unified operation and maintenance platform is divided into 5 parts:

The first part is instance operation and maintenance, which is mainly responsible for the management of the GaussDB cluster life cycle, such as creation, change, backup, recovery and parameter adjustment.

The second part is disaster recovery management, which mainly provides streaming disaster recovery, same-city disaster recovery, and three-center capabilities in two places.

The third part is intelligent operation and maintenance, which is the most critical part of the GaussDB operation and maintenance system. It uses the capabilities of AI4DB to create an autonomous operation and maintenance system. It is divided into four layers:

The first layer is the data collection layer, which is responsible for collecting the monitoring data of each layer and the operation instructions issued by the upper layer.

The second layer is the data computing layer, which is responsible for caching, persisting and data processing of the collected data, including time series database, algorithm model library, fault rule library, etc.

The third layer is the autonomous service layer, including SQL diagnosis and tuning, security, database operation and maintenance and other capabilities.

The fourth layer is panoramic monitoring, including alarm monitoring, full SQL and other capabilities.

GaussDB creates a full-link, end-to-end autonomous driving experience by building an integrated intelligent operation and maintenance center.

cke_145.png

3. Rapid detection of GaussDB faults

Panoramic database monitoring and real-time perception of system operating status

Let's take a look at the GaussDB panoramic monitoring system . Through panoramic monitoring, we can see which of the database clusters we maintain are normal, which are abnormal, and which are in sub-healthy conditions. For example, by viewing the alarm statistics module, you can see the current alarm status of the entire cluster through analysis. If there are emergency alarms and important alarms, they can be processed with priority. You can also see the risks of resource usage, and use the resource usage prediction capability to provide early warning of upcoming resource bottlenecks, such as insufficient disk space and insufficient CPU, so that we can handle and recover in a timely manner. Intelligent diagnosis uses technology to analyze the performance bottlenecks of the current database. It also supports custom monitoring of the dashboard, selects the user's key business database, customizes key monitoring indicators, and performs custom-dimensional security monitoring of key businesses.

The panoramic market must first rely on full-link and all-round monitoring. The observability of the database is very important for the operation and maintenance of the database. GaussDB full-link monitoring has hierarchical monitoring of hardware, OS, DB, etc., building full-link capabilities from collection, sending, display, analysis to inspection, and opening up It connects the entire monitoring chain from hardware to operating system and database. For example, if there is a hardware failure problem on the cloud, the hardware failure is maintained by different departments, and the database department can only see the database cluster. How to detect that the hardware is in sub-health or there is a problem with the operating system? At this time, the hardware department must work together to get through the problem. The notification mechanism from hardware to operating system and database. If a sub-health problem occurs at this layer, the fault information can be notified to the relevant operation and maintenance department for resolution.

cke_146.png

Full SQL data collection and analysis

Next, let’s take a look at the full SQL collection capability. Compared with the traditional full SQL collection method, that is, obtaining full SQL through traffic capture or logs, GaussDB builds full SQL insights in a lightweight way. The collection process establishes a memory buffer channel with the database, and directly collects SQL information from the memory channel and transfers it to external devices. This process does not require any disk IO operations and has minimal impact on performance.

In traditional collection scenarios, we may need to enable full SQL, which will have an impact on the business of about 30%. Under normal circumstances, users dare not enable it, but GaussDB's full SQL can minimize the risk. The full SQL solution provided by GaussDB has the following four characteristics:

  • low risk. GaussDB reads all the log information through memory and then transfers it to a third party, which consumes no more than 5% of database performance.
  • low cost. GaussDB does not transfer the full amount of SQL data to local IO, but directly transfers it to a third party, such as OBS on the cloud, and supports the compression of full SQL data, which is relatively low-cost.
  • High expansion. GaussDB full SQL will collect some information by default. If this information does not meet the current requirements, it can be expanded.
  • High security. When the full amount of SQL needs to be transferred to a third-party device, GaussDB performs default data desensitization during the transfer process to ensure data security.

Through full-link monitoring, GaussDB supports fast automatic inspection. By inspecting availability, reliability, performance, and resource usage trends and outputting inspection reports, you can view current problems and give some corresponding suggestions. Therefore, through the above-mentioned full link and automatic inspection capabilities, GaussDB can achieve rapid perception.

4. GaussDB intelligent diagnosis

Let's take a look at what capabilities GaussDB has in fault diagnosis.

  • SQL self-diagnosis

We perform SQL diagnosis based on an offline + online approach. First, we collect scenarios where slow SQL may exist, including SQL text, SQL execution time, and the number of scanned rows and returned rows. We also collect key indicators related to the database, operating system, or resources. , conduct an offline training through business SQL information and key indicator information, and finally obtain a slow SQL feature library.

What is the use of this feature library? When we encounter a slow SQL problem in a production environment, we can perform online inference based on the feature library and KNN algorithm to diagnose the cause of SQL and then give some optimization suggestions for the root cause.

  • SQL full link analysis

There has been feedback from customers before that our SQL execution is usually very fast and can be returned within one second, but recently, why does the execution time out occasionally? By checking the status of each layer and conducting serial analysis from the client, database CN, DN, operating system, etc., our business staff found that this occasional problem may occur on a certain shard. This type of diagnosis is very inefficient.

The SQL full-link monitoring and analysis capabilities provided by GaussDB solve this problem very well. It includes full-link tracking and aggregate analysis. It queries the database SQLID through business SQL keywords or client traceID and other conditions, and tracks the parsing process and execution time of the SQL in the database cluster, as well as the time spent by each SQL in the cluster. forwarding and aggregation in the data, and then trace the source of the problem.

  • Multidimensional indicator correlation analysis

A large number of indicators need to be monitored during database operation and maintenance. When one or more of the key indicators is abnormal, operation and maintenance personnel need to quickly and accurately locate the root cause of the abnormality in order to decide the next step. But when there are a large number of indicators, the workload of filtering information will also be huge, so we need an efficient tool to solve this problem. We know that there is a strong correlation between some database indicators. Through the directional correlation algorithm, when an abnormality occurs, the indicators in the same time period are compared, and the abnormal time period is compared with the key ones based on the strength of the correlation. Indicators related to the indicators are filtered out. The current detection algorithm supports glitches, continuous growth, drift, periodicity and other scenarios, which can help operation and maintenance personnel quickly locate problems and reduce the workload of operation and maintenance personnel, and help us identify the root cause of the problem. .

  • Trend forecast

In the practice of daily system operation and maintenance and fault handling, load changes often contain feedback on the sub-health of the current system and the impact of faults. Traditional component indicator monitoring and alarms are challenging in the timely detection of fault anomalies. By establishing monitoring of key indicators at the instance level, GaussDB uses historical data and key algorithms such as timing prediction and anomaly detection to predict gold KPI indicators, discover abnormal information, and then remind users to take measures to avoid serious consequences caused by abnormal situations.

  • Index recommendation

When application developers optimize SQL, index optimization is a key optimization content. Due to the complex analysis and practical thresholds of performance analysis and optimization methods, SQL optimization brings challenges.

The core method of index recommendation is based on native lexical and syntactic analysis, analyzing and processing the words and predicates in the query statement, and then combining field selectivity, aggregation conditions, multi-table join relationships, etc. to output optimal index suggestions. GaussDB provides an index recommendation function, gives an index recommendation list, and the positive and negative SQL returns of each index, identifies redundant indexes and useless indexes in the current database, and optimizes database query speed. GaussDB also provides optimizer evaluation capability. The optimizer evaluation capability provides a virtual index capability without actually creating an index. The virtual index is used to evaluate whether the index recommendation results are appropriate. By continuously optimizing the index configuration, users' load drift can be solved, and suboptimal and redundant indexes can be discovered in a timely manner to avoid failures.

  • SQL session checking and killing

The complex logic of application development may lead to logic problems that are difficult to find manually, and abnormal SQL occurs. Corresponding means are needed to help operation and maintenance personnel quickly detect and limit abnormal sessions. The GaussDB application platform provides a session management capability. The real-time session page supports session statistics, active sessions, session lock analysis, session killing and other functions, helping operation and maintenance personnel to quickly grasp the session information of the instance, manage instance sessions, and efficiently Locate logical problems related to database session connections that are difficult to find manually.

  • SQL current limiting and autonomous current limiting

We can imagine a scenario where during the normal operation of the database, a certain application launches a new function. This new function introduces a super bad SQL, causing the database to gradually change from a normal external service state to a gradually increasing resource usage. A large number of SQL execution slows down because it cannot obtain resources such as threads and CPUs, which ultimately leads to business anomalies. When encountering abnormal SQL (such as poor indexing) or increased SQL concurrency, the impact on the serviceability of the entire database is relatively large. At this time, we can suppress the SQL through precise current limiting to ensure that the business can normal operation.

The SQL current limiting provided by GaussDB provides the following capabilities:

Global fast and slow lanes . The so-called global fast and slow lanes define two resource pools. One is the normal resource pool, which we call the fast lane. The fast lane provides a large number of resources. Normal business runs in the fast lane. If a traffic accident occurs, the traffic accident here refers to an abnormality. SQL business. When a traffic accident occurs, we can put the abnormal SQL into the slow lane with one click on the page. The slow lane limits the use of resources, so that after the traffic accident is handled, the fast lane can continue to run at high speed.

Precise management and control of single-type SQL . For a single type of SQL, precise control is carried out from the perspective of execution time, IO usage, etc., and the resource occupation of this type of SQL is controlled. Acts as emergency current limiter.

Memory meltdown . Provides upper and lower memory limit configurations. After the memory usage exceeds the maximum memory limit, new connections are prohibited and the current session is killed. After the memory returns to the lower memory limit, the kill session is stopped and new connections are allowed.

SQL autonomous current limit . Provides the ability to autonomously limit the flow of SQL according to certain SQL rules, or resource usage rules such as CPU and memory, to prevent corresponding types of SQL from slowing down the entire database.

Our GaussDB operation and maintenance center has many other capabilities, which will be newly released in October. At that time, everyone can experience them and give us feedback.

This is the main content I will share today, thank you all!

Extra!

cke_23675.jpeg

Huawei will hold the 8th Huawei Connect Conference (HUAWEICONNECT 2023) at the Shanghai World Expo Exhibition and Convention Center and Shanghai World Expo Center from September 20 to 22, 2023. With the theme of "Accelerating Industry Intelligence", this conference invites thought leaders, business elites, technical experts, partners, developers and other industry colleagues to discuss how to accelerate industry intelligence from the aspects of business, industry, ecology and other aspects.

We sincerely invite you to come to the site to share the opportunities and challenges of intelligence, discuss key measures for intelligence, and experience the innovation and application of intelligent technology. you can:

  • In 100+ keynote speeches, summits, and forums, opinions on accelerating industry intelligence were collided
  • Visit the 17,000-square-meter exhibition area and experience the innovation and application of intelligent technology in the industry up close
  • Meet face-to-face with technical experts to learn about the latest solutions, development tools, and hands-on practice
  • Find business opportunities with customers and partners

Thank you for your continued support and trust, and we look forward to meeting you in Shanghai.

Official website of the conference: Huawei Connect Conference 2023 | HUAWEI CONNECT 2023

Welcome to follow the "Huawei Cloud Developer Alliance" public account to get the conference agenda, exciting activities and cutting-edge information.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Guess you like

Origin blog.csdn.net/devcloud/article/details/132838351