Intelligent Analysis and Diagnosis of Database Abnormalities

DAS (Database Autonomy Service, Database Autonomous Service) is oriented towards R&D and DBA. It is a database autonomous service that provides users with database performance analysis, fault diagnosis, security management and other functions. DAS uses big data methods, machine learning, and expert experience to help users eliminate the complexity of database management and service failures caused by manual operations, and effectively ensure the stable and efficient operation of database services. This article mainly describes the historical background, evolution strategy, important functions and implementation ideas of DAS, hoping to help or inspire students who are engaged in related development.

1 Current situation and problems

1.1 The imbalance between scale growth and operation and maintenance capability development is highlighted

With the rapid development of Meituan's business in recent years, the scale of the database has also maintained rapid growth. As the "nerve terminal" of the entire business system, once a problem occurs in the database, the loss to the business will be very large. At the same time, due to the rapid growth of the database scale, the number of problems has also greatly increased, and the passive analysis and positioning relying entirely on manpower has been overwhelmed. The following figure shows the growth trend of database instances in recent years at that time:

Figure 1 Growth trend of database instances

1.2 The ideal is very plump, the reality is very skinny

The main contradiction that the Meituan database team is currently facing is the imbalance between the growth of instance scale and the development of operation and maintenance capabilities, and the main contradiction is reflected in the high requirements for database stability and the lack of key data. Due to insufficient product capabilities, we can only rely on professional DBAs to manually troubleshoot problems, which takes a long time to handle exceptions. Therefore, we decided to fill in key information, provide self-service or automatic problem location capabilities, and reduce processing time.

We reviewed the faults and alarms in the past period of time, and deeply analyzed the root causes of these problems. We found that any abnormality can actually be divided into three stages: abnormality prevention, abnormality processing, and abnormality recovery according to time. For these three stages, combined with the definition of MTTR, and then investigated the solutions within Meituan and the industry, we made a panorama covering database exception handling solutions. As shown below:

Figure 2 Status Quo of Operation and Maintenance Capability

By comparison, we find:

  • We have relevant tools to support each link, but the capabilities are not strong enough. Compared with the capabilities of leading cloud manufacturers, which are about 20% to 30%, the shortcomings are more obvious.
  • Self-service and automation capabilities are also insufficient. Although there are many tools, the entire chain has not been opened up, and no synergy has been formed.

So how to solve this problem? After in-depth analysis and discussion among team members, we proposed a solution that is more in line with the current development stage.

2 solution ideas

2.1 Solve short-term conflicts and base on long-term development

From the review of historical failures, 80% of the time in 80% of the failures is spent on analysis and positioning. The short-term ROI (return on investment) is highest for resolving anomaly analysis and localization efficiency. In the long run, only by perfecting the capability map can the stability and guarantee capability of the entire database be continuously improved. Therefore, one of our ideas at the time was to solve short-term conflicts and to base on long-term development (Think Big Picture, Think Long Term). The new plan should leave enough room for development in the future, not just "relieving head pain and foot pain".

At the macro level, we hope that more functions can be automatically located, and based on automatic positioning, changes can be handled automatically or automatically, so as to improve the efficiency of abnormal recovery and ultimately improve the user experience. After the exception handling efficiency and user experience are improved, the communication cost of operation and maintenance personnel (mainly DBAs) will be greatly reduced, so that operation and maintenance personnel will have more time to invest in technology and can handle more "human flesh". The abnormality becomes self-help or automatic processing, thus forming a "flywheel effect". Finally, the goal of efficient stability assurance is achieved.

At the micro level, based on the existing data, we improve the observability through structured information output and fill in the shortcomings of missing key data. At the same time, based on perfect information output, we provide self-service or automatic positioning capabilities through the cooperation of rules (expert experience) and AI, and shorten the processing time.

Figure 3 Macro and Micro

2.2 Consolidate basic capabilities, empower upper-layer services, and realize database autonomy

With a clear guiding ideology, what development strategy and path should we take? Judging from the manpower situation of the team at that time, none of the students had experience in development of abnormal autonomy, and they did not even have the ability to analyze abnormality in the database. The talent structure could not meet the ultimate goal of the product. The so-called "big things in the world must be done in detail, and difficult things in the world must be done easily". Our idea is to start with small functions and easy ones, first complete simple functions such as indicator monitoring, slow query, active session, and then gradually deepen into complex functions such as full SQL, abnormal root cause analysis, and slow query optimization suggestions. Basic work is used to "learn the truth through falsehoods", continuously improve the team's ability to overcome difficulties, and at the same time lay a good foundation for intelligence.

The following is our 2-year path plan based on the current talent structure and future goals (the plan to realize the data autonomy goal will be launched after 2022, and this part will be omitted from the figure below):

Figure 4 Evolution strategy

2.3 Establish a scientific evaluation system to continuously track product quality

The famous American management scholar Kaplan said: "There is no management without measurement". Only by establishing a scientific evaluation system can we push the product to a higher level. How to evaluate the quality of the product and improve it continuously? We have done a lot of indicators before, but they are all uncontrollable and there is no way to guide our work. For example, when we first considered root cause localization, we used the result index precision and recall, but the result index was uncontrollable and difficult to guide our daily work. This requires finding the controllable factors and improving them continuously.

When we were learning about Amazon, we just discovered that they have a methodology for controllable input and output metrics that guides our work well. As long as we continuously optimize and improve the correct controllable input indicators, our output indicators can also be improved in the end (this also confirms what Zeng Guofan once said: "Work hard on the cause, but follow the fate on the fruit").

The following is our index design and technical implementation ideas for root cause positioning (in the part that can be continuously improved and controllable in the simulation environment, the actual effect of the final online will also be improved. It mainly includes "root cause positioning controllable input and output index design ideas" ” and “Technical Implementation Ideas for Obtaining Controllable Input Indicators for Root Cause Location”).

Root cause localization controllable input and output index design ideas

Figure 5 Design of controllable input and output indicators

The technical realization idea of ​​root cause location controllable input index acquisition

Figure 6 Technical design of controllable input and output indicators

In Figure 5, we simulate some anomalies (most anomalies) that can be realized at low cost by means of scene reproduction and technical means. For anomalies with high recurrence cost (very few), such as machine anomalies, hardware failures, etc., our current idea is to discover and optimize problems through "human operation", and wait until the next online anomaly occurs repeatedly. , and according to the results of the optimized diagnosis, it is determined whether the acceptance is passed or not by comparing with the expected.

In the future, we will establish a retrospective system to save the abnormal indicators at the time of the problem, and use the abnormal indicators to input the output results of the retrospective system to judge the effectiveness of the system improvement, so as to build a more lightweight and wider-coverage recurrence method. Figure 6 is the specific technical realization idea of ​​the reproduction system.

With the guiding ideology, the path planning in line with the current development stage and the scientific evaluation system, let's talk about the idea of ​​the technical solution.

3 technical solutions

3.1 Top-level design of technical architecture

In the top-level design of the technical architecture, we adhere to the four-step evolution strategy of platformization, self-service, intelligence and automation.

First of all, we need to improve the observable ability and build an easy-to-use database monitoring platform through the display of key information. Then we provide empowerment for changes (such as data changes and index changes) based on these key information, and pass part of the high-frequency operation and maintenance work through these structured key information (such as index changes, we can monitor whether there is recent access traffic, to ensure that Change security) to let users make their own decisions, that is, self-service. Next, we add some intelligent elements (expert experience + AI) to further cover self-service scenarios, and gradually automate some low-risk functions, and finally reach the advanced or fully automated stage through continuous system improvement.

Why do we put automation after intelligence? Because we believe that the goal of intelligence is also for automation, intelligence is the premise of automation, and automation is the result of intelligence. Only by continuously improving intelligence can we achieve advanced or complete automation. The following figure is our top-level architecture design (the left is the evolution strategy, the right is the top-level design of the technical architecture and the status quo at the end of 2021):

Figure 7 Architecture top-level design

The top-level design is only the "first step of the long march". Next, we will gradually introduce the specific work we carry out based on the top-level design from the bottom up, starting from the design of the data acquisition layer, the design of the calculation storage layer, and the design of the analysis and decision-making layer. .

3.2 Design of Data Acquisition Layer

In the above architecture diagram, the data collection layer is the bottom layer and the most important link of all links, and the quality of the collected data directly determines the capability of the entire system. At the same time, it works directly with database instances, and any design flaws could lead to massive failures. Therefore, the technical solution must take into account the quality of data collection and instance stability. When the two cannot be balanced, it is better to sacrifice the quality of collection to ensure the stability of the database.

In terms of data collection, the industry adopts a kernel-based method, but Meituan’s self-developed kernel is relatively late, and the deployment cycle is long, so our short-term method is to use the packet capture method to make a transition, and wait for the kernel-based collection to be deployed to a certain extent. Then switch over gradually. The following is our technical solution research based on the idea of ​​capturing packets:

Program performance generality Remark
pcap Low high The Meituan Wine and Brigade team has already practiced online
pf_ring middle middle Need to revamp MySQL
dpdk high Low NIC driver needs to be recompiled

From the survey, we can see that the solutions based on pf_ring and dpdk have great dependencies, which are difficult to implement in the short term, and there is no previous experience. However, there is no dependence on the pcap-based method, and we also have some experience. Before, the Meituan Wine Travel team made a full-scale SQL data collection tool based on the packet capture method, and it has been verified for one year. Therefore, we finally adopted a technical solution based on the pcap packet capture method. The following is the architecture diagram of the collection layer scheme, collection quality and impact on database performance.

The Technical Design of Agent

Figure 8 Technical design of Agent

Impact on the database

Figure 9 Test of the impact of Agent on the database

3.3 Design of Computational Storage Layer

Due to the huge number and traffic of Meituan's entire database instance, and with the rapid development of the business, it also shows a trend of rapid growth. Therefore, our design must not only meet the current requirements, but also consider the requirements for the next 5 years and longer. At the same time, for database fault analysis, the real-time and completeness of data is the key to quickly and efficiently locating problems, and the capacity cost required to ensure the real-time and completeness of data cannot be ignored. Therefore, combined with the above requirements and some other considerations, we put forward some principles for this part of the design, mainly including:

  • Full-memory computing : Ensure that all computations are performed with pure memory in a single thread or in a single process, pursuing the ultimate in performance and throughput.
  • Report raw data : The data reported by the MySQL instance maintains the original data state as much as possible, and does not do or do as little data processing as possible.
  • Data compression : Due to the huge amount of reporting, it is necessary to ensure the extreme compression of the reported data.
  • Controllable memory consumption : It is almost impossible for memory overflow to occur through theoretical and practical stress testing.
  • Minimize the impact on the MySQL instance : Post the calculation as much as possible, do not perform complex calculations on the Agent, and ensure that the production of the RDS instance is not greatly affected. The following is the specific architecture diagram:

Figure 10 Test of the impact of Agent on the database

Full SQL (all SQL accessing the database) is the most challenging function of the entire system and one of the most important information for database exception analysis. Therefore, some key points will be made on the aggregation method of full SQL, the effect of aggregation and compression, and compensation design. introduce.

3.3.1 Aggregation method of full SQL

Due to the huge amount of detailed data, we adopted an aggregation method. The consumer thread will aggregate and calculate the messages of the same template SQL by minute granularity, using "RDSIP+DBName+SQL template ID+minutes where the SQL query ends" as the aggregation key. The calculation formula of the aggregate key is: Aggkey=RDS_IP_DBName_SQL_Template_ID_Time_Minute (The value of Time_Minute is taken from the "year, month, day, hour, minute" where the SQL query end time is located)

Figure 11 SQL Template Aggregation Design

Figure 12 SQL template aggregation method

3.3.2 The effect of full SQL data aggregation and compression

In terms of data compression, following the principle of layer-by-layer flow reduction, message compression, pre-aggregation, dictionaryization, and minute-level aggregation are used to ensure that the traffic decreases as it passes through each component, ultimately saving bandwidth and reducing storage. The following are the relevant compression links and test data performance (the sensitive data has been desensitized and does not represent the actual situation of Meituan):

Figure 13 Design and effect of full SQL compression

3.3.3 Compensation mechanism for full SQL data

As mentioned above, on the data aggregation side, the aggregation is performed on a one-minute basis, and an additional one-minute message delay is allowed. If the message delay exceeds 1 minute, it will be directly discarded, so in the scenario of serious business peak delay, it will be lost. Comparing a large amount of data will have a greater impact on the accuracy of subsequent database anomaly analysis.

Therefore, we have added a delayed message compensation mechanism to send expired data into the compensation queue (using the Meituan message queue service Mafka), and ensure that long-delayed messages can also be stored normally by compensating for expired data. This ensures the accuracy of subsequent database anomaly analysis. The following is the design scheme of the data compensation mechanism:

Figure 14 Design of full SQL completion technology

3.4 Analysis and decision-making design

After having more complete data, the next step is to make decisions based on the data and infer possible root causes. In this part, we use a method based on expert experience combined with AI. We divide the evolution path into four stages:

The first stage : completely rule-based, accumulate experience in the field, and explore feasible paths. The second stage : Exploring AI scenarios, but focusing on expert experience, using AI algorithms on a small number of low-frequency scenarios to verify AI capabilities. The third stage : Expert experience and AI go hand in hand. Expert experience continues to iterate and extend in existing scenarios, AI is implemented in new scenarios, and the dual-track system ensures that the original capabilities do not degrade.
The fourth stage : Complete the replacement of most of the expert experience by AI, and use AI as the main expert experience as a supplement to maximize AI capabilities.

The following is the overall technical design of the analysis and decision-making part (we refer to an article by Huawei: "Exploration and Practice of Intelligent Recommendation for Network Cloud Roots" ):

Figure 15. Technical design for analytical decision making

In the decision analysis layer, we mainly adopt both expert experience and AI methods. Next, we will introduce the related implementation of expert experience (rule-based method) and AI method (AI algorithm-based method).

3.4.1 Rules-based approach

In the part of expert experience, we adopted the review methodology of GRAI (abbreviation for Goal, Result, Analysis and Insight) to guide our work. Iteratively improves the accuracy and recall rate. The following is an example of the master-slave delay rule refining process:

Figure 16 Review and improvement of expert experience

3.4.2 Method Based on AI Algorithm

Abnormal database indicator detection

Anomaly detection of database core indicators relies on reasonable modeling of the historical data of the indicators. By combining periodic modeling of offline processes with streaming detection of real-time processes, it will help to effectively detect database instances in the presence of failures or risks. To locate the root problem, so as to solve the problem in a timely and effective manner.

The modeling process is mainly divided into three processes. First, we preprocess the historical data of the indicator through some pre-processing modules, including filling missing values, data smoothing and aggregation. Afterwards, we created subsequent branches through the classification module, using different means to model different types of indicators. Finally, after the model is serialized and stored, Flink task reading is provided to realize stream detection.

The following is the design diagram of the detection

Figure 17 AI-based anomaly detection design

Root cause diagnosis (under construction)

Subscribe to alarm messages (based on rules or triggered by anomaly detection), trigger the diagnosis process, collect and analyze data, infer the root cause, and filter out valid information to assist users in solving. The diagnosis result is notified to the user through the elephant, and the diagnosis and diagnosis details page is provided, and the user can optimize the diagnosis accuracy by marking.

Figure 18 AI-based anomaly detection design

Data collection : collect database performance indicators, database status capture, system indicators, hardware problems, logs, records and other data. Feature extraction : Extract features from various types of data, including time series features extracted by algorithms, text features, and domain features extracted using database knowledge. Root cause classification : including feature preprocessing, feature screening, algorithm classification, root cause sorting and other parts. Root cause expansion : Based on the root cause category, it conducts in-depth mining of relevant information to improve the efficiency of users to deal with problems. Specifically, it includes functional modules such as SQL behavior analysis, expert rules, indicator association, dimension drill-down, and log analysis.

4 Construction achievements

4.1 Indicator performance

At present, we mainly adopt the closed-loop method of "combing the triggering alarm scenarios -> simulating the recurrence scene -> root cause analysis and diagnosis -> improvement plan -> accepting and improving the quality -> combing the triggering alarm scenarios" (for details, please refer to the previous article to establish a scientific Evaluation system, continuous tracking of product quality parts), continuous optimization and iteration. Through the improvement of controllable input indicators, the output indicators on the line are optimized and improved, so as to ensure that the system continues to develop in the right direction. Here's how the root cause recall and precision metrics have performed recently.

User alarm root cause feedback accuracy

Figure 19 User feedback accuracy

Alarm Diagnosis Analysis Overall Recall Rate

Figure 20 Root cause analysis recall rate

4.2 User Cases

In the push of root cause results, we communicated with the IM system (Elephant) within Meituan. After a problem occurs, we can find abnormalities through alarms -> Elephant push diagnosis root cause -> Click the diagnosis link for details to view the details -> One-click Plan processing -> track the effect of feedback processing -> successful execution or rollback to complete the closed loop of abnormal discovery, location, confirmation and processing. The following is an example of root cause analysis after an alert is triggered by an active session rule.

Automatically pull the group and give the root cause

Figure 21 Lock blocking leads to high active sessions

Click on the diagnostic report to view the details

Figure 22 Lock blocking leads to high active sessions

The following is a case of our slow query optimization suggestion push after the Load alarm (the reason for desensitization, the case of using the test environment simulation).

Figure 23 Slow query optimization suggestion

5 Conclusion and future outlook

After about 2 years of development, the database autonomous service has basically consolidated its basic capabilities, and has completed preliminary empowerment in some business scenarios (for example, for problematic SQL, business services are automatically identified before they are released, and potential risks are indicated; for incorrect index changes, Automatically detect the recent access traffic of the index before the execution of the work order, block the wrong changes, etc.). Next, our goal is to focus on database autonomy, in addition to continuing to deepen our work and improving our capabilities. The main work planning will revolve around the following three directions:

(1) Enhanced computing and storage capabilities : With the continuous rapid growth of database instances and business traffic, as well as the continuous enrichment of collected information, it is urgent to enhance the existing data channel capabilities to ensure that the processing capabilities can support the next 3-5 years .

(2) Autonomous capabilities are implemented in a small number of scenarios : In terms of database autonomy capabilities, a three-step strategy will be adopted:

  • Step 1: Establish the association between root cause diagnosis and SOP documents, and make diagnosis and treatment transparent;
  • Step 2: The SOP document is platformized, and the diagnosis and processing are streamlined;
  • Step 3: Partial low-risk unmanned intervention, automatic diagnosis and processing, and gradually realize database autonomy.

(3) More flexible anomaly retrospective system : The verification of the root cause localization algorithm of a certain scene before or after it goes online is very critical. We will improve the verification system, establish a flexible anomaly retrospective system, and continuously optimize through the playback based on on-site information. And improve the system positioning accuracy.

6 Author and team

Jinlong, from Meituan Basic Technology Department/Database R&D Center/Database Platform R&D Team.

Meituan Basic Technology Department - Database Technology Center is looking for senior and senior technical experts, Base Shanghai and Beijing. The Meituan relational database is large in scale and grows rapidly every year, carrying hundreds of billions of access traffic every day. Here, you can experience the business challenges of high concurrency, high availability, and high scalability, keep up with and develop cutting-edge technologies in the industry, and experience the productivity improvement brought about by technological progress. Interested students are welcome to send their resumes to: [email protected].

Read more collections of technical articles from the Meituan technical team

Frontend | Algorithm | Backend | Data | Security | O&M | iOS | Android | Testing

| Reply keywords such as [2021 stock], [2020 stock], [2019 stock], [2018 stock], [2017 stock] in the public account menu bar dialog box, you can view the collection of technical articles by the Meituan technical team over the years.

| This article is produced by the technical team of Meituan, and the copyright belongs to Meituan. Welcome to reprint or use the content of this article for non-commercial purposes such as sharing and communication, please indicate "The content is reproduced from the Meituan technical team". This article may not be reproduced or used commercially without permission. For any commercial activities, please send an email to [email protected] to apply for authorization.

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/meituantech/blog/5524099