The efficiency of fault detection and location has increased by more than 70%. What optimizations have been made to Qunar’s observable system?

A quick overview of the highlights in one minute

Qunar.com’s original monitoring system showed great strength in terms of the number of indicators - hundreds of millions of indicators and millions of alarms. However, it was slightly insufficient in terms of fault data - the average discovery time for order-type faults was as long as 4 minutes, only 20% of order faults can be discovered within 1 minute, and nearly half of the faults take more than 30 minutes to handle. In order to solve these problems, Qunar.com decided to start from optimizing fault indicators and carry out comprehensive optimization of fault discovery, fault root cause location, fault repair and other aspects.

This article will delve into the detailed process of this series of optimization reforms, analyze the monitoring methods and tools used at each stage, and key issues encountered in the practice process.file

about the author

Qunar.com infrastructure technology TL——Xiao Shuang

Member of the TakinTalks Stability Community Panel. Joined Qunar.com in 2018 and is currently responsible for the construction of Qunar.com's CI/CD, monitoring platform and cloud-native related platforms. During this period, he was responsible for the implementation of Qunar.com's containerized platform, assisting business lines in migrating large-scale applications to the container platform, completing the transformation and upgrade of the monitoring system Watcher2.0 and the implementation of the root cause analysis system. Have in-depth understanding and practical experience in monitoring alarms, CI/CD, and DevOps.

Warm reminder: This article is about 7,500 words, and it takes about 12 minutes to read.

"TakinTalks Stability Community" public account backend reply "Communication" to enter the reader communication group; reply "1102" to obtain courseware information;

background

All along, when Qunar shares its monitoring system with the outside world, it usually shares that we have monitored hundreds of millions of indicators, millions of alarms, the number of monitored machines, the amount of storage, etc. However, after several years of running and observing these, we began to have questions about these – do these metrics truly meet our monitoring needs?

When analyzing our fault data, we will find that some faults cannot be discovered in time, and some are even discovered manually rather than by monitoring alarms. We have made statistics on such situations and found that the average discovery time for order faults is about 4 minutes, and the discovery rate of order faults within 1 minute is only 20%, and the proportion of faults that take more than 30 minutes to process is as high as 48%. This led us to discover a problem: on the one hand, we thought that the monitoring system could already meet the company's needs. On the other hand, the fault data was not satisfactory, and some of the data was even quite bad.

From this, we began to adjust the monitoring system. Referring to the concept of MTTR, we divide faults into three indicators: discovery time, diagnosis time, and repair time. For these three stages, we have adopted different monitoring methods or tools to assist in optimizing the indicators of each stage. After nearly a year of practice, Qunar's order fault detection time has been reduced from 3 minutes to 1 minute, and the accuracy of fault root cause location has also reached 70%-80%. The key practice is the construction and implementation of Qunar.com's second-level monitoring and root cause analysis platform.

file

(Key time points must be recorded after each fault is handled)

file

(The digital platform automatically analyzes the fault handling level of each team)

1. What problems did you encounter when building second-level monitoring?

1.1 Current situation and challenges

Before deciding to implement second-level monitoring, we first sorted out the existing monitoring platform. In the process, we discovered that current monitoring platforms face three main challenges.

First, there is the problem of high storage IO and excessive space usage. Our previous time series database (TSDB) used Graphite's Carbon and Whisper. Due to Whisper's space pre-allocation strategy and write amplification issues, disk IO pressure is excessive and it also takes up too much storage space. Therefore, solving the problem of monitoring data storage became our top priority.

Secondly, the entire monitoring link needs to be modified to adapt to the challenge of second-level monitoring. Our existing data collection and alarm systems are designed at the minute level. If we want to achieve second-level monitoring, we need to make large-scale modifications to the entire link from data collection to storage to alarming.

Finally, there is the compatibility issue of the Graphite protocol. Our storage method uses Graphite, so our collection protocols and queries are based on Graphite. When building a second-level monitoring system, we must consider the compatibility of the Graphite protocol.

1.2 Storage solution selection

After considering all constraints, we evaluated two solutions, M3DB and VictoriaMetrics (hereinafter referred to as "VM") when selecting a storage solution. We chose them because they all support the Graphite protocol.

After detailed comparison, we found that M3DB has a higher compression rate and excellent performance, and its open source version is a cluster version. However, its deployment and maintenance are quite complex. On the other hand, VM has shown its advantages after stress testing: a single machine can support up to 10 million indicators for reading and writing, each component can be arbitrarily scaled, deployment is relatively simple, and the community is highly active.

After comparing the two options, we chose VM as our time series database (TSDB).

1.2.1 Problem: Performance is severely degraded in aggregate read scenarios

We conducted a stress test on the VM, and the server was configured with a 32-core CPU, 64GB of memory, and 3.2TB of SSD storage.

In the single-machine stress test, we set the number of indicators written per minute to 10 million, and the query QPS to 2,000. At this setting, we found an average response time of 100ms. After writing one day's worth of data, the disk usage is about 40GB, while the host Load remains between 5 and 6.

However, we also discovered some problems. When performing single metric queries, the VM performed well and fully met our needs. However, when performing complex indicator queries, such as function queries and aggregate indicator queries, performance will drop significantly and sometimes even time out.

1.2.2 Plan: Transformation of separation of storage and calculation

In order to solve the above problems, we decided to make some changes. Since VM performs well on single-metric queries, we decided to let VM focus on single-metric queries, while complex metric queries and aggregations are handled by CarbonAPI. CarbonAPI is a set of open source tools that supports the Graphite protocol and implements most of the aggregate calculations and aggregate indicator parsing queries in Graphite.

However, CarbonAPI did not fully meet our needs, and the implementation of aggregated metric parsing queries was incomplete. Therefore, we further transformed CarbonAPI. We added a metadata DB, and every time a metric is written to the VM, we store information such as the metric name and query URL into the metadata DB. Then, when parsing, CarbonAPI will parse indicators with multiple labels or functions into a single indicator, and then put it into the VM for query. This effectively improves the query performance of the VM.

It is worth mentioning that CarbonAPI is stateless and can be arbitrarily expanded, allowing us to achieve the separation of storage and calculation to support very high query QPS. Furthermore, we can implement some customized functions on this basis, such as monitoring processing, data tailoring, etc. Therefore, after selecting a VM and performing storage and computing separation transformation, we successfully solved the storage, query and write problems of second-level monitoring.

1.3 Client indicator collection optimization

1.3.1 Current situation problem: The scheduler and indicator warehouse do not meet the needs

The original minute-level monitoring mainly relied on our self-developed SDK for data collection, and did not use tools such as the open source Prometheus SDK. However, when we wanted to achieve second-level monitoring, we found that there were some problems on the client side.file

(Client-side minute-level indicator collection architecture)

First, the problem we face comes from the scheduler. As shown in the figure, after Counter completes counting an indicator, it will store the indicator and its related data in the local Metric warehouse. The scheduler will extract data from the indicator warehouse at a fixed time every minute, generate a snapshot and store it. When the server needs to collect data from the client, it will extract this snapshot instead of getting real-time data directly from the warehouse. This approach differs from practices in the open source community. We chose to use snapshots instead of real-time data mainly to align the data at the minute level to facilitate server-side processing. At any time, no matter how many times the server pulls data, it will obtain the same data from the previous minute, and these data are fixed and will not change again.

Secondly, our indicator warehouse only supports minute-level data storage, because our previous designs were based on minute-level data storage.

1.3.2 Solution: The client performs multiple data calculations and storage and generates multiple snapshots

In the process of transforming the client, we considered two options.

Option 1: ❌

We referred to the Prometheus model, which does not generate snapshots, but directly obtains the real-time data of the warehouse, and only accumulates or records the data. When the client pulls data, it can choose to store the original data in TSDB or perform incremental calculations by itself.

Disadvantages of this solution include:

If the client performs incremental calculations on its own, it needs to obtain data from the previous minute or previous pull interval before it can perform incremental calculations. If the original data is directly stored in TSDB, every time the user views the data, he or she needs to perform incremental calculations, which will affect the user experience.

Although this mode saves client memory, it will cause drastic changes in our collection architecture and may have data accuracy issues.

Option 2:✅

The second solution still relies on the client to generate snapshots, but will perform multiple data calculations and storage, and generate multiple snapshots. The advantage of this approach is that it requires less changes to the architecture, has no data accuracy issues, and puts less pressure on the server.

The disadvantage of this solution is that it takes up more memory because we need to store second-level data.

However, we have optimized this problem. For Counter class data, since it itself is an Int or Float64 data, it does not occupy much memory. For Timer type data, we use the Tdigest data sampling algorithm for data compression, reducing the data that may originally have 1,000 data points to 100 data points. Through such optimization, we found that the memory usage is acceptable.

1.3.3 Architecture after transformation

file

After choosing the second option, we modified the client and introduced a new computing layer. This computing layer implements two functions, one is data sampling, and the other is determining whether indicators need to be collected at the second level. At present, we only collect core order types and system P1-level indicators at the second level, because if all indicators are collected, resource consumption will be very large. When calculating, it calculates both second-level and minute-level data.

The modification of the scheduler is to add a snapshot manager to manage multiple snapshots. When the server pulls data, it will pull different snapshots based on the parameters. The configuration management service serves as an interface for interaction between the server and the client, and can push second-level configurations to the client in real time.

After this transformation, our client can now meet our needs and can count in seconds.

1.4 Server-side indicator collection optimization

1.4.1 Current Issues: Data Breakpoints and High Resource Consumption

In our original architecture, we adopted the Master-Worker pattern, which is a relatively simple but powerful design. In this architecture, the Master acts as a global scheduler, regularly pulling all tasks from the database and distributing tasks to various Workers through the message queue. This is a classic producer-consumer pattern. The advantage is that Workers can be easily expanded because they are stateless. If there are too many tasks, Workers can simply be added to meet demand.

However, when we tried to do second-level acquisition, we ran into some problems. We have hundreds of thousands of tasks, and when we send them through the message queue, it sometimes takes up to 12 seconds. This is inconsistent with our second-level collection requirements, because if it takes 12 seconds to send the task, and our collection interval is only 10 seconds, then there will be a breakpoint in the second-level collection.

Another problem is that our system is developed in Python and uses a multi-process/multi-thread model. When a large amount of node data needs to be pulled and aggregated calculations are required, the CPU consumption is too high, which is a typical problem. We need to find a reasonable solution that can not only meet the needs of second-level collection, but also effectively manage resource consumption.

1.4.2 Transformation strategy: transfer the task scheduling function to the Worker node

In order to solve these problems, we have carried out a series of transformations on the server side. First, we got rid of the message queue. We still maintain the Master-Worker model, but add the function of task partitioning to the Master node. For example, we have hundreds of thousands of tasks. Through task partitioning, we can clearly know how many Worker nodes there are, and then assign different tasks to different Workers. This is achieved through Etcd for partition settings. The Worker node will listen to etcd events. Once an event is detected, it will know which tasks need to be performed, such as tasks with IDs from 1 to 1000. The Worker then fetches these tasks, caches them in memory, and begins execution.

In this transformation, we transferred the task scheduling function to the Worker node. Although this makes the Worker a stateful service, if a Worker fails, the Master will listen to this change and restart the Worker's task. assigned to other nodes.

The current architecture can still be easily expanded, and we have chosen a development model such as Go + Goroutine because it is more suitable for high-concurrency scenarios. After such transformation, our system can now support minute-level and second-level data collection.

1.5 Practical results

Ultimately, our fault detection time dropped from the previous average of 3 minutes to less than 1 minute. This is a significant improvement.file

(Final architecture after transformation)

2. How does Qunar.com design fault location tools?

Microservices brings us many conveniences, but at the same time, it also brings some new challenges. One is the complexity of service links. Taking Qunar.com as an example, a request for an air ticket order may need to go through more than a hundred applications, and the entire link is long and complex. Secondly, the dependencies of many applications themselves are also quite complex. They not only rely on other services, but also rely on middleware such as Mysql, Redis, MQ, etc., and the running environment is also a dependency.

Therefore, Qunar.com's root cause analysis platform aims to solve a core problem: how to find out the root causes that may cause failures or alarms when links are complex and dependencies are complex.

2.1 Analysis model

The following is an overall overview of the analysis model of Qunar.com’s root cause analysis platform. I will introduce each part in detail.file

(Analytical model overview)

2.2 Knowledge graph construction

The construction of the knowledge graph is divided into four parts:

1) Basic data: including a unified event center (Qunar.com has a unified event center that can obtain events such as releases, configuration changes, operating system execution actions, etc.), logs, Traces, monitoring alarms, application portraits, etc.

2) Establishment of application associations: including service call chain and strong and weak dependencies.

3) Establishment of resource relationships: including various resource relationships that applications depend on (i.e., various resources that applications depend on, such as MySQL, MQ, etc.); physical topology awareness (awareness of the host and network environment of applications running in containers or KVM).

4) Establishing correlations between exceptions: The corresponding Trace, Log, etc. can be found accurately and quickly through abnormal indicators; and the correlations between abnormal alarms can be mined.

2.3 Anomaly analysis

Anomaly analysis is divided into two parts, one is application analysis and the other is link analysis.

2.3.1 Application Analysis

The main task of application analysis is to explore application dependencies. We will conduct a comprehensive inspection of its dependent links, including searching for possible abnormal events. Application analysis mainly includes four modules: runtime analysis, middleware analysis, event analysis and log analysis.file

Runtime analysis: When an application alarms or fails, we will check whether the application's running environment is stable, including the running status of KVM, host, and container, and whether the JVM has issues such as Full GC or excessive GC time. In addition, we will perform single instance analysis. When an abnormality or alarm occurs in an indicator, we will conduct an in-depth analysis of this indicator and perform outlier detection on this indicator on each machine. For example, if an indicator on five machines remains stable, but the indicator on only one machine fluctuates significantly, we will think that the abnormality of this machine may have caused the abnormality of the overall indicator.

Middleware analysis: Based on the application and its topological relationship, we will check whether there are exceptions or alarms on the resources that the application depends on, such as MySQL. At the same time, we will also analyze whether there are a large number of slow queries and other problems during this period.

Log analysis: We will extract exception types from the logs and perform year-on-year and month-on-month analysis. If an anomaly suddenly increases in the period before an alarm occurs, or new anomalies appear, we will consider these anomalies as possible causes of alarms or failures. For those abnormal logs that have always existed, we do not think they will be the cause of alarms. At the same time, we also provide business line subscription functions so that they can pay attention to the anomalies they are interested in.

Event analysis: In the event center, we will check whether important events such as release events and configuration changes occurred during the time period of the fault or alarm to help us locate the problem more accurately.

The goal of application analytics is to take a comprehensive look at the health of the application itself and its dependencies. Sometimes, application alarms or failures are not caused by the application itself, but by other applications on the link, which requires us to perform link analysis.

2.3.2 Link analysis

The fundamental goal of link analysis is to find the specific links that cause application anomalies. In an application link, there may be multiple applications that are interdependent, and an abnormality in any one link may lead to an abnormality in the entire application. Therefore, we need to analyze the call link and find the source of the problem.

Challenge 1: How to find the calling link that is truly related to the current abnormal indicator?

For example, if our application A provides interface A and interface B, then at any time, a large number of requests will enter these two interfaces, generating a large number of Traces. When an alarm occurs in application A, if we extract traces only by time, we may extract many traces that are irrelevant to the alarm. These traces may be entered through the B interface, which will not help us analyze the exceptions in the A interface, and may even interfere with our analysis.

solution:file

In order to solve this problem, we modified the monitoring SDK and linked it with Qtracer. Every time a request comes in, QTracer checks whether there is an incoming TraceID on the current link. If there is, a Span object is generated. If not, a complete new Trace is generated. In this way, when we need to manage points and perform QPS statistics, we only need to check whether there is a QTracer object in the current environment. If so, we will associate our indicators with the Trace data. In this way, we can ensure that the indicators recorded by Trace must be the indicators passed by the traffic. If the indicator of interface B is abnormal, we can use this indicator to find the relevant trace in reverse. After getting this data, we will create an index to facilitate our search for relevant Traces.

Challenge 2: Too many traces are found, how to converge?

After solving the first problem, the second problem we faced was how to converge if the amount of Trace found was too large. Taking the A interface as an example, we obtained the Trace related to the A interface. However, because the QPS of the A interface is very high, there may be thousands or even tens of thousands of requests per second, so when we are looking for the Trace within the last three minutes , a large amount of data may be obtained, even tens of thousands of items. Among so many Traces, we need to find out which call has a problem, and because different calls may take different links, this problem becomes more complicated.

solution:

In order to solve this problem, we adopted three strategies to converge Trace.

fileFirst, we adopted an exception Trace strategy. We will mark abnormal calls. For example, when application A calls application B, if the status code returned by application B is not 200 or a similar abnormal status, we will mark the node B as an abnormal node. This kind of abnormal information is very useful to us and is what we must obtain.

Secondly, we also considered a situation where some applications may return a status code of 200, but in fact the request handled may be an exception or error. Because the way we mark exceptions is simpler, this may not be perceived. For this kind of abnormal Trace, we performed T-value classification screening, that is, entrance classification.

Finally, we performed topological similarity screening. For example, if we still get too many traces through interface A and interface B, we will check the topological similarity of these traces. If the similarity is higher than 90%, then we may randomly discard some traces and keep only one or two. strip.

Through such a convergence strategy, we can reduce the number of Traces to a controllable range, for example, no more than ten at most. This will make it easier for us to locate and analyze problems.

Challenge 3: How to locate the AppCode that may be the root cause?

After obtaining the Trace, we actually obtained the upstream and downstream calling relationships. Next, we need to determine which application's anomaly may have caused the alarm.

solution:

The strategy we adopt is to use the alarm AppCode as the vertex, find its connected subgraph, and then traverse this subgraph. During this process, we will mark all abnormal AppCode on the link and filter them out. At this stage, we tend to think that these abnormal AppCode may be the cause of the alarm.

Another situation is that the application itself is not abnormal, but its alarm concentration is higher than a certain threshold. We define alarm concentration as the ratio of the number of alarms set by the application to the actual number of alarms. If this ratio exceeds a certain threshold, we will consider the application to be unhealthy and filter it out.

The last strategy is for applications where high-level alarms occur, such as L1/L2 level or P1/P2 level alarms. These apps will also be considered suspicious and filtered out.

fileAfter screening out these suspicious applications, we will conduct further analysis on these applications, such as analyzing their runtime status and logs to find possible anomalies. If no anomalies are found, we exclude the app. After completing these analyses, we are able to list the applications and anomalies we consider suspicious.

2.4 Weighting system

After analyzing these anomalies, we will submit the anomaly information to our weighting system for evaluation. Once the assessment is complete, we can generate the final report.

fileOur weight system is divided into four types: static weight, dynamic weight, application weight and strong and weak dependence weight. Among them, static weight and dynamic weight are relatively easy to understand. Let's focus on the calculation method of application weight.

2.4.1 Application weights

There are two main ways to apply weight calculation. One is the convergence Trace. During the Trace convergence process, the abnormal AppCode is calculated and the weight of the abnormal AppCode is accumulated. Application C in TraceA/B/C are all abnormal, so their weights are accumulated.file

The other is application distance. The closer the App is to the alarm AppCode, the higher the weight. As shown in the figure above, if application B, for example, application A is the most recent, the corresponding weight will be higher. Because most problems may be caused by direct downstream applications, they generally rarely exceed level three.

2.4.2 Strong and weak dependency pruning

Strong and weak dependency pruning is a tool that relies on chaos engineering and can calculate whether each application is strongly or weakly dependent. If there is a problem with application A, and there are some abnormalities in application B and application C, where B is a weak dependency and C is a strong dependency, we will tend to think that application C caused the problem of application A.

file

2.5 Report output

After weight calculation and sorting, a report will finally be output to tell the user the possible cause of the failure. If you want to view more detailed exception information, you can click View Details.

file

2.6 Practical results

Through this method, we have reduced the proportion of faults with slow fault location to 20%, and the accuracy is between 70% and 80%.file

(Analysis results page display - locating fault causes and abnormal logs)

3. Are you in a hurry when something goes wrong? Pre-plan system assists in troubleshooting

file(Qunar.com is developing a planning system)

The design of this system is mainly divided into three parts: plan triggering, plan recommendation and management module. The management module is currently being implemented internally, and the triggering and recommendation modules are still in the planning stage.

First of all, we believe that the importance of the plan system lies in the fact that most alarms and faults are accompanied by abnormal indicators. However, some alarm threshold settings are too sensitive. Even if it is a core alarm, sometimes it may not reach the fault level even if the alarm is only triggered. Therefore, we need to confirm the weight of these alarms twice.

When we confirm a possible failure event, the plan trigger module will start. Next, the plan recommendation module will perform event monitoring and rule matching. The rules here are mainly based on our experience communicating with business lines. For example, many business lines already have a set of standard SOPs. If indicator A is abnormal, I will check whether indicator B is also abnormal. If so, certain operations need to be performed. If the rule is matched, we will make the corresponding SOP into a recommendation report to the user. If no rule is matched, we will perform root cause analysis and generate corresponding SOPs or actions based on the analysis report. For example, if we discover that the outage may have been caused by a release, we will recommend that you consider rolling back.

The management module is relatively simple. It is mainly responsible for entry and execution. Users can enter their risk scenarios and corresponding actions, and after forming the SOP, our recommendation module can make recommendations.

4. Personal summary

Using fault MTTR indicators to optimize and build our monitoring system is a more results-oriented approach, rather than just focusing on the monitoring system itself or some internal performance indicators as before. Second-level monitoring mainly solves high-level faults, especially order-type faults, because such faults cause the greatest losses. Other faults have relatively little impact, and some faults may not even attract the attention of senior leaders. However, order-related glitches are bound to get their attention. Our goal is to detect such failures within one minute. In addition, our second-level monitoring is also suitable for other scenarios, such as flash sale activities.

Finally, our plan system mainly helps us locate faults in complex system environments and dependencies. It can help us confirm the dependent components of the application and the health status of the dependent applications, and calculate the weight related to the fault to help us locate the problem more accurately. (Full text ends)

Q&A

1. Are the rules for root cause locating abnormal data and knowledge graph set from the beginning? How to collect the initial basic data? How to maintain it later?

2. How long did it take for your root positioning platform to achieve its current effect?

3. The interface times out occasionally. The call chain can only see the timeout interface name, but cannot find the internal method. The root cause cannot be located, and it is difficult to reproduce. What should I do?

For the answers to the above questions, please click "Read Full Text" to watch the full version of the answers!file

This article is published by OpenWrite, a blog that publishes multiple articles !

Alibaba Cloud suffered a serious failure, affecting all products (has been restored). The Russian operating system Aurora OS 5.0, a new UI, was unveiled on Tumblr. Many Internet companies urgently recruited Hongmeng programmers . .NET 8 is officially GA, the latest LTS version UNIX time About to enter the 1.7 billion era (already entered) Xiaomi officially announced that Xiaomi Vela is fully open source, and the underlying kernel is .NET 8 on NuttX Linux. The independent size is reduced by 50%. FFmpeg 6.1 "Heaviside" is released. Microsoft launches a new "Windows App"
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/5129714/blog/10143894