How to reduce alarms per capita by 90%? Design and practice of the new generation alarm platform at B station

17290810:

A quick overview of the highlights in one minute

Bilibili's business scale and user groups continue to expand, and its requirements for service stability and availability are also increasing. This requires the monitoring and alarm system of station B to be able to detect and locate problems in a timely and accurate manner so as to solve them as soon as possible and maintain the user experience.

This article is a detailed record of an important iteration and optimization of station B’s alarm monitoring system. The article elaborates on the design ideas and optimization iterations of the alarm platform at Station B, as well as the problems encountered and solutions during the implementation process. Especially for improving the accuracy of alarm positioning and positioning efficiency, the article provides new design solutions and practical methods.

file

about the author

file

Bilibili senior development engineer——Wang Chengtian

Member of the TakinTalks stability community expert team and senior development engineer at Bilibili. Joined Bilibili in 2020 and has been responsible for the event platform, link tracking, AIOps and alarm platform direction technology evolution & platform iteration. Completed the implementation of a new generation alarm platform, achieved end-to-end detection of anomalies within the 99th percentile within one minute, and achieved a breakthrough in alarm management per capita from 1,000+ per person per week to 70+ per person.

Warm reminder: This article is about 6,000 words long and should take 8 minutes to read.

Reply "Communication" in the background of TakinTalks Stability Community to enter the reader communication group; reply "1130" to obtain courseware;

background

In the diversified business of Bilibili, the alarm platform plays a vital role. Whether it is video playback, barrage sending, user comments, live broadcast room management, or background content review, data statistics, etc., they are all inseparable from the stable operation of the system. The alarm platform can monitor the operating status of these business systems in real time. Once an abnormality occurs, an alarm will be issued in time, allowing operation and maintenance personnel to quickly locate the problem and handle it in a timely manner.

However, maintaining the stable operation of these businesses is no easy task. From the perspective of the three stages before, during, and after the alarm occurs, that is, the production end, the transmission end, and the consumer end, the design of station B’s alarm platform faces considerable challenges and complexity.file

Taking into account these business needs and complexity, we embarked on a comprehensive upgrade of the new generation alarm platform of station B. As a result, we achieved a 90% reduction in alarms per person and a breakthrough of 87.9% accuracy in root cause analysis. In this article, I will outline the design concept of the new generation alarm platform, and focus on sharing the platform’s implementation strategies in alarm noise reduction and intelligent alarm analysis.

1. What key designs have been made for the alarm platform?

1.1 Core business demands

During the design and iteration process of the alarm platform, we continued to receive various business requirements. Generally speaking, the core of these requirements are focused on "taking quality as the center, discovering and handling exceptions in a timely manner, and ensuring business stability." Specifically, demand scenarios can be divided into risk scenarios and failure scenarios. In risk scenarios, potential problems need to be sensed in advance and effectively dealt with; while in fault scenarios, online problems need to be quickly discovered and responded to in a timely manner to achieve rapid recovery.

In the process of meeting these business needs, we have refined three core demands and goals:

Validity: We expect that all alarms received are valid, that is, every time an alarm is received, there is indeed an exception, and this alarm is meaningful to the current recipient.

Timeliness: For high-priority exceptions, we hope to reach them as soon as possible and be perceived by users so that they can take timely action to handle them.

Coverage and follow-up: We expect to achieve comprehensive coverage and follow-up of alerts. From the user's perspective, they hope that all business or application scenarios they are responsible for can be covered by alarms. At the same time, they also hope to understand the alarm coverage in different scenarios. When an exception occurs in a rule, users need a convenient way to quickly handle and follow up to solve the problem.

1.2 Detailed design of alarm platform

1.2.1 Closed-loop model

In the detailed design of the alarm platform, we built a closed-loop model based on the aforementioned goals and business requirements. This model is designed to ensure the active participation of all relevant parties to drive continuous improvement of the goals. In particular, alarm definition and alarm management are two key links in the model, because they determine the effectiveness of alarm noise reduction and recall.file

Next, the design of station B alarm definition, detection, and channel side functions will be introduced in detail. The focus is on alarm handling, root cause analysis, and practical content on alarm management.

1.2.2 Alarm access

In the alarm access process, three scenarios are mainly distinguished. The design of these scenarios can cover most of the business needs for alarm access.

1) Platform-oriented scenarios

We provide open interface capabilities for alarm rules and templates for alarm scenarios covered by the platform, and support multi-tenant rule integration. Different tenants can be configured based on predefined templates, and alarm definition and coverage can be quickly completed when applying for services or registering resources. The entire process is low-cost and highly convenient.

2) For customized scenarios

We directly open the definition of alarm rules for the business, including the configuration of trigger conditions, expressions and notification strategies of the rules.

3) For third-party events

We have open event integration capabilities. Users can complete access by registering events and send them to the alarm platform through active triggering. The alarm platform completes subsequent closed-loop processing.

1.2.3 Alarm calculation

We designed a distributed alarm calculation engine to implement multi-level scheduling. Based on the rules that are fully effective online, global scheduling is performed to schedule alarm scenarios and availability zones to different availability zones and computing clusters. Under the same computing cluster, perform local scheduling and schedule tasks to different computing nodes to achieve load balancing. The computing node will periodically detect the data to determine whether the alarm is triggered, and will deliver it to the alarm channel after triggering.file

1.2.4 Alarm channel

Mainly engaged in noise reduction, rendering, and distribution to achieve accurate delivery and quick access. For alarm events generated on the engine side, alarm messages will be generated in the channel, and then go through the noise reduction module to complete notification window interception, notification frequency interception, alarm interception, silent interception, suppression interception, alarm aggregation, etc. in sequence. Next, through the alarm rendering and distribution module, the rendering recipient, rendering notification channel, and rendering notification template are completed, and finally the user is reached and the delivery status of the alarm is updated.file

2. What are the practical experiences of alarm management?

Below I will mainly explain how station B implements alarm management strategies, reduces alarm noise, and improves the effectiveness of alarms in the practical process of alarm management.

2.1 Alarm management background

In the past period of time, the problem of flooding of alarm notifications at Station B has become a problem for many years and a major pain point for the technical team and platform. On the one hand, with the emergence of stability issues and the access to new platforms, alarm rules and configurations continue to increase; on the other hand, there is a lack of effective alarm management and operational analysis mechanisms. This has led to an increasing number of alarms, and many users will not take the initiative to manage alarms, but instead choose to set do not disturb. Finally, real risks and anomalies are often buried in numerous alarms and cannot be perceived immediately.

To sum it up in one sentence, "Too many alarms is equivalent to no alarms."

2.2 Problem analysis

We believe this is mainly caused by the following four reasons:

Unreasonable alarm definitions: Many rules lack effective maintenance. After a large number of rules are triggered, it does not mean that an exception has occurred, and the business will not handle it. In addition, the granularity of some alarms is too fine, causing a large number of alarms to be generated after the same exception is triggered, amplifying the impact of the entire notification.

Amplification of the number of people notified: Due to changes in the historical organizational structure or temporary troubleshooting problems, positioning permissions are coupled, resulting in a long-term lack of governance in the service tree, resulting in a substantial amplification of the number of alarm notifications, which are often sent to unrelated personnel.

Lack of analysis and management tools: Although everyone knows that there are too many alarms and need to be managed, they have no clue and no suitable platform capabilities to help them analyze the main concentrated parts of the alarms so as to carry out targeted management.

Lack of effective operating mechanisms: There is insufficient motivation to manage alarms, and there is also a lack of effective mechanisms and normative constraints. In history, after some short-term governance, the alarm rebounded and the governance effect ceased to exist.

2.3 Three stages of alarm management

After analyzing the reasons for the proliferation of alarms, we started alarm management. This process is mainly divided into three stages.file

Phase One: Goal Setting

After many rounds of meetings and discussions, we determined the indicator for the number of alerts and reduced the original number of alerts from more than 1,000 per week to 80 per week. This goal seemed almost impossible in the initial stages. However, we believe that these 80 alarms are a reasonable range that business personnel can respond to and handle one by one. Therefore, we insist on implementing this goal and try our best to achieve it.

Phase 2: Data Analysis

We integrate alarm data into the data warehouse, providing a multi-dimensional analysis view and providing data support for alarm management. The impact factor is determined through a calculation formula, which is based on the number of alarm notifications, that is: number of alarm notifications = number of alarmsThe number of people who trigger the notification each timeNoise reduction coefficient. This formula helps us clarify the direction and focus of governance.

The third stage: governance actions

At this stage, we begin executing a series of governance actions.

First, the alarm items were optimized, the rationality of the default alarm items was evaluated with SRE and platform colleagues, and invalid alarm items were closed. For some unreasonable alarm items, we evaluated and optimized conditions such as their expressions and default thresholds to reduce alarm noise and improve the effectiveness of alarms.

Secondly, in order to solve the problem of amplification of the number of notifications, the settings of alarm recipients have been optimized. We have been deeply involved in and promoted the development of the service tree and the calibration of the person in charge's role, and launched capabilities such as on-duty and upgrade to narrow the scope of alarm notifications and effectively reduce the noise amplified by the number of people notified.

Finally, the alarm noise reduction strategy of the channel is enriched, supporting fine-tuned interception capabilities of rule groups, and multi-dimensional alarm summary and aggregation capabilities, thereby reducing alarm noise more efficiently.

2.4 Summary of experience in alarm management process

After conducting a series of alarm governance actions, we summarized and thought about the governance process. Mainly share from the following three aspects:

1) Exception recall is the bottom line

Throughout the entire governance process, invalid noise alarms are mainly processed intensively. However, real and valid alarms must not be ignored. In governance, we always adhere to the principle that exception recall cannot be sacrificed. Therefore, we focus our attention on unreasonable rules such as continuous triggering, repeated triggering, and abnormal amplification, as well as Top-level rules, and conduct special governance on these issues.

2) Operational advancement is essential

In fact, the governance process is an operational advancement process. In this process, dozens of groups were created and countless governance, promotion and follow-up documents were developed to ensure the implementation of the entire intensive and mandatory alarm management work. At the same time, we also worked closely with various scenario platforms and SREs to jointly promote alarm management.

3) Layering and grading focus on direction.

We learned from Teacher Mao Jian’s ideas, emphasizing layering and grading, and focusing on key points, especially core alarms that ensure business availability. For event-type alarms, it is recommended that they be recorded and queried to avoid drowning out other important alarm notifications. For some dependent alarms that are not suitable for direct notification to the business side, it is recommended to present them through correlation, which can effectively reduce the noise of the alarms.

2.5 Alarm management effect

In the past six months, alarm data has been significantly improved through push-based alarm management:

The median number of alarms decreased from 1000 times per week to 74 times per week, a reduction of 7.4% of the original number;

The overall number of alarm notifications has been reduced from 300,000+ before treatment to 22,000+, which is reduced to 7% of the original number;

The number of alarm notifications per capita decreased from 1600+ to 140, which was reduced to 8.8% of the original number;file

2.6 Establish a long-term governance mechanism

In order to ensure that the results achieved by alarm management are sustainable and stable, we have built a long-term governance mechanism, which specifically includes the following three aspects:

1) Convenient governance analysis tools

Contains an alarm management analysis dashboard, which is integrated into the alarm platform. This market supports the analysis of alarm distribution and trends from multiple perspectives such as business, individual and company, and through multiple dimensions. In addition, we also utilize some governance tools to provide the necessary capabilities for rapid governance.

2) Data report subscription

We reached a consensus with each SRE component platform, set common goals, and generated subscriptions for various data reports. Continue to follow up on these reports to prevent alarms from worsening and continue to manage alarms.file

3) Access control

For new access rules and platform scenarios, some manual verification of alarm data access has been added, as well as verification and noise monitoring on the platform side. For some unreasonable rules, we will recommend that the corresponding access party optimizes them to maintain the results of alarm management.

3. How is alarm root cause analysis designed and applied?

Combining the cutting-edge technologies widely used in the industry today, such as anomaly detection, intelligent noise reduction, intelligent merging strategies, and root cause analysis, I will focus on sharing the practice and practical experience of station B in root cause analysis.

3.1 Root cause analysis background

As the SLO system becomes increasingly perfect, more businesses are connected to the SLO system. This allows us to have clear exception definitions and trigger objects, allowing us to perform root cause analysis more accurately.

In addition, Station B has set a “1-5-10” goal. At present, by optimizing alarm calculation collection and channels, 99th percentile end-to-end problem discovery within one minute has been achieved. However, there are still some bottlenecks for 5-minute positioning. The positioning process relies too much on manual experience and there is a certain degree of uncertainty. The experience of artificial experts varies from person to person and is affected by the scene, resulting in inconsistent positioning efficiency. In addition, due to the construction and interaction of platform capabilities, as well as response delays in some extreme situations, such as early morning failures or business trips, positioning efficiency will be affected.

Finally, the accumulation of user needs is also increasing. With the growth of business and infrastructure, people are paying more and more attention to the efficiency of alarm positioning, and have accumulated a lot of experience in locating alarms. Therefore, the need for root cause analysis is becoming increasingly urgent, and at the same time, the conditions for implementing root cause analysis are gradually maturing.

3.2 Root cause analysis and design

3.2.1 Root Cause Analysis Design Version 1.0

First, we adopted a design based on SLO alarms to design the positioning capabilities of golden indicators under the microservice architecture. This is the root cause analysis design of version 1.0. The entire design plan mainly consists of three stages.file

In the first stage, we monitor SLO alarm triggers, correlate the error scope and data, and then analyze log indicators based on these data to dig out the abnormal dimensions. For example, errors may be concentrated on a certain instance of a specific cluster, or on a certain interface of a certain upstream request.

In the second stage, we correlate the link calls on the application side based on these error dimensions and find the abnormal links during the abnormal period. Through aggregation analysis, time-consuming and error scenarios can be distinguished, and abnormal nodes under abnormal links can be located through critical path analysis and pruning and drill-down.

In the final stage, the located abnormal nodes are mapped to the knowledge graph, and related to relevant databases, caches, message queues, containers, machines, switches, cabinets, computer rooms and other information through the data graph. Then, combined with information such as exceptions, alarms, changes, etc., scoring recommendations are made through the correlation analysis model, and the possible Top root cause ranking is finally recommended.

Although the accuracy of this version 1.0 root cause analysis design is acceptable in some scenarios, it still has some bottlenecks.

file

3.2.2 Root Cause Analysis Design Version 2.0

When designing version 2.0 of root cause analysis, we conducted a comprehensive survey and gained an in-depth understanding of the process of locating problems by senior SREs and business colleagues in the industry.

To sum up, this process is mainly: after the alarm is triggered, it is related to the specific abnormal event based on the experience of experts; then based on these experiences, it analyzes the possible reasons for the abnormality and finds out the relevant indicators and logs of these reasons. Wait for observation data; finally, by observing these data, determine the root cause of the anomaly.

In the process of designing version 2.0, we strived to universalize this process and build an anomaly knowledge graph to precipitate the experience of experts. At the same time, it also opens the functions of root cause analysis integration and knowledge catalog, providing general root cause analysis capabilities.

file

file

The implementation of root cause analysis mainly revolves around the definition of knowledge. The knowledge here is mainly based on abnormal node definition. Each exception will be defined under a specific entity node, such as a cluster instance of an application or a database. Each exception will be associated with specific data nodes, such as logs, indicators, alarms, etc. Through these data, we can detect and judge the occurrence of exceptions. At the same time, there are also conductive relationships between abnormalities. What abnormalities may cause an abnormality? What are the conductive links and correlations between these abnormalities? This is the entire knowledge structure we define in knowledge engineering.

This root cause analysis method based on the knowledge graph well solves the problems encountered in version 1.0. This allows us to gain a deeper understanding of the reasons for exceptions and more accurately locate the root cause of the problem. By systematically accumulating the experience of experts, it not only improves the efficiency of problem location, but also makes it possible to inherit and accumulate knowledge.

3.2.3 AIOps algorithm support

Capacity building for root cause analysis is inseparable from the support of AIOps algorithms. At present, we have built a comprehensive algorithm system and designed multiple scenarios such as indicators, logs, events, and links. At the same time, it also supports algorithm capabilities for multiple scenarios such as time series prediction, anomaly detection, indicator classification, log clustering, multi-dimensional drill-down, event clustering, and link analysis based on different types of data.file

3.2.4 Upgrade of correlation analysis model

In the process of correlation analysis, we build the abnormality graph through feature construction, model loading, prediction and scoring based on model reasoning, and finally recommend the root cause and transmission path of the abnormality. Feature design includes: time-dependent features, graph relationship/distance features, causal features, path similarity features, and event category features.

In terms of model construction, in the early stage, it was mainly based on the cold start method and predefined some propagation coefficients between anomalies. As the sample annotation data increases, model training begins, and reasoning and recommendations are made based on the model. In addition, we also support the GBDT model to further enhance correlation analysis capabilities.

file

3.3 Difficulties and key points

In the process of root cause analysis, we face some difficulties and key points. The main difficulty lies in how to effectively evaluate the root causes of recommendations and how to obtain user accuracy and evaluation data. This is critical to evaluate different versions, as well as the accuracy and effectiveness of root cause analysis, and to help the model iterate. In order to solve this problem, we carried out the following four parts of work:

file

Throughout the process, a focus was centered around gathering and building on expert experience. We will have in-depth communication with different business and SRE students, sort out relevant abnormal knowledge and expert experience, and complete the definition of this part of knowledge.

Regarding components, we will continue to improve the definition and construction capabilities of this part of knowledge with relevant students. For some special scenarios, we will also provide integration solutions based on the experience of external experts, and provide integration capabilities and interfaces for root cause analysis, so that external root causes can also be completed through analysis capabilities to achieve the recommendation, display and evaluation of manual alarms.

3.4 Case analysis

3.4.1 A sudden increase in upstream traffic leads to service throttling and reduced availability.

In this case, you can associate SLO exceptions to specific services based on alarms, and drill down to obtain specific exception interfaces and errors. At the same time, through abnormal correlation analysis, a sudden increase in upstream traffic can be detected, and it is possible to identify which entrance has a sudden increase in traffic. With this information, the business side can perform some current limiting configurations on the upstream services and repair them.

file

3.4.2 Downstream changes lead to service anomalies affecting gateway interface availability

It can be associated to specific links through root cause analysis and associated to specific downstream anomalies and changes. Ultimately, it can be recommended that changes to the AI ​​service are the main cause of affecting the gateway interface. Business students can quickly find the corresponding changer, and then perform stop-loss operations such as troubleshooting and rollback.

file

3.4.3 Downstream Redis request exception affects service interface

It can be linked to specific services through internal analysis and exceptions, and the abnormal transmission path can be constructed. At the same time, specific abnormal request time and logs can be recommended. In addition, the business side can also conduct more in-depth positioning with classmates on related components.

file

3.5 Effect evaluation

Root cause analysis has been implemented and achieved some results——

Number of root cause analyzes: 20652/day

Accuracy: 87.9%

Recall rate: 77.5%

Analysis takes 95th percentile: 10s

Average analysis time: within 4 seconds (This means that after an alarm is issued, it only takes an average of 4 seconds to recommend the root cause, and business personnel can directly see the corresponding abnormal root cause when opening the alarm card.)

file

4. Summary and Outlook

When designing an alarm platform, you need to consider a variety of business needs and different application scenarios. Understand the essence of these requirements, and then build the platform's capabilities and key functional models based on this essence, and design closed-loop logic to design an alarm platform that is closer to business needs. Secondly, the alarm management process is essential. Regardless of the number of alarms, only through a complete alarm management mechanism can the entire system be closed-loop to optimize the definition of rules, continuously improve the effectiveness of alarms, and maximize the value of alarms. Finally, the combination of alarm and artificial intelligence is the future development trend. Through in-depth integration of some capabilities, the accuracy of fault detection can be improved, the time for positioning and recovery stop loss can be reduced, and quality management and system stability can be guaranteed. (Full text ends)

Q&A:

1. Does this alarm management mainly rely on manual screening and reduction of rules?

2. Can you give an example of alarm correlation and noise reduction on the business side? Is it related on the monitoring side or on the alarm side? Can business alarms be associated with basic alarms?

3. Is there any further automatic analysis scenario after the alarm?

4. How to manage the sending status and effect of monitoring alarm notifications? How to deal with notification failure or exception?

5. In terms of alarm merging, how can high-quality alarms not be drowned? At the same time, ensure the quality inspection of low-quality alarms?

6. How to optimize the efficiency and accuracy of root cause analysis to support real-time analysis and batch processing? How to manage and store large amounts of data and log information during root cause analysis? What best practices and experiences can you share?

For the answers to the above questions, please click "Read the full text" to watch the full answer!

Statement: This article was originally written by the public account "TakinTalks Stability Community" and community experts. If you need to reprint, please reply "Reprint" in the background to obtain authorization.

This article is published by the blog post platform OpenWrite!

Tang Xiaoou, founder of SenseTime, passed away at the age of 55 In 2023, PHP stagnated Wi-Fi 7 will be fully available in early 2024 Debut, 5 times faster than Wi-Fi 6 Hongmeng system is about to become independent, and many universities have set up “Hongmeng classes” Zhihui Jun’s startup company refinances , the amount exceeds 600 million yuan, and the pre-money valuation is 3.5 billion yuan Quark Browser PC version starts internal testing AI code assistant is popular, and programming language rankings are all There's nothing you can do Mate 60 Pro's 5G modem and radio frequency technology are far ahead MariaDB splits SkySQL and is established as an independent company Xiaomi responds to Yu Chengdong’s “keel pivot” plagiarism statement from Huawei
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/5129714/blog/10321815