DataLeap's full-link intelligent monitoring and alarm practice (1): common problems

With the rapid development of ByteDance business, more and more tasks need to be managed in big data development scenarios. However, ordinary monitoring systems only support the configuration of monitoring rules for corresponding tasks, which cannot fully meet the current needs. Developers in Dimensions often face the following problems:

  1. Many tasks and complex dependencies : It is difficult to find and monitor all upstream tasks of important tasks. If all tasks are monitored, many useless alarms will be generated, causing useful alarms to be ignored;

  2. High configuration operation and maintenance costs : each task runs differently, and the promised completion time is different. If you set up monitoring for each task separately, the cost of analyzing and manually aligning task SLA is very high;

  3. Diversity of alarm forms : For hourly tasks, the timeliness requirements for alarms in different time periods are different, and ordinary monitoring cannot well meet the diverse alarm requirements in different time periods.

In order to effectively operate and maintain daily tasks and ensure data quality, the Bytedance data platform development kit data development team has developed a dependency-based full-link intelligent monitoring and alarm- baseline monitoring , which can intelligently decide whether to alarm, When to call the police, how to call the police, and who to call the police to ensure the overall output link of the task. Baseline monitoring has been widely used within ByteDance, covering 100+ projects such as Douyin, e-commerce, and advertising, and the baseline monitoring coverage rate of SLA tasks exceeds 80% .

At present, this capability has also been opened to enterprises through the volcano engine DataLeap. Enterprises can use the volcano engine DataLeap baseline monitoring to effectively reduce monitoring configuration costs and avoid invalid alarms and alarm floods.

actual case

This section will start from a practical case to introduce the core advantages of baseline monitoring compared with common monitoring.

User Xiao Ming has a promised SLA task, which must be produced before 10 o'clock. Its upstream and downstream relationship is shown in the figure below, where SLA tasks and tasks 4 and 5 belong to project B, and other projects belong to project A. Xiao Ming only has the operation and maintenance authority of project B.

 

Before the baseline monitoring, in order to ensure that the output of the SLA task meets expectations, Xiao Ming will configure a series of alarm rules on the SLA task and its upstream tasks in the same project B to prevent the SLA from breaking the line caused by the delay of the upstream task. For example, three basic alarms are configured on the SLA task and tasks 4 and 5 to ensure timely perception and exposure of the risk of SLA task delay, as shown in the following figure.

 

But the problem with this method is also obvious: using the basic monitoring rules, at least 9 rules need to be configured to basically complete the monitoring of SLA tasks; moreover, most of the configuration methods of monitoring rules come from expert experience, but there is still a risk of omission; The basic monitoring rules can only monitor projects with operation and maintenance authority, and cannot monitor upstream tasks that do not belong to this project, so Xiao Ming cannot perceive the risk of delay in advance. With baseline monitoring, Xiao Ming only needs to add the SLA task as a "guarantee task" to the baseline monitoring. All upstream nodes of the guarantee task will be covered by the baseline monitoring by default. Xiao Ming no longer needs to configure multiple basic alarm rules, which greatly reduces the It eliminates the difficulty of configuring alarm rules; once the baseline monitoring is configured, Xiao Ming can quickly perceive any delay in upstream tasks, which can effectively guarantee the on-time output of SLA tasks.

Through the actual case above, you should have a general understanding of the baseline. In the next article, let us understand the related concepts and system architecture of baseline monitoring, and learn more about the core implementation logic of baseline monitoring.

Guess you like

Origin blog.csdn.net/m0_60025795/article/details/131068224