How to embed "intelligent inspection" into "business system"?

Introduction: With the powerful SLS "Alarm 2.0" message system, intelligent inspection can bridge many internal and external systems (EventBridge, FC, etc.) The next step "Analysis Task" of "Results", so as to better realize the troubleshooting and solution of the problem.

Product Architecture

The intelligent exception analysis application is developed around core elements such as monitoring indicators, program logs, and service relationships in operation and maintenance scenarios. It generates abnormal events through machine learning and other means, and analyzes time series data and events through service topology correlation, ultimately reducing the complexity of enterprise operation and maintenance. , improve service quality. The product architecture diagram is shown below.

Ability Description:

  • A single task supports single-dimensional and multi-dimensional anomaly detection of 3K to 5K observation objects
  • For the detection results of the task, we quantify the abnormal score and abnormal shape, which is convenient for subsequent processing.
  • For abnormal points with a score of more than 0.75, we will push the relevant information (visualized graph) to your DingTalk system through Alert 2.0
  • For all the detection results, we will write the detection information to the current internal-ml-log for you to carry out subsequent integration through the SDK
  • At the same time, on the task page of our App, we support the "Annotation Feedback" function, you can make relevant annotations on the detection results to improve the learning accuracy of the model

Then, let's take a look at how to better embed the "inspection" capability into your business system!

Capability integration

With the help of the powerful SLS "Alarm 2.0" message system, intelligent inspection can bridge many internal and external systems (EventBridge, FC, etc.), and can also use the SLS SDK and custom functions to solve the problem of "alarm results" The next step of the "Analysis Task", so as to better realize the troubleshooting and resolution of the problem.

task creation

Here we take an SLS's own monitoring scenario as an example to see how the tool can be used better. We want to clarify the problem of the following scenario: in LogStore, by disassembling the access log, we can get the following structured information (see the figure below). The actual business scenarios of many customers are similar. The access behavior of customers is recorded in the access log. By inspecting the golden indicators of the current business, we can well know the service capabilities of each API interface in the current service.

According to the above structure, we define the current golden indicators that need to be inspected:

  • The number of successful responses per minute for each service interface of a cluster

  • 某集群各服务接口每分钟失败响应的次数

  • 某集群各服务接口每分钟成功平均响应延时

  • 某集群各服务接口每分钟失败平均响应延时

    • | SELECT time - time % 60 AS time, method, Count(*) AS total, Count_if(status=200) AS n_succ, Sum( CASE WHEN status=200 THEN latency ELSE 0 END) / (1 + Count_if(status=200)) AS avg_succ_latency, Sum( CASE WHEN status!=200 THEN latency ELSE 0 END) / (1 + Count_if(status!=200)) AS avg_fail_latency FROM log GROUP BY time, method limit 100000

当然,我们还有另外一个形式的黄金指标,用来进行后续的监控,我们可以仅关注请求失败的接口中的数量的变化,具体的SQL如下

not STATUS: 200 |SELECT   __time__ - __time__ % 60 AS time,
         method,
         status,
         Count(*) AS num
FROM     log
GROUP BY time,
         method,
         status limit 100000

我们【智能异常检测】App中完成作业的配置。入口地址 sls.console.aliyun.com/lognext/pro…

结果说明

通过上述配置,我们得到了一个【智能时序巡检】任务,我们根据下面的结果,介绍下截图中各部分的含义:

  • 【巡检实体数量】:当前任务中一共包含了多少个观测对象
  • 【巡检指标数量】:当前任务中每个观测对象的观测维度
  • 【实体信息列表】:当前任务中全部参与巡检的观测对象,且给每个对象提供一个唯一编码
  • 【异常事件列表】:当前选中的实体,在给定的时间窗口中,给定的过滤条件下的异常分数和异常类型

上述截图中的可视化信息均来自对应的Project下面的LogStore【internal-ml-log】中,关于这个logstore中存储的数据的详细说明,可以参考我们的官网文档。help.aliyun.com/document_de…

告警使用

您可以通过在【巡检任务】创建的最后一步中,配置多种消息发送逻辑

  • 钉钉-自定义
  • 事件总线(EventBridge)
  • 函数计算(FC)

通过SDK/钉钉发送消息

这里面的详细配置逻辑以及解释不在赘述,更多信息可以参考这个链接:developer.aliyun.com/article/851… 里面较为详细的介绍了在告警中您可以使用那些字段进行后续的操作和判别。当巡检任务发现一个异常时,会将具体的信息按照如下的模版推送到钉钉的webhook地址。

函数计算(FC)

关于配置函数计算去进行后续操作的部分细节可以参考 help.aliyun.com/practice_de…

这里我们简单的说在下一步的分析思路:

参考资料

原文链接:click.aliyun.com/m/100034811…

本文为阿里云原创内容,未经允许不得转载。

Guess you like

Origin juejin.im/post/7117188252787802120